Name
Navigating the Service Metric Swamp
Description

This session makes the case for simple service level objectives based on the correctness and speed of the system from the end-user perspective. When well selected, these can be used to calculate a more meaningful availability number.

Many of the concepts that we interact with daily are not actually that helpful in describing the problems we face. I’m talking about things like mean time between failure (MTBF), mean time to repair (MTTR), and even availability. The underlying concepts are strong, particularly MTBF and MTTR, but none of them are really usable as a service level objective. Availability is usually in service level agreements, and it is also problematic. All of these metrics are shallow metrics. Even in the best of circumstances, they don’t provide any particularly useful insight. That has to come from additional analysis. This all begs the question - why do we use them?

MTBF is problematic because we want it to be predictive, but it is not. It is an ex-post facto calculation of how we did, and it’s usually calculated incorrectly. The only insight we gain is that we should work towards having less outages. MTTR is great in concept; we should always seek to lower the amount of impact once it begins. However, you can’t use MTTR to measure your progress against that goal. That is because outage data doesn’t follow a normal statistical distribution (i.e., bell curve). It is heavily skewed towards shorter times. The mean isn’t what we think it is - a line in the middle of a bell curve. In Davidovič’s book “Incident Metrics in SRE,” he demonstrates that it’s quite possible for you to reduce the time of every outage by a few minutes and still have a worse MTTR.

There are only two things that are critically important to users: correctness and speed. The site should behave as expected and do it promptly. If that is true, it is available. That can actually be done with data from just one or a few synthetic transactions. We’ll end by showing the power of using just a single synthetic to illustrate both correctness and performance, and build an availability chart on top of that data.

Top three takeaways:

  • Users care about speed and correctness
  • Stop using MTTR and MTBF
  • Start using synthetics
     
Date & Time
Wednesday, May 25, 2022, 10:25 AM - 10:50 AM
David Owczarek