Distinctions Between Service Level Indicators, Objectives, and Agreements

If you’re someone who doesn't want to spend hours watching dashboards, let’s talk about these technical service metrics.

Generally speaking, your product performs in two modes. The first mode — the functionality is technically working, but users do not engage and churn rates go up etc. This is typically the state of all the growing projects and the path is quite beaten here i.e. sometimes you delve into analytics, conduct discovery, run A/B tests, etc.

But that’s not what we’re talking about today.

I’d like to discuss the second mode — the technical part does not work properly, and consequentially, the whole product does not work either, even if users really want to use it.

So what is a product manager (PM) to do? We will discuss how they can address the latter, but before that, you must first measure the problem.

When you ask your technical lead, they start to showcase 100 graphs they built with the team:

Here is the traffic per API, the split between data centers, and CPU load—and this is all for each of the 10 services.

An inexperienced product manager listens intently and tries to figure it all out. A more tech-savvy product manager — they might propose to define Service Level Indicators (SLIs).

SLIs (Service Level Indicators) are the 2–3 main technical service metrics that determine a client’s happiness. A simple test rule is that if one of the SLIs is red, then the client should be upset. If they don’t, it’s not an SLI because it doesn’t hurt the client. That is, they do not care about this metric, but they probably care about another one you don’t measure yet.

SLI Example

A classic example of SLI is the response success rate (or reliability), which is measured as a percentage of successfully handled responses by a service in the given period (5 mins, 1 hour, 1 day, 1 month — you define).

Let’s consider a weather forecast API. If you send 100 requests per second (for example, from the Apple Weather application), and after measuring all responses across 5 minutes, you counted 13% errors, then SLI for this period = 87%. You won’t be happy with the forecast API's account manager, to whom you probably pay a fee for each API call.

And if, for example, inside a weather forecast service in one of the data centers, the CPU of processors is not 20%, but 80%, but the API is working normally, then the client doesn’t care. This means CPU load is not SLI (although it is a good technical metric developers look at for their needs).

In the picture above — the behavior of real SLI (success rate). As you can see, it is not always equal to 100% — something constantly happens to the service. If you measure success in different time periods (for example, per hour), you can see either red (something is wrong), yellow, or green (everything is good) intervals of the “health” of the service. What is green, and what is red? We will come back to this when we talk about SLOs.

Looking for a second SLI

Why is it an important metric? Imagine that you are looking for a taxi right now, and the application has been trying to find the address you entered for 10 seconds, 20 seconds, 30 seconds and has still not finished. I don’t know about you, but I’ll close this application — fortunately, there are always competitor apps. Such an emotional reaction means that the Response Latency metric very much fits the definition of SLI—the client clearly cares.

How should this metric be described in SLI form? Is it worth taking the average service response time over a period? Or maybe the maximum: out of 100 requests, choose the slowest one, and it will be the “metric”? Or use the percentiles, so count the percent of requests finished before a certain threshold (e.g., 5 sec) over a time period (e.g., 5 mins)? The latter is probably the answer for most of the cases, but to be sure, we need to discuss this with the team!

And why does a PM need to know SLIs?

Instead of 100 metrics, you and your team define 2–4 SLIs once and look only at them from then on. You simply don’t care about the rest of the metrics. I have a dashboard with the main SLIs (my product has 6 of them—three per service), and I rarely go deeper. If something goes wrong, the team pulls out those 100 other graphs to find the cause of the problem, but it does start with the SLI “cockpit.”

Learning to understand the “main” metrics out of habit is not easy. For example, what are the SLIs for the WhatsApp Message Storage service? How do they differ from the SLI of the YouTube Streaming service? What about the SLI Payment API? They are not the same, because clients care about different things), and you can practice defining them on the free resource here.

What about SLOs?

We defined a metric (SLI), but it doesn't answer the question, “Is the business process healthy?” For that, we need an SLO (Service Level Objective)—the health threshold of the metric.

For example, you can measure that 96% of requests sent to the Weather Forecast API are successful (4% return errors). If the SLO for this metric is set to 95% (96 > 95), then all good — service is healthy (even if not perfect!). If this is set for 99% — then nope, service is not feeling good (96 < 99).

The classical pitfall of non-technically savvy PMs is trying to set SLOs to 100% to “make things simpler.” The reality is that backend logic doesn't “just always work.” Instead, something constantly happens to servers (actual physical computers in the data centers get broken), network connectivity loses data packets, code contains bugs, and so on. That’s why SLI graphs look like a “saw” in the pictures. It has two implications.

Firstly, 100% SLO is just impossible. Technically speaking, it is possible in short time periods, but over a month, it is highly unlikely. Even Google doesn’t always load (even though it has one of the strictest SLOs in the world).

Secondly, every additional “nine” (99%, 99.9%, 99.99%, etc.) will cost you more a more effort. Think about it: if you promised 99.99%, it allows you 0.01% of failures. Assuming your service has a constant traffic load, you can calculate that you can afford to have outages for just 60 sec * 60 min * 24 h * 30 days * 0.0001 / 60 sec = 4.3 minutes in a month! Be very careful to promise such lightspeed outage resolutions unless you are Stripe or Amazon.

How do you choose the right SLO? Talk to clients! They will push it up to 100%, while your job is to push it down to the value that doesn’t block your own innovation: less room for failure means fewer features and experiments to play with.

Finally — SLAs

Service Level Agreements (SLAs) — rules service owners will do if they violate SLOs. It might include a complete blockage of new development, investing in testing, outage response practices, and, in the case when it is agreed in a contract — even a financial penalty.

Outside of top tech companies, SLAs are quite rare because they require rigor and discipline to follow, but you can occasionally see them in critical product departments (e.g., Stripe card processing engine).

Summary

Product professionals should keep in mind:

SLI is a metric, one of 2–3 the most important ones for a service from a client's perspective
SLO is a health threshold of SLI
SLA defines service owner obligations in case of an SLO breach

Defining the right SLIs/SLOs/SLAs for a service is hard, especially at the beginning — but it pays off. Product teams can focus only on metrics that matter to their stakeholders and clients. This is because it allows product teams to focus on their core activities towards driving innovation, and this is exactly what we are here for.