Site Reliability Engineering
Chapter 4: Service Level Objectives
An SLI stands for Service Level Indicator. SLIs measure how some aspect of your service is performing. Common service attributes that get tracked through SLIs are availability, durability, latency and throughput.
An SLO is just an SLI with a target layered in. This target could provide an upper bound, a lower bound or both on your SLI. As an illustrative example, your SLI might measure availability as the percentage of requests which are successful, and your SLO for this SLI could state that within each 7 day window the average value of the SLI will be above 99.99%.
SLAs are just SLOs with legal contracts attached. In order to understand if something is an SLA or an SLO just ask yourself, “What happens if there is a threshold violation?” If the answer to this question involves money, credits or legal action — you are dealing with an SLA, otherwise you are dealing with an SLO. While engineering can help understand the technical inputs to setting SLAs, really SLAs are concerns for business and legal — not for engineering.
Indicators in Practice
Picking SLIs
While it might be helpful to have large operational metrics dashboards for a service, you should only select a small number of critical metrics as system level indicators. If you select too many indicators its hard to pay sufficient attention to each one. However, if you select too few, you will leave large portions of your service’s surface area unexplored. Most services will want SLIs around availability and latency. Then depending on the service type things like durability, throughput and end-to-end latency can also be useful.
Aggregation
In order to be useful, metrics need to be grouped together. Consider a metric which records each time a request arrives. This metric on its own is not useful. In order to be useful, a sum aggregation over a specified window is applied — i.e total number of requests per second.
Doing aggregations can get surprisingly tricky. For example, consider the simple aggregation of number of requests per second. Depending on the rate at which metrics are recorded, this aggregation could mean very different things. If your metrics reporter only publishes metrics once per minute, then your second level aggregation granularity implicitly is averaging requests over that minute, rather than giving you the true instantaneous requests/second for any given second.
Latency aggregations get even trickier. Typically, the way we think about aggregating latency metrics is to group a batch of requests into a time window of a fixed size and then compute a distribution of request completion times within those windows. If our service received 1000 requests in a 10 second window, and 999 of those requests completed within 100ms, we say that our p99.9 latency is 100ms — because 99.9% of requests in that window completed in under 100ms. Now consider if you wanted to measure second level latency granularity at p99.99, your service would need to get 10K requests per second to even measure this.
Yikes… Aggregations are hard.
Standardize
In order to raise the bar for SLIs across a company, it is useful to publish templates for SLIs that teams across the company can use. You don’t want every team building their own indicators to track latency — you want a template which says, “You should think about tracking latency like this.”
Objectives in Practice
Be Specific
Tell your users what you actually mean by your SLO. If you are computing latency distributions over a window size of one minute, then say that. If you compute availability using server side metrics instead of client side metrics, then say that.
You don’t want users guessing what you mean by your SLO.
Picking SLOs
- Keep it Simple: While this stuff can get complicated, your goal should be simplicity. Simple SLOs are easier to reason about and explain.
- Minimize SLOs: Have as few SLOs as possible without leaving significant parts of what your users care about unexplored.
- Start Small: It is better to start by under promising and over delivering, then it is to over promise and under deliver. Your SLOs should be realistic, not aspirational. Promise what you can actually live up to, and if that is not good enough for your users — make your service better before promising more.
Picking Windows
Making SLO promises on a short time horizon (e.g 24 hours) is setting yourself up to fail. The fact is your service is going to have bad days, and if your SLO promises daily level availability, you are going to be in violation some days.
A better model is to make SLO promises over a 7 or 28 day window. For example, set an availability SLO of 99.99% over a 28 day window instead of over a 24 hour window. By doing this you implicitly give yourself a large enough error budget that you can make adjustments as needed over that 28 day window.
Monitor
If we think of an error budget like a monthly household budget, then we don’t want to wait until our account balance is zero before realizing there is an issue. Instead, we want to be checking the balance throughout the month to monitor the burn rate. In order to aid, in this monitoring getting a notification from your bank if you are spending way too fast for the month would also be helpful.
In software, this monitoring should take the form of paging when the error budget burn rate is too high and by having your on-call periodically check your SLO dashboards.