Site Reliability Engineering

Chapter 3: Embracing Risk

Andrew Dawson
2 min readAug 22, 2023

Building reliable software comes at a cost. If you did not care about your product’s reliability, you could race new features out the door and run your product on the cheap. The problem is no one wants to use an unreliable service.

However, somewhat counterintuitively, people also don’t want to use a product that is 100% reliable. To achieve 100% reliability, your product is going to cost a fortune and will almost never add new features.

Deciding how reliable a product should be is a business strategy question, not a technical question. If you are running a pace maker company you probably want to be as close to 100% reliable as possible, but if you are a brand new social networking website you probably care much more about product release velocity and not burning through your series A fundraise.

There is not a correct level of reliability, but rather there are a set of tradeoffs that result from setting a reliability target. If you decide your product needs to be 80% reliable, you are going to be able to develop new stuff very fast and run cheaply, but your users are going to have to tolerate a lot of failures.

By formalizing reliability targets, builders are able to make principled decisions about how to balance risk and product reliability. Without defining an explicit target, its not clear if more effort should be put towards reliability or if development progress is being overly conservative and more risks need to be taken.

Service Level Objectives, SLOs, are the tool through which we formalize reliability targets. The most common type of SLO is an availability SLO. Availability is commonly expressed in terms of the number of requests that are successful over a period of time or the percentage of time the service was considered up.

When we define an SLO, we also are implicitly defining an error budget. If your availability SLO specifies that 99% of requests will be successful in a 10 day window, and you receive 1000 requests per day, it means you can fail 100 requests in this 10 day window and still be within the error budget implied by your SLO.

Error budgets and SLOs provide a tool to formalize the tradeoff between risk and reliability. If you service is consistently running much more reliably then your SLO, it means you can be more aggressive, take more risk or try to run cheaper. If your service is consistently not meeting your SLO and you are consistently exhausting your implicit error budget, then you need to slow down and focus on reliability.

Risk is not a bad thing when developing software, and reliability is not the ultimate good. All these things are tradeoffs that need to be balanced. SLOs and error budgets enable us to be intentional and explicit about these tradeoffs.

--

--

Andrew Dawson
Andrew Dawson

Written by Andrew Dawson

Senior software engineer with an interest in building large scale infrastructure systems.

No responses yet