Site Reliability Engineering

Chapter 5: Eliminating Toil

Andrew Dawson
1 min readAug 31, 2023

When building and operating a software product some amount of toil is inevitable — but we should strive to minimize toil.

Toil is work that that tends to be — manual, repetitive, automatable, tactical, has no enduring value and scales linearly with service size. Not all operational work is toil. For example, revamping your alerts to provide a higher signal to noise ratio, is operational work but it is not toil. Doing the dishes is toil… installing a better dishwasher is not!

Toil should be aggressively minimized on a team. Over a long period of time toil results in career stagnation, low morale, slow progress and lower quality feature development.

Toil can easily become a run away train. If toil is not proactively managed, it has a tendency to increase as software gets built. There seems to be a critical tipping point in which there is so much toil on a team, that not only does the team not have cycles for new feature development, the team does not even have cycles to make the required improvements to reduce the toil. This is like when a country’s interest on their debt is higher then their GDP — they are are in deep shit.

If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow — Carla Geisser, Google SRE

--

--

Andrew Dawson
Andrew Dawson

Written by Andrew Dawson

Senior software engineer with an interest in building large scale infrastructure systems.

No responses yet