Site Reliability Engineering

Chapter 6: Monitoring Distributed Systems

Andrew Dawson

2 min readSep 12, 2023

Monitoring is the process of collecting real time quantitative data about a system, such as query counts, CPU usage and server lifetimes. Monitoring enables us to analyze long term trends of a system, compare results from experimental groups, alert when there is an issue, build dashboards to visualize a system’s performance and to conduct retrospective debugging.
Monitoring a complex system is a significant engineering endeavor. Producing good metrics and then consuming those metrics in alerts, dashboards and SLO reports is complex. It is easy to not treat monitoring as a first class engineering problem when building a system, but it is and should be treated as such.
Given the complexity of building monitoring systems, it is generally better to favor simplicity rather than trying to build smart monitoring tools. For example, don’t build monitoring tools which use some fancy ML to automatically infer thresholds and don’t build some complex dependency hierarchy of monitoring signals in order to try to reduce noise. Keep monitoring simple and depend on humans to do more complex analysis using the monitoring tools.
The four golden signals of monitoring are — latency, traffic, errors and saturation. By covering these four, a service will be at least decently covered by monitoring.
Monitoring can get complex, therefore it is important to keep monitoring as simple as possible but no simpler. Alerting rules should be simple, reliable and frequently exercised. An alert that is overly noisy or is not exercised at least once a quarter is a candidate for removal.
It can be tempting to build monitoring tools that do everything — one tool for logging, profiling, metrics, dashing etc… But as is the case with almost all software projects its better to have well defined smaller projects with loosely coupled points of integration.
Paging a human is a very expensive operation. In order for pages to be effective they should follow a few rules — (1) A human can only urgently respond to a few pages per day, if you have more than a few pages per day you need to cut back on pages, (2) Every page should be actionable, (3) Every page should require human judgement, if the human just mindlessly runs an automated script, then it should not be a page.

Site Reliability Engineering

Chapter 6: Monitoring Distributed Systems

Written by Andrew Dawson

No responses yet