Crash-Only Software
A friend recently recommended Marc Brooker’s blog to me. I have read a handful of the posts so far, and its great. I strongly recommend it.
The most recent post I read was about crash-only software. The post was inspired by this paper from 2003 on crash-only software.
The paper starts by introducing the concept of crash-only software. The basic idea is there are a lot of ways in which computers/VMs/processes can fail. They can be hard down, they can be gracefully shutting down, they can be partially partitioned from the network, they can be dropping some percentage of requests etc… The paper asserts this total state space is too large and unpredictable to actually reason about effectively. The implication is that building recovery or shutdown logic to cover this whole state space is going to be brittle, untested and prone to breaking. Developers are better off just giving up on covering this whole state space.
The paper goes on to suggest an alternative called crash-only software in which every type of failure is converted into a hard crash. Instead of trying to gracefully shut down or instead of partially failing when system invariants are broken, the process should just be forced to crash. This compresses the failure state space down to a single type of failure, in other words the software is “crash-only.”
Some of the paper’s musings about crash-only software did not age well from 2003 to current day. Notably, most distributed systems today do handle a failure state space beyond simple hard crashes. Also the paper suggests that an external process called a “crasher” would be responsible for forcing processes to crash, in practice this not how it works today.
While some of the conclusions about crash only software did not turn out to be correct, the design principles the author suggests for building systems that can function well under a crash-only paradigm do hold up remarkably well. The author talks about five of these principles.
The five principles are
- All important non-volatile state is managed by dedicated state stores
- Components have externally enforced boundaries
- All interactions between components have a timeout
- All resources are leased, rather than permanently allocated
- Requests are entirely self-describing
I think Marc’s blog does a great job giving some high level notes on each of these, and I do not have anything else to add. I recommend you check out his post here, in order to double click on these principles.
Thanks for reading folks.