Recently I published a post on Why Google Built Spanner. If you have not already read that post, I recommend you check it out before reading this one. Understanding the motivation behind Spanner and the abstractions it provides to its users, is critical context to understanding Spanner’s inner workings.
The purpose of this article is to explain how Spanner is built. I will start by going over the 10,000 foot architecture view. Then we will go over the TrueTime API and why it’s so powerful. With this context in hand we will go through a series of queries, in increasing order of complexity, and explain how they would run on Spanner. I hope you enjoy…
What does Spanner’s architecture look like from 10,000 feet?
In order to understand how Spanner is architected, we must first understand what splits are. Let’s imagine a database table with the following schema.
ID Int64 PRIMARY KEY
In a database that runs on a single machine data would be stored on disk in sorted order by primary key. It might look like this:
Now this is straightforward as long as data lives on only a single node. But once data starts to live on multiple nodes we need to start thinking about what data lives where. This is where Spanner’s splits come into the picture. A split is nothing more than some continuous block of primary keys. The following diagram shows an example of 4 splits over the above data.
Here the following things should be observed about splits:
- Each split consists of a continuous, sorted block of primary keys.
- The splits do not have to be equal in size.
Spanner uses splits as the abstraction to divide data across machines. The following diagram shows an example of what this could look like:
Here we see that the four splits are divided across 3 nodes. Node 1 owns split 0. Node 2 owns splits 2 and 3. Node 3 owns split 1. At this point we have successfully spread data out across multiple nodes but the data is not replicated. Having a single node be responsible for the data presents issues regarding the durability and availability of your data. So lets replicate each split across nodes in separate zones. Each zone is its own failure domain, such that a failure in one zone will not impact the operations of other zones.
Now the splits are not only spread out across multiple nodes but each split is replicated in a total of 3 zones. For example split 0 exists in ZoneA.Node1, ZoneB.Node3 and ZoneC.Node3. These three nodes form a Paxos group for split 0. In this three node setup, each split exists on three nodes in total across the three zones. For every split the three nodes form a Paxos group that will need to reach consensus on the writes. Paxos requires a leader to exist and Spanner uses long lived leaders. Therefore each split will have a single node as the leader for that split. Let’s continue to extend our diagram to show this.
The leaders are all highlighted in green in the above diagram. Notice that leaders do not all exist in the same zone.
A piece we are still missing is the query routing piece. We need a component that can answer questions like: “Which split does row foo belong in?” and “What node is the leader node for split 0?” The piece that does this is the API layer, which user requests are routed through. Lets show this in our diagram:
Okay, great lets recap
- We have an API layer which can do lookups for leader locations and split boundaries.
- We have our data replicated across zones.
- We have a single leader for each split.
- Data exists in separate zones with totally separate failure domains.
There is one more important piece to add to this architecture diagram. The creators of Spanner had the brilliant idea to completely decouple data nodes from compute nodes. So all the nodes we have been talking about so far actually don’t store any data on their disks. The nodes we have been talking about are stateless compute nodes that form Paxos groups. This greatly simplifies the problem of node failures and scaling up/down. Since all these nodes are stateless if a node dies it can simply be replaced with another node — state does not need to be copied over during recovery. But now you are asking where does the actual data live? The data is stored in a file system at Google called Colossus. Colossus replicates data under the hood, handles disk failures, disk corruptions and optimizes data storage. This means Spanner can totally ignore all the problems associated with actually writing data to disk and instead focus on managing stateless compute nodes which access that data. Let’s finish our diagram by adding Colossus to the diagram.
At this point our 10,000 foot view of Spanner’s architecture is complete. But before we continue move on, I want to make a few closing remarks on this architecture
- In the diagram I assumed there were three zones. This is actually configurable by the user. Since these zones define Paxos groups the user will want to select an odd number of zones. Additionally the user will want to select a higher number of zones for higher durability. For example if a use case must be able to tolerate the total failure of two zones then the database creator should opt to use five zones such that the Paxos group can form a quorum even in the case of two zone failures.
- In addition to selecting the number of zones data should be replicated to. The user also can selected where the zones exist geographically. The user will want to select zones that are geographically close to their end customers to minimize latencies.
- So far we have been talking about zones not regions. Zones are physically independent datacenters that exist close enough together that the latency between them is small. A collection of zones comprises what is called a region. Spanner offers two modes: single region and multi-region. Most users will opt for single region because latencies will be lower. But multi-region is available and can improve durability and read latencies. In this post we will not touch on multi-region. The fundamental principles of Spanner’s workings can be understood by looking at single region deployments.
- When a user configures their Spanner database they will specify the number of nodes they want. This number expresses the number of nodes that exist within a single zone. As an example, our architecture diagram above represents a three node cluster. Note that the total number of nodes is the number of zones multiplied by the number of nodes per zone.
At this point we have a good grasp on the high level architecture and we are ready to dive into TrueTime next.
What time is it anyway… The power of the TrueTime API
“What time is it?” If everyone who is reading this just checked their watch every single person would be guaranteed to see a slightly different time depending on how their watch is calibrated. No two clocks will have precisely the same time. Even clocks on servers that run synchronization algorithms will have slight differences in time between the servers. It is a fundamental tenant of distributed systems that clocks between various servers are not perfectly synchronized.
The reason clock drift between servers matters is because if clocks did not drift, producing a global ordering of events would be trivial — and getting a global ordering of events is a powerful consistency guarantee to be able to rely on. In order to understand this lets look at an example
Consider we have a database which is replicated on servers Foo and Bar. Furthermore imagine we have two friends Alice and Bob that are writing to these database replicas. Consider the following sequence of events
- Bob writes X = 1
- Database replica Foo records this write and marks it using its local timestamp. Let’s say server Foo according its local clock recorded that this write happened at time=3.
- Foo responds back to Bob saying the write was applied successfully.
- Bob tells Alice that his write was applied and she goes ahead and issues a query to the database to write X=2.
- Alice’s query arrives at server Bar instead of server Foo.
- Server Bar applies Alice’s write and marks it with a timestamp based on the server’s local clock. Let’s say time=2. Note that since Foo’s clock and Bar’s clock will not match exactly it is very possible that Alice and Bob observe that Bob’s write happened first but Alice’s write is marked with an early database timestamp.
- Alice gets back a 200.
- Servers Foo and Bar do asynchronous replication between themselves and see a conflict for the value of X. Now what value for X should the database set? Since the clocks on Foo and Bar do not match exactly its impossible to determine which event happened first.
The key insight to take from this example is that if all clocks could agree on a time exactly then global ordering would be simply solved and all these conflict issues would go away.
It is also important to note that having a leader for row X that all writes go to, does not completely solve the issue. If there are two rows which have a relationship between themselves, but belong in different database partitions we can have a violation of the observed order of transactions even in a system which has one leader per partition. Constructing an example to demonstrate this is outside of scope for this post, but you can read about an example here.
The bottom line is global ordering is highly desirable but due to clock drift between machines, establishing a global ordering is not at all trivial. The novel insight Spanner had was to introduce an API called TrueTime which exposes time as an interval. An API call to TrueTime produces an interval of the form [TT.Earliest, TT.Latest]. Where the “absolute” time is guaranteed to exist within the bounds of this interval. The power of exposing clock uncertainty in a time API is it enables a server to assign a timestamp and then wait for the duration of the uncertainty before responding back to the user. This ensures that if the user receives a response and then issues another query that second query is guaranteed to receive a later timestamp because the first query’s time had certainly already passed by the time the user issues the second query. If you are curious to dive deeper into the proof, Notes on The Google Spanner Paper, explains this in fairly clear terms.
If TrueTime was used to out wait clock uncertainty in our above example, it would have been guaranteed that Bob’s write got assigned an early timestamp than Alice’s write. This would mean the database could order events such that any queries that depend on the value written by Alice and Bob would always see Bob’s event being applied first. Notice that Spanner’s ability to assign timestamps in accordance with a global ordering is equivalent to the definition of external consistency. If that is not clear take a look back at the definition of external consistency in my previous Spanner post.
The details of how TrueTime is implemented is out of scope for this paper. But I just want to jot down a few notes about it:
- TrueTime is supported using fancy expensive hardware. The reason Spanner cannot simply be an open source project and run on arbitrary hardware is because it has a hard dependency on expensive atomic clocks.
- Spanner can be a fast database as long as TrueTime does not have that much uncertainty. Spanner basically slows down in order to out-wait clock uncertainty. That means anyone could build Spanner using nodes that had some guaranteed amount of synchronization between their clocks. Although the tricky part here is most clock synchronization methods keep clocks within a few hundred milliseconds of each other. This level of uncertainty would be far too long to simple wait out. In Spanner, through the fancy hardware they have behind TrueTime they manage to keep the clock uncertainty well under 10MS (and it seems like it’s much lower than this).
At this point we understand Spanner’s high level architecture and we understand that the TrueTime API enables Spanner to provide an externally consistent global ordering of events. With this understanding in hand we are ready to explain how various queries would be served by Spanner. Let’s dive in…
How do point writes work?
Let’s imagine that a Spanner user named Bob created the following database schema
ID Int64 PRIMARY KEY
This is the schema that Bob is going to be using for the rest of this article. Now lets imagine that Bob wants to write the row
(10, "Andrew") — Spanner will do the following in order to satisfy this write:
- The API layer will figure out which split
ID=10belongs to. Let’s just say it belongs to split 1.
- Then the API layer will look up the leader for split 1 and forward the write to that split leader.
- The leader starts a transactions and attempts to take out a write lock for the row
ID=10. It is possible the transaction is blocked behind another in progress transaction. It is also possible there is a deadlock in which case a standard deadlock detection algorithm will run to detect and break the deadlock.
- Once the lock is acquired the leader calls TrueTime and gets and interval of time. The leader assigns the commit timestamp to be something no less than the upper bound of the interval.
- The leader sends the transaction and timestamp to the replicas and waits to get a quorum.
- At this point the leader is waiting for two things. First the leader is waiting to get a quorum back from the replicas AND the leader is also waiting for enough time to pass by to ensure that the selected commit timestamp is certainly in the past.
- Only after the leader gets the quorum and the commit timestamp is guaranteed to have past does the leader actually do the commit.
- The leader responds back to the client indicating the write has been persisted. And the leader tells the replicas to commit.
There are two critical insights here:
- The write is blocked until a quorum is achieved. This differs from most NoSQL systems which do non-blocking asynchronous replication.
- The write is blocked until Spanner can be positive the commit timestamp is in the past. This ensures that any transaction the user can observe as coming after the current transaction will receive a later timestamp.
How do point reads work?
Now that Bob has finished writing row
(10, "Andrew") let’s say Bob calls out to Alice and asks her to issue
select * where ID=10 Spanner’s
external-consistency guarantee, guarantees that Alice will see Bob’s write reflected. Note that
external-consistency actually provides a much strong guarantee than just
read-your-writes but it certainly also provides
read-your-writes and using this example is sufficient to demonstrate how reads are served.
- The API layer would receive the request and lookup the split which
ID=10belongs in. Let’s say the row is still in split 1. Note that the boundary between splits can change (Spanner handles this automatically). But for the sake of simplicity let’s just say the row has stayed in split 1 since Bob’s write.
- The API layer looks up the replica best suited to service the read. Any replica can service the read but some of the factors the API layer considers in determining which replica to service the read from are load and physical distance. Once the API layer has selected a replica it forwards the read request to that replica.
- The replica could be the leader or any other replica. The question the node needs to answer at this point is, “Am I up to date enough on this split to service this read?” If the node is itself the leader, then it is definitionally up to date with the split because all writes go through the leader. Therefore let’s ignore the case of considering reading from leader because its trivial.
- Assuming we are reading from a non-leader replica, the replica needs to determine if it is up to date enough to service the read. Therefore a time of read needs to be selected from TrueTime. Note that by the invariant of write timestamp assignments, the selected read timestamp will be greater than the timestamp of any previously committed transactions. In other words the timestamps represent a global ordering such that the read timestamp is greater than all write timestamps that happened before it. Once the replica has the time of read from TrueTime, the replica reaches out to the leader and asks “hey, leader do I have all the committed transactions at least up until this timestamp?” The leader will either say “yup you got all the commits that happened strictly before that time” or “nope, there are some transactions that have been committed with an earlier timestamp you do not yet have.”
- If the replica learns from the leader that it is already sufficiently up to date, the replica simply does the read, responds back to the API layer, and the API layer sends the response back to the user.
- If however, the replica learns from the leader that it is not up to date enough — i.e. there is a commit with a timestamp less than the selected read timestamp that has already committed, then the replica simply waits until it receives all the commits it needs to receive. Once it’s caught up it services the read.
It is critical to understand that Spanner’s use of TrueTime to assign a globally meaningful order to all events is at the core of Spanner’s consistency guarantee.
Now that we understand the default case for point writes and point reads we can extend our understanding of reads a little bit more by looking at how Spanner provides reads without locks and how Spanner supports stale reads.
Read only transactions in Spanner require no locks. Spanner achieves lock free read transactions by keeping track of multiple versions. You can imagine that for a given timestamp a Spanner database has a certain database state at that timestamp that will never change. New events can be applied at later timestamps, but for a given timestamp data in Spanner never changes. Since reads operate over data which has already been committed (i.e. is in the past) all reads are happening over immutable data and therefore do not require locks. Note that this only applies to read only transactions. If a transaction has any writes in it locks will have to be taken out.
Spanner also provides bounded stale reads. So far we have only talked about
externally-consistent reads, but there are applications that would prefer to trade off some amount of data freshness in exchange for lower latencies. For these use cases Spanner provides bounded stale reads. The application is able to specify a staleness threshold (say 10 seconds) and as long as the replica that services the read knows they are up to date within this threshold the replica can respond back without needing to check in with the leader at all. This can reduce the overall latency of the read because one round trip between the leader and replica can be saved if the replica knows for sure they are already up to date within the bounds of the staleness threshold.
We have covered a lot at this point, lets just recap the things we have touched on
- The data flow of point writes
- The data flow of point reads
- The use of immutable multi-versions to provide lock free read only transactions
- The clients ability to read with bounded staleness
The last type of queries we are going to cover is cross split read/write transactions. Understanding how this works requires a lot of context but if you have gotten this far you should have all the context that is needed. Let’s finish strong.
How to cross split reads work?
Up until this point we have only dealt with reads and writes that are within a single split. Now we will extend this to talk about how reads that span multiple splits are handled. Let’s say Bob issues a query
SELECT * FROM EMPLOYEE WHERE ID < 700 Let’s say the rows that contain
ID < 700 span splits 0, 1 and 2. This query will be handled as follows
- The API layer will get the query and will assign a time to the read based on the current TrueTime.
- The API layer will lookup the splits that the query involves and then lookup the corresponding split leaders.
- The API layer will send the query to some replica in split 0, some replica in split 1 and some replica in split 2.
- The replicas will service the read using the same approach as a read that is within a single split and the replicas will respond back to the API layer.
- The API layer will combine the results and return the result to the user.
How do cross split writes work?
The hardest type of query to support is a cross split write. But with all the context we have built up it really won’t seem that complex.
- The API layer will figure out the splits and leaders involved.
- One of the leaders will be selected as the coordinator for the transaction. The leaders of other splits involved that were not selected as the coordinator are called participants.
- The coordinator will determine which mutations should be sent to which split leaders and the coordinator will send the mutations.
- The participants go ahead and follow roughly the same write protocol that would be followed in the case of a single split write the only difference is the participants cannot commit until they hear its okay to do so from the coordinator.
- Once a participant determines the mutation can be applied the participant responds back to the coordinator and says “yup, I can commit if you tell me to”
- Once the coordinator gets confirmation from all participants then the coordinator tells them all to commit and finally responds back to the the user through the API layer.
So basically multi split write transactions are nothing more than a two phase commit plus a bunch of regular Spanner single split writes.
What can we learn from Spanner?
Spanner is a really amazing database. I learnt so much as I was studying it. Here are a few highlights.
- Spanner has a philosophy that you don’t pay for what you don’t use. For example if you can model your data such that you don’t need cross split transactions then you don’t pay the latency penalty associated with 2PC. So you can think of Spanner as a tool box that offers very powerful primitives and some of the primitives are more latency expensive then others but you will only incur the latency hit based on the primitives you decide to use. This a la carte approach to software is something I really like.
- Spanner started by having a great customer to work with and they built a product for that customer. In the case of Spanner is the was the Google’s Ads team. Generally I think this framing of really valuing your customers as an opportunity to learn about how to build a great product is very valuable.
- Spanner convinced me that CAP theorem is close to being a useless tool in terms of actually understanding real systems. While CAP theorem is true, Spanner convinced me that its way to overly simplistic to understand real world systems. Instead I would prefer to ask questions like, “What options do users have to trade off latency for data consistency?” or “In the case of a network partition what does your system do?” These questions are certainly highly related to CAP theorem but I think simply labeling a system AP or CP is not actually that useful for real world systems — more info is needed.
Anyways that was a lot about Spanner. I hope you all enjoyed. Let me know if you liked, hated, agreed, or disagreed with anything I said — I am all ears. Next time I will be writing about Vitess — so keep an eye out for that.