How Google Built Spanner

Employee {
ID Int64 PRIMARY KEY
Name String
}
  • Each split consists of a continuous, sorted block of primary keys.
  • The splits do not have to be equal in size.
  • We have an API layer which can do lookups for leader locations and split boundaries.
  • We have our data replicated across zones.
  • We have a single leader for each split.
  • Data exists in separate zones with totally separate failure domains.
  • In the diagram I assumed there were three zones. This is actually configurable by the user. Since these zones define Paxos groups the user will want to select an odd number of zones. Additionally the user will want to select a higher number of zones for higher durability. For example if a use case must be able to tolerate the total failure of two zones then the database creator should opt to use five zones such that the Paxos group can form a quorum even in the case of two zone failures.
  • In addition to selecting the number of zones data should be replicated to. The user also can selected where the zones exist geographically. The user will want to select zones that are geographically close to their end customers to minimize latencies.
  • So far we have been talking about zones not regions. Zones are physically independent datacenters that exist close enough together that the latency between them is small. A collection of zones comprises what is called a region. Spanner offers two modes: single region and multi-region. Most users will opt for single region because latencies will be lower. But multi-region is available and can improve durability and read latencies. In this post we will not touch on multi-region. The fundamental principles of Spanner’s workings can be understood by looking at single region deployments.
  • When a user configures their Spanner database they will specify the number of nodes they want. This number expresses the number of nodes that exist within a single zone. As an example, our architecture diagram above represents a three node cluster. Note that the total number of nodes is the number of zones multiplied by the number of nodes per zone.
  1. Bob writes X = 1
  2. Database replica Foo records this write and marks it using its local timestamp. Let’s say server Foo according its local clock recorded that this write happened at time=3.
  3. Foo responds back to Bob saying the write was applied successfully.
  4. Bob tells Alice that his write was applied and she goes ahead and issues a query to the database to write X=2.
  5. Alice’s query arrives at server Bar instead of server Foo.
  6. Server Bar applies Alice’s write and marks it with a timestamp based on the server’s local clock. Let’s say time=2. Note that since Foo’s clock and Bar’s clock will not match exactly it is very possible that Alice and Bob observe that Bob’s write happened first but Alice’s write is marked with an early database timestamp.
  7. Alice gets back a 200.
  8. Servers Foo and Bar do asynchronous replication between themselves and see a conflict for the value of X. Now what value for X should the database set? Since the clocks on Foo and Bar do not match exactly its impossible to determine which event happened first.
  • TrueTime is supported using fancy expensive hardware. The reason Spanner cannot simply be an open source project and run on arbitrary hardware is because it has a hard dependency on expensive atomic clocks.
  • Spanner can be a fast database as long as TrueTime does not have that much uncertainty. Spanner basically slows down in order to out-wait clock uncertainty. That means anyone could build Spanner using nodes that had some guaranteed amount of synchronization between their clocks. Although the tricky part here is most clock synchronization methods keep clocks within a few hundred milliseconds of each other. This level of uncertainty would be far too long to simple wait out. In Spanner, through the fancy hardware they have behind TrueTime they manage to keep the clock uncertainty well under 10MS (and it seems like it’s much lower than this).
Employee {
ID Int64 PRIMARY KEY
Name String
}
  1. The API layer will figure out which split ID=10 belongs to. Let’s just say it belongs to split 1.
  2. Then the API layer will look up the leader for split 1 and forward the write to that split leader.
  3. The leader starts a transactions and attempts to take out a write lock for the row ID=10. It is possible the transaction is blocked behind another in progress transaction. It is also possible there is a deadlock in which case a standard deadlock detection algorithm will run to detect and break the deadlock.
  4. Once the lock is acquired the leader calls TrueTime and gets and interval of time. The leader assigns the commit timestamp to be something no less than the upper bound of the interval.
  5. The leader sends the transaction and timestamp to the replicas and waits to get a quorum.
  6. At this point the leader is waiting for two things. First the leader is waiting to get a quorum back from the replicas AND the leader is also waiting for enough time to pass by to ensure that the selected commit timestamp is certainly in the past.
  7. Only after the leader gets the quorum and the commit timestamp is guaranteed to have past does the leader actually do the commit.
  8. The leader responds back to the client indicating the write has been persisted. And the leader tells the replicas to commit.
  1. The write is blocked until a quorum is achieved. This differs from most NoSQL systems which do non-blocking asynchronous replication.
  2. The write is blocked until Spanner can be positive the commit timestamp is in the past. This ensures that any transaction the user can observe as coming after the current transaction will receive a later timestamp.
  1. The API layer would receive the request and lookup the split which ID=10 belongs in. Let’s say the row is still in split 1. Note that the boundary between splits can change (Spanner handles this automatically). But for the sake of simplicity let’s just say the row has stayed in split 1 since Bob’s write.
  2. The API layer looks up the replica best suited to service the read. Any replica can service the read but some of the factors the API layer considers in determining which replica to service the read from are load and physical distance. Once the API layer has selected a replica it forwards the read request to that replica.
  3. The replica could be the leader or any other replica. The question the node needs to answer at this point is, “Am I up to date enough on this split to service this read?” If the node is itself the leader, then it is definitionally up to date with the split because all writes go through the leader. Therefore let’s ignore the case of considering reading from leader because its trivial.
  4. Assuming we are reading from a non-leader replica, the replica needs to determine if it is up to date enough to service the read. Therefore a time of read needs to be selected from TrueTime. Note that by the invariant of write timestamp assignments, the selected read timestamp will be greater than the timestamp of any previously committed transactions. In other words the timestamps represent a global ordering such that the read timestamp is greater than all write timestamps that happened before it. Once the replica has the time of read from TrueTime, the replica reaches out to the leader and asks “hey, leader do I have all the committed transactions at least up until this timestamp?” The leader will either say “yup you got all the commits that happened strictly before that time” or “nope, there are some transactions that have been committed with an earlier timestamp you do not yet have.”
  5. If the replica learns from the leader that it is already sufficiently up to date, the replica simply does the read, responds back to the API layer, and the API layer sends the response back to the user.
  6. If however, the replica learns from the leader that it is not up to date enough — i.e. there is a commit with a timestamp less than the selected read timestamp that has already committed, then the replica simply waits until it receives all the commits it needs to receive. Once it’s caught up it services the read.
  • The data flow of point writes
  • The data flow of point reads
  • The use of immutable multi-versions to provide lock free read only transactions
  • The clients ability to read with bounded staleness
  1. The API layer will get the query and will assign a time to the read based on the current TrueTime.
  2. The API layer will lookup the splits that the query involves and then lookup the corresponding split leaders.
  3. The API layer will send the query to some replica in split 0, some replica in split 1 and some replica in split 2.
  4. The replicas will service the read using the same approach as a read that is within a single split and the replicas will respond back to the API layer.
  5. The API layer will combine the results and return the result to the user.
  1. The API layer will figure out the splits and leaders involved.
  2. One of the leaders will be selected as the coordinator for the transaction. The leaders of other splits involved that were not selected as the coordinator are called participants.
  3. The coordinator will determine which mutations should be sent to which split leaders and the coordinator will send the mutations.
  4. The participants go ahead and follow roughly the same write protocol that would be followed in the case of a single split write the only difference is the participants cannot commit until they hear its okay to do so from the coordinator.
  5. Once a participant determines the mutation can be applied the participant responds back to the coordinator and says “yup, I can commit if you tell me to”
  6. Once the coordinator gets confirmation from all participants then the coordinator tells them all to commit and finally responds back to the the user through the API layer.
  • Spanner has a philosophy that you don’t pay for what you don’t use. For example if you can model your data such that you don’t need cross split transactions then you don’t pay the latency penalty associated with 2PC. So you can think of Spanner as a tool box that offers very powerful primitives and some of the primitives are more latency expensive then others but you will only incur the latency hit based on the primitives you decide to use. This a la carte approach to software is something I really like.
  • Spanner started by having a great customer to work with and they built a product for that customer. In the case of Spanner is the was the Google’s Ads team. Generally I think this framing of really valuing your customers as an opportunity to learn about how to build a great product is very valuable.
  • Spanner convinced me that CAP theorem is close to being a useless tool in terms of actually understanding real systems. While CAP theorem is true, Spanner convinced me that its way to overly simplistic to understand real world systems. Instead I would prefer to ask questions like, “What options do users have to trade off latency for data consistency?” or “In the case of a network partition what does your system do?” These questions are certainly highly related to CAP theorem but I think simply labeling a system AP or CP is not actually that useful for real world systems — more info is needed.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Andrew Dawson

Andrew Dawson

Senior software engineer with an interest in building large scale infrastructure systems.