In 2012 Google published a paper on a new database it had built named Spanner. Spanner offered a collection of features that no database before it had offered. It was ground breaking. In order to understand the true impact of Spanner and in order to understand the problem it solves, we have to go back in time to the creation of databases. So without further adieu — let us embark on a brief history of databases.
In 1970 Edgar F. Codd published a paper on the relational database model. This paper presented a brilliant idea to relate data together using a collection of related tables. In 1979 the first commercial relational database product entered the market. Throughout the 1980s and into the 1990s there was an explosion of relational database products. A big reason that developers latched onto these relational databases was because they provided a standard language for data access (SQL) and they formalized a set of guarantees that the database would provide (ACID).
Developers were happily using these relational databases until the early 2000s. But, around the turn of the millennium the volume of data started exploding — and developers had a rude awaking. It was realized that relational databases had been designed to run on only single machines — not a cluster of machines. As data grew larger and larger it became simply impossible to support some use cases using relational databases. This gave birth to a new class of databases known as NoSQL. These databases were designed to scale horizontally and automatically recover from node failures. This new class of database was critical in managing the increased data volume applications were dealing with.
While these NoSQL databases were able to scale out, they did not offer some of the same primitives that were offered by relational databases. Developers had come to know and love primitives such as table joins and transactions offered by relational databases — but these primitives were often left out or reduced amongst NoSQL databases. The reason for leaving these out in NoSQL databases is it turns out getting ACID compliant transactions in a globally replicated, fault tolerant, low latency database is VERY hard. These NoSQL databases decided to trade off against consistency in exchange for low latency and fault tolerance.
So by 2010 there were basically two classes of established databases: relational databases and NoSQL databases. The following chart provides a simplified version of what each of these classes of databases offered
While this chart is likely overly simplistic, it demonstrates a key point — each of these classes of databases offered something valuable that the other one did not have. From this insight a simple question arises — what do application do that need all of these features?
Google’s ads platform was a prime example of an application that needed all of these features. It required massive write throughput, it needed to scale out with Google’s business growth AND it also needed strongly consistent transactions.
Ultimately Google’s ads platform was the catalyst that resulted in Spanner’s creation. The pitch for Spanner was it offered the horizontal scalability and fault tolerance of NoSQL databases with the strong ACID compliant transactions of relational databases. Before we continue to dive deeper into Spanner, let’s take a step back and understand what Google was using before Spanner — and why it ultimately was not good enough.
Before Spanner, Google’s ads platform was built on top of a sharded MySQL database. It is worth understanding what a sharded relational database is because it was and still is a common pattern used to scale out a relational database. The following diagram shows what a sharded relational database looks like
In a sharded relational database the application needs to provide some shard_id. This shard_id will be mapped into a single database shard. Each database shard is a logically separate relational database. So as long as transactions only operate on data within a single database shard they get all the goodness of ACID, Transactions and Joins. The way this is supported is each database shard is owned by a single host. So basically what is happening is a bunch of logically separate databases are created, they are distributed over a collection of hosts and the application needs to structure queries such that all queries within a transaction operate over data only within a single shard. This enables scaling out the database by simply adding more hosts. To understand why this works, imagine you have 100 database shards and 10 hosts. This implies that each host owns 10 database shards. If you double the number of hosts to 20, now each host owns only 5 database shards. Each shard still only belongs to a single host — so providing transactions is still trivial within a shard, but each host now has half the amount of work to do.
This sharded relational database model sounds pretty great but there are several issues with it:
- Hot Shard Problem: This system works well when work is balanced across the shards. But when a single shard ends up containing data which is used way more often than data in other shards — the host that owns that shard becomes overwhelmed. Notice that simply expanding the number of hosts does necessarily help because that shard is still going to be owned by a single node. This problem is known as the hot shard problem.
- Resharding: In order to resolve the hot shard problem a resharding can be done. In order to do this the hash function or application is changed in order to shard data differently. The problem with this is it entails a long expensive data migration. When Google did this resharding for the last time it took them two years!
- Application Aware Sharding: The final issue with this setup is it puts extra responsibility on the application to figure out how to shard data. The application has to be aware of this idea of sharding and structure its table design and query patterns around this sharding. For some applications this sharding fits very naturally and its not a big issue — but for other applications it can be very burdensome.
Wow — that ended up being a not so short history. Let’s recap briefly.
- There were relational databases that worked great on single nodes. They provided ACID compliant transactions.
- As data exploded developers needed to make databases distributed.
- Making a distributed database support ACID transactions is super hard. So developers introduced weaker consistency models in exchange for getting fault tolerant, low latency distributed databases — NoSQL.
- Both SQL and NoSQL lack features that the other one has.
- It is possible to horizontally scale relational databases using sharding but it presents several issues that need to be dealt with.
So Google found itself in a position where its sharded relational database was not working anymore, a relational database would not scale and a NoSQL database would not provide the type of consistency guarantees Google needed. From this place of need — Spanner was born. Spanner is a horizontally scalable, multi-versioned, distributed, fault tolerant, externally consistent, SQL based database. That is a lot of buzz words to basically mean it’s a database that offers the best of NoSQL AND the best of a relational database.
You should now understand what Spanner is and what problem it is meant to solve. You should also be thinking “Wow Spanner sounds amazing — thats the catch?” We will get to the downsides of Spanner later in the post — but for now if you are very impressed with what Google built — and you are ready to dive deeper into the details of Spanner — then you are are in a good place as a reader.
What is external consistency?
In the previous section I mentioned that Spanner was externally consistent. Unless you have read about Spanner before — you will not know what this term means. Understanding what external consistency is critical to understanding why Spanner is such a compelling database option. In order to understand what external consistency is, we are going to talk about the guarantees provided by traditional relational databases which run on single machines and then we will turn our attention to understanding what guarantees are provided by databases that replicate data to multiple machines.
Before diving in I just want to call out that the distinction between relational versus NoSQL is not the same as the distinction between replicated versus non-replicated — relational databases can be setup to do replication. A database is considered replicated simply if it stores the same piece of data on multiple different machines. While the classification of relational database does not imply non-replicated, the classical literature around relational databases typically assumes the database is running on a single host (i.e its not replicated). Therefore we will be using classical relational database literature to talk about the guarantees of a relational database that is NOT replicated. So without further adieu let’s get started understanding external consistency.
Relational databases define a set of guarantees that be can be summarized in terms of the acronym ACID — Atomicity, Consistency, Isolation, Durability. There are tons of great resources to read about ACID so I will not focus on that here. For our discussion we only care about focusing in on the isolation component of ACID.
The isolation guarantee describes the extent to which concurrently running transactions can perceive the effects of other concurrently running transactions. The isolation guarantee does not actually describe a single guarantee but instead refers to a spectrum of possible isolation levels a database can provide. On one side of the scale the database will be configured to favor concurrency of execution and on the other side of the scale the database will be configured to favor isolation. Concurrency and isolation are inherently at odds with each other and there is no right answer for which one to favor — different applications will need to set this tradeoff differently.
The strictest of the isolation levels is serializability. At this level of isolation the database user is given the illusion that transactions run in a serial order. It is not actually the case that they run in a serial order — databases use two phase locking to provide serializability and still allow for some level of concurrency.
While serializability is a strong level of isolation it actually still allows for some unexpected user behavior. This is because the serializability isolation level does not describe any specific serial order that transactions run in. It is therefore valid for the database to run transactions in a serial order that is different from the user visible order. The following code demonstrates an example of this undesirable behavior:
// Starting DB State: ID | Count
0 | 1// Time = 0
Start Transaction A;
Transaction A: Update Count=5 where ID=0;
Commit Transaction A;// Time = 10 (on same host that ran above)
Start Transaction B;
Transaction B: Select where ID=0; ---> get result of 1
Commit Transaction B;
Notice that from a user’s perspective its super weird that transaction B got a result of 1. The user saw that Transaction A completed and then AFTER it completed the user ran transaction B. Therefore, the user would expect Transaction B to see the write from Transaction A. While this is what the user expects, the serializability guarantee does not say anything about the user visible order — it only says that the transactions appear to run in SOME serial order. In the above case that serial order is B and then A. So even though the user ran A and then B — its completely valid for the database to run B and then A in this case. In fact it would be valid according to the literal definition of serializability to have every read only transaction return the empty result set — doing so would NOT violate serializability.
At this point if you have used a relational database before you are likely confused — saying something like, “But I have used a relational database before and I never see this.” That is because the literature around ACID guarantees is speaking on databases which run on single machines — the isolation levels do not consider data replication! When the database runs on a single machine the order the database applies transactions matches the user visible order without doing anything special. Relational databases implement serializability using two phase locking which by default when run on a single machine will produce a serial order that matches the serial order observed by the user. A serial order that matches the user observable execution order is called strict serializability, or also linearizable. Strict serializability is the same thing as what Google called external consistency.
Okay so at this point we understand what serializability is and we understand that if you get serializability on a single node database — then for all practical purposes you get strict serializability (external consistency) for free. But this story changes drastically once data starts to be replicated. Once the same piece of data is stored on multiple nodes a whole new set of semantics around consistency needs to be defined. In order to demonstrate this let’s look at the following diagram.
This diagram shows a database which replicas the data associated with the row where ID=0 to three nodes. It also shows three database users which each issue queries regarding the row with ID=0 and these queries arrive at different database nodes. Understanding what should happen in this case is super hard — What value should the reader see? Which of the writes should ultimately become the source of truth? Should both writes be able to go through if they are run at roughly the same time? As we can see the complexity around data consistency gets much harder once data exists on more than a single node.
In order to address this complexity, the distributed systems community started to define terms like read-your-writes, prefix-reads and casual consistency. These terms all aim to explain the guarantees that are provided by a database which replicates data.
Now that we understand the context behind isolation levels for single node databases AND we understand that replicating data introducing a whole new set of concerns, we are ready to define external consistency. External consistency is nothing more than strict serializability in a replicated database. What this means is the same piece of data is stored on multiple machines but the user of the database perceives the database as though it was a single node database and that all transaction on the database are run in a serial order that matches the user observable order. That is pretty freaking amazing! In case the amazement did not sink in, let me explain the external consistency guarantee another way. External consistency provides the perception that all transactions have a global ordering to them, that transactions are applied to the database in this global order, and that the global order is consistent with the user visible ordering. No replicated database before Spanner had ever provided such a guarantee — this guarantee is stronger than the guarantee of serializability AND Google managed to do it in a replicated database!
Okay now I am done ranting about how amazing external consistency is. At this point you should understand what external consistency is and the background behind it. Also make sure to remember that Spanner managed to provide this guarantee while still providing horizontal scalability, fault tolerance and the other bits of awesomeness associated with NoSQL databases. At this point we should have our heads around what Spanner is — now we are going to turn to understanding how users interact with it.
How to start using Spanner?
Armed with the background of what Spanner is, we are now in a position to understand how users interact with it. I will not attempt to write a getting started guide for Spanner — Google already has great documentation around this. The purpose of this section is simply to give a brief overview of what it looks like to get started using Spanner.
In order to start using Spanner you will need a Google cloud account. Spanner only can run on Google’s cloud due to its dependency on Google specific hardware. Once you have a Google cloud account you will create and configure a Spanner instance. The Spanner instance is a container of resources that will be used to run your database. When you configure an instance you will have to select if you want it to be regional or multi-regional. A regional instance will store all data within a single region and simply replicate between zones within that region. A multi-regional instance will replicate between independent regions. Additionally, your instance configuration will need to declare the zones where data should be replicated to and the number of nodes in your instance. Defining the replica zones is useful because it enables you to locate your data near where your customers are located.
Once you have an instance configured you can create your database inside that instance. Creating a database looks very much like what it looks like in the traditional relational databases world. You will define a strongly typed schema. The schema will define tables. The rows in a table will be uniquely identified by a unique primary key. The data on disk will be laid out in primary key sorted order.
Once the instance, database and tables have been created you are ready to start reading and writing to your Spanner database. When clients read and write to the Spanner database they don’t need to think about the fact that the database is distributed or replicated. This has lots of implications
- The clients do not need to worry about sharding. Spanner takes care of sharding automatically and automatically handles rebalancing.
- The clients do not need to worry about a shard getting hot. If a shard does get hot Spanner will automatically split the shard to distribute load.
- The clients do not need to worry about consistency of reads after writes — as we talked about clients get external consistency.
- The clients don’t have to worry about database host failures.
As your application’s scale increases you will need your database to be able to handle a larger working set of data and more transactions per second. This can be achieved simply by scaling up the number of hosts in your instance. Notice that this is horizontal scaling rather than vertical scaling.
At this point your application and database should be humming along.
What other cool features does Spanner provide?
Interleaved Tables: Interleaved tables enable defining a parent child relationship between two or more independent tables. This enables the client to control data locality between the rows in what would otherwise be totally independent tables.
Atomic Schema Updates: The database schema can be updated atomically. This guarantee relates to the guarantee of external consistency in that it is as though the schema update took place at a single point in time globally.
Stale Reads: By default reads will be done at the external consistency level. But for some use cases this high level of consistency is not needed. For use cases that are more latency sensitive and can tolerate some staleness of data Spanner offers timestamp bound reads. These reads enable the customer to read some data AND specify that they are okay reading data that is up to X seconds old. This will generally result in decreased read latency.
Strongly Typed Data: The schema in Spanner is strongly typed just like traditional relational databases.
Secondary Indices: Spanner automatically indices on the primary key of each table, but it also offers the ability to define secondary indices on non-primary key columns.
Transactions: Spanner provides full fleshed transactions. These transactions can span multiple shards (the user does not need to think about sharding at all). These transactions can include reads and writes. Spanner also offers read only transactions which do not need to be committed and do not require taking out database locks.
Commit Timestamps: Spanner enables adding a commit timestamp column to each row. The timestamp will reflect the commit timestamp of the last time that row was altered/created. The important thing to note here is that this commit timestamp matches the global ordering established by the external consistency guarantee. So if row A has an earlier timestamp than row B it is actually guaranteed that globally row B was written after row A.
Okay Spanner is amazing, what’s the catch?
Spanner is an amazing database. It combines the best features of NoSQL and relational databases — but just like every piece of infrastructure it has has its downsides.
Higher Write Latencies: Spanner’s write latencies are good, but they cannot compete with the fastest NoSQL databases. NoSQL database, which replicate data asynchronously, simply don’t have as much work to do to process a write as Spanner has to do. If your application really does not need strong consistency guarantees then a NoSQL database will likely give you better write latencies.
Expensive: Spanner depends on specialty Google hardware. It is therefore expensive to use.
Google Vendor Lock In: Spanner is only offered as part of Google’s cloud. This means if you use Spanner you bind yourself to Google as your vendor.
Wrapping up… What can we learn?
We have covered a lot of ground in this post. We started by going over the history of databases that led up to Spanner. We then did a deep dive into external consistency. After that we explored how users actually interact with Spanner and some of the features it provides. My hope is at this point you have a grasp on what Spanner is, what motivated its creation and an appreciation for how groundbreaking of a database it is.
There are a lot of things we can learn from Spanner’s creation — but the thing that sticks with me is understanding that Spanner was really motivated by a specific product need. Spanner was not developed in a vacuum for some theoretical product — it was developed to provide a database solution to Google’s ads platform. As I was researching Spanner there was a quote I came across from one of the creators — “Great infrastructure starts with a great product.” I think this is really true, and as I develop infrastructure I will try to keep a product driven mindset.
Well, that is all for now. This post really viewed Spanner through the lens of someone using it — the next post will be focused on how Spanner is built. It will answer questions like — How is automatic shard rebalancing done? Does Spanner violate CAP? How is external consistency provided? Why is TrueTime important?
Until next time… Andrew.