How does Bigtable’s Control Plane Work?

This post is going to cover the internals of Bigtable’s control plane. The next post in this series will cover the internals of Bigtable’s data plane.

This post will assume you already have an understanding of what Bigtable is and the user facing abstractions it provides. If you need more background in these areas you can checkout my previous Bigtable post here.

Let’s dive in…

Terms

The following terms need to be understood for the rest of this post to make any sense (they will be continually reenforced throughout the post).

  • Chubby: Chubby is Google’s version of Zookeeper. It is a strongly consistent store used to hold small pieces of metadata. The data is structured as a file system. Clients can use the files as locks and setup watches on directories. Chubby gets used for many things, including — group membership, distributed locking and config management.
  • Tablet: A tablet is a Bigtable term which refers to a contagious key range block. The data for a given table will be split across many tablets. A tablet can be uniquely referred to by its table name and key range end point.
  • Tablet Server: A node which is responsible for reading and writing to a collection of tablets. A single tablet server will be responsible for many tablets. No durable state is actually stored on tablet servers. All state for a tablet is recorded in GFS. So a tablet server acts as the access point to a tablet’s data but does not own the actual physical storage of the data.

With those terms in hand we are ready to dive into the control plane internals.

Control Plane

The control plane needs to deal with the following questions:

  1. How is the assignment of tablets to tablet servers stored?
  2. How and when does this assignment get updated?
  3. How do client requests get routed to the correct tablet server?

Let’s take these questions in order.

How is the assignment of tablets to tablet servers stored?

Tablets simply represent a contagious block of keys in Bigtable. In order for a tablet’s data to be access there must be a node which can read and write the data in the tablet. Tablet servers are the nodes which read and write data for tablets. A single tablet server will be responsible for owning multiple tablets.

Tablets are assigned to a single tablet server at any given time. This is important because having multiple tablet servers operating on a single tablet makes keeping the data in a tablet consistent very hard. Additionally assigning ownership of tablets to tablet servers enables the tablet server to hold in-memory state regarding the tablets it owns. If every tablet server could operate on arbitrary tablets then little to no in-memory space in tablet servers could be dedicated per tablet.

Now that we understand that each tablet is assigned to at most one tablet server, the natural question that follows is where/how is this mapping stored?

Fair warning — this is about to get complex …

The assignment of tablets to tablet servers is actually stored in tablets on tablet servers…. WHAT?!?!?

Thats right, the assignment of tablets to tablet servers is actually just stored as another table in Bigtable and that table itself is divided into several tablets which are spread out across tablet servers.

This table is referred to as the Metadata table. The Metadata table has keys of form <user_table_name>-<end_row_key> . The Metadata table contains values which include the address of the tablet server for the given key as well as some additional metadata. The following shows an example of a few rows of the Metadata table.

Metadata Tablet Sample Rows

These four Metadata table rows show that there are two tables foo and bar. Table foo has two tablets, one covers the row keys from the start of the key range up until bbb and the other tablet covers the keys from bbb — ccc . The bar table also has two tablets one tablet covers key ranges from the start of the range to 199 and the other tablet covers from 199 -500 . These tablets have been spread across three different tablet servers. Notice that node_19891 owns multiple tablets. While tablet servers node_23421 and node_92448 only own a single tablet. In practice a single tablet server would own hundreds of tablets.

The relationship between the Metadata table and user tables is modeled as a tree structure. Let’s take a look at the following diagram to illustrate what this looks like.

Metadata Table Location Structure

In order to understand this structure lets work from the leaf nodes to the root node. The leaf nodes of this tree structure are user tablets. These tablets hold the information for tables the user has created. These user tablets are assigned to some tablet server. The assignment of tablet to tablet server is stored in the metadata tablets. The metadata tablets are simple tablets which make up the data in the Metadata table shown above. Just like user tablets need to be assigned to a tablet server in order to be readable or writable, metadata tablets also need to be assigned to a tablet server to be readable or writable. The assignment of metadata tablets to tablet servers is stored in a special metadata tablet called the root metadata tablet. This root tablet is exactly like the metadata tablets except for it references other metadata tablets instead of user tablets. This root metadata tablet also needs to be assigned to a tablet server in order to be readable or writable.

There is a circular dependency in this modeling — tablet location is stored in tablets but tablets are not readable or writeable until they have been assigned to a tablet server. In order to break this circular dependency the root of the tree is a metadata file in Chubby which references the root tablet of the Metadata table. Through storing a reference to the root metadata tablet outside of any tablet the whole tree can be traversed simply starting from Chubby. This is how the circular dependency is broken.

Wow, that was dense let’s recap…

Chubby stores a reference to the root metadata tablet. The root metadata tablet keeps track of assignments from metadata tablets to tablet servers. Metadata tablets keep track of assignments from user tablets to tablet servers. User tablets store the actual data users read and write to the database.

How and when does this assignment get updated?

Now that we understand the structure that is used to store the mapping of tablets to tablet servers the question that arises naturally is — how does this structure get updated?

The Metadata table needs to get updated whenever the set of tablet servers or the set of tablets changes. More specifically the following events lead to changes in these sets:

  • A new table is created
  • A table is deleted
  • A tablet server dies
  • A new tablet server is added
  • Two tablets are merged
  • A single tablet is split

Bigtable uses a singleton leader node to update the Metadata table in response to these changes. Let’s work through these changes one at a time to understand the protocol that gets used to update the Metadata table.

Table Is Created: When a new table is created it will start as an empty table with a single empty tablet to back it. This tablet needs to be assigned to some tablet server. The leader node keeps track of in-memory a set of unassigned tablets. The leader continually searches the tablet servers it knows about to figure out if there exists a tablet server which is available and has enough space to handle the unassigned tablet. If/when the leader finds such a tablet server it will call the LoadTablet RPC on the tablet server and provide the tablet name. The tablet server will then load the tablet and start to be responsible for serving it. At this point the leader can update the Metadata table to reflect the fact that this new tablet has been assigned.

Table Is Deleted: When a table is deleted the leader will find all tablets for that table by scanning the Metadata table. The leader will then call UnloadTablet for the discovered tablets on each owning tablet server. The leader will then delete the records in the Metadata table.

Tablet Server Fails: When a tablet server starts up it creates a unique file in Chubby in a specific directory. There are a few things that should be noted about this

  • The directory where tablet servers create this file in Chubby is used by the leader to discover the set of tablet servers which exist.
  • The tablet server holds a lock on the file it created at startup and periodically refreshes its ownership of this lock. Chubby will release the lock if the tablet server fails to update the lock within some TTL. This enables the file to be used as a liveness detection mechanism.
  • This file is only created once when a tablet server starts up and will be unique for every tablet server start up.

The Bigtable leader gets the list of tablet servers from this directory and caches those members in memory. The leader then goes over the full list of tablet server and sends an RPC to each of them inquiring into the state of their lock. Most of the time the tablet servers will respond back to the leader indicating they are alive and they have recently successfully updated their Chubby lock.

However sometimes the leader won’t get a response from a tablet server or the tablet server will indicate that it has not successfully updated its Chubby lock recently. In these cases the leader needs to continue forward to determine if the tablet server should be considered dead. The leader does this using the following protocol

  • Attempt to acquire the tablet server’s lock in Chubby.
  • If the lock could not be acquired, the leader assumes the tablet server is still alive and the leader will simply wait to check again until the next iteration.
  • However if the tablet server’s lock was acquired by the leader then the leader will consider the tablet server as dead. The leader will delete the lock in Chubby. After the lock is deleted the tablet server will never be considered eligible to server tablets again.
  • The leader will then continue to scan the Metadata table to find all tablets that were assigned to that tablet server and add those tablets to the unassigned tablet pool.
  • Those tablets will get reassigned to other tablet servers.
  • As these tablets are reassigned the Metadata table is updated by the leader.

New Tablet Server is Added: If a new tablet server is added that tablet server will create its unique file in Chubby. The leader will then discovery that tablet server and assign that tablet server ownership over some unassigned tablets. If there are no assigned tablets the leader can unload tablets from highly loaded tablet servers and load those tablets on the newly added tablet server.

Tablet Merge: If the leader detects that two tablets are small, the leader can merge those tablets together. Let’s refer to the original tablets as the source tablets and the new tablet as the target tablet (we will use this terminology here and later in this post).

  • The leader creates a target tablet which represents the merging of the two source tablets.
  • The leader triggers unloading of the two source tablets.
  • The leader adds the new target tablet to the unassigned pool such that it will be assigned to a tablet server.
  • The leader updates the Metadata table to reflect the new assignment.

Tablet Split: Tablet split is a bit different from tablet merge because the tablet server initiates the split rather than the leader. In the case of tablet split, the tablet server itself will update the Metadata table to reflect this split. After committing the split the tablet server will notify the leader of the split.

In the happy path, the leader will get this notification. However, in the less happy path this notification will be lost. If the notification is lost the leader will learn about the split lazily. The next time the leader makes a query to the source tablet the owning tablet server will notify the leader of the split.

And with that we have covered all the cases which lead to updates to the Metadata table. However there should be a big burning question at this point — What happens if the leader dies?

If the leader dies the following protocol will run

  1. When the leader first starts up it creates a unique leader file in Chubby. This file gets used for two purposes. Firstly, it is used to ensure only one leader exists at a time. Secondly, it operates as a liveness detection mechanism for the leader. Based on this file nodes can determine if there exist a live leader, or if the leader has died and needs to be replaced.
  2. If it was determined by some node that the leader needs to be replaced that new node will create its leader file in Chubby and become the new leader.
  3. The first thing the new leader does is read from the Chubby tablet server discovery directory to find all the tablet servers.
  4. The new leader then reaches out to all the tablet servers to learn which tablets are assigned to those tablet servers.
  5. The new leader then reads over the Metadata table to learn of any tablets which are unassigned. Unassigned tablets are tablets that exist in the Metadata table but were not discovered as part of step 4.
  6. The new leader will add those unassigned tablets to an in-memory pool for later assignment.

At this point the new leader is all brought up to date.

The highly astute reader will notice a problem in the above protocol. If the leader fails to discover the root Metadata tablet in step 4, then the leader will have no way to read the Metadata table. If this case arises then the leader will look to Chubby to learn about the root Metadata tablet. Once the leader has the root tablet the full Metadata tree can be traversed.

How do client requests get routed to the correct tablet server?

Clients keep track of a cache of tablet to tablet server assignment. This cache is not durably persisted and its fine if it is incomplete/wrong.

When a client makes a request to Bigtable its local cache is used to determine which tablet servers are responsible for the tablets involved in the query. The client will then reach out directly to those tablet servers.

It is possible that the client cache is wrong and the client gets a miss. If this happens the client will make requests traversing up the Metadata table until it finds the tablet to tablet server mapping it needs to satisfy its request.

So basically the client is making educated guesses about where data lives and if it’s wrong it fetches updated information from the server.

Bigtable introduces several optimizations to make this more efficient:

  • The Metadata table is actually stored both on disk and in-memory. This means reads from the Metadata table are very fast.
  • When the client gets a miss it does not just fetch the single row of the Metadata table it needs, instead it does a batch fetch. This increases the changes that on future calls it will have a local cached record for the tablets it cares about.

Summary

This post covered Bigtable’s control plane. We first learnt that Bigtable uses tablets and tablet servers to store the location information for user tablets. We learnt that this structure is modeled as a tree with a Chubby file as the root.

We then learnt that Bigtable uses a single leader to perform updates on the Metadata table when the set of tablet servers or tablets changes.

Finally we learnt that the client caches the Metadata table locally in order to make direct requests to tablet servers. We learnt that if the client has wrong or missing information it simply fetches new information from the server.

This post was pretty dense and covered a lot of ground. I hope that you found it educational. In the next post we will go over Bigtable’s data plane in detail.

Thanks for reading.

Senior software engineer with an interest in building large scale infrastructure systems.