How does Bigtable’s Data Plane Work?

Andrew Dawson
4 min readNov 28, 2021

This post is going to cover the internals of Bigtable’s data plane. If you have not followed along with my Bigtable series, there are a few posts you might want to check out before reading this.

This post will focus on the physical data storage for a single tablet. First we will cover the components involved in the data storage. Then we will cover how reads/writes are serviced. Finally we will go over Bigtable’s data compaction strategy.

Let’s dive in.

Storage Components

The following are the components involved in the physical storage of data for a Bigtable tablet.

  • GFS: GFS is Google’s version of S3. It is a distributed file system used to hold large blobs of data. All the actual data in Bigtable is stored in GFS.
  • SSTable: SSTables define a standard structure to record sorted data to disk (GFS). The data is persisted to disk in such a way that it can be searched quickly. Additionally it is important to note that these files are immutable and can be merged together with other SSTables efficiently.
  • Memtable: A memtable is a sorted in-memory structure which holds key-value pairs. A memtable can be serialized and written to disk as an SSTable. Note that both memtables and SSTables are commonly used across many NoSQL databases — they are not specific to Bigtable.
  • Redo Log: A redo log stores the history of transactions which have been committed to a database. All databases have some form of redo log.

Now that we understand what components exist, we are ready to see how they are used to service reads and writes.

Servicing Reads and Writes

The following diagram illustrates how reads and writes are serviced. It is important to remember that the client makes direct calls to tablet servers — there is no leader which orchestrates client calls across multiple tablet servers. The client simply makes a call to get data from a tablet server and the tablet server is responsible for serving that data back to the user.

Serving Reads and Writes

When a write operation arrives at a tablet server the write is recorded to the redo log. Once the write is written to the redo log it is considered committed. The write is then also written to the memtable. Remember that the redo log is stored in GFS (durable storage) and the memtable is stored in memory.

When a read arrives at a tablet server the tablet server will construct a merged view from the SSTables and the memtable in order to service the read. Note that since both the memtable and SSTables are lexicographically stored this merged view can be constructed efficiently.

That is all there is to serving reads and writes — it’s actually a very simple data access flow. Now let’s move on to compactions.

Compactions

The above strategy for servicing reads and writes will work for a while but after running for a while we will start to run into some problems. The first problem we will run into is the memtable will grow without bound. This actually presents two problems.

  1. We will run out of memory on the tablet server.
  2. If the tablet server dies the memtable needs to be reconstructed on another tablet server. The way this is done is by reading through the redo log and reconstructing the in memory state of the memtable based on the records in the redo log. If the memtable is allowed to grow without bound this recovery process will take a long time.

In order to resolve both of these issues Bigtable bounds the size of the memtable. When the memtable reaches its size limit it is frozen and a new memtable is created. The former memtable is serialized into an SSTable and written to GFS. This is the first type of compaction that Bigtable does.

This approach solves the memory explosion problem but it introduces a new problem — we will get an unbounded number of SSTables. Having an unbounded number of SSTables is a problem because it will force reads to fan out to lots of SSTables thereby slowing down reads.

In order to solve this problem Bigtable introduces two other types of compactions.

  1. Minor Compaction: Minor compactions combine a few SSTables into a single SSTable.
  2. Major Compaction: Major compactions combine all SSTables into a single SSTable.

Both minor and major compactions are done using some background processes. Neither minor nor major compactions require downtime.

That is it for Bigtable’s data plane. The next post in the Bigtable series will focus on optimizations done in Bigtable.

Cheers!

--

--

Andrew Dawson

Senior software engineer with an interest in building large scale infrastructure systems.