Bigtable is a distributed key/value storage system built at Google in 2005. It is designed to scale to store billions of rows and handle millions of reads/writes per second.
This post will explore Bigtable’s data model, consistency primitives and common usages. My following posts on Bigtable will explore Bigtable’s implementation and optimizations.
Let’s dive in.
A Bigtable instance is comprised of multiple tables and each table is comprised of multiple rows. So far this is pretty standard, but once we dive into the structure of a single row things get a bit more complex. Let’s take a look at the following diagram to understand what a row looks like in Bigtable.
The above picture depicts just a single row. In this diagram the blue boxes indicate terms and the green boxes indicate example data. In order to understand the structure of a row lets take this term by term.
Row Key: As part of a table’s schema a row key must be defined. A row key uniquely identifies a single row within a table. In our example
employeeID was selected as the row key and we are looking at the row where
Column Family: As part of a table’s schema column families must be defined. Column families are used to store buckets of related entities. Our example shows two column families
Contact Information and
Column Qualifier: Within a family there can be arbitrary qualifiers. The qualifiers within a family should be related to each other. Qualifiers should be thought of as data rather than as part of the schema. Let’s take a look at the
Manager Rating family in order to understand what is meant by this. Within the family of
Manager Rating we have two qualifiers —
15. In this case these qualifiers represent the employeeIDs of managers that have been given rating by employee with
employeeID=25. The mental model that I found helpful to understand what qualifiers are, was to think of a column family as a name of a map and qualifiers are just arbitrary keys within that map.
Cell and Timestamped Value: A row key, column family and column qualifier uniquely identify a single cell. A cell holds a collection of values. These values are organized into a map where the key is a timestamp and the value is a piece of user data.
You can conceptualize Bigtable as a collection of nested maps. The outer most one maps row keys to rows. Then within each row there is a collection of column families, each column family can be thought of as a map onto itself. A single column family map has keys, which are referred to as column qualifiers, and values which are referred to as cells. The deepest nested map is the cell — which can be thought of as a map from timestamp to value.
Now that we understand the structure of a single row we can zoom back out and make a few more notes on the data model.
- A table is lexicographically sorted by row key. This enables schema designers to control the relative locality of their data by carefully selecting row key.
- A single table is designed to have on the order of 100 column families. Within each column family an arbitrary number of qualifiers can be used.
- Bigtable is great at modeling sparse data because if a column qualifier is not specified it does not take up any space in the row. Therefore a typical use case of Bigtable will involve having million of unique qualifiers within a table but each individual row will be smallish because it will be sparse relative to the set of all column qualifiers in the table.
- All data is immutable in Bigtable. When a new record is written either a new qualifier is added to a family or a new timestamp is added to a cell — data is never modified.
- The timestamps in the cells can either be assigned by the user or assigned by Bigtable. If the user assigns timestamps it is the responsibility of the user to ensure the timestamps are unique.
- All data in Bigtable (with one small exception we will ignore) are simply strings.
Now that we understand the data model of Bigtable we will turn towards its consistency primitives.
Guarantees: Bigtable supports strong consistency within a single region and supports eventual consistency in cross regional deployments. Strong consistency means that all readers and writers get the same consistent view of the data.
Transactions: Bigtable does not support general purpose transactions. However, it does support single row transactions. A single row transaction enables reading and updating a single row as an atomic operation.
Intra-Row Guarantees: All updates to a single row are atomic. This is impressive and useful given the high cardinality of columns within a row.
Bigtable is great for use cases that have the following needs/attributes:
- Throughput: Requires high QPS for both reads and writes (Bigtable supports millions of reads and writes per second).
- Data Types: Data has high cardinality and is sparse (Bigtable supports hundreds of column families per row and an arbitrary number of qualifiers per family).
- Data Size: A massive amount of data needs to be stored (Bigtable can scale to billions of rows).
- Latency: Application needs low latency for reads and writes (Big table supports sub 10ms P99 latency for reads and writes).
A classic use case for Bigtable is storing sensor data and then running MapReduce jobs over it. This type of use case is a perfect fit for Bigtable because it typically requires high throughput and low latency. Additionally the nature of this data is likely parse with high cardinality.
Within Google Bigtable is or has been used for Google Earth, Search, Ads and more.
That is it for now. In this post we covered the data model of Bigtable, its consistency primitives and some classic use cases. The next post on Bigtable will dive into its implementation.
Until next time… cheers.