Vitess VIndexes — A Better Way to do Sharding?
One of the most magical pieces of Vitess is the clean abstractions it provides around sharding. Vitess enables the application to largely ignore the fact that data is sharded while still providing the flexibility to control how data get distributed between shards. In order to understand how Vitess managed this we will introduce a series of Vitess concepts that build upon each other. By the end of this post you will understand the key abstractions that Vitess uses to model sharding and why they are so powerful.
Every row in Vitess has a KeyspaceID. The KeyspaceID is used to determine which shard a row belongs in. It is like an address for the row. Superficially the KeyspaceID might look like a sharding key in the NoSQL world. But it is different in two key ways
- The KeyspaceID is computed when it is needed — it is not physically stored in the row.
- KeyspaceID is completely internal to Vitess — the application does not know or care about the KeyspaceID for a row.
So in Vitess you have a row which when run through some function gives a KeyspaceID and then that KeyspaceID is assigned to some shard.
Shards in Vitess are declared as ranges of KeyspaceIDs. So for example a shard could be named
40-80 that means this shard owns all KeyspaceIDs that are in the range of
40-80. The key observation to have here is that expressing shards this way decouples the value of the KeyspaceID from the shard the row belongs in. If shards are split or merged together the KeyspaceIDs do not need to change — they simply now map into different shards.
If you are interested to learn more about how shards are declared in Vitess you can read more here.
In Vitess the VSchema is a piece of config written by the Vitess user which declares how a sharded Keyspace will be sharded. Per table in the Keyspace, the VSchema will declare how rows in that table should be mapped to a single KeyspaceID.
Vitess offers the flexibility to control both the columns used to compute the KeyspaceID as well as the function that is applied over the columns. Most NoSQL stores offers the ability to select sharding columns but the actual function over those columns to compute the shard is hard coded within the database. Vitess offers a collection of functions as well as the ability to write new functions and plug them in. This offers increased flexibility for how data gets sharded.
Lets explain how the VSchema gets used by looking at how Vitess will route a query
- A query arrives at a VTGate host
- The VTGate host parses the query
- The VTGate host looks up the information in the VSchema relating to the tables that the query is operating on.
- VTGate uses the WHERE clause in the query and the information in VSchema to compute a KeyspaceID for the rows in the query.
- VTGate then looks up the shard(s) that own those KeyspaceIDs to figure out where to route the query to.
Now that we understand what the VSchema is let’s dive deeper into these sharding functions.
VIndexes are the actual functions that Vitess offers to map a set of columns to a KeyspaceID. VIndexes provide a method for taking a set of columns in a row and mapping them to one or more KeyspaceIDs.
You might be surprised to learn that VIndexes might map a row to more than one KeyspaceID. In the NoSQL world, a row is sharded and assigned to a single shard and that is the end of the story. But Vitess offers two types of VIndexes: unique and non-unique. A unique VIndex is simply one that given an input row a single KeyspaceID is produced. A non-unique VIndex produces one or more KeyspaceIDs for a row. Every table in a Keyspace must have a single unique VIndex. This unique VIndex is called the primary VIndex and it is very much like NoSQLs sharding keys.
Non-unique VIndexes are a lot like MySQL secondary indexes. In order to understand this lets look at an example of a table with the following schema
# MySQL Schema
CREATE TABLE person (
id UUID PRIMARY KEY,
);# Vitess VSchema
What is going on here?
- We declare a single MySQL table named person.
- Table person has two columns: id and name.
- We declare a VSchema which indicates that our Keyspace is sharded.
- The VSchema declares that we have a single table named person.
- We also define a primary VIndex on the person table which says use the column id and VIndex function
hashin order to get KeyspaceIDs for rows in the table.
Now suppose the following query arrives at VTGate
SELECT name WHERE id=0676b28-1637-11ec-9621-0242ac130002 This query is going to run quickly because VTGate can apply the VIndex function to the id column, get a KeyspaceID, lookup the shard that owns that KeyspaceID and then route the query to a signal shard — this query is fast because a signal shard can be determined based upon the query! But what happens if you want to write a query like
SELECT * WHERE name=Bob ?????
This query is a problem because there is no VIndex in the table that can help VTGate know how to route this query. So as of now the only solution would be to fan out to all the shards. But what if we had another VIndex that could take a tuple like
name=Bob and return a set of KeyspaceIDs? This would enable us to fan out only to the shards which are known to contain rows where
name=Bob This is the whole point of secondary (non-unique) VIndexes.
Secondary VIndexes help make queries more efficient because you don’t need to fan out to all shards. In the NoSQL world users must either do a full table scan or must provide the fully qualified shard key — there is no concept of partial shard fan out. With Vitess there are three options
- Target a single shard
- Fan out to a subset of shards by using a secondary VIndex
- Fan out to all shards
There is a lot more to know about VIndexes including
- The type of functions that are supported
- Functional vs lookup
- Sharded VIndexes vs non-sharded VIndexes
But covering all these details are out of scope here. You can read more about these details here.
Well that is all for Vitess sharding — lets just recap a little bit
- Vitess sharding introduces an abstraction between the physical row and the shard called KeyspaceID. Creating this separation enables hiding KeyspaceID from users and decoupling the KeyspaceID a row maps to from the shard the row ends up in.
- Vitess offers users the ability to select not only the rows that will be used to compute the KeyspaceID but also the function that will get used. Furthermore these are plugable.
- Vitess offers the concept of a primary VIndex and secondary VIndexes — secondary VIndexes help avoid total shard fan out.
- Vitess abstracts all this sharding logic into a single config file called the VSchema.
I think this is a very powerful and flexible way to operate on a sharded database — let me know what you think.