Understanding the Architecture of Vitess

A deep dive into Vitess architecture, query handling and cross regional deployments.

This is the third post in my Vitess series. This post is going to go over the architecture of Vitess, the major components and walk through how queries are handled. In case you missed the first two posts in the series you can check them out here:

Why Sharding Gets Hard, and Why You Need Vitess

An Overview of the Vitess Feature Set

Architecture Overview

Vitess is a cluster management solution for sharded MySQL databases. In this section we will incrementally show what the Vitess architecture looks like. First let us consider just two of the most critical Vitess components.

Here we see the two most important components of Vitess: VTGate and VTTablet.

VTGate is a collection of stateless nodes that act a frontend for users. All user queries are routed to an arbitrary VTGate host. That VTGate host inspects the query to determine which shards are involved in the query. VTGate then forwards the query to the applicable shard(s).

Each shard consists of N tablets, one of which is the leader. A tablet consists of two components which run on the same host — a MySQL instance and a VTTablet instance. The VTTablet is the Vitess component which is responsible for watching over the local MySQL. VTTablet can kill queries, ensure the MySQL instance is healthy, enforce limits on queries and report health back to VTGate. The VTTablet and the MySQL instance are always deployed together on a single host.

The above architecture diagram depicted VTGate and VTTablet but it was missing a metadata store. A metadata store is need in order to help answer the following types of questions:

  • How is a keyspace sharded?
  • What tablets exist for a shard?
  • Which node is the leader for a shard?

These pieces of information are stored in the topology service (metadata store). Let’s add this to our diagram.

The topology service needs to be a strongly consistent store that stores small amounts of metadata. The Vitess designers designed this component to be a plugin. It has implementations on top of Zookeeper, etcd and Consul. Any strongly consistent metadata store that supports watches can be used to implement topology service. VTGate pulls and writes information to the topology service.

A very critical piece of the Vitess architecture to understand is that the topology service is never used in the hot path for queries. It is accessed on node startup and it updated periodically as metadata changes. But it is NOT accessed per query — any shard routing information needed per query is cached by VTGate.

So far our architecture has been limited to a single failure zone. Vitess supports cross region deployments. Vitess has a concept called cells. A cell is a grouping of hosts that is considered to have a separate failure boundary from other cells. For example a deployment of Vitess might use AWS regions as Vitess cells. Let’s show what the architecture looks like once we expand to a cross region deployment.

This is a pretty crazy diagram — so lets break it down.

There are two types of topology services: local and global. The global topology service is not associated with any cell and it stores data that does not change very often. The local topology service is cell specific. The local topology service holds all the same information that the global topology service holds but it additionally holds information that is specific to its cell. For example the local topology service is responsible for holding host addresses for tablets in the current cell. When there is an update to the information in the global topology service that is pushed to all the cell local topology services. Lets make a few more notes about the relationship between cell local topology services and the global topology service

  • The global topology service should have nodes deployed in multiple separate failure zones. Therefore it can tolerate the outage of a full cell.
  • The components within a cell access only the local topology service — not the global topology service. If the local topology service goes down it only effects that specific cell.
  • The global topology service can go down for a while without impacting local cells.

Now let’s turn our attention to what the shards look like in the cross cell deployment. A single shard will span multiple cells. The whole point of cells is that they are separate failure domains — if all data for a single shard lived in a single cell it would defeat the purpose of cells. Just like the single cell deployment there is still only a single leader per shard across all cells. So if we had a deployment where a shard was backed by 5 nodes and was deployed across three cells one valid configuration would be to have 1 node in cell A, 2 nodes in cell B and 2 nodes in cell C. One of these nodes would be the leader.

There are two more components we are going to introduce as part of this architecture diagram. They are the vtctlclient and the vtctld server. These two components are simply a pair that forms an HTTP client and HTTP server that enables reads and updates to the topology service. Let’s add these pieces in.

Here we see that customers use the vtctlclient to make calls to the vtctld server. This client-server pair is how users can impact change in the cluster and read state in the cluster. Here are a few examples of what a user would do with vtctlclient

  • Trigger a reparenting so that a different node is the leader for a shard.
  • Read metadata regarding the tablets in a shard.
  • Look up which node is the leader for a shard.
  • Trigger a resharding workflow.
  • Mark a node as read only.

In summary Vitess has the following components

  • VTTablets: The data layer
  • VTGate: The frontend routing layer
  • vtctld and vtctlclient: HTTP client/server to interact with cluster metadata
  • Local/Global Topology Service: Cluster metadata store

Life of a Read and Life of a Write

When a user issues a query it will arrive at some VTGate node. The VTGate node will inspect the query and will inspect cluster metadata that it has cached from the topology server. Based on this inspection VTGate will determine if the query spans multiple shards or a single shard. There are several cases here:

Query is Read Only and Applies to Single Shard: In this case the VTGate can forward the query to any serving replica in the shard. The database does all the work and VTGate simply returns the result back to the user.

Query is Read+Write and Applies to Single Shard: This case is identical to the previous case except that the query must be forwarded to the leader for the shard. In a cross cell deployment this could involve making a cross cell call.

Query is Read Only and Applies to Multiple Shards: VTGate forwards the read only query to a replica in all the shards that are involved in the query (this could be all the shards in the cluster or a subset — but we we talk about that more in a future post). The VTGate gets the results back from the various shards, combines the results and returns them back to the user.

Query is Read+Write and Applies to Multiple Shards: Same as previous except all queries need to be forwarded to shard leaders.

If users are able to model their data access to avoid fan out queries that is ideal. When a query fans out to multiple shards the user takes a hit on two fronts:

  • Latency: The latency of the query will be bound by the slowest of all the shards involved.
  • Availability: If the availability for a single shard is 99.9% then a fan out query that hits 3 shards will have availability of 99.9% * 99.9% * 99.9% = 99.7%.

In addition to supporting fan out queries — Vitess also supports cross shard transactions. Although it supports this using a very slow 2PC protocol. Therefore it is highly recommended to avoid using cross shard transactions.

I hope you enjoyed this post and now have a better understanding of what the Vitess architecture looks like. I think there are a few learnings we can take away from this architecture

  • There are advantages of separating out a global metadata store from the region specific metadata store.
  • On the query hot path there should be as few components involved as possible. In the case of Vitess there are only two components on the query hot path.
  • Structuring the topology service as a plugin made Vitess more usable. Every company must have some Zookeeper like store — but not every company specifically uses Zookeeper. Implementing it as a plugin enables more companies to use Vitess.

My next post will go over the details of VIndexes and VSchemas. This will help us better understand how Vitess actually manages sharding.

Senior software engineer with an interest in building large scale infrastructure systems.