An Overview of Cadence

What problem does it solve?

Let’s imagine you want to build Uber Eats. There are fundamentally two separate concerns you are going to have to address

  1. How do you write the actual business logic? (e.g. how do you charge a credit card, how do you notify a driver, how do you figure out which driver to dispatch)

It is important to observe that while these concerns can be coupled together, they are logically distinct concerns. It is also important to note that getting the second concern right is hard, lets look at a few examples that demonstrate this

  1. Let’s say your business logic needs to call service Foo. So you write some code to call service Foo and run it on some machine. Now let’s say that machine dies before Foo is actually called. What do you do now? You need to have some other machine take over. This means you need a method to determine that the first machine died, and a method to rebalance the work that machine was doing to a new machine.

Solving problems like this are doable. Typically these problems are solved by some combination of the following tools

  1. Durable queues with a pool of workers polling off items from the queue and processing them.

While these methods do indeed enable building the type of long running applications we have been talking about, using these tools together correctly is hard. Importantly getting this orchestration right has absolutely nothing to do with the actual business logic you care about.

Cadence provides a set of primitives which make doing this type of orchestration much easier. Think of Cadence as a developer friendly SDK + managed service which abstracts away the common problems of micro-service orchestration. You write your business logic and Cadence handles the orchestration. More specifically Cadence handles things like the following

  1. Failure detection

How do users interact with Cadence?

Users of Cadence write a workflow using the Cadence client SDK. This workflow contains two types of code.

  1. Workflow Code: Workflow code is a representation of the orchestration steps that need to be preformed. The key feature of workflow code is it can be run an arbitrary number of times without producing any side effects, and always runs the exact same way. Another mental model to think about workflow code is it is a deterministic state machine. Such that some set of inputs (state transitions) are provided and the code applies those inputs to the workflow definition to reach a deterministic state.

The user of Cadence writes their workflow definition code and their activity code using the Cadence client SDK. The SDK is a fat client that does three things

  1. Provides a framework for users to define their workflows against.

So when users are defining their workflows they do not actually need to directly make calls to the Cadence server. They simply define their workflow using the Cadence client SDK and the fat client takes care of the rest.

In order to make this a little more clear let’s look at a sample workflow definition.

package helloworld

func init() {
workflow.Register(Workflow)
activity.Register(helloworldActivity)
}

func Workflow(ctx workflow.Context, name string) error {
ao := workflow.ActivityOptions{
ScheduleToStartTimeout: time.Minute,
StartToCloseTimeout: time.Minute,
HeartbeatTimeout: time.Second * 20,
}
ctx = workflow.WithActivityOptions(ctx, ao)

logger := workflow.GetLogger(ctx)
logger.Info("helloworld workflow started")
var helloworldResult string
err := workflow.ExecuteActivity(ctx, helloworldActivity, name).Get(ctx, &helloworldResult)
if err != nil {
logger.Error("Activity failed.", zap.Error(err))
return err
}

logger.Info("Workflow completed.", zap.String("Result", helloworldResult))

return nil
}

func helloworldActivity(ctx context.Context, name string) (string, error) {
logger := activity.GetLogger(ctx)
logger.Info("helloworld activity started")
return "Hello " + name + "!", nil
}

Here we can see the user is using the Cadence client to do the following

  1. Register a workflow definition and an activity definition.

At this point it should be clear what problem Cadence solves and how users interact with Cadence. The next section will explain how Cadence is built.

How is it built?

What does the lifecycle of a workflow look like?

In order to understand how Cadence is built, it is useful to understand what Cadence does to manage a workflow from workflow from start to finish. So let’s tell a little story about Bob writing a workflow to build Uber Eats and run an instance of that workflow.

  1. Bob decides he wants to use Cadence to build Uber Eats. He goes ahead and writes his workflow definition and activities using the Cadence client SDK.

And that is the story of how Bob ran instance Foo123 of the Uber Eats workflow from start to finish. Now that we understand how Cadence manages the lifecycle of a single workflow, let’s take a deeper dive into what the internals of the Cadence server look like.

What is the high level architecture of Cadence?

100k Foot View of Architecture

Cluster of Workflow Pollers: Collection of hosts managed by Cadence customer, which constantly issue long polls against the Cadence server to get notifications of when there are updates to apply to the workflow.

Cluster of Activity Pollers: Collection of hosts managed by Cadence customer, which constantly issue long polls against the Cadence server to get notifications when there are activities to run.

Frontend: A cluster of stateless nodes that handle all the types of tasks you would expect a frontend to handle: AuthN/AuthZ, rate limiting, request validation, etc… The frontend also serves as the routing layer by forwarding requests to the correct history/matching hosts.

History: A cluster of nodes organized into a consistent hash ring. Workflows are sharded onto this ring by workflowID. So a single shard contains many workflows, and a single host owns many shards. If a history host dies a gossip based protocol will be used to rebalance the shards to other nodes, those new nodes will load state from the database for the workflows they now own, and those nodes will start managing updates to the workflows. There is no durable state stored in the history nodes (that all belongs to the persistence layer) however the history nodes are stateful in the sense that certain workflows belong to certain history nodes.

Matching: A cluster of nodes organized into a consistent hash ring. The entities sharded onto the ring are called task lists. Task lists are basically just queues of tasks that workers are polling from. The details here are not that important — what is important to understand is matching is the component which owns the actual dispatching of workflow/activity tasks to the customer’s pollers. So when a long poll arrives at Cadence frontend, that call will be routed to the owning matching host which will manage the long poll and unblock it when a task becomes available for dispatch.

Worker: This is a collection of stateless nodes that preform background processing on behalf of the Cadence server. Examples of this include

  • Archiving old workflow histories

Sharded Database: Cadence is built to be able to run on top of different types of databases. Fundamentally the only thing Cadence needs from a database is the ability to support consistency between entities. In the NoSQL world this is done with conditional updates, and in the SQL world this is done with row level locking and transactions. In theory the database does not have to be sharded but you will quickly run into scaling problems if you cannot scale out the persistence layer. Additionally note that since matching and history components both use sharding, the Cadence data model fits nicely into using a sharded database.

What are the main components of the history service?

It should be apparent at this point that the history component of Cadence is where the real magic happens. As we have learnt Cadence fundamentally manages append only logs of workflows’ executions. Ensuring the correctness of that log and adding to it is owned by the history component.

The history component in Cadence is too complex to dive into all the details of it, however it’s worth understanding at a high level what it is doing. Basically the history component preforms the following steps

  1. Receives a batch of next steps from the customer’s workflow.

So if we really zoom out what history component is doing is writing updates to the workflow, creating tasks, processing those tasks and dispatching notifications to the workflow that some progress was made.

What else is part of Cadence?

Believe it or not, despite all the complexity was have already gone over — everything we have covered so far is just the barebones core engine of Cadence. There are tons of things built around this core engine. In this section I will just give a super fast overview of some of the other things that exist as part of Cadence.

  • Cross Region Replication: Cadence automatically handles replicating workflow state to remote regions. This enables failing over from one region to another without losing workflow progress.

Cadence offers much more than this feature wise. And within Cadence there are some pretty awesome optimizations, but I won’t spend too much time diving into all of these — just know that we have only scrapped the surface of Cadence’s internals and feature sets.

What can we learn from it?

In order to close I want to take some time to reflect on the higher level learnings we can take from Cadence.

From a technical perspective what did Cadence do well?

Value of Fat Client: The Cadence client SDK, as we have talked about, is very fat. It does caching, it provides a workflow execution environment, it abstracts away communication with the Cadence server and more. This fat client was totally necessary for Cadence. The reason the client being fat was a good thing is because the client-server interaction model for Cadence is super non-trivial. It would be unreasonable to expect the customer to get this interaction model correct. By having a fat client that complexity can be abstracted away. Additionally the fat client is needed to cache workflow state. If the client were a lightweight wrapper that just called into the Cadence server workflow state would have to be completely recreated every time there was an update — but by virtue of having a fat client the client can actually cache state and create a session with the Cadence server. This is a critical optimization within Cadence that is only possible by virtue of the client being fat.

Server Separation of Concerns: Cadence did a good job with the high level structuring of the server. As we discussed — it is broken up into four components which can be deployed and scaled separately. These four components each own a well defined set of responsibilities. The value of this separation is really two fold: (1) Separation of concerns helps manage complexity (2) Actually separating out different roles such that they can be deployed separately enables more fine grained scaling and service monitoring.

Start with Rock Solid Core: Cadence started by building a rock solid core engine. This core engine was composed of a few well defined building blocks. This core engine alone produced customer value. It also provided a foundation upon which more features could be added.

What are the weakness of Cadence?

Fixed Number of Shards: When a Cadence cluster is first provisioned, it is provisioned with a fixed number of shards. There is no way to update this number later without doing a complex database migration and taking downtime. This presents a problem because setting the original number of shards is a non-trivial problem. If the number is too low then the system will run into scaling bottlenecks. But if the number is too large then per shard overhead will significantly increase the cost of the system on the whole. I have seen that databases can split the shard space resulting in a doubling of the number of shards. It seems like something like this should be possible for Cadence.

Large Blob Sizes: When customers write their workflows they are subject to limitations around sizes of data they pass around. There are lots of limitations around blob sizes, and its not worth getting into the details, but just know that when customers write their workflow they have to be thoughtful of the size of data they pass around. In many cases this can make writing a Cadence workflow more awkward than writing “normal” code.

Customer Backwards Compatibility: Just like databases have backwards compatibility issues when a new schema is rolled out, Cadence customers also have to think about backwards compatibility when they rollout a new code definition for their workflow. As we know Cadence keeps track of the history of workflow execution. If new code is rolled out when there are running workflows in progress, it’s possible that those existing workflow histories cannot be applied to the new workflow code. This forces customers to do version checks which can get complex.

What non-technical things did Cadence do well?

Manage Velocity vs Debt Tradeoff: There is always a tradeoff between how fast a project is being developed and how much tech debt accumulates. A project with no tech debt is a poorly led project AND a project with so much tech debt its hard to implement anything new is also poorly led. There are lots of factors that go into deciding how to tradeoff velocity against tech debt. Features like the maturity of the project, the area of the trade off and the risk of the change are all considerations. Throughout the lifecycle of developing Cadence a good balance has been struck between these considerations.

Developer Friendly: Something Cadence really got right is putting the developer experience first. The Cadence team obsessed over making the client SDK abstractions clear and flexible. Something Cadence did a really good job of was not just blindly listening to single customers and implementing features based upon that single customer’s feedback, but instead trying to understand holistically the problems that were being faced by Cadence customers and designing clear abstractions to address those problems.

Easy to Start Small: One of the key attributes of Cadence that enabled it to spread so rapidly was the fact that it was very easy to start with a small use case and expand from there. Cadence has seen over and over that teams will test Cadence in some limited capacity and then once they discover its power they will use it in larger and larger ways. Cadence is very different from a database product in this way. Introducing a new database to an existing project requires a data migration. This is a huge effort and typically is treated as a one way door. Cadence is totally different than this — customers can try to use Cadence in some small part of their system and if it goes well they can consume it in more and more places.

I hope you enjoyed this overview of Cadence and that you learnt something valuable.

Senior software engineer with an interest in building large scale infrastructure systems.