Cohesion: Rethinking Workflow Development

Cohesion is a workflow system designed for a delightful developer experience and operational simplicity.

Why build a new workflow tool?

While there are many workflow systems in wide use already, they each have two areas of weakness: operational complexity, and developer experience.

Workflow engines are operationally complex

A workflow runs a set of tasks. A workflow system has to figure out which tasks to run when, actually run those tasks, and keep track of its state.

Consider just a few of the important parts of a workflow system:

Workflow Scheduler Persistent State Task compute runtime
  • A workflow engine, which contains a scheduler and keeps track of workflow state in some sort of persistent storage either a database or a message queue.
  • A compute runtime for tasks from the workflow usually a cluster of worker nodes.

Most workflow systems come with lots of operational complexity.

For one, a cluster has to be set up and managed for running the tasks (often that’s Kubernetes, a complex thing in itself).

More importantly, workflow engines need reliable and fast storage for their state. If it’s not reliable, user’s code must take on the complexity of handling that unreliability, and much of the point of using the workflow system is lost. And if it’s not fast, the workflow engine becomes a bottleneck, slowing down all workflows.

Storage with these requirements is complex to manage at scale (whether it’s in the form of a database or a message queue).

Finally, most engines also have fairly complex architectures and not enough observability built into their internals, making it hard to debug why something is slow.

Workflow developer experiences suck

The typical workflow development experience is not great, with two major problems: language, and testing.

Most workflow systems have a custom language, often in JSON or YAML. This works okay for “static” workflows with a few tasks; but for a non-trivial workflow with conditionals, loops, error handling, and more than a dozen or so tasks, YAML starts looking like a pretty terrible programming language.

(Aside about Airflow, since it uses Python: Airflow uses Python to create a static workflow DAG; so you don’t really get to use Python control flow constructs. And it relies heavily on templating so the programmer has to think carefully about “template application time” versus “run time”.)

Further, most workflows are a pain for a developer to test properly. Most try to get testing to work on a single developer machine, but workflows along with their dependencies often form too big a unit to test on a single laptop. And local set ups for the workflow system and its dependencies are tricky to create and need lots of maintenance effort as complexity and needs grow.

Cohesion

Were building a workflow system to address these problems. Cohesion lets you:

  1. Build workflows in regular programming languages
  2. Test in the cloud in an isolated test account
  3. Deploy into AWS Step Functions, a serverless workflow system

1. Building Workflows in Code

Most workflow systems have a custom language to describe a workflow. These languages vary in what they’re capable of, from being restricted to simple task sequences, to DAGs, all the way to full Turing-complete languages. But writing a workflow is programming and the best human interface for programming is a programming language.

So, Cohesion workflows are just regular code. For now, we focus on Python. Heres a small example:

def myWorkflow():
    a = cohesion.task.myFirstTask()
    b = cohesion.task.mySecondTask(a)
    return b

How does this workflow run? We need a workflow engine and a place to run tasks. We could build that, but then our first problem applies: workflow engines are hard to operate, and this difficulty is intrinsic to the problem a shiny new workflow system won’t necessarily be any simpler to operate.

However, there is still a way to avoid operational complexity. We can use a managed workflow runtime: a high-level service API that accepts a workflow definition and runs it. Turns out, AWS Step Functions is exactly what we want.

Cohesion transforms Python code into the workflow language for AWS Step Functions, and a set of serverless functions. This transformation is separate from the running of the workflow — this means you run a "natively AWS" workflow, with nothing added at runtime.

You can think of it as a Python compiler — except it isn't targeting a typical computer, but rather the Step Functions workflow runtime.Cohesion lets you use the full set of Python control flow constructs: if-statements, loops, try/except blocks, and functions. All these constructs are transformed to the underlying workflow language.

2. Testing in the cloud

Testing code on a developer’s laptop is attractive. The “inner loop” of coding — change, test, repeat — thrives on fast feedback. Deploying to the cloud on every change seems like it would just slow things down when you can test in a local docker container.

But as the system being tested gets larger, local testing gets harder. If there are a few interconnected services, then a local set up requires some combination of multiple docker containers, a local DB, etc. After a point, the test setup is a non-trivial project in itself.

Workflows tend to be beyond the level of complexity you want to test within a single docker container — local testing tends to be a poor fit for them.

So: why not test in the cloud? We have to overcome two challenges: multiple developers have to be able to test in isolation from each other, and deploying changes to the cloud can be slow.

For isolation, Cohesion simply creates one AWS account per developer. AWS accounts are not charged, and can be created fairly quickly. For deployment speed, Cohesion contains some optimizations to avoid any slow provisioning operations on many kinds of changes. (We'll dive into these details in a future post.)

All in all, you can get cloud testing for both task-level and workflow-level changes with less than a second of overhead using Cohesion.

3. AWS Step Functions

AWS Step Functions is a “serverless” workflow system. In the same vein as AWS Lambda, there’s no cluster to manage, just a higher-level service API. Give it a workflow definition, and it runs it. A cloud-managed workflow runtime means you can avoid the complexity of operating a workflow engine and its database. It also has fine-grained usage-based pricing: you pay only on workflow state transitions, and it’s free while the workflow is waiting.

The workflow definition language that Step Functions uses is a list of "state" objects (in JSON). It's a flexible language — you can express arbitrary control flow (branching, loops, error handling, etc.) and you can manipulate data between workflow tasks.

However, the language feels artificial as a programming interface for humans. Common patterns like loops are not simple to write. We think Step Functions is a great workflow runtime, but lacks a good programming interface. This is where Cohesion comes in -- by compiling Python to Step Functions, we bring an excellent developer experience to an operationally simple workflow runtime.

Try it out!

Cohesion is an exciting new tool for workflows with dev and ops simplicity. We're rolling it out to a few initial users: we'd like to understand users better, and fine tune our features based on user feedback before opening it up to everybody.

If you're interested in trying out Cohesion right now, we'd love to have you onboard! Give us your email address below, and we'll send you an invite. We'll include $100 of cloud credits for each person who signs up, until we run out.

Sign up for an early invite