Leveraging AWS to Scale R&D Workflows
Originally posted to Indigo Ag’s Engineering Blog.
In order to identify and deliver commercially viable products for our growers, Indigo’s Research and Development teams analyze bacterial and fungal microbes through bioinformatic analysis tools at scale. In order to deliver the scalability and efficiency required to support the high throughput nature of Indigo’s R&D pipelines, the Biomation team has built out cloud native solutions through AWS.
The tools that Biomation builds infrastructure around are written by Indigo’s data scientists, usually in the form of python modules. These tools require varying amounts of compute resources, making it difficult and highly inefficient to run these pipelines locally. Difficult because every tool requires special setup within whatever local environment they’re running from (specific binaries, configurations, etc), and inefficient because compute resources are limited to a single machine. In building out cloud infrastructure for our bioinformatics tools, we are able to scale with workloads as necessary, ensuring that jobs run as efficiently as possible while removing the technical overhead of configuring tools to run on any specific machine.
We have a set of tools used by our R&D scientists known as the Genomic Pipeline. Microbes are passed through this pipeline after undergoing whole genome sequencing. This pipeline preps microbes for further research and analysis for what will eventually become commercial products. The individual tools that comprise the Genomic Pipeline are python scripts, some packaged as python modules for programmatic access. For each tool, we developed a microservices architecture that ensured scalability, usability, and efficiency.
Architecting R&D Tools #
At the macro level, Biomation needed a way to build an architecture around the python tools produced by Indigo’s data scientists that would increase the number of microbes that could run through the Genomic Pipeline. At the micro level, this meant ensuring that the tools were easy to use for our scientists, had the scalable flexibility to deal with many microbes at once, and the composite efficiency of providing easily accessible results in a timely manner. At Indigo, we like to move fast, so making the pipeline as efficient as possible was and still remains a high priority.
In order to make the pipeline easier to use, we obviated the need for local installations and configurations by running the python scripts within Docker containers. This lent us several advantages, chief among them, only having to manage dependencies and various bioinformatic binaries for a single, portable environment. Each tool had its own Dockerfile that managed installing the dependencies required to run that tool successfully; when a container was spun up with the image generated by that Dockerfile, the python tool running in that particular container would always have its necessary dependencies. All of our Docker images and containers were ported to AWS Elastic Container Registry (ECR), a central repository for our images in the cloud.
While Docker containerization helped manage wrangling dependencies, we still needed a way to run these containers in an environment that could scale with our workloads. AWS Batch provided the solution for this problem. Batch lets you define a compute environment with certain parameters and manages the overhead of spinning up compute resources as you submit jobs to the Job Queue. The compute resources spun up by Batch are AWS EC2 instances — the type of instance provided by Batch for a particular job is specific to that job. Jobs that require more compute resources will run on instances that can accommodate that need, all within the bounds of the parameters set in your Batch compute environment.
With Batch managing compute resources and ECR holding our images, we had a cloud native, microservices architecture paradigm that could be applied to all of our tools. Even better, all of our cloud infrastructure could be managed as code via AWS CloudFormation and we’d have sufficient logging made available by AWS CloudWatch.
So when a user initiated a job as part of the Genomic Pipeline, that job was passed to our Job Queue which would await the allocation of a compute environment by Batch. When Batch provided a job with a compute environment, the container with the Docker image associated with the tool that the user was attempting to run was pulled down from ECR and ran on that compute environment. Inputs from the user were surfaced to the container as environment variables. After the job completed running, relevant output files got synced to S3 and data was synced to our Snowflake databases via Lambda functions.
Thus, users didn’t need to worry about managing compute environment or dependencies; they could expect to have the relevant output files from their analysis synced to a predictable location in S3 and the data associated with any job made available for further analysis through our database.
This was the pattern we used for each step in our Genomic Pipeline: every tool had a Docker image that managed dependencies, that tool would run on a compute environment provided by Batch, and relevant outputs were made easily accessible for users. This was a highly efficient pattern that improved the overall throughput of our R&D pipelines. Before we built this architecture around the Genomic Pipeline, only 30 or so microbes would make it through the pipeline in a given quarter. After, we were consistently running hundreds in the same time period.
While our new architecture was highly efficient, all software can be iterated on and improved. In keeping with a microservices architecture, each of our tools was an isolated python module or script, and they each ran in their own compute environments as Batch tasks. The problem with this paradigm was that the Genomic Pipeline is comprised of a finite set of tools that often need to be run in sequence. So while our users would run individual tools in the pipeline in isolation, the most common use case required running each tool in sequence.
While this was completely possible within the confines of the architecture we built out, it often meant additional overhead and cognitive load on our users: before invoking downstream tools, the users themselves would need to gather necessary inputs and manually execute jobs in sequence. We noticed that this was something that could be automated: if a user wanted to run the steps of the Genomic Pipeline in sequence, which was often the case, the overhead of ensuring that intermediate inputs and outputs were in the correct place should be handled automatically. In addition, users should still have the capability of running individual steps of the pipeline in isolation, without triggering downstream steps.
Step Functions and State Machines #
Each step in the Genomic Pipeline is run as its own Batch task when triggered by the user. We needed a clever way to orchestrate Batch tasks so that users could simply specify what step in the pipeline they want to start and end at, without having to manually trigger each job. AWS Step functions provided a convenient solution.
Step Functions allow you to define a state machine graph that maps to your exact workflow. We used Step Functions to define a state machine for the Genomic Pipeline, such that each state in the State Machine graph was a step in the pipeline. The Genomic State Machine, defined by our Step Functions, orchestrated our Batch tasks in a way that allowed users to specify a start state and end state, while handling the overhead of transitioning between intermediate steps in the pipeline deterministically.
We still had the same benefits of our overall cloud infrastructure: CloudFormation allowed us to manage our infrastructure as code, CloudWatch provided logging, S3 provided file storage, and Batch managed compute resources.
In addition to existing benefits, orchestrating our pipeline using Step Functions provided us additional benefits as well. Whereas we needed to manually keep track of errors between the steps of the pipeline as individual Batch tasks in the past, Step Functions have built in error handling that manages errors within your State Machine graph. This was especially convenient for us because if a job failed at a particular step in the pipeline, we wanted to catch that error, sync intermediate results and data to S3 and Snowflake to be made available to our users, and then fail gracefully.
Step Functions also allowed us to redefine what it meant to be a step in the Genomic Pipeline. We noted that some steps in the pipeline were actually composed of multiple substeps. Before the state machine, when users would manually trigger individual steps of the pipeline, this meant that Batch would provide a task with an EC2 instance with the minimum number of CPUs required by that particular step. For tools in the pipeline composed of multiple substeps, this meant running on an EC2 instance with as many CPUs as required by the most computationally expensive substep of that tool, no matter how short that particular step was. Our state machine definition made it easy break steps into substeps such that each state’s Batch task received only the compute resources it required. So instead of paying for a large instance which was only used to its full potential for a fraction of the total time it was running, our state machine definition ensures that Batch gives us EC2 instances that are used more efficiently. Optimizing our use of compute resources reduced our costs without hurting the efficiency of the overall pipeline.
In reducing individual tools into component substeps through Step Functions, we are also able to reduce bloat in our Docker containers. Each substep has its own Dockerfile, which only specifies the dependencies necessary to run that particular step. When tools were triggered and run in isolation, the Dockerfile for a given tool in the pipeline had to account for all the dependencies required for each substep in that particular tool — this was highly inefficient and often made it difficult to update individual dependencies or debug complicated tracebacks relating to interconnected dependencies. Step Functions allowed us combine isolated tools with an orchestration layer.
A key priority of our team is to provide cost effective solutions that can help automate and increase the throughput of Indigo’s R&D pipelines. We’ve used AWS cloud infrastructure to help us scale our workloads and provide tools that increase the overall efficiency of our end users — Indigo’s scientists. In the pursuit of improving the tools we’ve built, we’ve found Step Functions to be an effective way to orchestrate Batch tasks and deterministically handle jobs that must run in sequence. Our users can now to run hundreds of microbes through the Genomic Pipeline each day. We’re continuing to improve our pipeline, tooling, and infrastructure — if working on scalable genomic analysis and automating bioinformatics workflows is interesting to you, we’d love to chat. We’re hiring!