Special thanks to Yudho Diponegoro and Alexander Burton from AWS for introducing us to AWS SageMaker and guiding us along the way.

Problem statement

At Portcast (intro at the end), we run a cluster of Celery workers on AWS ECS alongside our API servers to handle background tasks. We use celery to run tasks asynchronously outside of HTTP response cycles. These kind of tasks are often time consuming and resource intensive.

Recently, we created a new celery task to load data from selected rows of a database table and put it to Amazon S3 so that our clients can download the file as an exported csv.

The problem is, each export involved loading couple of GBs worth of data into memory on our celery worker instance. Because of the memory limitations imposed on our worker instances, the export task blocks any other task from executing and sometimes fails completely leaving the unfinished tasks dangling.

Celery worker instances run indefinitely, so if we allocate more memory to a particular worker instance, we will have to pay for the additional memory indefinitely.

Instead of increasing the memory for our worker instance, we wanted to explore methods that would orchestrate the whole process of spinning up new instances, running the tasks and automatically shutting down the instances afterwards. This way, the additional memory and computational power the task needs is only allocated for the duration of the task independent of our ECS cluster and we can keep the celery worker instances light.

Solution

To achieve this, we explored AWS SageMaker Processing.

AWS SageMaker is known for machine learning related services. SageMaker Processing, however, is a lesser known module within SageMaker that was only launched late last year. Even though it is meant for ML data processing/model evaluation, it also has the exact feature we need: on-demand resource orchestration for heavy lifting processes.

Amazon SageMaker Processing - Fully Managed Data Processing and Model Evaluation | Amazon Web Services

Our solution has three parts

  1. Containerise the script