Why we built a serverless machine learning platform—instead of using AWS Lambda

Go Gopher mascot with code

AWS Lambda is an appealing option for model deployment. On the surface, the benefits are obvious. Lambda:

  • Lets data scientists and MLEs deploy without managing infrastructure
  • Maximizes availability while minimizing costs
  • Provides a simple interface for defining prediction APIs

The problem, however, is that while these are all benefits of a serverless architecture, general-purpose serverless platforms like Lambda often impose limitations that make them suboptimal for machine learning.

We learned this firsthand. Before working on Cortex, we tried running deployments through Lambda. In fact, it was Lambda’s shortcomings that, in part, pushed us to build a serverless compute platform specifically for machine learning.

Shortcomings such as…

1. Lambda can’t deploy large models (like Transformer models)

By now, you’ve read many articles on the growth of deep learning models. Suffice to say that in many niches of deep learning—natural language processing, in particular—models are getting much bigger, very quickly.

For example, over the last couple years, Hugging Face’s Transformers library has become the most popular NLP library. We anecdotally see users frequently using it in production APIs. The library provides convenient interfaces for models like:

  • GPT-2, which is ~6 GB fully trained.
  • BlenderBot, which is ~5 GB fully trained.
  • RoBERTa, which is > 1 GB fully trained.

And those are just the plausibly large. Some models, like T5, can be over 40 GB, though I’ve admittedly not met many teams deploying models of that size at scale.

A serverless platform for modern machine learning needs to be able to deploy large models, and Lambda cannot. Lambda limits deployed packages to an uncompressed size of 250 MB, and limits functions to 3,0008 MB of memory. If you’re looking to run any sort of state of the art language model, Lambda is ruled out almost by default.

2. GPU/ASIC support is required for serving models

As models get bigger, their resource requirements increase as well. For some of the large models we’ve discussed, GPU inference is the only way to reliably serve them with anything near realtime latency.

Similarly, ASICs like Inferentia and TPUs are changing the economics of model serving in some situations, and have the potential to do so on a much larger scale as they continue to mature. Even with its relative youth, we’ve benchmarked certain models performing with an order of magnitude increase in efficiency using Inferentia.

In the past, GPU/ASIC inference was considered a more niche use case, but more and more, it is becoming the standard in machine learning engineering. Unfortunately, Lambda doesn’t support either.

For a large number of Cortex users, this alone disqualifies Lambda as an option for deploying models to production.

3. Lambda is too inefficient for serving models

Lambda instances are capable of serving consecutive—but not concurrent—requests. In model serving, this is a big deal.

Inference is a computationally expensive task, one that often comes with large amounts of latency (hence the frequent need for GPUs/ASICs). To keep inference costs from soaring, it is important to be as efficient as possible in allocating compute resources, without negatively impacting latency.

One way we do this in Cortex is by providing pre and post-prediction hooks, which can execute code asynchronously. Commonly, this is used whenever there is some IO request connected to the inference function—calling user information from a database, writing to logs, etc.

The advantage these async hooks provide is that they allow you to free up the resources needed for inference as soon as the prediction is generated, as opposed to after the response is sent.

In Lambda, however, this is not possible.

As a result, if you are serving models with Lambda, you are likely over-scaling due to the wasted resources idling on each instance.

Machine learning needs a dedicated serverless platform

The serverless architecture is a natural fit for model deployment. The problem, which you run into in just about any niche of MLOps, is that the needs of the machine learning use-case are just specific enough to make popular DevOps tools (like Lambda) a poor fit.

Part of our mission in building Cortex has been to build a platform that provides the ease of use we love in Lambda, while solving the specific challenges of ML infrastructure.

If that sounds like an interesting challenge to you as well, we’re always looking for new contributors.

Like Cortex? Leave us a Star on GitHub

Star Cortex

Interested in production machine learning?