How we scale machine learning model deployment on Kubernetes

Go Gopher mascot with code

At this point, Kubernetes is the de facto choice for orchestrating containers. As such, when we initially began working on our model deployment system, which conceptualizes deployed models as containerized prediction APIs, Kubernetes was the obvious choice.

Building a cluster for production-scale machine learning inference, however, requires many features which go beyond the capabilities of vanilla Kubernetes, particularly when it comes to serving large inference workloads.

I’m going to summarize a few of the key features we’ve implemented in the Cortex cluster to handle these large workloads. If you’re working on ML infrastructure, hopefully the following will be of some help. Additionally, because Cortex is free and open source, you can explore any of the features yourself and/or try using Cortex as your model serving infrastructure.

1. GPU/ASIC support

A simple way to handle larger workloads is to use more powerful resources, like GPUs and ASICs (Inferentia, TPUs, etc.) This is probably the most obvious feature and the easiest to implement, but it’s important to include.

To get a sense of how important these compute resources can be, it’s useful to see a tangible example of a latency-sensitive ML application. Most people have used Gmail and seen its predictive Smart Compose feature:

According to the Gmail team, the prediction API behind Smart Compose needs to serve models in under 100 ms—otherwise, the user will type faster than Gmail can display predictions.

As the Gmail team points out in their write up, even with a model designed for speed, CPUs still delivered several hundred millisecond latency. The team had to use an ASIC (in this case, TPUs) to reach acceptable latency.

Implementing support for different compute resources within a cluster is fairly straightforward, as there are usually well-documented device plugins. The tricky part, however, is how these new resources affect autoscaling.

2. Request-based autoscaling

Autoscaling deployed models is an interesting problem. The Kubernetes Pod Autoscaler is a solid tool in that it allows us to scale up individual APIs—rather than instances—but it has a key problem: it relies on CPU utilization.

The problems with this are two-fold. First, not all of our deployed models run on CPUs. Secondly, because different models have different latency requirements depending on their task, utilization is an incomplete metric.

Our solution was to build an autoscaler with a more customizable logic. In Cortex, users can specify a particular concurrency limit for each individual API. The Cortex autoscaler will use this concurrency limit to scale the API depending on the length of its incoming request queue (for more information, see the autsocaling docs.)

As an example, imagine you operate an ecommerce site. You may have several deployed models:

  • A recommendation engine to personalize which products a user sees.
  • A conversational agent to handle customer support.
  • A fraud detection model to flag suspicious activities.

Depending on your situation, all of these models will likely have different latency requirements. A recommendation engine can run as a regular batch job, caching the user’s preferences for later use. A conversational agent can have a few seconds of latency, hidden behind a “Agent is typing…” message. A fraud detection model might need to be realtime in order to stop transactions before your logistics center begins processing the order.

With Cortex, you can optimize your scaling for each of these situations, irrespective of resource utilization. For the recommendation engine, you can schedule a batch job, which will require no autoscaling at all. For the conversational agent, you can set a fairly high concurrency limit, allowing a requests to wait a little longer in the queue. For the fraud detection model, you can set a low limit, making sure you have enough available replicas for realtime inference.

This implementation allows users to be more precise about the autoscaling behavior of each individual API, and therefore more cost effective. This is a huge deal in production machine learning, where inference costs can surge dramatically with new users.

3. Spot instance support

This last feature has less to do with the performance challenges of large workloads, and more to do with their costs. Simply put, inference costs can be extreme, even when your inference pipeline is highly optimized.

For example, AI Dungeon was originally built using a finetuned GPT-2 model, which is ~6 GB (it still uses a GPT-2 model for some inferences, but also utilizes the new GPT-3 API). GPT-2 is large, resource hungry, and can only handle a couple of concurrent requests if you’re aiming for decent latency. As a result, just a few weeks after launch AI Dungeon was running over 700 GPU instances to support active players.

AWS’s cheapest GPU instance, the g4dn.xlarge, costs $0.526 per hour on-demand. 700 of these would $368.20 per hour, or over $265,000 per month.

While reducing the number of needed instances via aggressive autoscaling and other optimizations is helpful, the biggest thing that helped AI Dungeon was Cortex’s support of Spot instances (unused instances that AWS sells at a steep discount). A g4dn.xlarge is only $0.1578 per hour on Spot. With the combination of Spot deployments, efficient autoscaling, and better optimization, AI Dungeon was able to reduce their spend by over 90%.

Implementing Spot instances was its own journey. Technically, it simply involved interfacing with AWS’s APIs. Architecturally, however, we had to reason about a number of challenges. How should Cortex behave when no Spot instances are available? What should be the default behavior when an instance is recalled? How should Cortex manage autoscaling groups that include Spot instances?

For details on how we answered these questions, see the Spot instance docs.

Making inference at-scale accessible

Part of what makes web development so powerful is its accessibility. Running a CRUD app with 1,000 concurrent users is, in many cases, achievable using strictly free tier resources.

In order for machine learning engineering to become truly accessible, it needs to reach similar levels of ease (or at least, get as close as possible). Much of, if not most of, this work is on the infrastructure side, and to this end we’ve recently released and are currently working on many features we’re excited about beyond those listed above:

  • Multi-model caching/live reloading. Models can now be cached from a remote repository, allowing a single API to serve requests from 1,000 of models without needing to spin up extra instances. (Released in 0.22)
  • Multi-cloud support. Deployments can be moved across clouds to take advantage of resources or discounts without extra configuration. (GCP support releasing in 0.24)
  • Serverless interface. Cortex can infer what instance types and autoscaling your model needs, if you do not want to specify. (Forthcoming)

And those are just a few.

Our hope is that in the near future, there will effectively be no infrastructure bottleneck to deploying models at scale. If this is exciting to you as well, we’re always happy to meet new contributors.

Like Cortex? Leave us a Star on GitHub

Star Cortex

Interested in production machine learning?