How to deploy Transformer models for language tasks

Go Gopher mascot with code

Transformer models represent the current state-of-the-art in almost all text and language tasks. Here, I want to provide a brief guide to deploying Transformers to production while optimizing for the specific challenges they present.

This guide does not require a deep familiarity with the Transformer architecture. If you are curious about Transformers and haven’t done so yet, I’d recommend reading the original Transformer paper “Attention is All You Need.

Additionally, this guide focuses on realtime inference because that is where most of the big challenges of serving Transformer models crop up. If you’re interested in running batch inference on Transformer models, see the Cortex batch prediction documentation, as the process is the same as for any other model architecture.

What makes deploying Transformer models hard

Transformer models have several characteristics that make them difficult to deploy to production:

  • Size. Transformer models are very large. Of the largest models in use—GPT-3, T5, T-NLG, etc.—most are Transformers.
  • Latency. As a result of their size and complexity, Transformers often require a lot of compute power to serve predictions with decent latency.
  • Concurrency. Related to both of the above challenges, Transformer models typically cannot handle many concurrent requests.

These issues combine to create a number of challenges. The first and most obvious is cost. Running a lot of big servers with powerful compute resources to handle a relatively small number of requests is not particularly economical.

The second involves infrastructure. Building a platform to serve Transformer models and account for these challenges is a large infrastructure challenge, one that ideally shouldn’t be handled in-house (though, as a maintainer of Cortex, I’m a bit biased here.)

All of that being said, these problems are not insurmountable. For the rest of this guide, I’m going to walk through some steps you can take to deploy a Transformer model with Cortex, such that your deployed model serves requests quickly, elastically scales to meet demand, and does all of this without significantly increasing costs.

1. Defining a performant API to serve predictions

To serve predictions from a model, we define a Predictor—an API that generates predictions in response to requests.

The request handling portion of the API, which we’ll see in a second, is a fairly straight forward Python class with an init() and predict() method. To serve predictions in the most performant way possible, there are two key things we want to implement in our script:

  • Loading the model into GPU memory for inference.
  • Relocating any non-blocking procedures to asynchronous functions.

We can load the model into GPU memory in our init() method, and make use of Cortex’s pre and post-processing hooks to asynchronously perform nonblocking tasks while freeing resources to serve more predictions:

In the above script, we initialize our predictor by setting the device to use GPUs, if they are available. We then run our logging service (which, in this hypothetical scenario, is necessary to run in addition to Cortex’s built-in logging) asynchronously, after the response has been delivered.

Note that we are simply using Hugging Face’s Transformers library for convenience here. The approach will be the same regardless of which Transformer model you use and where you access it from.

2. Provisioning GPUs while limiting costs

So now we have the request serving portion of our API built, but we still need to define its infrastructure. One concern we need to address in this step is the cost of GPU instances.

For context, AWS’s cheapest GPU instance, the g4dn.xlarge, costs $0.526 per hour on-demand. Larger g4 instances, unsurprisingly, scale quickly in cost. When you consider the low concurrency capabilities of larger Transformer models, this becomes a major issue, as a relatively small number of concurrent users can require a large number of expensive instances.

Cortex will automatically share GPU resources across APIs deployed to the same instance, giving us some optimization out of the gate. However, we want to reduce costs even further by using Spot instances.

Spot instances are unused instances that AWS sells at a discount. The g4dn.xlarge, for example, only costs $0.1578 per hour—a roughly 70% discount.

The risk with Spot instances is that while they are significantly cheaper, they are not guaranteed to be available. Because of this, Cortex also allows us to define backup instance types and fallback behavior for Spot deployments.

We can define all of this behavior within a cluster.yaml document, which Cortex can use to customize our Cortex cluster:

I’ve included a lot of optional attributes with their default values, just to illustrate some of the knobs exposed by Cortex around Spot deployments. Of particular note:

  • Instance distribution. Here we can define backup instances to try if there are no g4dn.xlarge Spot instances. The instances we’ve defined here all have the resources we need, and the most expensive of them (p2.xlarge) still only cost $0.27 per hour.
  • On demand base capacity. Cortex will allow us to fallback to on demand instances if all other options are exhausted. We can set a minimum (defined as an integer) and a maximum (defined as a percentage of total instances) number of on demand instances to improve availability.
  • Max price. Because Spot instance prices can fluctuate somewhat, we can be precise about how much we’re willing to spend on Spot instances.

When AI Dungeon transitioned to Spot instances on Cortex, they reduced their total cloud spend over 90%.

How we scaled AI Dungeon 2 to support over 1,000,000 users

Reducing cloud spend over 90% with Cortex and open source ML infrastructure.

3. Handling traffic spikes without over-scaling

Even with more affordable instances, the low concurrency limits of large Transformer models present obstacles. To handle thousands of concurrent users with realtime latency, we likely need to deploy a large number of replicas—seeing multi-hundred node deployments is not uncommon with Transformers.

To keep this cost as low as possible while avoiding any tradeoffs around latency, we need to autoscale as precisely as possible. The tricky part about autoscaling production machine learning deployments is in selecting the right metric for defining your autoscaling logic.

Most traditional DevOps autoscalers—Kubernetes Pod Autoscaler, for example—use some kind of resource utilization as their metric. When an instance or pod hits a certain percentage of utilization, scale up. This makes perfect sense for traditional software, but is less ideal for machine learning.

The problem with machine learning is that every model has different compute needs, and every task has variable latency requirements. To effectively use resource utilization as a metric, we’d need a way to distill CPU, GPU, and ASIC usage into a single metric, defined relative to the particular requirements of the task a model is being used for. This, ultimately, is an unwieldy approach.

Instead, Cortex uses what we call request-based autoscaling. In this paradigm, we set a concurrency capacity for individual APIs, which Cortex uses to autoscale them relative to the length of their request queue.

To configure this, we just have to set a few properties on our API’s YAML manifest:

Similar to the last example, I’ve included quite a few optional fields with their default values here for illustrative purposes. As you can see, Cortex exposes many knobs, allowing you to get very granular with your autoscaling logic. The different fields can be loosely bucketed as pertaining to:

  • Total replica count. These fields allow you to set minimum, maximum, and initial values for the total count of replicas deployed.
  • Per replica concurrency. These fields determine the total number of concurrent requests a single replica can handle, as well as how many parallel workers/threads-per-worker to run on each replica.
  • Stabilization factors. These fields allow you to control the pace at which your deployments scale, as well as their general sensitivity.

For more information, I’d recommend reading the autoscaling documentation.

Bonus: Increasing throughput with ONNX Runtime

This step is less universally applicable, but is particularly high impact for many Transformer deployments.

ONNX Runtime contains many built-in optimizations for running inference on Transformer models. In addition, the ONNX team has released several open source tools for optimizing Transformer models via quantization and other means once they’ve been exported to ONNX.

I won’t go into all the steps here—though I have an entire guide to optimizing Transformer models with ONNX here—but suffice it to say that in our benchmarks, we were able to 40x throughput per API by converting Transformer models from PyTorch to ONNX and applying optimizations.

Production-scale text processing with Transformers

Large Transformer models frequently draw the ire of critics who believe they are impractical for production use. In some ways, the critics are right—if you accept that production machine learning is forever stuck with traditional DevOps and software infrastructure.

However, as the MLOps ecosystem continues to mature, platforms like Cortex are making large deep learning models like GPT-2 more and more accessible for production use.

Deploying GPT-3 (if it is ever open sourced) may still be a challenge, but at this point, infrastructure should no longer be the bottleneck preventing you from deploying Transformer models to production.

Like Cortex? Leave us a Star on GitHub

Star Cortex

Interested in production machine learning?