There is (rightfully) quite a bit of emphasis on testing and optimizing models pre-deployment in the machine learning ecosystem, with meta machine learning platforms like Comet becoming a standard part of the data science stack.
There has been less of an emphasis, however, on testing and optimizing models post-deployment, at least as far as tooling is concerned. This dearth of tooling has forced many to build extra in-house infrastructure, adding yet another bottleneck to deploying to production.
We’ve spent a lot of time thinking about A/B testing deployed models in Cortex, and after several iterations, we’ve built a set of features that make it easy to conduct scalable, automated A/B tests of deployed models. In this guide, I want to explain both the how and why of our approach, and hopefully, give you a better way to test your models in production.
First, some context.
How we think about A/B testing deployed models
There are several things that differentiate our approach to A/B testing deployed models from our thinking around optimizing and validating models pre-deployment:
- When we refer to a deployed model, we are looking at more than just the model itself. A deployment consists of the model artifact, its inference serving code, and the configuration of its infrastructure.
- Similarly, when we test a deployed model, we care about more than just its accuracy. We also care about its latency, its concurrency capabilities, and other performance-related factors.
- When we deploy a model, it is often as part of a pipeline that includes several other deployed models. Testing, evaluating, and updating a deployed model as a piece of a bigger pipeline presents specific challenges.
This view of model deployment is reflected in Cortex’s basic architecture. Cortex adopts an API-centric view of the world, treating a model artifact, its inference serving code, and its infrastructure configuration—the essentials needed to deploy a model as an API—as an atomic unit of inference. On deploy, Cortex packages these elements together, versions them, and deploys them to the cluster.
A pipeline, in this API-centric worldview, is a chain of APIs. For example, a chat monitoring pipeline might consist of many interconnected APIs, each performing different tasks—named entity recognition, sentiment analysis, semantic similarity analysis, etc.
Given this worldview, A/B testing in Cortex is primarily concerned with deploying different versions of APIs, routing traffic to them according to some configurable logic, and tracking their performance in a way that is attributable and comparable.
How to A/B test machine learning models with Cortex
Configuring an A/B test in Cortex is fairly straightforward due to the Traffic Splitter, a configurable request router that sits in front of your deployed APIs and sends them traffic according to your specification.
For example, let’s say we were deploying a face recognition API, and we wanted to test two different versions of our model (which we’ll creatively call version A and version B). We would create a different API for each model, which we’ll similarly call face_recognition_a and face_recognition_b.
Our inference serving code would look the exact same for each model:
Note: There’s no particular reason why I’m using Cortex’s ONNX Predictor here, you could just as easily use the Tensorflow Serving client or the Python Predictor.
But each model will require slightly different configuration. In our configuration file, we’re going to define each API separately, and define the Traffic Splitter:
Now, we’ve created an API that uses version_a, an API that uses version_b, and a Traffic Splitter that will send 50% of all traffic to each API.
We can deploy all three of these services to the cluster at the same time with the Cortex CLI:
And we can check on the status of our deployment/find our API’s endpoint by from the CLI as well:
From now on, so long as we query the endpoint provided by the Traffic Splitter, all of our requests will be routed to our models according to the values we set their weights to.
Now, how do we track the performance of these APIs? Via Cortex’s built-in prediction tracking. Cortex automatically monitors your APIs and streams metrics to CloudWatch.
In addition, you can configure Cortex to track predictions however you’d like, and export the data to any service. So, you can very easily configure Cortex’s prediction tracking to log predictions in a way that includes the model version, the overall performance of API, or whatever other data you find relevant for your A/B testing.
Iteratively improving production machine learning systems
Improving a production system is an incremental process, and this iteration relies on infrastructure. When you kludge together a brittle production system, you may shorten your initial time to deploy, but you essentially freeze your pipeline in time.
It’s performance won’t improve, because testing would require changes—oftentimes rapid ones—and they would break the entire pipeline.
Solving this problem is a core focus of Cortex. We want it to be easy not just to deploy to production, but to build production machine learning systems that are continuously improving.
If you’re working on a production machine learning system, we would love to hear about it—and if you’re simply interested in production ML, we’re always looking for contributors!