How to serve batch predictions with TensorFlow Serving

Go Gopher mascot with code

One of our standard recommendations for reducing the cost of running a production machine learning system is to be discerning about which predictions need to be realtime, and to serve the rest through batch inference.

For example, people commonly assume that recommendation engines need to serve realtime inference, and to be fair there are some really impressive examples of this, like the online training system run by TikTok. However, many orgs have figured out that for their use-case, they don’t actually need to generate a new recommendation every time the user accesses their app.

They can generate predictions for users on a regular basis, cache the results, and use those stored recommendations for all of a user’s interactions. Instead of generating hundreds (or potentially thousands) of predictions for a user over a window of time, they now only generate one.

In this guide, I’m going to walk through a process for serving batch predictions that will add no infrastructure overhead to your team (assuming you’re already using Cortex for realtime inference). I’ll be using TensorFlow Serving as an example, but the following process is nearly identical for any of Cortex’s predictor types.

1. Define a batch API for TensorFlow Serving

The following assumes you have a Cortex cluster running. If you do not, it is very easy to spin one up by following the instructions here.

First, we need to define an API for performing inference. If you’ve defined a realtime API with Cortex, this will be very familiar.

Our API is essentially a Python class with an initializer and a predict() method, which is used to perform inference. With the Cortex Python client, we can actually define this Python API and deploy it in the same script. For example, below is an example batch API used to run image classification:

As you can see, the initializer sets up the predictor by loading the model into memory, initializing the labels, defining the preprocessing method, and starting the connection to AWS. The predict() method then runs inference on every image passed into the method, and then stores the results as a JSON object in S3.

We’ve defined the API's requirements and configuration in this file as well, but if you prefer, we could also do this in YAML:

We would then deploy using the Cortex CLI.

2. Submit jobs to batch API

Now that we have an API capable of running batch inference, we need to submit batch jobs to it. There are a few different ways to submit batch jobs in Cortex, all of which involve posting a JSON object to your batch API endpoint.

The first is method is to directly include the input data in your request. This would look something like this:

The “item_list” object is what Cortex looks for to discover your payload, and “batch_size” tells Cortex how many items to process in a job. Obviously, you would likely want to dynamically populate the items in “item_list”

The second method is to pass in a list of S3 paths that contain the relevant data for inference. The request submission follows the same process, but the JSON object itself will look more like this:

The “includes” and “excludes” fields provide ways to handle extraneous files in your S3 buckets.

Finally, if your dataset is one giant JSON object which needs to be broken up into smaller jobs, we can pass that in as well. The request object looks like this:

No matter how you submit your job, Cortex will process it and run inference behind the scenes.

3. Monitoring and updating batch jobs

Each batch job in Cortex is assigned a specific id, which Cortex returns when the job is first submitted. Using that id, we can monitor jobs directly.

To do this, we can either perform a GET request to our batch endpoint with the path “/jobID=<job_id>” affixed, or we can use the Cortex CLI and run cortex get <api_name> <job_id>. Either one of these approaches will return an object that looks like this:

We can also delete any in-process job by sending a DELETE request to the same URL.

Updating a batch API in Cortex is as simple as updating a realtime API. Simply modify the predictor, and run the deploy command again. Cortex will handle the update process and guarantee no downtime.

The importance of batch prediction

There is rightfully a lot of excitement (including among us) about realtime inference, and there are a lot projects pushing the boundaries of what is possible to do with it. Batch, however, is often treated as old-hat.

While a lot of our energy in building Cortex goes into serving realtime inference, we are equally invested in improving batch inference, as we don’t see the two as competitive in any way.

What we’re seeing more and more of is the bifurcation of inference jobs within machine learning orgs. Teams frequently have both batch and realtime pipelines, all of which need to be managed and orchestrated within a broader production ML system.

Our goal is for Cortex to provide a single interfacing for managing the deployment and inference stage of the entire system, one which makes it easy to deploy the best pipeline for the job.

Like Cortex? Leave us a Star on GitHub

Star Cortex

Interested in production machine learning?