Realtime API tutorial

this is an example for cortex release 0.22 and may not deploy correctly on other releases of cortex
This example shows how to deploy a realtime text generation API using a GPT-2 model from Hugging Face's transformers library.
Implement your predictor
1.Create a Python file named predictor.py.
2.Define a Predictor class with a constructor that loads and initializes the model.
3.Add a predict function that will accept a payload and return the generated text.
1
# predictor.py
2
​
3
import torch
4
from transformers import GPT2Tokenizer, GPT2LMHeadModel
5
​
6
​
7
class PythonPredictor:
8
    def __init__(self, config):
9
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
10
        print(f"using device: {self.device}")
11
        self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
12
        self.model = GPT2LMHeadModel.from_pretrained("gpt2").to(self.device)
13
​
14
    def predict(self, payload):
15
        input_length = len(payload["text"].split())
16
        tokens = self.tokenizer.encode(payload["text"], return_tensors="pt").to(self.device)
17
        prediction = self.model.generate(tokens, max_length=input_length + 20, do_sample=True)
18
        return self.tokenizer.decode(prediction[0])
Copied!
Here are the complete Predictor docs.
Specify your Python dependencies
Create a requirements.txt file to specify the dependencies needed by predictor.py. Cortex will automatically install them into your runtime once you deploy:
1
# requirements.txt
2
​
3
torch
4
transformers==3.0.*
Copied!
Configure your API
Create a cortex.yaml file and add the configuration below. A RealtimeAPI provides a runtime for inference and makes your predictor.py implementation available as a web service that can serve real-time predictions:
1
# cortex.yaml
2
​
3
- name: text-generator
4
  kind: RealtimeAPI
5
  predictor:
6
    type: python
7
    path: predictor.py
Copied!
Here are the complete API configuration docs.
Deploy your model locally
cortex deploy takes your Predictor implementation along with the configuration from cortex.yaml and creates a web API:
1
$ cortex deploy
2
​
3
creating text-generator (RealtimeAPI)
Copied!
Monitor the status of your API using cortex get:
1
$ cortex get --watch
2
​
3
env     realtime api     status     last update   avg request   2XX
4
local   text-generator   updating   8s            -             -
Copied!
Show additional information for your API (e.g. its endpoint) using cortex get <api_name>:
1
$ cortex get text-generator
2
status   last update   avg request   2XX
3
live     1m            -             -
4
​
5
endpoint: http://localhost:8888
6
...
Copied!
You can also stream logs from your API:
1
$ cortex logs text-generator
2
​
3
...
Copied!
Once your API is live, use curl to test your API (it will take a few seconds to generate the text):
1
$ curl http://localhost:8888 \
2
    -X POST -H "Content-Type: application/json" \
3
    -d '{"text": "machine learning is"}'
4
​
5
"machine learning is ..."
Copied!
Deploy your model to AWS
Cortex can automatically provision infrastructure on your AWS account and deploy your models as production-ready web services:
1
$ cortex cluster up
Copied!
This creates a Cortex cluster in your AWS account, which will take approximately 15 minutes.
After your cluster is created, you can deploy your model to your cluster by using the same code and configuration as before:
1
$ cortex deploy --env aws
2
​
3
creating text-generator (RealtimeAPI)
Copied!
Note that the --env flag specifies the name of the CLI environment to use. CLI environments contain the information necessary to connect to your cluster. The default environment is local, and when the cluster was created, a new environment named aws was created to point to the cluster. You can change the default environment with cortex env default <env_name).
Monitor the status of your APIs using cortex get:
1
$ cortex get --watch
2
​
3
env     realtime api     status   up-to-date   requested   last update   avg request   2XX
4
aws     text-generator   live     1            1           1m            -             -
5
local   text-generator   live     1            1           17m           3.1285 s      1
Copied!
The output above indicates that one replica of your API was requested and is available to serve predictions. Cortex will automatically launch more replicas if the load increases and will spin down replicas if there is unused capacity.
Show additional information for your API (e.g. its endpoint) using cortex get <api_name>:
1
$ cortex get text-generator --env aws
2
​
3
status   up-to-date   requested   last update   avg request   2XX
4
live     1            1           17m           -             -
5
​
6
metrics dashboard: https://us-west-2.console.aws.amazon.com/cloudwatch/home#dashboards:name=cortex
7
endpoint: https://***.execute-api.us-west-2.amazonaws.com/text-generator
8
...
Copied!
Use your new endpoint to make requests to your API on AWS:
1
$ curl https://***.execute-api.us-west-2.amazonaws.com/text-generator \
2
    -X POST -H "Content-Type: application/json" \
3
    -d '{"text": "machine learning is"}'
4
​
5
"machine learning is ..."
Copied!
Perform a rolling update
When you make a change to your predictor.py or your cortex.yaml, you can update your api by re-running cortex deploy.
Let's modify predictor.py to set the length of the generated text based on a query parameter:
1
# predictor.py
2
​
3
import torch
4
from transformers import GPT2Tokenizer, GPT2LMHeadModel
5
​
6
​
7
class PythonPredictor:
8
    def __init__(self, config):
9
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
10
        print(f"using device: {self.device}")
11
        self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
12
        self.model = GPT2LMHeadModel.from_pretrained("gpt2").to(self.device)
13
​
14
    def predict(self, payload, query_params):  # this line is updated
15
        input_length = len(payload["text"].split())
16
        output_length = int(query_params.get("length", 20))  # this line is added
17
        tokens = self.tokenizer.encode(payload["text"], return_tensors="pt").to(self.device)
18
        prediction = self.model.generate(tokens, max_length=input_length + output_length, do_sample=True)  # this line is updated
19
        return self.tokenizer.decode(prediction[0])
Copied!
Run cortex deploy to perform a rolling update of your API:
1
$ cortex deploy --env aws
2
​
3
updating text-generator (RealtimeAPI)
Copied!
You can track the status of your API using cortex get:
1
$ cortex get --env aws --watch
2
​
3
realtime api     status     up-to-date   stale   requested   last update   avg request   2XX
4
text-generator   updating   0            1       1           29s           -             -
Copied!
As your new implementation is initializing, the old implementation will continue to be used to respond to prediction requests. Eventually the API's status will become "live" (with one up-to-date replica), and traffic will be routed to the updated version.
Try your new code:
1
$ curl https://***.execute-api.us-west-2.amazonaws.com/text-generator?length=30 \
2
    -X POST -H "Content-Type: application/json" \
3
    -d '{"text": "machine learning is"}'
4
​
5
"machine learning is ..."
Copied!
Run on GPUs
If your cortex cluster is using GPU instances (configured during cluster creation), you can run your text generator API on GPUs. Add the compute field to your API configuration:
1
# cortex.yaml
2
​
3
- name: text-generator
4
  kind: RealtimeAPI
5
  predictor:
6
    type: python
7
    path: predictor.py
8
  compute:
9
    gpu: 1
Copied!
Run cortex deploy to update your API with this configuration:
1
$ cortex deploy --env aws
2
​
3
updating text-generator (RealtimeAPI)
Copied!
You can use cortex get to check the status of your API, and once it's live, prediction requests should be faster.
A note about rolling updates in dev environments
In development environments, you may wish to disable rolling updates since rolling updates require additional cluster resources. For example, a rolling update of a GPU-based API will require at least two GPUs, which can require a new instance to spin up if your cluster only has one instance. To disable rolling updates, set max_surge to 0 in the update_strategy configuration:
1
# cortex.yaml
2
​
3
- name: text-generator
4
  kind: RealtimeAPI
5
  predictor:
6
    type: python
7
    path: predictor.py
8
  compute:
9
    gpu: 1
10
  update_strategy:
11
    max_surge: 0
Copied!
Cleanup
Run cortex delete to delete each API:
1
$ cortex delete text-generator --env local
2
​
3
deleting text-generator
4
​
5
$ cortex delete text-generator --env aws
6
​
7
deleting text-generator
Copied!
Running cortex delete will free up cluster resources and allow Cortex to scale down to the minimum number of instances you specified during cluster creation. It will not spin down your cluster.
Next steps
Deploy another one of our examples.
See our exporting guide for how to export your model to use in an API.
Try the batch API tutorial to learn how to deploy batch APIs in Cortex.
See our traffic splitter example for how to deploy multiple APIs and set up a traffic splitter.
See uninstall if you'd like to spin down your cluster.

Traffic Splitter

Next - Deployments

Batch API

Last modified 10mo ago

Copy link

Contents

Implement your predictor

Specify your Python dependencies

Configure your API

Deploy your model locally

Deploy your model to AWS

Perform a rolling update

Run on GPUs

A note about rolling updates in dev environments

Cleanup

Next steps