Serve models at scale.
- kind: apiname: <string> # API name (required)model: <string> # path to an exported model (e.g. s3://my-bucket/exported_model)model_format: <string> # model format, must be "tensorflow" or "onnx" (default: "onnx" if model path ends with .onnx, "tensorflow" if model path ends with .zip or is a directory)request_handler: <string> # path to the request handler implementation file, relative to the cortex roottf_signature_key: <string> # name of the signature def to use for prediction (required if your model has more than one signature def)tracker:key: <string> # key to track (required if the response payload is a JSON object)model_type: <string> # model type, must be "classification" or "regression"compute:min_replicas: <int> # minimum number of replicas (default: 1)max_replicas: <int> # maximum number of replicas (default: 100)init_replicas: <int> # initial number of replicas (default: <min_replicas>)target_cpu_utilization: <int> # CPU utilization threshold (as a percentage) to trigger scaling (default: 80)cpu: <string | int | float> # CPU request per replica (default: 200m)gpu: <int> # GPU request per replica (default: 0)mem: <string> # memory request per replica (default: Null)
See packaging models for how to export the model.
- kind: apiname: my-apimodel: s3://my-bucket/my-model.onnxrequest_handler: handler.pycompute:gpu: 1
Request handlers are used to decouple the interface of an API endpoint from its model. A
pre_inference request handler can be used to modify request payloads before they are sent to the model. A
post_inference request handler can be used to modify model predictions in the server before they are sent to the client.
See request handlers for a detailed guide.
tracker can be configured to collect API prediction metrics and display real-time stats in
cortex get <api_name>. The tracker looks for scalar values in the response payload (after the execution of the
post_inference request handler, if provided). If the response payload is a JSON object,
key can be set to extract the desired scalar value. For regression models, the tracker should be configured with
model_type: regression to collect float values and display regression stats such as min, max and average. For classification models, the tracker should be configured with
model_type: classification to collect integer or string values and display the class distribution.
You can log more information about each request by adding a
?debug=true parameter to your requests. This will print:
The raw sample
The value after running the
pre_inference function (if applicable)
The value after running inference
The value after running the
post_inference function (if applicable)
Cortex adjusts the number of replicas that are serving predictions by monitoring the compute resource usage of each API. The number of replicas will be at least
min_replicas and no more than
Cortex spins up and down nodes based on the aggregate resource requests of all APIs. The number of nodes will be at least
$CORTEX_NODES_MIN and no more than
$CORTEX_NODES_MAX (configured during installation and modifiable via the AWS console).