Serve models at scale.
- kind: apiname: <string> # API name (required)model: <string> # path to an exported model (e.g. s3://my-bucket/model.zip)model_format: <string> # model format, must be "tensorflow" or "onnx" (default: "onnx" if model path ends with .onnx, "tensorflow" if model path ends with .zip)request_handler: <string> # path to the request handler implementation file, relative to the cortex rootcompute:min_replicas: <int> # minimum number of replicas (default: 1)max_replicas: <int> # maximum number of replicas (default: 100)init_replicas: <int> # initial number of replicas (default: <min_replicas>)target_cpu_utilization: <int> # CPU utilization threshold (as a percentage) to trigger scaling (default: 80)cpu: <string> # CPU request per replica (default: 400m)gpu: <string> # gpu request per replica (default: 0)mem: <string> # memory request per replica (default: Null)
See packaging models for how to create the zipped model.
- kind: apiname: my-apimodel: s3://my-bucket/my-model.ziprequest_handler: inference.pycompute:min_replicas: 5max_replicas: 20cpu: "1"
Request handlers are used to decouple the interface of an API endpoint from its model. A pre_inference request handler can be used to modify request payloads before they are sent to the model. A post_inference request handler can be used to modify model predictions in the server before they are sent to the client.
See request handlers for a detailed guide.