inf1.xlarge) when creating your Cortex cluster.inf field in the compute configuration for your API. One unit of inf corresponds to one Inferentia ASIC with 4 NeuronCores (not the same thing as cpu) and 8GB of cache memory (not the same thing as mem). Fractional requests are not allowed.inf1.xlarge/inf1.2xlarge - each has 1 Inferentia ASICinf1.6xlarge - has 4 Inferentia ASICsinf1.24xlarge - has 16 Inferentia ASICsprocesses_per_replica for Realtime APIs field in the API configuration). Each NCG will have an equal share of NeuronCores. Therefore, the size of each NCG will be 4 * inf / processes_per_replica (inf refers to your API's compute request, and it's multiplied by 4 because there are 4 NeuronCores per Inferentia chip).inf chips, there will be 8 NeuronCores available. If you set processes_per_replica to 1, there will be one copy of your model running on a single NCG of size 8 NeuronCores. If processes_per_replica is 2, there will be two copies of your model, each running on a separate NCG of size 4 NeuronCores. If processes_per_replica is 4, there will be 4 NCGs of size 2 NeuronCores, and if If processes_per_replica is 8, there will be 8 NCGs of size 1 NeuronCores. In this scenario, these are the only valid values for processes_per_replica. In other words the total number of requested NeuronCores (which equals 4 * the number of requested Inferentia chips) must be divisible by processes_per_replica.4 * inf / processes_per_replica (inf refers to your API's compute request, and it's multiplied by 4 because there are 4 NeuronCores per Inferentia chip). See NeuronCore Groups above for an example, and see Improving performance below for a discussion of choosing the appropriate number of NeuronCores.tensorflow-neuron and torch-neuron that are used by Cortex are found in the Realtime API pre-installed packages list and Batch API pre-installed packages list. When installing these packages with pip to compile models of your own, use the extra index URL --extra-index-url=https://pip.repos.neuron.amazonaws.com.aws/aws-neuron-sdk repo for TensorFlow and for PyTorch. Here are 2 examples implemented with Cortex:processes_per_replica to be lower so that the NeuronCore Group has access to the number of NeuronCores for which you compiled your model. This is acceptable if latency is your top priority, but if throughput is more important to you, this tradeoff is usually not worth it. To maximize throughput, compile your model for as few NeuronCores as possible and increase processes_per_replica to the maximum possible (see above for a sample calculation).--static-weights compiler option when possible. This option tells the compiler to make it such that the entire model gets cached onto the NeuronCores. This avoids a lot of back-and-forth between the machine's CPU/memory and the Inferentia ASICs.