DocumentationAPI Reference
Back to ConsoleLog In

What is Model Runtime Performance?

Every DL algorithm is measured based on its accuracy level and its runtime performance. Runtime performance includes all the measurements that determine the model’s efficiency and compute performance. It consists of –

  • Latency is the amount of time from when the model receives a sample until it returns its prediction, measured in milliseconds. Latency is a crucial metric for deep learning models used in real-time and time-sensitive applications. This time includes uploading the sample from the memory to the chip, computing the neural network ‘forward pass’, and downloading the prediction from the chip to the memory (“transfer time”). Latency is measured for the entire batch. Hence, the latency of a single image/sample in a batch is the same as the latency of the entire batch.

  • Throughput is the maximal number of samples the neural network can predict within a given time frame (1 second). When scaling deep learning models to production, throughput is one of the key metrics in assessing the performance characteristics and cost-to-serve. Similar to the latency measurement, throughput processing includes the transfer time for uploading the sample from the memory to the chip, computing the neural network ‘forward pass’, and downloading the prediction from the chip to the memory.

  • Memory footprint is the amount of processor (GPU or CPU) memory used during the forward pass. Specifically in neural networks, memory is required to store input data, weight parameters, and carry out calculations as an input propagates through the network. The behavior during inference is different from the behavior in training, where activations from a forward pass must be retained until they can be used to calculate the error gradients in the backward pass.


Did this page help you?