Quick Start

What is infery?

infery (from the word inference) is Deci’s proprietary deep-learning run-time inference engine that turns a model into a siloed efficient runtime server and enables you to run (load and use) your model from Python code.

infery enables efficient inference and seamless deployment, on any hardware. infery is essential for overcoming the complex challenges of making deep learning models production-ready.

infery benefits include –

  • Simplifies Deployment – Load models using a quick, yet simple Python package, built for scalability and super quick deployment.
  • Boosts Latency/Throughput – Enjoy inference performance acceleration of DL models provided by our platform, optimized for any given target hardware (CPU or GPU).
  • Runs Anywhere – Deci enables model portability across common frameworks and across various types of production hosts. infery offers inference performance optimization and model portability across multiple hardware, platforms and frameworks. You can change runtime backends (platforms), out of the box, without touching your code.
  • Reduces Cost-to-Serve – Deci reduces total cost of ownership by up to 80% by maximizing hardware utilization. infery enables the pipelining and performance scaling of multiple models on a single host.
  • Measures Your Model's Performance During Production – Deci reveals how your models really behave on your production hardware. Just load your model using infery to see how it behaves (in terms of latency, ms). This gives you the ability to debug and calculate the compute capacity you'll need for your task.

Quick Start

The following describes how to install and load a model so that you can run inference using python.
In case of errors, please see Error Handling.

Install

Full instructions can be found in Installations Instructions

# For CPU:
python3 -m pip install -U pip
python3 -m pip install infery


# For GPU:
python3 -m pip install -U pip

# Compile pycuda for the local CUDA. The example uses CUDA 11.2
export PATH=/usr/local/cuda-11.2/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
python3 -m pip install -U pycuda

# Install infery-gpu from PyPi and TensorRT from nvidia's pip repository
python3 -m pip install -U --extra-index-url https://pypi.ngc.nvidia.com infery-gpu

Loading Models

In a Python environment, load the model using infery.load function, as follows:

  • model_path – Specify the exact path to where you downloaded/saved your model.
  • framework_type – Specify the framework (programming language) used to develop this model. The supported options are listed in the table below.
  • inference_hardware – Specify either gpu or cpu according to the target hardware environment on which the model will be run (CPU or GPU).
  • static_batch_size – The static batch size the model graph is frozen for if the model is static, None if it is dynamic.
  • inference_dtype – The numpy data type to use for inference.
import infery, numpy as np

model = infery.load(model_path='model.onnx', framework_type='onnx', inference_hardware='gpu')

Loading Model Example:

# Downloading a pre-trained ResNet50 (Imagenet) model, that supports inference with batch size up to 64.
from urllib.request import urlretrieve
urlretrieve('https://dips-models-public.s3.amazonaws.com/resnet50_batchsize_64.onnx',
            '/tmp/model.onnx')

# Load it with infery
import infery
model = infery.load(weights_path='/tmp/model.onnx', framework_type='onnx')

If the model loaded successfully, you should see successful output logs –

__init__ -INFO- Infery was successfully imported with 2 CPUS and 1 GPUS.
infery_manager -INFO- Loading model /tmp/model.onnx to the GPU
infery_manager -INFO- Successfully loaded /tmp/model.onnx to the GPU.

If infery failed to load the model for any reason, you should be able to see the cause for the error with a verbose description –

infery_manager -ERROR- Failed to register model: The model file does not exist at /tmp/model.onnx
---------------------------------------------------------------------------
      1 import infery
----> 2 model = infery.load(model_path='/tmp/model.onnx', framework_type='onnx')
FileNotFoundError: The model file does not exist at /tmp/model.onnx

Predict - Run Inference

Once we loaded the model (see Loading Model), we are ready for inference.
In the example below, inputs represents the input tensor to be processed by the model. In this example, it is an automatically generated random tensor, which should be replaced by the real tensor to be processed by the model.predict command.

import numpy as np
from typing import List

x = np.random.random((1, 3, 224, 224)).astype('float32')

y: List[np.ndarray] = model.predict(x)
# or
y: List[np.ndarray] = model(x)

Infery supports chaining kwargs to benchmark() and predict().

When extra arguments are passed to the function, so they are passed to your model's framework, as-is.

The output of the model is a numpy.ndarray, with
shape (1, 1000), because the model was trained on the Imagenet** dataset, that is composed of 1000 labels –

[array([[-2.16313553e+00, -7.49338865e-01, -4.13975000e-01,
         -5.33734620e-01, -6.56776190e-01, -1.02638006e+00,
         -1.13409054e+00, -9.78322923e-01, -1.32272959e+00,
         -1.02403033e+00,  6.21275842e-01,  1.09605193e+00,
                    ...
          2.25341439e+00]], dtype=float32)]

Infery will always return a list of numpy arrays as a result.

Benchmark (Measure) The Model Performance On The Current HW

(1) Run the model.benchmark command from your application, as follows –

# Benchmark implicitly
model.benchmark(batch_size=1)

# Or, Benchmark explicitly
model.benchmark(batch_size=8,
                input_dims=(3,224,224),
                repetitions=100,
                warmup_calls=10,
                dtype='float16')

The operation consists of the following parameters –

  • BATCH_SIZE is the batch size for which the measurement will be made. This should be the batch size that the model is configured to handle.
  • INPUT_DIMS (optional) is the size/shape of the input to be used to measure the model’s performance. This should be the size/shape that the model is configured to handle.
  • WARMUP_CALLS (optional) is the number of warmup calls to perform PRIOR to the benchmark. This helps to prepare the clocks on different HW for benchmark, making the results more consistent, reaching the peak of your current HW's compute power before the benchmark starts.
  • REPETITIONS (optional) is the number of times the measurement request will be sent in order to improve accuracy. The measurement that is presented in the Deci platform represents the average of the measurements in the responses to each of these requests.
  • DTYPE (optional) is the data type of the inference (numpy compatible, e.g float32, float16, int8, etc.). The data type affects inference because it changes the amount of data the needs to be copied to the memory and back (for inference).

The following is an example of a request –

model.benchmark(64, duration_sec=3)

The following is an example of a response –

<ModelBenchmarks: {
    "batch_size": 64,
    "batch_inf_time": "10.72 ms",
    "batch_inf_time_variance": "0.01 ms",
    "system_startpoint_memory_used": "146.00 mb",
    "model_memory_used": "147.00 mb",
    "post_inference_memory_used": "1486.00 mb",
    "total_memory_size": "8192.00 mb",
    "throughput": "5970.04 fps",
    "sample_inf_time": "0.17 ms",
    "include_io": true,
    "framework_type": "trt",
    "framework_version": "8.0.1.6",
    "inference_hardware": "GPU",
    "infery_version": "3.7.0rc20324",
    "date": "14:34:56__11-08-2022",
    "ctime": 1667910896,
    "h_to_d_mean": "3.26 ms",
    "d_to_h_mean": "0.03 ms",
    "h_to_d_variance": "0.01 ms",
    "d_to_h_variance": "0.00 ms"
}>

Profiling - benchmark the inference layer-by-layer.

To benchmark a model layer-wise, specify profiling=True in `infery.load(..., profiling=True).

model = infery.load(
        model_path="../../models/hardware_specific_models/nvidia/rtx3070/resnet18_batchsize_64_RTX3070.pkl", # Nvidia RTX 3070 (TensorRT 8.0.1.6)
        framework_type="trt",
        profiling=True,
    )
print(model.benchmark(batch_size=1, duration_sec=5))

# List all the layers, by percentile of execution time
print("\nModel Layers:")
layers_df = model.get_layers_profile_dataframe()

print("\nTop 20 layers by % of time:")
model.get_bottlenecks(num_layers=20)
                                           Layer Name        ms  Percentage
1                           node_of_123 + node_of_125  1.560576   21.800138
5             node_of_130 + node_of_132 + node_of_133  0.405504    5.664603
7             node_of_137 + node_of_139 + node_of_140  0.392192    5.478644
2                                         node_of_126  0.387072    5.407121
4                           node_of_127 + node_of_129  0.384000    5.364207
6                           node_of_134 + node_of_136  0.376832    5.264076
12            node_of_153 + node_of_155 + node_of_156  0.302080    4.219843
11                          node_of_150 + node_of_152  0.292864    4.091102
9                                         node_of_144  0.286720    4.005275
17            node_of_169 + node_of_171 + node_of_172  0.279552    3.905143
16                          node_of_166 + node_of_168  0.278528    3.890838
22            node_of_185 + node_of_187 + node_of_188  0.278528    3.890838
14                                        node_of_160  0.271456    3.792048
19                                        node_of_176  0.271360    3.790707
21                          node_of_182 + node_of_184  0.270336    3.776402
0   Reformatting CopyNode for Input Tensor 0 to no...  0.219936    3.072350
8                           node_of_141 + node_of_143  0.192512    2.689256
3   Reformatting CopyNode for Output Tensor 0 to n...  0.163840    2.288728
13                          node_of_157 + node_of_159  0.152576    2.131378
18                          node_of_173 + node_of_175  0.150528    2.102769

Visualizing the model

The model can be visualized using netron.

model.open_in_netron()