Inference

Infery offers a straightforward interface for making predictions using each of its supported frameworks. There are 3 different predict options: - Predict - Predict Async - Benchmark

Predict

Predict using ONNX

First, download the following ResNet50 from ONNX model zoo. ONNX Model Zoo is a repository of pre-trained deep learning models in ONNX format, allowing users to easily access and use these models for various machine learning tasks.

from infery import Model

model = Model("resnet50-v1-12.onnx")
input_feed = model.make_dummy_inputs()
result = model.predict(input_feed)

Predict using Pytorch

To predict using Pytorch the user must define their own TensorSpec.

import torchvision
from infery import FrameworkType, ModelSourceType, Model
from infery.types import TensorSpec, DType

nn_module_model = torchvision.models.resnet50(pretrained=True)
model = Model(nn_module_model, framework=FrameworkType.PYTORCH, source_type=ModelSourceType.LOADED_MODEL)

pytorch_input_specs = [TensorSpec(axes=("batch_size", 3, 224, 224), dtype=DType.FP32, name='input_1')]
input_feed = model.make_dummy_inputs(specs=pytorch_input_specs)
result = model.predict(*input_feed.values())

Note that the arguments are passed to the contained nn.Module as-is, and therefore input_feed requires *-unpacking in this case.

Predict Async

Predict Async using OpenVino

Predict Async is supported for both TensorRT and OpenVino. First, make sure you have OpenVino installed, if not, please check out our installation guide. Next, download SSD300 using OpenVino tools

omz_downloader --name ssd300
omz_converter --name ssd300

from infery import Model

model = Model("ssd300.xml")
model.start_engine()
input_feed = model.make_dummy_inputs()

handles_list = [model.engine.predict_async(input_feed=input_feed) for _ in range(100)]
result_list = [handle.get() for handle in handles_list]

Benchmark

Infery can also benchmark ResNet50 ONNX model we downloaded in the prediction section by using it again.

from infery import Model, BenchmarkParams

model = Model("resnet50-v1-12.onnx")
benchmark_params = BenchmarkParams(batch_size=8, repetitions=100, warmups=100)
model.benchmark(benchmark_params=benchmark_params)

The following result will be output by running the benchmark function (actual results may vary depending on your environment):

BenchmarkResult:
    'batch_size': 8,
    'framework': <FrameworkType.ONNX: 'ONNX'>,
    'framework_version': '1.14.1',
    'gpu_temp_celsius': 59.0,
    'inference_device': {'device_id': 0, 'hardware': <HWType.GPU: 'GPU'>},
    'infery_version': '4.0.1',
    'latency_mean_ms': 11.611,
    'latency_variance_ms': 0.042,
    'memory_consumption_mb': 97.825,
    'throughput_qps': 688.984

For a one-line benchmark, you may simply pass the desired batch size along with other benchmark options as keyword arguments:

from infery import Model

model = Model("resnet50-v1-12.onnx")
model.benchmark(batch_size=8, duration_secs=5)

ONNX providers

When using an ONNX model you it can be initialized with one or more providers that best suits your requirements or hardware capabilities.

from infery import Model
model = Model("resnet50-v1-12.onnx")
model.start_engine(providers=["CPUExecutionProvider"])

Output Bindings

In TensorRT, pre-allocating a pinned output buffer helps to reduce the time spent on copying.

from infery import Model, FrameworkType

# Load and compile model
trt_model = Model("resnet50-v1-12.onnx").compile(target_framework=FrameworkType.TENSORRT, target_batch_size=1)

# Run inference with pre-allocated bindings
input_feed = trt_model.make_dummy_inputs()
output_feed = trt_model.engine.make_output_bindings(pinned=True)
trt_model.predict(input_feed=input_feed, output_bindings=output_feed)

Output Device

By default, outputs will be placed on the device the inputs arrived at (unless output bindings were used). Users can choose to skip unwanted data copies by specifying the device the outputs should be left on:

import torchvision
from infery import FrameworkType, ModelSourceType, Model
from infery.types import TensorSpec, DType, HWType

# Prep model
torch_model = torchvision.models.resnet50(pretrained=True)
model = Model(torch_model, framework=FrameworkType.PYTORCH, source_type=ModelSourceType.LOADED_MODEL)

# Run inference
pytorch_input_specs = [TensorSpec(axes=("batch_size", 3, 224, 224), dtype=DType.FP32, name='input_1')]
input_feed = model.make_dummy_inputs(specs=pytorch_input_specs)
result = model.predict(*input_feed.values(), output_device=HWType.GPU)

Returned tensor types

Infery supports copy-less conversions between many tensor libraries, so users can choose to pass their inputs and receive their outputs in any supported framework:

import torch
from infery import Model
model = Model("resnet50-v1-12.onnx")
torch_input = torch.randn(1, 3, 224, 224)
# get torch tensors back
torch_output = model.predict(torch_inputs, return_as="pt")