Infery Advanced Features
Infery is a high-performance Inference Engine that simplifies the deployment of Deep-Learning models. As this is a complex field, we incooperated into Infery a variety of power features for optimal performance.
Asynchronous Inference
Once a model is trained and optimized, we load it into the memory. Then, we run predictions. One of the main struggles after we load it to the memory is how to use it effeciently - right. hHow do I leverage the power of the machine I currently use? How do I squeeze the hardware even further?
With GPUs in general, and specially in the cloud - where occasionally we pay per instance, we thrive for high GPU Utilization in every instance, because we decrease the number of required instances in a cluster.
When running multiple inference requests at the same time, you can run Preprocessing/Postprocessing, as Inference takes place.
Concurrent inference can offer several benefits:
-
Increased throughput: With concurrent inference, an inference server can process multiple requests simultaneously, which can increase the overall throughput of the server. This means that the server can handle more requests in a given amount of time, which can be particularly beneficial in situations where there are many requests to process.
-
Improved latency: Concurrent inference can also improve the latency of the inference server. When multiple requests are processed concurrently, the time it takes to process each individual request may be reduced, which can result in lower latencies and faster response times.
-
Resource utilization: By processing multiple requests concurrently, an inference server can make more efficient use of available hardware resources, such as CPUs and GPUs. This can result in better resource utilization and more cost-effective inference.
-
Scalability: Concurrent inference can also improve the scalability of an inference server. As the number of requests increases, the server can scale up to handle more requests by processing them concurrently, rather than having to rely on scaling up individual resources.
How to run async inference
model = infery.load(..., concurrency=CONCURRENCY)
test_input = model.example_inputs
inference_tasks = []
for _ in range(CONCURRENCY):
execution_handle = model.predict_async(test_input)
Inference_tasks.append(execution_handle)
# blocking until the output is ready
model_outputs = [_.get() for _ inference_tasks]
# do something with the model outputs.
# …
Comapring the difference between synchronous and asynchronous inference:
Although changing your prediction strategy to asynchronous usually yields better results, comparing it to synchronous prediction is recommended.
from infery.common.enums.benchmark_mode import TRTBenchmarkMode
print(
"Sync Benchmark:",
model.benchmark(
batch_size=1,
duration_sec=5
)
print(
"Async Benchmark:",
model.benchmark(
batch_size=1,
duration_sec=5,
benchmark_mode=TRTBenchmarkMode.SINGLE_CONTEXT_ASYNC,
)
§
Run predict async on the model with auto-generated example inputs.
WARMUP_ITERATIONS = 200
CONCURRENCY = 10
ENGINE_PATH = '../models/hardware_specific_models/nvidia/a4000/yolox_s.engine'
# Load model
model = infery.load(ENGINE_PATH, concurrency=CONCURRENCY)
# Use infery's example_inputs to get a random np.ndarray input tensor
test_input = model.example_inputs
[model.predict(test_input) for _ in range(WARMUP_ITERATIONS)]
# Get an async execution handle
execution_handle = model.predict_async(test_input)
execution_handle.completed()
# blocking until the output is ready
model_output = execution_handle.get()
# Benchmark predict_async's throughput.
test_input_list = [model.example_inputs for _ in range(CONCURRENCY)]
start = time.perf_counter()
tasks = [model.predict_async(_) for _ in test_input_list]
results = [_.get() for _ in tasks]
print(f'Predict Asynchronous TOOK: {((time.perf_counter() - start) * 1000):.7f} [ms]')
# Benchmark predict's throughput.
start = time.perf_counter()
results = [model.predict(_) for _ in test_input_list]
print(f'Predict Synchronous TOOK: {((time.perf_counter() - start) * 1000):.7f} [ms]')
Layer's profiling
Another great power feature that Infery provides for all ONNX and TensorRT models is layer profiling. After loading an engine, with just a 1-line call to the get_layers_profile_dataframe method you can get a table showing the latencies for each of the deep learning model’s layers. This is especially useful for weeding out slow and unexpressive layers that hurt a model’s performance much more than they improve its latency. Here’s a simple example on how to generate a dataframe of the model layers, and the time each layer takes in ms:
model = infery.load(
model_path="resnet18.engine",
framework_type="trt",
profiling=True
)
model.get_layers_profile_dataframe()
Once you’ve got a list of the layer profiles, calling the model.get_bottlenecks method will sort the layers and leave the most problematic ones at the top of the list. Below is an example of this method, along with the creation of a plot to illustrate the results provided.
bottlenecks = model.get_bottlenecks(num_layers=10)
ax = bottlenecks.plot.bar(
x="Layer Name", y="ms", rot=90, title="ResNet18 ONNX Bottlenecks - Total Time"
)
ax.set_xlabel("Layer Name")
ax.set_ylabel("Inference Time [ms]")
# Plot total time (ms) spend in each layer
ax = bottlenecks.plot.bar(
x="Layer Name",
y="Percentage",
rot=90,
title="ResNet18 ONNX Bottlenecks - Percentage",
)
The plot makes it easy to identify which layers have faster inference time and which have longer ones.