How to Deploy Your Trained Model for High Performance Inference

Dr Gabriel Noaje1

1NVIDIA, Singapore, Singapore


Inference is where we interact with AI. Chat bots, digital assistants, recommendation engines, fraud protection services, and other applications that you use every day. Those deployed applications use inference to get you the information that you need.

Given the wide array of usages for AI inference, evaluating performance poses numerous challenges for developers. For AI inference on data center, edge, and mobile platforms, MLPerf Inference 1.0 measures performance across computer vision, medical imaging, natural language, and recommender systems. These benchmarks were developed by a consortium of AI industry leaders and provide the most comprehensive set of performance data available today, both for AI training and inference.

To perform well on the wide test array in this benchmark, it takes a full-stack platform with great ecosystem support, both for frameworks and networks. NVIDIA was the only company to make submissions for all data center and edge tests and deliver the best performance on all. One of the great byproducts of this work is that many of these optimizations found their way into inference developer tools like TensorRT and Triton.

In this session, we will step through some of these optimizations, including the use of Triton Inference Server and the A100 Multi-Instance GPU feature.
These features are all available for all data scientist practitioners that want to move their Deep Learning work to the next stage after spending a significant time optimizing the training portion of their model.


Dr Gabriel Noaje is a Senior Solutions Architect at NVIDIA APAC South specialized in HPC and DL. Gabriel has more than 15 years of experience in accelerator technologies and parallel computing.

Gabriel holds a PhD in Computer Sciences from the University of Reims Champagne-Ardenne, France.

  • 00


  • 00


  • 00


  • 00



Jul 08 2021


12:30 pm - 1:30 pm