NVIDIA TensorRT
NVIDIA® TensorRT™ is an ecosystem of tools for developers to achieve high-performance deep learning inference. TensorRT includes inference compilers, runtimes, and model optimizations that deliver low latency and high throughput for production applications. The TensorRT ecosystem includes the TensorRT compiler, TensorRT-LLM, TensorRT Model Optimizer, and TensorRT Cloud.
How TensorRT Works
Speed up inference by 36X compared to CPU-only platforms.
Built on the NVIDIA® CUDA® parallel programming model, TensorRT includes libraries that optimize neural network models trained on all major frameworks, calibrate them for lower precision with high accuracy, and deploy them to hyperscale data centers, workstations, laptops, and edge devices. TensorRT optimizes inference using techniques such as quantization, layer and tensor fusion, and kernel tuning.
TensorRT provides post-training quantization and support for models trained with quantization-aware training techniques for optimizing FP8, FP4, and integer formats for deep learning inference. Reduced-precision inference significantly minimizes latency, which is required for many real-time services, as well as autonomous and embedded applications.

Read the Introductory TensorRT Blog
Learn how to apply TensorRT optimizations and deploy a PyTorch model to GPUs.
Watch On-Demand TensorRT Sessions From GTC
Learn more about TensorRT and its features from a curated list of webinars at GTC.
Get the Complete Developer Guide
See how to get started with TensorRT in this step-by-step developer and API reference guide.
Navigate AI infrastructure and Performance
Learn how to lower your cost per token and get the most out of your AI models with our ebook.
Key Features
Large Language Model Inference
NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes inference performance of large language models (LLMs) on the NVIDIA AI platform with a simplified Python API.
Developers accelerate LLM performance on NVIDIA GPUs in the data center or on workstation GPUs—including NVIDIA RTX™ systems on native Windows—with the same seamless workflow.
Compile in the Cloud
NVIDIA TensorRT Cloud is a developer-focused service for generating hyper-optimized engines for given constraints and KPIs. Given an LLM and inference throughput/latency requirements, a developer can invoke TensorRT Cloud service using a command-line interface to hyper-optimize a TensorRT-LLM engine for a target GPU. The cloud service will automatically determine the best engine configuration that meets the requirements. Developers can also use the service to build optimized TensorRT engines from ONNX models on a variety of NVIDIA RTX, GeForce, Quadro®, or Tesla®-class GPUs. TensorRT Cloud is available in limited access to select partners. Apply for access, subject to approval.
Optimize Neural Networks
NVIDIA TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques, including quantization, sparsity, and distillation. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM and TensorRT to efficiently optimize inference on NVIDIA GPUs.
Major Framework Integrations
TensorRT integrates directly into PyTorch and Hugging Face to achieve 6X faster inference with a single line of code. TensorRT provides an ONNX parser to import ONNX models from popular frameworks into TensorRT. MATLAB is integrated with TensorRT through GPU Coder to automatically generate high-performance inference engines for NVIDIA Jetson™, NVIDIA DRIVE®, and data center platforms.
Deploy, Run, and Scale With Triton
TensorRT-optimized models are deployed, run, and scaled with NVIDIA Triton™ inference-serving software that includes TensorRT as a backend. The advantages of using Triton include high throughput with dynamic batching, concurrent model execution, model ensembling, and streaming audio and video inputs.
Accelerate Every Inference Platform
TensorRT can optimize models for applications across the edge, laptops and desktops, and data centers. It powers key NVIDIA solutions—such as NVIDIA TAO, NVIDIA DRIVE, NVIDIA Clara™, and NVIDIA JetPack™—and is integrated with application-specific SDKs, such as NVIDIA NIM™, NVIDIA DeepStream, NVIDIA® Riva, NVIDIA Merlin™, NVIDIA Maxine™, NVIDIA Morpheus, and NVIDIA Broadcast Engine.
TensorRT provides developers a unified path to deploy intelligent video analytics, speech AI, recommender systems, video conferencing, AI-based cybersecurity, and streaming apps in production.
Get Started With TensorRT
TensorRT is an ecosystem of APIs for high-performance deep learning inference.
Download TensorRT
The TensorRT inference library provides a general-purpose AI compiler and an inference runtime that delivers low latency and high throughput for production applications.
Download TensorRT Model Optimizer
TensorRT Model Optimizer is available for free on NVIDIA PyPI, with examples and recipes on GitHub.
Get Started With TensorRT Frameworks
TensorRT Frameworks add TensorRT compiler functionality to frameworks like PyTorch.
Download ONNX and Torch-TensorRT
The TensorRT inference library provides a general-purpose AI compiler and an inference runtime that delivers low latency and high throughput for production applications.
ONYX:
Torch-TensorRT:

Experience Tripy: Pythonic Inference With TensorRT
Experience high-performance inference and excellent usability with Tripy. Expect intuitive APIs, easy debugging with eager mode, clear error messages, and top-notch documentation to streamline your deep learning deployment.
Deploy
Get a free license to try NVIDIA AI Enterprise in production for 90 days using your existing infrastructure.
World-Leading Inference Performance
TensorRT was behind NVIDIA’s wins across all inference performance tests in the industry-standard benchmark for MLPerf Inference. TensorRT-LLM accelerates the latest large language models for generative AI, delivering up to 8X more performance, 5.3X better TCO, and nearly 6X lower energy consumption.
See All Benchmarks8X Increase in GPT-J 6B Inference Performance

4X Higher Llama2 Inference Performance

Total Cost of Ownership

Energy Use

Starter Kits
Beginner Guide to TensorRT
Watch Video: Getting Started With NVIDIA TensorRT
Beginner Guide to TensorRT-LLM
Watch Video: Getting Started With NVIDIA TensorRT
Beginner Guide to TensorRT Model Optimizer
Beginner Guide to Torch-TensorRT
Watch Video: Getting Started With NVIDIA Torch-TensorRT
Read Blog: Accelerate Inference up to 6X in PyTorch
Download Notebook: Object Detection With SSD (Jupyter Notebook)
Beginner Guide to TensorRT Pythonic Frontend: Tripy
TensorRT Ecosystem Ecosystem
Widely Adopted Across Industries

More Resources
Ethical AI
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.
Get started with TensorRT today, and use the right inference tools to develop AI for any application on any platform.