AI Inference
Inference can be deployed in many ways, depending on the use-case. Offline processing of data is best done at larger batch sizes, which can deliver optimal GPU utilization and throughput. However, increasing throughput also tends to increase latency. Generative AI and Large Language Models (LLMs) deployments seek to deliver great experiences by lowering latency. So developers and infrastructure managers need to strike a balance between throughput and latency to deliver great user experiences and best possible throughput while containing deployment costs.
When deploying LLMs at scale, a typical way to balance these concerns is to set a time-to-first token limit, and optimize throughput within that limit. The data presented in the Large Language Model Low Latency section show best throughput at a time limit of one second, which enables great throughput at low latency for most users, all while optimizing compute resource use.
Click here to view other performance data.
MLPerf Inference v4.1 Performance Benchmarks
Offline Scenario, Closed Division
Network | Throughput | GPU | Server | GPU Version | Target Accuracy | Dataset |
---|---|---|---|---|---|---|
Llama2 70B | 11,264 tokens/sec | 1x B200 | NVIDIA B200 | NVIDIA B200-SXM-180GB | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | OpenOrca |
34,864 tokens/sec | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB-CTS | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | OpenOrca | |
24,525 tokens/sec | 8x H100 | NVIDIA DGX H100 | NVIDIA H100-SXM-80GB | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | OpenOrca | |
4,068 tokens/sec | 1x GH200 | NVIDIA GH200 NVL2 Platform | NVIDIA GH200 Grace Hopper Superchip 144GB | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | OpenOrca | |
Mixtral 8x7B | 59,335 tokens/sec | 8x H200 | GIGABYTE G593-SD1 | NVIDIA H200-SXM-141GB | rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12) | OpenOrca, GSM8K, MBXP |
52,818 tokens/sec | 8x H100 | SMC H100 | NVIDIA H100-SXM-80GB | rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12) | OpenOrca, GSM8K, MBXP | |
8,021 tokens/sec | 1x GH200 | NVIDIA GH200 NVL2 Platform | NVIDIA GH200 Grace Hopper Superchip 144GB | rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12) | OpenOrca, GSM8K, MBXP | |
Stable Diffusion XL | 18 samples/sec | 8x H200 | Dell PowerEdge XE9680 | NVIDIA H200-SXM-141GB | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | Subset of coco-2014 val |
16 samples/sec | 8x H100 | SYS-421GE-TNHR2-LCC | NVIDIA H100-SXM-80GB | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | Subset of coco-2014 val | |
2.3 samples/sec | 1x GH200 | NVIDIA GH200 NVL2 Platform | NVIDIA GH200 Grace Hopper Superchip 144GB | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | Subset of coco-2014 val | |
ResNet-50 | 768,235 samples/sec | 8x H200 | Dell PowerEdge XE9680 | NVIDIA H200-SXM-141GB | 76.46% Top1 | ImageNet (224x224) |
710,521 samples/sec | 8x H100 | SYS-421GE-TNHR2-LCC | NVIDIA H100-SXM-80GB | 76.46% Top1 | ImageNet (224x224) | |
95,105 samples/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | NVIDIA GH200 Grace Hopper Superchip 96GB | 76.46% Top1 | ImageNet (224x224) | |
RetinaNet | 15,015 samples/sec | 8x H200 | ThinkSystem SR685a V3 | NVIDIA H200-SXM-141GB | 0.3755 mAP | OpenImages (800x800) |
14,538 samples/sec | 8x H100 | SYS-421GE-TNHR2-LCC | NVIDIA H100-SXM-80GB | 0.3755 mAP | OpenImages (800x800) | |
1,923 samples/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | NVIDIA GH200 Grace Hopper Superchip 96GB | 0.3755 mAP | OpenImages (800x800) | |
BERT | 73,791 samples/sec | 8x H200 | Dell PowerEdge XE9680 | NVIDIA H200-SXM-141GB | 90.87% f1 | SQuAD v1.1 |
72,876 samples/sec | 8x H100 | SYS-421GE-TNHR2-LCC | NVIDIA H100-SXM-80GB | 90.87% f1 | SQuAD v1.1 | |
9,864 samples/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | NVIDIA GH200 Grace Hopper Superchip 96GB | 90.87% f1 | SQuAD v1.1 | |
GPT-J | 20,552 tokens/sec | 8x H200 | ThinkSystem SR680a V3 | NVIDIA H200-SXM-141GB | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | CNN Dailymail |
19,878 tokens/sec | 8x H100 | ESC-N8-E11 | NVIDIA H100-SXM-80GB | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | CNN Dailymail | |
2,804 tokens/sec | 1x GH200 | GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRT | NVIDIA GH200 Grace Hopper Superchip 96GB | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | CNN Dailymail | |
DLRMv2 | 639,512 samples/sec | 8x H200 | GIGABYTE G593-SD1 | NVIDIA H200-SXM-141GB | 80.31% AUC | Synthetic Multihot Criteo Dataset |
602,108 samples/sec | 8x H100 | SYS-421GE-TNHR2-LCC | NVIDIA H100-SXM-80GB | 80.31% AUC | Synthetic Multihot Criteo Dataset | |
86,731 samples/sec | 1x GH200 | NVIDIA GH200 NVL2 Platform | NVIDIA GH200 Grace Hopper Superchip 144GB | 80.31% AUC | Synthetic Multihot Criteo Dataset | |
3D-UNET | 55 samples/sec | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | 0.863 DICE mean | KiTS 2019 |
52 samples/sec | 8x H100 | AS-4125GS-TNHR2-LCC | NVIDIA H100-SXM-80GB | 0.863 DICE mean | KiTS 2019 | |
7 samples/sec | 1x GH200 | GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRT | NVIDIA GH200 Grace Hopper Superchip 96GB | 0.863 DICE mean | KiTS 2019 |
Server Scenario - Closed Division
Network | Throughput | GPU | Server | GPU Version | Target Accuracy | MLPerf Server Latency Constraints (ms) |
Dataset |
---|---|---|---|---|---|---|---|
Llama2 70B | 10,756 tokens/sec | 1x B200 | NVIDIA B200 | NVIDIA B200-SXM-180GB | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | TTFT/TPOT: 2000 ms/200 ms | OpenOrca |
32,790 tokens/sec | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB-CTS | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | TTFT/TPOT: 2000 ms/200 ms | OpenOrca | |
23,700 tokens/sec | 8x H100 | AS-4125GS-TNHR2-LCC | NVIDIA H100-SXM-80GB | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | TTFT/TPOT: 2000 ms/200 ms | OpenOrca | |
3,884 tokens/sec | 1x GH200 | NVIDIA GH200 NVL2 Platform | NVIDIA GH200 Grace Hopper Superchip 144GB | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | TTFT/TPOT: 2000 ms/200 ms | OpenOrca | |
Mixtral 8x7B | 57,177 tokens/sec | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca, GSM8K, MBXP |
51,028 tokens/sec | 8x H100 | SYS-421GE-TNHR2-LCC | NVIDIA H100-SXM-80GB | rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca, GSM8K, MBXP | |
7,450 tokens/sec | 1x GH200 | NVIDIA GH200 NVL2 Platform | NVIDIA GH200 Grace Hopper Superchip 144GB | rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca, GSM8K, MBXP | |
Stable Diffusion XL | 17 samples/sec | 8x H200 | ThinkSystem SR680a V3 | NVIDIA H200-SXM-141GB | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s | Subset of coco-2014 val |
16 samples/sec | 8x H100 | SYS-421GE-TNHR2-LCC | NVIDIA H100-SXM-80GB | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s | Subset of coco-2014 val | |
2.02 samples/sec | 1x GH200 | NVIDIA GH200 NVL2 Platform | NVIDIA GH200 Grace Hopper Superchip 144GB | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s | Subset of coco-2014 val | |
ResNet-50 | 681,328 queries/sec | 8x H200 | GIGABYTE G593-SD1 | NVIDIA H200-SXM-141GB | 76.46% Top1 | 15 ms | ImageNet (224x224) |
634,193 queries/sec | 8x H100 | SYS-821GE-TNHR | NVIDIA H100-SXM-80GB | 76.46% Top1 | 15 ms | ImageNet (224x224) | |
77,012 queries/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | NVIDIA GH200 Grace Hopper Superchip 96GB | 76.46% Top1 | 15 ms | ImageNet (224x224) | |
RetinaNet | 14,012 queries/sec | 8x H200 | GIGABYTE G593-SD1 | NVIDIA H200-SXM-141GB | 0.3755 mAP | 100 ms | OpenImages (800x800) |
13,979 queries/sec | 8x H100 | SYS-421GE-TNHR2-LCC | NVIDIA H100-SXM-80GB | 0.3755 mAP | 100 ms | OpenImages (800x800) | |
1,731 queries/sec | 1x GH200 | GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRT | NVIDIA GH200 Grace Hopper Superchip 96GB | 0.3755 mAP | 100 ms | OpenImages (800x800) | |
BERT | 58,091 queries/sec | 8x H200 | Dell PowerEdge XE9680 | NVIDIA H200-SXM-141GB | 90.87% f1 | 130 ms | SQuAD v1.1 |
58,929 queries/sec | 8x H100 | SYS-421GE-TNHR2-LCC | NVIDIA H100-SXM-80GB | 90.87% f1 | 130 ms | SQuAD v1.1 | |
7,103 queries/sec | 1x GH200 | GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRT | NVIDIA GH200 Grace Hopper Superchip 96GB | 90.87% f1 | 130 ms | SQuAD v1.1 | |
GPT-J | 20,139 queries/sec | 8x H200 | Dell PowerEdge XE9680 | NVIDIA H200-SXM-141GB | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | 20 s | CNN Dailymail |
19,811 queries/sec | 8x H100 | AS-4125GS-TNHR2-LCC | NVIDIA H100-SXM-80GB | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | 20 s | CNN Dailymail | |
2,513 queries/sec | 1x GH200 | NVIDIA GH200 NVL2 Platform | NVIDIA GH200 Grace Hopper Superchip 144GB | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | 20 s | CNN Dailymail | |
DLRMv2 | 585,209 queries/sec | 8x H200 | GIGABYTE G593-SD1 | NVIDIA H200-SXM-141GB | 80.31% AUC | 60 ms | Synthetic Multihot Criteo Dataset |
556,101 queries/sec | 8x H100 | SYS-421GE-TNHR2-LCC | NVIDIA H100-SXM-80GB | 80.31% AUC | 60 ms | Synthetic Multihot Criteo Dataset | |
81,010 queries/sec | 1x GH200 | NVIDIA GH200 NVL2 Platform | NVIDIA GH200 Grace Hopper Superchip 144GB | 80.31% AUC | 60 ms | Synthetic Multihot Criteo Dataset |
Power Efficiency Offline Scenario - Closed Division
Network | Throughput | Throughput per Watt | GPU | Server | GPU Version | Dataset |
---|---|---|---|---|---|---|
Llama2 70B | 25,262 tokens/sec | 4 tokens/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | OpenOrca |
Mixtral 8x7B | 48,988 tokens/sec | 8 tokens/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | OpenOrca, GSM8K, MBXP |
Stable Diffusion XL | 13 samples/sec | 0.002 samples/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | Subset of coco-2014 val |
ResNet-50 | 556,234 samples/sec | 112 samples/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | ImageNet (224x224) |
RetinaNet | 10,803 samples/sec | 2 samples/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | OpenImages (800x800) |
BERT | 54,063 samples/sec | 10 samples/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | SQuAD v1.1 |
GPT-J | 13,097 samples/sec | 3. samples/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | CNN Dailymail |
DLRMv2 | 503,719 samples/sec | 84 samples/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | Synthetic Multihot Criteo Dataset |
3D-UNET | 42 samples/sec | 0.009 samples/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | KiTS 2019 |
Power Efficiency Server Scenario - Closed Division
Network | Throughput | Throughput per Watt | GPU | Server | GPU Version | Dataset |
---|---|---|---|---|---|---|
Llama2 70B | 23,113 tokens/sec | 4 tokens/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | OpenOrca |
Mixtral 8x7B | 45,497 tokens/sec | 7 tokens/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | OpenOrca, GSM8K, MBXP |
Stable Diffusion | 13 queries/sec | 0.002 queries/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | Subset of coco-2014 val |
ResNet-50 | 480,131 queries/sec | 96 queries/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | ImageNet (224x224) |
RetinaNet | 9,603 queries/sec | 2 queries/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | OpenImages (800x800) |
BERT | 41,599 queries/sec | 8 queries/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | SQuAD v1.1 |
GPT-J | 11,701 queries/sec | 2 queries/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | CNN Dailymail |
DLRMv2 | 420,107 queries/sec | 69 queries/sec/watt | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB | Synthetic Multihot Criteo Dataset |
MLPerf™ v4.1 Inference Closed: Llama2 70B 99.9% of FP32, Mixtral 8x7B 99% of FP32 and 99.9% of FP32, Stable Diffusion XL, ResNet-50 v1.5, RetinaNet, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net 99.9% of FP32 accuracy target, GPT-J 99.9% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 4.1-0005, 4.1-0021, 4.1-0027, 4.1-0037, 4.1-0038, 4.1-0043, 4.1-0044, 4.1-0046, 4.1-0048, 4.1-0049, 4.1-0053, 4.1-0057, 4.1-0060, 4.1-0063, 4.1-0064, 4.1-0065, 4.1-0074. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
NVIDIA B200 is a preview submission
Llama2 70B Max Sequence Length = 1,024
Mixtral 8x7B Max Sequence Length = 2,048
BERT-Large Max Sequence Length = 384.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here
LLM Inference Performance of NVIDIA Data Center Products
H200 Inference Performance - High Throughput
Model | PP | TP | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
Llama v3.1 405B | 1 | 8 | 128 | 128 | 3,874 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 128 | 2048 | 5,938 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 128 | 4096 | 5,168 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 405B | 8 | 1 | 2048 | 128 | 764 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.14a | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 5000 | 500 | 669 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 500 | 2000 | 5,084 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 1000 | 1000 | 3,400 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 2048 | 2048 | 2,941 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 20000 | 2000 | 535 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 1 | 128 | 128 | 4,021 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 1 | 128 | 2048 | 4,166 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 2 | 128 | 4096 | 6,527 output tokens/sec | 2x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 1 | 2048 | 128 | 466 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 1 | 5000 | 500 | 560 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 2 | 500 | 2000 | 6,848 output tokens/sec | 2x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 1 | 1000 | 1000 | 2,823 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 2 | 2048 | 2048 | 4,184 output tokens/sec | 2x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 2 | 20000 | 2000 | 641 output tokens/sec | 2x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 128 | 128 | 29,526 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 128 | 2048 | 25,399 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 128 | 4096 | 17,371 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 2048 | 128 | 3,794 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 5000 | 500 | 3,988 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 500 | 2000 | 21,021 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 1000 | 1000 | 17,538 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 2048 | 2048 | 11,969 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 20000 | 2000 | 1,804 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 128 | 128 | 31,938 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 128 | 2048 | 27,409 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 128 | 4096 | 18,505 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 2048 | 128 | 3,834 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 5000 | 500 | 4,042 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 500 | 2000 | 22,355 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 1000 | 1000 | 18,426 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 2048 | 2048 | 12,347 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 20000 | 2000 | 1,823 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 1 | 128 | 128 | 17,158 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 1 | 128 | 2048 | 15,095 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 2 | 128 | 4096 | 21,565 output tokens/sec | 2x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 1 | 2048 | 128 | 2,010 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 1 | 5000 | 500 | 2,309 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 1 | 500 | 2000 | 12,105 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 1 | 1000 | 1000 | 10,371 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 2 | 2048 | 2048 | 14,018 output tokens/sec | 2x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 2 | 20000 | 2000 | 2,227 output tokens/sec | 2x H200 | DGX H200 | FP8 | TensorRT-LLM 0.15.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 128 | 128 | 25,179 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.14.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 128 | 2048 | 32,623 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.15.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 128 | 4096 | 25,753 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 2048 | 128 | 3,095 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.15.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 5000 | 500 | 4,209 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.15.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 500 | 2000 | 27,430 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 1000 | 1000 | 20,097 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.15.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 2048 | 2048 | 15,799 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 20000 | 2000 | 2,897 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.14.0 | NVIDIA H200 |
TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)
GH200 Inference Performance - High Throughput
Model | PP | TP | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
Llama v3.1 70B | 1 | 1 | 128 | 128 | 3,637 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 4 | 128 | 2048 | 10,358 output tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.13.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 4 | 128 | 4096 | 6,628 output tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.13.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 1 | 2048 | 128 | 425 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 1 | 5000 | 500 | 422 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 4 | 500 | 2000 | 9,091 output tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.13.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 1 | 1000 | 1000 | 1,746 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 4 | 2048 | 2048 | 4,865 output tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.13.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 4 | 20000 | 2000 | 959 output tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.13.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 128 | 128 | 29,853 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 128 | 2048 | 21,770 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 128 | 4096 | 14,190 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 2048 | 128 | 3,844 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 5000 | 500 | 3,933 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 500 | 2000 | 17,137 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 1000 | 1000 | 16,483 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 2048 | 2048 | 10,266 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 20000 | 2000 | 1,560 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 128 | 128 | 32,498 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 128 | 2048 | 23,337 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 128 | 4096 | 15,018 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 2048 | 128 | 3,813 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 5000 | 500 | 3,950 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 500 | 2000 | 18,556 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 1000 | 1000 | 17,252 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 2048 | 2048 | 10,756 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 20000 | 2000 | 1,601 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 128 | 128 | 16,859 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 128 | 2048 | 11,120 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 4 | 128 | 4096 | 30,066 output tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.13.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 2048 | 128 | 1,994 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 5000 | 500 | 2,078 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 500 | 2000 | 9,193 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 1000 | 1000 | 8,849 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 2048 | 2048 | 5,545 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 20000 | 2000 | 861 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
TP: Tensor Parallelism
PP: Pipeline Parallelism
H100 Inference Performance - High Throughput
Model | PP | TP | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
Llama v3.1 70B | 1 | 1 | 128 | 128 | 3,378 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.17.0 | H100-SXM5-80GB |
Llama v3.1 70B | 1 | 2 | 128 | 4096 | 3,897 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.17.0 | H100-SXM5-80GB |
Llama v3.1 70B | 1 | 2 | 2048 | 128 | 774 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.15.0 | H100-SXM5-80GB |
Llama v3.1 70B | 1 | 2 | 500 | 2000 | 4,973 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.17.0 | H100-SXM5-80GB |
Llama v3.1 70B | 1 | 2 | 1000 | 1000 | 4,391 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.17.0 | H100-SXM5-80GB |
Llama v3.1 70B | 1 | 2 | 2048 | 2048 | 2,898 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.17.0 | H100-SXM5-80GB |
Llama v3.1 70B | 1 | 4 | 20000 | 2000 | 920 output tokens/sec | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.17.0 | H100-SXM5-80GB |
Mixtral 8x7B | 1 | 1 | 128 | 128 | 15,962 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.17.0 | H100-SXM5-80GB |
Mixtral 8x7B | 1 | 2 | 128 | 2048 | 23,010 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.15.0 | H100-SXM5-80GB |
Mixtral 8x7B | 1 | 2 | 128 | 4096 | 14,237 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.17.0 | H100-SXM5-80GB |
Mixtral 8x7B | 1 | 1 | 2048 | 128 | 1,893 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.17.0 | H100-SXM5-80GB |
Mixtral 8x7B | 1 | 2 | 5000 | 500 | 3,646 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.17.0 | H100-SXM5-80GB |
Mixtral 8x7B | 1 | 2 | 500 | 2000 | 18,186 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.14.0 | H100-SXM5-80GB |
Mixtral 8x7B | 1 | 2 | 1000 | 1000 | 15,932 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.14.0 | H100-SXM5-80GB |
Mixtral 8x7B | 1 | 2 | 2048 | 2048 | 10,686 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.17.0 | H100-SXM5-80GB |
Mixtral 8x7B | 1 | 2 | 20000 | 2000 | 1,757 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.17.0 | H100-SXM5-80GB |
TP: Tensor Parallelism
PP: Pipeline Parallelism
L40S Inference Performance - High Throughput
Model | PP | TP | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
Llama v3.1 8B | 1 | 1 | 128 | 128 | 9,105 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 128 | 2048 | 5,366 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 128 | 4096 | 3,026 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 2048 | 128 | 1,067 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 5000 | 500 | 981 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 500 | 2000 | 4,274 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 1000 | 1000 | 4,055 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 2048 | 2048 | 2,225 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 20000 | 2000 | 328 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Mixtral 8x7B | 4 | 1 | 128 | 128 | 15,278 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.15.0 | NVIDIA L40S |
Mixtral 8x7B | 2 | 2 | 128 | 2048 | 9,087 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.15.0 | NVIDIA L40S |
Mixtral 8x7B | 1 | 4 | 128 | 4096 | 5,736 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Mixtral 8x7B | 4 | 1 | 2048 | 128 | 2,098 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.15.0 | NVIDIA L40S |
Mixtral 8x7B | 2 | 2 | 5000 | 500 | 1,558 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.15.0 | NVIDIA L40S |
Mixtral 8x7B | 2 | 2 | 500 | 2000 | 7,974 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.15.0 | NVIDIA L40S |
Mixtral 8x7B | 2 | 2 | 1000 | 1000 | 6,579 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.15.0 | NVIDIA L40S |
Mixtral 8x7B | 2 | 2 | 2048 | 2048 | 4,217 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.15.0 | NVIDIA L40S |
TP: Tensor Parallelism
PP: Pipeline Parallelism
Inference Performance of NVIDIA Data Center Products
H200 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion v2.1 (512x512) | 1 | 4.33 images/sec | - | 231.26 | 1x H200 | DGX H200 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0.26 | NVIDIA H200 |
4 | 6.8 images/sec | - | 588.08 | 1x H200 | DGX H200 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0.26 | NVIDIA H200 | |
Stable Diffusion XL | 1 | 0.86 images/sec | - | 1157.27 | 1x H200 | DGX H200 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA H200 |
ResNet-50v1.5 | 8 | 20,801 images/sec | 62 images/sec/watt | 0.38 | 1x H200 | DGX H200 | 25.01-py3 | INT8 | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 |
128 | 65,045 images/sec | 107 images/sec/watt | 1.97 | 1x H200 | DGX H200 | 25.01-py3 | INT8 | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 | |
EfficientNet-B0 | 8 | 16,769 images/sec | 77 images/sec/watt | 0.48 | 1x H200 | DGX H200 | 25.01-py3 | INT8 | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 |
128 | 56,981 images/sec | 122 images/sec/watt | 2.25 | 1x H200 | DGX H200 | 25.01-py3 | INT8 | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 | |
EfficientNet-B4 | 8 | 4,507 images/sec | 14 images/sec/watt | 1.78 | 1x H200 | DGX H200 | 25.01-py3 | INT8 | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 |
128 | 8,991 images/sec | 15 images/sec/watt | 14.24 | 1x H200 | DGX H200 | 25.01-py3 | INT8 | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 | |
HF Swin Base | 8 | 5,090 samples/sec | 11 samples/sec/watt | 1.57 | 1x H200 | DGX H200 | 25.01-py3 | Mixed | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 |
32 | 8,204 samples/sec | 12 samples/sec/watt | 3.9 | 1x H200 | DGX H200 | 25.01-py3 | Mixed | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 | |
HF Swin Large | 8 | 3,382 samples/sec | 6 samples/sec/watt | 2.37 | 1x H200 | DGX H200 | 25.01-py3 | INT8 | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 |
32 | 4,676 samples/sec | 7 samples/sec/watt | 6.84 | 1x H200 | DGX H200 | 25.01-py3 | INT8 | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 | |
HF ViT Base | 8 | 9,006 samples/sec | 19 samples/sec/watt | 0.89 | 1x H200 | DGX H200 | 25.01-py3 | FP8 | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 |
64 | 15,640 samples/sec | 23 samples/sec/watt | 4.09 | 1x H200 | DGX H200 | 25.01-py3 | FP8 | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 | |
HF ViT Large | 8 | 3,439 samples/sec | 6 samples/sec/watt | 2.33 | 1x H200 | DGX H200 | 25.01-py3 | FP8 | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 |
64 | 5,471 samples/sec | 8 samples/sec/watt | 11.7 | 1x H200 | DGX H200 | 25.01-py3 | FP8 | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 | |
QuartzNet | 8 | 6,741 samples/sec | 25 samples/sec/watt | 1.19 | 1x H200 | DGX H200 | 25.01-py3 | Mixed | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 |
128 | 34,280 samples/sec | 92 samples/sec/watt | 3.73 | 1x H200 | DGX H200 | 25.01-py3 | INT8 | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 | |
RetinaNet-RN34 | 8 | 3,015 images/sec | 8 images/sec/watt | 2.65 | 1x H200 | DGX H200 | 25.01-py3 | INT8 | Synthetic | TensorRT 10.8.0.40 | NVIDIA H200 |
512x512 image size, 50 denoising steps for Stable Diffusion v2.1
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256
GH200 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion v2.1 (512x512) | 1 | 4.27 images/sec | - | 234.4 | 1x GH200 | NVIDIA P3880 | 24.09-py3 | INT8 | Synthetic | TensorRT 10.4.0.26 | GH200 96GB |
4 | 5.82 images/sec | - | 687.91 | 1x GH200 | NVIDIA P3880 | 24.09-py3 | INT8 | Synthetic | TensorRT 10.4.0.26 | GH200 96GB | |
Stable Diffusion XL | 1 | 0.68 images/sec | - | 1149.44 | 1x GH200 | NVIDIA P3880 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | GH200 96GB |
ResNet-50v1.5 | 8 | 21,533 images/sec | 63 images/sec/watt | 0.37 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | GH200 96GB |
128 | 63,043 images/sec | 99 images/sec/watt | 2.03 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | GH200 96GB | |
EfficientNet-B0 | 8 | 16,695 images/sec | 67 images/sec/watt | 0.48 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | GH200 96GB |
128 | 56,674 images/sec | 113 images/sec/watt | 2.26 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | GH200 96GB | |
EfficientNet-B4 | 8 | 4,531 images/sec | 13 images/sec/watt | 1.77 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | GH200 96GB |
128 | 8,784 images/sec | 14 images/sec/watt | 14.57 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | GH200 96GB | |
HF Swin Base | 8 | 5,106 samples/sec | 10 samples/sec/watt | 1.57 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | GH200 96GB |
32 | 8,197 samples/sec | 12 samples/sec/watt | 3.9 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | GH200 96GB | |
HF Swin Large | 8 | 3,403 samples/sec | 6 samples/sec/watt | 2.35 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | GH200 96GB |
32 | 4,846 samples/sec | 6 samples/sec/watt | 6.6 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | Mixed | Synthetic | TensorRT 10.7.0 | GH200 96GB | |
HF ViT Base | 8 | 8,990 samples/sec | 18 samples/sec/watt | 0.89 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | FP8 | Synthetic | TensorRT 10.7.0 | GH200 96GB |
64 | 15,562 samples/sec | 21 samples/sec/watt | 4.11 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | FP8 | Synthetic | TensorRT 10.7.0 | GH200 96GB | |
HF ViT Large | 8 | 3,707 samples/sec | 6 samples/sec/watt | 2.16 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | FP8 | Synthetic | TensorRT 10.7.0 | GH200 96GB |
64 | 5,703 samples/sec | 7 samples/sec/watt | 11.22 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | FP8 | Synthetic | TensorRT 10.7.0 | GH200 96GB | |
QuartzNet | 8 | 6,688 samples/sec | 22 samples/sec/watt | 1.2 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | GH200 96GB |
128 | 34,272 samples/sec | 85 samples/sec/watt | 3.73 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | GH200 96GB | |
RetinaNet-RN34 | 8 | 2,945 images/sec | 4 images/sec/watt | 2.72 | 1x GH200 | NVIDIA P3880 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | GH200 96GB |
512x512 image size, 50 denoising steps for Stable Diffusion v2.1
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256
H100 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion v2.1 (512x512) | 1 | 4.22 images/sec | - | 236.8 | 1x H100 | DGX H100 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0.26 | H100 SXM5-80GB |
4 | 6.41 images/sec | - | 624.6 | 1x H100 | DGX H100 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0.26 | H100 SXM5-80GB | |
Stable Diffusion XL | 1 | 0.83 images/sec | - | 1210.08 | 1x H100 | DGX H100 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | H100 SXM5-80GB |
ResNet-50v1.5 | 8 | 21,588 images/sec | 63 images/sec/watt | 0.37 | 1x H100 | DGX H100 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB |
128 | 59,535 images/sec | 99 images/sec/watt | 2.15 | 1x H100 | DGX H100 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB | |
EfficientNet-B0 | 8 | 16,351 images/sec | 67 images/sec/watt | 0.49 | 1x H100 | DGX H100 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB |
128 | 55,498 images/sec | 116 images/sec/watt | 2.31 | 1x H100 | DGX H100 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB | |
EfficientNet-B4 | 8 | 4,550 images/sec | 12 images/sec/watt | 1.76 | 1x H100 | DGX H100 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB |
128 | 8,144 images/sec | 15 images/sec/watt | 15.72 | 1x H100 | DGX H100 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB | |
HF Swin Base | 8 | 5,072 samples/sec | 9 samples/sec/watt | 1.58 | 1x H100 | DGX H100 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB |
32 | 7,706 samples/sec | 11 samples/sec/watt | 4.15 | 1x H100 | DGX H100 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB | |
HF Swin Large | 8 | 3,299 samples/sec | 6 samples/sec/watt | 2.42 | 1x H100 | DGX H100 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB |
32 | 4,463 samples/sec | 7 samples/sec/watt | 7.17 | 1x H100 | DGX H100 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB | |
HF ViT Base | 8 | 9,078 samples/sec | 17 samples/sec/watt | 0.88 | 1x H100 | DGX H100 | 25.02-py3 | FP8 | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB |
64 | 15,210 samples/sec | 22 samples/sec/watt | 4.21 | 1x H100 | DGX H100 | 25.02-py3 | FP8 | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB | |
HF ViT Large | 8 | 3,440 samples/sec | 6 samples/sec/watt | 2.33 | 1x H100 | DGX H100 | 25.02-py3 | FP8 | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB |
64 | 5,363 samples/sec | 8 samples/sec/watt | 11.93 | 1x H100 | DGX H100 | 25.02-py3 | FP8 | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB | |
QuartzNet | 8 | 6,767 samples/sec | 22 samples/sec/watt | 1.18 | 1x H100 | DGX H100 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB |
128 | 35,389 samples/sec | 77 samples/sec/watt | 3.62 | 1x H100 | DGX H100 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB | |
RetinaNet-RN34 | 8 | 2,827 images/sec | 8 images/sec/watt | 2.83 | 1x H100 | DGX H100 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | H100-SXM5-80GB |
512x512 image size, 50 denoising steps for Stable Diffusion v2.1
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256
L40S Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion v2.1 (512x512) | 1 | 2.49 images/sec | - | 401.48 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA L40S |
4 | 2.91 images/sec | - | 1372.72 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA L40S | |
Stable Diffusion XL | 1 | 0.37 images/sec | - | 2678.19 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA L40S |
ResNet-50v1.5 | 8 | 23,472 images/sec | 78 images/sec/watt | 0.34 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | NVIDIA L40S |
32 | 37,069 images/sec | 109 images/sec/watt | 0.86 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | NVIDIA L40S | |
BERT-BASE | 8 | 8,412 sequences/sec | 26 sequences/sec/watt | 0.95 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | NVIDIA L40S |
128 | 13,169 sequences/sec | 38 sequences/sec/watt | 9.72 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | NVIDIA L40S | |
BERT-LARGE | 8 | 3,188 sequences/sec | 10 sequences/sec/watt | 2.51 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | NVIDIA L40S |
24 | 4,034 sequences/sec | 12 sequences/sec/watt | 31.73 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | NVIDIA L40S | |
EfficientDet-D0 | 8 | 4,696 images/sec | 17 images/sec/watt | 1.7 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.11-py3 | INT8 | Synthetic | TensorRT 10.6.0.26 | NVIDIA L40S |
EfficientNet-B0 | 8 | 20,534 images/sec | 106 images/sec/watt | 0.39 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | NVIDIA L40S |
32 | 41,526 images/sec | 140 images/sec/watt | 0.77 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA L40S | |
EfficientNet-B4 | 8 | 5,149 images/sec | 17 images/sec/watt | 1.55 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | NVIDIA L40S |
16 | 6,116 images/sec | 18 images/sec/watt | 2.62 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | NVIDIA L40S | |
HF Swin Base | 8 | 3,843 samples/sec | 11 samples/sec/watt | 2.08 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0.23 | NVIDIA L40S |
16 | 4,266 samples/sec | 12 samples/sec/watt | 7.5 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.11-py3 | INT8 | Synthetic | TensorRT 10.6.0.26 | NVIDIA L40S | |
HF Swin Large | 8 | 1,932 samples/sec | 6 samples/sec/watt | 4.14 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.11-py3 | Mixed | Synthetic | TensorRT 10.6.0 | NVIDIA L40S |
16 | 2,141 samples/sec | 6 samples/sec/watt | 7.47 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.11-py3 | INT8 | Synthetic | TensorRT 10.6.0 | NVIDIA L40S | |
HF ViT Base | 8 | 5,799 samples/sec | 17 samples/sec/watt | 1.38 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.11-py3 | FP8 | Synthetic | TensorRT 10.6.0 | NVIDIA L40S |
HF ViT Large | 8 | 1,926 samples/sec | 6 samples/sec/watt | 4.15 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.11-py3 | FP8 | Synthetic | TensorRT 10.6.0 | NVIDIA L40S |
Megatron BERT Large QAT | 8 | 4,213 sequences/sec | 13 sequences/sec/watt | 1.9 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.11-py3 | INT8 | Synthetic | TensorRT 10.6.0 | NVIDIA L40S |
24 | 5,097 sequences/sec | 15 sequences/sec/watt | 4.71 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.11-py3 | INT8 | Synthetic | TensorRT 10.6.0 | NVIDIA L40S | |
QuartzNet | 8 | 7,643 samples/sec | 32 samples/sec/watt | 1.05 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | NVIDIA L40S |
128 | 22,595 samples/sec | 65 samples/sec/watt | 5.66 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0.23 | NVIDIA L40S | |
RetinaNet-RN34 | 8 | 1,463 images/sec | 7 images/sec/watt | 5.47 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0.23 | NVIDIA L40S |
1,024 x 1,024 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
L4 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion v2.1 (512x512) | 1 | 0.82 images/sec | - | 1221.73 | 1x L4 | GIGABYTE G482-Z54-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA L4 |
Stable Diffusion XL | 1 | 0.11 images/sec | - | 9098.4 | 1x L4 | GIGABYTE G482-Z54-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA L4 |
ResNet-50v1.5 | 8 | 9,649 images/sec | 134 images/sec/watt | 0.83 | 1x L4 | GIGABYTE G482-Z54-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA L4 |
32 | 10,101 images/sec | 111 images/sec/watt | 16.27 | 1x L4 | GIGABYTE G482-Z54-00 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | NVIDIA L4 | |
BERT-BASE | 8 | 3,323 sequences/sec | 46 sequences/sec/watt | 2.41 | 1x L4 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA L4 |
24 | 4,052 sequences/sec | 56 sequences/sec/watt | 5.92 | 1x L4 | GIGABYTE G482-Z54-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA L4 | |
BERT-LARGE | 8 | 1,081 sequences/sec | 15 sequences/sec/watt | 7.4 | 1x L4 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA L4 |
13 | 1,314 sequences/sec | 19 sequences/sec/watt | 9.9 | 1x L4 | GIGABYTE G482-Z54-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA L4 | |
EfficientNet-B4 | 8 | 1,844 images/sec | 26 images/sec/watt | 4.34 | 1x L4 | GIGABYTE G482-Z54-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA L4 |
HF Swin Base | 8 | 1,221 samples/sec | 17 samples/sec/watt | 6.55 | 1x L4 | GIGABYTE G482-Z54-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA L4 |
HF Swin Large | 8 | 621 samples/sec | 9 samples/sec/watt | 12.89 | 1x L4 | GIGABYTE G482-Z54-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA L4 |
HF ViT Base | 16 | 1,844 samples/sec | 26 samples/sec/watt | 4.34 | 1x L4 | GIGABYTE G482-Z54-00 | 25.02-py3 | FP8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA L4 |
HF ViT Large | 8 | 617 samples/sec | 9 samples/sec/watt | 12.96 | 1x L4 | GIGABYTE G482-Z54-00 | 25.02-py3 | FP8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA L4 |
Megatron BERT Large QAT | 24 | 1,789 sequences/sec | 25 sequences/sec/watt | 13.42 | 1x L4 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA L4 |
QuartzNet | 8 | 3,886 samples/sec | 54 samples/sec/watt | 2.06 | 1x L4 | GIGABYTE G482-Z54-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA L4 |
128 | 6,144 samples/sec | 85 samples/sec/watt | 20.83 | 1x L4 | GIGABYTE G482-Z54-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA L4 | |
RetinaNet-RN34 | 8 | 355 images/sec | 5 images/sec/watt | 22.51 | 1x L4 | GIGABYTE G482-Z54-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA L4 |
512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
A40 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 11,177 images/sec | 40 images/sec/watt | 0.72 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 |
128 | 15,473 images/sec | 52 images/sec/watt | 8.27 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 | |
BERT-BASE | 8 | 4,257 sequences/sec | 15 sequences/sec/watt | 1.88 | 1x A40 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA A40 |
128 | 5,667 sequences/sec | 19 sequences/sec/watt | 22.59 | 1x A40 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA A40 | |
BERT-LARGE | 8 | 1,573 sequences/sec | 5 sequences/sec/watt | 5.08 | 1x A40 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA A40 |
128 | 1,966 sequences/sec | 7 sequences/sec/watt | 65.11 | 1x A40 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA A40 | |
EfficientNet-B0 | 8 | 11,130 images/sec | 61 images/sec/watt | 0.72 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 |
128 | 20,078 images/sec | 67 images/sec/watt | 6.38 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 | |
EfficientNet-B4 | 8 | 2,145 images/sec | 8 images/sec/watt | 3.73 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 |
128 | 2,689 images/sec | 9 images/sec/watt | 47.59 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 | |
HF Swin Base | 8 | 1,697 samples/sec | 6 samples/sec/watt | 4.71 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 |
32 | 1,842 samples/sec | 6 samples/sec/watt | 17.38 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 | |
HF Swin Large | 8 | 959 samples/sec | 3 samples/sec/watt | 8.34 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 |
32 | 1,010 samples/sec | 3 samples/sec/watt | 31.68 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 | |
HF ViT Base | 8 | 2,175 samples/sec | 7 samples/sec/watt | 3.68 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 |
64 | 2,324 samples/sec | 8 samples/sec/watt | 27.54 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 | |
HF ViT Large | 8 | 694 samples/sec | 2 samples/sec/watt | 11.53 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 |
64 | 750 samples/sec | 2 samples/sec/watt | 85.34 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 | |
Megatron BERT Large QAT | 8 | 2,059 sequences/sec | 7 sequences/sec/watt | 3.89 | 1x A40 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA A40 |
128 | 2,650 sequences/sec | 9 sequences/sec/watt | 48.31 | 1x A40 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA A40 | |
QuartzNet | 8 | 4,388 samples/sec | 21 samples/sec/watt | 1.82 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 |
128 | 8,453 samples/sec | 28 samples/sec/watt | 15.14 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 | |
RetinaNet-RN34 | 8 | 706 images/sec | 2 images/sec/watt | 11.34 | 1x A40 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A40 |
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
A30 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 10,261 images/sec | 71 images/sec/watt | 0.78 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 |
128 | 16,465 images/sec | 101 images/sec/watt | 7.77 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 | |
BERT-BASE | 8 | 4,334 sequences/sec | 26 sequences/sec/watt | 1.85 | 1x A30 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA A30 |
128 | 5,820 sequences/sec | 35 sequences/sec/watt | 21.99 | 1x A30 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA A30 | |
BERT-LARGE | 8 | 1,500 sequences/sec | 10 sequences/sec/watt | 5.33 | 1x A30 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA A30 |
128 | 2,053 sequences/sec | 13 sequences/sec/watt | 62.34 | 1x A30 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA A30 | |
EfficientNet-B0 | 8 | 8,993 images/sec | 81 images/sec/watt | 0.89 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 |
128 | 17,119 images/sec | 105 images/sec/watt | 7.48 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 | |
EfficientNet-B4 | 8 | 1,875 images/sec | 13 images/sec/watt | 4.27 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 |
128 | 2,397 images/sec | 15 images/sec/watt | 53.4 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 | |
HF Swin Base | 8 | 1,646 samples/sec | 10 samples/sec/watt | 4.86 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 |
32 | 1,851 samples/sec | 11 samples/sec/watt | 17.28 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 | |
HF Swin Large | 8 | 907 samples/sec | 6 samples/sec/watt | 8.82 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 |
32 | 1,000 samples/sec | 6 samples/sec/watt | 32 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 | |
HF ViT Base | 8 | 2,058 samples/sec | 13 samples/sec/watt | 3.89 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 |
64 | 2,271 samples/sec | 14 samples/sec/watt | 28.18 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 | |
HF ViT Large | 8 | 675 samples/sec | 4 samples/sec/watt | 11.86 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 |
64 | 708 samples/sec | 4 samples/sec/watt | 90.34 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 | |
QuartzNet | 8 | 3,434 samples/sec | 29 samples/sec/watt | 2.33 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 |
128 | 9,997 samples/sec | 73 samples/sec/watt | 12.8 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 | |
RetinaNet-RN34 | 8 | 703 images/sec | 4 images/sec/watt | 11.39 | 1x A30 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A30 |
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
A10 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 8,499 images/sec | 57 images/sec/watt | 0.94 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 |
128 | 10,654 images/sec | 71 images/sec/watt | 12.01 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 | |
BERT-BASE | 8 | 3,109 sequences/sec | 21 sequences/sec/watt | 2.57 | 1x A10 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA A10 |
128 | 3,822 sequences/sec | 26 sequences/sec/watt | 33.49 | 1x A10 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.5.0 | NVIDIA A10 | |
BERT-LARGE | 8 | 1,086 sequences/sec | 7 sequences/sec/watt | 7.36 | 1x A10 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.6.0 | NVIDIA A10 |
128 | 1,265 sequences/sec | 8 sequences/sec/watt | 101.17 | 1x A10 | GIGABYTE G482-Z52-00 | 24.10-py3 | INT8 | Synthetic | TensorRT 10.6.0 | NVIDIA A10 | |
EfficientNet-B0 | 8 | 9,679 images/sec | 65 images/sec/watt | 0.83 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 |
128 | 14,418 images/sec | 96 images/sec/watt | 8.88 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 | |
EfficientNet-B4 | 8 | 1,633 images/sec | 11 images/sec/watt | 4.9 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 |
128 | 1,863 images/sec | 12 images/sec/watt | 68.72 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 | |
HF Swin Base | 8 | 1,214 samples/sec | 8 samples/sec/watt | 6.59 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 |
32 | 1,258 samples/sec | 8 samples/sec/watt | 25.44 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 | |
HF Swin Large | 8 | 623 samples/sec | 4 samples/sec/watt | 12.84 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 |
32 | 656 samples/sec | 4 samples/sec/watt | 48.75 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 | |
HF ViT Base | 8 | 1,370 samples/sec | 9 samples/sec/watt | 5.84 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 |
64 | 1,503 samples/sec | 10 samples/sec/watt | 42.59 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 | |
HF ViT Large | 8 | 453 samples/sec | 3 samples/sec/watt | 17.68 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | Mixed | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 |
Megatron BERT Large QAT | 8 | 1,566 sequences/sec | 10 sequences/sec/watt | 5.11 | 1x A10 | GIGABYTE G482-Z52-00 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | NVIDIA A10 |
128 | 1,801 sequences/sec | 12 sequences/sec/watt | 71.06 | 1x A10 | GIGABYTE G482-Z52-00 | 24.12-py3 | INT8 | Synthetic | TensorRT 10.7.0 | NVIDIA A10 | |
QuartzNet | 8 | 3,842 samples/sec | 26 samples/sec/watt | 2.08 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 |
128 | 5,867 samples/sec | 39 samples/sec/watt | 21.82 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 | |
RetinaNet-RN34 | 8 | 516 images/sec | 4 images/sec/watt | 15.5 | 1x A10 | GIGABYTE G482-Z52-00 | 25.02-py3 | INT8 | Synthetic | TensorRT 10.8.0.43 | NVIDIA A10 |
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
Inference Performance of NVIDIA GPUs in the Cloud
A100 Inference Performance in the Cloud
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 13,768 images/sec | - images/sec/watt | 0.58 | 1x A100 | GCP A2-HIGHGPU-1G | 23.10-py3 | INT8 | Synthetic | - | A100-SXM4-40GB |
128 | 30,338 images/sec | - images/sec/watt | 4.22 | 1x A100 | GCP A2-HIGHGPU-1G | 23.10-py3 | INT8 | Synthetic | - | A100-SXM4-40GB | |
BERT-LARGE | 8 | 2,308 images/sec | - images/sec/watt | 3.47 | 1x A100 | GCP A2-HIGHGPU-1G | 23.10-py3 | INT8 | Synthetic | - | A100-SXM4-40GB |
128 | 4,045 images/sec | - images/sec/watt | 31.64 | 1x A100 | GCP A2-HIGHGPU-1G | 23.10-py3 | INT8 | Synthetic | - | A100-SXM4-40GB |
BERT-Large: Sequence Length = 128
View More Performance Data
Training to Convergence
Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.
Learn MoreAI Pipeline
NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.
Learn More