AI Inference

Inference can be deployed in many ways, depending on the use-case. Offline processing of data is best done at larger batch sizes, which can deliver optimal GPU utilization and throughput. However, increasing throughput also tends to increase latency. Generative AI and Large Language Models (LLMs) deployments seek to deliver great experiences by lowering latency. So developers and infrastructure managers need to strike a balance between throughput and latency to deliver great user experiences and best possible throughput while containing deployment costs.


When deploying LLMs at scale, a typical way to balance these concerns is to set a time-to-first token limit, and optimize throughput within that limit. The data presented in the Large Language Model Low Latency section show best throughput at a time limit of one second, which enables great throughput at low latency for most users, all while optimizing compute resource use.


Click here to view other performance data.

MLPerf Inference v4.1 Performance Benchmarks

Offline Scenario, Closed Division

Network Throughput GPU Server GPU Version Target Accuracy Dataset
Llama2 70B11,264 tokens/sec1x B200NVIDIA B200NVIDIA B200-SXM-180GBrouge1=44.4312, rouge2=22.0352, rougeL=28.6162OpenOrca
34,864 tokens/sec8x H200NVIDIA H200NVIDIA H200-SXM-141GB-CTSrouge1=44.4312, rouge2=22.0352, rougeL=28.6162OpenOrca
24,525 tokens/sec8x H100NVIDIA DGX H100NVIDIA H100-SXM-80GBrouge1=44.4312, rouge2=22.0352, rougeL=28.6162OpenOrca
4,068 tokens/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GBrouge1=44.4312, rouge2=22.0352, rougeL=28.6162OpenOrca
Mixtral 8x7B59,335 tokens/sec8x H200GIGABYTE G593-SD1NVIDIA H200-SXM-141GBrouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)OpenOrca, GSM8K, MBXP
52,818 tokens/sec8x H100SMC H100NVIDIA H100-SXM-80GBrouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)OpenOrca, GSM8K, MBXP
8,021 tokens/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GBrouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)OpenOrca, GSM8K, MBXP
Stable Diffusion XL18 samples/sec8x H200Dell PowerEdge XE9680NVIDIA H200-SXM-141GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]Subset of coco-2014 val
16 samples/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]Subset of coco-2014 val
2.3 samples/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]Subset of coco-2014 val
ResNet-50768,235 samples/sec8x H200Dell PowerEdge XE9680NVIDIA H200-SXM-141GB76.46% Top1ImageNet (224x224)
710,521 samples/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GB76.46% Top1ImageNet (224x224)
95,105 samples/sec1x GH200NVIDIA GH200-GraceHopper-SuperchipNVIDIA GH200 Grace Hopper Superchip 96GB76.46% Top1ImageNet (224x224)
RetinaNet15,015 samples/sec8x H200ThinkSystem SR685a V3NVIDIA H200-SXM-141GB0.3755 mAPOpenImages (800x800)
14,538 samples/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GB0.3755 mAPOpenImages (800x800)
1,923 samples/sec1x GH200NVIDIA GH200-GraceHopper-SuperchipNVIDIA GH200 Grace Hopper Superchip 96GB0.3755 mAPOpenImages (800x800)
BERT73,791 samples/sec8x H200Dell PowerEdge XE9680NVIDIA H200-SXM-141GB90.87% f1SQuAD v1.1
72,876 samples/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GB90.87% f1SQuAD v1.1
9,864 samples/sec1x GH200NVIDIA GH200-GraceHopper-SuperchipNVIDIA GH200 Grace Hopper Superchip 96GB90.87% f1SQuAD v1.1
GPT-J20,552 tokens/sec8x H200ThinkSystem SR680a V3NVIDIA H200-SXM-141GBrouge1=42.9865, rouge2=20.1235, rougeL=29.9881CNN Dailymail
19,878 tokens/sec8x H100ESC-N8-E11NVIDIA H100-SXM-80GBrouge1=42.9865, rouge2=20.1235, rougeL=29.9881CNN Dailymail
2,804 tokens/sec1x GH200GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRTNVIDIA GH200 Grace Hopper Superchip 96GBrouge1=42.9865, rouge2=20.1235, rougeL=29.9881CNN Dailymail
DLRMv2639,512 samples/sec8x H200GIGABYTE G593-SD1NVIDIA H200-SXM-141GB80.31% AUCSynthetic Multihot Criteo Dataset
602,108 samples/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GB80.31% AUCSynthetic Multihot Criteo Dataset
86,731 samples/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GB80.31% AUCSynthetic Multihot Criteo Dataset
3D-UNET55 samples/sec8x H200NVIDIA H200NVIDIA H200-SXM-141GB0.863 DICE meanKiTS 2019
52 samples/sec8x H100AS-4125GS-TNHR2-LCCNVIDIA H100-SXM-80GB0.863 DICE meanKiTS 2019
7 samples/sec1x GH200GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRTNVIDIA GH200 Grace Hopper Superchip 96GB0.863 DICE meanKiTS 2019

Server Scenario - Closed Division

Network Throughput GPU Server GPU Version Target Accuracy MLPerf Server Latency
Constraints (ms)
Dataset
Llama2 70B10,756 tokens/sec1x B200NVIDIA B200NVIDIA B200-SXM-180GBrouge1=44.4312, rouge2=22.0352, rougeL=28.6162TTFT/TPOT: 2000 ms/200 msOpenOrca
32,790 tokens/sec8x H200NVIDIA H200NVIDIA H200-SXM-141GB-CTSrouge1=44.4312, rouge2=22.0352, rougeL=28.6162TTFT/TPOT: 2000 ms/200 msOpenOrca
23,700 tokens/sec8x H100AS-4125GS-TNHR2-LCCNVIDIA H100-SXM-80GBrouge1=44.4312, rouge2=22.0352, rougeL=28.6162TTFT/TPOT: 2000 ms/200 msOpenOrca
3,884 tokens/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GBrouge1=44.4312, rouge2=22.0352, rougeL=28.6162TTFT/TPOT: 2000 ms/200 msOpenOrca
Mixtral 8x7B57,177 tokens/sec8x H200NVIDIA H200NVIDIA H200-SXM-141GBrouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)TTFT/TPOT: 2000 ms/200 msOpenOrca, GSM8K, MBXP
51,028 tokens/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GBrouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)TTFT/TPOT: 2000 ms/200 msOpenOrca, GSM8K, MBXP
7,450 tokens/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GBrouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)TTFT/TPOT: 2000 ms/200 msOpenOrca, GSM8K, MBXP
Stable Diffusion XL17 samples/sec8x H200ThinkSystem SR680a V3NVIDIA H200-SXM-141GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]20 sSubset of coco-2014 val
16 samples/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]20 sSubset of coco-2014 val
2.02 samples/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]20 sSubset of coco-2014 val
ResNet-50681,328 queries/sec8x H200GIGABYTE G593-SD1NVIDIA H200-SXM-141GB76.46% Top115 msImageNet (224x224)
634,193 queries/sec8x H100SYS-821GE-TNHRNVIDIA H100-SXM-80GB76.46% Top115 msImageNet (224x224)
77,012 queries/sec1x GH200NVIDIA GH200-GraceHopper-SuperchipNVIDIA GH200 Grace Hopper Superchip 96GB76.46% Top115 msImageNet (224x224)
RetinaNet14,012 queries/sec8x H200GIGABYTE G593-SD1NVIDIA H200-SXM-141GB0.3755 mAP100 msOpenImages (800x800)
13,979 queries/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GB0.3755 mAP100 msOpenImages (800x800)
1,731 queries/sec1x GH200GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRTNVIDIA GH200 Grace Hopper Superchip 96GB0.3755 mAP100 msOpenImages (800x800)
BERT58,091 queries/sec8x H200Dell PowerEdge XE9680NVIDIA H200-SXM-141GB90.87% f1130 msSQuAD v1.1
58,929 queries/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GB90.87% f1130 msSQuAD v1.1
7,103 queries/sec1x GH200GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRTNVIDIA GH200 Grace Hopper Superchip 96GB90.87% f1130 msSQuAD v1.1
GPT-J20,139 queries/sec8x H200Dell PowerEdge XE9680NVIDIA H200-SXM-141GBrouge1=42.9865, rouge2=20.1235, rougeL=29.988120 sCNN Dailymail
19,811 queries/sec8x H100AS-4125GS-TNHR2-LCCNVIDIA H100-SXM-80GBrouge1=42.9865, rouge2=20.1235, rougeL=29.988120 sCNN Dailymail
2,513 queries/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GBrouge1=42.9865, rouge2=20.1235, rougeL=29.988120 sCNN Dailymail
DLRMv2585,209 queries/sec8x H200GIGABYTE G593-SD1NVIDIA H200-SXM-141GB80.31% AUC60 msSynthetic Multihot Criteo Dataset
556,101 queries/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GB80.31% AUC60 msSynthetic Multihot Criteo Dataset
81,010 queries/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GB80.31% AUC60 msSynthetic Multihot Criteo Dataset

Power Efficiency Offline Scenario - Closed Division

Network Throughput Throughput per Watt GPU Server GPU Version Dataset
Llama2 70B25,262 tokens/sec4 tokens/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBOpenOrca
Mixtral 8x7B48,988 tokens/sec8 tokens/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBOpenOrca, GSM8K, MBXP
Stable Diffusion XL13 samples/sec0.002 samples/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBSubset of coco-2014 val
ResNet-50556,234 samples/sec112 samples/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBImageNet (224x224)
RetinaNet10,803 samples/sec2 samples/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBOpenImages (800x800)
BERT54,063 samples/sec10 samples/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBSQuAD v1.1
GPT-J13,097 samples/sec3. samples/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBCNN Dailymail
DLRMv2503,719 samples/sec84 samples/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBSynthetic Multihot Criteo Dataset
3D-UNET42 samples/sec0.009 samples/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBKiTS 2019

Power Efficiency Server Scenario - Closed Division

Network Throughput Throughput per Watt GPU Server GPU Version Dataset
Llama2 70B23,113 tokens/sec4 tokens/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBOpenOrca
Mixtral 8x7B45,497 tokens/sec7 tokens/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBOpenOrca, GSM8K, MBXP
Stable Diffusion13 queries/sec0.002 queries/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBSubset of coco-2014 val
ResNet-50480,131 queries/sec96 queries/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBImageNet (224x224)
RetinaNet9,603 queries/sec2 queries/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBOpenImages (800x800)
BERT41,599 queries/sec8 queries/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBSQuAD v1.1
GPT-J11,701 queries/sec2 queries/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBCNN Dailymail
DLRMv2420,107 queries/sec69 queries/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBSynthetic Multihot Criteo Dataset

MLPerf™ v4.1 Inference Closed: Llama2 70B 99.9% of FP32, Mixtral 8x7B 99% of FP32 and 99.9% of FP32, Stable Diffusion XL, ResNet-50 v1.5, RetinaNet, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net 99.9% of FP32 accuracy target, GPT-J 99.9% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 4.1-0005, 4.1-0021, 4.1-0027, 4.1-0037, 4.1-0038, 4.1-0043, 4.1-0044, 4.1-0046, 4.1-0048, 4.1-0049, 4.1-0053, 4.1-0057, 4.1-0060, 4.1-0063, 4.1-0064, 4.1-0065, 4.1-0074. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
NVIDIA B200 is a preview submission
Llama2 70B Max Sequence Length = 1,024
Mixtral 8x7B Max Sequence Length = 2,048
BERT-Large Max Sequence Length = 384.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

LLM Inference Performance of NVIDIA Data Center Products

H200 Inference Performance - High Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 405B181281283,874 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 405B1812820485,938 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 405B1812840965,168 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 405B812048128764 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.14aNVIDIA H200
Llama v3.1 405B185000500669 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 405B1850020005,084 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 405B18100010003,400 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 405B18204820482,941 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 405B18200002000535 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 70B111281284,021 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 70B1112820484,166 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 70B1212840966,527 output tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 70B112048128466 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 70B115000500560 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 70B1250020006,848 output tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 70B11100010002,823 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 70B12204820484,184 output tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 70B12200002000641 output tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 8B1112812829,526 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 8B11128204825,399 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 8B11128409617,371 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 8B1120481283,794 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 8B1150005003,988 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 8B11500200021,021 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 8B111000100017,538 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 8B112048204811,969 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Llama v3.1 8B112000020001,804 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B1112812831,938 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B11128204827,409 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B11128409618,505 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B1120481283,834 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B1150005004,042 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B11500200022,355 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B111000100018,426 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B112048204812,347 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B112000020001,823 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B1112812817,158 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B11128204815,095 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B12128409621,565 output tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B1120481282,010 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B1150005002,309 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B11500200012,105 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B111000100010,371 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B122048204814,018 output tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B122000020002,227 output tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B1812812825,179 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.14.0NVIDIA H200
Mixtral 8x22B18128204832,623 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B18128409625,753 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x22B1820481283,095 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B1850005004,209 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B18500200027,430 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x22B181000100020,097 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B182048204815,799 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x22B182000020002,897 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.14.0NVIDIA H200

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

GH200 Inference Performance - High Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 70B111281283,637 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 70B14128204810,358 output tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.13.0NVIDIA GH200 96B
Llama v3.1 70B1412840966,628 output tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.13.0NVIDIA GH200 96B
Llama v3.1 70B112048128425 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 70B115000500422 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 70B1450020009,091 output tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.13.0NVIDIA GH200 96B
Llama v3.1 70B11100010001,746 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 70B14204820484,865 output tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.13.0NVIDIA GH200 96B
Llama v3.1 70B14200002000959 output tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.13.0NVIDIA GH200 96B
Llama v3.1 8B1112812829,853 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B11128204821,770 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B11128409614,190 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B1120481283,844 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B1150005003,933 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B11500200017,137 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B111000100016,483 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B112048204810,266 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B112000020001,560 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B1112812832,498 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B11128204823,337 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B11128409615,018 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B1120481283,813 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B1150005003,950 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B11500200018,556 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B111000100017,252 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B112048204810,756 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B112000020001,601 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B1112812816,859 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B11128204811,120 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B14128409630,066 output tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.13.0NVIDIA GH200 96B
Mixtral 8x7B1120481281,994 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B1150005002,078 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B1150020009,193 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B11100010008,849 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B11204820485,545 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B11200002000861 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B

TP: Tensor Parallelism
PP: Pipeline Parallelism

H100 Inference Performance - High Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 70B111281283,378 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.17.0H100-SXM5-80GB
Llama v3.1 70B1212840963,897 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.17.0H100-SXM5-80GB
Llama v3.1 70B122048128774 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Llama v3.1 70B1250020004,973 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.17.0H100-SXM5-80GB
Llama v3.1 70B12100010004,391 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.17.0H100-SXM5-80GB
Llama v3.1 70B12204820482,898 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.17.0H100-SXM5-80GB
Llama v3.1 70B14200002000920 output tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.17.0H100-SXM5-80GB
Mixtral 8x7B1112812815,962 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.17.0H100-SXM5-80GB
Mixtral 8x7B12128204823,010 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Mixtral 8x7B12128409614,237 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.17.0H100-SXM5-80GB
Mixtral 8x7B1120481281,893 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.17.0H100-SXM5-80GB
Mixtral 8x7B1250005003,646 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.17.0H100-SXM5-80GB
Mixtral 8x7B12500200018,186 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.14.0H100-SXM5-80GB
Mixtral 8x7B121000100015,932 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.14.0H100-SXM5-80GB
Mixtral 8x7B122048204810,686 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.17.0H100-SXM5-80GB
Mixtral 8x7B122000020001,757 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.17.0H100-SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism

L40S Inference Performance - High Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 8B111281289,105 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B1112820485,366 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B1112840963,026 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B1120481281,067 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B115000500981 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B1150020004,274 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B11100010004,055 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B11204820482,225 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B11200002000328 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Mixtral 8x7B4112812815,278 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B2212820489,087 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B1412840965,736 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Mixtral 8x7B4120481282,098 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B2250005001,558 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B2250020007,974 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B22100010006,579 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B22204820484,217 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S

TP: Tensor Parallelism
PP: Pipeline Parallelism

Inference Performance of NVIDIA Data Center Products

H200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)14.33 images/sec- 231.261x H200DGX H20024.10-py3INT8SyntheticTensorRT 10.5.0.26NVIDIA H200
46.8 images/sec- 588.081x H200DGX H20024.10-py3INT8SyntheticTensorRT 10.5.0.26NVIDIA H200
Stable Diffusion XL10.86 images/sec- 1157.271x H200DGX H20024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA H200
ResNet-50v1.5820,801 images/sec62 images/sec/watt0.381x H200DGX H20025.01-py3INT8SyntheticTensorRT 10.8.0.40NVIDIA H200
12865,045 images/sec107 images/sec/watt1.971x H200DGX H20025.01-py3INT8SyntheticTensorRT 10.8.0.40NVIDIA H200
EfficientNet-B0816,769 images/sec77 images/sec/watt0.481x H200DGX H20025.01-py3INT8SyntheticTensorRT 10.8.0.40NVIDIA H200
12856,981 images/sec122 images/sec/watt2.251x H200DGX H20025.01-py3INT8SyntheticTensorRT 10.8.0.40NVIDIA H200
EfficientNet-B484,507 images/sec14 images/sec/watt1.781x H200DGX H20025.01-py3INT8SyntheticTensorRT 10.8.0.40NVIDIA H200
1288,991 images/sec15 images/sec/watt14.241x H200DGX H20025.01-py3INT8SyntheticTensorRT 10.8.0.40NVIDIA H200
HF Swin Base85,090 samples/sec11 samples/sec/watt1.571x H200DGX H20025.01-py3MixedSyntheticTensorRT 10.8.0.40NVIDIA H200
328,204 samples/sec12 samples/sec/watt3.91x H200DGX H20025.01-py3MixedSyntheticTensorRT 10.8.0.40NVIDIA H200
HF Swin Large83,382 samples/sec6 samples/sec/watt2.371x H200DGX H20025.01-py3INT8SyntheticTensorRT 10.8.0.40NVIDIA H200
324,676 samples/sec7 samples/sec/watt6.841x H200DGX H20025.01-py3INT8SyntheticTensorRT 10.8.0.40NVIDIA H200
HF ViT Base89,006 samples/sec19 samples/sec/watt0.891x H200DGX H20025.01-py3FP8SyntheticTensorRT 10.8.0.40NVIDIA H200
6415,640 samples/sec23 samples/sec/watt4.091x H200DGX H20025.01-py3FP8SyntheticTensorRT 10.8.0.40NVIDIA H200
HF ViT Large83,439 samples/sec6 samples/sec/watt2.331x H200DGX H20025.01-py3FP8SyntheticTensorRT 10.8.0.40NVIDIA H200
645,471 samples/sec8 samples/sec/watt11.71x H200DGX H20025.01-py3FP8SyntheticTensorRT 10.8.0.40NVIDIA H200
QuartzNet86,741 samples/sec25 samples/sec/watt1.191x H200DGX H20025.01-py3MixedSyntheticTensorRT 10.8.0.40NVIDIA H200
12834,280 samples/sec92 samples/sec/watt3.731x H200DGX H20025.01-py3INT8SyntheticTensorRT 10.8.0.40NVIDIA H200
RetinaNet-RN3483,015 images/sec8 images/sec/watt2.651x H200DGX H20025.01-py3INT8SyntheticTensorRT 10.8.0.40NVIDIA H200

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

GH200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)14.27 images/sec- 234.41x GH200NVIDIA P388024.09-py3INT8SyntheticTensorRT 10.4.0.26GH200 96GB
45.82 images/sec- 687.911x GH200NVIDIA P388024.09-py3INT8SyntheticTensorRT 10.4.0.26GH200 96GB
Stable Diffusion XL10.68 images/sec- 1149.441x GH200NVIDIA P388024.10-py3INT8SyntheticTensorRT 10.5.0GH200 96GB
ResNet-50v1.5821,533 images/sec63 images/sec/watt0.371x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
12863,043 images/sec99 images/sec/watt2.031x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
EfficientNet-B0816,695 images/sec67 images/sec/watt0.481x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
12856,674 images/sec113 images/sec/watt2.261x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
EfficientNet-B484,531 images/sec13 images/sec/watt1.771x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
1288,784 images/sec14 images/sec/watt14.571x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
HF Swin Base85,106 samples/sec10 samples/sec/watt1.571x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
328,197 samples/sec12 samples/sec/watt3.91x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
HF Swin Large83,403 samples/sec6 samples/sec/watt2.351x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
324,846 samples/sec6 samples/sec/watt6.61x GH200NVIDIA P388024.12-py3MixedSyntheticTensorRT 10.7.0GH200 96GB
HF ViT Base88,990 samples/sec18 samples/sec/watt0.891x GH200NVIDIA P388024.12-py3FP8SyntheticTensorRT 10.7.0GH200 96GB
6415,562 samples/sec21 samples/sec/watt4.111x GH200NVIDIA P388024.12-py3FP8SyntheticTensorRT 10.7.0GH200 96GB
HF ViT Large83,707 samples/sec6 samples/sec/watt2.161x GH200NVIDIA P388024.12-py3FP8SyntheticTensorRT 10.7.0GH200 96GB
645,703 samples/sec7 samples/sec/watt11.221x GH200NVIDIA P388024.12-py3FP8SyntheticTensorRT 10.7.0GH200 96GB
QuartzNet86,688 samples/sec22 samples/sec/watt1.21x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
12834,272 samples/sec85 samples/sec/watt3.731x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
RetinaNet-RN3482,945 images/sec4 images/sec/watt2.721x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

H100 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)14.22 images/sec- 236.81x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0.26H100 SXM5-80GB
46.41 images/sec- 624.61x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0.26H100 SXM5-80GB
Stable Diffusion XL10.83 images/sec- 1210.081x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0H100 SXM5-80GB
ResNet-50v1.5821,588 images/sec63 images/sec/watt0.371x H100DGX H10025.02-py3INT8SyntheticTensorRT 10.8.0.43H100-SXM5-80GB
12859,535 images/sec99 images/sec/watt2.151x H100DGX H10025.02-py3INT8SyntheticTensorRT 10.8.0.43H100-SXM5-80GB
EfficientNet-B0816,351 images/sec67 images/sec/watt0.491x H100DGX H10025.02-py3INT8SyntheticTensorRT 10.8.0.43H100-SXM5-80GB
12855,498 images/sec116 images/sec/watt2.311x H100DGX H10025.02-py3INT8SyntheticTensorRT 10.8.0.43H100-SXM5-80GB
EfficientNet-B484,550 images/sec12 images/sec/watt1.761x H100DGX H10025.02-py3INT8SyntheticTensorRT 10.8.0.43H100-SXM5-80GB
1288,144 images/sec15 images/sec/watt15.721x H100DGX H10025.02-py3INT8SyntheticTensorRT 10.8.0.43H100-SXM5-80GB
HF Swin Base85,072 samples/sec9 samples/sec/watt1.581x H100DGX H10025.02-py3INT8SyntheticTensorRT 10.8.0.43H100-SXM5-80GB
327,706 samples/sec11 samples/sec/watt4.151x H100DGX H10025.02-py3INT8SyntheticTensorRT 10.8.0.43H100-SXM5-80GB
HF Swin Large83,299 samples/sec6 samples/sec/watt2.421x H100DGX H10025.02-py3MixedSyntheticTensorRT 10.8.0.43H100-SXM5-80GB
324,463 samples/sec7 samples/sec/watt7.171x H100DGX H10025.02-py3MixedSyntheticTensorRT 10.8.0.43H100-SXM5-80GB
HF ViT Base89,078 samples/sec17 samples/sec/watt0.881x H100DGX H10025.02-py3FP8SyntheticTensorRT 10.8.0.43H100-SXM5-80GB
6415,210 samples/sec22 samples/sec/watt4.211x H100DGX H10025.02-py3FP8SyntheticTensorRT 10.8.0.43H100-SXM5-80GB
HF ViT Large83,440 samples/sec6 samples/sec/watt2.331x H100DGX H10025.02-py3FP8SyntheticTensorRT 10.8.0.43H100-SXM5-80GB
645,363 samples/sec8 samples/sec/watt11.931x H100DGX H10025.02-py3FP8SyntheticTensorRT 10.8.0.43H100-SXM5-80GB
QuartzNet86,767 samples/sec22 samples/sec/watt1.181x H100DGX H10025.02-py3MixedSyntheticTensorRT 10.8.0.43H100-SXM5-80GB
12835,389 samples/sec77 samples/sec/watt3.621x H100DGX H10025.02-py3INT8SyntheticTensorRT 10.8.0.43H100-SXM5-80GB
RetinaNet-RN3482,827 images/sec8 images/sec/watt2.831x H100DGX H10025.02-py3INT8SyntheticTensorRT 10.8.0.43H100-SXM5-80GB

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

L40S Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)12.49 images/sec- 401.481x L40SSupermicro SYS-521GE-TNRT24.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L40S
42.91 images/sec- 1372.721x L40SSupermicro SYS-521GE-TNRT24.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L40S
Stable Diffusion XL10.37 images/sec- 2678.191x L40SSupermicro SYS-521GE-TNRT24.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L40S
ResNet-50v1.5823,472 images/sec78 images/sec/watt0.341x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
3237,069 images/sec109 images/sec/watt0.861x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
BERT-BASE88,412 sequences/sec26 sequences/sec/watt0.951x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
12813,169 sequences/sec38 sequences/sec/watt9.721x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
BERT-LARGE83,188 sequences/sec10 sequences/sec/watt2.511x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
244,034 sequences/sec12 sequences/sec/watt31.731x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
EfficientDet-D084,696 images/sec17 images/sec/watt1.71x L40SSupermicro SYS-521GE-TNRT24.11-py3INT8SyntheticTensorRT 10.6.0.26NVIDIA L40S
EfficientNet-B0820,534 images/sec106 images/sec/watt0.391x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
3241,526 images/sec140 images/sec/watt0.771x L40SSupermicro SYS-521GE-TNRT24.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L40S
EfficientNet-B485,149 images/sec17 images/sec/watt1.551x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
166,116 images/sec18 images/sec/watt2.621x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
HF Swin Base83,843 samples/sec11 samples/sec/watt2.081x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0.23NVIDIA L40S
164,266 samples/sec12 samples/sec/watt7.51x L40SSupermicro SYS-521GE-TNRT24.11-py3INT8SyntheticTensorRT 10.6.0.26NVIDIA L40S
HF Swin Large81,932 samples/sec6 samples/sec/watt4.141x L40SSupermicro SYS-521GE-TNRT24.11-py3MixedSyntheticTensorRT 10.6.0NVIDIA L40S
162,141 samples/sec6 samples/sec/watt7.471x L40SSupermicro SYS-521GE-TNRT24.11-py3INT8SyntheticTensorRT 10.6.0NVIDIA L40S
HF ViT Base85,799 samples/sec17 samples/sec/watt1.381x L40SSupermicro SYS-521GE-TNRT24.11-py3FP8SyntheticTensorRT 10.6.0NVIDIA L40S
HF ViT Large81,926 samples/sec6 samples/sec/watt4.151x L40SSupermicro SYS-521GE-TNRT24.11-py3FP8SyntheticTensorRT 10.6.0NVIDIA L40S
Megatron BERT Large QAT84,213 sequences/sec13 sequences/sec/watt1.91x L40SSupermicro SYS-521GE-TNRT24.11-py3INT8SyntheticTensorRT 10.6.0NVIDIA L40S
245,097 sequences/sec15 sequences/sec/watt4.711x L40SSupermicro SYS-521GE-TNRT24.11-py3INT8SyntheticTensorRT 10.6.0NVIDIA L40S
QuartzNet87,643 samples/sec32 samples/sec/watt1.051x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
12822,595 samples/sec65 samples/sec/watt5.661x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0.23NVIDIA L40S
RetinaNet-RN3481,463 images/sec7 images/sec/watt5.471x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0.23NVIDIA L40S

1,024 x 1,024 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

L4 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)10.82 images/sec- 1221.731x L4GIGABYTE G482-Z54-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L4
Stable Diffusion XL10.11 images/sec- 9098.41x L4GIGABYTE G482-Z54-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L4
ResNet-50v1.589,649 images/sec134 images/sec/watt0.831x L4GIGABYTE G482-Z54-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA L4
3210,101 images/sec111 images/sec/watt16.271x L4GIGABYTE G482-Z54-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L4
BERT-BASE83,323 sequences/sec46 sequences/sec/watt2.411x L4GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L4
244,052 sequences/sec56 sequences/sec/watt5.921x L4GIGABYTE G482-Z54-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L4
BERT-LARGE81,081 sequences/sec15 sequences/sec/watt7.41x L4GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L4
131,314 sequences/sec19 sequences/sec/watt9.91x L4GIGABYTE G482-Z54-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L4
EfficientNet-B481,844 images/sec26 images/sec/watt4.341x L4GIGABYTE G482-Z54-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA L4
HF Swin Base81,221 samples/sec17 samples/sec/watt6.551x L4GIGABYTE G482-Z54-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA L4
HF Swin Large8621 samples/sec9 samples/sec/watt12.891x L4GIGABYTE G482-Z54-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA L4
HF ViT Base161,844 samples/sec26 samples/sec/watt4.341x L4GIGABYTE G482-Z54-0025.02-py3FP8SyntheticTensorRT 10.8.0.43NVIDIA L4
HF ViT Large8617 samples/sec9 samples/sec/watt12.961x L4GIGABYTE G482-Z54-0025.02-py3FP8SyntheticTensorRT 10.8.0.43NVIDIA L4
Megatron BERT Large QAT241,789 sequences/sec25 sequences/sec/watt13.421x L4GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L4
QuartzNet83,886 samples/sec54 samples/sec/watt2.061x L4GIGABYTE G482-Z54-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA L4
1286,144 samples/sec85 samples/sec/watt20.831x L4GIGABYTE G482-Z54-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA L4
RetinaNet-RN348355 images/sec5 images/sec/watt22.511x L4GIGABYTE G482-Z54-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA L4

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

 

A40 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5811,177 images/sec40 images/sec/watt0.721x A40GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A40
12815,473 images/sec52 images/sec/watt8.271x A40GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A40
BERT-BASE84,257 sequences/sec15 sequences/sec/watt1.881x A40GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A40
1285,667 sequences/sec19 sequences/sec/watt22.591x A40GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A40
BERT-LARGE81,573 sequences/sec5 sequences/sec/watt5.081x A40GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A40
1281,966 sequences/sec7 sequences/sec/watt65.111x A40GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A40
EfficientNet-B0811,130 images/sec61 images/sec/watt0.721x A40GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A40
12820,078 images/sec67 images/sec/watt6.381x A40GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A40
EfficientNet-B482,145 images/sec8 images/sec/watt3.731x A40GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A40
1282,689 images/sec9 images/sec/watt47.591x A40GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A40
HF Swin Base81,697 samples/sec6 samples/sec/watt4.711x A40GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A40
321,842 samples/sec6 samples/sec/watt17.381x A40GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A40
HF Swin Large8959 samples/sec3 samples/sec/watt8.341x A40GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A40
321,010 samples/sec3 samples/sec/watt31.681x A40GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A40
HF ViT Base82,175 samples/sec7 samples/sec/watt3.681x A40GIGABYTE G482-Z52-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA A40
642,324 samples/sec8 samples/sec/watt27.541x A40GIGABYTE G482-Z52-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA A40
HF ViT Large8694 samples/sec2 samples/sec/watt11.531x A40GIGABYTE G482-Z52-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA A40
64750 samples/sec2 samples/sec/watt85.341x A40GIGABYTE G482-Z52-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA A40
Megatron BERT Large QAT82,059 sequences/sec7 sequences/sec/watt3.891x A40GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A40
1282,650 sequences/sec9 sequences/sec/watt48.311x A40GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A40
QuartzNet84,388 samples/sec21 samples/sec/watt1.821x A40GIGABYTE G482-Z52-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA A40
1288,453 samples/sec28 samples/sec/watt15.141x A40GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A40
RetinaNet-RN348706 images/sec2 images/sec/watt11.341x A40GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A40

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

 

A30 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5810,261 images/sec71 images/sec/watt0.781x A30GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A30
12816,465 images/sec101 images/sec/watt7.771x A30GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A30
BERT-BASE84,334 sequences/sec26 sequences/sec/watt1.851x A30GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A30
1285,820 sequences/sec35 sequences/sec/watt21.991x A30GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A30
BERT-LARGE81,500 sequences/sec10 sequences/sec/watt5.331x A30GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A30
1282,053 sequences/sec13 sequences/sec/watt62.341x A30GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A30
EfficientNet-B088,993 images/sec81 images/sec/watt0.891x A30GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A30
12817,119 images/sec105 images/sec/watt7.481x A30GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A30
EfficientNet-B481,875 images/sec13 images/sec/watt4.271x A30GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A30
1282,397 images/sec15 images/sec/watt53.41x A30GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A30
HF Swin Base81,646 samples/sec10 samples/sec/watt4.861x A30GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A30
321,851 samples/sec11 samples/sec/watt17.281x A30GIGABYTE G482-Z52-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA A30
HF Swin Large8907 samples/sec6 samples/sec/watt8.821x A30GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A30
321,000 samples/sec6 samples/sec/watt321x A30GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A30
HF ViT Base82,058 samples/sec13 samples/sec/watt3.891x A30GIGABYTE G482-Z52-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA A30
642,271 samples/sec14 samples/sec/watt28.181x A30GIGABYTE G482-Z52-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA A30
HF ViT Large8675 samples/sec4 samples/sec/watt11.861x A30GIGABYTE G482-Z52-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA A30
64708 samples/sec4 samples/sec/watt90.341x A30GIGABYTE G482-Z52-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA A30
QuartzNet83,434 samples/sec29 samples/sec/watt2.331x A30GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A30
1289,997 samples/sec73 samples/sec/watt12.81x A30GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A30
RetinaNet-RN348703 images/sec4 images/sec/watt11.391x A30GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A30

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

 

A10 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.588,499 images/sec57 images/sec/watt0.941x A10GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A10
12810,654 images/sec71 images/sec/watt12.011x A10GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A10
BERT-BASE83,109 sequences/sec21 sequences/sec/watt2.571x A10GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A10
1283,822 sequences/sec26 sequences/sec/watt33.491x A10GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A10
BERT-LARGE81,086 sequences/sec7 sequences/sec/watt7.361x A10GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.6.0NVIDIA A10
1281,265 sequences/sec8 sequences/sec/watt101.171x A10GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.6.0NVIDIA A10
EfficientNet-B089,679 images/sec65 images/sec/watt0.831x A10GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A10
12814,418 images/sec96 images/sec/watt8.881x A10GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A10
EfficientNet-B481,633 images/sec11 images/sec/watt4.91x A10GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A10
1281,863 images/sec12 images/sec/watt68.721x A10GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A10
HF Swin Base81,214 samples/sec8 samples/sec/watt6.591x A10GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A10
321,258 samples/sec8 samples/sec/watt25.441x A10GIGABYTE G482-Z52-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA A10
HF Swin Large8623 samples/sec4 samples/sec/watt12.841x A10GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A10
32656 samples/sec4 samples/sec/watt48.751x A10GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A10
HF ViT Base81,370 samples/sec9 samples/sec/watt5.841x A10GIGABYTE G482-Z52-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA A10
641,503 samples/sec10 samples/sec/watt42.591x A10GIGABYTE G482-Z52-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA A10
HF ViT Large8453 samples/sec3 samples/sec/watt17.681x A10GIGABYTE G482-Z52-0025.02-py3MixedSyntheticTensorRT 10.8.0.43NVIDIA A10
Megatron BERT Large QAT81,566 sequences/sec10 sequences/sec/watt5.111x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10
1281,801 sequences/sec12 sequences/sec/watt71.061x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10
QuartzNet83,842 samples/sec26 samples/sec/watt2.081x A10GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A10
1285,867 samples/sec39 samples/sec/watt21.821x A10GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A10
RetinaNet-RN348516 images/sec4 images/sec/watt15.51x A10GIGABYTE G482-Z52-0025.02-py3INT8SyntheticTensorRT 10.8.0.43NVIDIA A10

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

Inference Performance of NVIDIA GPUs in the Cloud

A100 Inference Performance in the Cloud

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5813,768 images/sec- images/sec/watt0.581x A100GCP A2-HIGHGPU-1G23.10-py3INT8Synthetic-A100-SXM4-40GB
12830,338 images/sec- images/sec/watt4.221x A100GCP A2-HIGHGPU-1G23.10-py3INT8Synthetic-A100-SXM4-40GB
BERT-LARGE82,308 images/sec- images/sec/watt3.471x A100GCP A2-HIGHGPU-1G23.10-py3INT8Synthetic-A100-SXM4-40GB
1284,045 images/sec- images/sec/watt31.641x A100GCP A2-HIGHGPU-1G23.10-py3INT8Synthetic-A100-SXM4-40GB

BERT-Large: Sequence Length = 128

View More Performance Data

Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More