NVIDIA has announced their latest DGX SATURNV Supercomputer that is designed to build smarter cars and next generation GPUs. The DGX SATURNV is termed as the most efficient supercomputer and utilizes NVIDIA Pascal GPUs.
NVIDIA's DGX SATURNV SuperComputer Is The World's Most Efficient - Utilizes Tesla P100 GPUs
The DGX SATURNV is ranked 28th on the Top500 list of Supercomputers and is also the most efficient of them all. The Supercomputer houses several DGX-1 units, which is NVIDIA's custom designed server rack based on their Tesla P100 graphics chips. Right now, the most efficient machine on the Top500 list is rated at 6.67 Giga Flops/Watt. The NVIDIA designed DGX SATURNV delivers an incredible 9.46 GigaFlops/Watt which is a 42% improvement.
That efficiency is key to building machines capable of reaching exascale speeds — that’s 1 quintillion, or 1 billion billion, floating-point operations per second. Such a machine could help design efficient new combustion engines, model clean-burning fusion reactors, and achieve new breakthroughs in medical research. via NVIDIA
What Powers The DGX SATURNV?
Powering the NVIDIA GDX SATURNV are 124 DGX-1 units. The NVIDIA DGX-1 is a supercomputer inside a box and is capable of delivering large amounts of performance in a small package.
The NVIDIA DGX-1 is a complete supercomputing solution that houses NVIDIA’s latest hardware and software innovations ranging from Pascal and NVIDIA SDK suite. The DGX-1 has the performance throughput equivalent to 250 x86 servers. This insane amount of performance allows users to get their own supercomputer for HPC and AI specific workloads.
Assembled by a team of a dozen engineers using 124 DGX-1s — the AI supercomputer in a box we unveiled in April — SATURNV helps us build the autonomous driving software that’s a key part of our NVIDIA DRIVE PX 2 self-driving vehicle platform. via NVIDIA
Some of the key specifications of NVIDIA’s DGX-1 Unit include:
- Up to 170 teraflops of half-precision (FP16) peak performance
- Eight Tesla P100 GPU accelerators, 16GB memory per GPU
- NVLink Hybrid Cube Mesh
- 20 Core Broadwell-E "Xeon E5-2698 v4" CPU (2.2 GHz)
- 7TB SSD DL Cache
- Dual 10GbE, Quad InfiniBand 100Gb networking
- 3U – 3200W
DGX-1 is an appliance that integrates deep learning software, development tools and eight of our Tesla P100 GPUs — based on our new Pascal architecture — to pack computing power equal to 250 x86 servers into a device about the size of a stove top. via NVIDIA
The Tesla P100 is the heart of the DGX-100 platform. Featuring the latest 5th generation Pascal architecture with 3584 CUDA Cores, 240 texture mapping units, clock speeds up to 1480 MHz and 16 GB of HBM2 VRAM (720 GB/s stream bandwidth), the DGX-1 is all prepped for the most intensive workloads pitted against it. The chi[ delivers 5.6 TFLOPs of FP64, 10.6 TFLOPs of FP32 and 21.2 TFLOPs of FP16 compute performance. It comes in a 300W package but delivers up to 17.7 GFLOPs/Watt at double precision compute.
“This system is internally at Nvidia for our self-driving car initiatives,” says Buck. “We are also using it for chip and wafer defect analysis and for our own sales and marketing analytics. We are also taking the framework we are using on this system and using it as the starting point for the CANDLE framework for cancer research. You only need 36 of these nodes to reach one petaflops, and it really speaks to our strategy of building strong nodes. The small number of nodes makes it really tractable for us to build a system like Saturn V.” via NextPlatform
The DGX SaturnV proves that NVIDIA Pascal GP100 was designed for the AI / Datacenter market, offering incredible amounts of power efficiency to this market along with increased performance from previous gen graphics processing units.
NVIDIA Volta Tesla V100S Specs:
NVIDIA Tesla Graphics Card | Tesla K40 (PCI-Express) | Tesla M40 (PCI-Express) | Tesla P100 (PCI-Express) | Tesla P100 (SXM2) | Tesla V100 (PCI-Express) | Tesla V100 (SXM2) | Tesla V100S (PCIe) |
---|---|---|---|---|---|---|---|
GPU | GK110 (Kepler) | GM200 (Maxwell) | GP100 (Pascal) | GP100 (Pascal) | GV100 (Volta) | GV100 (Volta) | GV100 (Volta) |
Process Node | 28nm | 28nm | 16nm | 16nm | 12nm | 12nm | 12nm |
Transistors | 7.1 Billion | 8 Billion | 15.3 Billion | 15.3 Billion | 21.1 Billion | 21.1 Billion | 21.1 Billion |
GPU Die Size | 551 mm2 | 601 mm2 | 610 mm2 | 610 mm2 | 815mm2 | 815mm2 | 815mm2 |
SMs | 15 | 24 | 56 | 56 | 80 | 80 | 80 |
TPCs | 15 | 24 | 28 | 28 | 40 | 40 | 40 |
CUDA Cores Per SM | 192 | 128 | 64 | 64 | 64 | 64 | 64 |
CUDA Cores (Total) | 2880 | 3072 | 3584 | 3584 | 5120 | 5120 | 5120 |
Texture Units | 240 | 192 | 224 | 224 | 320 | 320 | 320 |
FP64 CUDA Cores / SM | 64 | 4 | 32 | 32 | 32 | 32 | 32 |
FP64 CUDA Cores / GPU | 960 | 96 | 1792 | 1792 | 2560 | 2560 | 2560 |
Base Clock | 745 MHz | 948 MHz | 1190 MHz | 1328 MHz | 1230 MHz | 1297 MHz | TBD |
Boost Clock | 875 MHz | 1114 MHz | 1329MHz | 1480 MHz | 1380 MHz | 1530 MHz | 1601 MHz |
FP16 Compute | N/A | N/A | 18.7 TFLOPs | 21.2 TFLOPs | 28.0 TFLOPs | 30.4 TFLOPs | 32.8 TFLOPs |
FP32 Compute | 5.04 TFLOPs | 6.8 TFLOPs | 10.0 TFLOPs | 10.6 TFLOPs | 14.0 TFLOPs | 15.7 TFLOPs | 16.4 TFLOPs |
FP64 Compute | 1.68 TFLOPs | 0.2 TFLOPs | 4.7 TFLOPs | 5.30 TFLOPs | 7.0 TFLOPs | 7.80 TFLOPs | 8.2 TFLOPs |
Memory Interface | 384-bit GDDR5 | 384-bit GDDR5 | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM |
Memory Size | 12 GB GDDR5 @ 288 GB/s | 24 GB GDDR5 @ 288 GB/s | 16 GB HBM2 @ 732 GB/s 12 GB HBM2 @ 549 GB/s | 16 GB HBM2 @ 732 GB/s | 16 GB HBM2 @ 900 GB/s | 16 GB HBM2 @ 900 GB/s | 16 GB HBM2 @ 1134 GB/s |
L2 Cache Size | 1536 KB | 3072 KB | 4096 KB | 4096 KB | 6144 KB | 6144 KB | 6144 KB |
TDP | 235W | 250W | 250W | 300W | 250W | 300W | 250W |