NVIDIA Volta has been announced at GTC 2017 and the boy it’s a beast. The next-gen graphics processing unit is the world’s first chip that would make use of the industry leading TSMC 12nm FinFET process so let’s cover the whole details of this compute powerhouse.
NVIDIA Volta GV100 Unveiled – Tesla V100 With 5120 CUDA Cores, 16 GB HBM2, and 12nm FinFET Process
In the last GTC, NVIDIA announced the Pascal-based GP100 GPU it was back then fastest graphics chip designed for supercomputers. This year the company is taking next leap in graphics performance and they announced their Volta based GV100 GPU and we are going to take a deep look into the next-gen GPU designed for Artificial Intelligence Learning.
First of all, we also need to talk about workloads this specific chip is designed to handle. The NVIDIA Volta GV100 GPU is designed to power the most computationally intensive HPC, AI and graphics workloads.
GV100 GPU includes 21.1 billion transistors with a die size of 815mm2. It is also fabricated on new TSMC 12 nm FFN high-performance manufacturing process customized for NVIDIA. This GPU is much bigger than 610mm2 Pascal GP100 GPU. NVIDIA Volta GV100 delivers considerably provide more computing performance and adds many new features compared to its predecessor the Pascal GP100 GPU and it architecture family.
SEE ALSO: In the race to build the best AI, there’s already one clear winner
The chip itself behometh this chip is featuring brand new chip architecture that is just insane in terms of raw specifications. The NVIDIA Volta GV100 GPU is composed of six GPC (Graphics Processing Clusters). It has a total 84 Volta streaming multiprocessor unit and 42 TPCs. The 82 SMs is coming with 64 CUDA cores per SM so we looking at a total of 5376 CUDA cores on a complete die. All of 5376 CUDA cores can also be used for FP32 and INT32 programming instructions while there also a total of 2688 FP64 cores. Aside from these, we are looking at 672 Tensor processors 336 Texture Units.
The memory architecture is also updated with eight 512-bit memory controllers. This rounds up a total of 4096-bit bus interface which supports up to 16GB of HBM2 VRAM. The bandwidth is boosted with speeds of 900 MHz which delivers increased transfer rate of 900 GB/s compared to 729 GB/s on Pascal GP100 each memory controller is attached to 768 KB of L2 cache which total to 6 MB of L2 cache for the entire chip.
NVIDIA Tesla Graphics Cards Comparison
|Tesla Graphics Card Name||NVIDIA Tesla M2090||NVIDIA Tesla K40||NVIDIA Telsa K80||NVIDIA Tesla P100||NVIDIA Tesla V100|
|GPU Name||GF110||GK110||GK210 x 2||GP100||GV100|
|Transistor Count||3.00 Billion||7.08 Billion||7.08 Billion||15 Billion||21.1 Billion|
|CUDA Cores||512 CCs (16 CUs)||2880 CCs (15 CUs)||2496 CCs (13 CUs) x 2||3840 CCs||5120 CCs|
|Core Clock||Up To 650 MHz||Up To 875 MHz||Up To 875 MHz||Up To 1480 MHz||Up To 1455 MHz|
|FP32 Compute||1.33 TFLOPs||4.29 TFLOPs||8.74 TFLOPs||10.6 TFLOPs||15.0 TFLOPs|
|FP64 Compute||0.66 TFLOPs||1.43 TFLOPs||2.91 TFLOPs||5.30 TFLOPs||7.50 TFLOPs|
|VRAM Size||6 GB||12 GB||12 GB x 2||16 GB||16 GB|
|VRAM Bus||384-bit||384-bit||384-bit x 2||4096-bit||4096-bit|
|VRAM Speed||3.7 GHz||6 GHz||5 GHz||700 MHz||900 MHz|
|Memory Bandwidth||177.6 GB/s||288 GB/s||240 GB/s||720 GB/s||900 GB/s|
Key compute features of the NVIDIA Volta GV100 based Tesla V100 include the following:
- New Streaming Multiprocessor (SM) Architecture Optimized for Deep Learning Volta features a major new redesign of the SM processor architecture that is at the center of the GPU. The new Volta SM is 50% more energy efficient than the previous generation Pascal design, enabling major boosts in FP32 and FP64 performance in the same power envelope. New Tensor Cores designed specifically for deep learning deliver up to 12x higher peak TFLOPs for training. With independent, parallel integer and floating point data paths, the Volta SM is also much more efficient on workloads with a mix of computation and addressing calculations. Volta’s new independent thread scheduling capability enables finer-grain synchronization and cooperation between parallel threads. Finally, a new combined L1 Data Cache and Shared Memory subsystem significantly improve performance while also simplifying programming.
- Second-Generation NVLink The second generation of NVIDIA’s NVLink high-speed interconnect delivers higher bandwidth, more links, and improved scalability for multi-GPU and multi-GPU/CPU system configurations. GV100 supports up to 6 NVLink links at 25 GB/s for a total of 300 GB/s. NVLink now supports CPU mastering and cache coherence capabilities with IBM Power 9 CPU-based servers. The new NVIDIA DGX-1 with V100 AI supercomputer uses NVLink to deliver greater scalability for ultra-fast deep learning training.
- HBM2 Memory: Faster, Higher Efficiency Volta’s highly tuned 16GB HBM2 memory subsystem delivers 900 GB/sec peak memory bandwidth. The combination of both a new generation HBM2 memory from Samsung, and a new generation memory controller in Volta provides 1.5x delivered memory bandwidth versus Pascal GP100 and greater than 95% memory bandwidth efficiency running many workloads.
- Volta Multi-Process Service Volta Multi-Process Service (MPS) is a new feature of the Volta GV100 architecture providing hardware acceleration of critical components of the CUDA MPS server, enabling improved performance, isolation, and better quality of service (QoS) for multiple computer applications sharing the GPU. Volta MPS also triples the maximum number of MPS clients from 16 on Pascal to 48 on Volta.
- Enhanced Unified Memory and Address Translation Services GV100 Unified Memory technology in Volta GV100 includes new access counters to allow more accurate migration of memory pages to the processor that accesses the pages most frequently, improving efficiency for accessing memory ranges shared between processors. On IBM Power platforms, new Address Translation Services (ATS) support allows the GPU to access the CPU’s page tables directly.
- Cooperative Groups and New Cooperative Launch APIs Cooperative Groups is a new programming model introduced in CUDA 9 for organizing groups of communicating threads. Cooperative Groups allows developers to express the granularity at which threads are communicating, helping them to express richer, more efficient parallel decompositions. Basic Cooperative Groups functionality is supported on all NVIDIA GPUs since Kepler. Pascal and Volta include support for new Cooperative Launch APIs that support synchronization amongst CUDA thread blocks. Volta adds support for new synchronization patterns.
- Maximum Performance and Maximum Efficiency Modes In Maximum Performance mode, the Tesla V100 accelerator will operate unconstrained up to its TDP (Thermal Design Power) level of 300W to accelerate applications that require the fastest computational speed and highest data throughput. Maximum Efficiency Mode allows data center managers to tune power usage of their Tesla V100 accelerators to operate with optimal performance per watt. A not-to-exceed power cap can be set across all GPUs in a rack, reducing power consumption dramatically, while still obtaining excellent track performance.
- Volta Optimized Software New versions of deep learning frameworks such as Caffe2, MXNet, CNTK, TensorFlow, and others harness the performance of Volta to deliver dramatically faster training times and higher multi-node training performance. Volta-optimized versions of GPU-accelerated libraries such as cuDNN, cuBLAS, and TensorRT leverage the new features of the Volta GV100 architecture to deliver higher performance for both deep learning and High-Performance Computing (HPC) applications. The NVIDIA CUDA Toolkit version 9.0 includes new APIs and support for Volta features to provide even easier programmability.
|GPU Family||AMD Vega||AMD Navi||NVIDIA Pascal||NVIDIA Volta|
|Flagship GPU||Vega 10||Navi 10?||NVIDIA GP100||NVIDIA GV110|
|GPU Process||FinFET||7nm FinFET?||TSMC 16nm FinFET||TSMC 12nm FinFET|
|GPU Transistors||15-18 Billion||TBC||15.3 Billion||21.1 Billion|
|Memory (Consumer Cards)||HBM2||Next-Gen Memory||GDDR5X/HBM2||GDDR6/HBM2?|
|Memory (Dual-Chip Professional/ HPC)||HBM2||Next-Gen Memory||HBM2||HBM2|
|HBM2 Bandwidth||512 GB/s (Instinct MI25)||>1 TB/s?||732 GB/s (Peak)||900 GB/s|
|Graphics Architecture||Next Compute Unit (Vega)||Next Compute Unit (Navi)||5th Gen Pascal CUDA||6th Gen Volta CUDA|
|Successor of (GPU)||Radeon RX 500 Series?||Radeon RX 600 Series?||GM200 (Maxwell)||GV110 (Volta)|
Share your thoughts on NVIDIA Volta GV100 12nm FinFET GPU in comment section below.