Today NVIDIA announced their new Ampere architecture, in addition to the new A100 on which it runs. It's a significant improvement over Turing, already an AI-centric architecture that powers high-quality data centers and ML-powered ray tracing in the consumer graphics space.
If you want a full collection of all the very technical details, you can read NVIDIA's in-depth architecture overview. We will break down the most important things.
The new mold is absolutely huge
From the gate they go all out with this new chip. The previous generation Tesla V1
This new GPU has 19.5 teraflops of FP32 performance, 6,912 CUDA cores, 40 GB memory and 1.6 TB / s memory bandwidth. In a fairly specific workload (scarce INT8), the A100 actually cracks 1 PetaFLOPS of raw computing power. That is of course on INT8, but the card is still very powerful.
Then, like the V100, they took eight of these GPUs and created a mini supercomputer that they sell for $ 200,000. You'll likely see them coming to cloud providers like AWS and Google Cloud Platform soon.
Unlike the V100, however, this is not one huge GPU – it's actually 8 separate GPUs that can be virtualized and rented separately for different tasks, along with a 7x higher memory throughput to boot.
As for the use of all those transistors, the new chip works much faster than the V100. For AI training and inference, A100 offers 6x acceleration for FP32, 3x for FP16 and 7x acceleration in effect when all these GPUs are used together.
Note that the V100 highlighted in the second graph is the 8 GPU V100 server, not a single V100.
NVIDIA also promises up to 2x acceleration in many HPC workloads:
As for the raw TFLOP numbers, A100 FP64 dual precision is 20 TFLOPs, vs. 8 for V100 FP64.
TensorFloat-32: a new number format optimized for Tensor Cores
With Ampere, NVIDIA uses a new number format designed to replace FP32 in some workloads. FP32 essentially uses 8 bits for the number range (however large or small it may be) and 23 bits for the precision.
NVIDIA claims that these 23 precision bits are not quite necessary for many AI workloads, and you can get similar results and performance from just 10 of them. This new format is called Tensor Float 32. This is, on top of the shrinkage and the increase in the number of cores, how they claim the huge 6x acceleration in AI training.
They claim that “Users don't have to make any code changes, because TF32 only runs in the A100 GPU. TF32 works on FP32 inputs and produces results in FP32. Operations without a tensor continue to use FP32 . This means it should be a replacement for workloads that don't need the extra precision.
If you compare FP performance on the V100 with TF performance on the A100, you will see where these massive accelerations come from. TF32 is up to ten times faster. Much of this is, of course, also due to the fact that the other improvements in Ampere are generally twice as fast, and is not a direct comparison.
They have also introduced a new concept called fine-structured structured thrift, which contributes to the computational performance of deep neural networks. In fact, some weights are less important than others, and matrix math can be compressed to improve throughput. While data discarding doesn't seem like a great idea, they claim it doesn't affect the accuracy of the trained inferencing network, and just speeds it up.
For Sparse INT8 calculations, the peak performance of a single A100 is 1250 TFLOPS, an astonishingly high number. Of course, it will be difficult to find a real workload that only turns INT8, but gears are gears.