With the increasing demand for HPC and AI-powered cloud applications, there is a need for very powerful data center GPUs. Usually NVIDIA is king of this field, but AMD’s latest MI100 GPU offers some serious competition.
A card for the HPC market
The map is fast, seriously fast. NVIDIA’s high-end A1
NVIDIA’s cards, of course, support other acceleration techniques for AI-specific workloads in different number formats, such as the precision format TensorFloat-32 and fine-mesh structured sparsity. For AI and Machine Learning workloads, NVIDIA is still king as their cards are built specifically for tensor-based operations.
But for general purpose high-performance computing, the MI100 takes the cake in raw computing power. Plus, it’s nearly half the price and much more efficient per watt.
In addition to the other enhancements, the new architecture also brings mixed precision enhancements, with their “Matrix Core” technology delivering 7X better FP16 performance compared to their previous generation cards.
AMD CPUs and Instinct GPUS both power two of the US Department of Energy’s exascale supercomputers. The “Frontier” supercomputer is expected to be built next year with current Epyc CPUs and MI100s, and will deliver more than 1.5 exaflops of maximum processing power. The “El Capitan” supercomputer is expected to be built on next-generation hardware in 2023 and will deliver more than 2 exaflops of double precision.
Can ROCm comply with CUDA?
All this power is of course useless if the software doesn’t support it. It’s no secret that NVIDIA has managed to make machine learning a bit of a walled garden.
NVIDIA’s computation framework is called CUDA or Compute Unified Device Architecture. It is owned and only works with their cards. But since their cards were historically the fastest, many applications are built with CUDA support in the first place.
There are cross-platform programming models, especially OpenCL, which support AMD very well with their ROCm platform. Both NVIDIA cards and AMD cards support OpenCL, but since NVIDIA only supports it by transpiling to CUDA, it is actually slower to use OpenCL with an NVIDIA card. As a result, not all applications will support this.
Ultimately, you will need to do your own research and see if the application you want to use can run on AMD cards, and you may need to be prepared for some tinkering and bug fixes. NVIDIA GPUs, on the other hand, are usually plug-and-play, so even if AMD is faster, NVIDIA can continue to bother them with closed-source software.
However, this situation is getting better: AMD is committed to open sourcing and creating an open environment. Tensorflow and PyTorch, two very popular ML frameworks, both support the ROCm ecosystem.
Hopefully, the raw specifications of AMD’s latest offering can push the industry into a more competitive environment. After all, they are used in supercomputers