NVIDIA has released the first MLPerf 4.1 results for its Blackwell B200 processor. The results show that the Blackwell GPU offers four times the performance of the H100 based on the Hopper architecture. However, there are some important points to consider when evaluating these results. Here are the details…
NVIDIA Blackwell B200 performance 4x faster than H100
According to NVIDIA’s results, the Blackwell-based B200 GPU produces 10,755 tokens per second in a server test and 11,264 tokens per second in an offline test. These results are similar to those delivered by four H100 GPUs in the MLPerf Llama 2 70B benchmark, confirming Nvidia’s claim: the Blackwell is 3.7 to 4 times faster than the H100.
However, some of this performance increase is due to Blackwell’s use of FP4 (four-bit floating-point) precision, which is supported by its fifth-generation Tensor Cores, while the H100 only supports FP8 (eight-bit floating-point). FP4 offers twice the efficiency of FP8, which plays a significant role in Blackwell’s performance.
However, it’s also worth noting that NVIDIA is comparing a single B200 GPU to four H100 GPUs. Typically, single-GPU performance is better than multi-GPU performance, so it’s hard to say this comparison is entirely fair.
Also, single GPU results for H100 are not listed in MLPerf 4.1, only results for B200. A single H200 GPU generates 4,488 tokens per second, meaning B200 is only 2.5 times faster. Memory capacity and bandwidth also play a big role in these performance differences.
The B200 GPU tested carries 180GB of HBM3E memory, while the H100 SXM carries 80GB of HBM (up to 96GB in some configurations). The H200 can carry 96GB of HBM3 and up to 144GB of HBM3E memory.
Row 0 – Cell 0 | # of GPUs | Offline | Server | per GPU Offline | per GPU Server |
Nvidia B200 180GB HBM3E | 1 | 11264 | 10755 | 11264 | 10755 |
Nvidia H100 80GB HBM3 | 4 | 10700 | 9522 | 2675 | 2381 |
Nvidia H200 141GB HBM3E | 1 | 4488 | 4202 | 4488 | 4202 |
Nvidia H200 141GB HBM3E | 8 | 32124 | 29739 | 4016 | 3717 |
For now, NVIDIA has only shared the Blackwell B200’s performance in the productive AI benchmark test on the Llama 2 70B model in MLPerf 4.1. The fact that NVIDIA hasn’t shared its performance in other tests could be due to other factors or something they’re still working on.
What do you think? You can write your opinions in the comments section below.
{{user}} {{datetime}}
{{text}}