Machine learning is advancing quickly, and the metrics used to gauge success struggle to keep up. For example, the biannual machine learning competition known as “the Olympics of AI,” MLPerf, unveiled three new benchmark tests that represent emerging trends in the field.
“Lately, it has been challenging trying to follow what happens in the field,” says Miro Hodak, AMD engineer and MLPerf Inference working group co-chair.
“We see that the models are becoming progressively larger, and in the last two rounds we have introduced the largest models we’ve ever had.”
Usual suspects
As per EEE Spectrum‘s report, the usual suspects, Nvidia, Arm, and Intel, provided the chips that addressed these new benchmarks.
When Nvidia released its new Blackwell Ultra GPU, which came in a GB300 rack-scale design, it shot to the top of the charts.
AMD demonstrated impressive performance by launching its most recent MI325X GPUs. With their Xeon submissions, Intel revealed that inference could still be done on CPUs.
However, they also entered the GPU market with their Intel Arc Pro submission.
In the most recent round, MLPerf released the most significant benchmark, a large language model based on Llama3.1-403B. By introducing a benchmark based on the Deepseek R1 671B model, which had more than 1.5 times as many parameters as the previous most significant benchmark, they again outperformed themselves in this round.
Deepseek R1 is a reasoning model that approaches a query through several steps in a chain of thought. This makes this benchmark even more difficult because more computation occurs during inference than during regular LLM operation. Because reasoning models are said to be the most accurate, they are the method of choice for complex programming problems and science and math problems.
Based on Llama3.1-8B, MLPerf recently introduced the most significant and minor LLM benchmark.
According to Taran Iyengar, chair of the MLPerf Inference task force, low-latency but high-accuracy reasoning is increasingly in demand in the industry.
Small LLMs can provide this and are an excellent option for edge applications and text summarisation tasks.
This results in a bewildering total of four LLM-based benchmarks. These consist of the new, smaller Llama3.1-8B benchmark, the preexisting Llama2-70B benchmark, the Llama3.1-403B benchmark introduced last round, and the largest, the new Deepseek R1 model. This suggests that LLMs are here to stay, at the very least.
This round of MLPerf inference included a new voice-to-text model based on Whisper-large-v3, in addition to the numerous LLMs.
This benchmark is based on the increasing number of voice-enabled applications, whether speech-based AI interfaces or smart devices.
Categories of MLPerf Inference competition
The MLPerf Inference competition is divided into two main categories: “open,” which permits some model modifications, and “closed,” which mandates using the reference neural network model exactly as is. Within those, several subcategories pertain to the infrastructure type and test methodology.
However, this article will concentrate on the “closed” datacenter server results.
Nobody was surprised to learn that an Nvidia GPU-based system had the best performance per accelerator on every benchmark, at least in the “server” category.
Additionally, Nvidia introduced the Blackwell Ultra, which topped the rankings for the two biggest benchmarks, DeepSeek R1 reasoning and Lllama3.1-405B.
With a significantly larger memory capacity, twice the acceleration for attention layers, 1.5 times more AI compute, and faster memory and connectivity than the standard Blackwell, Blackwell Ultra is a more potent version of the Blackwell architecture. Like the two benchmarks it was tested on, it is designed for the more complex AI workloads.
Nvidia’s director of accelerated computing products, Dave Salvator, credits two significant adjustments for Blackwell Ultra’s success and the hardware enhancements.
The first is NVFP4, a proprietary 4-bit floating point number format from Nvidia. Salvator claims that while utilising much less processing power, “we can deliver comparable accuracy to formats like BF16.”
Disaggregated serving
“Disaggregated serving” is the second. According to the concept of disaggregated serving, the inference workload is divided into two primary components: generation/decoding, which involves actually calculating the output, and prefill, which consists of loading the query (“Please summarise this report.”) and its complete context window (the report) into the LLM.
The requirements for these two stages are different. Prefill requires a lot of computation, but generation and decoding rely far more on memory bandwidth. Salvator claims that Nvidia attains a performance boost of almost 50% by allocating distinct sets of GPUs to the two different stages.
The MI355X, AMD’s newest accelerator chip, was released in July. The business only provided results in the “open” category, allowing model software modifications.
The MI355x has increased high-bandwidth memory and supports 4-bit floating point, just like the Blackwell Ultra. In the open Llama2.1-70B benchmark, the MI355X outperformed its predecessor, the MI325X, by a factor of 2.7, according to Mahesh Balasubramanian, senior director of data centre GPU product marketing at AMD.
AMD included systems with AMD MI300X and MI325X GPUs in its “closed” submissions. Based on expert testing and image generation benchmarks, the more sophisticated MI325X computer performed comparably to those constructed with Nvidia H200S on the Llama2-70 b.
The Llama2-70b benchmark was the first hybrid submission in this round, utilising both AMD MI300X and MI325X GPUs for the same inference task. Because new GPUs are released every year and the older models, which are widely used, are not going away, hybrid GPUs are crucial. One crucial step is to be able to distribute workloads among various GPU types.
Intel has maintained that machine learning can be done without a GPU. In fact, submissions that used Intel’s Xeon CPU trailed on the recommender system benchmark but still performed similarly to the Nvidia L4 on the object detection benchmark.
This time, an Intel GPU also appeared for the first time. In 2022, the Intel Arc Pro made its debut.
The MaxSun Intel Arc Pro B60 Dual 48G Turbo graphics card, which has two GPUs and 48 gigabytes of memory, was included in the MLPerf submission.
The system trailed Nvidia’s L40S on the Llama2-70b benchmark but performed similarly on the small LLM benchmark.