Americas

  • United States

Nvidia debuts massive Blackwell-powered systems

News
Mar 18, 20243 mins
CPUs and ProcessorsData CenterHigh-Performance Computing

The DGX SuperPOD features eight or more DGX GB200 systems and can scale to tens of thousands of Nvidia Superchips.

Nvidia GB200 NVL72 system
Credit: Nvidia

Along with its new Blackwell architecture, Nvidia is unveiling new DGX systems that offer significant performance gains compared to the older generation.

There are several iterations of Nvidia’s existing DGX servers, ranging from 8 Hopper processors to 256 processors and with prices that start at $500,000 and scale to several million. Nvidia is following a similar configuration structure for the Blackwell generation, but no prices are available yet.

At the high end of the new lineup is the Nvidia GB200 NVL72 system. It’s a 72-node, liquid-cooled, rack-scale system for the most compute-intensive workloads. Each DGX GB200 system features 36 Grace Blackwell Superchips — which include 72 Blackwell GPUs and 36 Grace CPUs — connected by the newest generation NVLink interconnect. The platform acts as a single GPU with 1.4 exaflops of AI performance and 30TB of fast memory.

The new DGX systems are about more than just speeds and feeds; they offer a whole new form of interchip communication, said Charlie Boyle, vice president of DGX systems at Nvidia. “On a very large AI training job … you might spend 60% of your time just talking to each other on the GPU. If I can increase that network speed dramatically by putting that over NVlink, which is a memory-based network not a traditional database network, I can get that work done much more efficiently,” he said.

A DGX rack is a 44U cabinet with 18 compute trays, nine switch trays, a couple of power distribution units, a management switch power, a liquid cooling manifold, and NVlink backplane. Up until now, DGX systems have been air-cooled – this is the first unit with liquid cooling, a tacit admission that these things run hot. Boyle declined to comment on rumors that the Blackwell processor would run at over 1000 watts of power.

“It was done for efficiency and density,” said Boyle. “To get 72 GPUs in a rack and get that NVlink all together, they have to be very dense in there. We make this technology available to our OEM and ODM partners. They could choose to bring out different configurations, different densities. But for the product that I’m selling as DGX, it’s liquid cooled because of the high density in the system.”

The new version of the DGX SuperPOD with DGX GB200 systems won’t make other versions obsolete, but it does have unique capabilities that are only in this system. For example, RAS (reliability, availability, scalability) features are built into the chip and extend into the server with capabilities such as predictive maintenance, system health, and monitoring thousands of data points at all times.

Nvidia has developed what it calls the DGX Ready data center program; it has worked with its data center partners to be ready to host these systems with minimal set up effort, and that includes the liquid cooling.

“When these systems ship to customers, and I believe most of these will wind up in colocation data centers, some customers do have native liquid and some are building next gen data centers, but we make it easy for customers that want to adopt this,” he said.

The new DGX systems are set to ship later this year.