Americas

  • United States

New ML benchmarks show best algorithms for training chatbots

News Analysis
Jul 14, 20232 mins
Generative AIServers

In a benchmark meant to measure the performance of training machine-learning models, Nvidia came out on top.

artificial intelligence machine learning and modern computer and picture id1209989402 100903158 lar

MLCommons, a group that develops benchmarks for AI technology training algorithms, revealed the results for a new test that determines system speeds for training algorithms specifically used for the creation of chatbots like ChatGPT.

MLPerf 3.0 is meant to provide an industry-standard set of benchmarks for evaluating ML model training. Model training can be a rather lengthy process, taking weeks and even months depending on the size of a data set. That requires an awful lot of power consumption, so training can get expensive.

The MLPerf Training benchmark suite is a full series of tests that stress machine-learning models, software, and hardware for a broad range of applications. It found performance gains of up to 1.54x compared to just six months ago and between 33x and 49x compared to the first round in 2018.

As quickly as AI and ML have grown, MLCommons has been updating its MLPerf Training benchmarks. The latest revision, Training version 3.0, adds testing for training large language models (LLM), specifically for GPT-3, the LLM used in ChatGPT. This is the first revision of the benchmark to include such testing.

All told, the test yielded 250 performance results from 16 vendors’ hardware, including systems from Intel, Lenovo and Microsoft Azure. Notably absent from the test was AMD, which has a highly competitive AI accelerator in its Instinct line. (AMD did not respond to queries as of press time.)

Also notable is that Intel did not submit its Xeon or GPU Max and instead opted to test its Gaudi 2 dedicated AI processor from Habana Labs. Intel told me it chose Gaudi 2 because it is purpose-designed for high performance, high efficiency, deep learning training and inference and is particularly able to manage generative AI and large language models, including GPT-3.

Using a cluster of 3,584 H100 GPUs built in partnership with AI cloud startup CoreWeave, Nvidia posted a training time of 10.94 minutes. Habana Labs took 311.945 minutes but with a much smaller system equipped with 384 Gaudi2 chips. The question then becomes which is the cheaper option when you factor in both acquisition costs and operational costs? MLCommons didn’t go into that.

The faster benchmarks are a reflection of faster silicon, naturally, but also optimizations in algorithms and software. Optimized models mean faster development of models for everyone.

The benchmark results show how various configurations performed, so you can decide based on configuration and price whether the performance is a fit for your application.