Google says its TPU v4 supercomputer is more powerful and efficient than ever, thanks to optical circuit switching technology and architecture, and challenges Nvidia. A new white paper from Google details the company’s use of optical circuit switches in its machine learning training supercomputer, saying that the TPU v4 model with those switches in place offers improved performance and more energy efficiency than general-use processors. Google’s Tensor Processing Units — the basic building blocks of the company’s AI supercomputing systems — are essentially ASICs, meaning that their functionality is built in at the hardware level, as opposed to the general use CPUs and GPUs used in many AI training systems. The white paper details how, by interconnecting more than 4,000 TPUs through optical circuit switching, Google has been able to achieve speeds 10 times faster than previous models while consuming less than half as much energy. Aiming for AI performance, price breakthroughs The key, according to the white paper, is in the way optical circuit switching (performed here by switches of Google’s own design) enables dynamic changes to interconnect topology of the system. Compared to a system like Infiniband, which is commonly used in other HPC areas, Google says that its system is cheaper, faster and considerably more energy efficient. “Two major architectural features of TPU v4 have small cost but outsized advantages,” the paper said. “The SparseCore [data flow processors] accelerates embeddings of [deep learning] models by 5x-7x by providing a dataflow sea-of-cores architecture that allows embeddings to be placed anywhere in the 128 TiB physical memory of the TPU v4 supercomputer.” According to Peter Rutten, research vice president at IDC, the efficiencies described in Google’s paper are in large part due to the inherent characteristics of the hardware being used — well-designed ASICs are almost by definition better suited to their specific task than general use processors trying to do the same thing. “ASICs are very performant and energy efficient,” he said. “If you hook them up to optical circuit switches where you can dynamically configure the network topology, you have a very fast system.” While the system described in the white paper is only for Google’s internal use at this point, Rutten noted that the lessons of the technology involved could have broad applicability for machine learning training. “I would say it has implications in the sense that it offers them a sort of best practices scenario,” he said. “It’s an alternative to GPUs, so in that sense it’s definitely an interesting piece of work.” Google-Nvidia comparison is unclear While Google also compared TPU v4’s performance to systems using Nvidia’s A100 GPUs, which are common HPC components, Rutten noted that Nvidia has since released much faster H100 processors, which may shrink any performance difference between the systems. “They’re comparing it to an older-gen GPU,” he said. “But in the end it doesn’t really matter, because it’s Google’s internal process for developing AI models, and it works for them.” Related content brandpost Sponsored by Zscaler Zero Trust + AI: A match made in the clouds It’s time to unpack the true value of Zero Trust and AI in modern cybersecurity. By Zscaler Sep 27, 2024 5 mins Machine Learning Cloud Computing Security brandpost Sponsored by Zscaler AI-assisted cybersecurity: 3 key components you can’t ignore AI in cybersecurity: Balancing innovation with vigilance as artificial intelligence reshapes true risk. By Zscaler Sep 12, 2024 8 mins Machine Learning Security brandpost Sponsored by Zscaler How cybersecurity and AI will influence global elections in 2024 The nefarious trend of seeking to influence voters and undermine the legitimacy of election results will doubtlessly continue and intensify. By Rob Sloan, VP Cybersecurity Advocacy Jun 13, 2024 7 mins Machine Learning brandpost Sponsored by Zscaler What it means to 'fight AI with AI' AI will certainly be used by attackers to improve the quality of their strikes, but there are proactive measures we can take today to scale our defenses. Help is here. By Nat Smith, Sr. Director Product Management Jun 12, 2024 4 mins Machine Learning PODCASTS VIDEOS RESOURCES EVENTS NEWSLETTERS Newsletter Promo Module Test Description for newsletter promo module. Please enter a valid email address Subscribe