The Splunk buy and Cisco’s partnership with Nvidia are part of its efforts to build data-center network infrastructure for supporting AI/ML workloads.
It wasn’t that long ago that ideas about revamping data-center networking operations to handle AI workloads would have been confined to a whiteboard. But conditions have changed drastically in the past year.
“AI and ML were on the radar, but the past 18 months or so have seen significant investment and development – especially around generative AI. What we expect in 2024 is more enterprise data-center organizations will use new tools and technologies to drive an AI infrastructure that will let them get more data, faster, and with better insights from the data sources,” said Kevin Wollenweber, senior vice president and general manager of Cisco’s networking, data center and provider connectivity organization. Enterprises also will be able to “better handle the workloads that entails,” he said.
A flurry of recent Cisco activity can attest to AI’s growth at the enterprise level.
Cisco’s $28 billion Splunk acquisition, which closed this week, is expected to drive AI advancements across Cisco’s security and observability portfolios, for example. And Cisco’s newly inked agreement with Nvidia will yield integrated software and networking hardware that promises to help customers more easily spin up infrastructure to support AI applications.
As part of the partnership, Nvidia’s newest Tensor Core GPUs will be available in Cisco’s M7 Unified Computing System (UCS) rack and blade servers, including UCS X-Series and UCS X-Series Direct, to support AI and data-intensive workloads in the data center and at the edge, the companies stated. The integrated package will include Nvidia AI Enterprise software, which features pretrained models and development tools for production-ready AI.
“The Nvidia alliance is actually an engineering partnership, and we are building solutions together with Nvidia to make it easier for our customers – enterprises and service providers – to consume AI technology,” Wollenweber said. The technologies they deliver will enable AI productivity and will include toolsets to build, monitor and troubleshoot the fabrics so they run as efficiently as possible, Wollenweber said. “Driving this technology into the enterprise is where this partnership will grow in the future.”
AI accelerates network investments
Greater network capacity will be a requirement for AI deployments, industry watchers note.
According to research firm IDC, revenues in the data-center portion of the Ethernet switching market rose 13.6% in 2023 as enterprises and service providers required ever-faster Ethernet switches to support rapidly maturing AI workloads. “To illustrate this point, revenues for 200/400 GbE switches rose 68.9% for the full year in 2023,” IDC analyst Brandon Butler said in a Network World article.
“The Ethernet switching market in 2023 was dominated by the impact of AI, with the overall market rising 20.1% in 2023 to reach $44.2 billion,” Butler said.
The Dell’Oro Group also wrote recently about how AI networks will accelerate the transition to higher speeds. “For example, 800 Gbps is expected to comprise the majority of the ports in AI back-end networks by 2025, within just two years of the latest 800 Gbps product introduction,” wrote Sameh Boujelbene, vice president at Dell’Oro Group.
“While most of the market demand will come from Tier 1 Cloud Service Providers, Tier 2/3 and large enterprises are forecast to be significant, approaching $10 B over the next five years. The latter group will favor Ethernet,” Boujelbene stated.
Ethernet as a technology gets tons of investment, and it evolves quickly, Wollenweber said. “We’ve gone from 100G to 400G to 800G, and now we’re building 1.6 terabit Ethernet now, and it’s also the predominant networking technology for the rest of the data center,” Wollenweber said.
The 650 Group reported this week that networking speeds will continue to increase at a rapid pace to keep up with AI and machine learning (ML) workloads. Early 2024 demonstrations of 1.6 terabit Ethernet (1.6 TbE) show that Ethernet is keeping pace with AI/ML networking requirements, and 650 Group projects that 1.6 TbE solutions will be the dominant port speed by 2030.
Ethernet plus AI
Ethernet is the foundation for most enterprise data-center networks today. So, when enterprises want to add GPU-based systems for AI workloads, it makes sense to stick with Ethernet; the IT and engineering staffs understand Ethernet, and they can get consistent performance out of Ethernet technologies and integrate these AI compute nodes, Wollenweber said.
“An AI/ML workload or job – such as for different types of learning that use large data sets – may need to be distributed across many GPUs as part of an AI/ML cluster to balance the load through parallel processing,” Wollenweber wrote in a blog about AI networking.
“To deliver high-quality results quickly – particularly for training models – all AI/ML clusters need to be connected by a high-performance network that supports non-blocking, low-latency, lossless fabric,” Wollenweber wrote. “While less compute-intensive, running AI inferencing in edge data centers will also involve requirements on network performance, scale and latency control to help quickly deliver real-time insights to a large number of end-users.”
Wollenweber cited the remote direct memory access (RDMA) over Converged Ethernet (RoCE) network protocol as a means to improve throughput and lower latency on compute and storage traffic; RoCEv2 is used to enable access to memory on a remote host without CPU involvement.
“Ethernet fabrics with RoCEv2 protocol support are optimized for AI/ML clusters with widely adopted standards-based technology, easier migration for Ethernet-based data centers, proven scalability at lower cost-per-bit, and designed with advanced congestion management to help intelligently control latency and loss,” Wollenweber wrote.
Cisco’s AI infrastructure
What customers will need are better operational tools to help schedule AI/ML workloads across GPUs more efficiently. In Cisco’s case, those tools include its Nexus Dashboard.
“How do we actually make it simpler and easier for customers to tune these Ethernet networks and connect this massive amount of compute as efficiently as possible? That’s what we are looking at,” Wollenweber said.
Cisco’s recent spate of news builds on earlier work to shape its AI data center directions. Last summer, for example, Cisco published a blueprint defining how organizations can use existing data center Ethernet networks to support AI workloads.
A core component of that blueprint is its Nexus 9000 data center switches, which “have the hardware and software capabilities available today to provide the right latency, congestion management mechanisms, and telemetry to meet the requirements of AI/ML applications,” Cisco wrote in its Data Center Networking Blueprint for AI/ML Applications. “Coupled with tools such as Cisco Nexus Dashboard Insights for visibility and Nexus Dashboard Fabric Controller for automation, Cisco Nexus 9000 switches become ideal platforms to build a high-performance AI/ML network fabric.”
Another element of Cisco’s AI network infrastructure is its high-end programmable Silicon One processors, which are aimed at large-scale AI/ML infrastructures for enterprises and hyperscalers.