How will AI affect networks within and between data centers? Ciena’s Brian Lavallée explains the importance of high-performance networks in realizing AI's full potential.
Although there’s a significant amount of hype related to Artificial Intelligence (AI), there’s no debate that AI is real and is already significantly reshaping a wide range of industries, driving innovation and efficiencies to previously unimaginable levels. However, similar to any disruptive technology introduction, like steam engines, electricity, and the internet, AI will come with unique challenges and opportunities.
Figure 1: Artificial Intelligence (AI), a new technology inflection point
AI infrastructure challenges lie in cost-effectively scaling storage, compute, and network infrastructure, while also addressing massive increases in energy consumption and long-term sustainability. To better understand these challenges, let’s delve into how AI impacts networks within the data centers hosting AI infrastructure and then work our way outwards where data centers are connected over increasing distances.
Intra Data Center Networks for Artificial Intelligence
AI was born in data centers that host the traditional cloud services we use daily in our business and personal lives. However, AI storage, compute, and network infrastructure requirements quickly became too complex and demanding for traditional cloud infrastructure for use cases like Large Language Models (LLM) training, the underlying technology of Generative AI (GenAI) applications, like the widely popular ChatGPT. Traditional cloud infrastructure success is driven by being cost-effective, flexible, and scalable, which are also essential attributes for AI infrastructure. However, a new and more extensive range of network performance requirements are needed for AI as shown in Figure 2. Today, AI infrastructure technology is mostly closed and proprietary, but the industry has rallied to form new standardization groups, such as the Ultra Ethernet Consortium (UEC) and Ultra Accelerator Link (UALink) Promoter Group, to create a broader technology ecosystem to drive faster innovation leveraging a more secure supply chain of vendors.
Figure 2: Comparison of traditional cloud and AI infrastructure requirements
AI applications, such as LLM training leveraging Deep Learning (DL) and artificial neural networks, involve moving massive amounts of data within a data center over short, high-bandwidth, low-latency networks operating at 400Gb/s and 800Gb/s to 1.6Tb/s and higher in the future. Just like customized AI-specific processors like Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are being developed, network technology innovation is also required to fully optimize AI infrastructure. This includes advances in optical transceivers, Optical Circuit Switches (OCS), co-packaged modules, Network Processing Units (NPUs), standards-based UEC and UALink-based platforms, and other networking technologies.
Figure 3: AI is enabled by high-performance networks within and between data centers
Although these network technology advancements will address AI performance challenges, the massive amounts of associated space and energy consumption will lead to many more data centers being constructed and interconnected. The distances within and between data centers will require different network solutions.
AI campus networks
A single modern GPU, the foundational element of AI compute clusters, can consume as much as 1,000 watts, so when tens to hundreds of thousands (and more) are interconnected for purposes like LLM training, associated energy consumption becomes a monumental challenge for data center operators. New AI infrastructure will rapidly consume energy and space within existing data centers. This will lead to new data centers being constructed in a “campus” where data centers are separated by less than 10 kilometers to minimize latency for improved AI application performance. Campuses will need to be located near available energy that is reliable, sustainable, and cost-effective. Campus data centers will be connected to each other and to distant data centers using optics optimized for specific cost, power, bandwidth, latency, and distances.
Data Center Interconnection (DCI) networks
As AI infrastructure is hosted in new and existing data centers, interconnecting them will be required like they are interconnected today for traditional cloud services. This will be achieved using similar optical transport solutions albeit at higher rates, including 1.6Tb/s, an industry first enabled by Ciena’s WaveLogicTM 6 technology. How much new traffic are we talking about? According to a recent analysis from research firm Omdia, monthly AI-enriched network traffic is forecasted to grow at approximately 120% (CAGR) from 2023 to 2030, as shown in Figure 4. This is a lot of additional traffic for global networks to carry going forward.
Figure 4: Monthly AI-enriched network traffic growth forecast from 2023 – 2030 (source: Omdia)
For enterprises, AI will drive an increasing need to migrate data and applications to the cloud due to economics, in-house gaps in AI expertise, as well as challenging power and space limitations. As cloud providers offer AI-as-a-Service and/or GPU-as-a-Service, performing LLM training in the cloud will require enterprises to move huge amounts of training data securely between their premises and the cloud, as well as across different cloud instances. This will drive the need for more dynamic and higher speed bandwidth interconnections, requiring more cloud exchange infrastructure, which represents a new telco revenue-generating opportunity.
Optimized AI performance at the network edge
Once an LLM is properly trained, it will be optimized and “pruned” to provide an acceptable inferencing (i.e., using AI in the real-world) accuracy within a much smaller footprint in terms of compute, storage, and energy requirements. These optimized AI algorithms are pushed out to the edge to reduce the strain on core data centers hosting LLM training, reduce latency, and abide by regulations related to data privacy concerns by hosting data locally. Placing AI storage and compute assets in geographically distributed data centers closer to where AI is created and consumed, whether by humans or machines, allows for faster data processing for near real-time AI inferencing to be achieved. This means more edge data centers to interconnect.
Balancing electrical power consumption and sustainability
AI is progressing at an increasingly rapid pace, creating new opportunities and challenges to address. For example, AI models involving DL and artificial neural networks are notoriously power-hungry in their LLM training phase consuming immense amounts of electricity. This will only increase as models become more complex, requiring constantly increasing amounts of compute, storage, and networking capabilities.
Figure 5: Ciena WaveLogic innovation constantly improves power and space savings per bit
Although AI infrastructure compute and storage consumes far more electrical energy than the networks that interconnect them, network bandwidth growth cannot scale linearly with associated power consumption – this is not sustainable or cost-effective. This means network technology must also consistently reduce electrical power (and space) per bit to “do its part” in an industry so critical to enabling AI capabilities. Figure 5 illustrates how Ciena’s relentless WaveLogic technology evolution continually increases the achievable spectral efficiency while also reducing the required power per and space per bit.
AI data is only valuable if it can move
Hype aside, AI will provide unprecedented benefits across different industries to positively affect our business and personal lives. However, the rapid and widespread adoption of AI presents a range of new challenges related to its foundational infrastructure encompassing compute, storage, and network building blocks. Successfully addressing these challenges requires extensive cross-industry innovation and collaboration because AI will only scale successfully if data can move securely, sustainably, and cost-effectively from inside core data centers hosting AI LLM training to edge data centers hosting AI inferencing.