AI Data Center Networking

Posted on Mar 28, 2024 by

 184

What Is AI Data Center Networking?

AI data center networking encompasses the networking infrastructure within data centers that facilitates artificial intelligence (AI) capabilities. It caters to the demanding requirements of AI and machine learning (ML) workloads, particularly during the intensive AI training phase, by ensuring network scalability, high performance, and low latency.

In the early stages of high-performance computing (HPC) and AI training networks, InfiniBand emerged as a popular proprietary networking technology due to its rapid and efficient communication between servers and storage systems. However, Ethernet, an open alternative, has gained significant traction in the AI data center networking market and is expected to become the dominant technology.

The increasing adoption of Ethernet in AI data center networking can be attributed to several factors, with operational efficiency and cost being prominent. Ethernet benefits from a vast pool of network professionals capable of building and managing networks, in contrast to the more limited availability of expertise for proprietary InfiniBand networks. Additionally, Ethernet offers a wider range of tools for managing networks compared to InfiniBand, which is primarily sourced through Nvidia.

What AI-Driven Requirements Are Addressed by AI Data Center Networking?

AI data center networking addresses the specific requirements driven by generative AI and large deep-learning AI models. The development of an AI model involves three phases:

Phase 1: Data preparation - Collecting and organizing datasets to be used in training the AI model.
Phase 2: AI training - Training the AI model by exposing it to large volumes of data, allowing it to learn patterns and relationships to develop intelligence.
Phase 3: AI inference - Applying the trained model in real-world scenarios to make predictions or decisions based on new, unseen data.

While Phase 3 generally utilizes existing data center and cloud networks, Phase 2 (AI training) requires significant data and compute resources to support the iterative learning process. Graphics processing units (GPUs) are commonly used for AI learning and inference, typically in clustered configurations for efficiency. However, scaling up clusters can increase costs, highlighting the importance of AI data center networking that doesn't hinder cluster efficiency.

Training large models requires connecting numerous GPU servers, sometimes tens of thousands, with each server costing over $400,000 in 2023. Therefore, optimizing job completion time and minimizing tail latency (where outlier AI workloads slow down overall job completion) are crucial for maximizing the return on GPU investment. In this context, the AI data center network must be reliable without causing efficiency degradation in the cluster.

How Does AI Data Center Networking Work?

AI data centers rely heavily on GPU servers, which can contribute significantly to overall costs. However, the networking component of AI data centers is crucial in maximizing GPU utilization. To achieve this, an efficient network is essential. Ethernet, a proven and open technology, is the ideal solution for AI data center networking, specifically designed to meet the demands of AI workloads. The network architecture is enhanced with congestion management, load balancing, and low latency to optimize job completion time (JCT). Additionally, simplified management and automation ensure reliability and consistent performance.

Fabric Design

AI data center networking can employ various fabric designs, but the recommended choice is the any-to-any non-blocking Clos fabric. This design optimizes the training framework by utilizing a consistent networking speed of 400 Gbps (with a potential increase to 800 Gbps) from the NIC to leaf and spine. Depending on the scale of GPU and model size, either a two-layer, three-stage non-blocking fabric or a three-layer, five-stage non-blocking fabric can be implemented.

Flow Control and Congestion Avoidance

Aside from fabric capacity, there are additional design considerations that contribute to the overall dependability and efficiency of the fabric. This encompasses appropriately sized fabric interconnects with the correct number of links, enabling the identification and rectification of flow imbalances to prevent congestion and packet loss. The combination of explicit congestion notification (ECN), data center quantized congestion notification (DCQCN), and priority-based flow control effectively resolves flow imbalances, ensuring the transmission remains free of losses.

To address congestion, dynamic and adaptive load balancing techniques are implemented at the switch level. Dynamic load balancing redistributes flows within the switch locally to achieve a balanced distribution. Adaptive load balancing continuously monitors flow forwarding and next hop tables, identifying imbalances and redirecting traffic away from congested paths.

In cases where congestion cannot be entirely avoided, ECN provides early notification to applications. During these instances, leafs and spines update packets to support ECN, informing senders about congestion and prompting them to slow down transmission to prevent packet drops during transit. If endpoints fail to react promptly, priority-based flow control (PFC) enables Ethernet receivers to communicate buffer availability feedback to senders. During periods of congestion, leafs and spines can pause or regulate traffic on specific links, effectively reducing congestion and preventing packet drops. This ensures lossless transmission for specific traffic classes.

Scale and Performance

Ethernet has emerged as the favored open-standard resolution for addressing the rigorous demands of high-performance computing and AI applications. It has undergone continuous evolution, encompassing advancements like the transition to 800 GbE and data center bridging (DCB), in order to offer enhanced speed, reliability, and scalability. Consequently, Ethernet stands as the optimal selection for managing the substantial data throughput and low-latency requirements essential to mission-critical AI applications.

Automation

Automation plays a vital role in an effective AI data center networking solution, although the quality of automation varies. To fully realize its value, automation software must prioritize experience-first operations. It is utilized throughout the design, deployment, and ongoing management of the AI data center, enabling automated and validated lifecycle processes from Day 0 through Day 2+. This approach ensures repeatable and continuously validated AI data center designs and deployments, eliminating human error, leveraging telemetry and flow data for performance optimization, proactive troubleshooting, and outage prevention.