Al Intelligent Computing Center Network Architecture Design Practice
Data center networks for conventional cloud setups are generally architected with a traffic pattern in mind that prioritizes the needs of external clients. This orientation results in a predominant flow of data from the data centers toward end-users, a directional movement often referred to as north-south traffic. In contrast, the traffic that moves laterally inside the cloud, labeled as east-west, takes a backseat in this model. Nevertheless, this foundational network infrastructure, which facilitates Virtual Private Cloud (VPC) networks and underpins smart computing tasks, encounters a number of difficulties.
Network Congestion: Not all servers generate outbound traffic simultaneously. To control network construction costs, the bandwidth of the downlink ports on leaf switches and the uplink ports do not have a 1:1 ratio but are designed with a convergence ratio. Generally, the uplink bandwidth is only one-third of the downlink bandwidth.
High Latency for Internal Cloud Traffic: Communication between two servers across different leaf switches requires traversing spine switches, resulting in a three-hop forwarding path, which introduces additional latency.
Limited Bandwidth: In most cases, a single physical machine is equipped with only one network interface card (NIC) for connecting to the VPC network. The bandwidth of a single NIC is relatively limited, and currently available commercial NICs typically do not exceed 200 Gbps.
For intelligent computing scenarios, a recommended practice is to build a dedicated high-performance network to accommodate intelligent computing workloads, meeting the requirements of high bandwidth, low latency, and lossless.
High Bandwidth Design
The intelligent computing servers can be fully equipped with 8 GPPU cards and have 8 PCIe network card slots reserved. When building a GPU cluster across multiple machines, the burst bandwidth for communication between two GPUs may exceed 50 Gbps. Therefore, it is common to associate each GPU with a network port of at least 100 Gbps. In this scenario, you can configure either 4 network cards with a capacity of 2100 Gbps each or 8 network cards with a capacity of 1100 Gbps each. Alternatively, you can configure 8 network cards with a single port capacity of 200/400 Gbps.
The key to unblocking network design is to adopt a Fat-Tree architecture. The downlink and uplink bandwidth of the switches follow 1:1 non-converged design. For example, if there are 64 ports with a bandwidth of 100Gbps each in the downlink, there will also be 64 ports with a bandwidth of 100Gbps each in the uplink.
In addition, data center-grade switches with non-blocking forwarding capability should be used. The mainstream data center switches available in the market generally provide full-port non-blocking forwarding capability.
Low-Latency Design: Al-Pool
In terms of low-latency network architecture design, Baidu Intelligent Cloud has implemented and deployed the Al-Pool network solution based on Rail optimization. In this network solution, 8 access switches form an AA-Pool group. Taking a two-layer switch networking architecture as an example, this network architecture achieves one-hop communication between different intelligent computing nodes within the same Al-Pool.
In the Al-Pool network architecture, network ports with thesame numbers from different intelligent computing nodes should be connected to the same switch.For example, RDMA port 1 of intelligent computing node 1, RDMAA port 1 of intelligent computing node 2, and so on, up to the RDIMA port 1 of intelligent computing node P/2, should all be connected to the switch.
Within each intelligent computing node, the upper-layer communication library matches the GPU cards with the corresponding network ports based on the on-node network topology. This enables direct communication with only one hop between two intelligent computing nodes that have the same GPU card number.
For communication between intelligent computing nodes with different GPU card numbers, the Rail Local technology in the NCCLcommunication library can make full use of the bandwidth of NVSwitch between GPUSwithin the host, transforming cross-card communication between multiple machines into communication between the same GPU card numbers across machines.
For communication between two physical machines across Al-PocIs, it requires passing through aggregation switches, resulting in three-hop communication.
The scalability of GPUs that the network can support is related to the port density and network architecture of the switches used. As the network becomes more hierarchical, it can accommodate a larger number of GPU cards, but the number of hops and latency for forwarding also increase. Therefore, a trade-off should be made based on the actual business requirements.
Two-level Fat-Tree Architecture
8 access switches form an intelligent computing resource pool callled Al-Pool. In the diagram, P represents the number of ports on a single switch. Each switch can have a maximum of P/2 downlink ports and P/2 uplink ports, which means a single switch can connect to up to P/2 servers andP/2 switches. A two level Fat-Tree network can accommodate a total of P*P/2 GIPU cards.
Three-level Fat-Tree Architecture
In a three-level network architecture, there are additional aggregation switch groups and core switch groups. The maximum number of switches in each group is P/2. The maximum number of aggregation switch groups is 8, and the maximum number of core switch groups is P/2. A three-level Fat-Tree network can accommodate a total of P*(P/2)(P/2) = PP*P/4 GPU cards.
In the context of a three-level Fat-Tree network, the InfiniBand 40-port 200Gbps HDR switches can accommodate a maximum of 16,000 GPUS. This scale of 16,000 GPU cards is currently the largest scale network for GPU clusters using InfiniBand in China, and Baidu holds the current record.
Comparison of two-level and three-level fat tree network architectures
The scale of accommodated GPU cards
The most significant difference between a two-level Fat-Tree and athree-level Fat-Tree lies in the capacity to accommodate GPU cardis. In the diagram below, N represents the scale of GPU cards, and P represents the number of ports on a single switch. For example, for a switch with 40 ports, a two-tier Fat-Tree architecture can accommodate 800 GPU cards, while a threee-tier Fat-Tree architecture can accommodate 16,000 GPU cards.
Another difference between the two-level Fat-Tree and three-levelFat-Tree network architectures is the number of hops in the network forwarding path between any two nodes.
In the two-level Fat-Tree architecture, within the same intelligent computing resource pool (Al-Pool), the forwarding path between nodes with the same GPU card number is 1 hop. The forwarding path between nodes with different GPU card numbers, without Rail Local optimization within the intelligent computing nodes, is 3 hops.
In the three-level Fat-Tree architecture, within the same intelligent computing resource pool (AI-Pool), the forwarding path betweennodes with the same GPU card number is 3 hops. The forwarding path between nodes with different GPU card numbers, without Rail Local optimization within the intelligent computing nodes, is 5 hops.
AI HPC Network Architecture Typical Practice
Based on the currently mature commercial switches, we recommend several specifications for physical network architectures, taking into consideration the different models of InfiniBand/RoCE switches and the supported scale of GPUS.
Regular: InfiniBand two-tier Fat-Tree network architecture based con InfiniBand HDR switches, supporting a maximum of 800 GPU cards in a single cluster.
Large: RoCE two-tier Fat-Tree network architecture based on 128-port 100G data center Ethernet switches, supporting a maximum of 8192 GPU cards in a single cluster.
XLarge: InfiniBand three-tier Fat-Tree network architecture based on InfiniBand HDR switches, supporting a maximum of 16,000 GPU cards in a single cluster.
XXLarge: Based on InfiniBand Quantum-2 switches or equivalent-performance Ethernet data center switches, adopting a three-tierFat-Tree network architecture, supporting a maximum of 100,000 GPU cards in a single cluster.
At the same time, high-speed network connectivity is essential to ensure efficient data transmission and processing.
FS provides high-quality connection products to meet the requirements of AI model network deployment. FS product lineup includes (200G, 400G) InfiniBand switches, Data Center switches (10G, 40G, 100G, 400G) network cards and (10/25G, 40G, 50/56G, 100G) optical modules, which can accelerate the AI model training and inference process. Optical modules provide high bandwidth, low latency and low error rates, enhancing the capabilities of data center networks and enabling faster and more efficient A-computing. Choosing FS's connection products can optimize network performance and support the deployment and operation of large-scale AI models.