Deep Dive into Network Requirements for Large AI Models
From the emergence of Transformers to the widespread adoption of ChatGPT in 2023, a consensus has gradually formed that increasing the model's parameter size enhances its performance, following a scaling law that governs their relationship. Particularly, when the parameter size exceeds several hundred billion, the language understanding, logical reasoning, and problem analysis capabilities of large AI models experience rapid improvement.
Concurrently, the shift in model size and performance has brought about alterations in the network requirements for training large AI models compared to traditional models.
To cater to the demands of efficient distributed computing in large-scale training clusters, AI large model training processes typically involve various parallel computing modes, such as data parallelism, pipeline parallelism, and tensor parallelism. In these parallel modes, collective communication operations become essential among multiple computing devices. Moreover, the training process often adopts a synchronous mode, necessitating the completion of collective communication operations among multiple machines and multiple cards before proceeding to the next iteration or computation of training. Therefore, the design of an efficient cluster networking scheme in large-scale training clusters of AI large models is pivotal. This design aims to achieve low latency and high throughput in inter-machine communication, critical for reducing communication overhead during data synchronization among multiple machines and cards. This optimization contributes to improving the GPU's effective computation time ratio (GPU computation time / overall training time), a crucial factor in the efficiency enhancement of AI distributed training clusters. The following analysis will delve into the network requirements of AI large models, examining aspects such as scale, bandwidth, latency, stability, and network deployment.
Challenges in Scaling GPU Networks for Efficient Training of Ultra-Large AI Models
The computational demands of AI applications are experiencing exponential growth, with models expanding to massive scales. The parameters of artificial intelligence models have surged by a factor of a hundred thousand, and current AI large models boast parameter sizes ranging from hundreds of billions to trillions. Training such models undeniably requires substantial computational power. Additionally, ultra-large models impose high demands on memory. For instance, a 1 trillion parameter model utilizing 1-byte storage would consume 2 terabytes of storage space. Moreover, during the training process, intermediate variables generated by forward computation, gradients from backward computation, and optimizer states essential for parameter updates all need storage. These intermediate variables continuously increase within a single iteration. For instance, a training session using the Adam optimizer produces intermediate variables peaking at several times the size of the model parameters. Such high memory consumption implies that dozens to hundreds of GPUs are necessary to store the complete training process of a model.
However, having a large number of GPUs alone is insufficient for training ultra-large models effectively. The key to enhancing training efficiency lies in adopting suitable parallelization methods. Currently, three main parallelization methods are employed for ultra-large models: data parallelism, pipeline parallelism, and tensor parallelism. All three parallelization methods are utilized in the training of models with parameter sizes ranging from hundreds of billions to trillions. Training ultra-large models necessitates clusters comprising thousands of GPUs. Initially, this may seem inferior compared to the interconnection scale of tens of thousands of servers in cloud data centers. However, in reality, interconnecting thousands of GPU nodes is more challenging due to the necessity for highly matched network capabilities and computational capabilities. Cloud data centers primarily employ CPU computation, and their network requirements typically range from 10 Gbps to 100 Gbps, utilizing traditional TCP transport layer protocols. In contrast, AI large model training employs GPU training, which exhibits computational power several orders of magnitude higher than CPUs. Consequently, the network demands range from 100 Gbps to 400 Gbps, and RDMA protocols are utilized to reduce transmission latency and enhance network throughput.
Specifically, achieving high-performance interconnection among thousands of GPUs poses several challenges in terms of network scale:
Issues encountered in large-scale RDMA networks, such as head-of-line blocking and PFC deadlock storms.
Network performance optimization, including more efficient congestion control and load balancing techniques.
Network card connectivity issues, as a single host is subject to hardware performance limitations. Addressing how to establish thousands of RDMA QP connections.
4. Network topology selection, considering whether the traditional Fat Tree structure is preferable or if reference can be made to high-performance computing network topologies such as Torus and Dragonfly.
Optimizing GPU Communication for Efficient AI Model Training Across Machines
In the context of AI large-scale model training, the collective communication operations among GPUs within and across machines generate a substantial volume of communication data. Examining GPU communication within a single machine, consider AI models with billions of parameters where the collective communication data resulting from model parallelism can reach the scale of hundreds of gigabytes. Therefore, efficient completion time relies significantly on the communication bandwidth and modes between GPUs within the machine. GPUs within a server should support high-speed interconnection protocols, mitigating the necessity for multiple data copies through CPU memory during GPU communication.
Furthermore, GPUs are typically connected to network cards via PCIe buses, and the communication bandwidth of the PCIe bus determines whether the single-port bandwidth of the network card can be fully utilized. For instance, considering a PCIe 3.0 bus (16 lanes corresponding to a unidirectional bandwidth of 16GB/s), if inter-machine communication is equipped with a single-port bandwidth of 200Gbps, the network performance between machines may not be fully utilized.
Crucial Factors in AI Large-Scale Model Training Efficiency
Network latency during data communication comprises two components: static latency and dynamic latency. Static latency encompasses data serialization latency, device forwarding latency, and electro-optical transmission latency. It is determined by the capabilities of the forwarding chip and transmission distance, representing a constant value when network topology and communication data volume are fixed. Conversely, dynamic latency significantly impacts network performance, encompassing queuing latency within switches and latency due to packet loss and retransmission, often caused by network congestion.
Illustrating with the training of a GPT-3 model with 175 billion parameters, theoretical analysis indicates that when dynamic latency increases from 10μs to 1000μs, the proportion of effective GPU computing time decreases by nearly 10%. A network packet loss rate in the thousands results in a decrease of 13% in effective GPU computing time, and at a 1% loss rate, the proportion drops to less than 5%. Reducing computational communication latency and enhancing network throughput are critical considerations for fully leveraging the computational power in AI large-scale model training.
Beyond latency, network variations introduce latency jitter, impacting training efficiency. The collective communication process of computing nodes during training involves multiple parallel point-to-point (P2P) communications. For instance, the Ring AllReduce collective communication between N nodes includes 2*(N-1) data communication substeps, where all nodes in each substep must complete P2P communication in parallel. Network fluctuations lead to noticeable increases in flow completion time (FCT) for P2P communication between specific nodes. The variation in P2P communication time, caused by network jitter, is considered the weakest link in system efficiency, resulting in increased completion time for the corresponding substep. Hence, network jitter diminishes the efficiency of collective communication, impacting the training efficiency of AI large-scale models.
Critical for Computational Power in Large-Scale AI Model Training
Since the emergence of Transformers, it has signaled the onset of rapid evolution in large-scale models. Over the past five years, the model size has surged from 61 million to 540 billion, representing an exponential increase of nearly 10,000 times. The computational power of the cluster plays a pivotal role in determining the speed of AI model training. For instance, training GPT-3 on a single V100 GPU would take an impractical 335 years, whereas a cluster comprising 10,000 V100 GPUs, scaling ideally, could complete the training in approximately 12 years.
The reliability of the network system is foundational in ensuring the computational stability of the entire cluster. Network failures can exert a widespread impact, disrupting the connectivity of numerous compute nodes in the event of a network node failure, thereby compromising the overall computational power of the system. Additionally, fluctuations in network performance can affect the entire cluster, given that the network is a shared resource, unlike individual compute nodes that are more easily isolated. Performance fluctuations have the potential to adversely impact the utilization of all computational resources. Thus, maintaining a stable and efficient network is of utmost importance throughout the training cycle of AI large-scale models, presenting new challenges for network operations.
In cases of failure during the training task, fault-tolerant replacement or elastic scaling may be necessary to address faulty nodes. Changes in the positions of participating nodes may render the current communication patterns suboptimal, necessitating job redistribution and scheduling to enhance overall training efficiency. Furthermore, unexpected network failures, such as silent packet loss, not only diminish the efficiency of collective communication but also lead to communication library timeouts, resulting in prolonged periods of training processes being stalled and significantly impacting efficiency. Consequently, obtaining fine-grained information about throughput, packet loss, and other parameters of the business flow becomes essential for timely fault detection and self-healing within seconds.
The Role of Automated Deployment and Fault Detection in Large-Scale AI Clusters
The establishment of intelligent lossless networks often relies on RDMA protocols and congestion control mechanisms, accompanied by an array of intricate and diverse configurations. Any misconfiguration of these parameters has the potential to impact network performance and may lead to unforeseen issues. Statistics indicate that over 90% of high-performance network failures stem from configuration errors. The primary cause of such problems lies in the multitude of configuration parameters for network cards, contingent on architecture versions, business types, and network card types. In the context of large-scale AI model training clusters, the complexity of configurations is further heightened. Therefore, efficient and automated deployment and configuration can effectively enhance the reliability and efficiency of large-scale model cluster systems.
Automated deployment and configuration necessitate the capability to deploy configurations in parallel across multiple machines, automatically select relevant parameters for congestion control mechanisms, and choose appropriate configurations based on network card types and business requirements.
Similarly, within complex architectures and configuration scenarios, the ability to promptly and accurately pinpoint faults during business operations is essential to ensuring overall business efficiency. Automated fault detection facilitates rapid problem localization, precise notifications to management personnel, and reduces the costs associated with problem identification. It enables swift identification of root causes and provides corresponding solutions.
Choosing FS Accelerate Al Model Network
The analysis underscores the specific requirements of AI large-scale models concerning scale, bandwidth, stability, latency/jitter, and automation capability. However, a technological gap persists in fully meeting these requirements with the current configuration of data center networks.
The demand for network capabilities in AI large-scale models is exceptionally high, given their substantial parameter sizes and intricate computational needs. Adequate computing and storage resources are essential to support their training and inference processes, while high-speed network connectivity is crucial for efficient data transmission and processing. FS addresses these needs by offering high-quality connectivity products tailored to the unique circumstances of each customer, thereby enhancing network performance and user experience.
FS's product portfolio extends beyond switches and network cards to include optical modules with rates ranging from 100G to 800G, as well as AOCs and DACs. These products facilitate efficient data transmission, accelerating AI model training and inference processes. In large-scale AI training, optical modules connect distributed computing nodes, collaborating to execute complex computational tasks. With attributes such as high bandwidth, low latency, and a low error rate, these products expedite model updates and optimization, reducing communication delays and fostering faster and more efficient artificial intelligence computing.
Opting for FS's connectivity products enhances the capabilities of data center networks to better support the deployment and operation of AI large-scale models. For further details, please visit the official FS website.