Meeting the Five Major Network Demands for AIGC Large Models

Posted on Apr 26, 2024 by

 103

AIGC Large Models have revolutionized the field of artificial intelligence, enabling breakthroughs in natural language processing, computer vision, and other complex tasks. With their massive size and parameter count, these models have become indispensable for various AI applications. However, their success hinges on meeting the network demands associated with their scale. In this article, we will explore the five major network demands from the perspectives of scale, bandwidth, latency, stability, and network deployment, which must be addressed to unleash the full potential of AIGC large models.

Ultra-scale Network Demand

AI application computation is growing exponentially, with models reaching billions to trillions of parameters. Training such mega-models requires ultra-high arithmetic and extensive video memory. High memory consumption means that tens or hundreds of GPUs are needed to fully store a model's training process. However, just having a large number of GPUs is not enough to train an effective large model. For more information, you can read the article Why GPUs Are so Crucial for AI?

Effective training relies on three parallelism methods: data, pipeline, and tensor parallelism. All three types of parallelism exist when training large models at the scale of hundreds of billions to trillions. Interconnecting thousands of GPU nodes is more challenging than connecting tens of thousands of servers due to matching network capacity and computing power, posing challenges in networking at this scale. Cloud data centers use CPU computing for network demands of 10Gbps-100Gbps, with traditional TCP protocol. Al mega-model training uses GPU training, higher arithmetic power, and RDMA protocol for improved throughput.

Specifically, issues like RDMA network problems, performance optimization, NIC connectivity, and network topology selection arise. Addressing these challenges involves optimizing congestion control, load balancing, and selecting suitable network topologies like Fat Tree or Torus.

Ultra-high-bandwidth Bandwidth Network Demand

Bandwidth plays a vital role in the efficient transfer of data between components of AIGC Large Models. In the Al large model training scenario, the collection of communication operations between the machine and the machine will generate a large amount of communication data. With the increasing size of models and the need to process vast amounts of training data, high-bandwidth networks are essential. Technologies like high-speed interconnects, advanced data compression algorithms, and data caching mechanisms can significantly improve data transfer rates and minimize bottlenecks. By ensuring sufficient network bandwidth, organizations can accelerate model training, reduce latency, and enhance overall AI performance.

Ultra-Low Delay Network Demand

Network delay during data communication transmission consists of static delay and dynamic delay. Static delay includes data serial delay, equipment forwarding delay, and photoelectric transmission delay, which are determined by the forwarding chip's capability and transmission distance. Dynamic delay comprises switch internal queuing delay and packet loss retransmission delay, often caused by network congestion and packet loss.

Besides latency, network variation factors introduce latency jitter, impacting training efficiency. The communication process among computing nodes can be divided into parallel P2P communications. For example, in Ring AllReduce communication among N nodes, there are 2*(N-1) data communication sub-processes, and all nodes complete P2P communication within each sub-process. The P2P communication time between specific nodes increases significantly when the network fluctuates. This network jitter acts as the weakest link, prolonging the completion time of the subflow. Consequently, network jitter reduces aggregate communication efficiency, affecting AI models' training efficiency.

Ultra-High Stability Network Demand

The network system's availability plays a crucial role in ensuring computational stability across the cluster. Network failures can significantly impact the connectivity of multiple computing nodes and compromise system integrity. Additionally, network performance fluctuations pose challenges for shared resources, impacting the utilization rate of computational resources. Therefore, maintaining a stable and efficient network is of utmost importance during large-scale AI model training, presenting new challenges for network operation and maintenance.

In the event of a failure, fault-tolerant replacement or elastic scaling may be necessary to address the affected nodes. Changes in node locations can require job rescheduling and scheduling optimizations to improve overall training efficiency. Unanticipated network failures, such as silent packet loss, not only reduce aggregate communication efficiency but also trigger timeouts, leading to prolonged training operation disruptions. Obtaining detailed information on service flow throughput and packet loss enables timely avoidance and self-healing within seconds, minimizing the impact on training efficiency.

Network Automation Deployment

Intelligent lossless networks rely on RDMA protocols and congestion control mechanisms, but their complex configurations pose challenges. Misconfigurations of these parameters can impact service performance and lead to unexpected issues. Over 90% of high-performance network failures are attributed to misconfiguration problems, often caused by numerous NIC configuration parameters. Deploying configurations effectively improves reliability and efficiency in large model cluster systems. Automated deployment configuration involves parallel deployment of multiple configurations, automatic selection of congestion control-related parameters, and configuration selection based on NIC and service type.

Similarly, in complex architectural and configuration scenarios, fast and accurate fault localization is vital for maintaining overall business efficiency. Automated fault detection swiftly identifies problems, notifies management, and reduces the cost of locating and resolving the root cause of issues.

Build Lossless and High-Performance Al Networks with FS

FS is a leading network solutions provider focused on creating a digitally interconnected world. We are committed to delivering innovative, efficient, and reliable products, solutions, and services that cater to the diverse needs of our users. Our comprehensive range of offerings includes InfiniBand switches, SmartNICs designed to optimize data centers, high-performance computing, edge computing, AI, and more. By providing cost-effective solutions with exceptional performance, we empower our customers to accelerate their business capabilities.

Our lossless network solutions, based on InfiniBand and ROCE technologies, create a network environment that ensures zero data loss and high-performance computing capabilities. We understand that different application scenarios and user requirements call for tailored solutions. That's why we analyze local conditions to choose the optimal solution, providing users with high bandwidth, low latency, and efficient data transmission. By addressing network bottlenecks and enhancing overall performance, we elevate the network experience for our users.

Contact our solution experts, who will work closely with you to understand your specific needs and develop a customized solution. Let us help you unlock the full potential of your network.