RDMA-Enhanced High-Speed Network for Training Large Models

Posted on Dec 19, 2023 by

 2.0k

As intelligence technology continues to evolve, the demand for high-speed networks capable of supporting large model training increases. This article explores how RDMA-enhanced networks provide the necessary speed and efficiency for training expansive HPC models. By leveraging RDMA technology, data centers can significantly reduce latency and increase throughput, ensuring robust performance for HPC applications. Explore the realms of RDMA high-speed networks for large model training with a comprehensive overview of InfiniBand and RoCE technologies.

Understanding RDMA for High-Speed Networks

Remote Direct Memory Access (RDMA) stands at the forefront of ultra-high-speed network memory access technology, revolutionizing the way programs swiftly access the memory of remote computing nodes. The underlying principle of its exceptional speed is vividly depicted below. RDMA introduces a paradigm shift by eliminating the need for network access to traverse the operating system's kernel (e.g., Sockets, TCP/IP). This strategic bypass mitigates the CPU time consumption associated with kernel operations, enabling direct memory access to the Network Interface Card (NIC), also known as the Host Channel Adapter (HCA) in certain contexts.

traditional-vs-rdma

In the realm of hardware implementations, RDMA manifests through three key technologies: InfiniBand, RoCE, and iWARP. Notably, InfiniBand and RoCE have emerged as the prevailing choices, as acknowledged by industry experts at the forefront of technological advancements.

Unveiling InfiniBand: Pinnacle of Bandwidth Excellence

Presently, the InfiniBand ecosystem encompasses mainstream technologies of 100G and 200G high-speed transmission. Among them, Enhanced Data Rate (EDR, 100G) and High Data Rate (HDR, 200G) represent some proprietary terms. It's worth noting that some brands now can provide solutions with even higher rates. For example, FS has introduced a 400G network card, and there are even 800G optical modules available. InfiniBand technology is rapidly advancing.

Despite its exceptional capabilities, InfiniBand is often overlooked by many IT professionals due to its high cost, rendering it less accessible for general use. However, within the confines of major universities and research institutions' supercomputing centers, InfiniBand emerges as an almost indispensable standard, particularly for supporting critical supercomputing tasks.

In contrast to conventional switches, InfiniBand networking employs a distinctive network topology known as "Fat Tree" to facilitate seamless communication between network cards of any two computing nodes. The Fat Tree structure comprises two layers: the core layer, responsible for traffic forwarding and disconnected from computing nodes, and the access layer, linking diverse computing nodes.

The high cost of implementing a Fat Tree topology in an InfiniBand network is mainly because, for example, on an aggregation switch with 36 ports, half must connect to computing nodes and the other half to upper layer core switches for lossless communication. Notably, each cable costs around $1.3K, and redundant connections are required for lossless communication.

fat-tree-topology

The adage "You get what you pay for" aptly applies to InfiniBand. It undeniably delivers unparalleled high bandwidth and low latency. According to Wikipedia, InfiniBand boasts significantly lower latency compared to Ethernet, registering latencies of 100 nanoseconds and 230 nanoseconds, respectively. This exceptional performance has positioned InfiniBand as a cornerstone technology in some of the world's foremost supercomputers, utilized by industry giants such as Microsoft, NVIDIA, and national laboratories in the United States.

Unlocking the Potential of RoCE: An Affordable RDMA Solution

In the realm of network technologies, RoCE (RDMA over Converged Ethernet) emerges as a cost-effective alternative, particularly when compared to high-priced counterparts like InfiniBand. While not deemed inexpensive, RoCE offers a more budget-friendly option, providing RDMA capabilities over Ethernet. In recent times, RoCE has experienced rapid development, gaining momentum as a viable substitute for InfiniBand, especially in scenarios where the latter's cost is a prohibitive factor.

Despite its affordability, achieving a truly lossless network with RoCE poses challenges, making it difficult to maintain the overall network cost below 50% of what would be incurred by InfiniBand.

Empowering Large-Scale Model Training: GPUDirect RDMA Unleashed

In the realm of large-scale model training, inter-node communication costs loom large. A transformative solution surfaces with the fusion of InfiniBand and GPUs, giving rise to a pivotal feature known as GPUDirect RDMA. This innovation facilitates direct communication between GPUs across nodes, circumventing the involvement of memory and CPU. Simply put, the intricate dance of communication between GPUs of two nodes unfolds directly through the InfiniBand network interface cards, sidestepping the traditional route through CPU and memory.

GPUDirect RDMA assumes heightened significance in the context of large-scale model training, where the models find their residence on GPUs. The conventional process of copying models to the CPU already demands a considerable time investment, and transmitting them to other nodes via the CPU would only exacerbate the sluggish pace of data transfer.

gpu-direct-rdma

Optimizing Large Model Networks: Strategic Configuration

In the world of large models, achieving optimal performance depends on careful configuration, especially when pairing a GPU and an InfiniBand network card. Here we introduce our partner—NVIDIA’s DGX system, which advocates a one-to-one pairing of GPU and InfiniBand network card and sets a benchmark. In this paradigm, a standard compute node can accommodate nine InfiniBand NICs. Of these, one is dedicated to the storage system, while the remaining eight are assigned to individual GPU cards.

This configuration, while optimal, bears a considerable cost, prompting exploration for more budget-friendly alternatives. An advantageous compromise entails adopting a ratio of 1 InfiniBand network card to 4 GPU cards.

In practical scenarios, both the GPU and InfiniBand find their connection points at a PCI-E switch, typically accommodating two GPUs per switch. The ideal scenario unfolds when each GPU is meticulously assigned its dedicated InfiniBand network card. However, challenges arise when two GPUs share a single InfiniBand network card and PCI-E switch. This configuration introduces contention between the GPUs, vying for access to the shared InfiniBand network card.

PCI-E switch-connection

The number of InfiniBand network cards becomes a crucial determinant, directly impacting contention levels and, consequently, communication efficiency between nodes. The accompanying diagram vividly illustrates this correlation. Notably, with a lone 100 Gbps network card, the bandwidth clocks in at 12 Gbps, with subsequent increases in bandwidth occurring almost linearly as the number of network cards escalates. Imagine the transformative potential of a configuration featuring eight H100 cards paired with eight 400G InfiniBand NDR cards, yielding an astonishingly high data transfer rate.

nvidia-ib-bw

One network card per GPU is the ideal situation：

gpu-network-solution

Architecting Excellence: Rail Optimization for Large Model Network Topology

In the field of large-scale model work, the key to success lies in configuring a dedicated fat-tree network topology. Unlike traditional high-performance computing (HPC) fat trees, "Rails" delivers breakthrough enhanced performance.

Lower-End Fat-tree and Rails-Optimized Topology

This illustration showcases a foundational version of the fat-tree and Rails-optimized topology. It comprises two switches, with MQM8700 representing an HDR switch. The interconnection speed between the two HDR switches is assured by four HDR cables. Each DGX GPU node boasts a total of nine InfiniBand (IB) cards, referred to as Host Channel Adapters (HCAs) in the diagram. Among these, one card is exclusively dedicated to storage (Storage Target), while the remaining eight serve the purpose of large-scale model training. Specifically, HCA1/3/5/7 connects to the first switch, while HCA2/4/6/8 associates with the second switch.

Full-Speed Rails-Optimized Topology

For the pursuit of a seamless, high-performance network, it is recommended to utilize an unobstructed, fully optimized rail topology, as depicted in the above diagram. Each DGX GPU node is equipped with eight IB cards, each connecting to an individual switch. These switches, termed leaf switches, amount to a total of eight. The assignment is meticulous: HCA1 links to the first leaf switch, HCA2 to the second, and so forth. To facilitate high-speed connectivity between leaf switches, spine switches come into play.

The underlying topology, depicted in the subsequent diagram, unfolds with two green switches representing spine switches and four blue switches as leaf switches. A total of 80 cables interconnect the blue and green switches, with the blue switches strategically positioned below, connected to the compute nodes. The essence of this configuration lies in its ability to circumvent bottlenecks, empowering each IB card to engage in high-speed communication with all other IB cards within the network. This translates to an environment where any GPU can seamlessly communicate with other GPUs at unprecedented speeds.

spine to leaf to server nodes diagram

Deciding Excellence: FS's InfiniBand and RoCE Solutions

In the intricate landscape of high-performance, lossless networks, the choice between InfiniBand and RoCE hinges on the specific demands of your application and infrastructure. Both InfiniBand and RoCE stand as stalwarts, offering low-latency, high-bandwidth, and minimal CPU overhead, rendering them apt for high-performance computing applications.

FS offers an extensive range of high-speed products, catering to both InfiniBand and Ethernet solutions. Our modules are available in a wide range of speeds, from 40G to 800G, with options for multi-rate DACs and AOCs to meet the diverse needs of our customers. Additionally, we provide NVIDIA® switches and NICs to further enhance your networking capabilities. These products not only showcase exceptional performance but also serve as catalysts, substantially amplifying customers' business acceleration capabilities at an economical cost.