RDMA Accelerating Cluster Performance Enhancement

Posted on Dec 17, 2023 by

 1.6k

Driven by the demands of enterprise digitization, a plethora of new applications continually emerge and get implemented. As data becomes a pivotal asset for businesses, it triggers an upsurge in the demand for high-performance computing, big data analytics, and various storage applications. Traditional data transmission protocols like TCP/UDP encounter numerous bottlenecks in adapting to these novel requirements.

RoCE Technological Benefits and Ecosystem Development

RDMA, or Remote Direct Memory Access, represents a high-performance network communication technology, serving as a fundamental component of the InfiniBand network standard. DMA, or Direct Memory Access, involves direct device access to host memory without CPU intervention. RDMA technology facilitates direct memory data access through the network interface, bypassing the operating system kernel. This enables efficient, low-latency network communication with high throughput, particularly suited for extensive parallel computing clusters.

Through optimizing the transport layer and leveraging network interface card capabilities, RDMA empowers applications to better utilize network link resources. Initially implemented on the InfiniBand transport network, RDMA expanded to traditional Ethernet to meet growing demand. Ethernet-based RDMA technology includes iWARP and RoCE, with RoCE further divided into RoCEv1 and RoCEv2. In contrast to the expensive InfiniBand, RoCE and iWARP present significantly lower hardware costs.

When RDMA operates on Ethernet networks, it's known as RoCE (RDMA over Converged Ethernet). Presently, the prevalent networking solution for high-performance networks relies on the RoCE v2 protocol (RDMA over Converged Ethernet). This protocol converges Ethernet and RDMA, finding broad applications in diverse deployment scenarios within Ethernet networks.

socket-vs-rdma

In contrast to the TCP/IP methodology, RDMA employs Kernel Bypass and Zero Copy technologies to deliver reduced latency, diminished CPU usage, alleviation of memory bandwidth bottlenecks, and the attainment of high bandwidth utilization. RDMA introduces an IO-based channel, enabling an application to directly access remote virtual memory for reading and writing through RDMA devices.

tcp-ip-vs-rdma-roce

RDMA technology establishes a dedicated data path between applications and the network, circumventing the system kernel. By optimizing this data path, CPU resources for data forwarding can be reduced to 0%, leveraging high performance provided by ASIC chips. RDMA efficiently transfers data directly into the computer's storage area via the network, swiftly transporting data from one system to another's memory without impacting the operating system, thereby minimizing the demand for computational power.

This eradicates the overhead associated with external memory copying and context switching, liberating memory bandwidth and CPU cycles to enhance application system performance and overall cluster efficiency. RDMA technology has found widespread adoption in supercomputing centers and internet enterprises, establishing a mature application-network ecosystem. Its integration into enterprise-level large-scale data centers within this project signifies a new developmental stage in the technological ecosystem.

Enhancing HPC Application Efficiency with GPU Direct-RDMA

Traditional TCP networks heavily rely on CPU processing for packet management, often struggling to fully exploit available bandwidth. Consequently, in HPC environments, RDMA emerges as an indispensable network transport technology, particularly during large-scale cluster training.

RDMA technology extends beyond high-performance network transmission of user-space data in CPU memory; it also facilitates GPU transfers within GPU clusters spanning multiple servers. This is where GPU Direct technology, a pivotal component for optimizing HPC performance, comes into play. Given the escalating complexity of deep learning models and the surge in computational data volume, single machines no longer suffice to meet computational requirements. Distributed training, involving multiple machines and GPUs, has become imperative. In this context, communication between multiple machines becomes a critical performance metric for distributed training, and GPUDirect RDMA technology proves instrumental in accelerating GPU communication across machines.

➢ GPU Direct RDMA: Leveraging the RoCE capability of network cards, it enables high-speed memory data exchange among GPUs across server nodes within a GPU cluster.

In terms of network design and implementation, NVIDIA enhances the performance of GPU clusters by supporting the functionality of GPU Direct RDMA. The technical implementation of GPU Direct RDMA is elucidated in the diagram below.

gpu-direct-rdma

Within the realm of GPU cluster networking, elevated demands for network latency and bandwidth come to the forefront. Traditional network transmission has, at times, limited the parallel processing capabilities of GPUs, leading to resource inefficiencies. The conventional route for high-bandwidth data transmission often necessitates the involvement of CPU memory, introducing bottlenecks related to both memory read/write operations and CPU load during GPU multi-node communication. To tackle these challenges, GPU Direct RDMA technology takes a direct approach by exposing the network card device to the GPU, facilitating direct remote access between GPU memory spaces. This innovative approach significantly enhances both bandwidth and latency, thereby substantially improving the efficiency of GPU cluster operations.

FS RS6460, a cutting-edge 4U dual-socket rackmount accelerated computing GPU server. Powered by 2nd Gen Intel® Xeon® Scalable Processors and equipped with up to 24x hot-swap SAS/SATA hard drives, it offers unparalleled performance and scalability. With support for up to 8x double-width PCIe GPUs, it's ideal for demanding applications like HPC, deep learning, and large language model training.

Lossless Network Solution for Data Center Switches

roce-solution

The solution supporting RoCE traffic on switches is commonly referred to as the Lossless Ethernet solution. This comprehensive solution encompasses key technologies vital for efficient network operations:

➢ ECN Technology: ECN introduces a traffic control and end-to-end congestion notification mechanism at the IP and transport layers. It utilizes the DS field in the IP packet header to indicate congestion states along the transmission path. Terminal devices equipped with ECN support can assess congestion based on packet content, adjusting transmission methods to mitigate congestion escalation. Enhanced Fast ECN technology marks the ECN field of data packets upon dequeuing, minimizing delay in marking ECN during forwarding. This allows receiving servers to promptly receive ECN-marked data packets, facilitating the acceleration of the sending rate adjustment.

➢ PFC Technology: PFC offers per-hop priority-based flow control. As devices forward packets, they schedule and forward packets based on priority, mapping them to corresponding queues. In cases where the sending rate of packets with a specific priority exceeds the receiving rate, leading to insufficient available data buffering space at the receiving end, the device transmits a PFC PAUSE frame back to the previous hop device. Upon receiving the PAUSE frame, the previous hop device halts the transmission of packets of that priority, resuming traffic only after receiving a PFC XON frame or after a certain aging time has passed. PFC ensures that congestion in one type of traffic does not disrupt the normal forwarding of other traffic types, maintaining interference-free operation for different packet types on the same link.

Streamlining RDMA and RoCE Product Selection

In conclusion, leveraging practical insights from deploying lossless Ethernet, NVIDIA has embraced ECN as the linchpin congestion control technology. Bolstered by hardware-accelerated Fast ECN, the system ensures rapid responses for effective flow control. Complemented by ETS and inventive physical cache optimization, resource scheduling undergoes fine-tuning tailored to the unique traffic model. On the flip side, the inclusion of PFC technology introduces potential challenges with the looming risk of network deadlock. Comparative evaluations underscore the limited efficacy of PFC flow control mechanisms in enhancing network reliability, addressing congestion packet loss, and simultaneously reveal inherent risks and performance bottlenecks.

RDMA emerges as a paramount force in achieving optimal end-to-end network communication, focusing on expediting remote data transfers. This involves a sophisticated amalgamation of kernel bypass on the host side, transport layer offloading on the network card, and network-side congestion flow control. The outcomes are tangible in the form of low latency, high throughput, and minimal CPU overhead. Nevertheless, the current RDMA implementation grapples with constraints, such as scalability limitations and intricacies in configuration and modification.

As technology evolves, it is crucial to navigate the ever-changing landscape of RDMA and RoCE product selection, keeping a keen eye on advancements and addressing limitations for seamless integration and sustained high-performance network solutions.

When constructing high-performance RDMA networks, beyond the requisite RDMA adapters and robust servers, critical components such as high-speed optical modules, switches, and optical cables are integral to success. In this context, the selection of FS's reliable high-speed data transmission products and solutions is highly commendable. As a foremost provider of high-speed data transmission solutions, FS offers a diverse range of top-tier products, encompassing high-performance switches, 200/400/800G optical modules, smart network cards, and more, precisely tailored to meet the exacting demands of low-latency, high-speed data transmission. In addition, in order to meet customers' high-bandwidth and low-latency network construction needs, FS provides a series of high-performance InfiniBand network switches and optical modules. Our InfiniBand optical modules support multiple speeds to meet the needs of networks of different sizes and adapt to various application scenarios.

FS's products and solutions are widely deployed across various industries, seamlessly meeting the demands of large-scale scientific computing, real-time data analysis, and the stringent low-latency prerequisites of financial transactions. FS stands as the preferred choice for achieving a harmonious balance between cost-effectiveness and operational efficiency when deploying high-performance networks