How to Build Lossless Network with RDMA?

Posted on Dec 18, 2023 by

 1.9k

After delving into the realms of RDMA and lossless networks, individuals frequently encounter two fundamental questions: Why is the adoption of lossless networks crucial? What advantages do these cutting-edge technologies offer?

Addressing these inquiries solely from a networking standpoint may pose challenges. However, FS's technical experts are poised to offer insights by presenting illustrative examples from both front-end business and back-end application perspectives.

Why is a Lossless Network Essential?

In the vast realm of online businesses—encompassing search, shopping, live streaming, and more—swift responses to high-frequency user requests are imperative. The slightest latency introduced anywhere in the data center can significantly impact user experiences, influencing factors such as website traffic, reputation, active users, and beyond.

Moreover, as the trends in machine learning surge, the demand for computational power is skyrocketing. To cater to the intricate requirements of complex neural networks and deep learning models, data centers are deploying numerous distributed computing clusters. However, the communication latency inherent in large-scale parallel programs can significantly impede the overall computational efficiency.

Addressing the efficiency challenges of burgeoning data storage and retrieval within data centers, distributed storage networks using Ethernet convergence are gaining popularity. However, in storage networks where data flows are primarily characterized by elephant flows, congestion-induced packet loss can trigger re-transmissions of these large flows, diminishing efficiency and exacerbating congestion.

From the perspective of both front-end user experience and back-end application efficiency, the current prerequisites for data center networks are clear: lower latency is better, and higher efficiency is paramount.

To mitigate internal network latency and enhance processing efficiency in data centers, RDMA technology has emerged. Allowing user-level applications to directly read from and write to remote memory without involving the CPU in multiple memory copies, RDMA bypasses the kernel and writes data directly to the network card. This achieves high throughput, ultra-low latency, and minimal CPU overhead.

Presently, RDMA's transport protocol over Ethernet is RoCEv2 (RDMA over Converged Ethernet v2). RoCEv2, a connectionless protocol based on UDP (User Datagram Protocol), is faster and consumes fewer CPU resources compared to the connection-oriented TCP (Transmission Control Protocol). However, lacking mechanisms like sliding windows and acknowledgment responses inherent to TCP poses challenges. In case of packet loss, the upper-layer application must detect and initiate retransmission, potentially reducing the efficiency of RDMA transmission.

To unlock the true performance of RDMA and overcome network performance bottlenecks in large-scale distributed systems within data centers, establishing a lossless network environment specifically tailored for RDMA is essential. The key to achieving lossless transmissions lies in effectively addressing network congestion.

What is RDMA?

RDMA (Remote Direct Memory Access) is a sophisticated technology crafted to mitigate the latency associated with server-side data processing during network transfers.

hpc

In the conventional mode of transferring data between applications on two servers, the process unfolds as follows:

Data is copied from the application cache to the TCP protocol stack cache in the kernel.
It is then copied to the driver layer.
Finally, it is copied to the NIC (Network Interface Card) cache.

Multiple memory copies necessitate CPU intervention, leading to significant processing latency, often extending into tens of microseconds. Moreover, the extensive CPU involvement throughout this process consumes a considerable amount of CPU performance, potentially impacting normal data computations.

Enter RDMA mode, where application data can circumvent the kernel protocol stack and be directly written to the network card. This approach delivers noteworthy benefits, including:

Substantial reduction of processing latency from tens of microseconds to within 1 microsecond.
Minimal CPU involvement throughout the process, resulting in performance savings.
Enhanced transmission bandwidth.

What Does RDMA Demand from the Network?

RDMA finds increasing application in high-performance computing, big data analytics, and high-concurrency I/O scenarios. Technologies like iSICI, SAN, Ceph, MPI, Hadoop, Spark, and Tensorflow are adopting RDMA for their operations. For the underlying network supporting end-to-end transmission, the most crucial metrics are low latency (in microseconds) and losslessness.

Low Latency

Network forwarding latency predominantly occurs at device nodes (excluding optical-electrical transmission latency and data serialization latency). Device forwarding latency encompasses three key parts:

Storage forwarding latency: The chip's forwarding pipeline processing delay, generating approximately 1 microsecond of chip processing latency per hop (industry attempts to use cut-through mode aim to reduce single-hop latency to around 0.3 microseconds).
Buffer caching latency: In network congestion situations, packets are buffered before forwarding. The larger the buffer, the longer the packets are cached, resulting in higher latency. For RDMA networks, optimal buffer size selection is crucial, and a larger buffer is not necessarily better.
Retransmission latency: RDMA networks leverage various techniques to prevent packet loss.

Lossless Network

RDMA achieves full-rate transmission in a lossless state, but performance sharply declines when packet loss and retransmissions occur. In traditional network modes, large buffers are the primary means to achieve losslessness. However, as mentioned earlier, this contradicts the requirement for low latency. In an RDMA network environment, the objective is to achieve losslessness with smaller buffers.

Within this constraint, RDMA attains losslessness primarily through network flow control techniques based on PFC (Priority Flow Control) and ECN (Explicit Congestion Notification).

Key Technology for Achieving Lossless RDMA Networks: PFC

Priority-based Flow Control (PFC) is a queue-based backpressure mechanism that operates on priority levels. It prevents buffer overflow and packet loss by sending Pause frames, signaling upstream devices to halt packet transmission.

rdma

PFC enables the individual pausing and resuming of specific virtual channels without impacting other traffic. In the illustrated scenario, when Queue 7's buffer consumption hits the configured PFC flow control threshold:

The local switch initiates a PFC Pause frame transmission upstream.
The upstream device receiving the Pause frame temporarily suspends packet transmission from that queue.
If the upstream device's buffer also reaches a threshold, it continues triggering Pause frames to apply backpressure upstream.
Ultimately, avoiding data packet loss is achieved by reducing the sending rate of the priority queue.
When buffer occupancy falls below the recovery threshold, a PFC release frame is sent.

Key Technology for Achieving Lossless RDMA Networks: ECN

Explicit Congestion Notification (ECN) is an established technology that, while previously less prevalent, is now widely adopted between hosts.

ECN operates by embedding markings in packets using the ECN field in the IP header when congestion arises at the egress port of a network device, surpassing the ECN threshold. This marking serves as an indicator that the packet has encountered network congestion. Upon identifying the ECN marking in a packet, the receiving server promptly generates a Congestion Notification Packet (CNP) and dispatches it back to the source server. This CNP contains information about the flow responsible for congestion. Upon receiving the CNP, the source server adjusts the sending rate of the corresponding flow, mitigating network congestion and preventing packet loss.

As previously mentioned, achieving end-to-end lossless transmission through PFC and ECN hinges on configuring distinct thresholds. The accurate setup of these thresholds entails meticulous management of the switch's Memory Management Unit (MMU), addressing the switch's buffer management.

Conclusion: Achieving Lossless Transmission in RDMA Networks

RDMA networks achieve lossless transmission through the deployment of PFC and ECN functionalities. PFC technology controls RDMA-specific queue traffic on the link, applying backpressure to upstream devices during congestion at the switch's ingress port. With ECN technology, end-to-end congestion control is achieved by marking packets during congestion at the egress port, prompting the sending end to reduce its transmission rate.

Optimal network performance is achieved by adjusting buffer thresholds for ECN and PFC, ensuring faster triggering of ECN than PFC. This allows the network to maintain full-speed data forwarding while actively reducing the server's transmission rate to address congestion. Persistent issues may lead to PFC utilization, pausing packet transmission from upstream switches, reducing network throughput without packet loss.

Deploying RDMA in data center networks requires addressing the demands for lossless network transmission. Focusing on fine-grained operations and maintenance becomes crucial to meet the requirements of latency-sensitive and packet loss-sensitive network environments. Despite challenges like PFC storms, deadlock issues, and complex ECN threshold designs in multi-tier networks, FS's R&D team is committed to delivering enhanced services, and focused on providing optimal solutions for our customers, while maintaining a high level of reliability. FS offers a range of products, including NVIDIA® InfiniBand Switches, 100G/200G/400G/800G InfiniBand transceivers, and NVIDIA® InfiniBand Adapters, committed to becoming a professional provider of communication and high-speed network system solutions.