RoCE Technology in High-Performance Computing: Insights and Applications
Evolution of HPC Networks and Emergence of RoCE
During the initial era of High-Performance Computing (HPC) systems, specialized networking solutions like Myrinet, Quadrics, and InfiniBand were commonly chosen over Ethernet. These custom networks effectively addressed the constraints of Ethernet solutions, delivering enhanced bandwidth, reduced latency, improved congestion control, and distinctive features. In 2010, the IBTA introduced the RoCE (RDMA over Converged Ethernet) protocol technology standard, followed by the release of the RoCEv2 protocol technology standard in 2014, bringing a substantial boost in bandwidth. The notable enhancements in Ethernet performance have generated increasing interest in high-performance network solutions compatible with traditional Ethernet. This shift has disrupted the declining trend of Ethernet usage in HPC clusters listed in the top 500, enabling Ethernet to maintain a prominent position in the rankings.
While Myrinet and Quadrics have faded from the scene, InfiniBand continues to occupy a crucial role in high-performance networks. Moreover, proprietary network series from Cray, Tianhe, and Tofulseries networks also play significant roles.
Introduction to RoCE Protocol
The RoCE protocol serves as a cluster network communication protocol that facilitates Remote Direct Memory Access (RDMA) over Ethernet. This protocol transfers packet send/receive tasks to the network card, eliminating the need for the system to enter kernel mode, a characteristic of the TCP/IP protocol. Consequently, this reduces the overhead associated with copying, encapsulation, and decapsulation, leading to a substantial decrease in Ethernet communication latency. Additionally, it minimizes CPU resource utilization during communication, eases network congestion, and enhances the efficient utilization of bandwidth.
The RoCE protocol consists of two versions: RoCE v1 and RoCE v2. RoCE v1 operates as a link-layer protocol, requiring both communicating parties to be within the same Layer 2 network. In contrast, RoCE v2 functions as a network-layer protocol, enabling RoCE v2 protocol packets to be routed at Layer 3, providing superior scalability.
RoCE V1 Protocol
Retaining the interface, transport layer, and network layer of InfiniBand (IB), the RoCE protocol substitutes the link layer and physical layer of IB with the link layer and network layer of Ethernet. In the link-layer data frame of a RoCE data packet, the Ethertype field value is specified by IEEE as 0x8915, unmistakably identifying it as a RoCE data packet. However, due to the RoCE protocol's non-adoption of the Ethernet network layer, RoCE data packets lack an IP field. Consequently, routing at the network layer is unfeasible for RoCE data packets, restricting their transmission to routing within a Layer 2 network.
RoCE V2 Protocol
Introducing substantial enhancements, the RoCE v2 protocol builds upon the RoCE protocol's foundation. RoCEv2 transforms the InfiniBand (IB) network layer utilized by the RoCE protocol by incorporating the Ethernet network layer and a transport layer employing the UDP protocol. It harnesses the DSCP and ECN fields within the IP datagram of the Ethernet network layer for implementing congestion control. This enables RoCE v2 protocol packets to undergo routing, ensuring superior scalability. As RoCEv2 fully supersedes the original RoCE protocol, references to the RoCE protocol generally denote the RoCE v2 protocol, unless explicitly specified as the first generation of RoCE.
Lossless Networks and RoCE Congestion Control Mechanism
Ensuring the seamless transmission of RoCE traffic is pivotal in RoCE protocol-based networks. During RDMA communication, it is imperative for data packets to reach their destination without loss and in the correct order. Any occurrence of packet loss or out-of-order arrival necessitates a go-back-N retransmission, and subsequent data packets expected to arrive should not be stored in cache.
The RoCE protocol implements a two-fold congestion control mechanism: an initial stage leveraging DCQCN (Datacenter Quantized Congestion Notification) for gradual slowdown and a subsequent transmission pause stage utilizing PFC (Priority Flow Control). Although strictly classified as a congestion control strategy and a traffic control strategy, respectively, they are commonly considered the dual stages of congestion control.
In scenarios involving multiple-to-one communication within a network, congestion often arises, evident in the swift increase in the total size of pending send buffer messages at a port on the switch. Uncontrolled situations may lead to buffer saturation, resulting in packet loss. Hence, in the initial stage, when the switch detects that the total size of pending send buffer messages at a port reaches a specific threshold, it marks the ECN field of the IP layer in the RoCE data packet. Upon receiving this packet, if the recipient observes the ECN field marked by the switch, it sends a Congestion Notification Packet (CNP) back to the sender, prompting the sender to reduce its sending speed.
Crucially, not all packets are marked when the threshold for the ECN field is reached. Two parameters, Kmin and Kmax, play a role in this process. When the congestion queue length is below Kmin, no marking occurs. As the queue length ranges between Kmin and Kmax, the probability of marking increases with a longer queue. If the queue length surpasses Kmax, all packets get marked. The receiver doesn't send a CNP packet for every received ECN packet but dispatches a CNP packet within each time interval upon receiving packets with ECN markings. This approach allows the sender to adjust its sending speed based on the number of CNP packets received.
In cases where network congestion worsens, and the switch detects that the queue length of a specific port's pending send queue reaches a higher threshold, the switch dispatches a Pause Flow Control (PFC) frame to the upstream sender of the messages. This action induces a pause in data transmission until the congestion in the switch is alleviated. Once the congestion is relieved, the switch sends a PFC control frame upstream to signal the resumption of sending. PFC flow control supports pausing on different traffic channels, enabling the adjustment of the bandwidth ratio for each channel to the total bandwidth. This configuration allows the pause of traffic transmission on one channel without affecting data transmission on other channels.
ROCE & Soft-RoCE
In the realm of high-performance Ethernet cards, while the majority now embrace the RoCE protocol, there remain instances where certain cards do not offer support. To address this gap, collaborative efforts from IBIV, Mellanox, and other contributors have given rise to the open-source Soft-RoCE project. This initiative caters to nodes equipped with unsupported RoCE protocol cards, enabling them to leverage Soft-RoCE for communication in conjunction with nodes featuring RoCE-supported cards, illustrated in the diagram. While this may not enhance the performance of the former, it facilitates the latter in fully capitalizing on its performance capabilities. Particularly in scenarios like data centers, a strategic upgrade limited to high I/O storage servers with RoCE-supported Ethernet cards can significantly boost overall performance and scalability. Moreover, the combination of RoCE and Soft-RoCE proves adaptable to the demands of gradual cluster upgrades, eliminating the necessity for a simultaneous full-scale upgrade.
Challenges in Implementing RoCE in HPC Environments
Essential Requirements of HPC Networks
As outlined by FS, HPC networks hinge on two fundamental prerequisites: ① low latency and ② the ability to sustain low latency in rapidly evolving traffic patterns.
For ① low latency, RoCE is specifically engineered to tackle this concern. Reiterating from earlier discussions, RoCE efficiently offloads network operations to the network card, resulting in low latency and reduced CPU utilization.
For ② maintaining low latency in dynamically changing traffic patterns, the primary focus shifts to congestion control. The intricacy of highly dynamic HPC traffic patterns poses a challenge for RoCE, leading to suboptimal performance in this regard.
ROCE's Low Latency
In contrast to traditional TCP/IP networks, both InfiniBand and RoCEv2 circumvent the kernel protocol stack, leading to a substantial enhancement in latency performance. Empirical tests have demonstrated that bypassing the kernel protocol stack can reduce end-to-end latency at the application layer within the same cluster from 50 microseconds (TCP/IP) to 5 microseconds (RoCE) or even 2 microseconds (InfiniBand).
RoCE Packet Structure
Assuming we want to send 1 byte of data using RoCE, the additional costs to encapsulate this 1-byte data packet are as follows:
Ethernet Link Layer: 14 bytes MAC header + 4 bytes CRC
Ethernet IP Layer: 20 bytes
Ethernet UDP Layer: 8 bytes
IB Transport Layer: 12 bytes Base Transport Header (BTH)
Total: 58 bytes
Assuming we want to send 1 byte of data using IB, the additional costs to encapsulate this 1-byte data packet are as follows:
IB Link Layer: 8 bytes Local Routing Header (LHR) + 6-byte CRC
IB Network Layer: 0 bytes (When there is only a Layer 2 network, the Link Next Header (LNH) field in the link layer can indicate that the packet has no network layer)
IB Transport Layer: 12 bytes Base Transport Header (BTH)
Total: 26 bytes
If it is a customized network, the packet structure can be simplified further. For example, the Mini-packet (MP) header of Tianhe-1A consists of 8 bytes.
From this, it can be seen that the heavy underlying structure of Ethernet is one of the obstacles to applying RoCE to HPC.
Ethernet switches in data centers often need to have many other functionalities, which require additional costs to implement, such as SDN, Qos, and so on.
Regarding these Ethernet features, are Ethernet and RoCE compatible with these functionalities? Do these functionalities affect the performance of RoCE?
Challenges in RoCE Congestion Control
The congestion control mechanisms in both facets of the RoCE protocol pose specific challenges that may hinder the maintenance of low latency in dynamic traffic patterns.
Priority Flow Control (PFC) relies on pause control frames to prevent the reception of an excessive number of packets, a strategy susceptible to packet loss. Unlike credit-based methods, PFC tends to result in lower buffer utilization, particularly challenging for switches with limited buffers, often associated with lower latency. Conversely, credit-based approaches offer more precise buffer management.
Data Center Quantized Congestion Notification (DCQCN) in RoCE, akin to InfiniBand's congestion control, employs backward notification, conveying congestion information to the destination and then returning it to the sender for rate limitation. While RoCE follows a fixed set of formulas for slowdown and speedup strategies, InfiniBand allows customizable strategies, providing more flexibility. While default configurations are commonly used, having the option for customization is preferable. Notably, the testing in the referenced paper involved generating at most one Congestion Notification Packet (CNP) every N=50 microseconds, and the feasibility of reducing this value remains uncertain. In InfiniBand, the CCTI_Timer can be set as low as 1.024 microseconds, though practical implementation of such a small value is undetermined.
An ideal approach would involve directly returning congestion information to the source from the congestion point, known as Forward notification. While Ethernet's limitations are understood based on specifications, the rationale behind InfiniBand not adopting this approach raises questions.
RoCE Applications in HPC: Slingshot and Performance Testing
The latest supercomputers in the United States feature the innovative Slingshot network, an enhanced version of Ethernet. Utilizing Rosetta switches compatible with traditional Ethernet, the network addresses specific RoCE limitations. Enhanced features come into play when both ends of a link support dedicated devices like network cards and Rosetta switches. These features include minimizing IP packet frame size to 32 bytes, sharing queue occupancy information with neighboring switches, and implementing improved congestion control. While the average switch latency of 350ns is comparable to high-performance Ethernet switches, it falls short of the low latency achieved by InfiniBand (IB) and some specialized supercomputer switches, including the previous generation of Cray XC supercomputer switches.
In practical applications, the Slingshot network demonstrates commendable performance. Notably, the paper "An In-Depth Analysis of the Slingshot Interconnect" primarily compares it with the previous generation of Cray supercomputers, lacking a direct comparison with InfiniBand.
Additionally, CESM and GROMACS applications underwent testing using both 25G Ethernet with low latency and 100G Ethernet with higher bandwidth. Despite a fourfold difference in bandwidth between the two, the results offer valuable insights into their comparative performance.
With a proficient technical team and extensive experience across diverse application scenarios, FS has garnered trust and preference from customers. However, FS acknowledges challenges in applying RoCE to high-performance computing (HPC) based on market demands and user project implementation experiences:
Ethernet switches exhibit higher latency in comparison to IB switches and certain HPC custom network switches.
RoCE's flow control and congestion control strategies have room for improvement.
The cost of Ethernet switches remains relatively high.
In the context of rapidly evolving AI data center networks, the choice of an appropriate solution is pivotal. Traditional TCP/IP protocols no longer suffice for AI applications demanding high network performance. RDMA technology, particularly in the form of InfiniBand and RoCE, has emerged as highly regarded network solutions. InfiniBand has showcased exceptional performance in realms like high-performance computing and large-scale GPU clusters. In contrast, RoCE, being an RDMA technology based on Ethernet, provides enhanced deployment flexibility.
For those seeking high-performance and efficient AI data center networks, the selection of the right network solution tailored to specific requirements and application scenarios becomes a crucial step. FS offers a range of products, including NVIDIA® InfiniBand Switches, 100G/200G/400G/800G InfiniBand transceivers and NVIDIA® InfiniBand Adapters, establishing itself as a professional provider of communication and high-speed network system solutions for networks, data centers, and telecom clients.