RDMA over Converged Ethernet Guide
In the era of data, the requirements for a faster, more efficient, and scalable network has never been reduced. Since the traditional TCP/IP Ethernet connections are CPU intensitive and require extra processing and copying of the data, they can’t meet the current network needs any more. In that context, the RDMA over Converged Ethernet (RoCE) arrives. To figure out what RoCE is, it's worth looking at RDMA first.
What Is RDMA?
Remote Direct Memory Access (RDMA) is a technology that enables direct memory access from the memory of one host or server to the memory of another host or server without involving the CPU. In thus doing, it frees the CPUs to do the work they meant to do such as running applications and processing massive amounts of data. Then, the network and host performance with lower latency, lower CPU load, and higher bandwidth can be cost-effectively achieved.
Figure 1: RDMA Technology
What Is RoCE?
As a type of RDMA, RoCE is a network protocol defined in the InfiniBand Trade Association (IBTA) standard, allowing RDMA over converged Ethernet network. Shortly, it can be regarded as the application of RDMA technology in hyper-converged data centers, cloud, storage, and virtualized environments. It possesses all the benefits of RDMA technology and the familiarity of Ethernet. To understand the differences between RoCE and Infiniband, you can read this article RoCE vs Infiniband vs TCP/IP.
Types of RoCE
Generally, there are two RDMA over Converged Ethernet versions: RoCE v1 and RoCE v2. It depends on the network adapter or card used.
RoCE v1: The RoCE v1 protocol is an Ethernet link layer protocol allowing two hosts in the same Ethernet broadcast domain (VLAN) to communicate. It uses Ethertype 0x8915, which limits the frame length as 1500 bytes for a standard Ethernet frame and 9000 bytes for an Ethernet jumbo frame.
RoCE v2: The RoCE v2 protocol overcomes the limitation of version 1 being bounded to a single broadcast domain (VLAN). By changing the packet encapsulation to include IP and UDP headers, RoCE v2 can now be used across both L2 and L3 networks. This enables Layer 3 routing, which brings RDMA to network with multiple subnets for great scalability. Therefore, RoCE v2 is also regarded as Routable RoCE (RRoCE). Owing to the arrival of RoCE v2, the IP multicast is now also possible.
Figure 2: RoCE v1 vs RoCE v2 Packet Format
Benefits of RoCE
Since RDMA over Converged Ethernet has direct access to memory data via network interface rather than through the kernel, it can enable low-latency and high-performance transmission.
Low CPU involvement: Access remote switch or server’s memory without consuming CPU cycles on the remote server, which enables full use of the available bandwidth and higher scalability.
Zero-copy: Send and receive data to and from remote buffers.
High-productive: Since the latency and throughput have been improved by RoCE, the network performance has gained a lot.
Cost-saving: With RoCE there is no need to buy new equipment or replace Ethernet infrastructure to handle the massive amount of data, which greatly saves capital expenditures for companies.
Figure 3: Before Vs. After RoCE
How to Realize RoCE？
Generally, to realize RDMA over converged Ethernet for a data center, you can install network adapter or cards drivers supporting RoCE. All Ethernet NICs require RoCE network adapter cards. RoCE drivers are available in Red Hat, Linux, Microsoft Windows, and other common operating systems. RDMA over converged Ethernet is available in two ways. For network switch, you can choose to use the switch with an operating system supporting PFC (priority flow control). As for a rack server or host, you will need to use a network adapter card, such as ConnectX-3 pro and ConnectX-4 and above.
FAQs About RoCE
Here we list some frequently asked questions about RDMA over converged Ethernet for your better understanding about it.
1. Which FS switches or network cards/adapters support RoCE?
Up to now, except S5860 series and S5850-24S2Q, S5850-24S2Q-DC switches, FS N series switches and S58/80 series all can support RoCE v1 and v2. Customers need to enable their PFC function after buying an RDMA switch. As for adapters and cards, the RoCE is not yet accessible in FS.
2. Can RoCE adapters communicate with other adapter types, like iWARP?
RoCE adapters can only communicate with other RDMA over converged Ethernet adapters. Any configurations that attempt to mix adapter types, say RoCE adapters combined with iWARP adapters, will probably revert to traditional TCP/IP connections.
3. What’s the difference between RoCE and iWARP?
As RoCE network protocol, iWARP (Internet wide area RDMA protocol) also supports RDMA function with lower latency, but they do have some differences.
On the one hand, RoCE is the only industry-standard Ethernet-based RDMA solution with a multi-vendor ecosystem delivering network adapters and operating over standard Layer 2 and Layer 3 Ethernet switches. And iWARP has seen only minimal support.
On the other hand, iWARP uses a complex mix of layers, including DDP (Direct Data Placement), a tweak known as MPA (Marker PDU Aligned framing), and a separate RDMA protocol (RDMAP) to deliver RDMA services over TCP/IP. With such a complex architecture, it will be hard for iWARP protocol to apply RDMA to the existing software transport frameworks. After such a compromise, the throughput, latency, and CPU utilization for iWARP will be dampened.
Figure 4: iWARP's Complex Network Layers Vs. RoCE’s Simpler Model
Running RDMA in data centers, offloading of data movement and the higher availability of CPU resources to the application can be achieved. Adopters of RoCE can benefit from RDMA’s capabilities without changing their network infrastructure. By reducing Ethernet network latency and offloading CPU overhead, RoCE increases performance in search, storage, database, financial and high transaction rate applications. By increasing CPU efficiency and improving application performance, RoCE can reduce the number of servers needed, thereby producing energy savings and reducing the footprint of Ethernet-based data centers.