Intelligent Lossless Ethernet Network for HPC( High-Performance Computing)

Posted on Dec 18, 2023 by

 1.0k

Presently, data centers are transforming into hubs of computational prowess, witnessing a continual expansion in the scale of computing clusters housed within them. The rising need for high-performance interconnection networks between computing nodes is a direct response to the escalating performance expectations set for the network that links these computational nodes. The integration of data center networking has seamlessly become an indispensable component of data center computing power, reflecting a prevailing trend toward the profound fusion of computing and networking.

The Increasing Demands of High-Performance Computing Workloads on Networking Infrastructure

As revolutionary technologies like 5G, big data and the Internet of Things (IoT) permeate various facets of society, the trajectory toward an intelligent, digitally centered society is inevitable over the next two to three decades. Data center computing power has emerged as a potent driving force, transitioning from a focus on resource scale to computing power scale. The industry has widely embraced the concept of computational power centers, where networks play a pivotal role in facilitating high-performance computing within data centers. Elevating network performance stands as a crucial factor in boosting the energy efficiency of data center computing power.

The industry is relentlessly evolving on multiple fronts to enhance computing power. The progress in single-core chip technology has hit a bottleneck at 3nm. While multi-core stacking is pursued to augment computing power, it results in a notable surge in power consumption per unit of computing power as the core count rises. The evolution of computing unit technology is nearing its limits, with Moore's Law, doubling performance every 18 months, approaching exhaustion. High-Performance Computing (HPC) has become imperative to meet the escalating demand for computing power, particularly as the scale of computing clusters expands from the P-scale to the E-scale. This transition necessitates increasingly higher performance from interconnection networks, marking a distinct trend in the deep integration of computation and networking.

High-Performance Computing (HPC) involves harnessing aggregated computing power to tackle intricate scientific computing challenges beyond the capacity of standard workstations, including simulations, modeling, and rendering. As the demand for computing power surges from the P-scale to the E-scale, the scale of computing clusters grows, imposing elevated requirements on interconnection network performance. The symbiotic relationship between computation and networking becomes more pronounced.

HPC introduces varied network performance requirements across three typical scenarios:

Loose coupling computing scenario: In scenarios with low interdependence between computing nodes, such as financial risk assessment or remote sensing, the demand for network performance is relatively modest.
Tight coupling scenario: High coordination dependency between computing nodes, synchronization of computations, and rapid information transmission characterize tight coupling scenarios like electromagnetic simulation and fluid dynamics. These scenarios mandate low network latency and necessitate low-latency network provisions.
Data-intensive computing scenario: In data-intensive scenarios like weather forecasting and gene sequencing, where computing nodes handle substantial data volumes and generate significant intermediate data, a high-throughput network is crucial, accompanied by specific requirements for network latency.

In summary, high-performance computing (HPC) imposes stringent demands for high throughput and low latency on networks. To fulfill these requirements, the industry commonly adopts Remote Direct Memory Access (RDMA) as a substitute for the TCP protocol, reducing latency and minimizing CPU utilization on servers. Despite its benefits, RDMA's sensitivity to network packet loss underscores the importance of a lossless network.

Evolution of High-Performance Computing Networks

Traditional data center networks have historically employed multi-hop symmetric architectures based on Ethernet technology and relied on the TCP/IP protocol stack for transmission. However, despite over 30 years of development, the inherent technical characteristics of the traditional TCP/IP network make it less suited to meet the demands of high-performance computing (HPC). A significant shift has occurred with RDMA (Remote Direct Memory Access) technology gradually supplanting TCP/IP as the preferred protocol for HPC networks. Additionally, the choice of RDMA's network layer protocol has evolved from expensive lossless networks based on the InfiniBand (IB) protocol to intelligent lossless networks based on Ethernet. FS's technical experts will elucidate the reasons behind these technological transitions and advancements.

From TCP to RDMA

In traditional data centers, Ethernet technology and the TCP/IP protocol stack have been the norm for building multi-hop symmetric network architectures. However, the TCP/IP network has become insufficient for the demands of high-performance computing due to two main limitations:

Latency Issues: The TCP/IP protocol stack introduces several microseconds of latency due to multiple context switches in the kernel during packet reception/sending. This latency, ranging from 5-10 microseconds, becomes a bottleneck in microsecond-level systems, impacting tasks such as data processing and distributed SSD storage.
CPU Utilization: Beyond latency concerns, the TCP/IP network necessitates the involvement of the host CPU in multiple memory copies within the protocol stack. As network scale and bandwidth increase, this results in elevated CPU scheduling burdens, leading to sustained high CPU loads. With the prevalent understanding that transmitting 1 bit of data consumes 1 Hz of CPU frequency, network bandwidths exceeding 25G (at full load) demand a significant portion of CPU capacity.

To address these challenges, RDMA functionality has been introduced on the server side. RDMA, a direct memory access technology, facilitates data transfer directly between computer memories without involving the operating systems, thereby bypassing time-consuming processor operations. This approach achieves high bandwidth, low latency, and low resource utilization.

From IB to RoCE

RDMA's kernel bypass mechanism, as depicted in the diagram below, enables direct data read and write between applications and network cards. This circumvents TCP/IP limitations, reducing protocol stack latency to nearly 1 microsecond. RDMA's zero-copy mechanism allows the receiving end to read data directly from the sender's memory, significantly reducing CPU burden and enhancing CPU efficiency. In comparison, a 40Gbps TCP/IP flow can saturate all CPU resources, whereas RDMA at 40Gbps sees CPU utilization drop from 100% to 5%, with network latency decreasing from milliseconds to below 10 microseconds.

hpc

Currently, there are three options for RDMA network layer protocols: InfiniBand, iWARP (Internet Wide Area RDMA Protocol), and RoCE (RDMA over Converged Ethernet).

InfiniBand: Specifically designed for RDMA, InfiniBand guarantees lossless networking at the hardware level, providing high throughput and low latency. However, its closed architecture poses interoperability challenges and vendor lock-in risks.
iWARP: This protocol allows RDMA over TCP, utilizing special network cards but losing performance advantages due to TCP protocol limitations.
RoCE: Enabling remote memory access over Ethernet, RoCE applies RDMA technology to Ethernet. Supporting RDMA on standard Ethernet switches, RoCE only requires special network cards. Two versions exist: RoCEv1 and RoCEv2. RoCEv2, a network-layer protocol, enables routing functionality and allows access between hosts in different broadcast domains.

Despite RoCE's benefits, its sensitivity to packet loss requires lossless Ethernet support. This evolution in HPC networks showcases the ongoing pursuit of enhanced performance, efficiency, and interoperability.

Conclusion

As the demands for data centers and high-performance computing escalate, RDMA technology remains a pivotal player in facilitating high-performance, low-latency data transfers. The decision between InfiniBand technology and RDMA-enabled Ethernet technologies necessitates careful consideration of specific requirements and practical needs by both users and vendors. In the realm of supercomputing, InfiniBand technology boasts broad applications and a well-established ecosystem. On the other hand, RoCE and iWARP prove to be more fitting for high-performance computing and storage scenarios within Ethernet environments.

FS is a professional provider of communication and high-speed network system solutions to networking, data center and telecom customers, Leveraging NVIDIA® InfiniBand Switches, 100G/200G/400G/800G InfiniBand transceivers, and NVIDIA® InfiniBand Adapters to provide customers with a complete set of solutions based on InfiniBand and lossless Ethernet ((RoCE). These solutions cater to diverse application requirements, empowering users to accelerate their business and enhance performance. For more information, visit the official FS.COM.