Ushering a New Era of HPC with NVIDIA InfiniBand

Posted on Jun 7, 2024 by

 260

High-Performance Computing (HPC) systems require robust networking solutions to efficiently manage vast amounts of data and complex computations. InfiniBand technology has emerged as a leading solution in this field, offering unparalleled performance in terms of bandwidth and latency. This article explores the use of InfiniBand in HPC, highlighting its benefits, solution and prospects.

What Is InfiniBand Network？

InfiniBand is a high-speed communication protocol designed to provide low-latency and high-throughput data transfers. It supports RDMA, allowing direct memory access between computers without involving the CPU, which significantly reduces latency. Unlike traditional Ethernet, InfiniBand is optimized for HPC environments, ensuring that data is transferred quickly and efficiently across the network.

InfiniBand network bandwidth currently supports speeds such as FDR 56Gbps, EDR 100Gbps, HDR 200Gbps, and even NDR 400Gbps/800Gbps (when connected via a 4x link width for data transfer rates). InfiniBand networks are primarily used in HPC scenarios, where they connect multiple servers into a high-performance computing cluster. The performance of the cluster is effectively a linear aggregation of the performance of individual servers. It is the InfiniBand technology that has made the creation of supercomputing high-performance clusters possible.

For more information on InfiniBand networks, you can read: InfiniBand Network and Architecture Overview

InfiniBand and Ethernet

From the inception and development of InfiniBand to its current dominant position in the HPC field, it has always been compared to the widely adopted Ethernet technology. The comparison between the two is as follows.

InfiniBand significantly surpasses Ethernet in both data transfer speeds and low latency. Its low-latency design makes it particularly well-suited for HPC applications. Additionally, InfiniBand offers considerable cost advantages per unit. According to the latest global HPC TOP500 list, the adoption rate of InfiniBand continues to rise, dominating the TOP100 rankings, while the usage of Ethernet has been steadily declining year by year.

	InfiniBand	Ethernet
Bandwidth	40Gbps, 56Gbps, 100Gbps, 200Gbps, 400Gbps, 800Gbps	1Gbps, 10Gbps, 25Gbps,40Gbps, 100Gbps, 200Gbps, 400Gbps, 800Gbps
Delay	Less than or equal to 1 microsecond	Close to 10 microseconds
Application Areas	Supercomputing, Enterprise Storage Areas	Internet, Metropolitan Area Network, Data Centre Backbone Network, etc.
Advantages	Extremely Low Latency and High Throughput	Wide range of applications, has become a standard interconnect technology generally recognized by the industry
Disadvantages	Requires expensive proprietary interconnect equipment on server hardware	Latency is difficult to further reduce

How Does InfiniBand Work?

InfiniBand is a unified interconnect architecture capable of handling storage I/O, network I/O, and inter-process communication (IPC). It can interconnect RAID, SANs, LANs, servers, and clustered servers, as well as connect to external networks (such as WAN, VPN, and the Internet). InfiniBand is primarily designed for enterprise data centres, whether large or small. Its main objectives are to achieve high reliability, availability, scalability, and performance. InfiniBand provides high bandwidth and low latency transmission over relatively short distances, and supports redundant I/O channels within single or multiple interconnected networks, ensuring that the data centre remains operational in the event of localised failures.

What are the Benefits of InfiniBand in HPC?

In HPC, where tasks such as scientific simulations, and large-scale data analytics are common, the need for rapid data processing is critical. InfiniBand addresses this need by providing the low-latency and high-bandwidth connections necessary to maintain the performance of these demanding applications. Its ability to handle vast amounts of data quickly makes it an essential component of modern HPC systems.

Effortless Network Management

InfiniBand represents an innovative network architecture designed specifically for software-defined networking (SDN), supervised by a subnet manager. The subnet manager configures the local subnet, ensuring seamless network operations. To manage traffic, all channel adapters and switches must implement a subnet management agent (SMA) that collaborates with the subnet manager. Each subnet requires at least one subnet manager for initial setup and reconfiguration when links are established or broken. An arbitration mechanism designates the primary subnet manager, with others running in standby mode. In standby mode, each subnet manager maintains backup topology information and verifies the subnet's operational status. If the primary subnet manager fails, a standby subnet manager takes over, ensuring uninterrupted subnet management.

Efficient CPU Offloading

CPU offloading is a key technique for enhancing computing performance, and the NVIDIA InfiniBand network architecture facilitates data transfer with minimal CPU resources in the following ways:

Hardware offloading of the entire transport layer protocol stack.
Kernel bypass and zero copy.
RDMA (Remote Direct Memory Access), a process that writes data directly from one server's memory to another's without involving the CPU.

Another option is using GPUDirect technology, which allows direct access to data in GPU memory and accelerates data transfers from GPU memory to other nodes. This feature boosts the performance of computational applications like deep learning training and machine learning.

Lower Latency

The latency comparison between InfiniBand and Ethernet includes both the switch and NIC (Network Interface Card) levels. At the switch level, Ethernet switches handle complex services (such as IP, MPLS, and QinQ), leading to delays typically measured in microseconds, with cut-through support exceeding 200ns. In contrast, InfiniBand switches simplify layer 2 processing, relying solely on 16-bit LID forwarding. Using cut-through technology, they can reduce latency to below 100ns, outperforming Ethernet switches.

At the NIC level, RDMA technology enables InfiniBand NICs to avoid CPU involvement in message processing, significantly reducing latency. The send and receive latency for InfiniBand NICs is approximately 600ns, while Ethernet TCP/UDP applications have a latency of around 10us. This results in a latency difference of over tenfold between the two technologies.

Lower Latency

FS NVIDIA InfiniBand Solutions Accelerate HPC

As a partner of NVIDIA®, FS offers original InfiniBand switches and adaptors, cutting costs by 30% with its extensive product ecosystem.

High Performance: Utilises NVIDIA® H100 GPU and InfiniBand switches for ultra-low latency and high bandwidth.
Cost Efficiency: Provides cost-effective, high-quality InfiniBand modules and cables, supporting up to 400G/800G speeds.
Reliability: Ensures lossless data transmission with traffic control and CRC redundancy checks.
Scalability: Supports 400Gb/s interconnects with the NVIDIA® Quantum-2 MQM9790 InfiniBand switch, enhancing data centre network performance.
Compliance: Features Broadcom DSP technology, low power consumption, and adherence to industry standards like OSFP MSA.

NVIDIA InfiniBand Solutions

Conclusion

As InfiniBand technology continues to advance, it is increasingly becoming the optimal choice to replace Gigabit and 10 Gigabit Ethernet, poised to be the preferred solution for high-speed interconnect networks. In the future, InfiniBand will have broad application prospects in areas such as GPUs, SSDs, and clustered databases.

FS offers highly reliable NVIDIA InfiniBand network architecture and professional technical services, delivering customised network designs and product lists within 48 hours to meet specific client needs. Contact us now to explore our InfiniBand solutions more!