English

Exploring the Significance of InfiniBand Networking and HDR in Supercomputing

Posted on Jun 25, 2024 by
67

Supercomputing, or high-performance computing (HPC), plays a pivotal role in solving complex computational problems across various fields, including climate modeling, genomics, and artificial intelligence. The continuous evolution of supercomputing necessitates advancements in networking technologies to handle massive data volumes and intricate computations. This article delves into the significance of InfiniBand networking and HDR in supercomputing, highlighting their contributions to enhanced performance, efficiency, and scalability.

The Popularity of InfiniBand in Supercomputers and HPC Data Centers

In June 2015, InfiniBand dominated the global Top 500 supercomputer list, holding an impressive 51.8% share, a year-over-year increase of 15.8%.

The Popularity of InfiniBand

In the latest November 2023 Top500 list, InfiniBand maintained its leading position, highlighting its ongoing growth trend. Key trends included:

  • InfiniBand-based supercomputers led with 189 systems.

  • InfiniBand-based supercomputers dominated the top 100 systems with 59 installations.

  • NVIDIA GPUs and networking products, especially Mellanox HDR Quantum QM87xx switches and BlueField DPUs, were the primary interconnects in over two-thirds of the supercomputers.

Beyond traditional HPC applications, InfiniBand networks are widely used in enterprise data centers and public clouds. For instance, leading enterprise supercomputer NVIDIA Selene and Microsoft's Azure public cloud utilize InfiniBand to deliver superior business performance.

Advantages of InfiniBand Networks

InfiniBand is recognized as a future-proof standard for high-performance computing (HPC), renowned for its role in supercomputers, storage, and LAN networks. Key advantages include simplified management, high bandwidth, complete CPU offloading, ultra-low latency, cluster scalability and flexibility, quality of service (QoS), and SHARP support.

Effortless Network Management 

InfiniBand features a pioneering network architecture designed for software-defined networking (SDN), managed by a subnet manager. This manager configures local subnets to ensure seamless network operation. All channel adapters and switches implement subnet management agents (SMAs) to cooperate with the subnet manager. Each subnet requires at least one subnet manager for initial setup and reconfiguration, with a failover mechanism to maintain uninterrupted subnet management.

Network Management

Superior Bandwidth 

InfiniBand consistently outperforms Ethernet in network data rates, crucial for server interconnects in HPC. Popular rates in 2014 were 40Gb/s QDR and 56Gb/s FDR, evolving to 100Gb/s EDR and 200Gb/s HDR in many supercomputers. The advent of advanced InfiniBand products with 400Gb/s NDR rates is being considered for high-performance computing systems.

Efficient CPU Offloading 

InfiniBand enhances computing performance by minimizing CPU resource use through:

  • Hardware offloading the entire transport layer protocol stack

  • Kernel bypass with zero copy

  • RDMA (Remote Direct Memory Access) allows direct memory transfers between servers without CPU involvement

GPU direct technology further improves performance by enabling direct GPU memory access, which is essential for HPC applications like deep learning and machine learning.

Low Latency 

InfiniBand achieves significantly lower latency compared to Ethernet. InfiniBand switches streamline layer 2 processing and employ cut-through technology, reducing forwarding latency to below 100ns. In contrast, Ethernet switches typically experience higher latency due to complex layer 2 processing. InfiniBand NICs (Network Interface Cards) also benefit from RDMA, reducing latency to around 600ns, while Ethernet-based applications hover around 10us, marking a tenfold latency advantage.

Measured latency for MPI

Scalability and Flexibility 

InfiniBand supports up to 48,000 nodes in a single subnet, avoiding broadcast storms and unnecessary bandwidth usage. It supports various network topologies for scalability, from 2-layer fat-tree for smaller networks to 3-layer fat-tree and Dragonfly for larger deployments.

Quality of Service (QoS) Support 

InfiniBand provides QoS by prioritizing traffic, ensuring high-priority applications are served first. This is achieved through virtual channels (VLs), which isolate traffic based on priority, ensuring efficient traffic management and high-priority application performance.

Stability and Resilience 

InfiniBand features a self-healing network mechanism integrated into its switches, enabling rapid recovery from link failures within 1ms, significantly faster than typical recovery times.

Network topology

Optimized Load Balancing 

InfiniBand uses adaptive routing for load balancing, dynamically distributing traffic across switch ports to prevent congestion and optimize bandwidth utilization.

Network Computing Technology-SHARP

InfiniBand's SHARP technology offloads collective communication tasks from CPUs and GPUs to switches, reducing network data traversal and significantly boosting performance, especially in HPC applications.

Diverse Network Topologies 

InfiniBand supports multiple topologies like fat-tree, Torus, Dragonfly+, Hypercube, and HyperX, catering to different requirements for scalability, cost efficiency, latency minimization, and transmission distance.

Overview of InfiniBand HDR Products

With growing client demands, 100Gb/s EDR is gradually being phased out, HDR is widely adopted for its flexibility, offering both HDR100 (100G) and HDR200 (200G) options.

InfiniBand HDR Switches

NVIDIA offers two types of InfiniBand HDR switches. The first is the HDR CS8500 modular chassis switch, a 29U switch with up to 800 HDR 200Gb/s ports. Each 200G port can split into 2x100G, supporting up to 1600 HDR100 (100Gb/s) ports. The second type is the QM87xx series fixed switch, with a 1U form factor integrating 40 200G QSFP56 ports. These ports can split into up to 80 HDR100 ports and also support EDR rates for 100G EDR NIC connections. Note that a single 200G HDR port can only downgrade to 100G for EDR NICs and cannot split into 2x100G for two EDR NICs.

The 200G HDR QM87xx switches come in two models: MQM8700-HS2F and MQM8790-HS2F. The only difference between these models is the management method. The QM8700 supports out-of-band management through a management port, while the QM8790 requires the NVIDIA UFMR platform for management.

Both QM8700 and QM8790 switches offer two airflow options. Details of the QM87xx series switches are as follows:

Production Ports Link speed Interface Type Rack Units Management
MQM8790-HS2F 40 200G QSFP56 1 RU Inband
MQM8700-HS2F 40 200G QSFP56 1 RU Inband/outband

InfiniBand HDR NICs 

HDR NICs come in a variety of models compared to HDR switches. There are two data rate options: HDR100 and HDR.

HDR100 NICs support 100Gb/s transmission rates. Two HDR100 ports can connect to an HDR switch using a 200G HDR-to-2x100G HDR100 cable. Unlike 100G EDR NICs, HDR100 NICs' 100G ports support both 4x25G NRZ and 2x50G PAM4 transmissions.

200G HDR NICs support 200Gb/s transmission rates and can connect directly to switches using 200G direct cables.

Each data rate has single-port, dual-port, and PCIe options, allowing businesses to choose based on their needs. Common IB HDR NIC models include:

Production Ports Supports InfiniBand speeds Supports Ethernet speeds Interface Type Host Interface
MCX653105A-ECAT Single HDR100, EDR, FDR, QDR, DDR and SDR 100, 50, 40, 25, and 10Gb/s QSFP56 PCIe 4.0x16
MCX653106A-ECAT Dual HDR, HDR100, EDR, FDR, QDR, DDR and SDR 100, 50, 40, 25, and 10Gb/s QSFP56 PCIe 4.0x16
MCX653105A-HDAT Single HDR, HDR100, EDR, FDR, QDR, DDR and SDR 200, 100, 50, 40, 25, and 10Gb/s QSFP56 PCIe 4.0x16
MCX653106A-HDAT Dual HDR, HDR100, EDR, FDR, QDR, DDR and SDR 200, 100, 50, 40, 25, and 10Gb/s QSFP56 PCIe 4.0x16

Conclusion 

InfiniBand networks, particularly with HDR technology, are pivotal in supercomputing. Their high throughput, low latency, scalability, reliability, and energy efficiency make them indispensable for modern HPC environments. As supercomputing evolves, InfiniBand will remain a cornerstone technology, driving advancements and enabling groundbreaking research and innovation.

You might be interested in

Blog
See profile for George.
George
Introducing InfiniBand HDR Products for HPC
Dec 30, 2023
1.0k
Knowledge
Knowledge
See profile for Howard.
Howard
InfiniBand Network and Architecture Overview
Dec 30, 2023
2.4k
Knowledge
Knowledge
Knowledge
See profile for Sheldon.
Sheldon
Decoding OLT, ONU, ONT, and ODN in PON Network
Mar 14, 2023
403.1k
Knowledge
See profile for Irving.
Irving
What's the Difference? Hub vs Switch vs Router
Dec 17, 2021
373.2k
Knowledge
See profile for Sheldon.
Sheldon
What Is SFP Port of Gigabit Switch?
Jan 6, 2023
349.4k
Knowledge
Knowledge
See profile for Migelle.
Migelle
PoE vs PoE+ vs PoE++ Switch: How to Choose?
May 30, 2024
431.9k