Building HPC Data Center Networking Architecture with FS InfiniBand Solution

Posted on May 24, 2024 by

 327

In the ever-evolving landscape of high-performance computing, the backbone of future HPC business development lies in HPC networking and infrastructure. As HPC applications grow in complexity and data volume, the demand for resilient, scalable, and efficient networks becomes imperative. The architecture of HPC networks serves as the bedrock for HPC system operations, playing a pivotal role in data processing, management, and large-scale storage. This article delves into the key components of HPC network architecture, elucidates HPC data center networking's advantages, and explores FS's comprehensive solution with extensive products for different partitions of the HPC data center network architecture.

What is the HPC Networking Architecture and its Key Components?

The network architecture for HPC workloads is meticulously designed and comprises three key components: computing network, management network, and storage network. They collaborate to tackle the most complex algorithms, unlocking new potentials across various domains.

Computing Network

The computing network serves as the computational backbone of HPC networking systems, consisting of HPC computing networks and general-purpose computing networks.

HPC computing networks are tailored for high-performance HPC tasks, adept at efficiently processing extensive data volumes and executing tasks that require complex computations, such as image recognition, natural language processing, and model inference. HPC computing networks typically comprise GPU servers, high-performance switches, high-speed modules, and high-grade DAC/AOC cables, forming large computing network clusters that collaborate to accelerate HPC workloads and deliver real-time insights. The HPC computing cluster has the most performance-demanding characteristics, as it serves as the backbone for processing HPC workloads, requiring a network with high performance, lossless transmission, low latency, and scalability. Therefore, HPC computing networks typically consist of 400G and higher connections and utilize powerful networking technologies, such as InfiniBand interconnection.

General-purpose computing networks primarily handle general application traffic, offering versatile computing resources essential for HPC networks, such as deep learning platforms and other software. They provide a flexible computing environment capable of accommodating diverse workloads and applications except data-intensive computing tasks for HPC. This network usually consists of 10/25/100/400/800 Gigabit Ethernet (GE) connections.

Management Network

The management network primarily deploys service management systems and operational support components to efficiently allocate workloads and distribute resources, ensuring optimal performance and resource utilization.

The management network in HPC data center network architecture can be categorized into out-of-band and in-band management networks. The out-of-band management network can host the management port access of multiple terminal types in the data center, and monitor and manage the status of physical devices in the cluster, enabling unified operation and remote maintenance. The in-band management network provides the Internet interface to the business/office network, providing Internet access to the data center.

Data centers targeting HPC workloads are typically enormous, with possibly thousands of 100G-800G ports per cluster. To enhance network interoperability and enable unified management of such a large amount of network infrastructure, the management network typically employs open networking operating systems to create a highly resilient, flexible, and reliable network.

Storage Network

In HPC data centers, the storage network utilizes high-speed, high-bandwidth interconnected storage systems primarily designed for storing vast datasets generated by HPC applications. This network includes components such as storage servers, storage devices, and storage management software. The storage server connects the components within the HPC network, enabling seamless data exchange and access to data stored on the servers. In HPC data centers, storage devices typically feature high-speed and high-capacity attributes to accommodate the storage requirements of extensive datasets. Meanwhile, to ensure rapid and efficient data transmission, it is important to deploy high-speed network infrastructure, including switches and optical modules. Storage management software plays a pivotal role in overseeing and controlling storage systems, encompassing functions like data management, storage resource management, data backup, recovery, and data security.

In well-designed HPC network architectures, the storage network and its infrastructure are optimized for high throughput and low latency, ensuring cost-effective and reliable data storage.

HPC Networking Architecture

How Does HPC Data Center Networking Stand Out?

Different fabrics within HPC data center network architecture collaborate to construct a lossless, high-performance, and scalable network. This network efficiently distributes workloads among multiple interconnected computing resources, enabling enterprises to rapidly scale the operation of large-scale multi-node training workloads and stay ahead in the industry competition. The following characteristics of HPC data center networking enable it to meet various HPC workloads and scale requirements.

Parallel Computing – HPC data center networks utilize parallel processing, enabling the simultaneous execution of multiple workloads. With thousands of tasks processed concurrently, operations are completed within milliseconds. This empowers industries to train bigger, better, and more accurate models, which can accelerate industry advancements.
Size – HPC data centers are typically massive in scale, potentially comprising thousands of computing engines (such as GPUs and CPUs) and a vast array of network connectivity infrastructure operating at different speeds.
Bandwidth – High-bandwidth traffic needs to flow in and out of servers for applications to operate effectively. In modern data center deployments, HPC functions are achieving interface speeds of up to 400G per compute engine.
Latency – The completion time of HPC workloads is a critical factor influencing user experience. Therefore, HPC data center networks often adopt low-latency network technologies such as InfiniBand and RDMA.
Lossless – A lossless network minimizes packet loss, enabling smooth and efficient data transmission, which is essential for data centers handling HPC workloads to maintain data integrity and optimize performance.
Unified Management – Large-scale HPC networks consist of numerous network infrastructures. Typically, unified management platforms are employed to configure, monitor, and oversee these components, thereby simplifying operations and boosting system security.

For more details about RDMA, check RDMA-Enhanced High-Speed Network for Training Large Models

Building Effective Network for HPC Workloads with FS InfiniBand Solution

In the global landscape of increasing adoption of HPC applications and HPC computing, FS unveils the high-performance HPC solution. Leveraging high-speed, low-latency InfiniBand technology and an elastic, efficient network operating platform - PicOS® and AmpCon™, FS H100 InfiniBand solution aids enterprises in optimizing HPC workloads, simplifying HPC business processes, and driving intelligent applications of HPC across various industries.

Full Range of NVIDIA® InfiniBand Products Empowering Computing Network

NVIDIA® InfiniBand is globally recognized as a high-speed, low-latency, and scalable solution tailored for supercomputers, HPC, and cloud data centers, making it the prime choice for HPC computing networks. As a trusted Elite Partner in the NVIDIA Partner Network, FS offers a comprehensive range of NVIDIA® InfiniBand products, serving as a reliable solution provider in the HPC field.

As shown in the diagram below, FS offers the NVIDIA® Quantum-2 MQM9790 InfiniBand switch, NVIDIA® ConnectX®-7 InfiniBand Adapter, and InfiniBand transceivers and cables with speeds of up to 800G, forming a specialized InfiniBand network for HPC computing. This network provides the fastest networking performance and feature sets available to tackle the world’s most challenging problems.

InfiniBand Quantum-2 Switches

FS NVIDIA® QM9700/9790 infiniBand switches comprise 64 400Gb/s ports or 128 200Gb/s ports on physical 32 OSFP connectors. They can deliver an aggregated 51.2 Tb/s of bidirectional throughput with a capacity of more than 66.5 billion packets per second (bpps), delivering world-leading networking performance.

NVIDIA® ConnectX®-6/7 InfiniBand Adapters

FS NVIDIA® InfiniBand adapters support PCIe 5.0 and deliver single network ports at 400Gb/s. FS's NVIDIA® ConnectX-7 InfiniBand adapters include advanced In-Network Computing capabilities and additional programmable engines that enable preprocessing data algorithms and offload application control paths to the network.

InfiniBand Transceivers and Cables

Various FDR, EDR, HDR, and NDR transceivers and DAC/AOC/ACC cables with 1–2 and 1–4 splitter options provide maximum flexibility to build a topology of choice. FS NVIDIA® InfiniBand modules and cables are 100% verified by the original, ensuring perfect compatibility with NVIDIA® Quantum-2 switches and ConnectX-7 adapters.

InfiniBand Network

PicOS® and AmpCon™ Platform Enabling Intelligent Network Management

In the management network, FS PicOS® switches can utilize the advanced PicOS® software and AmpCon™ management platform feature sets to empower customers to efficiently provision, monitor, manage, preventatively troubleshoot, and maintain the HPC infrastructure, realizing higher utilization and reducing overall OPEX. The FS PicOS® software and AmpCon™ management platform synergize effectively to enable visualized HPC-driven operations and management across the entire HPC data center network. Their specific advantages include:

FS PicOS® Software

PicOS® is fully standardized and backward compatible with existing networks, making it easy to integrate with switches from Cisco, Juniper, and others. This allows customers to upgrade their networks according to their budget gradually.
Enable zero-trust security for access layers with integrated leading NAC Policy Manager and comprehensive security mechanism support.
Work with AmpCon™ to automate switch provisioning, deployment, and error-free configuration at scale, resulting in reduced OpEx.
Use an open solution with spine-leaf arrays to support flexible and scalable virtualization architectures.
Achieve full network visibility with SNMP and sFlow, while gNMI delivers efficient and effective open telemetry.

FS AmpCon™ Management Platform

Support for Zero-Touch Provisioning greatly simplifies the installation and deployment process, enabling effortless deployment of hundreds or even thousands of PicOS® switches.
Robust graphical user interfaces (GUIs) enable real-time monitoring of network performance and conditions, with the capability to store monitoring data in either an on-premises or cloud-based database for further analysis.
End-to-end networking lifecycle management and automated provisioning, maintenance, compliance checking, and upgrades to prevent misconfigurations and downtime.
As an open and extendable platform, AmpCon™ is ready to take advantage of telemetry and other emerging technologies, continuously evolving to bring new levels of analysis and automation in the HPC era.

Management Network

FS PicOS® Switches Enhancing Large-Scale Data Storage Efficiency

Connected via 100G optical modules, the FS PicOS® switches establish a scalable and high-bandwidth network, facilitating efficient data transmission for HPC data center storage systems. Meanwhile, FS PicOS® switches support the BGP protocol with powerful routing control capabilities while ensuring the optimal forwarding path and low-latency forwarding status of the storage network. These robust switches significantly boost the performance of storage networks, thereby meeting the demanding requirements of modern HPC workloads with ease.

Storage Network

The Final Thought

As the construction of HPC data center networks continues to expand, FS stands out as a global provider of HPC computing solutions. In addition to offering highly reliable solutions and products, FS boasts seven global local warehouses covering over 200 countries, along with a robust and agile supply chain system. This allows for swift product delivery, shortening customer project cycles and enabling them to seize opportunities in the HPC market swiftly. FS tailors customized solutions for different partitions of HPC data center network architecture, facilitating precise configuration based on customer budget requirements and helping them effectively manage project costs.

As the field of HPC advances, FS remains committed to the HPC era, continuously innovating cutting-edge HPC computing solutions to accelerate the adoption of HPC technology across diverse industries.

Related Articles：

The Rise of HPC Data Centers: FS Empowering Next-gen Data Centers

InfiniBand Insights: Powering High-Performance Computing in the Digital Age

FS AmpCon™: Your Network Automation Partner