Data Center Networking in the Age of HPC

Posted on Jun 25, 2024 by

 264

Traditional cloud data centers have been the cornerstone of computing infrastructure for over a decade, serving a diverse range of users and applications. However, in recent years, data centers have evolved to accommodate advancements in technology and the rising demand for HPC-driven computing. This post explores the pivotal role that networking plays in shaping the future of data centers and facilitating the era of HPC.

Dedicated Data Centers: HPC Factories and HPC Clouds

Two distinct categories of data centers are emerging: HPC factories and HPC clouds. These centers are tailored to meet the specific demands of HPC workloads, which heavily rely on accelerated computing.

HPC factories are designed to manage extensive, large-scale workflows and the development of foundational models like large language models (LLMs). These models serve as the building blocks for more advanced HPC systems. A robust and high-performance network infrastructure is crucial for seamless scalability and efficient resource utilization across thousands of GPUs.

HPC clouds extend the capabilities of traditional cloud infrastructure to support large-scale generative HPC applications. Generative HPC goes beyond conventional systems by generating new content such as images, text, and audio based on trained data. Managing HPC clouds with numerous users necessitates advanced management tools and a networking infrastructure capable of efficiently handling diverse workloads.

HPC and Distributed Computing

HPC workloads, particularly those involving large and complex models like ChatGPT and BERT, are computationally intensive. To expedite model training and process vast datasets, HPC practitioners have embraced distributed computing. This approach distributes the workload across interconnected servers or nodes linked by a high-speed, low-latency network.

The scalability and capacity of the network to accommodate a growing number of nodes are critical for the success of HPC. A highly scalable network enables researchers to access more computational resources, leading to faster and enhanced performance.

When designing the network architecture for HPC data centers, it's essential to prioritize an integrated solution with distributed computing capabilities. Data center architects must carefully consider network design and tailor solutions to meet the unique demands of the HPC workloads they intend to deploy.

NVIDIA Quantum™-2 InfiniBand and NVIDIA Spectrum-X are two networking platforms specifically engineered and optimized to address the networking challenges of HPC data centers, each offering unique features and innovations.

Enhancing HPC Performance with InfiniBand

InfiniBand technology has significantly boosted large-scale supercomputing deployments for complex distributed scientific computing. It has emerged as the preferred network for HPC factories, playing a crucial role in accelerating mainstream high-performance computing (HPC) and AI applications today due to its ultra-low latencies. The NVIDIA Quantum-2 InfiniBand platform integrates several essential network capabilities needed for efficient HPC systems.

In-network computing, powered by InfiniBand, integrates hardware-based computing engines directly into the network fabric. This approach scales efficiently and utilizes NVIDIA's Scalable Hierarchical Aggregation and Reduction Protocol (SHARP), enhancing data bandwidth for collective operations and overall performance.

Adaptive routing in InfiniBand optimizes traffic flow by dynamically selecting congestion-free paths based on real-time network conditions. Managed by a Subnet Manager, this routing strategy maximizes efficiency without compromising packet delivery order.

The InfiniBand Congestion Control Architecture ensures deterministic bandwidth and latency by employing a three-stage congestion management process. This architecture prevents bottlenecks in HPC workloads, ensuring consistent performance.

These inherent optimizations enable InfiniBand to meet the demanding requirements of HPC applications, driving superior performance and efficiency.

Exploring Ethernet Deployment in HPC

Deploying Ethernet networks for HPC infrastructures involves addressing specific requirements inherent to the Ethernet protocol. Over time, Ethernet has evolved with a feature-rich and sometimes complex set of capabilities tailored to diverse network scenarios.

However, traditional Ethernet is not inherently optimized for high performance. HPC clouds using traditional Ethernet struggle to achieve the performance levels attainable with purpose-built networks.

In multi-tenant environments where multiple HPC jobs run concurrently, performance isolation becomes critical to prevent degradation. Traditional Ethernet's handling of link faults can cause a 50% drop in cluster performance due to its optimization for everyday enterprise workflows rather than the demands of high-performance HPC applications relying on the NVIDIA Collective Communications Library (NCCL).

These performance limitations stem from inherent characteristics of traditional Ethernet, including higher switch latencies typical in commodity ASICs, a split buffer switch architecture that can lead to unfair bandwidth distribution, and suboptimal load balancing for the large data flows generated by HPC workloads.

The Spectrum-X networking platform addresses these challenges and more. Built upon the Ethernet protocol with RDMA over Converged Ethernet (RoCE) Extensions, Spectrum-X enhances HPC's performance by integrating InfiniBand’s best practices.

Conclusion

The era of HPC is here, and the network is the cornerstone of its success. To fully harness the potential of HPC, data center architects must design networks that cater to the unique demands of HPC workloads. Proper network design is crucial for unlocking the full capabilities of HPC technologies and driving innovation in the data center industry.

NVIDIA Quantum InfiniBand is an excellent choice for HPC factories due to its ultra-low latencies, scalable performance, and advanced features. For organizations building Ethernet-based HPC clouds, NVIDIA Spectrum-X offers a groundbreaking solution with its purpose-built technology innovations.

As an official NVIDIA partner, FS holds an adequate inventory of NVIDIA InfiniBand and Ethernet switches, ensuring quick delivery and availability. Additionally, FS provides access to technical experts who can develop tailored network architecture solutions to meet your specific HPC requirements.