Understanding the Connection Between Switches and HPC

Posted on Dec 20, 2023 by

 2.1k

In the ever-evolving landscape of computer networks, protocols play a pivotal role in governing data exchange. One such cornerstone is the OSI seven-layer protocol, a global standard introduced in the 1980s. This protocol, designed to standardize communication between computers, unfolds its intricacies through a layered network model. From the hardware-oriented physical layer to the application-centric application layer, each tier contributes to seamless communication. As we delve deeper, we explore the evolution from traditional TCP/IP to the realm of RDMA, addressing the demands of High-Performance Computing (HPC) with an emphasis on high throughput and low latency. Join us on a journey through network architectures, the role of switches, and the intriguing choice between Ethernet and InfiniBand in the pursuit of optimal performance and cost-effectiveness. This exploration is a testament to the dynamic nature of network technologies, where adaptability is key in meeting the ever-growing demands of modern data centers.

Understanding OSI Protocol and the Transition to RDMA in HPC

A protocol is a set of rules, standards, or agreements established for data exchange within a computer network. Legally, the OSI (Open System Interconnection) seven-layer protocol stands as an international standard. Introduced in the 1980s, the OSI protocol aimed to standardize communication between computers, addressing the requirements of open networks through its seven-layer network model.

The physical layer governs how hardware communicates and establishes standards for physical devices, including interface types and transmission rates, facilitating the transmission of bit streams (data represented by 0s and 1s).

Switches and AI

The data link layer primarily manages frame encoding and error control. It encapsulates data from the physical layer into frames and transmits them to the upper layer. Additionally, it can break down data from the network layer into bit streams for transmission to the physical layer, incorporating error detection and correction mechanisms through checksums.

The network layer creates logical circuits between nodes, utilizing IP for address resolution (each node having an IP address) and transmitting data in packets.

The transport layer oversees the quality of data transfer between two nodes, ensuring correct order and handling issues such as loss, duplication, and congestion control.

The session layer administers session connections in network devices, providing session control and synchronization for coordinating communication between different devices.

The presentation layer manages data format conversion and encryption/decryption operations, ensuring correct interpretation and processing by applications on different devices.

The application layer delivers direct network services and application interfaces to users, encompassing various applications like email, file transfer, and remote login.

These layers collectively form the OSI seven-layer model, each with specific functions and responsibilities facilitating communication and data exchange between computers.

Switches and AI

It's crucial to note that real-world network protocols may deviate from the OSI model, designed and implemented based on practical requirements and network architectures.

Concerning TCP/IP, it is a protocol suite comprising various protocols, roughly categorized into four layers: the application layer, transport layer, network layer, and data link layer. TCP/IP is considered an optimized version of the seven-layer protocol.

In the context of High-Performance Computing (HPC) and its demand for high throughput and low latency, TCP/IP has transitioned to RDMA (Remote Direct Memory Access). TCP/IP exhibits drawbacks, including latency introduction and significant CPU overhead due to multiple context switches and CPU involvement in encapsulation during transmission.

Switches and AI

RDMA, as a technology, permits direct access to memory data over a network interface without operating system kernel involvement. It enables high-throughput, low-latency network communication, making it well-suited for large-scale parallel computing clusters. While RDMA does not specify the entire protocol stack, it imposes stringent requirements on specific transports, such as minimal packet loss, high throughput, and low latency. Variants like InfiniBand, ROCE (RDMA over Converged Ethernet), and iWARP (Internet Wide Area RDMA Protocol) cater to RDMA technology based on Ethernet, each with its technological nuances and cost considerations.

Switches and AI

Spine-Leaf Architecture vs. Traditional Three-Layer Networks

Switches and routers operate at distinct layers within the network. A switch functions at the data link layer, utilizing MAC addresses for device identification and executing packet forwarding. It facilitates communication between diverse devices. On the other hand, a router, also referred to as a gateway, operates at the network layer, enabling connectivity through the use of IP addressing to link various subnetworks.

Conventional data centers often adopt a three-tier architecture, comprising the access layer, aggregation layer, and core layer. The access layer is typically directly connected to servers, with commonly utilized access switches being Top of Rack (TOR) switches. The aggregation layer serves as an intermediary between the access layer and the core layer. The core switch handles traffic entering and leaving the data center, establishing connectivity with the aggregation layer.

However, traditional three-tier network architectures exhibit notable drawbacks, which become more pronounced with the evolution of cloud computing:

Bandwidth waste: Each aggregation switch group manages a Point of Delivery (POD), and each POD has independent VLAN networks. The use of Spanning Tree Protocol (STP) often results in only one active aggregation switch for a VLAN network, blocking others. This impedes the horizontal scalability of the aggregation layer.
Large failure domain: Due to the STP algorithm, network topology changes necessitate convergence, leading to potential network disruptions.
High latency: As data centers expand, the rise in east-west traffic results in significant latency. Communication between servers in a three-tier architecture traverses multiple switches, and upgrading the performance of core and aggregation switches incurs higher costs.

The spine-leaf architecture presents notable advantages, encompassing a flattened design, low latency, and high bandwidth. In a spine-leaf network, leaf switches fulfill a role similar to traditional access switches, while spine switches serve as core switches.

Switches and AI

Leaf and spine switches dynamically select multiple paths using Equal Cost Multi-Path (ECMP). In the absence of bottlenecks in the access ports and uplink links of the leaf layer, this architecture achieves non-blocking performance. As each leaf in the fabric connects to every spine, any issues with one spine result in only a slight degradation in the data center's throughput performance.

A Deep Dive into NVIDIA SuperPOD Architecture

A SuperPOD refers to a server cluster designed to deliver high-throughput performance through the interconnection of multiple compute nodes. Taking the NVIDIA DGX A100 SuperPOD as an illustration, the recommended configuration incorporates the QM8790 switch, providing 40 ports, each operating at 200G.

The architecture employed follows a fat tree (non-blocking) structure. In the initial layer, DGX A100 servers are equipped with 8 interfaces, each connecting to one of the 8 leaf switches. A SuperPOD comprises 20 servers, forming an SU. Thus, a total of 8 * SU servers are necessary. In the second layer architecture, as the network is non-blocking and port speeds are uniform, the number of uplink ports on the spine switches should be greater than or equal to the number of downlink ports on the leaf switches. Therefore, 1 SU corresponds to 8 leaf switches and 5 spine switches, 2 SUs correspond to 16 leaf switches and 10 spine switches, and so forth. It is also recommended to add a core layer switch when the number of SUs exceeds 6.

Switches and AI

For the DGX A100 SuperPOD, the server-to-switch ratio for the compute network is approximately 1:1.17 (based on 7 SUs). However, when considering storage and network management requirements, the server-to-switch ratio for DGX A100 SuperPOD and DGX H100 SuperPOD is roughly 1:1.34 and 1:0.50, respectively.

In terms of ports, the recommended configuration for DGX H100 includes 31 servers per SU. DGX H100, designed with 4 interfaces for compute purposes, utilizes the QM9700 switch, offering 64 ports, each at 400G.

Regarding switch performance, the QM9700 in the DGX H100 SuperPOD's recommended configuration introduces Sharp technology. This technology constructs Streaming Aggregation Trees (SAT) in the physical topology using the Aggregator Manager. Multiple switches in the tree perform parallel computations, resulting in reduced latency and improved network performance. The QM8700/8790+CX6 supports a maximum of 2 SATs, while the QM9700/9790+CX7 supports up to 64 SATS. With an increased number of ports, the switch count is reduced.

Switch Choices: Ethernet, InfiniBand, and RoCE Compared

The fundamental distinction between Ethernet switches and InfiniBand switches lies in the variance between the TCP/IP protocol and RDMA (Remote Direct Memory Access). Presently, Ethernet switches are more prevalently employed in traditional data centers, whereas InfiniBand switches find greater use in storage networks and High-Performance Computing (HPC) environments. Both Ethernet and InfiniBand switches can achieve a maximum bandwidth of 400G.

RoCE vs InfiniBand vs TCP/IP

Switches and AI

Key Considerations:

High Scalability: All three network protocols exhibit high scalability, with InfiniBand demonstrating the highest scalability. A single InfiniBand subnet can support tens of thousands of nodes, providing a relatively scalable architecture, allowing for virtually unlimited cluster sizes compared to InfiniBand routers.
High Performance: TCP/IP introduces additional CPU processing overhead and latency, resulting in comparatively lower performance. RDMA over Converged Ethernet (RoCE) improves speed and efficiency in data centers by leveraging existing Ethernet infrastructure. InfiniBand, however, excels in faster and more efficient communication by transmitting data serially, one bit at a time, using a switched fabric.
Ease of Management: While RoCE and InfiniBand offer lower latency and higher performance than TCP/IP, TCP/IP is generally easier to deploy and manage. Network administrators using TCP/IP for device and network connectivity require minimal centralized management.
Cost-Effectiveness: InfiniBand may pose challenges for budget-conscious enterprises as it relies on expensive IB switch ports to handle a significant load of applications, contributing to higher computational and maintenance costs. In contrast, RoCE and TCP/IP, utilizing Ethernet switches, present a more cost-effective solution.
Network Equipment: RoCE and TCP/IP leverage Ethernet switches for data transmission, while InfiniBand utilizes dedicated IB switches to carry applications. IB switches typically need to be interconnected with devices supporting the IB protocol, making them relatively closed and challenging to replace.

Modern data centers demand maximum bandwidth and extremely low latency from their underlying interconnect. In such scenarios, traditional TCP/IP network protocols fall short, introducing CPU processing overhead and high latency.

For enterprises deciding between RoCE and InfiniBand, careful consideration of unique requirements and cost factors is essential. Those prioritizing the highest-performance network connectivity may find InfiniBand preferable, while those seeking optimal performance, ease of management, and cost-effectiveness may opt for RoCE in their data centers.

FS InfiniBand & RoCE Solutions

FS has a wealth of products that support RoCE or InfiniBand. No matter which one you choose, it will provide lossless network solutions based on these two network connection options. These solutions enable users to build high-performance computing capabilities and lossless networking environments. FS focuses on customizing the best solutions based on specific application scenarios and user needs, providing high-bandwidth, low-latency, and high-performance data transmission. This effectively alleviates network bottlenecks, enhances overall network performance, and improves user experience.