Unveiling the Foundations of GPU Computing-1

Posted on Dec 27, 2023 by

 736

In the domain of large-scale model training, the foundation often lies in clusters composed of single-machine 8-GPU units, featuring models such as 8 A100, A800, H100, or H800, and possibly incorporating upcoming models like {4, 8} L40S. The following illustrates the hardware topology of GPU computing within a typical host equipped with 8 A100 GPUs:

hardware topology of GPU computing

This article will delve into fundamental concepts and terminology based on the depicted diagram.

PCIe Switch Chip

In the domain of high-performance GPU computing, vital elements such as CPUs, memory modules, NVMe storage, GPUs, and network cards establish fluid connections via the PCIe (Peripheral Component Interconnect Express) bus or specialized PCIe switch chips.

Boasting five generations of advancements, with Gen5 being the latest, PCIe ensures efficient interconnectivity among devices. This continual evolution underscores PCIe's crucial role in shaping high-performance computing, enhancing data transfer rates, and fostering the seamless collaboration of interconnected devices in modern computing clusters.

NVLink

Definition of NVLink

NVLink is a high-speed, wire-based serial multi-lane communications link developed by Nvidia. The definition of NVLink on Wikipedia is as follows:

NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of muıltiple NVLinks, and devices use mesh networking to communicate instead of a central hub. The protocol was first announced in March 2014 and uses proprietary high-speed signaling interconnect (NVHS).

The technology supports full-mesh interconnection between GPUs on the same node and has evolved through multiple generations, enhancing bidirectional bandwidth for improved performance in high-performance computing applications.

Evolution of NVLink: NVLink 1.0 to NVLink 4.0

As shown in the figure below, NVLink follows the four generations.

NVLink evolution: 1/2/3/4 generations

NVLink 1.0:

Connection Method: Utilizes a 4-channel connection.
Total Bandwidth: Achieves a bidirectional total bandwidth of up to 160 GB/s.
Purpose: Primarily designed to accelerate data transfer between GPUs, enhancing collaborative computing performance.

NVLink 2.0:

Connection Method: Based on a 6-channel connection.
Total Bandwidth: Improved to a bidirectional total bandwidth of 300 GB/s.
Performance Enhancement: Provides higher data transfer speeds, improving communication efficiency between GPUs.

NVLink 3.0:

Connection Method: Adopts a 12-channel connection.
Total Bandwidth: Reaches a bidirectional total bandwidth of 600 GB/s.
New Features: Introduces new technologies and protocols, enhancing communication bandwidth and efficiency.

NVLink 4.0:

Connection Method: Uses an 18-channel connection.
Total Bandwidth: Further increases to a bidirectional total bandwidth of 900 GB/s.
Performance Improvement: By increasing the number of channels, it provides greater bandwidth to meet the growing demands of high-performance computing and artificial intelligence applications.

The key difference among NVLink 1.0, NVLink 2.0, NVLink 3.0, and NVLink 4.0 lies on the connection method, bandwidth and their performance.

NVSwitch

NVSwitch is a switching chip developed by NVIDIA, designed specifically for high-performance computing and artificial intelligence applications. Its primary function is to provide high-speed, low-latency communication between multiple GPUs within the same host. NVSwitch is a switching chip developed by NVIDIA, designed specifically for high-performance computing and artificial intelligence applications. Its primary function is to provide high-speed, low-latency communication between multiple GPUs within the same host.

The figure below shows the typical hardware topology of an 8-card A100 host.

hardware topology of GPU computing

The picture below shows Inspur NF5488A5 NVIDIA HGX A100 8 GPU assembly side view.

You can see the eight pieces of A100 in the picture and beneath the six thick heat sinks on the right side lies the NVSwitch chip.

Inspur NF5488A5 NVIDIA HGX A100

NVLink Switch

The NVLink Switch is a specialized switching device developed by NVIDIA, designed to facilitate high-performance communication and connectivity between GPU devices across different hosts. Unlike the NVSwitch, which is integrated into GPU modules within a single host, the NVLink Switch serves as a standalone switch specifically engineered for linking GPUs in a distributed computing environment.

It may sound confusing when comparing NVLink switch and NVSwitch. as the name suggests a switch, but in reality, the NVSwitch is a switching chip on the GPU module. It wasn't until 2022 that NVIDIA developed this chip into a standalone switch, officially naming it the NVLink Switch.

HBM

Traditionally, GPU memory, much like conventional DDR (Double Data Rate) memory, is physically inserted into the motherboard and connected to the processor (CPU, GPU) through the PCIe interface. This arrangement, however, introduces a speed bottleneck within the PCIe, where Gen4 offers a bandwidth of 64GB/s, and Gen5 increases it to 128GB/s.

In response to this limitation, several GPU manufacturers (notably, not limited to NVIDIA) have adopted an innovative approach. They stack multiple DDR chips, forming what is known as High Bandwidth Memory (HBM), and integrate them with the GPU. This design, exemplified when discussing H100, eliminates the need for each GPU to traverse the PCIe switch chip when engaging with its dedicated memory. Consequently, this strategy significantly enhances data transfer speeds, potentially achieving a notable order of magnitude improvement. The term "High Bandwidth Memory" (HBM) aptly describes this advanced memory architecture.

Evolution of HBM: HBM 1 to HBM3e

Bandwidth Unit

In the realm of large-scale GPU computing training, the performance directly hinges on data transfer speeds, intricately connected to various pathways like PCIe bandwidth, memory bandwidth, NVLink bandwidth, HBM bandwidth, and network bandwidth. When gauging these data rates, a variety of bandwidth units come into play.

In networking contexts, data rates are commonly articulated in bits per second (b/s), often specified in a unidirectional manner (TX/RX) to distinguish between transmission and reception. Conversely, for other modules like PCIe, memory, NVLink, and HBM, bandwidth is typically expressed in bytes per second (B/s) or transactions per second (T/s). It's important to note that these measurements often represent bidirectional total bandwidth, accounting for both the upstream and downstream data flow.

Therefore, when comparing and evaluating bandwidth across diverse components, it is crucial to discern and convert between these units appropriately. This ensures a comprehensive understanding of the data transfer capabilities critical for optimizing large-scale GPU training performance.