English 



Unveiling the Foundations of GPU Computing-2

Posted on Dec 27, 2023 by

 1.9k

It's known that training large models is done on clusters of machines with preferably many GPUs per server. In the previous article Unveiling the Foundations of GPU Computing-1, we introduced common terms and concepts in GPU networks. In this article, we will continue discussing common GPU architectures.

8x NVIDIA A100 GPU Node/8x NVIDIA A800 GPU Node

Typical 8-card A100 host hardware topology

As shown in the topology diagram of the A100 GPU above, the topology of 8 A100 GPUs includes the following components:

2 CPU Chips (and associated memory on both sides, NUMA): The central processing unit is responsible for general-purpose computing tasks.
2 Storage Network Cards (for accessing distributed storage, with in-band management, etc.): These network cards are used for accessing distributed storage.
4 PCIe Gen4 Switch Chips: PCIe Gen4 is the fourth generation of the PCIe interface, offering higher data transfer rates.
6 NVSwitch Chips: NVSwitch facilitates direct GPU-to-GPU communication at extremely high speeds, which is crucial for the efficient operation of large-scale deep learning Nodels and parallel computing tasks.
8 GPUs: The A100 GPUs are the primary processing units responsible for executing parallelized computations, particularly well-suited for deep learning workloads.
8 GPU-Dedicated Network Cards: Each GPU has a dedicated network card to optimize communication between GPUs and enhance the overall performance of parallel processing tasks.

We will provide detailed explanations of these components in the following sections.

The following picture gives more detailed topology information for reference.

NVIDIA DGX A100 host (official 8-card machine) hardware topology

Storage network card

The positioning of the storage network card in the GPU architecture primarily involves its connection to the central processing unit (CPU) through the PCIe bus, and its responsibility for facilitating communication with distributed storage systems. Here are the main purposes of the storage network card in the GPU architecture:

Reading and Writing Distributed Storage Data: One of the primary functions of the storage network card is to efficiently read and write data from and to distributed storage systems. This is crucial in the training process of deep learning models where frequent access to training data from distributed storage and writing of training results to checkpoint files is essential.
Node Management Tasks: The storage network card is not limited to data transfer; it also encompasses node management tasks. This includes tasks such as remote login through SSH (Secure Shell), monitoring system performance, and collecting relevant data. These tasks contribute to monitoring and maintaining the operational status of GPU clusters.

Although BF3 DPU is officially recommended, in practice, alternative solutions can be chosen as long as they meet bandwidth requirements. For example, for cost-effectiveness, RoCE can be considered, while for maximizing performance, InfiniBand is the preferred choice.

NVSwitch Fabric

In a full mesh network topology, each node is connected directly to all the other nodes. Usually, 8 GPUs are connected in a full-mesh configuration through six NVSwitch chips, also referred to as NVSwitch fabric.

In the full-mesh structure, the bandwidth of each line depends on the bandwidth per NVLink lane, represented as n * bw-per-nvlink-lane. For the A100 GPU, which utilizes NVLink3 technology with a bandwidth of 50GB/s per lane, the total bandwidth of each line in the full-mesh structure is 12 * 50GB/s = 600GB/s. It's essential to note that this bandwidth is bidirectional, supporting both data transmission and reception, resulting in a unidirectional bandwidth of 300GB/s.

In contrast, the A800 GPU reduces the number of NVLink lanes from 12 to 8. Consequently, in the full-mesh structure, each line's total bandwidth is 8 * 50GB/s = 400GB/s, with a unidirectional bandwidth of 200GB/s.

Below is the nvidia-smi topology for an 8*A800 machine.

nvidia-smi topology for an 8*A800 machine

GPU-to-GPU Connections (top-left region):

All are denoted as NV8, representing 8 NVLink connections.

NIC Connections:

On the same CPU die: Labeled as NODE, signifying no need to traverse NUMA but necessitates crossing PCIe switch chips.

On different CPU dies: Designated as SYS, indicating the necessity to traverse NUMA.

GPU-to-NIC Connections:

On the same CPU die and under the same PCIe switch chip: Identified as NODE, indicating the requirement to cross PCIe switch chips only.

On the same CPU die but not under the same PCIe switch chip: Designated as NNODE, signifying the need to cross both PCIe switch chips and PCIe host bridge.

On different CPU dies: Marked as SYS, indicating the need to traverse NUMA, PCIe switch chips, and cover the longest distance.

IDC GPU Fabric

The following figure shows the GPU node interconnection architecture:

GPU node interconnection architecture

Compute Network

The compute network is primarily used to connect GPU nodes, supporting the collaboration of parallel computing tasks. This includes transferring data between multiple GPUs, sharing computational results, and coordinating the execution of large-scale parallel computing tasks.

Storage Network

The storage network is used to connect GPU nodes and storage systems, supporting the read and write operations of large-scale data. This includes loading data from storage systems into GPU memory and writing computation results back to the storage system.

To achieve the high-performance demands of RDMA (Remote Direct Memory Access) is essential for both compute and storage networks. The choice between two RDMA technologies: RoCEv2 and InfiniBand involves a trade-off between cost-effectiveness and superior performance, with each option catering to specific use cases and budget considerations.

Public cloud service providers often utilize RoCEv2 networks in their configurations, such as the CX configuration with 8 GPU instances, each equipped with 8 * 100Gbps. Compared to other options, RoCEv2 is relatively cost-effective, provided that it meets the performance requirements.

Regarding the difference between RoCE and Infiniband, you can refer to this article InfiniBand vs. RoCE: How to choose a network for data center? from FS community.

Bandwidth Bottlenecks in Data Link Connections

Single-machine 8-card A100 GPU host bandwidth bottleneck analysis

The diagram highlights the bandwidth specifications of key links:

Communication between GPUs on the same host: Utilizing NVLink, the bidirectional bandwidth achieves 600GB/s, while unidirectional bandwidth reaches 300GB/s.
Communication between GPUs and their respective Network Interface Cards (NICs) on the same host: Using PCIe Gen4 switch chips, the bidirectional bandwidth is 64GB/s, and the unidirectional bandwidth is 32GB/s.
Communication between GPUs across different hosts: Data transmission relies on NICs, and the bandwidth is dependent on the specific NIC used. Currently, commonly used NICs in China for A100/A800 models offer a mainstream bandwidth of 100Gbps (12.5GB/s) in one direction. Therefore, inter-host communication experiences a significant decrease in performance compared to intra-host communication.

200Gbps (25GB/s) is close to the unidirectional bandwidth of PCIe Gen4.

400Gbps (50GB/s) surpasses the unidirectional bandwidth of PCIe Gen4.

Hence, using a 400Gbps NIC in this type of configuration does not yield significant benefits, as it requires PCIe Gen5 performance to fully utilize the 400Gbps bandwidth.

8x NVIDIA H100/8x NVIDIA H800 Host

Hardware Topology Within H100 Host

The overall hardware topology of the H100 host is quite similar to the A100 8-card machine, with some differences, mainly observed in the number of NVSwitch chips and bandwidth upgrades.

Hardware Topology Within H100 Host

Within each H100 host, there are four chips, which is two fewer than the A100 configuration.
The H100 chip is manufactured using a 4-nanometer process, with 18 Gen4 NVLink connections along the bottom row, providing a bidirectional total bandwidth of 900GB/s.

H100 GPU Chip

Single-chip H100 GPU internal logical layout

The chip is manufactured using a cutting-edge 4-nanometer process, indicating advanced manufacturing technology.
The bottom row of the chip comprises 18 Gen4 NVLink connections, providing a bidirectional total bandwidth of 18 lanes * 25GB/s/lane = 900GB/s.
The central blue area represents the L2 cache, a high-speed cache used for storing temporary data.
The left and right sides of the chip house HBM chips, which function as graphics memory, storing data required for graphics processing.

Networking

In terms of networking, the H100 is similar to the A100, with the only difference being that the standard configuration includes the 400Gbps CX7 network card.