English

An Overview of NVIDIA NVLink

Posted on Jan 29, 2024 by
9.4k

NVIDIA NVLink has emerged as a crucial technology in the fields of high-performance computing (HPC) and artificial intelligence (AI). This article delves into the intricacies of NVLink, and learns about NVSwitch chips, NVLink servers, and NVLink switches, shedding light on its significance in the ever-evolving landscape of advanced computing.

What Is NVIDIA NVLink?

NVLink is a protocol that addresses the communication limitations between GPUs within a server. Unlike traditional PCIe switches, which have limited bandwidth, NVLink enables high-speed direct interconnection between GPUs within the server. The fourth-generation NVLink offers significantly higher bandwidth—112Gbps per lane—compared to PCIe Gen5 lanes, which is three times faster.

NVLink

NVLink aims to offer a streamlined, high-speed, point-to-point network for direct GPU interconnections, minimizing overhead compared to traditional networks. By providing CUDA acceleration across different layers, NVLink reduces communication-related network overhead. NVLink has evolved alongside GPU architecture, progressing from NVLink1 for P100 to NVLink4 for H100, as depicted in the figure. The key difference among NVLink 1.0, NVLink 2.0, NVLink 3.0, and NVLink 4.0 lies in the connection method, bandwidth, and performance.

NVSwitch Chip

The NVSwitch chip is a physical chip similar to a switch ASIC that connects multiple GPUs with high-speed NVLink interfaces, improving communication and bandwidth within a server. The third generation of NVIDIA NVSwitch has been proposed and can interconnect each pair of GPUs at a staggering 900 GB/s.

NVLink`

The latest NVSwitch3 chip, with 64 NVLink4 ports, offers a total of 12.8 Tbps of unidirectional bandwidth or 3.2 TB/s of bidirectional bandwidth. What sets the NVSwitch3 chip apart is its integration of the SHARP function, which aggregates and updates computation results across multiple GPU units during all reduced operations, reducing network packets and enhancing computational performance.

NVLink

NVLink Server

NVLink servers incorporate NVLink and NVSwitch technologies to connect GPUs, typically found in NVIDIA's DGX series servers or OEM HGX servers with similar architectures. These servers utilize NVLink technology, delivering exceptional GPU interconnectivity, scalability, and HPC capabilities. In 2022, NVIDIA announced the fourth-generation NVIDIA® DGX™ system, the world’s first AI platform to be built with new NVIDIA DGX H100 server.

NVLink

Consequently, NVLink servers have become indispensable in crucial domains such as scientific computing, AI, big data processing, and data centers. By providing robust computing power and efficient data processing, NVLink servers not only meet the demanding requirements of these fields but also drive advancements and foster innovations within them.

NVLink

NVLink Switch

In 2022, NVIDIA took out the NVSwitch chip and made it into a switch called the NVLink Switch, which connects GPU devices across hosts. It adopts a 1U size design with 32 OSFP ports; each OSFP comprises 8 112G PAM4 lanes, and each switch has 2 built-in NVSwitch3 chips.

NVLink

NVLink Network

The NVSwitch physical switch connects multiple NVLink GPU servers into a large Fabric network, which is the NVLink network, solving high-speed communication bandwidth and efficiency issues between GPUs. Each server has its own independent address space, providing data transmission, isolation and security protection for GPUs in the NVLink network. When the system starts, the NVLink network automatically establishes a connection through the software API and can change the address during operation.

NVLink

The figure compares NVLink networks with traditional Ethernet networks, demonstrating the creation of an NVLink network independent of IP Ethernet and dedicated to GPU service.

Concept
Traditional Example
NVLink Network
Physical Layer
400G electrical/optical media
Custom-FW OSFP
Data Link Layer
Ethernet
NVLink custom on-chip HW and FW
Network Layer
IP
New NVLink Network Addressing and Management Protocols
Transport Layer
TCP
NVLink custom on-chip HW and FW
Session Layer
Sockets
SHARP groupsCUDA export of Network addresses of data-structures
Presentation Layer
TSL/SSL
Library abstractions (e.g., NCCL, NVSHMEM)
Application Layer
HTTP/FTP
Al Frameworks or User Apps
NIC
PCIe NIC (card or chip)
Functions embedded in GPU and NVSwitch
RDMA OffLoad
NIC Off-Load Engine
GPU-internal Copy Engine
Collectives OffLoad
NIC/Switch Off-Load Engine
NVSwitch-internal SHARP Engines
Security Off-Load
NIC Security Features
GPU-internal Encryption and "TLB" Firewalls
Media Control
NIC Cable Adaptation
NVSwitch-internal OSFP-cable controllers
Table: Traditional networking concepts mapped to their counterparts with the NVLink Switch System

InfiniBand Network VS NVLink Network

InfiniBand Network and NVLink Network are two different networking technologies used in high-performance computing and data center applications. They have the following differences:

Architecture and Design: InfiniBand Network is an open-standard networking technology that utilizes multi-channel, high-speed serial connections, supporting point-to-point and multicast communication. NVLink Network is a proprietary technology by NVIDIA, designed for high-speed direct connections between GPUs.

Application: InfiniBand Network is widely used in HPC clusters and large-scale data centers. NVLink Network is primarily used in large-scale GPU clusters, HPC, AI and other fields.

Bandwidth and Latency: InfiniBand Network offers high bandwidth and low latency communication, providing higher throughput and shorter transmission delays. NVLink Network delivers higher bandwidth and lower latency between GPUs to support fast data exchange and collaborative computing. The following is the bandwidth comparison between the H100 using NVLink network and the A100 using IB network.

NVLink

 Also check-Getting to Know About InfiniBand.

Conclusion

NVIDIA NVLink stands as a groundbreaking technology that has revolutionized the fields of HPC and AI. Its ability to enhance GPU communication, improve performance, and enable seamless parallel processing has made it an indispensable component in numerous HPC and AI applications. As the landscape of advanced computing continues to evolve, NVLink's significance and impact are set to expand, driving innovation and pushing the boundaries of what is possible.

You might be interested in

Knowledge
Knowledge
Knowledge
See profile for Sheldon.
Sheldon
Decoding OLT, ONU, ONT, and ODN in PON Network
Mar 14, 2023
386.1k
Knowledge
See profile for Irving.
Irving
What's the Difference? Hub vs Switch vs Router
Dec 17, 2021
367.5k
Knowledge
See profile for Sheldon.
Sheldon
What Is SFP Port of Gigabit Switch?
Jan 6, 2023
335.5k
Knowledge
See profile for Migelle.
Migelle
PoE vs PoE+ vs PoE++ Switch: How to Choose?
Mar 16, 2023
420.5k
Knowledge
Knowledge
Knowledge
Knowledge
See profile for Moris.
Moris
How Much Do You Know About Power Cord Types?
Sep 29, 2021
294.6k