English

How Much Do You Know About InfiniBand In-Network Computing?

Posted on Dec 30, 2023 by
419

InfiniBand plays a crucial role in high-performance computing (HPC) and artificial intelligence (AI) applications, as reflected in its provision of high-speed, low-latency network communication to support large-scale data transfer and complex computational tasks. The significance of InfiniBand extends to the realm of In-Network Computing, where its applications are gradually expanding. By executing computational tasks within the network, InfiniBand further reduces latency and enhances overall system efficiency, propelling the HPC and AI domains towards higher performance and increased intelligence.

InfiniBand In-Network Computing

InfiniBand In-Network Computing: What Is It?

InfiniBand In-Network Computing (INC) is an extension of InfiniBand technology designed to enhance system performance by introducing computational capabilities into the network. In the realm of network computing, it effectively addresses collective communication and point-to-point bottleneck issues in AI and HPC applications, providing novel perspectives and solutions for the scalability of data centers.

The philosophy of In-Network Computing involves integrating computational capabilities into the switches and InfiniBand adapters of the InfiniBand network. This enables the execution of simple computing tasks concurrently with data transmission, eliminating the need to transfer data to terminal nodes such as servers for processing.

InfiniBand In-Network Computing in Data Center

In recent years, the evolution of modern data centers has manifested in a novel distributed parallel processing architecture, driven by cloud computing, big data, high-performance computing, and artificial intelligence. Resources such as CPU, memory, and storage are dispersed throughout the data center and interconnected via high-speed networking technologies like InfiniBand, Ethernet, Fibre Channel, and Omni-Path. Collaborative design and division of labor facilitate the collective accomplishment of data processing tasks, creating a balanced system architecture centered around business data.

InfiniBand In-Network Computing integrates in-network computing by executing computational tasks within the network, transferring data processing responsibilities from the CPU to the network to reduce latency and enhance system performance. Through key technologies like network protocol offloading, RDMA, GPUDirect, InfiniBand achieves functionalities such as online computation, decreased communication latency, and optimized data transfer efficiency. This profound integration of in-network computing provides effective support for high-performance computing and artificial intelligence applications.

Key Technologies of InfiniBand In-Network Computing

Network Protocol Offloading

Network protocol offloading involves relieving the CPU from the burden of processing network-related protocols by moving these tasks to dedicated hardware.

InfiniBand network adapters and InfiniBand switches handle the processing of the entire network communication protocol stack, including the physical layer, link layer, network layer, and transport layer. This offloading eliminates the need for additional software and CPU processing resources during data transmission, significantly improving communication performance.

RDMA

Remote Direct Memory Access (RDMA) technology is developed to address the issue of server-side data processing latency in network transmission. RDMA enables direct data transmission from the memory of one computer to another without involving the CPU, reducing data processing latency and improving network transmission efficiency.

RDMA allows data to be transferred directly from user applications to the storage area of the server, which can then be quickly transmitted to the remote system's storage via the network. This eliminates the need for multiple data copying and text exchanging operations during the transmission process, resulting in a significant reduction in CPU load.

GPUDirect RDMA

GPUDirect RDMA is a technology that leverages RDMA capability to facilitate direct communication between GPU nodes, enhancing communication efficiency in GPU clusters.

In scenarios where two GPU processes on different nodes within a cluster need to communicate, GPUDirect RDMA enables the RDMA network adapter to directly transfer data between the GPU memories of the two nodes. This eliminates the need for CPU involvement in data copying, reduces accesses to the PCIe bus, minimizes unnecessary data copying, and significantly enhances communication performance.

SHARP

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) is a collective communication network offloading technology designed to optimize efficiency in high-performance computing and artificial intelligence applications that involve collective communications.

SHARP integrates a compute engine unit into the InfiniBand switch chip, supporting various fixed-point or floating-point calculations. In a cluster environment with multiple switches, SHARP establishes a logical tree in the physical topology, where multiple switches process collective communication operations in parallel and distributed manner. This parallel and distributed processing of the SHARP tree significantly reduces the latency of collective communication, minimizes network congestion, and improves the scalability of the cluster system. The protocol supports operations such as Barrier, Reduce, and All-Reduce, enhancing the efficiency of collective communications in large-scale computing environments.

InfiniBand In-network Computing Applications: HPC & AI

InfiniBand In-Network Computing finds prominent applications in HPC and AI due to its ability to enhance overall system performance and efficiency.

InfiniBand In-network Computing in HPC

In the field of HPC, where computing-intensive tasks are predominant, InfiniBand is instrumental in mitigating CPU/GPU resource contention. The communication-intensive nature of HPC tasks, involving both point-to-point and collective communications, necessitates effective communication protocols. In this context, offloading techniques, RDMA, GPUDirect, and SHARP technologies are widely employed to optimize computing performance.

InfiniBand In-network Computing in AI

Artificial Intelligence, being a forefront technology, heavily relies on InfiniBand In-Network Computing to expedite the training process and obtain highly accurate models. In the current landscape, GPUs or dedicated AI chips serve as the computational core in AI training platforms. These platforms leverage InfiniBand to accelerate training, a process known for its computing intensity. Offloading application communication protocols is crucial in reducing latency during AI training. GPUDirect RDMA technology is employed to enhance communication bandwidth between GPU clusters, effectively reducing communication delays.

Conclusion

InfiniBand In-Network Computing, as an innovative network computing technology, provides efficient and reliable computational support for HPC and AI fields. As one of the significant innovations in the field of information technology, InfiniBand In-Network Computing will lead the continuous advancement and evolution of network computing technology. FS can provide AI solution-related InfiniBand products, such as IB switches, IB network cards, and IB module cables, which are available for purchase on FS.com.

You might be interested in

Knowledge
Knowledge
See profile for FS Official.
FS Official
Inquiries and Answers about Infiniband Technology
Dec 26, 2023
411
Knowledge
See profile for George.
George
Getting to Know About InfiniBand
Dec 19, 2023
849
Knowledge
Knowledge
Knowledge
See profile for Sheldon.
Sheldon
Decoding OLT, ONU, ONT, and ODN in PON Network
Mar 14, 2023
368.0k
Knowledge
See profile for Irving.
Irving
What's the Difference? Hub vs Switch vs Router
Dec 17, 2021
360.3k
Knowledge
See profile for Sheldon.
Sheldon
What Is SFP Port of Gigabit Switch?
Jan 6, 2023
319.4k
Knowledge