Exploring the Ideal Switches for Artificial Intelligence

Posted on Mar 19, 2024 by

 193

With the rapid development and widespread application of artificial intelligence (AI), the high demands AI places on network performance have become a crucial challenge in today's technological advancement. Choosing switches that are suitable for AI applications is essential. This article will discuss the challenges AI poses to network performance and introduce switch solutions suitable for artificial intelligence.

Challenges of AI in Network Performance

AI applications require exceptional network performance. Here are the challenges AI poses to network performance.

Throughput and latency

Firstly, high throughput and low latency are fundamental requirements for AI tasks. Fast data transmission and low latency are crucial due to the significant amount of data involved in AI tasks. Secondly, AI applications demand reliability and stability in data accuracy, making these qualities crucial considerations in network design.

Limitations of Traditional Network Protocols

Traditional TCP/IP protocols have certain limitations when faced with the demands of AI applications. Firstly, TCP/IP protocols introduce significant delays in data transmission due to multiple context switches and CPU involvement in packet encapsulation. Secondly, TCP/IP networks place a heavy load on host CPUs, largely due to the high coefficient of correlation between network bandwidth and CPU utilization. Additionally, the traditional three-layer network architecture suffers from bandwidth wastage and limitations in large-scale data transmission and processing, necessitating alternative solutions better suited for AI applications.

Data Center Architecture

The traditional three-layer network architecture (access layer, aggregation layer and core layer) has certain drawbacks and limitations when it comes to AI applications. And with the development of cloud computing, these shortcomings have become more prominent, including waste of bandwidth, large fault domain and long latency.

To optimize network performance, the leaf-spine architecture has emerged as a superior choice. The leaf-spine architecture directs network traffic directly to the target device, reducing bandwidth wastage and providing lower latency and better scalability. Optimizing network architecture can meet the high demands of AI applications on network performance and improve the efficiency and performance of AI applications.

Data Center Architecture

Application of RDMA Technology in AI

Remote Direct Memory Access (RDMA) technology has emerged to meet the network performance demands of AI applications. RDMA enables direct data transfer between host memory and network devices, bypassing the CPU and thereby reducing latency and alleviating CPU loads. In Ethernet-based RDMA solutions, technologies such as Infiniband, RoCE, and iWARP have become prominent choices. Among them, Infiniband is specially designed for RDMA, ensuring reliable transmission from the hardware level. It has advanced technology, but the cost is high. Both RoCE and iWARP are based on Ethernet RDMA technology. These technologies support high throughput, low latency, and reliable transmission, providing more efficient network performance for AI applications. For more information about RDMA, please refer to A Quick Look at the Differences: RDMA vs TCP/IP.

Ideal Switches for Artificial Intelligence

Selecting switches suitable for AI requires considering multiple factors. Firstly, the switches should support RDMA technology to meet the high throughput and low latency requirements. Secondly, switches should possess scalability and flexibility to accommodate the growing workload of AI. There are various options available in the market, including custom AI switch solutions provided by manufacturers like NVIDIA.

NVIDIA Spectrum and Quantum platforms are deployed with both Ethernet and InfiniBand switches. The Spectrum and Quantum platforms target different application scenarios. Spectrum-X is designed for generative AI, optimizing the limitations of traditional Ethernet switches. In NVIDIA's vision, AI application scenarios can be roughly divided into AI cloud and AI factory. In the AI cloud, traditional Ethernet switches and Spectrum-X Ethernet can be used, while in the AI factory, the NVLink+InfiniBand solution needs to be used. For more information about NVLink, please refer to An Overview of NVIDIA NVLink.

The following table shows the original NVIDIA switches provided by FS.

Types	Product	Features
Ethernet Switch	MSN2700-CS2RC	32x 100Gb QSFP28, Spine Switch, MLAG, PTP
	MSN4410-WS2FC	24x 100Gb QSFP28-DD, 8x 400Gb QSFP-DD, Spine Switch, RoCE, PTP
	MSN4410-WS2RC	24x 100Gb QSFP28-DD, 8x 400Gb QSFP-DD, Spine Switch, RoCE, PTP
	MSN4700-WS2FC	32x 400Gb QSFP-DD, Spine Switch, RoCE, PTP
	MSN4700-WS2RC	32x 400Gb QSFP-DD, Spine Switch, MLAG, PTP
	MSN2410-CB2FC	48x 25Gb SFP28, 8x 100Gb QSFP28, Leaf Switch, MLAG, PTP
	MSN2700-CS2FC	32 x 100Gb QSFP28, Spine Switch, MLAG, PTP
InfiniBand Switch	MQM9790-NS2F	64X NDR 400G, 32 OSFP Ports, HPC/AI, QuantumTM-2, Unmanaged
	MQM8790-HS2F	40X HDR QSFP56, HPC/AI, QuantumTM, Unmanaged
	MQM8700-HS2F	40x HDR QSFP56, HPC/AI, QuantumTM, Managed
	MQM9700-NS2F	64 X NDR 400G, 32 OSFP Ports, HPC/AI, QuantumTM-2, Managed

Conclusion

AI applications pose high demands on network performance, and switches, as core components of the network, are crucial for meeting these demands. This article has discussed the challenges AI presents to network performance and introduced switch solutions suitable for artificial intelligence. By adopting RDMA technology and optimizing network architecture, high throughput, low latency, and reliable transmission can be achieved, meeting the requirements of AI applications. Choosing switches that are suitable for artificial intelligence is a critical step in enhancing AI network performance and efficiency. In the future, as AI technology continues to evolve, innovative network devices and architectures will drive further advancements in AI applications.