Networking and Hardware Trends under the AIGC Butterfly Effect
In 2023, Artificial intelligence-generated content (AIGC) technology, exemplified by ChatGPT, is thriving and making significant strides across various areas such as text generation, code development, and poetry creation, reshaping the industry landscape.
The Deloitte report predicts that by 2027, the artificial intelligence infrastructure services market, driven by AIGC, will increase to $13-16 billion.
AIGC leverages natural language processing (NLP) and machine learning (ML) technologies to accomplish content generation across text, images, audio, and video. This achievement is made possible by robust computational power, storage, and high-speed communication support.
The Key to Empowering Computing Power Lies in Networking - Tremendous Potential in InfiniBand
The remarkable advancement of artificial intelligence (AI) is inseparable from the three pillars of data, algorithms and computing power. Especially for the large-scale and complex AIGC model, a strong computing power infrastructure is crucial.
Taking ChatGPT as an example, it utilizes 10,000 V100 GPUs during the training process, forming a high-bandwidth cluster which consumes approximately 3640 PF-days of computational power in a single training run.
However, the most significant factor impacting GPU utilization is the network, particularly in computing clusters composed of tens of thousands of GPUs, where substantial bandwidth for efficient data interchange is indispensable. The absence of robust networking support can lead to GPUs waiting for data, decreased utilization, prolonged training times, increased costs, and diminished user experience. Therefore, the importance of networking cannot be overstated.
In essence, without efficient networking, the application of large-scale models is severely constrained.
To support the operation of AIGC, a high-performance network infrastructure is essential. The industry has put forward three main network solutions to meet the demands of AI cluster computing: InfiniBand, RDMA, and fabric switches.
Among these, RDMA (Remote Direct Memory Access) is a novel communication mechanism that significantly enhances data throughput while reducing latency. It is primarily implemented over Ethernet, based on the RoCE v2 protocol.
Fabric switch solutions are suitable for small-scale AI computing cluster deployments. It employs specific chips and technologies to meet the demands of high-performance networking. However, it faces challenges such as limited scalability, high device power consumption, and a large fault domain.
InfiniBand networks, with extremely high bandwidth, no congestion, and low latency, appear to be the optimal choice for building high-performance networks for now. Although the cost is relatively high, it is adopted by models such as ChatGPT and GPT-4. NVIDIA, featuring InfiniBand and GPUs, has established dominance in the AI infrastructure, capturing approximately 80% of the market share. Taking the NVIDIA DGX SuperPOD with NVIDIA DGX H100 systems as an example, it consists of between 31 and 127 DGX H100 systems, with a total of 1,016 NVIDIA Hopper GPUs. This configuration delivers outstanding AI computing performance.
Trends in AIGC Networking and Computing Power Driving Core Products
Servers - The Heart of AI Computing Power
The rapid development of AIGC is propelling the demand for high-performance AI servers. The global AI server market is experiencing substantial growth, with IDC data projecting a market size of $31.79 billion by 2025, with a compound annual growth rate of 19%.
AI servers differ from traditional servers in that they are typically equipped with high-performance GPUs or TPUs accelerators to speed up deep learning and machine learning. This has led to increased demands for larger memory, faster storage, more core processors, and additional PCIe devices.
Escalating High-Performance Demands: AI workloads typically require substantial computing power, driving the need for high-performance servers. This includes servers equipped with high-performance GPUs, TPUs, and fast storage.
Specific Hardware Requirements: AI servers necessitate specific hardware configurations, such as GPU accelerators, PCIe slots, and high-speed network interfaces to meet the operational needs of AI algorithms.
Innovative Server Designs: To cater to the demands of large-scale AI models, new server designs like NVIDIA's DGX GH200 have emerged, offering greater throughput and scalability.
Switches - Urgent Demand for 400G/800G
Switches, serving as the central hub of the data center computing network, are gradually evolving to meet the ever-increasing demands for high-speed data transmission. They also play a pivotal role in providing the necessary support and solutions for the rapid growth of AI and data centers.
High-Speed Network Demands: AI workloads generate a significant need for data transmission, driving the demand for high-speed network switches, with a transition from 10G/40G to 400G/800G.
Bandwidth Loss Reduction: AI servers and data centers require higher-performance switches to mitigate bandwidth losses during data transmission, leading to more intricate switch designs and PCB requirements.
Data Center Expansion: The growth of AI is propelling the expansion of data centers, increasing the demand for switches. According to Dell'Oro's report, by 2027, switches with speeds of 400Gbps and higher will capture nearly 70% of the market share.
Optical Modules - Robust Growth and Emerging Technology Trends
Optical modules, used with switches or network cards for data transmission, are an indispensable component in the AI wave, especially at the 400Gbps and 800Gbps levels. With the rapid expansion of AI and data centers, the optical module market is witnessing a strong growth trend.
Additionally, as network speeds continue to increase, traditional pluggable optical devices may reach their physical limits. New optical module solutions, such as Co-Packaged Optics (CPO), are emerging to meet the demand for higher speeds and denser packaging in high-speed data transmission.
Trends in Other Products Driven by AIGC
In addition to the previously mentioned servers, switches, and optical modules, the entire network infrastructure requires a broader range of products, and their growth is also influenced by the expansion of AI-driven solutions, including:
Power Management: Components such as power switches, power filters, and voltage regulators that ensure stable and reliable power distribution throughout the network.
Control and Management: Components such as management chips, clock chips, and BIOS chips within servers, which are essential for overseeing and coordinating network operations.
Heat Management: Products like CPU heatsinks and fans are crucial for effective and dependable heat management in AI-driven systems, particularly in data center environments.
The Ongoing Ripple Effect of AIGC
AIGC has sparked a technological revolution. From a hardware perspective, there is a continuous growth in demand for high-performance servers, network switches, and optical modules, especially in the high-performance computing and data center domains. Furthermore, AIGC's rapid development has given rise to new hardware design trends, such as larger-scale GPU clusters and a pressing need for high-speed networks.
Speaking of software and services, the scope of AIGC technology's applications is continually expanding, encompassing various fields like text composition, code development, poetry creation, and more. This has opened up new opportunities in software development and cloud computing services. The butterfly effect of AIGC is spreading continuously and is expected to persist. The butterfly effect of the AIGC is and will continue to spread.