In the AI Era: Fueling Growth in the Optical Transceiver Market
The advent of Artificial Intelligence (AI) has been a catalyst for transformative change across various industries. One such domain experiencing a paradigm shift is the optical transceiver market. This article delves into the impact of AI, particularly the AI wave sparked by models like ChatGPT, on reshaping data center networks and fueling the growth of high-performance optical transceivers, with a focus on the anticipated surge in 800G optical transceivers in 2024.
The AI Wave Sparked by ChatGPT
The development and deployment of AI models like ChatGPT have ushered in a new era of possibilities. These models, powered by advanced deep learning techniques, showcase the ability to comprehend and generate human-like text. ChatGPT, as a representative of this AI wave, has demonstrated the potential to enhance communication and streamline data processing. Its natural language processing capabilities contribute to more efficient human-machine interactions, making it an invaluable asset in optimizing data center operations. The AI wave has, therefore, become a driving force behind the need for faster, more reliable, and higher-capacity optical transceivers.
The operation of ChatGPT requires robust cloud computing resources for support. The GPT model released by OpenAI in 2018 had 117 million parameters and was trained with approximately 5GB of pre-training data. In contrast, GPT-3 boasts an astonishing 175 billion parameters and was trained with 45TB of data. During the model training phase alone, ChatGPT consumed approximately 3640 PF days of computational power, with training costs reaching a staggering $12 million. The consumption during the service access phase is even greater. It is estimated that to meet the search and access needs of current ChatGPT users, an initial investment of around $3-4 billion in computing infrastructure, specifically using servers (GPUs), is required.
How AI Reshapes Data Center Networks
The integration of AI into data centers has redefined the landscape of data transmission. Traditional data centers, designed for conventional computing workloads, are undergoing a metamorphosis to meet the demands of AI-driven applications. The key differentiator lies in the way data is processed and transmitted.
Traditional Data Center vs. AI Data Center
In a traditional data center, data flows through a hierarchical network architecture, with each layer introducing latency and potential bottlenecks. Initially, data centers adopted the traditional three-tier model, comprising the access layer, aggregation layer, and core layer. The access layer linked computing nodes to cabinet switches, the aggregation layer facilitated interconnections between access layers, and the core layer managed connections between aggregation layers and external networks.
However, as the volume of east-west traffic within data centers rapidly increased, the core and aggregation layers of the three-tier network architecture faced growing tasks and higher performance requirements, resulting in significantly elevated equipment costs. Consequently, a more streamlined leaf-spine network architecture tailored for east-west traffic emerged. In this revised architecture, leaf switches establish direct connections to compute nodes, while spine switches function as core switches, dynamically selecting multiple paths through Equal-Cost Multipath (ECMP).
The leaf-spine network architecture brings several advantages, including high bandwidth utilization, excellent scalability, predictable network latency, and enhanced security. These features make it widely applicable and advantageous for deployment in various data center scenarios.
AI data centers, on the other hand, leverage parallel processing, distributed computing, and high-speed interconnects to ensure seamless data flow and minimal latency. The need for an unblocked fat-tree network architecture has become crucial due to the substantial internal data traffic. NVIDIA's AI data centers employ a fat-tree network architecture to ensure unblocked functionality.
The fundamental idea behind it involves utilizing a large number of low-performance switches to construct an extensive unblocked network. This design ensures that, for any communication pattern, there are paths enabling communication bandwidth to match the bandwidth of the network interface cards (NICs), and all switches within the architecture are identical. The fat-tree network architecture finds widespread application in data centers with demanding network requirements, particularly in high-performance computing centers and AI data centers.
Take NVIDIA's DGX A100 SuperPOD AI data center system as an example, all three-tier switches consist of NVIDIA Quantum QM8790 40-port switches. The first-tier switches are linked to 1120 Mellanox HDR 200G InfiniBand NICs. In this setup, the second-tier switches' downlink ports connect to the first-tier switches, while their uplink ports connect to the third-tier switches. The third-tier switches exclusively feature downlink ports and are interconnected with the second-tier switches.
Furthermore, the storage side of the system employs a distinct network architecture, kept separate from the compute side. This segregation necessitates a specific number of switches and optical transceivers. Thus, when compared to conventional data centers, the count of switches and optical transceivers in AI data centers has experienced a substantial increase.
800G Optical Transceivers Play a Pivotal Role
800G optical transceivers play a pivotal role in this transformation. A single 800G optical transceiver in the optical port can replace two 400G optical transceivers. Additionally, in the electrical port, 8 SerDes channels can be integrated, aligning with the 8 100G channels in the optical port. This design leads to an enhanced channel density in switches, accompanied by a notable reduction in physical size.
The optical transceiver rate is influenced by the network cards, and the network card speed is constrained by the PCIe channel speed. In NVIDIA's A100 DGX servers, internal connections occur through NVLink3 with a unidirectional bandwidth of 300GB/s. However, the A100 GPUs link to ConnectX-6 network cards via 16 PCIe 4.0 channels, yielding a total bandwidth of approximately 200G. Consequently, a 200G optical transceiver or DAC cable is needed to match the network card's bandwidth of 200G.
In the case of H100 DGX servers, internal connections utilize NVLink4 with a unidirectional bandwidth of 450GB/s. The H100 GPUs connect to ConnectX-7 network cards through 16 PCIe 5.0 channels, resulting in a total bandwidth of around 400G for an individual network card. Notably, the optical transceiver speed is influenced by the PCIe bandwidth between the network card and the GPU.
If the internal PCIe channel speed in A100 and H100 DGX servers were to reach 800G (PCIe 6.0), it would become feasible to deploy network cards with an 800G bandwidth and employ 800G optical transceivers. This advancement has the potential to significantly enhance the computational efficiency of the system.
2024 — The Year of 800G Optical Transceivers
Looking ahead, 2024 is poised to be a significant year for the optical transceiver market, with the spotlight on 800G solutions. As of 2019, marked as the point in time for transitioning to 100G optical transceivers, the market presented two upgrade paths: 200G and 400G. However, the upcoming generation of high-speed optical transceivers in the market is exclusively geared towards 800G optical transceivers. Combined with the escalating computational power and competition driven by AI and GC (Generalized Convolutional) networks, it is anticipated that major cloud providers and technology giants in North America are likely to make substantial acquisitions of 800G optical transceivers in 2024.
Amidst this transformative landscape, having a reliable and innovative partner becomes crucial. FS, as a reliable provider of networking solutions, offers a complete 800G portfolio designed for ultra-large-scale cloud data centers worldwide. In 2023, we unveiled a new series of 800G NDR InfiniBand solutions. Our product range encompasses both 800G OSFP and 800G QSFP-DD optical transceiver types. FS also extends its product line to include 800G AOCs and DACs. This helps broaden our support for customers across various industries, ensuring a continuous supply of top-notch and reliable optical network products and solutions.
In conclusion, the confluence of AI advancements and the optical transceiver market heralds a new era of high-speed and efficient data transmission. The transformative impact of AI on data center networks underscores the pivotal role of optical transceivers. As we anticipate 2024, the year of 800G optical transceivers, businesses can always rely on FS to navigate the complexities of the AI era and build resilient, high-performance networks that pave the way for a future of limitless possibilities.