Enhancing Storage Network Efficiency with NVIDIA Spectrum Ethernet

Posted on Jun 19, 2024 by

 224

In today’s data-driven world, the demand for efficient, high-performance storage networks is escalating. NVIDIA Spectrum Ethernet is designed to meet this need by providing top-notch performance and scalability for storage networks, ensuring seamless data handling and processing.

Scale-out Storage Needs a Robust Network

Regardless of your business, you likely handle massive and growing amounts of data that must be stored and analyzed. The traditional scale-up approach with larger storage filers has been replaced by scale-out storage. This method uses multiple smaller nodes connected as one logical unit, allowing a single file to be distributed across many nodes.

As demand grows, additional nodes can be added to boost capacity and performance, whether using traditional enterprise solutions or software-defined storage. While distributed storage offers flexible scaling and cost efficiency, it requires a high-performance network to connect the nodes.

The Difference in Storage Traffic Between Traditional Traffic

Traditional network traffic is often consistent and homogeneous, making traditional Ethernet different. However, storage traffic introduces unique challenges.

1. Network Stress

Modern storage solutions use faster SSDs and interfaces like NVMe and PCIe Gen 4 for higher performance. This increased speed puts more stress on networks.

2. Congestion

When storage networks become saturated, congestion is inevitable, similar to highway traffic. Scale-out storage requires fast data delivery from each node. Congestion can cause fairness issues in data center switches, slowing some nodes more than others. Since data is spread across many nodes, slowing one node affects the entire cluster.

3. Bursty Traffic

Storage workloads are often bursty, needing large bandwidth for short periods. Network switches must buffer these bursts to prevent packet loss and avoid performance deterioration.

4. Storage Jumbo Frames

Traditional data center traffic uses a 1.5 KB MTU, but scale-out storage nodes perform better with 9 KB "jumbo frames," which increase throughput and reduce CPU overhead.

5. Low Latency

Flash-based media significantly reduce read/write latency, but network-induced latency can negate these improvements. High latency, especially from excessive buffering, can hinder storage performance.

6. Data Requirements for Training and Inference

Both training and inference require high-speed data access to keep GPUs fully utilized. Lower storage latency allows GPUs to perform compute tasks more efficiently.

Why Commodity Switch ASICs Fall Short of Storage Traffic?

Most data center switches use commodity ASICs, optimized for traditional traffic patterns and packet sizes to minimize costs. These ASICs often employ a split buffer architecture, sacrificing fairness to meet bandwidth targets.

Switch buffers absorb traffic bursts and prevent packet loss during congestion. Typically, buffers are shared across many ports, but not all shared buffers are equal. Commodity switches generally use either ingress-shared or egress-shared buffers.

In an ingress-shared buffer, incoming ports are statically mapped to specific memory slices, limiting their buffer use even if the rest of the buffer is available. Similarly, an egress-shared buffer maps outgoing ports to specific buffer slices, restricting their access to the full buffer.

Flows within the same memory slice perform differently from those traveling between slices. Ports sharing the same buffer slice experience higher latency and lower throughput, while others enjoy better performance. This variability affects storage traffic, leading to issues with fairness, predictability, and microburst absorption in switches using split buffers.

NVIDIA Spectrum Switches: Optimized for Storage

Commodity switch ASICs can cause inconsistent performance due to their split buffer architecture. In contrast, NVIDIA Spectrum switches use a fully shared buffer, ensuring all flows behave consistently. This architecture maximizes burst absorption capacity and provides fair, predictable performance. All traffic flows receive equal treatment, ensuring uniformly high performance regardless of the ports used.

Benchmarking Deep-Buffer Switches vs. NVIDIA Spectrum Switches

In the first test, the team used the FIO tool to benchmark WRITE operations from two initiators to one target under background traffic. The FIO job was completed in 87 seconds with the deep-buffer switch, but only took 51 seconds with the NVIDIA Spectrum switch, achieving a 40% speed increase.

NVIDIA Spectrum switches store writes 40% faster than deepbuffer switches

In the second case, the team took the deep-buffer switch. Deep-buffer switches significantly increase latency, slowing down storage and reducing application performance. Testing showed that the deep-buffer switch's latency was 50,000 times higher than that of the Spectrum switch (2-19 milliseconds vs. 300 nanoseconds). Additionally, buffer occupancy is directly correlated with increased latency.

In light of this, the graph on the right displays the highest latency for each deep-buffer ASIC (such as Ramon, Jericho 1, and Jericho 2). Fast storage systems in particular and data center applications in general cannot work with these extremely high latency figures.

Actual and anticipated delay in relation to buffer size and occupancy

In the third test, copying a file from two Windows machines to the same target storage, the deep-buffer switch showed uneven bandwidth distribution (830 MBps vs. 290 MBps). In contrast, the Spectrum switch provided equal bandwidth distribution (584 MBps each).

The deep-buffer switch offees unfair bandwidth per node (left) while the NVIDIA Spectrum switch offers equal bandwidth(right)

These tests reveal that deep-buffer switches are not ideal for data center applications due to high latency and poor performance under scaled workloads. NVIDIA Spectrum switches, however, deliver consistent, high performance, making them suitable for AI/ML and storage workloads.

Conclusion

NVIDIA Spectrum Ethernet stands as a robust solution for optimizing storage network performance. Its high throughput, low latency, scalability, reliability, energy efficiency, and advanced management capabilities make it an ideal choice across various industries. As data demands continue to rise, investing in reliable networking solutions like NVIDIA Spectrum Ethernet is crucial for maintaining competitive advantage and operational efficiency. As an official NVIDIA authorized partner, FS offers NVIDIA Spectrum switches, with adequate US inventory and same-day shipping, you can experience superior network performance now!