DPFR

Posted on Mar 29, 2024 by

 58

What Is DPFR?

DPFR is an advanced fault recovery technology that operates at sub-millisecond level. It utilizes the data plane to rapidly detect port faults and employs various mechanisms such as local fast fault convergence, remote fault advertisement, and remote fast fault convergence to achieve swift fault rectification while ensuring uninterrupted service delivery. This document provides an overview of the need for DPFR, a comparison between DPFR and conventional fault convergence technologies, an explanation of DPFR's working principles, and a demonstration of a typical DPFR application.

The Significance of DPFR

Conventional fault convergence technologies rely on control-plane operations, where dynamic routing protocols like OSPF and BGP exchange information and perform path recomputation. Although Bidirectional Forwarding Detection (BFD) helps accelerate fault detection, the overall route convergence process still takes hundreds of milliseconds or even seconds in a large-scale data center network (DCN).

However, for online transaction applications that demand high performance and reliability, a delay of hundreds of milliseconds to restore service transmission after a link fault is unacceptable. Continuous packet loss can lead to transaction failures or connection timeouts, significantly degrading application performance.

To address this challenge, DPFR has been developed as a solution. It represents a shift from control-plane-based fault convergence to data-plane-based fault convergence. By leveraging the data plane, DPFR enables rapid fault detection, remote fault advertisement, and swift path switching. This approach achieves fault convergence at the sub-millisecond level, minimizing the impact on service performance. DPFR technology offers enhanced reliability and stability, particularly for critical applications like high-performance databases, storage systems, and supercomputing environments.

Comparing Conventional Fault Convergence Technologies with DPFR

On a large-scale DCN, DPFR and conventional fault convergence technologies are contrasted in the following table.

Operational Mechanism of DPFR

DPFR operates through the collaboration of three essential roles, each performing specific functions:

1. Fault detection node:

Fast fault detection: The data plane quickly identifies faults, such as faulty optical modules or incorrectly connected transmission optical cables. It samples outgoing traffic on the faulty port, gathers information about the faulty flow, and generates a corresponding fault table.
Local fast fault convergence: If the fault detection node has an alternative redundant path available, it performs rapid path switching for data packets before control plane fault convergence occurs. In this scenario, the fault detection node acts as a path switching node.
Remote fault advertisement: In cases where no redundant path is available, the data plane generates an advertisement packet containing information about the faulty flow and sends it to the upstream device.

2. Forwarding node:

Remote fault advertisement reception: The forwarding node records the port through which the fault advertisement packet is received, determines information about the faulty flow, and generates the corresponding fault table.
Remote fault relay: If the forwarding node lacks a redundant path based on the faulty flow information, it samples the faulty flow on the port, generates a fault advertisement packet, and forwards it to the upstream device.

3. Path switching node:

Remote fault relay reception: Similar to the forwarding node, the path switching node receives remote fault advertisement packets and processes them accordingly.
Remote fast fault convergence: Following the reception of remote fault advertisement packets, the path switching node performs fast fault convergence by switching paths to ensure continuity of data transmission.

The fault tables in each node are established based on the information about the faulty flow. These tables have a specified aging period to maintain consistency between data plane behavior and control plane route convergence results.

Typical Application of DPFR

In the traditional Layer 3 networking depicted in the diagram below, servers are interconnected using separate IP addresses. Leaf switches function as independent Layer 3 gateways, responsible for forwarding both Layer 2 and Layer 3 traffic. Spine switches serve as standalone Layer 3 devices and are connected to the leaf switches to enable Equal-Cost Multipath (ECMP) load balancing.

This networking model is commonly utilized in lossless environments such as high-performance computing (HPC), artificial intelligence (AI), and storage scenarios. In HPC applications, for instance, link faults can result in a substantial number of lost packets. Consequently, distributed computing tasks fail to consolidate, necessitating a restart, and overall application performance suffers. DPFR addresses this issue by reducing packet loss time and ensuring high reliability for critical applications that demand optimal performance, including AI, machine learning, and HPC.

In a network where DPFR is implemented across all devices, if a spine switch or a link between the spine and leaf switches experiences a fault, the leaf switch promptly redirects traffic to an alternative ECMP member link. Similarly, when a link between the spine and leaf switches encounters a fault, the spine switch instructs the remote leaf switch to redirect traffic to another available ECMP member link.