Google Cloud Networking under the hood: How Protective ReRoute increases resilience

Cloud infrastructure reliability is foundational, yet even the most sophisticated global networks can suffer from a critical issue: slow or failed recovery from routing outages. In massive, planetary-scale networks like Google’s, router failures or complex, hidden conditions can prevent traditional routing protocols from restoring service quickly, or sometimes at all. These brief but costly outages — what we call slow convergence or convergence failure — critically disrupt real-time applications with low tolerance to packet loss and, most acutely, today’s massive, sensitive AI/ML training jobs, where a brief network hiccup can waste millions of dollars in compute time.

To solve this problem, we pioneered Protective ReRoute (PRR), a radical shift that moves the responsibility for rapid failure recovery from the centralized network core to the distributed endpoints themselves. Since putting it into production over five years ago, this host-based mechanism has dramatically increased Google’s network’s resilience, proving effective in recovering from up to 84%¹ of inter-data-center outages that would have been caused by slow convergence events. Google Cloud customers with workloads that are sensitive to packet loss can also enable it in their environments — read on to learn more.

The limits of in-network recovery

Traditional routing protocols are essential for network operation, but they are often not fast enough to meet the demands of modern, real-time workloads. When a router or link fails, the network must recalculate all affected routes, which is known as reconvergence. In a network the size of Google’s, this process can be complicated by the scale of the topology, leading to delays that range from many seconds to minutes. For distributed AI training jobs with their wide, fan-out communication patterns, even a few seconds of packet loss can lead to application failure and costly restarts. The problem is a matter of scale: as the network grows, the likelihood of these complex failure scenarios increases.

Protective ReRoute: A host-based solution

Protective ReRoute is a simple, effective concept: empower the communicating endpoints (the hosts) to detect a failure and intelligently re-steer traffic to a healthy, parallel path. Instead of waiting for a global network update, PRR capitalizes on the rich path diversity built into our network. The host detects packet loss or high latency on its current path, and then immediately initiates a path change by modifying carefully chosen packet header fields, which tells the network to use an alternate, pre-existing path.

This architecture represents a fundamental shift in network reliability thinking. Traditional networks rely on a combination of parallel and series reliability. Serialization of components tends to reduce the reliability of a system; in a large-diameter network with multiple forwarding stages, reliability degrades as the diameter increases. In other words, every forwarding stage affects the whole system. Even if a network stage is designed with parallel reliability, it creates a serial impact on the overall network while the parallel stage reconverges. By adding PRR at the edges, we treat the network as a highly parallel system of paths that appear as a single stage, where the overall reliability increases as the number of available paths grows exponentially, effectively circumventing the serialization effects of slow network convergence in a large-diameter network. The following diagram contrasts the system reliability model for a PRR-enabled network with that of a traditional network. Traditional network reliability is in inverse proportion to the number of forwarding stages; with PRR the reliability of the same network is in direct proportion to the number of composite paths, which is exponentially proportional to the network diameter.

How Protective ReRoute works

The PRR mechanism has three core functional components:

End-to-end failure detection: Communicating hosts continuously monitor path health. On Linux systems, the standard mechanism uses TCP retransmission timeout (RTO) to signal a potential failure. The time to detect a failure is generally a single-digit multiple of the network’s round-trip time (RTT). There are also other methods for end-to-end failure detection that have varying speed and cost.
Packet-header modification at the host: Once a failure is detected, the transmitting host modifies a packet-header field to influence the forwarding path. To achieve this, Google pioneered and contributed the mechanism that modifies the IPv6 flow-label in the Linux kernel (version 4.20+). Crucially, the Google software-defined network (SDN) layer provides protection for IPv4 traffic and non-Linux hosts as well by performing the detection and repathing on the outer headers of the network overlay.
PRR-aware forwarding: Routers and switches in the multipath network respect this header modification and forward the packet onto a different, available path that bypasses the failed component.

Proof of impact

PRR is not theoretical; it is a continuously deployed, 24×7 system that protects production traffic worldwide. Its impact is compelling: PRR has been shown to reduce network downtime caused by slow convergence and convergence failures by up to the above-mentioned 84%. This means that up to 8 out of every 10 network outages that would have been caused by a router failure or slow network-level recovery are now avoided by the host. Furthermore, host-initiated recovery is extremely fast, often resolving the problem in a single-digit multiple of the RTT, which is vastly faster than traditional network reconvergence times.

Key use cases for ultra-reliable networking

The need for PRR is growing, driven by modern application requirements:

AI/ML training and inference: Large-scale workloads, particularly those distributed across many accelerators (GPUs/TPUs), are uniquely sensitive to network reliability. PRR provides the ultra-reliable data distribution necessary to keep these high-value compute jobs running without disruption.
Data integrity and storage: Significant numbers of dropped packets can result in data corruption and data loss, not just reduced throughput. By reducing the outage window, PRR improves application performance and helps guarantee data integrity.
Real-time applications: Applications like gaming and services like video conferencing and voice calls are intolerant of even brief connectivity outages. PRR reduces the recovery time for network failures to meet these strict real-time requirements.
Frequent short-lived connections: Applications that rely on a large number of very frequent short-lived connections can fail when the network is unavailable for even a short time. By reducing the expected outage window, PRR helps these applications reliably complete their required connections.

Activating Protective ReRoute for your applications

The architectural shift to host-based reliability is an accessible technology for Google Cloud customers. The core mechanism is open and part of the mainline Linux kernel (version 4.20 and later).

You can benefit from PRR in two primary ways:

Hypervisor mode: PRR automatically protects traffic running across Google data centers without requiring any guest OS changes. Hypervisor mode provides recovery in the single digit seconds for traffic of moderate fanout in specific areas of the network.
Guest mode: For critical, performance-sensitive applications with high fan-out and in any segment of the network, you can opt into guest-mode PRR, which enables the fastest possible recovery time and greatest control. This is the optimal setting for demanding mission-critical applications, AI/ML jobs, and other latency-sensitive services.

To activate guest-mode PRR for critical applications follow the guidance in the documentation and be ready to ensure the following:

Your VM runs a modern Linux kernel (4.20+).
Your applications use TCP.
The application traffic uses IPv6. For IPv4 protection, the application needs to use the gVNIC driver.

Get started

The availability of Protective ReRoute has profound implications for a variety of Google and Google Cloud users.

For cloud customers with critical workloads: Evaluate and enable guest-mode PRR for applications that are sensitive to packet loss and that require the fastest recovery time, such as large-scale AI/ML jobs or real-time services.
For network architects: Re-evaluate your network reliability architectures. Consider the benefits of designing for rich path diversity and empowering endpoints to intelligently route around failures, shifting your model from series to parallel reliability.
For the open-source community: Recognize the power of host-level networking innovations. Contribute to and advocate for similar reliability features across all major operating systems to create a more resilient internet for everyone.

^{1. https://dl.acm.org/doi/10.1145/3603269.3604867}

Google Cloud Networking under the hood: How Protective ReRoute increases resilience

The limits of in-network recovery

Protective ReRoute: A host-based solution

How Protective ReRoute works

Proof of impact

Key use cases for ultra-reliable networking

Activating Protective ReRoute for your applications

Get started

Your IP address

Recommended books

Google Cloud Networking under the hood: How Protective ReRoute increases resilience

The limits of in-network recovery

Protective ReRoute: A host-based solution

How Protective ReRoute works

Proof of impact

Key use cases for ultra-reliable networking

Activating Protective ReRoute for your applications

Get started

Related Posts

Technis accelerates global smart space solutions with akenza IoT platform

Diptyx E-Reader – An open-source, ESP32-powered, dual-screen e-book reader (Crowdfunding)

ClearBlade Launches Industry’s First Real-Time Forecasting AI at the Edge

Unpacking the Critical Importance of Information Technology in Modern Society

Your IP address

Recommended books