Facebook | 2021-10-04T00:00:00Z

Facebook 2021 Outage: BGP, DNS, and Backbone Recovery

Facebook's 2021 outage was caused by a backbone configuration change that withdrew BGP routes to Facebook's DNS servers, making Facebook, Instagram, WhatsApp, and internal tools unreachable.

Incident answer

Impact: Facebook, Instagram, WhatsApp, Messenger, and internal tooling were unavailable or severely degraded for hours.

Root cause: A backbone maintenance command unintentionally disconnected Facebook data centers from the wider internet.

Lesson: Network automation needs blast-radius controls, out-of-band access, and recovery paths that do not depend on the broken network.

Quick Summary

On October 4, 2021, Facebook and several Meta-owned services experienced a major global outage. Meta's engineering writeup, More details about the October 4 outage, explains that a backbone configuration change caused Facebook's DNS servers to become unreachable from the internet.

This incident became a famous example of how network control-plane failures can also break the tools engineers need to recover.

Why It Mattered

The outage affected Facebook, Instagram, WhatsApp, Messenger, and internal operational systems. For many users, it looked like the company had disappeared from the internet.

The deeper lesson is that internal dependencies matter during recovery. If access control, chat, deployment, and observability depend on the same damaged network path, incident response slows down sharply.

Root Cause Pattern

The pattern was network automation with large blast radius. A command intended for backbone maintenance changed reachability in a way that removed routes to critical DNS infrastructure.

Key investigation clues:

DNS resolution fails or returns no usable path.
BGP route visibility changes suddenly.
Multiple apps and internal tools fail together.
Recovery requires physical or out-of-band access.

Remediation Themes

Important reliability lessons:

Add guardrails to network automation.
Keep out-of-band access independent from the primary network.
Treat DNS as a critical production dependency.
Drill recovery paths where internal tools are partially unavailable.

What Engineers Should Practice

When a large platform disappears, check DNS, BGP, and identity paths before diving into individual services. A multi-product outage is often a shared dependency failure.

The practical takeaway: the recovery system must survive the failure it is supposed to repair.

External References

Meta Engineering: More details about the October 4 outage