Famous Production Incidents

Cloudflare | 2025-11-18T00:00:00Z

Cloudflare 2025 Outage: Bot Management Feature File Failure

A root cause analysis of Cloudflare's November 2025 outage, where a database permissions change produced an oversized Bot Management feature file and disrupted core traffic handling.

AWS | 2025-10-19T00:00:00Z

AWS 2025 Outage: DynamoDB DNS and US-EAST-1 Cascades

A production incident case study of the October 2025 AWS US-EAST-1 outage, where a DynamoDB DNS automation race condition cascaded into EC2, NLB, Lambda, and other services.

Google Cloud | 2025-06-12T00:00:00Z

Google Cloud 2025 Outage: Service Control and Global API Failures

A root cause analysis of Google Cloud's June 2025 outage, where a Service Control quota policy path crashed globally after invalid policy data propagated within seconds.

CrowdStrike | 2024-07-19T00:00:00Z

CrowdStrike 2024 Outage: Channel File 291 and Global Windows Crashes

A production incident case study of CrowdStrike's July 2024 Falcon content update, which caused widespread Windows crashes and long manual recovery for affected machines.

Atlassian | 2022-04-05T00:00:00Z

Atlassian 2022 Outage: Site Deletion and Long-Tail Recovery

A production incident case study of Atlassian's 2022 outage, where a maintenance script deleted customer sites and recovery took days.

Roblox | 2021-10-28T00:00:00Z

Roblox 2021 Outage: Consul, Discovery, and Multi-Day Recovery

A case study of the Roblox 2021 outage, where service discovery and Consul cluster issues contributed to a long recovery.

Facebook | 2021-10-04T00:00:00Z

Facebook 2021 Outage: BGP, DNS, and Backbone Recovery

A root cause analysis of Facebook's October 2021 outage, covering BGP route withdrawal, DNS failure, and operational recovery complexity.

Fastly | 2021-06-08T00:00:00Z

Fastly 2021 Outage: CDN Configuration and Global Blast Radius

A production incident case study of Fastly's 2021 CDN outage, where a latent software bug was triggered by customer configuration.

Slack | 2021-01-04T00:00:00Z

Slack 2021 Outage: Capacity, Dependency Recovery, and Incident Response

A practical case study of Slack's January 2021 outage, focusing on capacity, dependency recovery, and customer-facing degradation.

Cloudflare | 2019-07-02T00:00:00Z

Cloudflare 2019 WAF Outage: Regex, CPU, and Global Blast Radius

A production incident case study on Cloudflare's 2019 WAF outage, where a rule change caused CPU exhaustion at the edge.

Google Cloud | Sun Jun 02 2019 00:00:00 GMT+0000 (Coordinated Universal Time)

Google Cloud 2019 Outage: Network Control Plane Failure

A concise incident analysis of the Google Cloud 2019 outage, where a configuration change caused network congestion and broad service impact.

GitHub | 2018-10-21T00:00:00Z

GitHub 2018 MySQL Incident: A Root Cause Analysis

A clear explanation of GitHub's 2018 MySQL incident, including the failover failure mode, user impact, and engineering lessons.

Azure DevOps | 2018-09-04T00:00:00Z

Azure DevOps 2018 Outage: Regional Failure and Recovery Lessons

A practical root cause analysis of the Azure DevOps 2018 outage, including regional dependency failure, recovery gaps, and reliability lessons.

AWS | 2017-02-28T00:00:00Z

AWS S3 US-EAST-1 Outage: What Happened in 2017?

A concise root cause analysis of the AWS S3 2017 outage, its impact, remediation, and reliability lessons for production engineers.