Atlassian | 2022-04-05T00:00:00Z
Atlassian 2022 Outage: Site Deletion and Long-Tail Recovery
A production incident case study of Atlassian's 2022 outage, where a maintenance script deleted customer sites and recovery took days.
Real outage case studies with direct answers, root causes, impact, and engineering lessons.
Atlassian | 2022-04-05T00:00:00Z
A production incident case study of Atlassian's 2022 outage, where a maintenance script deleted customer sites and recovery took days.
Roblox | 2021-10-28T00:00:00Z
A case study of the Roblox 2021 outage, where service discovery and Consul cluster issues contributed to a long recovery.
Facebook | 2021-10-04T00:00:00Z
A root cause analysis of Facebook's October 2021 outage, covering BGP route withdrawal, DNS failure, and operational recovery complexity.
Fastly | 2021-06-08T00:00:00Z
A production incident case study of Fastly's 2021 CDN outage, where a latent software bug was triggered by customer configuration.
Slack | 2021-01-04T00:00:00Z
A practical case study of Slack's January 2021 outage, focusing on capacity, dependency recovery, and customer-facing degradation.
Cloudflare | 2019-07-02T00:00:00Z
A production incident case study on Cloudflare's 2019 WAF outage, where a rule change caused CPU exhaustion at the edge.
Google Cloud | 2019-06-02T00:00:00Z
A concise incident analysis of the Google Cloud 2019 outage, where a configuration change caused network congestion and broad service impact.
GitHub | 2018-10-21T00:00:00Z
A clear explanation of GitHub's 2018 MySQL incident, including the failover failure mode, user impact, and engineering lessons.
Azure DevOps | 2018-09-04T00:00:00Z
A practical root cause analysis of the Azure DevOps 2018 outage, including regional dependency failure, recovery gaps, and reliability lessons.
AWS | 2017-02-28T00:00:00Z
A concise root cause analysis of the AWS S3 2017 outage, its impact, remediation, and reliability lessons for production engineers.