GitHub | 2018-10-21T00:00:00Z

GitHub 2018 MySQL Incident: A Root Cause Analysis

GitHub's 2018 incident followed a network partition and MySQL failover sequence that created data consistency risk, backlog, and prolonged degraded service while engineers preserved data integrity.

Incident answer

Impact: GitHub users saw delayed webhooks, stale data, unavailable actions, and degraded repository workflows.

Root cause: A network partition triggered a complicated database failover and replication recovery problem.

Lesson: Automated failover must be designed around data integrity, topology awareness, and tested recovery from partial partitions.

Quick Summary

On October 21, 2018, GitHub had a major production incident after a network event affected connectivity between data centers. The result was not a simple "database down" story. It was a distributed systems recovery problem involving MySQL topology, replication, queued work, and the need to protect data integrity. GitHub published a detailed October 21 post-incident analysis that walks through the database failover sequence.

The incident is a classic case study because user-visible recovery took much longer than the initial network problem. Once a system enters a risky data state, the right response is often careful and slow.

Why It Mattered

GitHub is a coordination system for software teams. When it degrades, it affects code review, deployment, CI/CD, webhooks, automation, and release work across many companies.

The incident also shows why incident response is not only about restoring green dashboards. The team had to reason about correctness, stale reads, queued writes, and whether fast remediation could create worse damage.

For a broader practice path around this skill, see our root cause challenge guide.

Root Cause Pattern

The pattern was a failover path that met a partial network partition. In distributed systems, a component can be alive but unreachable from the wrong place. That is more dangerous than a clean crash because automated systems may make decisions from incomplete information.

The recovery work had to answer questions like:

Which database node had the authoritative writes?
Which replicas were safe to promote?
Which background jobs could replay safely?
Which user-facing features should remain paused until consistency was restored?

Remediation Themes

The big lessons for production engineers:

Treat failover as a product surface, not only an infrastructure feature.
Run failure drills that include network partitions and stale topology views.
Make replication health and queue backlog visible to incident command.
Prefer explicit data-safety checkpoints over optimistic recovery.

What Engineers Should Practice

When debugging a database incident, separate availability from correctness. A service can respond while still returning stale or unsafe state. Good incident response asks both "can users use it?" and "can we trust what it is doing?"

That distinction is why database incidents often need slower, more deliberate recovery than stateless service outages.

External References

GitHub Blog: October 21 post-incident analysis