AWS | 2025-10-19T00:00:00Z

AWS 2025 Outage: DynamoDB DNS and US-EAST-1 Cascades

AWS's October 2025 outage began with a latent race condition in DynamoDB DNS automation that removed records for the US-EAST-1 regional endpoint and cascaded into dependent services.

Incident answer

Impact: DynamoDB, EC2 launches, Network Load Balancer, Lambda, ECS, EKS, Fargate, STS, Support, and other US-EAST-1 services saw errors, delays, or partial outages.

Root cause: A race condition between DynamoDB DNS automation components produced an incorrect empty DNS record for the regional endpoint and left automation unable to self-repair.

Lesson: Core service discovery automation needs concurrency safety, stale-plan protection, dependency-aware recovery, and manual break-glass procedures.

Quick Summary

On October 19 and 20, 2025, AWS had a major service disruption in the Northern Virginia, US-EAST-1, region. AWS's DynamoDB post-event summary explains that the first impact period involved increased DynamoDB API errors because clients could not resolve or connect to the regional DynamoDB endpoint.

The technical root cause was a latent race condition in DynamoDB's automated DNS management system. An incorrect empty DNS record was created for the regional endpoint, and the automation could not repair the inconsistent state without manual operator intervention.

Why It Mattered

US-EAST-1 is a heavily used AWS region, and DynamoDB is a dependency for many AWS services. Once the regional DynamoDB endpoint failed, other control planes and service subsystems began to degrade.

The incident is useful because it shows that "DNS issue" is rarely just DNS. Service discovery, leases, launch workflows, health checks, and throttling can all become part of the recovery story.

Root Cause Pattern

The pattern was a race condition in critical automation managing service discovery. A delayed DNS Enactor applied an old plan while another enactor had already advanced and cleaned up old state, removing the active regional endpoint records.

Investigation clues:

A core dependency fails before many downstream services.
Existing workloads may continue while new launches, new connections, or control-plane APIs fail.
Health checks become misleading because dependent network state is stale.
Recovery requires manual repair of automation state before dependent systems can drain backlogs.

Remediation Themes

Practical lessons:

Make automation concurrency-safe under delayed workers and stale plans.
Add velocity controls so health checks cannot remove too much capacity at once.
Test recovery workflows for lease expiration, backlog processing, and congestive collapse.
Keep operational runbooks for repairing automation state when self-healing fails.

What Engineers Should Practice

During a cloud dependency incident, separate running workloads from control-plane operations. "Existing instances are fine, but new launches fail" is an important clue that leases, provisioning, or metadata systems are involved.

The lesson is simple but uncomfortable: automation that repairs the platform can also become the thing that needs repair.