Google Cloud | 2025-06-12T00:00:00Z

Google Cloud 2025 Outage: Service Control and Global API Failures

Google Cloud's 2025 outage happened when invalid quota policy data exercised an unprotected Service Control code path, causing regional binaries to crash and many APIs to return 503s.

Incident answer

Impact: Many Google Cloud, Google Workspace, and Google Security Operations products saw increased 503 errors and API access issues globally.

Root cause: A Service Control feature lacked proper error handling and feature flag protection; unintended blank policy fields triggered a null pointer crash loop after global replication.

Lesson: Globally replicated control-plane data needs staged propagation, defensive parsing, feature flags, and fail-open behavior for noncritical checks.

Quick Summary

On June 12, 2025, Google Cloud experienced a broad service disruption across many products. Google's Service Health incident report says the outage involved Service Control, the policy and quota-checking system used by Google API management and control planes.

A new Service Control feature had shipped earlier, but the failed code path was only exercised when a later policy change inserted data with unintended blank fields. That metadata replicated globally within seconds, causing crash loops across regional deployments.

Why It Mattered

This was a control-plane incident, so the visible symptoms appeared across many otherwise separate services. Customers saw 503s and access failures in products that shared the affected API management path.

The incident is a sharp example of why control-plane dependencies can create global failures even when individual data-plane systems are healthy.

Root Cause Pattern

The pattern was invalid global metadata meeting an insufficiently defended control-plane binary. The new quota-policy code lacked appropriate error handling and was not protected behind a feature flag that could have limited exposure.

Investigation clues in this class:

Remediation Themes

Important reliability lessons:

What Engineers Should Practice

When an outage cuts across product boundaries, map the shared control-plane path before debugging individual services. Authorization, quota, policy, and configuration systems often explain failures that look unrelated at first.

The practical takeaway: global consistency is powerful, but a bad value can become globally bad very quickly.

External References

Read Next