2025-08-22 API and Registry Disruption

On August 22 the team was performing routine maintenance on a Kubernetes cluster. We experienced unexpected service outages for api.foundries.io and hub.foundries.io.

Timeline of Events

14:54 UTC - api.foundries.io is unreachable
14:58 UTC - api.foundries.io is restored
18:30 UTC - Docker registry not responding to private containers
19:43 UTC - Docker registry restored

Root cause

api.foundries.io

A critical component of api.foundries.io was configured to use a container image we’d removed. Things had been working okay because the image had been cached locally. However, the Kubernetes upgrade moved the service to a new host where the container image was not cached. We quickly restored the image and the service came back.

hub.foundries.io

The node pool used for a component of the registry service was updated. The new Kubernetes components were using more RAM than previous versions. We had a resource constraint for a service that made its Pod unschedulable after the upgrade. This impacted private containers for customers. However, our monitoring was testing operations on public containers and was not triggered. This issue went unnoticed for about an hour before we spotted the problem and fixed it.