2024-10-09 api.foundries.io outage

We have a service referred to as “ota-lite” that is the user facing portion of our device management platform.

The service was unavailable from 20:13UTC to 20:35UTC. During this time, actions like publishing TUF Targets and managing devices (from user facing APIs) were not possible. The device-gateway remained up during this outage.

What Happened

On the previous day, 2024-10-08, we performed a task where we periodically rotate secure keys and tokens used by our backend. After rotating keys, we restarted each container affected by the key rotation to make sure they continued to function properly. We also deleted some old service accounts and tokens no longer used by our system.

The next day while trying to deploy a minor API change, the service was failing to start. An attempt to rollback the change couldn’t run either.

The failure logs pointed at the fact we were using a key that didn’t exist.

The missing key was for a minor part of the system where we keep backups of each version of Targets.json published by our customers. It’s not strictly required, so the quick temporary fix was to comment out this code to give us breathing room.

The root cause wound up being a combination of a bad/naive Grep expression combined with a bad variable naming. As a result, we de facto rotated not only a secret content, but also a secret name —while the API K8s specs were still pointing to the old secret name. As the API was not restarted during this change, it continued to work as if nothing changed. But at the time of the roll out of a new API version, it started failing due to the missing required secret.

Timeline of Events

20:12 UTC - We rolled out a minor change to OTA Lite.
20:13 UTC - We notice the deployment has failed.
20:18 UTC - We had a first workaround. However, GCP Cloud Build had a huge cache-miss and building took minutes rather than seconds.

The service was unavailable from 20:13UTC to 20:35UTC. During this time, actions like publishing TUF Targets and managing devices (from customer facing APIs) were not possible. The device-gateway remained up during this outage.

20:30 UTC - We get a 2nd workaround to get the system up.
20:35 UTC - The workaround was built and deployed. Things were functional again.
21:00 UTC - The real fix was deployed.

Lessons Learned

We need to fix the GCP Cloud Build caching problem by how we build our container. In a pinch - it was our bottleneck.
We need better container health checks. The service was passing a check that just needed it to be up for about a second. However, it was crashing shortly after that.
Kubernetes secrets, GCP service-accounts, and keys need 2-way links so that things can be looked up from either direction.