Status of Foundries.io web services
We have a service referred to as “ota-lite” that is the user facing portion of our device management platform.
The service was unavailable from 20:13UTC to 20:35UTC. During this time, actions like publishing TUF Targets and managing devices (from user facing APIs) were not possible. The device-gateway remained up during this outage.
On the previous day, 2024-10-08, we performed a task where we periodically rotate secure keys and tokens used by our backend. After rotating keys, we restarted each container affected by the key rotation to make sure they continued to function properly. We also deleted some old service accounts and tokens no longer used by our system.
The next day while trying to deploy a minor API change, the service was failing to start. An attempt to rollback the change couldn’t run either.
The failure logs pointed at the fact we were using a key that didn’t exist.
The missing key was for a minor part of the system where we keep backups of each version of Targets.json published by our customers. It’s not strictly required, so the quick temporary fix was to comment out this code to give us breathing room.
The root cause wound up being a combination of a bad/naive Grep expression combined with a bad variable naming. As a result, we de facto rotated not only a secret content, but also a secret name —while the API K8s specs were still pointing to the old secret name. As the API was not restarted during this change, it continued to work as if nothing changed. But at the time of the roll out of a new API version, it started failing due to the missing required secret.
The service was unavailable from 20:13UTC to 20:35UTC. During this time, actions like publishing TUF Targets and managing devices (from customer facing APIs) were not possible. The device-gateway remained up during this outage.