Status

Logo

Status of Foundries.io web services

View the Project on GitHub foundriesio/status

CI Failures in AWS

Throughout the course of April 7th, our us-east-2 connection issues began to improve. By the morning of April 8th, we had observed no CI failures caused by AWS networking issues.

Background

Our main service is hosted in Google’s Kubernetes Engine. However, we have a multi-cloud solution. The primary reason is not that we love pain and complexity, but because:

Over the past year, we have continued to increase our use of AWS for CI builds. The price/performance value has improved and also allows us to be much more dynamic. We are to the point where almost all new customers have their CI jobs running in AWS’s us-east-2 region.

Code comes into our backend (GCP), CI job (AWS) runs and interacts with the backend.

What Happened

On April 5th, things went bad. After some strace debugging, we observed a high rate of connect system calls were unresponsive when hitting GCP services. It took us a while to prove this was not our own misconfiguration or mistake, but eventually we could see that AWS us-east-2 could not make reliable TCP connections to our multiple services inside GCP.

At this point, it was a scramble. Our container builds are the easiest ones to manage, so we were able to move those into a new AWS region, us-west-1. This fixed container builds, but our Yocto Project builds, while less frequent, were still having issues.

Yocto builds are more complex for us to host. It builds an entire operating system from scratch, without managing a cache correctly, builds become painfully slow/expensive. That cache is handled via a set of sharded NFS servers. For us to move a customer to a new data center means we also need to help setup NFS and caching.

We were able to move some workloads over to a different data center and in the meantime, AWS gradually resolved their networking issue(s).

By yesterday afternoon, I was able to breath again. There was still a focus on monitoring every single CI job for anomalies, but things were looking stable. As of now, we’ve gone about 24 hours without seeing a CI job inside us-east-2 fail due to infrastructure issues.

What we learned