CI Failures in AWS

23:00 UTC(April 5) - A CI run fail with network connectivity issues is seen
11:00 UTC - Increased number of CI failures
12:00 UTC - Problems isolated to CI runs inside AWS’s us-east-2 region
16:00 UTC - Experiment with moving away from AWS NAT Gateway
17:00 UTC - Problem persists - has nothing to do with NAT Gateway
17:30 UTC - Move to stand up container build infrastructure in us-west-1
20:30 UTC - Tests in us-west-1 look promising.
21:00 UTC - All containers builds are moved from us-east-1 and into us-west-1 and online.net
04:00 UTC(April 7) - LmP builds are much more stable. us-east-1 is showing connection timeouts, but the following patches have dramatically reduced the frequency:
- https://github.com/foundriesio/jobserv/commit/b0327a09c4c0271412c93ac5c0af258fe1755411
- https://github.com/foundriesio/jobserv/commit/b4272c43effb0dc09e003efa3ff54e583cfa76e9

Throughout the course of April 7th, our us-east-2 connection issues began to improve. By the morning of April 8th, we had observed no CI failures caused by AWS networking issues.

Background

Our main service is hosted in Google’s Kubernetes Engine. However, we have a multi-cloud solution. The primary reason is not that we love pain and complexity, but because:

We build containers natively on ARM.
Yocto Project builds have historically had a better price/performance value from places that host bare-metal servers, such as online.net.

Over the past year, we have continued to increase our use of AWS for CI builds. The price/performance value has improved and also allows us to be much more dynamic. We are to the point where almost all new customers have their CI jobs running in AWS’s us-east-2 region.

Code comes into our backend (GCP), CI job (AWS) runs and interacts with the backend.

What Happened

On April 5th, things went bad. After some strace debugging, we observed a high rate of connect system calls were unresponsive when hitting GCP services. It took us a while to prove this was not our own misconfiguration or mistake, but eventually we could see that AWS us-east-2 could not make reliable TCP connections to our multiple services inside GCP.

At this point, it was a scramble. Our container builds are the easiest ones to manage, so we were able to move those into a new AWS region, us-west-1. This fixed container builds, but our Yocto Project builds, while less frequent, were still having issues.

Yocto builds are more complex for us to host. It builds an entire operating system from scratch, without managing a cache correctly, builds become painfully slow/expensive. That cache is handled via a set of sharded NFS servers. For us to move a customer to a new data center means we also need to help setup NFS and caching.

We were able to move some workloads over to a different data center and in the meantime, AWS gradually resolved their networking issue(s).

By yesterday afternoon, I was able to breath again. There was still a focus on monitoring every single CI job for anomalies, but things were looking stable. As of now, we’ve gone about 24 hours without seeing a CI job inside us-east-2 fail due to infrastructure issues.

What we learned

Our ability to manage container builds has really matured.
Our ability to manage Yocto builds needs to improve.
Multi-cloud can be as much of a liability as it can be an asset.
Even when you are “in the cloud”, you need people that understand tools like strace and understand system’s programming concepts. Luckily, that’s in our DNA.