Monday May 9th 2022, 15:00 - 17:00 UTC
Tests running in our US-East-1 data center (Headless Cloud) were failing to start.
The cause of this issue was a bug in our user service which returned an error message when trying to allocate a Sauce Connect tunnel for new tests. This caused a cascading situation where new tests were unable to start and resources to start further tests were exhausted.
We scaled down services and removed all running tests and cleaned the queue of waiting tests, in order to free up resources.
We plan on changing the in order in which we reserve resources within the system in order to prevent the cascading error condition that can starve resources when tunnels cannot be allocated.