Friday December 16th 2022, 20:51 - 21:31 UTC
All tests waiting to be assigned to a virtual device during the incident were marked as failed. The error rate was ~70-80%, but primarily impacted customers requesting specific test devices.
Demand for macOS increased to a point where it passed what was available. As this happened, new tests began to queue up which triggered the clearing of the new jobs queue resulting in all of the tests waiting in this queue being marked as failed. While the underlying issue was with macOS and iOS capacity it impacted all test types as the new test queue (which is shared) was backed up.
Clearing new tests queue restored the system from starvation; no additional action is usually required. In some cases, even after removing the new tests queue, the starvation comes back, and another clearing is performed.
We are looking at ways to increase our capacity specifically for macOS and iOS. We are also looking into ways that jobs can be cleared from queues by image name or platform, rather than clear the whole queue.