Friday October 7th 2022, 08:30 - 14:54 UTC
Friday October 7th 2022, 20:30 - 23:04 UTC
Customers attempting to start Windows 10 tests were affected by high wait times and elevated error rates.
After initial remedial action, which involved aggressive limiting of new tests_,_ we temporarily saw an improvement in wait times. However, this elevated the error rate even further, and wait times eventually degraded back to the initially reported times.
A number of Windows devices that are part of our virtual device pool were inadvertently disabled from launching tests but were still marked as online and available. The virtual device pool represents the collection of virtual machines available to run customer tests. As test requests came in, a portion of those requests were assigned to a device that could never start a job due to the launch setting being disabled which led to the elevated error rate.
As the error rate increased, failed tests kicked off retries and again had jobs assigned to the impacted virtual devices. The high volume of new test requests caused the services that process and allocate tests to devices to become overloaded. As that allocation process slowed down, device availability status wasn’t updated fast enough which caused the system to think the virtual device pool was sized appropriately and that it didn’t need to boot more devices.
The restoration involved the following steps:
The team reviewed how the system processes new test requests since this was a critical contributing factor to the incident. We specifically looked at how many new tests were picked up in our processing loop and found that decreasing the number of tests processed during a single loop led to better performance. This is achieved through a configuration change which we vetted in our staging environment, after which we promoted to production. This incident was exacerbated due to the influx of new test requests, and this configuration change directly addresses that condition by controlling the processing more tightly.
We are also looking at how we manage the virtual device pools. The two main areas of focus and improvement: