2022-October-07 Service Incident (2)

Incident Report for Sauce Labs

Postmortem

Dates:

Friday October 7th 2022, 08:30 - 14:54 UTC

Friday October 7th 2022, 20:30 - 23:04 UTC

What happened:

Customers attempting to start Windows 10 tests were affected by high wait times and elevated error rates.

After initial remedial action, which involved aggressive limiting of new tests_,_ we temporarily saw an improvement in wait times. However, this elevated the error rate even further, and wait times eventually degraded back to the initially reported times.

Why it happened:

A number of Windows devices that are part of our virtual device pool were inadvertently disabled from launching tests but were still marked as online and available. The virtual device pool represents the collection of virtual machines available to run customer tests. As test requests came in, a portion of those requests were assigned to a device that could never start a job due to the launch setting being disabled which led to the elevated error rate.

As the error rate increased, failed tests kicked off retries and again had jobs assigned to the impacted virtual devices. The high volume of new test requests caused the services that process and allocate tests to devices to become overloaded. As that allocation process slowed down, device availability status wasn’t updated fast enough which caused the system to think the virtual device pool was sized appropriately and that it didn’t need to boot more devices.

How we fixed it:

The restoration involved the following steps:

Limiting the queue size to drop the job backlog
Increasing CPU resources on internal services
Restarting resource-starved pods
Off-lining launch-disabled Windows devices

What we are doing to prevent it from happening again:

The team reviewed how the system processes new test requests since this was a critical contributing factor to the incident. We specifically looked at how many new tests were picked up in our processing loop and found that decreasing the number of tests processed during a single loop led to better performance. This is achieved through a configuration change which we vetted in our staging environment, after which we promoted to production. This incident was exacerbated due to the influx of new test requests, and this configuration change directly addresses that condition by controlling the processing more tightly.

We are also looking at how we manage the virtual device pools. The two main areas of focus and improvement:

Improving how we scale the services responsible for device allocation.
Making the process responsible for prebooting devices more responsive to rapid changes in request rate.

Posted Oct 19, 2022 - 16:31 UTC

Resolved

We have seen all testing stabilized for the last 30 minutes in our us-west-1 data center. All testing is running as expected. All services are fully operational.

Posted Oct 07, 2022 - 23:06 UTC

Monitoring

After taking remedial action, Windows tests are starting as expected in our US-West-1 Data Center. We are monitoring.

Posted Oct 07, 2022 - 21:54 UTC

Investigating

We are seeing an abnormal amount of demand for Windows tests on our US-West-1 data center causing tests to be slow to start or not start at all. We are investigating and taking remedial action.

Posted Oct 07, 2022 - 21:06 UTC

This incident affected: Automated Browser Testing (US-West) and Live Browser Testing (US-West).