2023-February-3 Service Incident 3

Incident Report for Sauce Labs

Postmortem

Dates:

Tuesday January 31st 2022, 20:00 UTC - Tuesday February 7th 2023, 23:00 UTC

What happened:

Tests running on desktop and virtual mobile devices experienced long wait times and elevated error rates upon session start. This primarily impacted macOS/iOS based tests but did at times impact Windows and Android tests due to some of the remediation steps taken.

Why it happened:

Our macOS/iOS cloud capacity is finite, unlike our other device clouds, and we can’t burst extra capacity in our macOS/iOS cloud. Some of our larger customers have high macOS/iOS test concurrency, and when they run large suites of tests simultaneously, they essentially use up all our macOS/iOS capacity. Resulting in long wait times and elevated error rates for new tests.

How we fixed it:

In order to restore service, we took the following actions:

We reduced macOS/iOS test concurrency for customers requesting high numbers of new tests.
We reduced the job queue size, which throttled new test requests across all customers.
When the request rate got to the point that tests were erroring out, we cleared the job queue to provide some relief for the tests already being processed. This was done for all test types, as this queue is shared across Windows, macOS, iOS, and Android.

What we are doing to prevent it from happening again:

We are taking several actions to prevent this scenario in the future:

We are currently working on expanding our macOS/iOS cloud with the ability to burst capacity as new test request rate increases.
We have put per-cloud limits in place on the job queue to lessen the impact across different test types during these incidents.
We had added checks before a test resource has been assigned to a test which will reduce the number of allocated resources that ultimately fail due to an invalid request.
We are working on ways to detect abandoned jobs so that we can stop them and free up their resources.
We have resized our hypervisors to make more efficient use of system resources, reducing time spent in waste states between VM availability. This should provide an effective increase in usable capacity.

Posted Feb 15, 2023 - 09:44 UTC

Resolved

After taking remedial action, we are seeing normal wait times and the error rates during session creation have subsided for desktop and virtual mobile device tests. This incident has been resolved.

Posted Feb 03, 2023 - 15:42 UTC

Monitoring

After taking remedial action, we are seeing normal wait times and the error rates during session creation have subsided for desktop and virtual mobile device tests. We are currently monitoring.

Posted Feb 03, 2023 - 15:00 UTC

Investigating

Desktop browser and virtual mobile device tests are again experiencing long wait times and a high rate of errors upon session start in our US-West-1 datacenter. This includes MacOS desktop, Windows desktop, Android emulator tests, and iOS simulator tests. We are investigating.

Posted Feb 03, 2023 - 13:22 UTC

This incident affected: Automated Browser Testing (US-West), Automated Virtual Mobile Device Testing (US-West), Live Browser Testing (US-West), and Live Virtual Mobile Device Testing (US-West).