2017-May-8 Service Incident
Incident Report for Sauce Labs Inc

Date: May 8, 2017
Time: 1:49pm - 4:01pm PDT

What Happened
Wait times for VMs exceeded 30 seconds.

Why it Happened
Our service keeps wait times for new tests low (below one second) by pre-booting VMs, so they are available to run tests immediately. One complication of this strategy is deciding which type of VM to boot in advance. Sauce supports many different types of VMs and customers can choose to run any type of test, at any time. Our service decides what type of VMs to pre-boot based on historical usage and makes adjustments based on current usage.

On Monday, Sauce Labs replaced a heavily used VM image (Windows 7) that, in the past, had been pre-booted at a high rate. Due to a lack of historical usage data for this new image our service didn't have enough contextual data to trigger adequate pre-boots. As customers ran tests on the new image, the supply of pre-booted VMs was quickly exhausted. New tests had to wait for VMs to be rebooted, which led to high wait times. The problem was exacerbated by the fact that the new Windows 7 image boots more slowly than the previous version.

How did we fix it
We rolled back the default to the previous Windows 7 image.

What are we doing to prevent it from happening again
- Deploy new images gradually rather than all at once
- Implement the means to preserve the historical usage patterns of a VM image and associate it with a new image, which will ensure a similar pre-booting strategy ("weighting") for the new image
- Continue investigating additional strategies to reduce the boot time of our most popular images.

Posted 14 days ago. May 15, 2017 - 14:15 PDT

Resolved
Wait times have returned to normal levels. All services are fully operational.
Posted 21 days ago. May 08, 2017 - 16:22 PDT
Update
We identified the root cause and took remedial action. Sauce Connect tunnels are working as expected. Wait times have improved and we’re monitoring performance closely.
Posted 21 days ago. May 08, 2017 - 15:21 PDT
Update
Our Mac and PC clouds are still experiencing high wait times and starting Sauce Connect tunnels is failing intermittently. We are taking remedial actions.
Posted 21 days ago. May 08, 2017 - 14:36 PDT
Investigating
Our system monitoring has detected a problem. Our PC and Mac clouds are affected by high wait times.
Posted 21 days ago. May 08, 2017 - 14:18 PDT
This incident affected: Sauce Automated, Sauce Manual, Sauce Connect, REST API, Storage REST API, Web Application, saucelabs.com, wiki.saucelabs.com, and Web Application — Archives Page.