2017-May-8 Service Incident
Incident Report for Sauce Labs Inc

Date: May 8, 2017
Time: 1:49pm - 4:01pm PDT

What Happened
Wait times for VMs exceeded 30 seconds.

Why it Happened
Our service keeps wait times for new tests low (below one second) by pre-booting VMs, so they are available to run tests immediately. One complication of this strategy is deciding which type of VM to boot in advance. Sauce supports many different types of VMs and customers can choose to run any type of test, at any time. Our service decides what type of VMs to pre-boot based on historical usage and makes adjustments based on current usage.

On Monday, Sauce Labs replaced a heavily used VM image (Windows 7) that, in the past, had been pre-booted at a high rate. Due to a lack of historical usage data for this new image our service didn't have enough contextual data to trigger adequate pre-boots. As customers ran tests on the new image, the supply of pre-booted VMs was quickly exhausted. New tests had to wait for VMs to be rebooted, which led to high wait times. The problem was exacerbated by the fact that the new Windows 7 image boots more slowly than the previous version.

How did we fix it
We rolled back the default to the previous Windows 7 image.

What are we doing to prevent it from happening again
- Deploy new images gradually rather than all at once
- Implement the means to preserve the historical usage patterns of a VM image and associate it with a new image, which will ensure a similar pre-booting strategy ("weighting") for the new image
- Continue investigating additional strategies to reduce the boot time of our most popular images.

Posted about 1 year ago. May 15, 2017 - 14:15 PDT

Wait times have returned to normal levels. All services are fully operational.
Posted about 1 year ago. May 08, 2017 - 16:22 PDT
We identified the root cause and took remedial action. Sauce Connect tunnels are working as expected. Wait times have improved and we’re monitoring performance closely.
Posted about 1 year ago. May 08, 2017 - 15:21 PDT
Our Mac and PC clouds are still experiencing high wait times and starting Sauce Connect tunnels is failing intermittently. We are taking remedial actions.
Posted about 1 year ago. May 08, 2017 - 14:36 PDT
Our system monitoring has detected a problem. Our PC and Mac clouds are affected by high wait times.
Posted about 1 year ago. May 08, 2017 - 14:18 PDT
This incident affected: Manual Testing (Manual VM Testing), Sauce Connect (Sauce Connect VM), Automated VM Testing (Automated PC Testing), REST API (REST API VMs), and Web Interface (Sauce UI, saucelabs.com, Documentation Wiki).