Date: May 8, 2017
Time: 1:49pm - 4:01pm PDT
Wait times for VMs exceeded 30 seconds.
Why it Happened
Our service keeps wait times for new tests low (below one second) by pre-booting VMs, so they are available to run tests immediately. One complication of this strategy is deciding which type of VM to boot in advance. Sauce supports many different types of VMs and customers can choose to run any type of test, at any time. Our service decides what type of VMs to pre-boot based on historical usage and makes adjustments based on current usage.
On Monday, Sauce Labs replaced a heavily used VM image (Windows 7) that, in the past, had been pre-booted at a high rate. Due to a lack of historical usage data for this new image our service didn't have enough contextual data to trigger adequate pre-boots. As customers ran tests on the new image, the supply of pre-booted VMs was quickly exhausted. New tests had to wait for VMs to be rebooted, which led to high wait times. The problem was exacerbated by the fact that the new Windows 7 image boots more slowly than the previous version.
How did we fix it
We rolled back the default to the previous Windows 7 image.
What are we doing to prevent it from happening again
- Deploy new images gradually rather than all at once
- Implement the means to preserve the historical usage patterns of a VM image and associate it with a new image, which will ensure a similar pre-booting strategy ("weighting") for the new image
- Continue investigating additional strategies to reduce the boot time of our most popular images.