Date: April 5, 2017 Time: 6:40am PDT - 7:57am PDT
What Happened: Wait times for VMs exceeded 30 seconds, which can cause some tests to error or not start.
Why it Happened: One of the VMs running two core Sauce cloud management services went down. There are redundant VMs with a failover mechanism but an inefficiency in the component that starts and stops VMs led to a reduction in the number of VMs available to customers.
*What we did to fix it: * We deployed a change to the component that starts/stops customer VMs to better handle increased load.
What we are doing to prevent this from happening again: - Investigate whether it’s possible to detect this problem sooner, which could prevent this situation from causing an outage. - Continue to investigate the clustering system (and failover mechanism) to look for opportunities for further improvements. - Make other performance enhancements to related services