2017-March-22 Service Incident
Incident Report for Sauce Labs Inc
Postmortem

Date of Incident: March 22, 2017
Time of Incident: 1:40pm - 4:40pm PDT

What Happened:
Wait times for VMs exceeded 30 seconds, which can cause some tests to error or not start.

Why it Happened:
We have determined that the service responsible for creating and halting VMs was recovering slower than usual after a restart, which happens when Sauce Labs deploys changes.

*What we did to fix it: *
We increased the overall amount of CPU and RAM we allocate to the service responsible for creating and halting VMs. We also significantly refactored how that service reads and writes to its data store, providing a significant boost in speed.

We thoroughly tested these changes under production conditions and we are confident we have addressed the issue.

What we are doing to prevent this from happening again:
We added a considerable amount of instrumentation and analytics to that service so we can quickly and proactively diagnose similar issues in the future.

Posted over 1 year ago. Mar 30, 2017 - 19:47 PDT

Resolved
We’ve taken steps to bring wait times back to normal levels. All services are fully operational.
Posted over 1 year ago. Mar 22, 2017 - 17:30 PDT
Update
Wait times are still high. We’re continuing to take remedial action.
Posted over 1 year ago. Mar 22, 2017 - 16:00 PDT
Update
Wait times continue to be high. We are still investigating.
Posted over 1 year ago. Mar 22, 2017 - 15:03 PDT
Investigating
We’ve detected a problem with our service and are experiencing high wait times. We are investigating.
Posted over 1 year ago. Mar 22, 2017 - 14:34 PDT
This incident affected: Manual Testing (Manual VM Testing), Sauce Connect (Sauce Connect VM), Automated VM Testing (Automated PC Testing), REST API (REST API VMs), and Web Interface (Sauce UI, saucelabs.com, Documentation Wiki).