2018-October-17 Service Incident
Incident Report for Sauce Labs Inc

Date: October 17, 2018
Time: 7:52 am - 2:51 pm PDT

*What happened: *
Customers were unable to use the Virtual Cloud or access the web application.

Why it happened:
Our primary Kubernetes cluster, which runs many of our critical services including job routing and scheduling, had a Byzantine failure (i.e., the Kubernetes master nodes having inconsistent views of configuration). This prevented us from being able to boot new containers and left key services crippled (although still supporting a reduced number of tests).

How we fixed it:
After attempting to repair the cluster without interrupting the running customer workloads, we decided the only sure path back to health was a full reboot.  We shut down the system and then manually rebooted key infrastructure services one at a time, ensuring all replicas were consistent before proceeding.

What we are doing to prevent it from happening again:
We've engaged our Kubernetes support provider to help us analyze the issue and have implemented their first round of recommendations.  We’re in the process of building a new cluster to resolve the other identified issues. We're also adding additional monitoring in order to preempt similar issues and have modified our deploy process.  Finally, we have greatly improved our recovery processes when the cluster gets into an unhealthy state.

Posted 4 months ago. Oct 24, 2018 - 15:44 PDT

Our service has recovered entirely. All systems are fully operational.
Posted 4 months ago. Oct 17, 2018 - 14:53 PDT
All services appear to be functioning. We are in the final phase of monitoring.
Posted 4 months ago. Oct 17, 2018 - 14:34 PDT
Our service is showing signs of recovery. We are monitoring closely.
Posted 4 months ago. Oct 17, 2018 - 13:52 PDT
Our service continues to have problems running tests, starting Sauce Connect, and powering our web UI.
We've isolated the problem and are looking for solutions to free resources, manually restarting services,
and taking other advanced actions now.
Posted 4 months ago. Oct 17, 2018 - 11:56 PDT
We are continuing to work on a fix for this issue.
Posted 4 months ago. Oct 17, 2018 - 10:36 PDT
We have identified an issue and we are working to implement a fix.
Posted 4 months ago. Oct 17, 2018 - 08:46 PDT
Automated and manual VM/emu/sim testing is unavailable, Sauce Connect (VM) tunnels are not starting and the Sauce Labs website and dashboard are unavailable. We are investigating.
Posted 4 months ago. Oct 17, 2018 - 08:11 PDT
This incident affected: Sauce Connect (Sauce Connect VM), REST API (REST API VMs), Manual Testing (Manual VM Testing), Web Interface (Sauce UI, Analytics, saucelabs.com), and Automated VM Testing (Automated PC Testing, Automated Mac Testing, Automated iOS Simulator Testing, Automated Android Emulator Testing).