2018-October-17 Service Incident
Incident Report for Sauce Labs Inc
Postmortem

Date: October 17, 2018
Time: 7:52 am - 2:51 pm PDT

*What happened: *
Customers were unable to use the Virtual Cloud or access the web application.

Why it happened:
Our primary Kubernetes cluster, which runs many of our critical services including job routing and scheduling, had a Byzantine failure (i.e., the Kubernetes master nodes having inconsistent views of configuration). This prevented us from being able to boot new containers and left key services crippled (although still supporting a reduced number of tests).

How we fixed it:
After attempting to repair the cluster without interrupting the running customer workloads, we decided the only sure path back to health was a full reboot.  We shut down the system and then manually rebooted key infrastructure services one at a time, ensuring all replicas were consistent before proceeding.

What we are doing to prevent it from happening again:
We've engaged our Kubernetes support provider to help us analyze the issue and have implemented their first round of recommendations.  We’re in the process of building a new cluster to resolve the other identified issues. We're also adding additional monitoring in order to preempt similar issues and have modified our deploy process.  Finally, we have greatly improved our recovery processes when the cluster gets into an unhealthy state.

Posted about 2 months ago. Oct 24, 2018 - 15:44 PDT

Resolved
Our service has recovered entirely. All systems are fully operational.
Posted about 2 months ago. Oct 17, 2018 - 14:53 PDT
Update
All services appear to be functioning. We are in the final phase of monitoring.
Posted about 2 months ago. Oct 17, 2018 - 14:34 PDT
Update
Our service is showing signs of recovery. We are monitoring closely.
Posted about 2 months ago. Oct 17, 2018 - 13:52 PDT
Update
Our service continues to have problems running tests, starting Sauce Connect, and powering our web UI.
We've isolated the problem and are looking for solutions to free resources, manually restarting services,
and taking other advanced actions now.
Posted about 2 months ago. Oct 17, 2018 - 11:56 PDT
Update
We are continuing to work on a fix for this issue.
Posted about 2 months ago. Oct 17, 2018 - 10:36 PDT
Identified
We have identified an issue and we are working to implement a fix.
Posted about 2 months ago. Oct 17, 2018 - 08:46 PDT
Investigating
Automated and manual VM/emu/sim testing is unavailable, Sauce Connect (VM) tunnels are not starting and the Sauce Labs website and dashboard are unavailable. We are investigating.
Posted about 2 months ago. Oct 17, 2018 - 08:11 PDT
This incident affected: Automated VM Testing (Automated PC Testing, Automated Mac Testing, Automated iOS Simulator Testing, Automated Android Emulator Testing), REST API (REST API VMs), Manual Testing (Manual VM Testing), Web Interface (Sauce UI, Analytics, saucelabs.com), and Sauce Connect (Sauce Connect VM).