2018-October-23 Service Incident
Incident Report for Sauce Labs Inc
Postmortem

Date: October 23, 2018
Time: 9:54am - 11:37am PDT

What Happened:
Customers were unable to use the Virtual Cloud or access the web application.

Why it happened: Our primary Kubernetes cluster, which runs many of our critical services including job routing and scheduling, had a Byzantine failure (i.e., the Kubernetes master nodes having inconsistent views of configuration . This prevented us from being able to boot new containers and left key services crippled (although still supporting a reduced number of tests).

How we fixed it:
We shut down the system and then manually rebooted key infrastructure services one at a time, ensuring all replicas were consistent before proceeding.

What we are doing to prevent it from happening again: We've engaged our Kubernetes support provider to help us analyze the issue and have implemented their first round of recommendations.  We’re in the process of building a new cluster to resolve the other identified issues. We're also adding additional monitoring in order to preempt similar issues and have modified our deploy process.  Finally, we have greatly improved our recovery processes when the cluster gets into an unhealthy state.

Posted about 2 months ago. Oct 26, 2018 - 09:25 PDT

Resolved
Our service has fully recovered and all components are fully operational.
Posted about 2 months ago. Oct 23, 2018 - 12:55 PDT
Monitoring
We identified and addressed the problems. Our services are recovering. We are closely monitoring the situation.
Posted about 2 months ago. Oct 23, 2018 - 11:42 PDT
Update
The majority of our services continue to be unavailable. We are investigating and working on remediation.
Posted about 2 months ago. Oct 23, 2018 - 11:14 PDT
Update
We are continuing to investigate this issue.
Posted about 2 months ago. Oct 23, 2018 - 10:08 PDT
Update
We are continuing to investigate this issue.
Posted about 2 months ago. Oct 23, 2018 - 10:06 PDT
Investigating
Most of our service is not operational, including running tests, starting Sauce Connect tunnels, and accessing our web UI. We are investigating.
Posted about 2 months ago. Oct 23, 2018 - 10:05 PDT
This incident affected: Automated VM Testing (Automated PC Testing, Automated Mac Testing, Automated iOS Simulator Testing, Automated Android Emulator Testing), REST API (REST API VMs), Manual Testing (Manual VM Testing), Web Interface (Sauce UI, Real Device Cloud UI, Analytics, saucelabs.com), and Sauce Connect (Sauce Connect VM).