2018-October-23 Service Incident
Incident Report for Sauce Labs Inc

Date: October 23, 2018
Time: 9:54am - 11:37am PDT

What Happened:
Customers were unable to use the Virtual Cloud or access the web application.

Why it happened: Our primary Kubernetes cluster, which runs many of our critical services including job routing and scheduling, had a Byzantine failure (i.e., the Kubernetes master nodes having inconsistent views of configuration . This prevented us from being able to boot new containers and left key services crippled (although still supporting a reduced number of tests).

How we fixed it:
We shut down the system and then manually rebooted key infrastructure services one at a time, ensuring all replicas were consistent before proceeding.

What we are doing to prevent it from happening again: We've engaged our Kubernetes support provider to help us analyze the issue and have implemented their first round of recommendations.  We’re in the process of building a new cluster to resolve the other identified issues. We're also adding additional monitoring in order to preempt similar issues and have modified our deploy process.  Finally, we have greatly improved our recovery processes when the cluster gets into an unhealthy state.

Posted 4 months ago. Oct 26, 2018 - 09:25 PDT

Our service has fully recovered and all components are fully operational.
Posted 4 months ago. Oct 23, 2018 - 12:55 PDT
We identified and addressed the problems. Our services are recovering. We are closely monitoring the situation.
Posted 4 months ago. Oct 23, 2018 - 11:42 PDT
The majority of our services continue to be unavailable. We are investigating and working on remediation.
Posted 4 months ago. Oct 23, 2018 - 11:14 PDT
We are continuing to investigate this issue.
Posted 4 months ago. Oct 23, 2018 - 10:08 PDT
We are continuing to investigate this issue.
Posted 4 months ago. Oct 23, 2018 - 10:06 PDT
Most of our service is not operational, including running tests, starting Sauce Connect tunnels, and accessing our web UI. We are investigating.
Posted 4 months ago. Oct 23, 2018 - 10:05 PDT
This incident affected: Sauce Connect (Sauce Connect VM), REST API (REST API VMs), Manual Testing (Manual VM Testing), Web Interface (Sauce UI, Real Device Cloud UI, Analytics, saucelabs.com), and Automated VM Testing (Automated PC Testing, Automated Mac Testing, Automated iOS Simulator Testing, Automated Android Emulator Testing).