Date: October 17, 2018
Time: 7:52 am - 2:51 pm PDT
*What happened: *
Customers were unable to use the Virtual Cloud or access the web application.
Why it happened:
Our primary Kubernetes cluster, which runs many of our critical services including job routing and scheduling, had a Byzantine failure (i.e., the Kubernetes master nodes having inconsistent views of configuration). This prevented us from being able to boot new containers and left key services crippled (although still supporting a reduced number of tests).
How we fixed it:
After attempting to repair the cluster without interrupting the running customer workloads, we decided the only sure path back to health was a full reboot. We shut down the system and then manually rebooted key infrastructure services one at a time, ensuring all replicas were consistent before proceeding.
What we are doing to prevent it from happening again:
We've engaged our Kubernetes support provider to help us analyze the issue and have implemented their first round of recommendations. We’re in the process of building a new cluster to resolve the other identified issues. We're also adding additional monitoring in order to preempt similar issues and have modified our deploy process. Finally, we have greatly improved our recovery processes when the cluster gets into an unhealthy state.