2019-January-30 Service Incident
Incident Report for Sauce Labs Inc
Postmortem

Dates:

January 30th, 2019 9:54am - 5:01pm

What happened:

An ingress point to our service failed and caused the system which boots tunnel VMs to boot a large number of unusable tunnels, reducing effective capacity to zero and making customers unable to start tunnels.

Why it happened:

We mistakenly injected a configuration fault into the ingress which handles tunnel start requests, which in turn caused a runaway feedback loop that brought down our tunnel capacity.

How we fixed it:

The configuration fault was corrected very quickly but left a pool of unusable tunnels and a huge number of customers hitting the tunnel service all at once. We resolved this by tuning the mechanism that reaps unusable tunnels and by throttling customer tunnel requests until tunnel capacity had recovered.

What we are doing to prevent it from happening again:

We have hardened our configuration change process so it’s impossible to inject configuration faults. We are also adding additional tunnel capacity and upgrading the hypervisors to resolve an issue where VMs boot more slowly when many are booting at once. The new tooling and operating procedures for tunnel request throttling developed during this incident are now part of our standard procedures.

Posted 10 days ago. Feb 10, 2019 - 10:56 PST

Resolved
An ingress point to our service failed and caused the system which boots tunnel VMs to boot a large number of unusable tunnels, reducing effective capacity to zero and making customers unable to start tunnels.
Posted 21 days ago. Jan 30, 2019 - 09:41 PST