January 30th, 2019 9:54am - 5:01pm
An ingress point to our service failed and caused the system which boots tunnel VMs to boot a large number of unusable tunnels, reducing effective capacity to zero and making customers unable to start tunnels.
We mistakenly injected a configuration fault into the ingress which handles tunnel start requests, which in turn caused a runaway feedback loop that brought down our tunnel capacity.
The configuration fault was corrected very quickly but left a pool of unusable tunnels and a huge number of customers hitting the tunnel service all at once. We resolved this by tuning the mechanism that reaps unusable tunnels and by throttling customer tunnel requests until tunnel capacity had recovered.
We have hardened our configuration change process so it’s impossible to inject configuration faults. We are also adding additional tunnel capacity and upgrading the hypervisors to resolve an issue where VMs boot more slowly when many are booting at once. The new tooling and operating procedures for tunnel request throttling developed during this incident are now part of our standard procedures.