Date: February 28, 2018
Time: 7:00am - 11:15am PST
Users experienced a 20-40% error rate when accessing our service's web application or using its API.
Why did it happen:
An early implementation of a new feature was allowed to remain generally available alongside a more performative version of the same feature. An increase in usage of the less performant version of this feature caused an unexpected amount of resource pressure to occur which degraded the feature, as well as other related services.
What did we do to fix it:
We disabled the early implementation of the feature and directed user traffic to the more performative solution.
What we are doing to prevent this from happening again:
In addition to migrating user traffic to the more performative solution, we’ll be implementing more proactive feature migration processes and deepening our existing application performance monitoring to ensure a faster response to any future issues.