Date of Incident: June 20, 2017
Time of Incident: 12:05 pm PST to 1:55 pm PST
What Happened: REST API calls began to return errors in the 500s at a high rate. The error rate started at 15% and climbed to 99% over the next hour. Services dependent on making successful REST API calls, such as our Web Application and the Sauce Connect client, failed at a similar rate.
Why It Happened: Our database experienced an unusually high volume of traffic, the origin of which is still being investigated. Resource usage increased to critically high levels. As query response times increased, API calls and downstream services like our Web Application and Sauce Connect, which are dependent on database queries, started to timeout and fail.
What We Did To Fix It: We terminated all database processes and released all the SQL handles on the buffer pool. This stopped the contention for memory and allowed query response times to return to normal.
What we are doing to prevent this from happening again: We're performing an exhaustive analysis of our existing queries with a focus on eliminating potential problem sources and improving their overall safety and efficiency.