Monday May 2nd 2022, 06:42 - 07:42 UTC
Tests running in our US-East-1 datacenter were unable to be loaded in our UI.
Requests to the jobs REST API endpoint were failing, which resulted in responding with 504 HTTP status code. This was caused by a hung internal DB query, which resulted in a retry mechanism kicking in, increasing load on the DB which resulted in cascading failure of subsequent queries to the jobs endpoint.
We canceled the hanging queries.
We are going to put in place a safety mechanism to prevent long running queries from overwhelming the DB as well as improving alerting to notify us earlier if a similar situation occurs.