Friday November 1st 2023, 10:15 - 18:00 UTC
An individual customer was continuously running a large number of misconfigured tests, impacting the overall error rate in the US-West region and ultimately triggering our alerting system. Though the error rate alerting threshold was surpassed, no other customer tests were impacted.
Due to the individual customer’s misconfiguration, all tests they ran failed to start. Our test validation service did not properly identify the misconfiguration, causing VMs to be allocated to them. The large number of failing tests tripped our alerting system, though only the individual customer was impacted.
We worked with the individual customer to stop and reconfigure their tests.
We are enhancing our test configuration validation to better inform customers of these issues before a VM is allocated to a test that will ultimately fail. We are also improving our visibility into test errors, to better identify and monitor individual customer issues versus overall system issues.