Friday May 5th 2023, 00:06 - Monday May 8th 2023, 16:52 UTC
Customers using real device testing intermittently could not see available devices, start Appium tests, or fetch test results. The impact was experienced for about 10 minutes every 4 hours.
During the incident, we saw a periodic spike in requests that eventually consumed all resources in our connection pool. When that happened, our service liveliness probes failed, which kicked off a graceful shutdown of nodes in our service. This shutdown and subsequent restart caused the temporary unavailability of devices as the service fully came back up.
To quickly restore service, we increased the resources for this impacted service. To address this issue in the long term, we enhanced the way we throttle these requests coming into the system.
Beyond the fixes already deployed during the restoration efforts, we are also looking at ways to detect these spikes better and handle them when they occur.