2023-May-5 Service Incident 2

Incident Report for Sauce Labs

Postmortem

Dates:

Friday May 5th 2023, 00:06 - Monday May 8th 2023, 16:52 UTC

What happened:

Customers using real device testing intermittently could not see available devices, start Appium tests, or fetch test results. The impact was experienced for about 10 minutes every 4 hours.

Why it happened:

During the incident, we saw a periodic spike in requests that eventually consumed all resources in our connection pool. When that happened, our service liveliness probes failed, which kicked off a graceful shutdown of nodes in our service. This shutdown and subsequent restart caused the temporary unavailability of devices as the service fully came back up.

How we fixed it:

To quickly restore service, we increased the resources for this impacted service. To address this issue in the long term, we enhanced the way we throttle these requests coming into the system.

What we are doing to prevent it from happening again:

Beyond the fixes already deployed during the restoration efforts, we are also looking at ways to detect these spikes better and handle them when they occur.

Posted Jun 02, 2023 - 17:50 UTC

Resolved

After taking remedial action, RDC tests are consistently starting successfully in the US-West-1 Data Center. All services are fully operational.

Posted May 05, 2023 - 18:12 UTC

Update

After making some changes, we are seeing a decrease in RDC start failures. We are still investigating further.

Posted May 05, 2023 - 16:25 UTC

Investigating

Android and iOS Real Device tests are failing to start intermittently on our US-West-1 Data Center. We are investigating.

Posted May 05, 2023 - 15:27 UTC

This incident affected: Automated Real Device Testing (US-West) and Live Real Device Testing (US-West).