2023-January-05 Service Incident 2

Incident Report for Sauce Labs

Postmortem

Dates:

Thursday January 5th 2023, 11:25 - 14:45 UTC

What happened:

Roughly 30 minutes after a RDC deployment, our internal device availability dashboards started showing a sharp drop in available iOS devices. We were alerted to this drop and noticed that our iOS device availability was down to ~50%. During this time, customers would have seen the affected devices as “busy” or otherwise unavailable, but existing sessions (started before the deployment) were not affected.

Why it happened:

iOS devices gradually became unavailable after an RDC deployment because the service responsible for device cleaning failed, and non-clean devices could not be made available to users. The device cleaning failed due to a spike in requests to the app resigner service causing internal iOS apps installed during device cleaning to fail. This was because these internal iOS apps were not properly cached after the deployment.

How we fixed it:

We immediately rolled back to the previous RDC version (before identifying the culprit) since the issue was correlated with a deployment. It is unclear (but also unlikely) to have fixed the issue entirely, as the resigner service was already in an unrecoverable state.

Disk space on the resigner cache server was 100% utilized, and the web server was throwing HTTP 500 error responses. This was cleared to make more space, and the web server was restarted. After this, requests to resigner service were still high but were gradually being processed, and device availability was recovering.

What we are doing to prevent it from happening again:

We are looking at better monitoring and management of the cache server and resigner service.

Posted Jan 27, 2023 - 09:30 UTC

Resolved

After taking remedial action, availability of our iOS real devices has returned to normal levels. This issue is now resolved, all services are fully operational.

Posted Jan 05, 2023 - 14:10 UTC

Monitoring

After taking remedial action, we are seeing improved availability in our iOS real devices for our US-West-1 & EU-Central-1 datacenters. We are currently monitoring.

Posted Jan 05, 2023 - 13:43 UTC

Investigating

We are currently experiencing decreased availability in our iOS real devices in both our US-West-1 & EU-Central-1 datacenters. We are investigating

Posted Jan 05, 2023 - 12:43 UTC

This incident affected: Automated Real Device Testing (US-West, EU-Central) and Live Real Device Testing (US-West, EU-Central).