Sauce Labs Maintenance Windows for Sauce Labs
Customers may experience intermittent errors during automated browser and virtual mobile device tests in our US-West-1 datacenter. We are closely monitoring and investigating the affected services.
2022-August-11 Resolved Service Incident
Incident Report for Sauce Labs
Postmortem

Dates:

Thursday August 11th 2022, 00:24 - 14:50 UTC

What happened:

During a period of high volume, the service that handles incoming Selenium webdriver commands for both new and active sessions saw intermittent HTTP 503 errors: “No server is available to handle this request”. 

Why it happened:

The main driver behind the increased error rate was CPU throttling on the nodes that run the primary service that handles these command requests. This caused the service to scale up and down over the duration of the incident during which we saw:

  •  The graceful shutdown of replicas taking upwards of 30 minutes
  •  Uneven load balancing across the replicas for the service

These conditions ultimately led to limited capacity for command requests leading to the HTTP 503 errors. 

How we fixed it:

The team responded by terminating the replicas that were taking a long time to shut down during the high period of request volume, which increased available capacity. 

What we are doing to prevent it from happening again:

As part of the stabilization effort, we have disabled auto-scaling until:

  • Load balancing is improved for this service
  • The impact that the long graceful shutdown has on the service’s availability is reduced

Additionally, we identified steps to improve future detection and response times in similar cases.

Posted Sep 22, 2022 - 14:40 UTC

Resolved
Between 05:00 UTC and 13:15 UTC We experienced elevated 5xx error rate on all automated tests (Virtual and Real Device Cloud) in our US-West-1 Data Center. This issue is now resolved, all services are fully operational.
Posted Aug 11, 2022 - 13:00 UTC