2022-August-11 Service Incident

Incident Report for Sauce Labs

Postmortem

Dates:

Thursday August 11th 2022, 00:24 - 14:50 UTC

What happened:

During a period of high volume, the service that handles incoming Selenium webdriver commands for both new and active sessions saw intermittent HTTP 503 errors: “No server is available to handle this request”.

Why it happened:

The main driver behind the increased error rate was CPU throttling on the nodes that run the primary service that handles these command requests. This caused the service to scale up and down over the duration of the incident during which we saw:

The graceful shutdown of replicas taking upwards of 30 minutes
Uneven load balancing across the replicas for the service

These conditions ultimately led to limited capacity for command requests leading to the HTTP 503 errors.

How we fixed it:

The team responded by terminating the replicas that were taking a long time to shut down during the high period of request volume, which increased available capacity.

What we are doing to prevent it from happening again:

As part of the stabilization effort, we have disabled auto-scaling until:

Load balancing is improved for this service
The impact that the long graceful shutdown has on the service’s availability is reduced

Additionally, we identified steps to improve future detection and response times in similar cases.

Posted Aug 26, 2022 - 13:30 UTC

Resolved

Error rates have stablized and whilst we continue to monitor the situation, this incident is resolved.

Posted Aug 16, 2022 - 15:14 UTC

Update

We have seen continued improvement in error rates over the weekend and continue to monitor the situation.

Posted Aug 15, 2022 - 12:28 UTC

Monitoring

We have taken remedial action and have seen an improvement on all testing for a few hours. We are monitoring.

Posted Aug 13, 2022 - 07:03 UTC

Update

We are continuing to investigate this issue.

Posted Aug 12, 2022 - 09:46 UTC

Investigating

Mac Desktop and Mobile Tests in the US-West-1 data center have elevated error rates and may fail to start.
We are investigating.

Posted Aug 11, 2022 - 21:35 UTC

This incident affected: Automated Browser Testing (US-West), Automated Virtual Mobile Device Testing (US-West), Automated Real Device Testing (US-West), Live Browser Testing (US-West), Live Virtual Mobile Device Testing (US-West), Live Real Device Testing (US-West), Visual Testing (Visual Testing Hub), and Native Framework Mobile App Testing (US-West).