Thursday August 11th 2022, 00:24 - 14:50 UTC
During a period of high volume, the service that handles incoming Selenium webdriver commands for both new and active sessions saw intermittent HTTP 503 errors: “No server is available to handle this request”.
The main driver behind the increased error rate was CPU throttling on the nodes that run the primary service that handles these command requests. This caused the service to scale up and down over the duration of the incident during which we saw:
These conditions ultimately led to limited capacity for command requests leading to the HTTP 503 errors.
The team responded by terminating the replicas that were taking a long time to shut down during the high period of request volume, which increased available capacity.
As part of the stabilization effort, we have disabled auto-scaling until:
Additionally, we identified steps to improve future detection and response times in similar cases.