Friday August 26th 2022, 14:58 - 16:07 UTC
In our US-West-1 region we had a brief interruption where Sauce Connect tunnels were failing to start.
The user authentication service that Sauce Connect relies on was unresponsive for approximately 5 minutes. The API gateway that serves this service was not evenly distributing requests which resulted in some pods receiving a bulk of the requests and becoming CPU bound. This caused a service within Sauce Connect to go into a crash loop and the pods left in the pool were unable to serve all the requests.
We restarted the authentication service to evenly distribute the load. Once that happened the affected service within Sauce Connect became healthy again.
We have added additional monitoring and alerting to inform us when the authentication service gets into this state. We are also looking at ways to better distribute requests as well as ensuring that we have the right level of resources allocated for both Sauce Connect and the authentication service.