2023-November-22 Service Incident

Incident Report for Sauce Labs

Postmortem

Dates:

Thursday November 16th 2023, 16:15 - Thursday December 7th 21:46 UTC

What happened:

Customers encountered sporadic VDC and RDC test failures throughout the incident. The symptoms visible to customers were alleviated within the first two weeks by implementing various workarounds. The incident was completely resolved on December 7th by applying a fix provided by a third party to all our regions.

Why it happened:

During a routine upgrade to our Google-managed Kubernetes clusters (version 1.25 > 1.26), an undocumented change was introduced to the Container Network Interface (Cilium) causing port conflicts with Kubernetes NodePorts during SNAT. This resulted in occasional dropped SYN-ACK packets and, ultimately, communication failures for services at random times.

How we fixed it:

Teams successfully restored functionality by implementing retry logic to services experiencing connection issues. This restored service for customers but there were still background issues with dropped connectivity to services we could not directly manage. To address these background issues, we worked with Google to identify a corrective action that eliminated the Google-managed component, ip-masq-agent. Once removed, we replaced it with a version we could manage and omit the configuration flag causing the issue.

What we are doing to prevent it from happening again:

Although taking over management of the Google-managed component resolved the incident, we anticipate an official fix from Google, slated for completion by the end of January. Concurrently, we are engaged in collaborative efforts with Google to enhance our joint ability to identify such issues in the future, especially as they roll out new versions of these managed components.

Posted Dec 18, 2023 - 16:48 UTC

Resolved

The Real Device and Emulator/Simulator errors have been resolved in our US-West and EU-Central Data Centers. All services are fully operational.

Posted Nov 22, 2023 - 17:48 UTC

Monitoring

After taking remedial action, we are seeing errors decrease for Real Device and Emulator/Simulator tests in our US-West and EU-Central Data Centers. We are monitoring.

Posted Nov 22, 2023 - 17:11 UTC

Investigating

We are aware of an ongoing issue impacting session creation for some Real Device and Emulator/Simulator automated tests on our US-West and EU-Central Data Centers. We are investigating.

Posted Nov 22, 2023 - 14:28 UTC

This incident affected: Automated Virtual Mobile Device Testing (US-West, EU-Central) and Automated Real Device Testing (US-West, EU-Central).