Thursday November 16th 2023, 16:15 - Thursday December 7th 21:46 UTC
Customers encountered sporadic VDC and RDC test failures throughout the incident. The symptoms visible to customers were alleviated within the first two weeks by implementing various workarounds. The incident was completely resolved on December 7th by applying a fix provided by a third party to all our regions.
During a routine upgrade to our Google-managed Kubernetes clusters (version 1.25 > 1.26), an undocumented change was introduced to the Container Network Interface (Cilium) causing port conflicts with Kubernetes NodePorts during SNAT. This resulted in occasional dropped SYN-ACK packets and, ultimately, communication failures for services at random times.
Teams successfully restored functionality by implementing retry logic to services experiencing connection issues. This restored service for customers but there were still background issues with dropped connectivity to services we could not directly manage. To address these background issues, we worked with Google to identify a corrective action that eliminated the Google-managed component, ip-masq-agent. Once removed, we replaced it with a version we could manage and omit the configuration flag causing the issue.
Although taking over management of the Google-managed component resolved the incident, we anticipate an official fix from Google, slated for completion by the end of January. Concurrently, we are engaged in collaborative efforts with Google to enhance our joint ability to identify such issues in the future, especially as they roll out new versions of these managed components.