Wednesday February 22th 2023, 14:20 - 16:13 UTC
Some sessions running on iOS devices became disconnected, and some iOS devices appeared as “busy” for the duration of the incident preventing new sessions from being created on them.
We gradually performed an upgrade to the cluster that runs our production device pools, starting with EU iOS. After ~50 pools were upgraded, we started getting alerts about pool containers being down, prompting us to stop the upgrade process. Once the process was halted, we noticed that the just upgraded pools were in an “Error” state, throwing errors referencing network connectivity.
We immediately stopped the upgrade process and rolled back to the previous version, starting with one pool to test, and then rolling back all the affected pools. This got the pools back into a “Running” state, but network connectivity errors were still thrown from inside the pools. After more troubleshooting, we restarted the affected servers, restoring network connectivity.
This upgrade was run in our staging environment successfully with no degradation in device availability or network connectivity so we were confident in performing it in production. Going forward, we will approach these upgrades with a slower gradual rollout to better expose any issues they may introduce.