Thursday September 7th 2023, 22:36 - 22:55 UTC
Users experienced high wait times, followed by elevated error rates both as jobs timed out and after restoration as we handled the backlog of requests. This impacted Windows, Mac, iOS Simulator, and Android Emulator tests in the US West region.
An internal service responsible for maintaining the virtual device state across our various platforms crashed due to a failover of MemoryStore during routine GCP maintenance. During this event, the service did not recover gracefully, and it caused another internal service responsible for pre-launching virtual devices based on demand to have connectivity issues with the cloud state service. Due to this connectivity issue, we did not pre-launch devices successfully for the duration of the incident leading to long device wait times.
We restarted the cloud state service and then, subsequently, the pre-launching service. After that action, all connections were re-established, and the services began running normally.
We are enhancing the cloud state service to handle a MemoryStore outage better, specifically introducing retry logic for reestablishing connectivity.