Thursday January 5th 2023, 11:25 - 14:45 UTC
Roughly 30 minutes after a RDC deployment, our internal device availability dashboards started showing a sharp drop in available iOS devices. We were alerted to this drop and noticed that our iOS device availability was down to ~50%. During this time, customers would have seen the affected devices as “busy” or otherwise unavailable, but existing sessions (started before the deployment) were not affected.
iOS devices gradually became unavailable after an RDC deployment because the service responsible for device cleaning failed, and non-clean devices could not be made available to users. The device cleaning failed due to a spike in requests to the app resigner service causing internal iOS apps installed during device cleaning to fail. This was because these internal iOS apps were not properly cached after the deployment.
We immediately rolled back to the previous RDC version (before identifying the culprit) since the issue was correlated with a deployment. It is unclear (but also unlikely) to have fixed the issue entirely, as the resigner service was already in an unrecoverable state.
Disk space on the resigner cache server was 100% utilized, and the web server was throwing HTTP 500 error responses. This was cleared to make more space, and the web server was restarted. After this, requests to resigner service were still high but were gradually being processed, and device availability was recovering.
We are looking at better monitoring and management of the cache server and resigner service.