Thursday February 2nd 2023, 21:20 - 23:15 UTC
The number of devices reported as unavailable in one of our US data centers reached 5% for iOS devices and 20%. Our data center engineers were alerted when device availability dropped below 80% and continued to show a sustained drop in available devices.
A service called device inspector queries our device pools and reports back the set of available devices. This service runs on 2 nodes, and one of them failed to query the device pool and subsequently marked devices as unavailable.
We restarted the affected device inspector node, and the number of offline devices returned to normal levels.
We are looking at ways to improve the monitoring of the device inspection process and the logic it uses to handle these situations better.