Dates:
Thursday June 6th - Tuesday June 27th
What happened:
Customers experienced a higher-than-normal error rate when running tests in Sauce Labs’ Mac (including iOS simulator) and Windows Clouds intermittently over the period of June 6th to June 27th.
Why it happened:
Sauce Labs’ Mac and Windows Clouds experienced higher-than-normal error rates. The underlying cause of the incident was a kernel bug introduced on May 25th. This bug impacted the networking configuration of our VMs, gradually reducing our effective capacity by ~10%.
A number of contributing factors exacerbated the impact and obscured the underlying root cause:
- Devices running beyond test abandonment due to a bug in our internal session manager service (fix deployed on June 9th)
- ~30% of overall VM capacity unavailable due to a change intending to remediate the kernel bug (change deployed on June 20th)
How we fixed it:
We addressed the symptoms and ultimate root cause through several actions:
- Increased capacity to Sauce Labs’ Mac Cloud from our various Mac hardware vendors and Improved iOS simulator and app setup load time.
- Rolled back to a previous kernel version to address the underlying kernel bug (June 24th)
- Rolled back initial fix that attempted to address the symptoms of the kernel bug (June 27th)
What we are doing to prevent it from happening again:
We are breaking down our corrective and preventative actions into the following areas:
Increasing Observability
- Expand metrics to better understand the VM capacity states, specifically state durations
Improve iOS Simulator Performance
- Implement support for ARM-based instances
- Increase capacity of instant boot simulators
- Continue researching ways to optimize the iOS simulator start time (starting simulators earlier, starting app download earlier, optimizing code, etc.)