Sauce Labs Maintenance Windows for Sauce Labs
Customers may experience intermittent errors during automated browser and virtual mobile device tests in our US-West-1 datacenter. We are closely monitoring and investigating the affected services.
2023-June-24 Service Incident
Incident Report for Sauce Labs
Postmortem

Dates:

Thursday June 6th - Tuesday June 27th

What happened:

Customers experienced a higher-than-normal error rate when running tests in Sauce Labs’ Mac (including iOS simulator) and Windows Clouds intermittently over the period of June 6th to June 27th.

Why it happened:

Sauce Labs’ Mac and Windows Clouds experienced higher-than-normal error rates. The underlying cause of the incident was a kernel bug introduced on May 25th. This bug impacted the networking configuration of our VMs, gradually reducing our effective capacity by ~10%.

A number of contributing factors exacerbated the impact and obscured the underlying root cause:

  1. Devices running beyond test abandonment due to a bug in our internal session manager service (fix deployed on June 9th)
  2. ~30% of overall VM capacity unavailable due to a change intending to remediate the kernel bug (change deployed on June 20th)

How we fixed it:

We addressed the symptoms and ultimate root cause through several actions:

  1. Increased capacity to Sauce Labs’ Mac Cloud from our various Mac hardware vendors and Improved iOS simulator and app setup load time.
  2. Rolled back to a previous kernel version to address the underlying kernel bug (June 24th)
  3. Rolled back initial fix that attempted to address the symptoms of the kernel bug (June 27th)

What we are doing to prevent it from happening again:

We are breaking down our corrective and preventative actions into the following areas:

  1. Increasing Observability 

    1. Expand metrics to better understand the VM capacity states, specifically state durations
  2. Improve iOS Simulator Performance

    1. Implement support for ARM-based instances
    2. Increase capacity of instant boot simulators
    3. Continue researching ways to optimize the iOS simulator start time (starting simulators earlier, starting app download earlier, optimizing code, etc.)
Posted Jul 11, 2023 - 10:25 UTC

Resolved
This incident has been resolved.
Posted Jun 29, 2023 - 11:52 UTC
Monitoring
We have taken steps to reduce error rates in our US-West-1 data center and we are seeing improvements. We are continuing to monitor the situation.
Posted Jun 25, 2023 - 14:22 UTC
Investigating
MacOS tests are seeing elevated error rates in our US-West-1 Data center. We are investigating.
Posted Jun 24, 2023 - 02:13 UTC
This incident affected: Automated Browser Testing (US-West) and Live Browser Testing (US-West).