2019-January-29 Service Incident
Incident Report for Sauce Labs Inc
Postmortem

Dates:

January 23rd, 2019 to January 29th, 2019

What happened:

The system that manages the demand for our cloud, as well as the cloud itself, became overwhelmed and unable to process a high volume of backlogged jobs, resulting in high wait times, slow page loads, and customers not being able to run tests. This issue occurred intermittently over the course of several days.

Why it happened:

This was the result of a runaway, amplifying feedback loop between our demand management system and our cloud. As more load was placed on our cloud, it began to slow. The demand management system responded by requesting more capacity, thus increasing the load and further amplifying the cycle until we were no longer able to provide capacity to our customers.

How we fixed it:

We corrected this by interrupting the cycle, and then decomposing and tuning the demand subsystems.

What we are doing to prevent it from happening again:

In the short term, we are continuing to tune the system and are expanding and upgrading our cloud so that it’s more performant. We are also underway on a more substantial redesign of our demand management subsystem to deal with the anticipated growth we anticipate in 2019 and beyond.

Posted 20 days ago. Jan 31, 2019 - 11:17 PST

Resolved
All systems are fully operational.
Posted 21 days ago. Jan 30, 2019 - 17:01 PST
Update
All services -- including automated tests, Sauce Connect tunnels, and our REST API -- are behaving normally. We are continuing to monitor but expect a full recovery soon.
Posted 21 days ago. Jan 30, 2019 - 16:08 PST
Update
Sauce Connect tunnels are starting. We are monitoring our service closely.
Posted 21 days ago. Jan 30, 2019 - 12:52 PST
Update
Error rates for automated tests have returned to normal levels. We are continuing to take remedial actions to restore our Sauce Connect service.
Posted 21 days ago. Jan 30, 2019 - 12:16 PST
Update
Dear Valued Customers:

In recent months, we have experienced ongoing issues with the stability and responsiveness of our cloud platform. A number of those issues have been particularly pronounced this week. Before I go into detail, I want to personally apologize for any difficulties these issues have caused. We realize the uninterrupted availability of our platform is imperative to your business, and is one of the primary reasons you put your trust in Sauce Labs. We consider it our responsibility to be 100 percent transparent on all issues related to the performance of our platform. With that in mind, I want to share the root cause of these issues, and outline how we’re addressing them. In addition, as always, we will continue to provide real-time updates on ongoing issues via our publicly available status page.

In October of 2018, we experienced a significant spike in the overall job volume running on our platform. This spike exposed a number of bottlenecks that directly impacted the performance and availability of our cloud. To address those bottlenecks and ensure long-term scalability, our engineers have been working to expand or upgrade virtually all of our components, including our Kubernetes cluster, ingres clusters, database clusters, VM cloud, and NAT hardware.

Unfortunately, in the course of making these necessary refinements, we experienced some availability and performance degradation issues. To remediate them and create a long-term path to optimal stability, our engineering team is working around the clock to not only resolve ongoing issues, but to design controls that will limit the impact of future spikes in usage and deliver greater overall stability to our customers.

As we communicated in November, we are also in the process of building a new pair of Kubernetes clusters in our new San Jose data center. This will provide us with greater capacity and resolve many of the stability errors we’re currently experiencing. We are likewise in the process of reengineering our VM clouds to significantly improve hypervisor boot time, which will further improve the stability and performance of our cloud when usage spikes.

It is our number one priority to address these issues and permanently return our platform to the levels of stability and responsiveness you have come to expect of Sauce Labs. We are expending every available resource to make it happen. On behalf of everyone at Sauce Labs, I thank you for your patience as we work through these issues, and for continuing to trust us with your business.

Regards,

Charles
Posted 21 days ago. Jan 30, 2019 - 11:42 PST
Update
Error rates for automated tests are elevated again and many tests are unable to start.
Posted 21 days ago. Jan 30, 2019 - 11:35 PST
Update
Error rates for automated tests have returned to normal and tests are running. Sauce Connect tunnels are still not starting.
Posted 21 days ago. Jan 30, 2019 - 11:07 PST
Update
The error rate for automated tests is high and many tests are failing to start. Sauce Connect tunnels are still not starting. We are investigating.
Posted 21 days ago. Jan 30, 2019 - 10:34 PST
Update
Sauce Connect tunnels are not starting. We are taking remedial action.
Posted 21 days ago. Jan 30, 2019 - 09:54 PST
Update
Sauce Connect tunnels are intermittently not starting and some tunnels are disconnecting. We’re taking remedial action.
Posted 21 days ago. Jan 30, 2019 - 09:27 PST
Update
app.saucelabs.com is available. We are experiencing intermittent REST API slowness as well as intermittent high wait times in all our Clouds. Our engineers are actively investigating this and working on a fix.
Posted 21 days ago. Jan 30, 2019 - 02:47 PST
Update
app.saucelabs.com is unavailable and we are taking action.
Posted 21 days ago. Jan 30, 2019 - 02:28 PST
Investigating
We are experiencing intermittent REST API and UI slowness as well as intermittent high wait times in all our Clouds. Our engineers are actively investigating this and working on a fix.
Posted 22 days ago. Jan 29, 2019 - 16:32 PST
This incident affected: Sauce Connect (Sauce Connect VM), REST API (REST API VMs), Manual Testing (Manual VM Testing), Web Interface (Sauce UI, Analytics), and Automated VM Testing (Automated PC Testing, Automated Mac Testing, Automated iOS Simulator Testing, Automated Android Emulator Testing).