January 23rd, 2019 to January 29th, 2019
The system that manages the demand for our cloud, as well as the cloud itself, became overwhelmed and unable to process a high volume of backlogged jobs, resulting in high wait times, slow page loads, and customers not being able to run tests. This issue occurred intermittently over the course of several days.
This was the result of a runaway, amplifying feedback loop between our demand management system and our cloud. As more load was placed on our cloud, it began to slow. The demand management system responded by requesting more capacity, thus increasing the load and further amplifying the cycle until we were no longer able to provide capacity to our customers.
We corrected this by interrupting the cycle, and then decomposing and tuning the demand subsystems.
In the short term, we are continuing to tune the system and are expanding and upgrading our cloud so that it’s more performant. We are also underway on a more substantial redesign of our demand management subsystem to deal with the anticipated growth we anticipate in 2019 and beyond.