Articulate 360: Articulate Review is unavailable
Incident Report for Articulate
Postmortem

On August 14 and 15, Articulate Review was unavailable for approximately three hours due to two separate incidents. We'd like to talk about what happened in each case, and what we're doing to improve the resiliency of our services.

Between 16:00 and 16:22 UTC on August 14, Articulate Review users were unable to publish or review their team content. This incident started as a simple investigation into a problem with our data pipeline, which we use to continuously improve and ensure our customers are getting the most out of Articulate 360. We take great care with production data, so digging into the data issue on a customer-facing system was not a risk we were willing to take. As an alternative, we decided to take a snapshot of production and restore it to a temporary database where we could do an in-depth analysis of the data pipeline failure. While we perform these snapshots routinely and automatically during non-peak traffic hours, this process put excessive load on the database and, coupled with a high traffic load, caused a performance degradation in Articulate Review. After several minutes, the database performance became so poor that our health checks (automated checks that determine the health of the service) deemed the servers to be unstable and removed their ability to serve production traffic. We were immediately alerted to the issue, and we started looking into the high database load. The engineer who triggered the database snapshot began investigating if it had any impact on the outage. While we were looking into it, the automated process finished taking the snapshot, and then our servers began to recover. At that point, we were able to restore Articulate Review to normal operation.

To prevent this from happening again, we are going to increase our available IOPS (the amount of work that a server can perform per second) on our database servers and put some additional safeguards around snapshots of production data.

The next day, August 15, between 20:40 and 23:41 UTC, Articulate Review users were again unable to publish or review their team content. We were immediately alerted to an issue with the service and began our investigation. We discovered that during a routine code change, a third party system (which we rely on for automation and orchestration of a complex series of steps) encountered an error that it could not recover from which left us in a state where we could not rollback to the previous version. Typically, when we are ready to push a new feature or a small bug fix to Articulate 360, we perform a series of automated health and stability checks against the new code before it reaches our customers. During this code change, one of those health checks failed, and the system attempted to roll back to the previous version automatically. Unfortunately, due to a bug in the system which caused the environment configuration to become corrupt, the code could not properly be reverted. We attempted to manually intervene to get Articulate Review back online, but the system rejected our attempts to correct the configuration. After discussing options with the team, and working closely with Amazon Web Service's Support Team, we were able to recover to a temporary environment and restore service to Articulate Review. Once we were back online, we discovered the source of the configuration problem to be a bug in the third party service. With a better understanding of the system error, we carefully swapped the environments back to a normal state with no additional downtime for customers.

We know how crucial uptime and performance is for our customers, and when things don't go well, either due to our error or from a third party service, we take a close look at what we can do better. In this case, we've identified significant limitations in the current service we are using to power Articulate Review and our engineering team is prioritizing a migration to a more stable and flexible platform to prevent this sort of outage from happening again.

Posted Sep 05, 2018 - 21:27 UTC

Resolved
We have confirmed Articulate Review is operating as expected again.

Once we complete an internal review of this incident, we will publish a more detailed postmortem of what went wrong, along with the steps we will be taking to prevent this from happening in the future.
Posted Aug 16, 2018 - 00:04 UTC
Update
We're continuing to investigate the cause of this outage, and we'll keep you posted on our progress.
Posted Aug 15, 2018 - 22:30 UTC
Update
We're still investigating the cause of this outage, and we'll let you know as soon as we have more info.
Posted Aug 15, 2018 - 21:37 UTC
Investigating
Articulate Review is currently unavailable and we're looking into what went wrong. We're so sorry for the trouble!
Posted Aug 15, 2018 - 20:47 UTC
This incident affected: Articulate 360 (Review 360).