Less than two weeks ago, Articulate 360 had an outage due to failures in our infrastructure systems that support our platform. After detailed analysis, we're now sharing what went wrong, how we responded, and the steps we've taken to improve our systems and processes going forward.
Summary
Three incidents caused outages from Tuesday, March 5, 2024, to Thursday, March 7, 2024. The first two incidents were isolated and caused an interruption of a total of 1.5 hours, while the third incident occurred Wednesday afternoon for 5 hours and repeated Thursday morning for another 4 hours.
Even though there were three separate incidents, we recognize that this felt like one long outage. We are committed to doing better. Key takeaways and process enhancements include improvements to monitoring and testing, an update to our DNS infrastructure, and making any infrastructure changes only after a full incident assessment. Details on each incident are included below along with remediations and lessons learned. Additional information about lessons learned and next steps are summarized at the end of this report.
Details
This report shares the factors that contributed to the incident, the remediations put in place, as well as lessons learned and go-forward improvements to both help reduce time to restoration, as well as help identify and prevent incidents in the future.
Incident 1: Tuesday, March 5 (18 minutes):
From approximately 15:06 to 15:24 ET, Rise 360 and account management were unavailable.
Root Cause: This downtime was caused by a missing environment variable that caused the Rise authentication service to fail to initialize.
Identification: Our team was paged within minutes and identified the issue was caused by an incorrect environment variable that propagated to application services.
Remediation: The issue was resolved immediately by rolling back the change.
Learnings: After the incident, we added additional safeguards to our infrastructure to ensure environment variables are fully validated before application services come online to prevent this issue from happening again.
Incident 2: Wednesday, March 6 (1 hour 10 minutes):
From approximately 08:18 to 09:28 ET on Wednesday, Rise 360 was unavailable.
Root Cause: This downtime was caused by a Web Application Firewall (WAF) configuration change that blocked our services when they scaled up with morning traffic.
Identification: Once notified, our team quickly identified that auto-scaled services were blocked due to the WAF configuration change the night before. After that was fixed, the team immediately identified the subscription service database was not keeping up with the high load for service recovery.
Remediation: The WAF configuration update in conjunction with adding an additional subscription database read replica along with increasing the auto-scaling capacity for our subscription service solved the outage Wednesday morning.
Learnings: Moving forward, we are adding additional requirements to our pre-release test process. We are adding additional safeguards to simulate real-world traffic patterns, building tighter integration tests and coordination processes on infrastructure rollouts. We are also adding additional monitoring and firewall safeguards and addressing WAF documentation and tooling gaps. We have increased the auto-scale capacity for the subscription services and will be doing enhanced load testing to prevent the secondary subscription service from getting overwhelmed in the future.
Incident 3: Wednesday, March 6 (~5 hours) and Thursday, March 7 (4 hours)
From approximately 14:51 through 19:45 ET on Wednesday, all of Articulate 360 was unavailable including Rise 360, Review 360, and account management.
On Thursday, the same issue recurred 08:42 through 12:42 ET before a full resolution was put in place.
Root Cause: The downtime occurred after we rolled out a change to our Kubernetes cluster to support faster service scaling. The configuration change resulted in system and network instability due to an undetected cascading DNS failure that prevented our pods from functioning. A key factor that contributed to the length of the outage was missing monitoring that showed our DNS services were overloaded.
Identification: The team identified system instability in the Kubernetes cluster approximately an hour after rolling out a core infrastructure change. In response, we made further configuration changes to support less aggressive pod scaling. By Wednesday evening, we mistakenly believed the updated changes to the cluster configuration had resolved the incident. However, our services had stabilized due to the end of peak traffic loads, not our configuration change. So, unfortunately, we had a false sense of confidence at the end of the day until the service instability began again the next morning when high traffic returned. The team then identified the issue as being due to the core DNS service failing due to critically high load.
Remediation: Updating our DNS configuration to reduce the number of search domains along with rolling back a core infrastructure change solved the outage that occurred on March 6 & 7.
Learnings: We have improved monitoring and alerting to detect overload of the DNS service. We are enhancing our monitoring dashboards to fast-track infrastructure issue triage. We are strengthening our pre-release testing procedures to simulate 10x production-level traffic. We are conducting a full audit of our services for scalability, and to identify any single points of failure to strengthen our infrastructure against extreme traffic loads or service failures. We are also revisiting our upgrade and rollout procedures. We will avoid making infrastructure changes immediately following - and prior to a full assessment of - an incident. As a final learning, we are revamping our escalation and communication protocols to provide clearer and more frequent updates.
Lessons Learned and Next steps
In response to the incidents, we are committed to not only addressing the immediate issues but also implementing robust measures to safeguard against similar events in the future.
We identified key areas for improvement across our operational framework. These are the planned actions, each aimed at enhancing our system's resilience, streamlining incident management, and fostering better coordination and monitoring to ensure the highest level of service reliability and customer satisfaction.
System Resilience Enhancements
Coordination and Monitoring Improvements
Incident Management and Response
Articulate’s Commitment to Excellence
Every day, you entrust Articulate with the crucial mission of training your customers, employees, and teams—a responsibility we take very seriously and value deeply.
We’re sorry for the trouble the recent incident may have caused, impacting the smooth and efficient service we constantly aim to deliver.
We are fully committed to learning from this experience and strengthening our pledge to support your ongoing success and satisfaction.