Articulate Webpages Slow to Load

Incident Report for Articulate

Postmortem

Less than two weeks ago, Articulate 360 had an outage due to failures in our infrastructure systems that support our platform. After detailed analysis, we're now sharing what went wrong, how we responded, and the steps we've taken to improve our systems and processes going forward.

Summary

Three incidents caused outages from Tuesday, March 5, 2024, to Thursday, March 7, 2024. The first two incidents were isolated and caused an interruption of a total of 1.5 hours, while the third incident occurred Wednesday afternoon for 5 hours and repeated Thursday morning for another 4 hours.

Even though there were three separate incidents, we recognize that this felt like one long outage. We are committed to doing better. Key takeaways and process enhancements include improvements to monitoring and testing, an update to our DNS infrastructure, and making any infrastructure changes only after a full incident assessment. Details on each incident are included below along with remediations and lessons learned. Additional information about lessons learned and next steps are summarized at the end of this report.

Details

This report shares the factors that contributed to the incident, the remediations put in place, as well as lessons learned and go-forward improvements to both help reduce time to restoration, as well as help identify and prevent incidents in the future.

Incident 1: Tuesday, March 5 (18 minutes):

From approximately 15:06 to 15:24 ET, Rise 360 and account management were unavailable.

Root Cause: This downtime was caused by a missing environment variable that caused the Rise authentication service to fail to initialize.
- Note: An environment variable is like a necessary ingredient for a recipe; without it, the Rise service couldn't start.
Identification: Our team was paged within minutes and identified the issue was caused by an incorrect environment variable that propagated to application services.
Remediation: The issue was resolved immediately by rolling back the change.
Learnings: After the incident, we added additional safeguards to our infrastructure to ensure environment variables are fully validated before application services come online to prevent this issue from happening again.

Incident 2: Wednesday, March 6 (1 hour 10 minutes):

From approximately 08:18 to 09:28 ET on Wednesday, Rise 360 was unavailable.

Root Cause: This downtime was caused by a Web Application Firewall (WAF) configuration change that blocked our services when they scaled up with morning traffic.
- Note: To clarify what a WAF is, imagine your favorite store has a security guard at the door, checking everyone's bags for items that aren't allowed inside. A Web Application Firewall (WAF) is like that security guard, but for web applications. It checks the data coming in to block harmful traffic, ensuring only safe and welcome visitors get through. This is an important layer in a system architecture. A change we made incorrectly identified our own auto-scaled services as harmful and blocked them.
Identification: Once notified, our team quickly identified that auto-scaled services were blocked due to the WAF configuration change the night before. After that was fixed, the team immediately identified the subscription service database was not keeping up with the high load for service recovery.
Remediation: The WAF configuration update in conjunction with adding an additional subscription database read replica along with increasing the auto-scaling capacity for our subscription service solved the outage Wednesday morning.
Learnings: Moving forward, we are adding additional requirements to our pre-release test process. We are adding additional safeguards to simulate real-world traffic patterns, building tighter integration tests and coordination processes on infrastructure rollouts. We are also adding additional monitoring and firewall safeguards and addressing WAF documentation and tooling gaps. We have increased the auto-scale capacity for the subscription services and will be doing enhanced load testing to prevent the secondary subscription service from getting overwhelmed in the future.

Incident 3: Wednesday, March 6 (~5 hours) and Thursday, March 7 (4 hours)

From approximately 14:51 through 19:45 ET on Wednesday, all of Articulate 360 was unavailable including Rise 360, Review 360, and account management.

On Thursday, the same issue recurred 08:42 through 12:42 ET before a full resolution was put in place.

Root Cause: The downtime occurred after we rolled out a change to our Kubernetes cluster to support faster service scaling. The configuration change resulted in system and network instability due to an undetected cascading DNS failure that prevented our pods from functioning. A key factor that contributed to the length of the outage was missing monitoring that showed our DNS services were overloaded.
- Note: To clarify Kubernetes and DNS in less technical terms, Kubernetes is like a captain steering a fleet of ships, and each ship carries its own crew and cargo. Those ships are pods containing our application services that work together to make up Articulate 360. Kubernetes navigates, organizes, and adjusts the size and scale of our fleet to ensure all ships sail smoothly together, responding to changing seas without losing direction or speed. The Domain Name System (DNS) makes it possible for all ships to locate and communicate as they sail. When DNS was failing, ships were unable to launch and navigate.
Identification: The team identified system instability in the Kubernetes cluster approximately an hour after rolling out a core infrastructure change. In response, we made further configuration changes to support less aggressive pod scaling. By Wednesday evening, we mistakenly believed the updated changes to the cluster configuration had resolved the incident. However, our services had stabilized due to the end of peak traffic loads, not our configuration change. So, unfortunately, we had a false sense of confidence at the end of the day until the service instability began again the next morning when high traffic returned. The team then identified the issue as being due to the core DNS service failing due to critically high load.
Remediation: Updating our DNS configuration to reduce the number of search domains along with rolling back a core infrastructure change solved the outage that occurred on March 6 & 7.
Learnings: We have improved monitoring and alerting to detect overload of the DNS service. We are enhancing our monitoring dashboards to fast-track infrastructure issue triage. We are strengthening our pre-release testing procedures to simulate 10x production-level traffic. We are conducting a full audit of our services for scalability, and to identify any single points of failure to strengthen our infrastructure against extreme traffic loads or service failures. We are also revisiting our upgrade and rollout procedures. We will avoid making infrastructure changes immediately following - and prior to a full assessment of - an incident. As a final learning, we are revamping our escalation and communication protocols to provide clearer and more frequent updates.

Lessons Learned and Next steps

In response to the incidents, we are committed to not only addressing the immediate issues but also implementing robust measures to safeguard against similar events in the future.

We identified key areas for improvement across our operational framework. These are the planned actions, each aimed at enhancing our system's resilience, streamlining incident management, and fostering better coordination and monitoring to ensure the highest level of service reliability and customer satisfaction.

System Resilience Enhancements

Service Audit: Conduct a thorough audit focusing on scalability and the identification and mitigation of single points of failure.
Review DNS Architecture: Revise and strengthen our DNS services to enhance reliability and reduce the risk of future DNS-related failures with increased scale.
Enhanced Testing: Improve testing protocols to simulate traffic at ten times our normal production load.
Kubernetes Configuration: Adjust configurations to prevent the automatic removal of older Kubernetes pods, ensuring stability.
Web Application Firewall (WAF) Enhancements: Strengthen WAF testing to reduce false positives and continue to protect genuine traffic.
Anomaly Detection: Deploy mechanisms for instant alerts on unusual WAF blockages.

Coordination and Monitoring Improvements

Cross-Team Coordination: Enhance testing and coordination among teams for better communication and efficiency.
Dashboard Revisions: Revamp dashboards to highlight critical metrics for quicker decision-making.
CoreDNS and CNI Monitoring: Boost monitoring and alerts for CoreDNS and Container Network Interface to swiftly identify and rectify issues.

Incident Management and Response

Incident Management: Streamline procedures for managing incidents and reverting changes swiftly.
Tooling Improvements: Upgrade tools for faster detection, resolution, and communication during incidents.
Incident Communication: Refine response communication protocols for clear, timely updates to stakeholders. Improve hand-offs.
Post-Incident Changes: Implement a policy to halt changes to our infrastructure immediately following any incident, ensuring thorough assessment and stability before proceeding with updates.

Articulate’s Commitment to Excellence

Every day, you entrust Articulate with the crucial mission of training your customers, employees, and teams—a responsibility we take very seriously and value deeply.

We’re sorry for the trouble the recent incident may have caused, impacting the smooth and efficient service we constantly aim to deliver.

We are fully committed to learning from this experience and strengthening our pledge to support your ongoing success and satisfaction.

Posted Mar 19, 2024 - 12:22 UTC

Resolved

This incident has been resolved.

Posted Mar 07, 2024 - 18:44 UTC

Update

We are continuing to monitor for any further issues.

Posted Mar 07, 2024 - 18:26 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 07, 2024 - 18:07 UTC

Identified

Our team has identified the issue, and we're deploying a fix. Some users may continue to experience slowly loading pages. We'll let you know as soon as the issue is fully resolved!

Posted Mar 07, 2024 - 17:49 UTC

Update

We are experiencing degraded experiences across the entire Articulate platform.

Posted Mar 07, 2024 - 17:24 UTC

Update

We are experiencing degraded experiences across the entire Articulate platform.

Posted Mar 07, 2024 - 17:05 UTC

Update

Hello Articulate users! Articulate webpages are loading slowly. We will share an update when all webpages are loading as expected.

Posted Mar 07, 2024 - 16:27 UTC

Investigating

Hi Rise 360 users! Rise 360 isn't loading. We are so sorry for the inconvenience. As always, we'll let you know when we are back up and running.

Posted Mar 07, 2024 - 14:15 UTC

This incident affected: Articulate 360 (Rise 360, Review 360, Reach 360, Team Slides, Articulate 360 Desktop App Updates, Storyline 360 Text to Speech, Articulate 360 Training, Content Library 360, Content Library Images 360, Account Management) and Articulate ID, Articulate.com.