Concurrent Load Testing at Cengage
By: Chris M., Manager, Software Quality Performance Engineering
A summary of Concurrent Load Testing and how it got started at Cengage.
The development of new technology is always exciting. Its success relies on how it performs under heavy load by many users. The coolest, most sensational developments can be complete failures if they do not work as they are designed. When new applications and platforms are developed, they undergo load and stress testing. Load and stress testing are essential before going live with new technology. Load testing is the practice of applying regular stress to software or application platforms to see if it will function as designed with many users on the system at the same time.
Getting Started
Concurrent Load Testing began at Cengage as a testing methodology for validating a highly integrated Cloud architecture before going live in the fall of 2015. The original test plan called for 4 days of coordinated testing among the platform teams to validate the integrations and the architecture. The testing lasted another 3 weeks. The findings of the testing created a better user experience for Cengage product users that fall.
Now, concurrent load testing days are practiced every month. These exercises are commonly called ‘Chaos Testing’ in a reference to the ‘Chaos Monkey’ program used to throw in unexpected failure conditions into the environment. The testing is a focused effort by multiple teams who work cooperatively to achieve team and company goals. These exercises validate the stability, performance, redundancy, and failover capabilities of main application platforms.
The Full Story
Concurrent load testing was an experiment in team coordination. It took place during the migration of application platforms from data center platforms to the cloud. All the integrations, network connections, and platforms had been thoroughly tested before the Fall of 2015, but not in a coordinated fashion. The company needed to know that the cloud architecture worked as designed under load and stressed conditions. With new network pathways created to integrate Cloud platforms with the existing data center based applications, it was essential to ensure stability at 10,000 concurrent users, which was the expected load to occur during the upcoming semester. A testing plan was designed to coordinate application platform testing with all the teams that use the Single Sign-On application platform for authentication and authorization. The resulting testing sessions identified over 90 issues, which included network, configuration, and application code bugs that were subsequently corrected and fixed before most users began to use the cloud-based systems. These problems came to light as a direct result of testing at high volumes on the new infrastructure that teams were unable to see during development.
Smaller, planned concurrent testing events continued throughout the next year, as new application platforms were migrated to the cloud. These testing sessions would last two or three days with metered changes to environments and known configuration settings. The unpredictable nature of these testing sessions quickly evolved into testing ‘chaos’ and the event name has stuck since. In addition to testing for environment stability and performance, our teams also validate team processes for troubleshooting and issue resolution. The validation of alerts, monitoring capabilities, and team responsiveness has been a key tenet of Chaos Load Testing.
Change and Growth
As our business grows adoption in Higher Education communities, our platform depth and application offerings grow too. These platforms continue to be widely integrated and connected more than ever before. It’s important to maintain stability and uptime. The combined integration and interoperability of Cengage products allows us to continually provide quality products to our growing customer base.
Maintaining a snappy application with millions of users is our pledge to Cengage customers. To deliver on this promise, we invested in the creation of a dedicated performance environment. This environment ensured that we could increase the frequency of system-wide load testing without compromising the teams’ development workflows. The dedicated performance environment also allows concurrent load testing to make environmental changes quickly and reduce test cycle time.
These targeted experiments in the performance environment allow teams to find bottlenecks and potential failure modes. These failure modes would otherwise go undetected in the production application until after it was too late. The goal-based testing sessions uncover failure modes and breakdowns in the performance environment well before the issues would affect the end users. By focusing on achieving specific performance goals during concurrent load testing, the teams often avoid fire-drill scenarios in production that could have the developer scrambling for a fix. When issues are found as a part of testing, the corrections to the system can be applied without the added duress of a production outage.
The Future of Performance Testing
Coordinated load and performance testing continues to show positive, measurable results from both the server and client-side. Issues identified during Chaos Testing Events in an isolated performance environment can be used to correct production settings and configurations before they cause actual problems for our customers. The additional teamwork value of the exercises strengthens support team processes as well as decreases response time to customer facing issues. The future of concurrent testing for Cengage is a necessary part of testing our application platforms, our infrastructure, and our support process in preparation for another record setting fall semester.