Load and Stress Testing: A Case Study (Ukrainian app "Diia"

Once upon a digital frontier, Ukraine embraced the future with Diia, a groundbreaking blend of a mobile app, web portal, and the essence of e-governance. Launched in 2020, Diia bestowed upon its citizens the power to wield digital documents on their smartphones for identification and seamless access to a vast array of government services (around 150!).

I genuinely adore the Diia app for its transformative impact on our daily life. Gone are the days of carrying around cumbersome identification documents. The convenience of not needing to have my ID or driving license physically on hand at all times is a game-changer. Whether it's accessing essential services, COVID certificates, signing petitions, voting for the new name for the railway system, or participating in events like the Eurovision Song Contest, having my identification securely stored on my phone simplifies and streamlines the way I engage with the digital world. Ukraine is about democracy and our app Diia has proved that.

However...

As the calendar flipped to the year 2024, the nation eagerly awaited the results of the Eurovision Song Contest (ESC). The Diia app, entrusted with facilitating the voting process, found itself at the epicentre of a technological maelstrom. Originally scheduled for February 3, the announcement of the national entry for ESC 2024 faced an unexpected hurdle.

A colossal surge of citizens, fueled by the desire to cast their votes digitally, inundated the Diia application. The system, unprepared for the deluge, encountered a technical failure. Instead of a swift announcement, the process extended to the next day, allowing for a full day of voting to accommodate the enthusiastic participants.

The Minister of Digital Transformation revealed the staggering numbers: a record-breaking 15,000 requests per second, a fivefold increase in capacity compared to the previous year, and a formidable queue that emerged as a consequence. They did increase the capacities by 5 times (but received more requests by 20 times). It was a testament to the app's popularity and the challenges that accompany rapid adoption on such a scale.

What could have been done to prevent it?

In the fast-paced world of web applications and services, ensuring your system can handle the anticipated load and stress is crucial for maintaining optimal performance and user satisfaction. Load testing and stress testing are indispensable tools in a developer's and tester's arsenal, and in this article, I want to explore their implementation using the open-source tool, k6.

Grafana k6 is an open-source load testing tool that makes performance testing easy and productive for engineering teams. k6 is free, developer-centric, and extensible. Using k6, you can test the reliability and performance of your systems and catch performance regressions and problems earlier.

I will write a JS script to simulate user behaviour. In this case, a single GET request is made to the website https://jollywise.co.uk/, followed by a 1-second sleep, effectively simulating one user per second. This script serves as a foundation for subsequent load-testing scenarios.

The initial test, executed with default settings, provides valuable insights into the system's performance. The test duration is a brief 1.2 seconds, and various metrics shed light on the request and response times. Notably, the average iteration duration is 1.17 seconds, with a modest 0.854 requests per second. The system handles a single virtual user seamlessly, with no failed requests.

Gradual Increase:

The load testing journey progresses with a step-by-step escalation. The user count is increased to 100 users per second, scaling up to 150, with a 60-second iteration duration.

The system maintains stability, with no failed requests and a consistent response rate.

Pushing the Limits:

The real test of a system's robustness lies in stress testing. I increase the user count significantly - to 1000 users per second, scaling up to 1500. The results are eye-opening, revealing a 21.31% failure rate. While successful requests number 6406, failed requests reach 23655. The server struggles to respond within the expected time frame under this immense load.

Identifying Bottlenecks:

Examining the breakdown of request and response times, it becomes evident that the server faces challenges in multiple stages. From the waiting time before sending the request to the TLS/SSL handshake time, each aspect contributes to the overall stress on the system.

There are many more things you can include in your script.

Thresholds:

Thresholds in k6 allow you to set performance criteria for different metrics during a test. They help in defining acceptable levels of performance and trigger warnings or failures if those levels are not met. Thresholds are crucial for identifying potential performance issues and ensuring that the system performs within acceptable bounds. Examples of thresholds:

In this example, the threshold for the 95th percentile of response time is set to be less than 500 milliseconds, and the threshold for the failure rate is set to be less than 0.1%.

Stages:

You can also control the duration, ramp-up, and stabilization of your test using the stages option in the test script. The stages option allows you to define a sequence of load stages, each with its own duration, target number of virtual users (VUs), and other parameters. This enables you to simulate realistic scenarios where the load gradually increases, stabilizes, and then possibly decreases.

Here's an example of how you can structure the stages option in the k6 test script:

In this example:

The first stage has a duration of 2 minutes and a target of 50 VUs, meaning that k6 will gradually ramp up from 0 to 50 VUs over the course of 2 minutes.
The second stage has a duration of 5 minutes and a target of 50 VUs, simulating a stable period where the system is subjected to a constant load of 50 VUs.
The third stage has a duration of 2 minutes and a target of 0 VUs, indicating a ramp-down period where k6 will gradually reduce the number of virtual users to 0 over 2 minutes.

You can customise the duration and target values in each stage based on your specific testing requirements. This approach helps you mimic real-world scenarios where systems experience varying levels of load over time, allowing you to identify performance issues during ramp-up, measure stability under a constant load, and observe behaviour during ramp-down. Adjust these values according to the characteristics of your application and the load patterns you want to simulate.

Conclusion:

In the ever-evolving landscape of digital services, where user expectations are high, embracing tools like k6 (or pick the one you are more comfortable with) for load and stress testing is paramount.

These tools empower us to proactively identify bottlenecks, optimise performance, and ensure that systems can handle the anticipated loads, ultimately delivering a seamless and reliable user experience. The lessons learned from Diia's experience can underscore the significance of comprehensive performance testing in the digital frontier.

Load and Stress Testing: A Case Study (Ukrainian app "Diia" - Eurovision fail)

Gradual Increase:

Pushing the Limits:

Identifying Bottlenecks:

Recent Posts

Коментари