Scaling in the time of COVID: How we run load tests at Learnosity

Read Time 6 Mins

by Olivier MehaniSenior Software Engineer

Process

Engineering

In the second of a two-part article on load testing, Senior Software Engineer Olivier Mehani goes into the details of running a load test to keep Learnosity’s system performant in the face of dramatic user growth.

In this two-part series, I look at how we load test our platform to ensure platform stability during periods of heavy user traffic. For Learnosity, that’s typically during the back-to-school period. The year was different though, as COVID caused a dramatic global pivot to online assessment in education. Here is what the result of that looked like in terms of traffic:

We expect major growth every year but that kind of hockey stick curve is not something you can easily predict. But, because scalability is one of the cornerstones of our product offering, we were well-equipped to handle it.

This article series reveals how we prepared for that.

In part one (which was, incidentally, pre-COVID), I detailed how we actually created the load by writing a script using Locust. In this one, I’ll go through the process of running the load test. I’ll also look at some system bottlenecks it helped us find.

Let’s kick things off by looking at some important things a good load-testing method should do. Namely, it should:

Apply a realistic load, starting from known-supported levels.
Determine whether the behavior under load matches the requirements.
If the behavior is not as desired, you need to identify errors and fix them. These could be:

In the load-test code (not realistic enough)
In the load-test environment (unable to generate enough load)
In the system parameters
In the application code

If the behavior is as desired, then ramp up the load exponentially.
We used two separate tools for steps one and two above (as described in the first part of this series) and tracked the outcomes of step three in a spreadsheet.

TL;DR

We used Locust to create the load, and a custom application to verify correct behavior.
We found a number of configuration-level issues, mainly around limits on file descriptors and open connections.
Stuff we learned along the way:
- Record all parameters and their values, change one at a time
- Be conscious of system limits, particularly on the allowed number of open files and sockets

Running load tests and collecting data

It’s crucially important to keep good records when corralling data from the thundering herd into useful information.

There are essentially two classes of data to keep track of here:

1) static parameters, which describe all the relevant configurations of the system under load and;

2) dynamic behavior, which is any relevant metrics collected during the load test.

Those two sets tend to grow over the duration of the load-test campaign, as more relevant parameters, and more performance metrics are identified.

Static parameters

Static parameters are under our direct control and we can change them in different runs. Those include environment and instance sizes – which also includes a growing number of configuration options that are discovered through successive rounds of load tests. The numbers driving the load generation are also important here.

Dynamic behavior

The dynamic behavior metrics are things we can observe as the load test runs.

This category covers both intrinsic performance metrics of the hosts, such as CPU load or memory used, as well as the actual target metrics that allow us to call whether a load test run was successful or not.

Color-coding helps in making the data more legible. As I mentioned before, there can be different types of failure, either of the system under load or of the load-testing system itself. Since we have two different components in our load test, we have a number of different colors for failures of the different components. Observations made during the run are important too, as they are generally a good indicator of what to do next.

We also have some automatic data reporting in place for the correctness test application.

It generates a simple CSV output, which can easily be imported into the same spreadsheet for each run. This allows us to quickly get an idea of how well the system performed in terms of the target criteria (in this case, the end-to-end delay) by calculating and plotting some summary statistics.

At this stage, load testing becomes a relatively simple process, driven by the data-collection spreadsheet:

Run a load test with the current parameters
If the outcome is:
- a failure: understand and fix the cause of failure, or
- a success: increase the parameters defining the load.

When updating the parameters to generate more load. It’s generally better to change one parameter at a time so you can more easily track down issues and create a fuller matrix of the combination of parameters.

Unless you have a clear number available, it’s good practice to double the previous value as it lets you exponentially explore the space (and hit a bottleneck faster), even if you’ve started with reasonably low values.

Configuration bottlenecks

The load-test setup described above helped us identify a number of bottlenecks in our environment. Those bottlenecks were also often present in the load-generating infrastructure, so we had to run most tests twice before we could increase the load and hit a new bottleneck.

We found that at every level of the stack, the defaults were too low. Sometimes, parameters that remained in our configuration management system from legacy deployments were set incorrectly.

It is interesting to note that there are quite a few parameters that control the number of connections that can be successfully established to the app. We decided to create a funnel so the kernel limits would be the lowest. Any connection that managed to make it through those would be virtually guaranteed to make it all the way to the application and be serviced.

System-wide limits

First, we disabled an ancient entry that set the fs.filemax sysctl to 20,480. This dated from age-old deployments where we were actually increasing the default limit. It is now automatically calculated based on the memory of the host, and our hardcoded value is now set lower to the automatic default value (in one random check I just did, it was at 395,576, one order of magnitude higher). This otherwise led to errors:

“Too many open files in system”

from various components of the system.

Per-process limits

Next was the configuration of nginx, which proxies public requests to our app. After a while, it complained that:

[alert] 3122#0: *1520270 2048 worker_connections are not enough while connecting to upstream, client: 192.2.0.2, server: eventbus.loadtest.learnosity.com, request: "POST /latest/publish HTTP/1.1", upstream: "http://127.0.0.1:9092/latest/publish", host: "eventbus.loadtest.learnosity.com"

This was simply fixed by adjusting this parameter in the nginx.conf file.

worker_connections 32000;

Which led to the obvious next error

[alert] 25367#0: *5200246 socket() failed (24: Too many open files), client: 192.2.0.2, server: eventbus.loadtest.learnosity.com, request: "POST /latest/publish HTTP/1.1", upstream: http://127.0.0.1:9092/latest/publish", host: "eventbus.loadtest.learnosity.com"

Our default systemd configuration sets the limit of the maximum number of files the process can have open to 2048. We installed a manual override in /etc/systemd/system/nginx.service.d/nofile.conf, so as not to manual change system files.

[Service]
LimitNOFILE=65536

Note that the number of allowed open files is around twice that of worker_connections. This is because each accepted connection will lead the worker to establish a second connection to the backend, which also needs to be accounted for.

Once nginx was happy, we hit the same problem one level deeper, from our Golang http-based app.

http: Accept error: accept tcp [::]:9092: accept4: too many open files; retrying in 5ms

The fix was the same, by adding a manual override to the systemd service for our Event bus.

[Service]
LimitNOFILE=32768

Network tuning

Along the way, we found a few other occasional errors. Most revolved around TCP connections not being cleared up fast enough. We already had some configuration in place (prior to the load test) to maximize the amount of allowed connections and limit their lingering after closure

net.core.somaxconn = 63536 net.ipv4.tcp_fin_timeout = 15

We tweaked a couple more to expand the range of ports that could be used for outgoing connections (from nginx to the app), as well as allow the reuse of existing TCP sockets in the TIME_WAIT state.

net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1

We also had to adjust some firewall settings. We had kernel connection tracking enabled by default. This created issues when the table got full, trying to keep track of too many connections.

nf_conntrack: table full, dropping packet

We first thought about tweaking timeouts

 (net.netfilter.nf_conntrack_tcp_timeout_fin_wait and net.netfilter.nf_conntrack_tcp_timeout_time_wait)

so the table entries would clear faster. However, we realized that connection tracking made little sense on a worker node. Simply disabling it in firewalld (with –notrack) in rules for connections to the backend app fixed that issue.

With all these changes, our Event bus was finally constrained only by the performance of the machines! We ran some dimensioning experiments to evaluate the impact of vertical (instance size) and horizontal (instance number) scaling, so we could make informed decisions on this final bit of system configuration for production.

Know thyself – understand your tech stack

We’re at a time when hundreds of clients are relying on our infrastructure to perform with total reliability as millions of learners make the switch to digital assessments.

Had we doubled – or even tripled – the maximum traffic load capacity of what we saw in 2019, we would have come unstuck because of the unforeseen impact of COVID. But we always aim to stretch out our capacity by way more than that.

Due to the publish/subscribe nature of the Event bus application, we had to write a more complex Locust script than usual, keeping some state between publishers and subscribers. We also had to write a separate application to measure the message-delivery times, which were our target pass/fail metric.

Had we doubled – or even tripled – the traffic load capacity of what we saw in 2019, we would have come unstuck because of the unforeseen impact of COVID. But we always aim to stretch out our capacity by way more than that. Share on X

Thanks to careful bookkeeping of all the parameters and results, we identified weak points in our system and optimized them. Note that most of the issues came from misconfigurations of the system, and that we did not even get into the application code.

This is an important reminder that a successful load-testing campaign requires a good understanding of the entire tech stack under consideration.

After further load tests, we were confident that the event-passing system would be able to sustain at least seven times the maximum load we’d previously seen. That put our system in good stead to withstand the shock waves of even the most challenging circumstances.

See what scaling successfully looks like. Download Re:Vision, our free annual report right here 👇.

Olivier Mehani

Senior Software Engineer

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Necessary" category .
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Analytics

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	Hotjar sets this cookie to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	Hotjar sets this cookie to identify a new user’s first session. It stores a true/false value, indicating whether it was the first time Hotjar saw this user.
_hjid	1 year	This is a Hotjar cookie that is set when the customer first lands on a page using the Hotjar script.
_hjIncludedInPageviewSample	2 minutes	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's pageview limit.
_pk_id.*	1 year 27 days	This cookie is used to store a few details about the user such as the unique visitor ID.
_pk_ses.*	30 minutes	This is a short lived cookies used to temporarily store data for the visit.
1P_JAR	1 day	Set by Google. This group sets a unique ID to remember your preferences and other information such as website statistics and track conversion rates.
CONSENT	2 years	Set by Google. This group sets a unique ID to remember your preferences and other information such as website statistics and track conversion rates.
DV	1 day	Used to collect information about how visitors use our site which we use to compile reports to help us make improvements. The cookies enable the collection of information in an anonymous form, including the number of visitors to the site, where visitors have come to the site from and the pages they visited.
lpv*	30 minutes	This cookie is set to keep Pardot from tracking multiple page views on a single asset over a 30-minute session
NID	6 months	Set by Google. This group sets a unique ID to remember your preferences and other information such as website statistics and track conversion rates.
pardot	past	The pardot cookie is set while the visitor is logged in as a Pardot user. The cookie indicates an active session and is not used for tracking.
visitor_id*	10 years	This cookie includes a unique visitor ID and the unique identifier for your Pardot account. This cookie is set for visitors by the Pardot tracking code.
visitor_id*-hash	10 years	This cookie contains the Pardot account ID and stores a unique hash. This cookie is a security measure to make sure that a malicious user can’t fake a visitor from Pardot and access corresponding prospect information.

Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.

Functional

Scaling in the time of COVID: How we run load tests at Learnosity

In the second of a two-part article on load testing, Senior Software Engineer Olivier Mehani goes into the details of running a load test to keep Learnosity’s system performant in the face of dramatic user growth.

TL;DR

Running load tests and collecting data

Static parameters

Dynamic behavior

Configuration bottlenecks

System-wide limits

Per-process limits

Network tuning

Know thyself – understand your tech stack

Olivier Mehani

Senior Software Engineer

Infrastructure as code: Successfully scaling to new heights

What 39.3 million learners taught us about the future of assessment

How we manipulated Locust to test system performance under pressure

Scaling in the time of COVID: How we run load tests at Learnosity

In the second of a two-part article on load testing, Senior Software Engineer Olivier Mehani goes into the details of running a load test to keep Learnosity’s system performant in the face of dramatic user growth.

TL;DR

Running load tests and collecting data

Static parameters

Dynamic behavior

Configuration bottlenecks

System-wide limits

Per-process limits

Network tuning

Know thyself – understand your tech stack

Olivier Mehani

Senior Software Engineer

Let’s make it official

Related articles

Infrastructure as code: Successfully scaling to new heights

What 39.3 million learners taught us about the future of assessment

How we manipulated Locust to test system performance under pressure