This blog post was written by Axel Örn Sigurðsson, SenseOn’s Staff Software Engineer.
SenseOn’s mission is to provide our customers with a first-class threat detection and response product. This means conducting incident response and forensic investigations on endpoint devices in a secure, stable, and effective way.
The end result of this process is that we’ve developed a solution that allows rapid response and investigation from directly within our platform via a persistent bi-directional link between endpoint devices and our platform.
However, making this kind of high-quality rapid remote incident response capability possible meant making some critical and novel technical decisions.
To start with, we had to choose a communication protocol.
After assessing several potential communication protocols, we ultimately decided to use WebSockets. The reasoning for this was twofold:
Then, we needed to pick a programming language for the project.
We decided to use Python. This might seem like a less-than-obvious choice for anyone familiar with WebSocket implementation, given that Python has a reputation for being slow and difficult to scale compared to some other languages.
Indeed, our initial exploration of whether Python was feasible for the project didn’t seem promising. Our team reviewed a few articles comparing different languages for WebSocket implementation. Most put Python last.
But we weren’t convinced.
Instead, we decided to explore why Python was performing so badly against programming languages such as C++, Go, Java, NodeJS, PHP and Rust and see if it could be made faster.
The first thing we needed to do was get some baseline statistics for Python’s performance as a WebSocket server.
Luckily an author of an article comparing WebSocket servers in different programming languages had published all their code online. This meant we could get baseline stats on local hardware for NodeJS (the best results in their article) and C++ (which we also use within SenseOn). In this article, the test server and clients ran on the same machine, which helped us get an initial idea of the results.
At this point, we had identified three candidates for which Python libraries to use. These were aiohttp, autobahn and websockets.
To put them to the test, we did our own round of benchmarking.
The existing benchmarks consisted of clients connecting to the system, retrieving the current timestamp, and measuring the total time for the connection and data exchange.
Our tests were conducted in increments of 100 connections, starting with 100 connections and increasing up to 10,000 connections in the final round. We then tracked how long each round took to complete overall.
We graphed the results of this test to get the following output, with the total elapsed time on the y-axis and the number of clients on the x-axis:
The graph shows that all the libraries seem to be scaling linearly with the number of connected clients. NodeJS took around 5 seconds to finish the test with 10,000 clients, with C++ taking twice as long. Conversely, Python was four to five times slower than NodeJS, taking 20 to 25 seconds.
We terminated the Autobahn benchmark at around 7,000 clients as it was the slowest option at that point.
To understand the poor performance of Python compared to the other languages in our test, we decided to examine the setup of the test servers.
The servers were standard implementations of the libraries without any additional adjustments.
Notably, the Python test servers all depended on asyncio. This is a library to write concurrent code in Python using the async/await syntax. Asyncio has an event loop in the background to run asynchronous tasks, which it lets you replace with a different implementation.
One such event loop is uvloop. Replacing the default event loop in asyncio with uvloop is as simple as running the following two lines as early as possible:
After doing this, we could continue using asynchronous tasks as normal.
Re-running the test with this single modification dramatically altered the results. Replacing the default event loop in asyncio with uvloop significantly increased Python’s performance.
We show this in the following graph, with the previous results represented by dotted lines:
After this modification, aiohttp and websockets needed 16 seconds to finish the same test we put them through before, compared to 20 to 26 seconds previously.
However, we knew we could improve Python’s performance further. Even after replacing the default event loop, Python was three times slower than the NodeJS version and almost twice slower as the C++ version.
To make Python even faster, we had an idea. Why not distribute its compute requirements over multiple cores?
Up to this point, our test servers were running in a single Python process. Therefore, they were limited to a single CPU core. We figured out that, given that each request doesn’t share any state with other requests, there was no reason why we couldn’t run the server with multiple workers.
Our team knew that using tools like supervisord+nginx or gunicorn for multiple workers allows a program to spread the compute requirements of a Python task between different CPU cores.
Note: At this point, we had also chosen aiohttp as the primary candidate to work with, so all further tests from this point focus on using aiohttp. Re-running the test with aiohttp, uvloop and gunicorn for multiple workers gives us the following results:
With tasks spread between multiple cores, we were able to benchmark the performance results of our Python test server at similar speeds to a NodeJS server, with both Python and NodeJS running in half the time the C++ server took.
Although NodeJS has a slight edge due to its built-in multi-threading for network requests, our modifications levelled the playing field between it and Python.
Note: In this test, we focused on improving the Python server implementations. There might be more performance to be gained from improving the C++ and NodeJS servers further as well.
We made a Python WebSocket server dramatically faster. This was a great result, but it left another question unanswered–could our faster Python server become scalable too?
This was a critical question. After all, our original goal was to explore the feasibility of using Python for a WebSocket server. This meant that the main WebSocket server also needed to be able to scale by utilising multiple workers on each node and multiple nodes as well.
Note: Communication between workers and storage of state was handled using Redis.
Until now, we had been working with a basic WebSocket server that would echo back messages to the clients. This was ideal for proving that Python could be sped up, but we needed a more realistic environment for proof of concept.
To do this, we implemented a proof of concept server that:
With this proof of concept in place, we set out to do further benchmarking to push this proof of concept to its limit.
We ran the server and test clients on different machines for this proof of concept test setup. We used a single gunicorn server with multiple workers running on a single machine and test clients on four Raspberry Pis.
Our proof of concept test used a test runner that scheduled a total of 2,000 clients across the Raspberry Pis to connect in each round.
Here’s how it worked:
The initial results were very promising.
Everything ran successfully, with the session duration tracked throughout the test staying fairly stable. That was until we got to over 100,000 connections. At this point, the server was beginning to struggle.
When we looked into what was failing on the server, we noticed that the number of connections each gunicorn worker was handling wasn’t spread evenly between the workers.
We saw that two workers out of seventeen were responsible for almost two-thirds of the connections. This told us there was a problem regarding how the connections were being load balanced. To try and solve this, we read the gunicorn documentation. That’s where we stumbled on this specific line:
“Gunicorn relies on the operating system to provide all of the load balancing when handling requests.”
This told us that with no way of defining a custom way of load balancing requests, it’s up to the operating system to distribute connections to the process. In this case, it tries to hand over connections to the first worker and if it’s too busy, then the next one and so on.
This generally works fine for short-lived connections. But for WebSocket connections, the first worker will take on new connections with already existing idling ones in place. The idling connections may open for a long time, with the load over the connection coming much later.
It’s worth keeping in mind that even if we evenly spread out the connections between the workers, we could still have a spike of load on multiple connections that are all connected to the same worker.
The solution: Using supervisord and nginx for load balancing, with the default round-robin strategy, we were able to achieve a more balanced distribution of connections in the test.
This allowed us to handle a total of 160,000 connections with the same test setup without any issues.
What stopped us at this point is that the Raspberry Pis started failing completely (long after I thought they would).
Our initial goal was to easily handle over 100,000 connections on a single machine, which we managed to do in our previous tests before hitting hardware limits. We wanted to push further to see which software limits we would have to deal with. This meant that Raspberry Pis were not an adequate solution.
We moved on to using Google Cloud instead. Here we could spin up machines as we needed them for test runners, all while using a single powerful server to run the prototype.
And the results?
The test scaled further, up to 260,000 idling and responding connections, only ending because nginx ran out of capacity.
The memory usage and CPU load of both nginx and the WebSocket server scaled almost exactly linearly, which was an important metric for us to keep an eye on. At this point, we can scale further horizontally with multiple server nodes, but our goal was to push a single node to the limit.
One additional key element of this last test was that we also monitored nginx handling TLS termination. nginx was using slightly more RAM than the WebSocket server at the end of the test, but that was without tweaking any nginx settings, leaving some room for further optimisation.
After proving Python servers could be made faster and more scalable, we had full confidence in utilising Python as a WebSockets server. This feeling was not present at the beginning of our research.
Despite Python often being perceived as slow and less efficient in comparison to other languages, we showed that it could be extremely effective when used correctly. Being able to work with a language that we are well-versed in allowed us to progress quickly and minimise uncertainty during the implementation phase of the project.
Python has now been running in production for a while and continues to perform exactly within our expectations.
Conducting load testing early on to assess its capabilities and limitations allowed us to design this distributed WebSocket server that can handle any scale required. In addition, the server is flexible enough to scale both vertically and horizontally, depending on specific circumstances.