Cloud Container Platform sped up by Redis change

Ever since the launch back in March 2016, we’ve kept working on Cloud Containers. It’s a robust hosting platform and a work in progress, all at the same time.

Ideas for enhancements come from a few different places. Sometimes it’s a simple case of our customers getting what they ask for, like when we introduced container cloning. Sometimes we spy technical opportunities, as with the Python + Miniconda image launch a few weeks ago. And sometimes we see the platform not performing as well as we know it should.

Growing pains

Recently we noticed that the Cloud Container platform was looking a bit sluggish. Actions like creating or rebooting containers were taking much longer than we expect - and longer than they used to. We hadn’t made any changes that could explain the slowness, or which coincided with things getting worse. In fact, there wasn’t even a time when the problem started happening. It was more like gradual deterioration over the years. (On reflection, that explains why we didn’t pick it up sooner.)

When we looked into it, we discovered that the effect was worse in our two busiest regions, AKL01 and AKL02, which indicated that the issue grows as the platform grows. A big majority of our Cloud Container servers are in one of these two data centres.

But there’s a difference between knowing that something could be better and knowing how to fix it. We needed more information. It was clear how long jobs were taking (26.4 seconds to restart a Container, for example) but not exactly where each millisecond was going. There was a chunk of time that we had no good explanation for, and no good insight into.

Using Grafana Tempo for telemetry data we solved our visibility problem, and it pointed to Redis as the time sink. The only problem is that Redis is extremely fast. That’s one of its big selling points, so something had to be wrong. Something about the way we were using Redis.

Racing Redis

It’s funny how big problems — like many accumulated hours of unnecessary waiting for Containers to start or stop — can have tiny causes.

As it turned out, we were using SCAN across entire tables for data rather than using GET to, well, get, a specific piece of data. Anyone who has worked with any form of database knows this isn’t optimal when working with large chunks of data. For the nerds: SCAN grows linearly according to the number of data points (O(n)) while GET is a constant time operation regardless (O(1)).

It appears this behaviour was written into very early prototypes and wasn’t noticed when the performance difference was imperceivable. It’s just been lurking in the background and slowly reducing job performance ever since. The more Containers that were added (many thousands of them, after seven years) the more jobs/actions per day took place and the longer each SCAN would take, and the more time a GET would save.

So we looked over our Redis implementation, paying close attention to how we both store and retrieve the data that we need and made what in hindsight was the obvious change by moving to a GET based lookup. We also implemented a few other improvements while in the area, including introducing pipelining. These changes introduced big performance improvements.

The numbers that tell the story

If your Cloud Container server is in either of our two main data centres, AKL01 or AKL02, you have probably already noticed that jobs are happening significantly and perceivably faster. How much faster, you ask? Here are a few examples.

Starting a Container

Time taken (before): 24.2s

Time taken (after): 3.44s
Improvement: 20.76s (86%)

Restarting a Container

Time taken (before): 26.4s

Time taken (after): 5.32s
Improvement: 21.08s (80%)

Stopping a Container

Time taken (before): 43.9s

Time taken (after): 4.46s
Improvement: 39.44s (90%)

20 or 30 seconds doesn’t always sound like a long time, but when you see those percentage gains you can appreciate what we’ve achieved. It’s yet another step in our forever-mission of keeping Cloud Containers as slick as possible and a good reminder to check assumptions you made long ago!

Have you noticed how much faster the Cloud Container platform is?