Splitting Light: Season 2 - Episode 22


Splitting light

Season 2 Episode 22

Too many logs

If you are no longer interested in the newsletter, please unsubscribe

I’ve rarely seen people talk about this effect. The effect being the amplification of requests. This effect can overwhelm your system. We had to deal with it.

The object storage, at least OpenIO, was a collection of distributed services. You might call them micro services if you want. That had implications. When a request comes in, from the user perspective, it’s a single request. But behind the scenes, it is a collection of requests. One request will create many sub requests. The request flow is amplified. One request generates a dozen or more sub requests.

One sub request for authentication. Several to check the metadata. Several for each different feature’s elements. And then some more for fetching the actual data. Each of these generates one or more logs. Logs that we had to process and store.

When you have a low amount of requests per second, let’s say one or two, at maximum you’ll maybe need to handle a hundred or so logs per second. But what if you raise the amount of request to what was for us an acceptable performance criteria?

Well, you start pushing a 1000 requests per second which results in maybe 100 000 logs per second. You had to double or even triple that to get our minimal performance criteria.

Except, most tools could not handle this. FluentD, the software we used to gather the logs and process them is written in Ruby. A similar language to Python. Performance is not the objective. It’s made to be convenient. It could not handle the load. So we looked for a drop-in alternative.

Luckily, one person had created fluent-bit. A replacement written in C++. So I looked into it. It was missing a critical feature, the UDP support. So I implemented it and plugged it in. After swapping the software, we went from an overwhelmed system to a system that was bored. We later found a few bugs but it was that simple.

Now that we could process the logs, the issue shifted to storing them. We needed to tweak a few knobs in Elasticsearch to make things work.

I added a mechanism of hot and cold logs. Logs would get stored, the index would rotate and after a few days it would be moved to the “cold” storage. Aka rotating disks instead of fast NVME drives. From that basic premise, Folays wrote software that interacted with the elastic API to move the indexes according to free space and no longer on a time base.

This worked nicely. With both the metrics and the logs, we now had much better insight. We could now go harder on the cluster one last time.

(1) Merge request on fluent-bit to get UDP support https://github.com/fluent/fluent-bit/pull/767

If you have missed it, you can read the previous episode here

To pair with :

  • Way I fell - Atu
  • The Clan Corporate by Charles Stross

Vincent Auclair

Connect with me on your favorite network!

Oud metha, Dubai, Dubai 00000
Unsubscribe · Preferences

Symbol Sled

Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.

Read more from Symbol Sled

Splitting light Season 2 Episode 21 All nighter If you are no longer interested in the newsletter, please unsubscribe As we were moving forward, in mid June 2018, we hit a point where we needed to be able to check the logs of the cluster as a whole. The way we had done it until then was manually connecting to the machines and opening the right files to look inside. This was no longer viable. One of the main office rooms (1) Scaleway’s monitoring team had done a metric stack which we already...

Splitting light Season 2 Episode 20 Sharpened sense of purpose If you are no longer interested in the newsletter, please unsubscribe By early June 2018, 8 months in, we were advancing quickly. All these bricks started to be assembled into something that worked. It almost felt like advancing following a lego model manual. Except we didn’t have a manual. The hardware, the software, the integration with existing systems, the testing started to converge into something that could be used. It...

Splitting light Season 2 Episode 19 Bandwidth waves If you are no longer interested in the newsletter, please unsubscribe At every step we would test the performance. Crude methods at first. Sowing together scripts would enable us to get more kick out of the performance testing. The more performance we wanted to extract, the harder it was to do the tests. At first one powerful machine was enough to generate the request and traffic. Then we needed two of them. Then twenty… Then a hundred. We...