Splitting light

Season 2 Episode 21

All nighter

If you are no longer interested in the newsletter, please unsubscribe

As we were moving forward, in mid June 2018, we hit a point where we needed to be able to check the logs of the cluster as a whole. The way we had done it until then was manually connecting to the machines and opening the right files to look inside. This was no longer viable.

Scaleway’s monitoring team had done a metric stack which we already used heavily, but their logging stack wasn’t ready yet. So, one day, we decided to build our own. I looked at the existing solutions and quickly went for Elasticsearch plus Fluentd. Spent the day working on the automation and doing the right configurations. As the day came to an end, I stayed at the office determined to finish this task to have this “feature” that we had been waiting for so long.

By the early night, I was ingesting my first logs. By the middle of the night, I was accurately extracting the different elements in the logs. By the early morning we had a basic dashboard and logs streaming in a single location where we could query and display information.

The log dashboard looked something similar to this

Then came the knob tweaking. The most important element of the product was S3. If a user could not store or fetch data it was useless to them. How you connect elements together matters. Had I decided to send logs in TCP, I could get congestion and this would have a negative impact on the product. If the log ingestion was not working correctly it would break other elements. It would be tightly coupled systems.

But if the monitoring part did not work for whatever reasons, did I really want the product to be impacted? Did I really want requests not going through if the logs were broken? Not in our context. For me, for us, we would rather have a functional product and loose logs than break the product and still have logs.

There is a simple way of decoupling those elements. Send the logs over UDP. They were emitted from the source and if they cannot not be handled on the receiving side, they were simply dropped. Lost. No extra software. No queues. No added complexity. Just using a specific transport protocol. This fixed the issue.

I switched the knobs to the right settings. Added the right log configuration elements and that was it. We could now fix more issues because, simply, we could see them. We could also search issues and see logs that previously were spread over several machines, now aggregated in a single location.

Then, we wanted to backup the logs, so I looked at an Elasticsearch feature to backup in S3. At the time the implementation was missing a few knobs to work with a non AWS S3. I dove into the code, added the right logic in the backup plugging and rebuilt the plugging. Plugged it in and we started backuping logs in our S3.

Very quickly though, we were getting too many logs. It was overwhelming the system.

(1) Photo by Quentin

If you have missed it, you can read the previous episode here

To pair with :