Splitting Light: Season 2 - Episode 21


Splitting light

Season 2 Episode 21

All nighter

If you are no longer interested in the newsletter, please unsubscribe

As we were moving forward, in mid June 2018, we hit a point where we needed to be able to check the logs of the cluster as a whole. The way we had done it until then was manually connecting to the machines and opening the right files to look inside. This was no longer viable.

Scaleway’s monitoring team had done a metric stack which we already used heavily, but their logging stack wasn’t ready yet. So, one day, we decided to build our own. I looked at the existing solutions and quickly went for Elasticsearch plus Fluentd. Spent the day working on the automation and doing the right configurations. As the day came to an end, I stayed at the office determined to finish this task to have this “feature” that we had been waiting for so long.

By the early night, I was ingesting my first logs. By the middle of the night, I was accurately extracting the different elements in the logs. By the early morning we had a basic dashboard and logs streaming in a single location where we could query and display information.

Then came the knob tweaking. The most important element of the product was S3. If a user could not store or fetch data it was useless to them. How you connect elements together matters. Had I decided to send logs in TCP, I could get congestion and this would have a negative impact on the product. If the log ingestion was not working correctly it would break other elements. It would be tightly coupled systems.

But if the monitoring part did not work for whatever reasons, did I really want the product to be impacted? Did I really want requests not going through if the logs were broken? Not in our context. For me, for us, we would rather have a functional product and loose logs than break the product and still have logs.

There is a simple way of decoupling those elements. Send the logs over UDP. They were emitted from the source and if they cannot not be handled on the receiving side, they were simply dropped. Lost. No extra software. No queues. No added complexity. Just using a specific transport protocol. This fixed the issue.

I switched the knobs to the right settings. Added the right log configuration elements and that was it. We could now fix more issues because, simply, we could see them. We could also search issues and see logs that previously were spread over several machines, now aggregated in a single location.

Then, we wanted to backup the logs, so I looked at an Elasticsearch feature to backup in S3. At the time the implementation was missing a few knobs to work with a non AWS S3. I dove into the code, added the right logic in the backup plugging and rebuilt the plugging. Plugged it in and we started backuping logs in our S3.

Very quickly though, we were getting too many logs. It was overwhelming the system.

(1) Photo by Quentin

If you have missed it, you can read the previous episode here

To pair with :

  • Chella Ride - Dog Blood
  • Energiya-Buran: The Soviet Space Shuttle by Bart Hendrickx, Bert Vis

Vincent Auclair

Connect with me on your favorite network!

Oud metha, Dubai, Dubai 00000
Unsubscribe · Preferences

Symbol Sled

Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.

Read more from Symbol Sled

Splitting light Season 2 Episode 20 Sharpened sense of purpose If you are no longer interested in the newsletter, please unsubscribe By early June 2018, 8 months in, we were advancing quickly. All these bricks started to be assembled into something that worked. It almost felt like advancing following a lego model manual. Except we didn’t have a manual. The hardware, the software, the integration with existing systems, the testing started to converge into something that could be used. It...

Splitting light Season 2 Episode 19 Bandwidth waves If you are no longer interested in the newsletter, please unsubscribe At every step we would test the performance. Crude methods at first. Sowing together scripts would enable us to get more kick out of the performance testing. The more performance we wanted to extract, the harder it was to do the tests. At first one powerful machine was enough to generate the request and traffic. Then we needed two of them. Then twenty… Then a hundred. We...

Splitting light Season 2 Episode 18 Controlling latency If you are no longer interested in the newsletter, please unsubscribe We didn't only increase our scope in iteration phases to reduce risk or go faster. We also did it for customer facing metrics. One specifically required some tradeoffs; it was latency. To be more precise, time to first byte. Object storage is a generic way of storing and fetching data. The maximum data you could store in a single object at the time was 5 terabytes but...