Splitting Light: Season 2 - Episode 19


Splitting light

Season 2 Episode 19

Bandwidth waves

If you are no longer interested in the newsletter, please unsubscribe

At every step we would test the performance. Crude methods at first. Sowing together scripts would enable us to get more kick out of the performance testing.

The more performance we wanted to extract, the harder it was to do the tests. At first one powerful machine was enough to generate the request and traffic. Then we needed two of them. Then twenty… Then a hundred.

We started to hit weird edge cases. But our biggest issue was that we didn’t really have the visibility of what was happening. We could only monitor things by hand and watch logs scroll. Luckily for us, at that point, the monitoring team started to have a working internal platform we could use.

We were one of the first users. It was based on a then newish software stack. It was Prometheus based. We hooked ourselves in. We added exporters for every software component and started looking at the dashboards. I don’t remember who created the first version of the cluster dashboard but it became the go to place to observe.

The great thing about that system was that many people had contributed open source programs to bridge existing software and Prometheus. We would monitor requests per second as well as packets sent and received. We continued to push performance. Suddenly a strange pattern emerged.

The bandwidth would raise, then suddenly it would sharply drop. Almost flat lining for a minute. Then it would rise again. That pattern repeated endlessly.

I had seen this pattern previously. I’d seen it at my previous job before working at Scaleway. It was a sign of file descriptor exhaustion. We dived in, checked our operating system settings. We tweaked a few knobs, but the real fault was inside the code base.

The code for the object storage, the OpenStack code as well as the OpenIO code did not make use of connection pools everywhere. Neither did connection indicate that they could be reused.

We patched all the occurrences in the python code. Sent the patches upstream and tried again. This time it worked better. The performance did not collapse like it had previously. However, when doing basic tests after a big performance test we hit another issue.

Sometimes we’d get a permission denied in the middle of an operation with successful requests before and after. The odd thing was that it did not even reach the authentication service. After deep debugging, we found the culprit. Because we used connection pools, the connection would linger. From the load balancer side, it would terminate the connection after a set time. Everything would close nicely on both operating systems. However, in the python code, it would stay there, untouched. Until it was used again by the code flow. The request would naturally fail but the error handling was not accurate. It would error out and cascade back to permission denied.

After tracking the issue in the code, I checked the upstream library. A fix had been proposed but refused by the maintainers. Their point of view was that the calling code should handle the error. They weren’t wrong but that was not an option for us. We decided to patch the library and use the patched version instead of the upstream one. Initiating a tradition of eventually pulling every dependency in house because we needed to fix issues.

Funny enough, long after this, we encountered the same identical issue in our Golang code. Even funnier, this bug would most likely not happen in C or C++ because control flow is done differently.

This pushed our sense of purpose.


If you have missed it, you can read the previous episode here

To pair with :

  • Howling at the moon - Phantogram
  • Brazen: Rebel Ladies Who Rocked the World (Culottées in French) by Pénélope Bagieu

Vincent Auclair

Connect with me on your favorite network!

Oud metha, Dubai, Dubai 00000
Unsubscribe · Preferences

Symbol Sled

Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.

Read more from Symbol Sled

Splitting light Season 2 Episode 18 Controlling latency If you are no longer interested in the newsletter, please unsubscribe We didn't only increase our scope in iteration phases to reduce risk or go faster. We also did it for customer facing metrics. One specifically required some tradeoffs; it was latency. To be more precise, time to first byte. Object storage is a generic way of storing and fetching data. The maximum data you could store in a single object at the time was 5 terabytes but...

Splitting light Season 2 Episode 17 Increasing scope If you are no longer interested in the newsletter, please unsubscribe Just as we were iterating to reduce risk and gain experience. We were also increasing the scope of our work as the iterations went. We didn’t expect to have to handle the hardware. Yet we learned how to choose it. How many servers we wanted per rack and why. Measuring the actual power load at full usage. With the help of the network team, choosing the right routers and...

Splitting light Season 2 Episode 16 Iterative process If you are no longer interested in the newsletter, please unsubscribe All throughout these first few months, from December 2017 to May 2018, we did iterations on object storage. This was both necessary to pick the skills and also reduce risks. The work to bring up an object storage is a subset to making a public facing object storage product. In the first case you control everything, in the second, you control much less. Quentin working in...