Splitting light

Season 2 Episode 19

Bandwidth waves

If you are no longer interested in the newsletter, please unsubscribe

At every step we would test the performance. Crude methods at first. Sowing together scripts would enable us to get more kick out of the performance testing.

The more performance we wanted to extract, the harder it was to do the tests. At first one powerful machine was enough to generate the request and traffic. Then we needed two of them. Then twenty… Then a hundred.

We started to hit weird edge cases. But our biggest issue was that we didn’t really have the visibility of what was happening. We could only monitor things by hand and watch logs scroll. Luckily for us, at that point, the monitoring team started to have a working internal platform we could use.

We were one of the first users. It was based on a then newish software stack. It was Prometheus based. We hooked ourselves in. We added exporters for every software component and started looking at the dashboards. I don’t remember who created the first version of the cluster dashboard but it became the go to place to observe.

The great thing about that system was that many people had contributed open source programs to bridge existing software and Prometheus. We would monitor requests per second as well as packets sent and received. We continued to push performance. Suddenly a strange pattern emerged.

The bandwidth would raise, then suddenly it would sharply drop. Almost flat lining for a minute. Then it would rise again. That pattern repeated endlessly.

Prometheus-style architecture for metrics

I had seen this pattern previously. I’d seen it at my previous job before working at Scaleway. It was a sign of file descriptor exhaustion. We dived in, checked our operating system settings. We tweaked a few knobs, but the real fault was inside the code base.

The code for the object storage, the OpenStack code as well as the OpenIO code did not make use of connection pools everywhere. Neither did connection indicate that they could be reused.

We patched all the occurrences in the python code. Sent the patches upstream and tried again. This time it worked better. The performance did not collapse like it had previously. However, when doing basic tests after a big performance test we hit another issue.

Sometimes we’d get a permission denied in the middle of an operation with successful requests before and after. The odd thing was that it did not even reach the authentication service. After deep debugging, we found the culprit. Because we used connection pools, the connection would linger. From the load balancer side, it would terminate the connection after a set time. Everything would close nicely on both operating systems. However, in the python code, it would stay there, untouched. Until it was used again by the code flow. The request would naturally fail but the error handling was not accurate. It would error out and cascade back to permission denied.

Old screenshot of one of the dashboard used during the testing phase

After tracking the issue in the code, I checked the upstream library. A fix had been proposed but refused by the maintainers. Their point of view was that the calling code should handle the error. They weren’t wrong but that was not an option for us. We decided to patch the library and use the patched version instead of the upstream one. Initiating a tradition of eventually pulling every dependency in house because we needed to fix issues.

Funny enough, long after this, we encountered the same identical issue in our Golang code. Even funnier, this bug would most likely not happen in C or C++ because control flow is done differently.

This pushed our sense of purpose.

If you have missed it, you can read the previous episode here

To pair with :