Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.
Share
Splitting Light: Season 2 - Episode 19
Published about 2 months ago • 3 min read
Splitting light
Season 2 Episode 19
Bandwidth waves
If you are no longer interested in the newsletter, please unsubscribe
At every step we would test the performance. Crude methods at first. Sowing together scripts would enable us to get more kick out of the performance testing.
The more performance we wanted to extract, the harder it was to do the tests. At first one powerful machine was enough to generate the request and traffic. Then we needed two of them. Then twenty… Then a hundred.
We started to hit weird edge cases. But our biggest issue was that we didn’t really have the visibility of what was happening. We could only monitor things by hand and watch logs scroll. Luckily for us, at that point, the monitoring team started to have a working internal platform we could use.
InfluxDB-style architecture for metrics
We were one of the first users. It was based on a then newish software stack. It was Prometheus based. We hooked ourselves in. We added exporters for every software component and started looking at the dashboards. I don’t remember who created the first version of the cluster dashboard but it became the go to place to observe.
The great thing about that system was that many people had contributed open source programs to bridge existing software and Prometheus. We would monitor requests per second as well as packets sent and received. We continued to push performance. Suddenly a strange pattern emerged.
The bandwidth would raise, then suddenly it would sharply drop. Almost flat lining for a minute. Then it would rise again. That pattern repeated endlessly.
Prometheus-style architecture for metrics
I had seen this pattern previously. I’d seen it at my previous job before working at Scaleway. It was a sign of file descriptor exhaustion. We dived in, checked our operating system settings. We tweaked a few knobs, but the real fault was inside the code base.
The code for the object storage, the OpenStack code as well as the OpenIO code did not make use of connection pools everywhere. Neither did connection indicate that they could be reused.
Wave pattern
We patched all the occurrences in the python code. Sent the patches upstream and tried again. This time it worked better. The performance did not collapse like it had previously. However, when doing basic tests after a big performance test we hit another issue.
Sometimes we’d get a permission denied in the middle of an operation with successful requests before and after. The odd thing was that it did not even reach the authentication service. After deep debugging, we found the culprit. Because we used connection pools, the connection would linger. From the load balancer side, it would terminate the connection after a set time. Everything would close nicely on both operating systems. However, in the python code, it would stay there, untouched. Until it was used again by the code flow. The request would naturally fail but the error handling was not accurate. It would error out and cascade back to permission denied.
Old screenshot of one of the dashboard used during the testing phase
After tracking the issue in the code, I checked the upstream library. A fix had been proposed but refused by the maintainers. Their point of view was that the calling code should handle the error. They weren’t wrong but that was not an option for us. We decided to patch the library and use the patched version instead of the upstream one. Initiating a tradition of eventually pulling every dependency in house because we needed to fix issues.
Funny enough, long after this, we encountered the same identical issue in our Golang code. Even funnier, this bug would most likely not happen in C or C++ because control flow is done differently.
This pushed our sense of purpose.
If you have missed it, you can read the previous episode here
To pair with :
Howling at the moon - Phantogram
Brazen: Rebel Ladies Who Rocked the World (Culottées in French) by Pénélope Bagieu
Splitting light Season 2 Episode 26 Entrepreneurs inside Scaleway If you are no longer interested in the newsletter, please unsubscribe Around September 2018 What we were doing as the storage team at Scaleway was a product of both the context and the time we were in. We were almost acting as a startup within an incubator. We did whatever was necessary to move on. Launching in Amsterdam instead of Paris first was one of the examples. We wanted the product to be live so customers could use it....
Splitting light Season 2 Episode 25 Pool meeting If you are no longer interested in the newsletter, please unsubscribe There were many amenities at la Maison Iena. A pool table, several arcades, a rooftop.. The most interesting one was the pool. It was in what used to be the wine cellar. We had seen it but none of us in the storage team had used it. We decided to do something fun. We would do a sprint planning meeting in the pool. The pool in the underground old wine cellar We carried a...
Splitting light Season 2 Episode 24 Hackathon If you are no longer interested in the newsletter, please unsubscribe Several of the team members had gone to School 42. A tuition free university created by the owner of Scaleway, Xavier Niels. Several would be an understatement. Out of the 14 people, more than half had gone there. School 42 frequently organized hackathons. We decided it was the perfect opportunity for us to organize one. Théo doing the hackathon presentation (1) Our goal was to...