Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.
Share
Splitting Light: Season 2 - Episode 23
Published about 2 months ago • 3 min read
Splitting light
Season 2 Episode 23
Beat the cluster to a pulp
If you are no longer interested in the newsletter, please unsubscribe
With proper observability we could now push the cluster even further. This was the final set of tests that we would perform before wiping everything and going to beta after a new setup.
We huddled and concocted a strategy. Picked up our tools and went on the field to beat the cluster to a pulp one last time. Our goal was explicitly to overwhelm the cluster as much as we could.
We started rounding up available servers in the available Scaleway customer fleet. We were careful to spread out our selection. Why? Because we didn’t want to overwhelm the customer network. Had we not been careful we would have saturated parts of Scaleway’s internal network, triggering issues and creating unsatisfied customers.
Network saturation can happen in any of the big red arrows
Théo had found this distributed testing tool that handled S3 protocol. He and other team members would tune it for specific test frequencies. Me and the rest of the team would monitor the cluster and monitor for network path local saturation.
With a better look into the insides of the system we could now understand it better. In parallel to the performance test we would also run our own testing scripts. We wanted to make sure everything worked correctly from a customer perspective.
As we raised the throughput and the requests per second, it all suddenly stopped. We checked metrics, logs, network, nothing seemed wrong. We hailed the network team. It turns out that we had triggered the DDOS prevention tool. The automation had ordered the traffic to the cluster public IP to be completely dropped.
We had a good laugh. Almost celebrating that we had triggered the DDOS protection measures. They adjusted some settings and we pushed traffic again. We retriggered it several times but now we knew how to read the signs.
Sample dashboard portion from that time, not actually a sign of anything specific (1)
We closely monitored a few things. The network bottlenecks and saturation. The CPU process saturation and the latency. Disk latency and memory pressure. But there was an important component that we paid extra attention to. Because we only advertised HTTPS, we needed to make sure the load balancer had the required performance capacity to handle encryption correctly.
It had been a criteria when we had to choose the CPU type of the load balancers. Our case required hardware accelerated encryption and decryption. But having the right components in the CPU doesn’t mean the operating system and the software can use it. We had to test and make sure we had done the right settings. The right knobs turned on at the right places. The software had to be set up correctly. We found that the CPUs were spending time waiting for Input/Output. It was waiting for data. So we had to dig even further. We had to assign the network cards to the right CPU and cores depending on how they were physically cabled on the mother board.
Physical connectivity on the mother board matters. Network 1 & 2 have low latency with CPU1 but high latency with CPU2
It was a lot of work. We acquired a lot of knowledge. We had a lot of fun.
In addition to performance we also tested reliability and safety measures. What would happen if we erased some data? Or corrupted some? What happened if we killed or suspended parts of the software? If we rebooted a few servers?… Disconnected network cables?…
These tests helped us create a better mental map of the system. I helped us create the right alerts. It enabled us to test that the cluster worked correctly to respond accordingly to issues.
Time stretched a bit. DC5, where we were supposed to rack the hardware for production, where we were to launch needed more time to metamorphose into a datacenter. More than expected. So to buy us some time and have customers testing, we decided to do a student hackathon.
Splitting light Season 2 Episode 29 Unsynchonized KPIs If you are no longer interested in the newsletter, please unsubscribe By November 2018, we had released object storage in public beta. It was now time for block storage. It was ready to be in private beta. The hardware was racked, configuration was done. API was done. From the storage team side it was ready. You are probably surprised that I haven’t talked about block storage a lot. True. Object storage was a lot more work than block...
Splitting light Season 2 Episode 28 Side quests If you are no longer interested in the newsletter, please unsubscribe After shipping the hardware to Amsterdam, we quickly launched private and public beta. We were the first product to launch to public beta in November 2018. Database as a service was not far behind. Image of the Object Storage in public beta! (1) Théo (a) had instructed the customer success team to forward almost all support tickets to us. We did level 1 (L1) support. Every...
Splitting light Season 2 Episode 27 OpenIO festival If you are no longer interested in the newsletter, please unsubscribe La maison Iena was a very big office. In October 2018 we hosted OpenIO for a full day conference.The OpenIO festival. They came from Lille and invited customers and users. There were talks and demos. OpenIO Summit 2018 at La Maison (1) I remember sitting in a few of these talks. By October 2018, we had already significantly dived into the code. We had upstreamed a few...