Splitting light

Season 2 Episode 23

Beat the cluster to a pulp

If you are no longer interested in the newsletter, please unsubscribe

With proper observability we could now push the cluster even further. This was the final set of tests that we would perform before wiping everything and going to beta after a new setup.

We huddled and concocted a strategy. Picked up our tools and went on the field to beat the cluster to a pulp one last time. Our goal was explicitly to overwhelm the cluster as much as we could.

We started rounding up available servers in the available Scaleway customer fleet. We were careful to spread out our selection. Why? Because we didn’t want to overwhelm the customer network. Had we not been careful we would have saturated parts of Scaleway’s internal network, triggering issues and creating unsatisfied customers.

Network saturation can happen in any of the big red arrows

Théo had found this distributed testing tool that handled S3 protocol. He and other team members would tune it for specific test frequencies. Me and the rest of the team would monitor the cluster and monitor for network path local saturation.

With a better look into the insides of the system we could now understand it better. In parallel to the performance test we would also run our own testing scripts. We wanted to make sure everything worked correctly from a customer perspective.

As we raised the throughput and the requests per second, it all suddenly stopped. We checked metrics, logs, network, nothing seemed wrong. We hailed the network team. It turns out that we had triggered the DDOS prevention tool. The automation had ordered the traffic to the cluster public IP to be completely dropped.

We had a good laugh. Almost celebrating that we had triggered the DDOS protection measures. They adjusted some settings and we pushed traffic again. We retriggered it several times but now we knew how to read the signs.

Sample dashboard portion from that time, not actually a sign of anything specific (1)

We closely monitored a few things. The network bottlenecks and saturation. The CPU process saturation and the latency. Disk latency and memory pressure. But there was an important component that we paid extra attention to. Because we only advertised HTTPS, we needed to make sure the load balancer had the required performance capacity to handle encryption correctly.

It had been a criteria when we had to choose the CPU type of the load balancers. Our case required hardware accelerated encryption and decryption. But having the right components in the CPU doesn’t mean the operating system and the software can use it. We had to test and make sure we had done the right settings. The right knobs turned on at the right places. The software had to be set up correctly. We found that the CPUs were spending time waiting for Input/Output. It was waiting for data. So we had to dig even further. We had to assign the network cards to the right CPU and cores depending on how they were physically cabled on the mother board.

Physical connectivity on the mother board matters. Network 1 & 2 have low latency with CPU1 but high latency with CPU2

It was a lot of work. We acquired a lot of knowledge. We had a lot of fun.

In addition to performance we also tested reliability and safety measures. What would happen if we erased some data? Or corrupted some? What happened if we killed or suspended parts of the software? If we rebooted a few servers?… Disconnected network cables?…

These tests helped us create a better mental map of the system. I helped us create the right alerts. It enabled us to test that the cluster worked correctly to respond accordingly to issues.

Time stretched a bit. DC5, where we were supposed to rack the hardware for production, where we were to launch needed more time to metamorphose into a datacenter. More than expected. So to buy us some time and have customers testing, we decided to do a student hackathon.

(1) Screenshot from Loic

If you have missed it, you can read the previous episode here

To pair with :