Splitting Light: Season 2 - Episode 23


Splitting light

Season 2 Episode 23

Beat the cluster to a pulp

If you are no longer interested in the newsletter, please unsubscribe

With proper observability we could now push the cluster even further. This was the final set of tests that we would perform before wiping everything and going to beta after a new setup.

We huddled and concocted a strategy. Picked up our tools and went on the field to beat the cluster to a pulp one last time. Our goal was explicitly to overwhelm the cluster as much as we could.

We started rounding up available servers in the available Scaleway customer fleet. We were careful to spread out our selection. Why? Because we didn’t want to overwhelm the customer network. Had we not been careful we would have saturated parts of Scaleway’s internal network, triggering issues and creating unsatisfied customers.

Théo had found this distributed testing tool that handled S3 protocol. He and other team members would tune it for specific test frequencies. Me and the rest of the team would monitor the cluster and monitor for network path local saturation.

With a better look into the insides of the system we could now understand it better. In parallel to the performance test we would also run our own testing scripts. We wanted to make sure everything worked correctly from a customer perspective.

As we raised the throughput and the requests per second, it all suddenly stopped. We checked metrics, logs, network, nothing seemed wrong. We hailed the network team. It turns out that we had triggered the DDOS prevention tool. The automation had ordered the traffic to the cluster public IP to be completely dropped.

We had a good laugh. Almost celebrating that we had triggered the DDOS protection measures. They adjusted some settings and we pushed traffic again. We retriggered it several times but now we knew how to read the signs.

We closely monitored a few things. The network bottlenecks and saturation. The CPU process saturation and the latency. Disk latency and memory pressure. But there was an important component that we paid extra attention to. Because we only advertised HTTPS, we needed to make sure the load balancer had the required performance capacity to handle encryption correctly.

It had been a criteria when we had to choose the CPU type of the load balancers. Our case required hardware accelerated encryption and decryption. But having the right components in the CPU doesn’t mean the operating system and the software can use it. We had to test and make sure we had done the right settings. The right knobs turned on at the right places. The software had to be set up correctly. We found that the CPUs were spending time waiting for Input/Output. It was waiting for data. So we had to dig even further. We had to assign the network cards to the right CPU and cores depending on how they were physically cabled on the mother board.

It was a lot of work. We acquired a lot of knowledge. We had a lot of fun.

In addition to performance we also tested reliability and safety measures. What would happen if we erased some data? Or corrupted some? What happened if we killed or suspended parts of the software? If we rebooted a few servers?… Disconnected network cables?…

These tests helped us create a better mental map of the system. I helped us create the right alerts. It enabled us to test that the cluster worked correctly to respond accordingly to issues.

Time stretched a bit. DC5, where we were supposed to rack the hardware for production, where we were to launch needed more time to metamorphose into a datacenter. More than expected. So to buy us some time and have customers testing, we decided to do a student hackathon.

(1) Screenshot from Loic

If you have missed it, you can read the previous episode here

To pair with :

  • Stop - Reef
  • The image of her (Les belles Images) by Simone de Beauvoir

Vincent Auclair

Connect with me on your favorite network!

Oud metha, Dubai, Dubai 00000
Unsubscribe · Preferences

Symbol Sled

Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.

Read more from Symbol Sled

Splitting light Season 2 Episode 22 Too many logs If you are no longer interested in the newsletter, please unsubscribe I’ve rarely seen people talk about this effect. The effect being the amplification of requests. This effect can overwhelm your system. We had to deal with it. The object storage, at least OpenIO, was a collection of distributed services. You might call them micro services if you want. That had implications. When a request comes in, from the user perspective, it’s a single...

Splitting light Season 2 Episode 21 All nighter If you are no longer interested in the newsletter, please unsubscribe As we were moving forward, in mid June 2018, we hit a point where we needed to be able to check the logs of the cluster as a whole. The way we had done it until then was manually connecting to the machines and opening the right files to look inside. This was no longer viable. One of the main office rooms (1) Scaleway’s monitoring team had done a metric stack which we already...

Splitting light Season 2 Episode 20 Sharpened sense of purpose If you are no longer interested in the newsletter, please unsubscribe By early June 2018, 8 months in, we were advancing quickly. All these bricks started to be assembled into something that worked. It almost felt like advancing following a lego model manual. Except we didn’t have a manual. The hardware, the software, the integration with existing systems, the testing started to converge into something that could be used. It...