Splitting light

Season 2 Episode 13

Pouring the foundation slab

If you are no longer interested in the newsletter, please unsubscribe

We had the green light to continue. Now it was time to use these plans and experiments to build a product. We started pouring the foundation slabs. We wrote salt stack deployment code. Reusing the same tool we had used for provisioning Carbon14 in the bunker.

The casts room, where the storage team 💾 worked a lot (1)

We went full in with diskless boots. The advantage of being able to reboot into the system outweighed the costs. Using our existing tooling with just a few tweaks the system worked accurately. The goal behind that was to make sure that the code was always consistent. If we installed the system locally, inevitably issues would creep up. Inconsistent libraries. Files growing without rotations. It was just much easier for us to just have an image and reboot into it if something went wrong.

I remember doing the code to set up the storage disks. It had to be stateless. We did not want a special initialisation code path. The code checked if the disk was configured, if it wasn’t, it would format it. All automatically. We had to make sure it worked flawlessly. If it didn’t we would encounter data loss. The first prototype was done working with virtual machines on our laptops.

Another part I remember a lot is introducing a rate limit on the load balancer. Because we were doing a publicly available product, where all the clients shared the same endpoint and that we could not dynamically scale up or down, we needed to make sure a customer could not overwhelm the system.

At first we looked at limiting the bandwidth per customer. But doing the limitation at that level would hinder what customers needed S3 for. We settled on rate limiting the number of requests per bucket. We didn’t need any external database or mapping. We got the information from the URI and counted the number of requests per second. Remember, we didn’t need this to be 100% accurate. We only needed it as an extra failsafe.

We realized that in addition to the load balancer and the storage servers, we needed quite a few internal services. We started requesting some in Scaleway’s common infrastructure.

Yellow, orange then red, the required order.

By early May 2018, we were able to have a robust fully automatic deployment. Starting first on virtual machines, then moving to physical machines. Slowly, step by step, we emulated the production environment more and more. Using Scaleway’s available customers machines. First we ran everything on the same machine. Then we introduced multiple virtual machines, then multiple physical machines, then we added the right network setup…

The goal was to “slowly” increase our knowledge and automation. By doing this, at each step, the bug surface area was constrained. We would acquire necessary knowledge and skills at each step. We would eliminate bugs at each step. By doing the R&D this way, bugs would not be hidden behind knowledge we did not possess yet.

Théo (Storage team manager), Emmanuel (Frontend engineer) and Kevin (VP of billing, payments & IAM)

All during that process, we also had to learn how to use S3. In 2018, it was not as widely known as it is now. How to plug each CLI, what were the requirements to make them compatible… We found out the hard way that many tools failed silently. Most importantly, many of them retried failed requests silently.

Until then hardcoded credentials were enough. Now we needed to plug into Scaleway’s information system to have an access key

(1) Photos by the launch event photographer - if you know who it was please tell me

If you have missed it, you can read the previous episode here

To pair with :