Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.
Share
Splitting Light: Season 2 - Episode 13
Published about 2 months ago • 3 min read
Splitting light
Season 2 Episode 13
Pouring the foundation slab
If you are no longer interested in the newsletter, please unsubscribe​
We had the green light to continue. Now it was time to use these plans and experiments to build a product. We started pouring the foundation slabs. We wrote salt stack deployment code. Reusing the same tool we had used for provisioning Carbon14 in the bunker.
The casts room, where the storage team đź’ľ worked a lot (1)
We went full in with diskless boots. The advantage of being able to reboot into the system outweighed the costs. Using our existing tooling with just a few tweaks the system worked accurately. The goal behind that was to make sure that the code was always consistent. If we installed the system locally, inevitably issues would creep up. Inconsistent libraries. Files growing without rotations. It was just much easier for us to just have an image and reboot into it if something went wrong.
I remember doing the code to set up the storage disks. It had to be stateless. We did not want a special initialisation code path. The code checked if the disk was configured, if it wasn’t, it would format it. All automatically. We had to make sure it worked flawlessly. If it didn’t we would encounter data loss. The first prototype was done working with virtual machines on our laptops.
Details of the cats room (1)
Another part I remember a lot is introducing a rate limit on the load balancer. Because we were doing a publicly available product, where all the clients shared the same endpoint and that we could not dynamically scale up or down, we needed to make sure a customer could not overwhelm the system.
At first we looked at limiting the bandwidth per customer. But doing the limitation at that level would hinder what customers needed S3 for. We settled on rate limiting the number of requests per bucket. We didn’t need any external database or mapping. We got the information from the URI and counted the number of requests per second. Remember, we didn’t need this to be 100% accurate. We only needed it as an extra failsafe.
We realized that in addition to the load balancer and the storage servers, we needed quite a few internal services. We started requesting some in Scaleway’s common infrastructure.
Yellow, orange then red, the required order.
By early May 2018, we were able to have a robust fully automatic deployment. Starting first on virtual machines, then moving to physical machines. Slowly, step by step, we emulated the production environment more and more. Using Scaleway’s available customers machines. First we ran everything on the same machine. Then we introduced multiple virtual machines, then multiple physical machines, then we added the right network setup…
The goal was to “slowly” increase our knowledge and automation. By doing this, at each step, the bug surface area was constrained. We would acquire necessary knowledge and skills at each step. We would eliminate bugs at each step. By doing the R&D this way, bugs would not be hidden behind knowledge we did not possess yet.
All during that process, we also had to learn how to use S3. In 2018, it was not as widely known as it is now. How to plug each CLI, what were the requirements to make them compatible… We found out the hard way that many tools failed silently. Most importantly, many of them retried failed requests silently.
Until then hardcoded credentials were enough. Now we needed to plug into Scaleway’s information system to have an access key
​
(1) Photos by the launch event photographer - if you know who it was please tell me
​
If you have missed it, you can read the previous episode here​
Splitting light Season 2 Episode 20 Sharpened sense of purpose If you are no longer interested in the newsletter, please unsubscribe By early June 2018, 8 months in, we were advancing quickly. All these bricks started to be assembled into something that worked. It almost felt like advancing following a lego model manual. Except we didn’t have a manual. The hardware, the software, the integration with existing systems, the testing started to converge into something that could be used. It...
Splitting light Season 2 Episode 19 Bandwidth waves If you are no longer interested in the newsletter, please unsubscribe At every step we would test the performance. Crude methods at first. Sowing together scripts would enable us to get more kick out of the performance testing. The more performance we wanted to extract, the harder it was to do the tests. At first one powerful machine was enough to generate the request and traffic. Then we needed two of them. Then twenty… Then a hundred. We...
Splitting light Season 2 Episode 18 Controlling latency If you are no longer interested in the newsletter, please unsubscribe We didn't only increase our scope in iteration phases to reduce risk or go faster. We also did it for customer facing metrics. One specifically required some tradeoffs; it was latency. To be more precise, time to first byte. Object storage is a generic way of storing and fetching data. The maximum data you could store in a single object at the time was 5 terabytes but...