Splitting Light: Season 1 - Episode 28


Splitting light

Season 1 Episode 28

Juggling code

If you are no longer interested in the newsletter, please unsubscribe

Three years into my tenure in the lab, we had launched many products. From the revolutionary C1 where we had 900 nodes per rack to the cold storage product. We had two compute devices production, one network device and one storage device. That did not count one project that had been terminated and the SCADA devices which I did not work on. There were also two upcoming devices, the C3 and another network device where we would punch above the 40gb link speed.

Both for the C2 and C3 we had multiple configurations of the individual nodes and two customers. The cloud team which provided billed by the hour resources and had been formed when I had been recruited. There was also the “dedicated” team which rented hardware billed by the month and was the oldest team in the company. Both teams worked independently and most importantly worked differently. Between them, you had the network team which handled the network equipment.

When we handed over the product, we had to continue doing the support and fix bugs that the teams could not handle. This took some time from us. Yet we were still building more products, each more complex than the preceding one. We reached for more compute out of every watt and more compute out of every cm3.

It was a tough job. Jumping from product to product, team to team, handling Carbon14, learning new concepts and soldering boards… I had to juggle all this and learn to prioritise even better than I had been doing before. It was a never ending push forward.

This was possible because of how the work was organized. The sequenced dance of steps to design, test, code, manufacture and add features was complex but well mastered. The code bases were structured identically and files named identically across devices. It was less hard to switch in and out of each device.

Another component was the gradual ramp up in skill and knowledge. Debugging with an oscilloscope was hard and complicated to set up the first time but after five or six times, it was “just” a procedure I would apply. Same for patching the linux kernel or interacting with the ASIC’s memory.

Even with all this structure and the ease of skill it was still hard. I liken this to a mental burden. Each part has its own constraints and specificities. Each time I had to jump into something, I had to load those parts into my brain. Focus back on the details, on the code, on the wires in the PCB because those were the important details. Those had to be loaded and unloaded again and again. Each time losing concentration and mental energy. I did not fully understand the mechanism until a few years later but it was intuitive at that time. I would organize the work to be similar things to work on together, to make the switching easier.

Of course, no matter how well you organize things, when there’s issues in production, those take priority. Over time, the operators could use more and more diagnostics features and tools but they needed help to understand how and when to use them. To do that, documentation was needed and how it was done mattered.

If you have missed it, you can read the previous episode here

To pair with :

  • Reborn ice horn - 1991
  • The Mythical Man-Month: Essays on Software Engineering by Frederick P. Brooks Jr.

Vincent Auclair

Connect with me on your favorite network!

Oud metha, Dubai, Dubai 00000
Unsubscribe · Preferences

Symbol Sled

Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.

Read more from Symbol Sled

Splitting light Season 2 Episode 23 Beat the cluster to a pulp If you are no longer interested in the newsletter, please unsubscribe With proper observability we could now push the cluster even further. This was the final set of tests that we would perform before wiping everything and going to beta after a new setup. We huddled and concocted a strategy. Picked up our tools and went on the field to beat the cluster to a pulp one last time. Our goal was explicitly to overwhelm the cluster as...

Splitting light Season 2 Episode 22 Too many logs If you are no longer interested in the newsletter, please unsubscribe I’ve rarely seen people talk about this effect. The effect being the amplification of requests. This effect can overwhelm your system. We had to deal with it. The object storage, at least OpenIO, was a collection of distributed services. You might call them micro services if you want. That had implications. When a request comes in, from the user perspective, it’s a single...

Splitting light Season 2 Episode 21 All nighter If you are no longer interested in the newsletter, please unsubscribe As we were moving forward, in mid June 2018, we hit a point where we needed to be able to check the logs of the cluster as a whole. The way we had done it until then was manually connecting to the machines and opening the right files to look inside. This was no longer viable. One of the main office rooms (1) Scaleway’s monitoring team had done a metric stack which we already...