Splitting Light: Season 2 - Episode 09


Splitting light

Season 2 Episode 09

Redundancy is key

If you are no longer interested in the newsletter, please unsubscribe

Around February 2018, as we moved forward and validated individual hardware pieces, we now had to bring everything together. This is where my experience in the lab had a lot of impact, coupled with the experience we had gathered maintaining the existing storage products.

What most software engineers fail to realize when working with hardware is that the time flows differently. Let me explain. When you have a software bug, it can take time to understand the bug but there are techniques to either disable the feature that caused the bug or rollback to the previous version. Pushing a fix in most systems rarely takes more than a few hours.

When dealing with hardware, if a component fails you know right away that this particular component failed. However, fixing the issue will most likely take at least a day. If it’s a piece you have spares for, you can submit a job to a datacenter technician to have the part swapped. Sometimes to replace the part a total downtime is required. Sometimes it can take weeks to even have the part shipped.

With software you design for post-incident recovery speed. With hardware you design for post-incident reliability. Meaning you want to design your systems to still work reliably after a failure and for it to work for as long as you need and can afford. The magic is knowing which part is critical and which one is less.

To accomplish this you draw the data flows over your design and ask yourself : If this piece fails, what happens? Does it cut the customer? If yes, how can I prevent that? Everything has to be thought of beforehand. You can’t suddenly decide you want to double attach everything in multiple racks on a whim. You have to take into account the failure that you know can happen. Part of it comes from experience, part of it from stories, tech lore, you find on the internet. Stores such as the 500 miles email and others.

When you are going over that process, you also try to use techniques and patterns that are known in the industry to be rock solid or at least that don’t rely on software and people to be too smart. We ended up double attaching every server and network equipment, except for the BMC which were on a separate and distinct network. I didn’t trust the shared network BMC. We use well known features, LACP bonding and ECMP routing.

Most importantly at the time, we refused to chase the single hardware story that was popularized by the big ones. We had a distinct hardware per functional part in the product. We did not have the manpower, the budget or the time to go down that road. Hacker news was all buzzed up about what the big ones did, I didn’t care about them. I cared about what worked for us in our context.

By then, we had almost clocked in five months after the pivot, I was about to meet someone dear to me.


(1) Photos available here by Emmanuel Caillé

If you have missed it, you can read the previous episode here

To pair with :

  • Proust ist mein Leben (Proust est ma vie, et ça m'ennuie profondément) - Christian Rottler, Catherine Rogister, Rüde Hagelstein remix
  • Failure is Not an Option: Mission Control From Mercury to Apollo 13 and Beyond by Gene Kranz

Vincent Auclair

Connect with me on your favorite network!

Oud metha, Dubai, Dubai 00000
Unsubscribe · Preferences

Symbol Sled

Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.

Read more from Symbol Sled

Splitting light Season 2 Episode 16 Iterative process If you are no longer interested in the newsletter, please unsubscribe All throughout these first few months, from December 2017 to May 2018, we did iterations on object storage. This was both necessary to pick the skills and also reduce risks. The work to bring up an object storage is a subset to making a public facing object storage product. In the first case you control everything, in the second, you control much less. Quentin working in...

Splitting light Season 2 Episode 15 Internal identification If you are no longer interested in the newsletter, please unsubscribe End of May 2018 had arrived. I had an appointment. After doing a tattoo on my arm, I continued to look into tattoo art. Looking at different things and identifying more symbolism that I wanted. I remember reading Revenger by Alastair Reynolds, a mix between new space opera and pirates of the caribbean. Spaceships with light sales. I finally contacted the tattoo...

Splitting light Season 2 Episode 14 Access key If you are no longer interested in the newsletter, please unsubscribe We were now at a stage where we had to no longer have hardcoded credentials. We needed to be plugged to Scaleway’s authentication database. Historically Scaleway, the cloud computing division, had decided to design their own API. They used a mechanism called Json Web Token (JWT) to authentify. This system only required a secret credential. On our side, S3 required two things. A...