Splitting Light: Season 2 - Episode 09


Splitting light

Season 2 Episode 09

Redundancy is key

If you are no longer interested in the newsletter, please unsubscribe

Around February 2018, as we moved forward and validated individual hardware pieces, we now had to bring everything together. This is where my experience in the lab had a lot of impact, coupled with the experience we had gathered maintaining the existing storage products.

What most software engineers fail to realize when working with hardware is that the time flows differently. Let me explain. When you have a software bug, it can take time to understand the bug but there are techniques to either disable the feature that caused the bug or rollback to the previous version. Pushing a fix in most systems rarely takes more than a few hours.

When dealing with hardware, if a component fails you know right away that this particular component failed. However, fixing the issue will most likely take at least a day. If it’s a piece you have spares for, you can submit a job to a datacenter technician to have the part swapped. Sometimes to replace the part a total downtime is required. Sometimes it can take weeks to even have the part shipped.

With software you design for post-incident recovery speed. With hardware you design for post-incident reliability. Meaning you want to design your systems to still work reliably after a failure and for it to work for as long as you need and can afford. The magic is knowing which part is critical and which one is less.

To accomplish this you draw the data flows over your design and ask yourself : If this piece fails, what happens? Does it cut the customer? If yes, how can I prevent that? Everything has to be thought of beforehand. You can’t suddenly decide you want to double attach everything in multiple racks on a whim. You have to take into account the failure that you know can happen. Part of it comes from experience, part of it from stories, tech lore, you find on the internet. Stores such as the 500 miles email and others.

When you are going over that process, you also try to use techniques and patterns that are known in the industry to be rock solid or at least that don’t rely on software and people to be too smart. We ended up double attaching every server and network equipment, except for the BMC which were on a separate and distinct network. I didn’t trust the shared network BMC. We use well known features, LACP bonding and ECMP routing.

Most importantly at the time, we refused to chase the single hardware story that was popularized by the big ones. We had a distinct hardware per functional part in the product. We did not have the manpower, the budget or the time to go down that road. Hacker news was all buzzed up about what the big ones did, I didn’t care about them. I cared about what worked for us in our context.

By then, we had almost clocked in five months after the pivot, I was about to meet someone dear to me.


(1) Photos available here by Emmanuel Caillé

If you have missed it, you can read the previous episode here

To pair with :

  • Proust ist mein Leben (Proust est ma vie, et ça m'ennuie profondément) - Christian Rottler, Catherine Rogister, Rüde Hagelstein remix
  • Failure is Not an Option: Mission Control From Mercury to Apollo 13 and Beyond by Gene Kranz

Vincent Auclair

Connect with me on your favorite network!

Oud metha, Dubai, Dubai 00000
Unsubscribe · Preferences

Symbol Sled

Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.

Read more from Symbol Sled

Splitting light Season 2 Episode 10 Finding someone If you are no longer interested in the newsletter, please unsubscribe It’s hard to explain everything that led to meeting this person. From my side you could sum it up to: I let go. I let myself feel the tide instead of trying to control it. As it rose and fell back, I would meet people. I was more comfortable with myself. I enjoyed myself more. This was the change that, for me, made this happen and work. I met Djazia in February 2018....

Splitting light Season 2 Episode 08 Compiling knowledge If you are no longer interested in the newsletter, please unsubscribe To be able to use OpenIO and offer it as a public facing product we had to amass quite a large amount of knowledge. We had to understand how it worked in detail. We had to understand the hardware requirements as well as how we wanted to make it filled and cabled. We had to understand how Scaleway’s information system worked and how we would connect to it. Skunk Works:...

Splitting light Season 2 Episode 07 Future growth If you are no longer interested in the newsletter, please unsubscribe To bring object storage the fastest and safest way would be to use existing software. At the time we reevaluated the solutions that had been selected beforehand. There were three of them. Ceph, an open source industry standard, OpenIO a provider of an open source object storage, and Scality a provider of a closed source product. There were multiple criteria to take into...