Splitting Light: Season 2 - Episode 03


Splitting light

Season 2 Episode 03

Remote hands

If you are no longer interested in the newsletter, please unsubscribe

Even before we could do anything other than look at what object and block storage were, we had to assure the continuity of the existing products. That meant getting the people who used to take care of them to give us all the information they knew and what we needed to continue running them.

One thing helped us very much. Both Folays and Loic had worked on some of the products. We could more easily learn how they worked. I had done carbon14 product, and Théo the interface part of it. Florent also had some experience with the existing object storage product.

Having knowledge inside the team made it easier but it still wasn’t easy. We had to find where and how they were monitored, what the alerts meant, and how to fix issues that were raised by the support team. These were live products with live customers.

After documenting internally what we could, we had one last problem. The problem every person that has worked with hardware in a distant datacenter has... Replacing parts. Hardware fails. It’s inevitable. Parts fail at different rates but eventually something fails. The most frequent failure for us was hard drives. We had quite a large, at least for me at that time, fleet of machines each equipped with a number of disks. An inaccurate number, it’s been a long time, would maybe be a thousand plus drives. Many of them were several years old. They failed sometimes. Each machine could tolerate two, maybe three disks that died before data loss.

Having them replaced was, how to say, painful at first. We had to find out how the drives were configured, what types of drives they were, and how to tell a datacenter technician which drives to swap… Many things had to be learned with live products.

Blinking the disks worked… Sometimes but most of the time not. Getting the defective disk serial number worked until the disk was dead which was when we needed it… Using a slightly different disk model worked or not depending on the RAID card and hardware model… Sometimes changing the faulty drive triggered a failure in another while the data was being rebuilt. Sometimes multiple times in a very short time.\

It is safe to say that for me, it vaccinated me from ever using hardware RAIDs again. Combined with my prior experience in building hardware, it cemented my understanding that any piece of hardware had to be sufficiently redundant until an intervention could be scheduled.

We all eventually learned a lot and documented. We had handbook procedures with to-do lists. What steps, how, which interface, which tool… Each time something new happened it was supposed to be written down.

Eventually, we managed to take care of things in a timely manner. While this was painful, we learned many lessons and team members who had never worked with hardware got experience. It enabled us to get a better understanding of the products which would lead us to make some hard decisions…

If you have missed it, you can read the previous episode here

To pair with :

  • Acid Tracks - Phuture
  • The Caves of Steel by Isaac Asimov

Vincent Auclair

Connect with me on your favorite network!

Oud metha, Dubai, Dubai 00000
Unsubscribe · Preferences

Symbol Sled

Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.

Read more from Symbol Sled

Splitting light Season 2 Episode 10 Finding someone If you are no longer interested in the newsletter, please unsubscribe It’s hard to explain everything that led to meeting this person. From my side you could sum it up to: I let go. I let myself feel the tide instead of trying to control it. As it rose and fell back, I would meet people. I was more comfortable with myself. I enjoyed myself more. This was the change that, for me, made this happen and work. I met Djazia in February 2018....

Splitting light Season 2 Episode 09 Redundancy is key If you are no longer interested in the newsletter, please unsubscribe Around February 2018, as we moved forward and validated individual hardware pieces, we now had to bring everything together. This is where my experience in the lab had a lot of impact, coupled with the experience we had gathered maintaining the existing storage products. What most software engineers fail to realize when working with hardware is that the time flows...

Splitting light Season 2 Episode 08 Compiling knowledge If you are no longer interested in the newsletter, please unsubscribe To be able to use OpenIO and offer it as a public facing product we had to amass quite a large amount of knowledge. We had to understand how it worked in detail. We had to understand the hardware requirements as well as how we wanted to make it filled and cabled. We had to understand how Scaleway’s information system worked and how we would connect to it. Skunk Works:...