Splitting light

Season 1 Episode 22

My very first storage product

If you are no longer interested in the newsletter, please unsubscribe

As I was qualifying the storage hardware, one person from the cloud team was assigned to work on the product part. Just as I was about to hand it over , he decided that he wanted to spend more time with his family which lived halfway across the globe, so he resigned. As we were between cycles, the C2 manufacturing order had been sent and the C3 wasn’t “physical” yet, so I took over the product and started working on it.

At the time, there was a cold storage product in AWS called GLACIER, not the S3 storage class, a distinct product that I think has been sunsetted by AWS. Because of the hardware specificity, the only sensible product to design was a similar product. I studied how it operated and started designing different aspects of how it would work. I experimented with several technologies to have more abstraction but each software I tried wasn’t up to the task. Ceph could not be used because of hardware constraints. Network block device (NBD) didn’t seem ready for production after I triggered multiple weird bugs. ATAoE wasn’t flexible enough for us. I was stumped.

I took a step back and put things into perspective. I had hardware with very high density in gb per cm3 per watt. The goal was to archive data for customers. Access latency was not required to be very short but durability was paramount. Constraints were that all the drives could not be powered up at once and random access on them was suboptimal.

After a few rounds of thinking and brainstorming, I eventually decided that I would not be afraid of handling data myself. I put on a brave mask and started designing the data storage system in its entirety. I separated the product into three isolated parts. 1. The hardware itself and the small software to manage the drives; 2. The data pipeline where I would process the data stream and control the integrity of the data; 3. The data intake which customers would have to interact with to send the data.

The first part was done at the handover. I dived into the second one. The idea was that customers could send a bunch of data and they would “seal” it into an archive. We did not care about what the data was and we gave a maximum size of 10 terabyte.To handle that amount of data meant that we could not process the data in-situ, we had to stream it. To handle that, I built a glorified shell pipe.

Let me explain for the people who didn’t understand the last phrase. I would have data arriving through the network in a single connection and it would be processed by a python script. It would count the number of bytes and do a very fast checksum on the data. I needed integrity checksums not tamper proof checksums. Then the data would be sliced and I would apply reed solomon encoding on it using a library that was fast. Reed Solomon encoding is a mathematical method of adding some additional data so that if you lose part of it you are still able to recover it. It’s used on space probes or many things that you use in your day to day life.

After prototyping everything, I built the real pipeline with the associated metadata database. Then came the testing phase. Performance, reliability and integrity. Performance was important because moving terabytes of data takes time and if you process it slowly it takes even more time. Reliability was important because if the system was down, customers could not access their data. Integrity was the most important component, whatever happened to the storage mediums, we had to give the data back in its original form.

I tested the code on data archives from wikipedia, on home movies, on random data or even empty data. Checking performances and then I started to corrupt the data bit by bit, then byte by byte, continuing until I corrupted the data to the limit of our system and furter to check the failure mode code. I killed some of the processes during an archival or retrieval. I cut the data stream in the middle or suspended it. Every edge case had to be ironed out. Each time, I compared the data that had gone in with the data that came out using checksums of multiple algorithms.

Why did I do all these tests? When you are storing data over a long period of time, many things can happen. Ranging from a bit flip due to cosmic rays, to a disk mechanical failure, to a chip silicon failure or a software failure. Each of these cases had to be handled as best as we could.

After I was confident that the code could handle multiple failures I built an additional safety mechanism. All the metadata was stored in a database. If that database died, even though we would still have data, we would not be able to reconstruct it. At every step, whether be hardware discovery, data archival or data extraction I emitted logs lines with every detail of the action. With those logs, I was able to rebuild the database. It was similar to a write-ahead log, but instead it was more a write-after log. The logs would be stored elsewhere; it was an additional failsafe mechanism. Like for the rest, I started by removing lines in the database to tests, removing more and more data until I was sure that everything worked and that I could compare database dumps before/after and there was no modification except auto-generated timestamps.

While I was doing all these things, two people from the dedicated team were busy doing the customer facing part which was soon to be launched.

To pair with :