Splitting light

Season 1 Episode 24

Looking back on carbon14's launch

If you are no longer interested in the newsletter, please unsubscribe

Four racks of C14 as they would be later in DC4, 40 petabyte for 3 200 watts

Did you think that last episode was anticlimactic? You are right to think so. At the time of the launch, I was not thinking of it as a product launch. I was “just” doing my work. I was given a problem, I would find a solution. I could not understand that I was doing so much more than my job title. That realization came almost a decade later while writing this very content. I had been involved in previous launches, I had been a component of the C1 & C2 launch and we had successfully put the customs routers in production, but I was on these only an element, not the coordinator. This was my first product launch. I did not do everything myself, but I was involved in every component. After the launch I followed through with customers and fixed the few issues that were present.

The first big one was because of how I had designed part of the API on the hardware. I did not have enough hardware to trigger the bug when I was doing my tests because it could only appear under moderate usage when there were multiple competing workers that were archiving data. It happened because I had isolated the components too much. I had two actions, one to switch to the appropriate disk and a second set to read/write/erase data. That meant that there was a potential race condition when the disk was switched but another worker read/wrote into it. I hastily changed that mechanism to make that action atomic. You added the disk you wanted to write to in your request and it guaranteed that it was written in the correct disk.

Second major issue was a list of issues that could be traced back to docker. We had used it on the client data reception component and it was difficult to use it in production. It was and still is a great tool to develop and test and reproduce code but, for me, it’s not a very good tool for handling production loads. Between the magic it does with firewall rules and all the non-standard actions, this is where I spent most of my time debugging production issues. This one bug on docker, still open, triggered many cascading issues. After this, I could no longer trust docker as a production ready tool.

These were technical issues, engineering problems. They could be fixed or compensated with extra checks. But there were other issues that we could not fix. Not easily at least.

When I had designed the pipeline I had focused on finding a solution from a technical point of view, not from a user point of view. I had made sure that data ingestion was fast because it meant less machines to store the buffer data. I had missed a fundamental element. When customers used this product, they did not care about how fast the data was ingested. However, they deeply cared about getting it back as fast as possible if needed. It was a disaster recovery product. When the archived data was needed, it was imperative to get it quickly. I could tweak the scheduling algorithm a bit but I could not change the underlying mechanism without rewriting much of the product.

The second thing was a big missed opportunity. After we had a working product, we looked into getting the ISO 14641-1 (NF-461) certification for electronic archiving. This was a necessary certification for many administrative documents and it was an emerging market. However, the conditions to be certified were not compatible with our product. We stored bundles of data instead of individual files. We used erasure coding instead of multiple copies. We used fast hashing instead of slow cryptographic hashes. The product had not been very expensive in hardware or human time to make because I had made these choices. We could not easily modify the product to fit into the standard. We would have to massively reengineer it. It could have been a very lucrative market but it was too far away from our technical solution. I knew my solution to the problem was, in some aspects, more efficient. But it didn’t matter, we could not abide by the certification conditions.

I learned more about thinking from a customer centric point of view because I encountered these issues. I learned that sometimes, technical efficiency or prownees is not enough. Even worse, sometimes it can even be detrimental.

Nevertheless, the product was used. The customers seemed happy. The company opened a position to continue the work on the product. We wanted to add more endpoints and more features. I was to fold back into the hardware team to work on the third generation compute hardware.