Did you think that last episode was anticlimactic? You are right to think so. At the time of the launch, I was not thinking of it as a product launch. I was “just” doing my work. I was given a problem, I would find a solution. I could not understand that I was doing so much more than my job title. That realization came almost a decade later while writing this very content. I had been involved in previous launches, I had been a component of the C1 & C2 launch and we had successfully put the customs routers in production, but I was on these only an element, not the coordinator. This was my first product launch. I did not do everything myself, but I was involved in every component. After the launch I followed through with customers and fixed the few issues that were present.
The first big one was because of how I had designed part of the API on the hardware. I did not have enough hardware to trigger the bug when I was doing my tests because it could only appear under moderate usage when there were multiple competing workers that were archiving data. It happened because I had isolated the components too much. I had two actions, one to switch to the appropriate disk and a second set to read/write/erase data. That meant that there was a potential race condition when the disk was switched but another worker read/wrote into it. I hastily changed that mechanism to make that action atomic. You added the disk you wanted to write to in your request and it guaranteed that it was written in the correct disk.
Second major issue was a list of issues that could be traced back to docker. We had used it on the client data reception component and it was difficult to use it in production. It was and still is a great tool to develop and test and reproduce code but, for me, it’s not a very good tool for handling production loads. Between the magic it does with firewall rules and all the non-standard actions, this is where I spent most of my time debugging production issues. This one bug on docker, still open, triggered many cascading issues. After this, I could no longer trust docker as a production ready tool.
These were technical issues, engineering problems. They could be fixed or compensated with extra checks. But there were other issues that we could not fix. Not easily at least.
When I had designed the pipeline I had focused on finding a solution from a technical point of view, not from a user point of view. I had made sure that data ingestion was fast because it meant less machines to store the buffer data. I had missed a fundamental element. When customers used this product, they did not care about how fast the data was ingested. However, they deeply cared about getting it back as fast as possible if needed. It was a disaster recovery product. When the archived data was needed, it was imperative to get it quickly. I could tweak the scheduling algorithm a bit but I could not change the underlying mechanism without rewriting much of the product.
The second thing was a big missed opportunity. After we had a working product, we looked into getting the ISO 14641-1 (NF-461) certification for electronic archiving. This was a necessary certification for many administrative documents and it was an emerging market. However, the conditions to be certified were not compatible with our product. We stored bundles of data instead of individual files. We used erasure coding instead of multiple copies. We used fast hashing instead of slow cryptographic hashes. The product had not been very expensive in hardware or human time to make because I had made these choices. We could not easily modify the product to fit into the standard. We would have to massively reengineer it. It could have been a very lucrative market but it was too far away from our technical solution. I knew my solution to the problem was, in some aspects, more efficient. But it didn’t matter, we could not abide by the certification conditions.
I learned more about thinking from a customer centric point of view because I encountered these issues. I learned that sometimes, technical efficiency or prownees is not enough. Even worse, sometimes it can even be detrimental.
Nevertheless, the product was used. The customers seemed happy. The company opened a position to continue the work on the product. We wanted to add more endpoints and more features. I was to fold back into the hardware team to work on the third generation compute hardware.
We are recruiting a storage devops (M/F/P) to work on C14 (a real thing) with Newton with many organic petabytes to scale, please RT
If you have missed it, you can read the previous episode here
Splitting light Season 2 Episode 23 Beat the cluster to a pulp If you are no longer interested in the newsletter, please unsubscribe With proper observability we could now push the cluster even further. This was the final set of tests that we would perform before wiping everything and going to beta after a new setup. We huddled and concocted a strategy. Picked up our tools and went on the field to beat the cluster to a pulp one last time. Our goal was explicitly to overwhelm the cluster as...
Splitting light Season 2 Episode 22 Too many logs If you are no longer interested in the newsletter, please unsubscribe I’ve rarely seen people talk about this effect. The effect being the amplification of requests. This effect can overwhelm your system. We had to deal with it. The object storage, at least OpenIO, was a collection of distributed services. You might call them micro services if you want. That had implications. When a request comes in, from the user perspective, it’s a single...
Splitting light Season 2 Episode 21 All nighter If you are no longer interested in the newsletter, please unsubscribe As we were moving forward, in mid June 2018, we hit a point where we needed to be able to check the logs of the cluster as a whole. The way we had done it until then was manually connecting to the machines and opening the right files to look inside. This was no longer viable. One of the main office rooms (1) Scaleway’s monitoring team had done a metric stack which we already...