Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.
Share
Splitting Light: Season 1 - Episode 26
Published 9 months ago • 4 min read
Splitting light
Season 1 Episode 26v
Racks and racks of devices
If you are no longer interested in the newsletter, please unsubscribe
An interesting thing that I became actually aware of at the time was the law of large numbers. I knew it existed, but witnessing the effects was new.
As we produced thousands of devices, we would sometimes have issues that were triggered by defects or tolerance margins. I called them ghost bugs. We tried as much as we could to catch as many bugs as we could before sending the manufacturing order but some would pass through the cracks.
Those bugs could be very hard to track down. I remember one vividly. It was an issue with the router we had built. Sometimes one of the 1gb network ports would malfunction. Either it would have a lot of errors or it would only link up at 100mb. This issue appeared in one or two devices out of a hundred. We only detected it with devices that were already installed and wired in the datacenter. Detecting the issue just before the servers they were connected to were about to be rented out. The network team would ping me and I would have to diagnose the issue. However in this case, the steps were not easy. The signal was transformed to ethernet via a PHY, a specific chip, which talked in a specific way to the routing ASIC, another chip. We were unable to reproduce this issue with the devices we had in the lab. The only way would have been to bring the device back to the lab. Sometimes we resorted to that, but it was not always possible. Software to obtain diagnostic information from both chips had to be written. It had to be checked in the lab then packaged, then the software on the device had to be upgraded to test live for this problem. Diagnosing and trying to fix issues such as this one could take several days. If the team was in a hurry or it impacted a live customer, they would swap the port to bypass the issue and I would continue to diagnose the issue. I still had to be very careful to not impact live customers during my testing.
These were hard to fix because finding the original cause was hard. Sometimes it was finding the right data in the thousands of pages of data that would hint at the issue. Then we had to write the code specific to that issue, update the artifacts, test and validate the bug fix. Sometimes the underlying issue would stay out of our grasp. As we weeded out the easy bugs, the only bugs left were harder bugs. They would become increasingly hard to deal with. The bug fix could be tracing a line slightly differently, or adding a component or replacing a chip by one slightly different. Sometimes it was fixing the sequence of data interactions or the speed of the interaction. We had to dig through more and more layers of electronics and software and human interactions.
The lab's 48 1gb + 2 40gb network router, enclosed in case
We initially hardcoded many things in the code. It was easier and faster to have the configuration inside the code. Slowly, over time, we extracted that code into files that could be changed without our input. We would add helper commands that eased the tasks of the operator to add information in these configuration files. The structure of the code changed and we allowed for more flexibility to the operators to do some of the diagnostics themselves. That required more documentation and more configuration files but it freed time for us and empowered them to do more themselves. It made the feedback loop tighter.
The router with components exposed (yes, red means it's a prototype)
The combinatorial nature of the work made it that we had to go in iterations. Bare minimum build first. Then add small features, show it to operators, refactor to suit needs better. Send manufacturing orders and push to production. Continue adding features, refactor, listen again to operator feedback… It could only work this way. The experience we gained at each step whether fixing an issue or adding a new feature made it easier for the next. We had to learn to build on the bleeding edge of our knowledge. Not only designing and building the hardware but how the hardware would be used and how to find solutions to bugs that were increasingly complicated.
Every single thing we understood compounded up for the next iteration of hardware. Each thing we learned made it easier to do more complex hardware or software. My next assignment would be something that I would physically remember. It started with a soldering iron.
If you have missed it, you can read the previous episode here
To pair with :
Red herring (original mix) - Union jack
The Abyss (L'oeuvre au noir) by Marguerite Yourcenar
Splitting light Season 2 Episode 23 Beat the cluster to a pulp If you are no longer interested in the newsletter, please unsubscribe With proper observability we could now push the cluster even further. This was the final set of tests that we would perform before wiping everything and going to beta after a new setup. We huddled and concocted a strategy. Picked up our tools and went on the field to beat the cluster to a pulp one last time. Our goal was explicitly to overwhelm the cluster as...
Splitting light Season 2 Episode 22 Too many logs If you are no longer interested in the newsletter, please unsubscribe I’ve rarely seen people talk about this effect. The effect being the amplification of requests. This effect can overwhelm your system. We had to deal with it. The object storage, at least OpenIO, was a collection of distributed services. You might call them micro services if you want. That had implications. When a request comes in, from the user perspective, it’s a single...
Splitting light Season 2 Episode 21 All nighter If you are no longer interested in the newsletter, please unsubscribe As we were moving forward, in mid June 2018, we hit a point where we needed to be able to check the logs of the cluster as a whole. The way we had done it until then was manually connecting to the machines and opening the right files to look inside. This was no longer viable. One of the main office rooms (1) Scaleway’s monitoring team had done a metric stack which we already...