Splitting light

Season 1 Episode 26v

Racks and racks of devices

If you are no longer interested in the newsletter, please unsubscribe

An interesting thing that I became actually aware of at the time was the law of large numbers. I knew it existed, but witnessing the effects was new.

As we produced thousands of devices, we would sometimes have issues that were triggered by defects or tolerance margins. I called them ghost bugs. We tried as much as we could to catch as many bugs as we could before sending the manufacturing order but some would pass through the cracks.

Not C1 nor C2 but in Scaleway's DC3 (now OpCore)

Those bugs could be very hard to track down. I remember one vividly. It was an issue with the router we had built. Sometimes one of the 1gb network ports would malfunction. Either it would have a lot of errors or it would only link up at 100mb. This issue appeared in one or two devices out of a hundred. We only detected it with devices that were already installed and wired in the datacenter. Detecting the issue just before the servers they were connected to were about to be rented out. The network team would ping me and I would have to diagnose the issue. However in this case, the steps were not easy. The signal was transformed to ethernet via a PHY, a specific chip, which talked in a specific way to the routing ASIC, another chip. We were unable to reproduce this issue with the devices we had in the lab. The only way would have been to bring the device back to the lab. Sometimes we resorted to that, but it was not always possible. Software to obtain diagnostic information from both chips had to be written. It had to be checked in the lab then packaged, then the software on the device had to be upgraded to test live for this problem. Diagnosing and trying to fix issues such as this one could take several days. If the team was in a hurry or it impacted a live customer, they would swap the port to bypass the issue and I would continue to diagnose the issue. I still had to be very careful to not impact live customers during my testing.

Server double network attached, taken in DC3

These were hard to fix because finding the original cause was hard. Sometimes it was finding the right data in the thousands of pages of data that would hint at the issue. Then we had to write the code specific to that issue, update the artifacts, test and validate the bug fix. Sometimes the underlying issue would stay out of our grasp. As we weeded out the easy bugs, the only bugs left were harder bugs. They would become increasingly hard to deal with. The bug fix could be tracing a line slightly differently, or adding a component or replacing a chip by one slightly different. Sometimes it was fixing the sequence of data interactions or the speed of the interaction. We had to dig through more and more layers of electronics and software and human interactions.

The lab's 48 1gb + 2 40gb network router, enclosed in case

We initially hardcoded many things in the code. It was easier and faster to have the configuration inside the code. Slowly, over time, we extracted that code into files that could be changed without our input. We would add helper commands that eased the tasks of the operator to add information in these configuration files. The structure of the code changed and we allowed for more flexibility to the operators to do some of the diagnostics themselves. That required more documentation and more configuration files but it freed time for us and empowered them to do more themselves. It made the feedback loop tighter.

The router with components exposed (yes, red means it's a prototype)

The combinatorial nature of the work made it that we had to go in iterations. Bare minimum build first. Then add small features, show it to operators, refactor to suit needs better. Send manufacturing orders and push to production. Continue adding features, refactor, listen again to operator feedback… It could only work this way. The experience we gained at each step whether fixing an issue or adding a new feature made it easier for the next. We had to learn to build on the bleeding edge of our knowledge. Not only designing and building the hardware but how the hardware would be used and how to find solutions to bugs that were increasingly complicated.

Every single thing we understood compounded up for the next iteration of hardware. Each thing we learned made it easier to do more complex hardware or software. My next assignment would be something that I would physically remember. It started with a soldering iron.

If you have missed it, you can read the previous episode here

To pair with :