Splitting Light: Season 1 - Episode 26


Splitting light

Season 1 Episode 26v

Racks and racks of devices

If you are no longer interested in the newsletter, please unsubscribe

An interesting thing that I became actually aware of at the time was the law of large numbers. I knew it existed, but witnessing the effects was new.

As we produced thousands of devices, we would sometimes have issues that were triggered by defects or tolerance margins. I called them ghost bugs. We tried as much as we could to catch as many bugs as we could before sending the manufacturing order but some would pass through the cracks.

Those bugs could be very hard to track down. I remember one vividly. It was an issue with the router we had built. Sometimes one of the 1gb network ports would malfunction. Either it would have a lot of errors or it would only link up at 100mb. This issue appeared in one or two devices out of a hundred. We only detected it with devices that were already installed and wired in the datacenter. Detecting the issue just before the servers they were connected to were about to be rented out. The network team would ping me and I would have to diagnose the issue. However in this case, the steps were not easy. The signal was transformed to ethernet via a PHY, a specific chip, which talked in a specific way to the routing ASIC, another chip. We were unable to reproduce this issue with the devices we had in the lab. The only way would have been to bring the device back to the lab. Sometimes we resorted to that, but it was not always possible. Software to obtain diagnostic information from both chips had to be written. It had to be checked in the lab then packaged, then the software on the device had to be upgraded to test live for this problem. Diagnosing and trying to fix issues such as this one could take several days. If the team was in a hurry or it impacted a live customer, they would swap the port to bypass the issue and I would continue to diagnose the issue. I still had to be very careful to not impact live customers during my testing.

These were hard to fix because finding the original cause was hard. Sometimes it was finding the right data in the thousands of pages of data that would hint at the issue. Then we had to write the code specific to that issue, update the artifacts, test and validate the bug fix. Sometimes the underlying issue would stay out of our grasp. As we weeded out the easy bugs, the only bugs left were harder bugs. They would become increasingly hard to deal with. The bug fix could be tracing a line slightly differently, or adding a component or replacing a chip by one slightly different. Sometimes it was fixing the sequence of data interactions or the speed of the interaction. We had to dig through more and more layers of electronics and software and human interactions.

We initially hardcoded many things in the code. It was easier and faster to have the configuration inside the code. Slowly, over time, we extracted that code into files that could be changed without our input. We would add helper commands that eased the tasks of the operator to add information in these configuration files. The structure of the code changed and we allowed for more flexibility to the operators to do some of the diagnostics themselves. That required more documentation and more configuration files but it freed time for us and empowered them to do more themselves. It made the feedback loop tighter.

The combinatorial nature of the work made it that we had to go in iterations. Bare minimum build first. Then add small features, show it to operators, refactor to suit needs better. Send manufacturing orders and push to production. Continue adding features, refactor, listen again to operator feedback… It could only work this way. The experience we gained at each step whether fixing an issue or adding a new feature made it easier for the next. We had to learn to build on the bleeding edge of our knowledge. Not only designing and building the hardware but how the hardware would be used and how to find solutions to bugs that were increasingly complicated.

Every single thing we understood compounded up for the next iteration of hardware. Each thing we learned made it easier to do more complex hardware or software. My next assignment would be something that I would physically remember. It started with a soldering iron.

If you have missed it, you can read the previous episode here

To pair with :

  • Red herring (original mix) - Union jack
  • The Abyss (L'oeuvre au noir) by Marguerite Yourcenar

Vincent Auclair

Connect with me on your favorite network!

Oud metha, Dubai, Dubai 00000
Unsubscribe · Preferences

Symbol Sled

Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.

Read more from Symbol Sled

Splitting light Season 1 Episode 28 Juggling code If you are no longer interested in the newsletter, please unsubscribe Scaleway's first custom router Three years into my tenure in the lab, we had launched many products. From the revolutionary C1 where we had 900 nodes per rack to the cold storage product. We had two compute devices production, one network device and one storage device. That did not count one project that had been terminated and the SCADA devices which I did not work on....

Splitting light Season 1 Episode 27 From components to a display screen If you are no longer interested in the newsletter, please unsubscribe We had used very simple LEDs to display the status of the nodes and management system for the first two compute generations. You could not do any actions except power down or reset the system with buttons. We eventually got feedback from the team who managed big quantities of these devices. It worked great but, sometimes, it was hard to handle in the...

Splitting light Season 1 Episode 25 The teapots and the biscuits If you are no longer interested in the newsletter, please unsubscribe If you had entered the hardware lab during those years, you would have seen six people crammed in a room too small for their desks and the electronic test benches. There was open hardware with wires dangling in every direction. You could glimpse disks, network connectors, fibers, PCBs with components, metal chassis fitted with electronic and SCADA equipment...