Splitting Light: Season 1 - Episode 21


Splitting light

Season 1 Episode 21

1+ meter track length

If you are no longer interested in the newsletter, please unsubscribe

This next assignment was to become the premise of a career pivot even though I could not know it at the time. Greg has these interesting ideas that I can only fully appreciate now that I am much more experienced and more battle tested. He did the hardware designs in a modular way. There was a project to reuse the C1 node to package it as a raspberry pie but it fizzled out because handling mass market hardware is a very different world than handling datacenter hardware. From that project, on which I had done a bit of software qualification, was born a storage board. A new type of hard drive had just come out. It was higher capacity and 25% less expensive per gb. It was the SMR hard drives. However they had one flaw or constraint depending on how you saw it. The performance for random read/write was very bad.

What if you could hedge this? Greg spun a new design where you plugged 56 3.5 inch drives vertically on a large squarish PCB. With a C1 node slotted in on a corner. There was a maze of lanes and small components to route and switch the SATA data lanes to the node.

Once voltage tests were done, I was handed the board and I started to qualify the 56 slots. I wrote a bit of python to help me but 90% of the process was manual. I would plug in an SSD drive in the slot, power it up from the terminal, check that the drive would link up with the operating system, then check for data errors while transferring some data back and forth, then power down the drive and aim for the next slot.

Right away, I found that half of the slots were not working. After checking the tracks and the schematics, the issue was found. For simplicity, one of the sata buses had some of the SerDes signal pairs inverted. Which meant that, the system on chip (SoC) was expecting negative voltage where it was receiving positive and vice versa for the other wire. It was documented that we could invert the lanes by configuring the SATA PHY but the specific configuration was nowhere to be found in our documents. I sent an email to our chip support engineer for assistance.

Continuing on the working SATA bus, as I was powering up a new slot I heard a big mechanical snap. A few seconds later I smelled burnt plastic and my device was not responding anymore. I turned right away to the test bench. The differential circuit breaker had done its job in preventing a fire. We slowly disconnected everything and started inspecting the board.

We turned it around, looked everywhere, looked at the schematics but couldn't find anything. Eventually Greg found the issue. To understand it, you have to understand how a SATA connector is seated. It’s actually a sort of bridge which is seated or soldered but the place where the connector connects with the line pads on the board is open and visible. In the sata slot I had just turned on, underneath the connector, were several solder bubbles. They were the cause of the short circuit. We had tested a less expensive manufacturer and they had not seen the defect but neither did we until it burned a few components. I cleaned the bubbles with a soldering braid, replaced the burnt components and continued the tests.

Our support engineer had responded with the memory register configuration and after patching the operating system a bit we had the second sata bus working. Continuing on the tests, I found that some of the slots were not reachable. The components that controlled them did not respond to my commands. The digital oscilloscope came to the rescue. My first straps at the exit of the C1 node didn’t show anything unusual. Greg suggested I strap the component directly. I did and low and behold, there was something unusual. One of the things that the presenter in “Indistinguishable from magic” had said was that digital is analog. The resistance of the copper on the length of the track had diminished the signal enough that it was out of the specs for the components. We increased the power of the signal and it became very digital (square) again.

After testing every single slot, I wrote some python code to make it more manageable as well as documentation for the hardware. I was ready to hand it over. But, that would not happen…

To pair with :

  • Megumi The Milkway Above - Connan Mockasin
  • Magician: Apprentice by Raymond E. Feist

If you have missed it, you can read the previous episode here


Vincent Auclair

Connect with me on your favorite network!

Oud metha, Dubai, Dubai 00000
Unsubscribe · Preferences

Symbol Sled

Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.

Read more from Symbol Sled

Splitting light Season 2 Episode 23 Beat the cluster to a pulp If you are no longer interested in the newsletter, please unsubscribe With proper observability we could now push the cluster even further. This was the final set of tests that we would perform before wiping everything and going to beta after a new setup. We huddled and concocted a strategy. Picked up our tools and went on the field to beat the cluster to a pulp one last time. Our goal was explicitly to overwhelm the cluster as...

Splitting light Season 2 Episode 22 Too many logs If you are no longer interested in the newsletter, please unsubscribe I’ve rarely seen people talk about this effect. The effect being the amplification of requests. This effect can overwhelm your system. We had to deal with it. The object storage, at least OpenIO, was a collection of distributed services. You might call them micro services if you want. That had implications. When a request comes in, from the user perspective, it’s a single...

Splitting light Season 2 Episode 21 All nighter If you are no longer interested in the newsletter, please unsubscribe As we were moving forward, in mid June 2018, we hit a point where we needed to be able to check the logs of the cluster as a whole. The way we had done it until then was manually connecting to the machines and opening the right files to look inside. This was no longer viable. One of the main office rooms (1) Scaleway’s monitoring team had done a metric stack which we already...