Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.
Share
Splitting Light: Season 1 - Episode 21
Published 6 months ago • 3 min read
Splitting light
Season 1 Episode 21
1+ meter track length
If you are no longer interested in the newsletter, please unsubscribe
This next assignment was to become the premise of a career pivot even though I could not know it at the time. Greg has these interesting ideas that I can only fully appreciate now that I am much more experienced and more battle tested. He did the hardware designs in a modular way. There was a project to reuse the C1 node to package it as a raspberry pie but it fizzled out because handling mass market hardware is a very different world than handling datacenter hardware. From that project, on which I had done a bit of software qualification, was born a storage board. A new type of hard drive had just come out. It was higher capacity and 25% less expensive per gb. It was the SMR hard drives. However they had one flaw or constraint depending on how you saw it. The performance for random read/write was very bad.
What if you could hedge this? Greg spun a new design where you plugged 56 3.5 inch drives vertically on a large squarish PCB. With a C1 node slotted in on a corner. There was a maze of lanes and small components to route and switch the SATA data lanes to the node.
The storage board
Once voltage tests were done, I was handed the board and I started to qualify the 56 slots. I wrote a bit of python to help me but 90% of the process was manual. I would plug in an SSD drive in the slot, power it up from the terminal, check that the drive would link up with the operating system, then check for data errors while transferring some data back and forth, then power down the drive and aim for the next slot.
Right away, I found that half of the slots were not working. After checking the tracks and the schematics, the issue was found. For simplicity, one of the sata buses had some of the SerDes signal pairs inverted. Which meant that, the system on chip (SoC) was expecting negative voltage where it was receiving positive and vice versa for the other wire. It was documented that we could invert the lanes by configuring the SATA PHY but the specific configuration was nowhere to be found in our documents. I sent an email to our chip support engineer for assistance.
Differential signaling pair
Continuing on the working SATA bus, as I was powering up a new slot I heard a big mechanical snap. A few seconds later I smelled burnt plastic and my device was not responding anymore. I turned right away to the test bench. The differential circuit breaker had done its job in preventing a fire. We slowly disconnected everything and started inspecting the board.
We turned it around, looked everywhere, looked at the schematics but couldn't find anything. Eventually Greg found the issue. To understand it, you have to understand how a SATA connector is seated. It’s actually a sort of bridge which is seated or soldered but the place where the connector connects with the line pads on the board is open and visible. In the sata slot I had just turned on, underneath the connector, were several solder bubbles. They were the cause of the short circuit. We had tested a less expensive manufacturer and they had not seen the defect but neither did we until it burned a few components. I cleaned the bubbles with a soldering braid, replaced the burnt components and continued the tests.
Our support engineer had responded with the memory register configuration and after patching the operating system a bit we had the second sata bus working. Continuing on the tests, I found that some of the slots were not reachable. The components that controlled them did not respond to my commands. The digital oscilloscope came to the rescue. My first straps at the exit of the C1 node didn’t show anything unusual. Greg suggested I strap the component directly. I did and low and behold, there was something unusual. One of the things that the presenter in “Indistinguishable from magic” had said was that digital is analog. The resistance of the copper on the length of the track had diminished the signal enough that it was out of the specs for the components. We increased the power of the signal and it became very digital (square) again.
After testing every single slot, I wrote some python code to make it more manageable as well as documentation for the hardware. I was ready to hand it over. But, that would not happen…
To pair with :
Megumi The Milkway Above - Connan Mockasin
Magician: Apprentice by Raymond E. Feist
If you have missed it, you can read the previous episode here
Splitting light Season 2 Episode 08 Compiling knowledge If you are no longer interested in the newsletter, please unsubscribe To be able to use OpenIO and offer it as a public facing product we had to amass quite a large amount of knowledge. We had to understand how it worked in detail. We had to understand the hardware requirements as well as how we wanted to make it filled and cabled. We had to understand how Scaleway’s information system worked and how we would connect to it. Skunk Works:...
Splitting light Season 2 Episode 07 Future growth If you are no longer interested in the newsletter, please unsubscribe To bring object storage the fastest and safest way would be to use existing software. At the time we reevaluated the solutions that had been selected beforehand. There were three of them. Ceph, an open source industry standard, OpenIO a provider of an open source object storage, and Scality a provider of a closed source product. There were multiple criteria to take into...
Splitting light Season 2 Episode 06 Object storage If you are no longer interested in the newsletter, please unsubscribe Object storage was our first priority. Many existing and future products depended on having an object storage product. We looked at the state of the art. There were a few competing protocols. Amazon S3 was the oldest but there was also Blackblaze B2, Openstack SWIFT, Google GCS and lastly Azure Blob storage. When we looked into how object storage was used, where it was used...