Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.
Share
Splitting Light: Season 1 - Episode 32
Published 5 days ago • 3 min read
Splitting light
Season 1 Episode 32
Excel sheets to configure
If you are no longer interested in the newsletter, please unsubscribe
Assignment of fixed amount of TCAM memory to subsystems
One of the specificities I thought the lab lab’s team had was that we didn’t try to be too smart. We used simple battle tested tools and common systems. The complexity was not in the tools we used but in hiding the complexity in the final output as much as we could.
We used excel for multiple things. We used it to find the amount of memory to allocate depending on the features we wanted. Or we used it to find out the amount of memory to allocate to the PHY depending on the number of links we had and their respective maximum speed. And then further splitting it up depending on possible MTU size we would allow.
We use excel in many places. Our purpose, the purpose of making hardware in the first place was to have a higher density of compute in a single rack at a better price. Just like manufacturing constraints of third party components required us to carefully allocate memory between subsystems, we were constrained by the physics of where our devices would be.
At the time, circa 2013-2017, racks in our datacenters or most available datacenters had a 4 000 watts capacity. Nowadays, a decade later, a single AI card in a server can require 4 000 watts. We had to cram as much as we could in a limited physical space with a limited number of watts and a limited cooling capacity. We had to find a sweet spot to fit as many compute devices that could be rented out. The goal was to make them as dense as we could and as cheap as could. Everything had to fit within these constraints. More electricity or cooling could not magically suddenly appear in the datacenter room.
Only 4 000 watts available for that volume
The process started with excel sheets summing the power envelopes of the compute chips, adding overhead for the internal network sheets and refining as we could. Each chassis once built and loaded with software had to be checked that it did not draw more than the power budget we had allocated to it at full load.
Initially it was always over budget. But over time we would be able to eventually lower it. We would disable non-essential functions in every chip. From disabling parts in the Intel SoC to disabling unused subsystems in the power control microcontrollers. Each milliwatt we saved on a card would add up. Saving watts at the card level. Dozens of watts at the chassis level. And eventually saving hundreds of watts at the rack level. The individual optimization themselves did not matter but it was the aggregate of all these small optimizations that had an impact.
Herds of rack
These tests also had to confirm that the heat emitted by all these chips and components could be evacuated. For that we had special cards that were full of very simple components that just emitted heat as electricity passed through. Fancy electric radiators in a sense. We would power everything up and check that the fans responded accurately to the temperature and that copper tracks in the circuit board were able to bring the required power through.
Even power supply efficiency had to be taken into account. The more efficient the supply the more expensive it was. However, the less efficient it was the less power we had for chips as part of it was lost in the supply. All these choices and effort in making tradeoffs had a purpose. Each tradeoff brought or cost something. We had to make choices. Everything was documented and committed to the source code repository. All these excel sheets, high level design, tests results, test procedure, it was the only way we could make it work.
Datacenter corridor and heat flow
Excel was sufficient for us. There were six of us in the lab. Six people that worked on different subsystems. We didn’t need a collaborative tool that half worked, we needed a stable tool that did what it was asked within its design limits. We pushed our own limits. My next step was to power up a complex board, pushing the envelope even more.
If you have missed it, you can read the previous episode here
To pair with :
00:26 (Ryan Davis Remix) - Ólafur Arnalds, Nils Frahm
Splitting light Season 1 Episode 31 Piercing the abstraction layers If you are no longer interested in the newsletter, please unsubscribe We build abstraction layers in software. We build them to hide the underlying complexity or limits. Operating systems are done to hide the complexity of interfacing with different devices. A SaaS hides the complexity of running and enhancing the software in exchange for a fee. From that you can derive multiple other advantages. On one side, the layers make...
Splitting light Season 1 Episode 30 Oscillating P/N If you are no longer interested in the newsletter, please unsubscribe This episode talks about mental health, if this is a sensitive subject for you, feel free to skip. Previously, I shared that with the help of doctors, stable work and income as well as family and friends, I was able to function almost properly. I felt that stability had to be extracted by raw will power. But if you zoomed in and looked with a high enough resolution you...
Splitting light Season 1 Episode 29 Hardware documentation If you are no longer interested in the newsletter, please unsubscribe We gave a lot of documentation to the cloud team or the dedicated team for each new hardware. This documentation was to help them implement support for the hardware, adapt their information system and operate it. Their day to day was managing the hardware and we tried to make it as simple as possible for them. There is a reason why computer engineering is layer upon...