The C1 server could not be used by the bare metal team. The nodes had little memory and no local disk. The C2 did not have these constraints and from the beginning it was designed to be used by both the cloud and bare metal teams. If both teams used the devices, that meant we produced more of them and it also meant the R&D cost could be spread on more devices.
The cloud team had decided to write code from scratch for several reasons but the bare metal team wanted to minimize additional code. They had an existing infrastructure that handled several dozen thousands servers which worked efficiently and used standard tools to manage the servers.
Because we had previously used existing software to implement BGP we wondered if we could just interface existing software to plug the feature gaps we had. So we talked to the people that managed the hardware and the people that handled the network to gather more knowledge and understand the missing features.
To understand some of the requirements, you have to understand how a server works. There are multiple pieces of software on a server. To turn it on, restart it, turn it off and monitor different pieces of the server there is what is commonly called a BMC. Each constructor has its variant in the naming and features but if you boil it down it’s the same. There is a, mostly, common protocol to talk to these via a network and it’s called IPMI. So it was one of our requirements to implement the protocol. Unfortunately, the BMC for our server was not wired to any IP network. We talked to it via an I2C bus. Changing the design would be too expensive however, we could sidecar some software to make it work.
We took the open source ipmitool software and patched some parts of the code to implement a new kind of manufacturer. The software talked standard IPMI and we just translated on the fly the instruction to I2C commands to send to the correct BMC. Once it worked we added the temperature readings, the watts consumptions, and other information that was commonly used.
Two other pieces of required software were required for the network management side. The first one was to discover the devices connected to the network. It’s a specific protocol called LLDP. The protocol works the following way, each element in the network mesh periodically emits a specific packet that is loaded with information. It has the MAC addresses, the model of the machine as well as specific capabilities of the device plugged. If you combine this one with a default operating system that has the software running you can do some auto discovery of your network.
To handle these packets we had to configure the chip to capture them and handle them software wise. There was existing open source software that handled the protocol and so we did the same job as we did for ipmitool or the BGP. I set up the different bits of configuration to relay the capture packets to LLDPd while doing the minimum amount of customization. I added the sidecar.
Lastly, in a similar fashion, because we were using L3 networks to communicate with the servers and because of how the internal network was laid out at Scaleway we wanted to be able to send the DHCP request to a distant host. On most network equipment vendors there was a feature called relay which is a feature of the protocol. What did we do? Like the rest, we interfaced it. Just by configuring some bits and transiting some packets in a specific way we had it working.
Using open source software enabled us to do the features more quickly and also, more importantly be compliant with the standards.
If you have missed it, you can read the previous episode here
To pair with :
- Cool like me - Fryars
- Nausicaä of the Valley of Wind by Hayao Miyazaki