Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.
Share
Splitting Light: Season 2 - Episode 36
Published about 6 hours ago • 4 min read
Splitting light
Season 2 Episode 36
Before we miss the Thalys
If you are no longer interested in the newsletter, please unsubscribe
Around April 2019
Our Amsterdam cluster was different from the optimal design. It did not follow the rack design. That was a problem. There was a reason for that. It was the first batch of hardware. We had sent it quickly there. Making it work to launch the Object Storage. But, since then we had received additional hardware. We now needed to make it compliant.
Théo (a) did one of his signature moves, he had the hardware shipped, booked train tickets for part of the team and suddenly we were in a high speed train to Amsterdam. Inside a Thalys train. During the trip, we prepared the configuration changes, the steps of the maintenance. What we could plan for, we did.
Thalys TGV in Hoofddorp, Netherlands (1)
The first day on site, we toured the datacenter room, the spare parts room and the few cages Scaleway had around. We went to the hotel and did final adjustments for the next day.
We came back the next morning to do the procedure. We needed to do multiple things while running the cluster with no interruption. We needed to move the four server bays of the infrastructure plus the two cold storage logs servers to a new rack. Then we had to add 6 new servers to the existing rack.
Moving storage infrastructure to a second rack and filling the space with storage servers
We started by moving things that had no impact, disabling services that were not critical. The slow log servers went first.
We pinged the network team that we were on site and required assistance now. They didn’t know we had gone there. They were not happy that we had not informed them beforehand. But they did the necessary configurations.
After the logs, we moved the infrastructure server by server. For each, we drained the load balancers, deployed configuration, unplugged, moved, replugged, booted, checked everything and applied configuration again. Server by server. Service by service. The hand allocation of services to survive a crashed server or a server bay crash was really handy. It was the critical element that enabled us to not have to schedule a downtime.
Services were spread out, we could lose a server or bay and have no customer impact
As soon as an infrastructure server was moved, we racked a storage server in its location. Cabling both the network and power. After booting them to do final checks, we stopped just before adding them to the cluster. While we were there, we also did a few swaps of parts that had died.
The cluster was well taken care of. As we executed each step, it stayed smooth. We were not wasting a minute. We had one last infrastructure server. That server controlled critical services. Namely the DHCP, TFTP and some other important services. We did the last checks, turned it off, moved it and racked a storage server in its new position.
Once the infrastructure server was plugged back in, it would not boot back in the operating system. The first real difficulty of the operation appeared. Folays (b) doubled down on it. Going to the BIOS setting and UEFI command line. While he worked on it, we bundled cables together and did some other small operations on network devices for the network team. To be half forgiven.
We had a fixed deadline. We had to catch a train back to Paris. Folays finally made the server boot and was able to get remote access with SSH. We packed our things as fast as possible and booked a taxi. By the time we would arrive at the Amsterdam central station, the train would have left. But, it stopped at the Shipol airport which was closer to our location. We instructed the taxi to go there and got there just in time to catch the train.
However the work wasn’t finished. While we were on the train, Folays was racing against time to make the services work again. The DHCP server had not come back online. By safety we had set long lease times but the states were now reaching their end of life. At some point, having gone through the DHCP state machine, the servers would release their IP and everything would break down instantly. He was working fast. Finally, he made it work. DHCP worked again, the leases renewed and after a few checks, we could breathe again. The laptops were closed and the team rested for the rest of the journey.
The next day, we made the new storage join the clusters. After cleaning our changes we did an additional change. The deployment automation was changed to get IPs dynamically at boot but then set them as static once booted. We had learned our lesson.
Splitting light Season 2 Episode 35 The Iena days If you are no longer interested in the newsletter, please unsubscribe Around April 2019 At the end of 2018, I had started to look into buying a home. I had finally finished my student back in September 2017, just before the pivot. I had put money into savings since then, it was now time to buy. It was my next step. My next personal step. It was a long process. I started to look for a house in the Paris suburbs but couldn’t really find...
Splitting light Season 2 Episode 34 fr-par, you are not cleared for launch If you are no longer interested in the newsletter, please unsubscribe March 2019 After we had sent our first bills to customers, less than a year and a half after the pivot, we were preparing for a new Object Storage region. DC5 had finally come online, our racks of hardware had been installed there. We were doing final adjustments on the installation scripts and deployment steps. However we had one issue. Assembling...
Splitting light Season 2 Episode 33 Over-engineering bandwidth If you are no longer interested in the newsletter, please unsubscribe Late February 2019 The second element was to bill the bandwidth. Specifically the outgoing one. In my original design of the billing component I had thought that we needed a “high performing” database to handle the bandwidth calculations. I had chosen ScyllaDB because friends had good things to say about it. It was a reimplementation of Cassandra but in C++. The...