Splitting light

Season 2 Episode 36

Before we miss the Thalys

If you are no longer interested in the newsletter, please unsubscribe

Around April 2019

Our Amsterdam cluster was different from the optimal design. It did not follow the rack design. That was a problem. There was a reason for that. It was the first batch of hardware. We had sent it quickly there. Making it work to launch the Object Storage. But, since then we had received additional hardware. We now needed to make it compliant.

Théo (a) did one of his signature moves, he had the hardware shipped, booked train tickets for part of the team and suddenly we were in a high speed train to Amsterdam. Inside a Thalys train. During the trip, we prepared the configuration changes, the steps of the maintenance. What we could plan for, we did.

Thalys TGV in Hoofddorp, Netherlands (1)

The first day on site, we toured the datacenter room, the spare parts room and the few cages Scaleway had around. We went to the hotel and did final adjustments for the next day.

We came back the next morning to do the procedure. We needed to do multiple things while running the cluster with no interruption. We needed to move the four server bays of the infrastructure plus the two cold storage logs servers to a new rack. Then we had to add 6 new servers to the existing rack.

Moving storage infrastructure to a second rack and filling the space with storage servers

We started by moving things that had no impact, disabling services that were not critical. The slow log servers went first.

We pinged the network team that we were on site and required assistance now. They didn’t know we had gone there. They were not happy that we had not informed them beforehand. But they did the necessary configurations.

After the logs, we moved the infrastructure server by server. For each, we drained the load balancers, deployed configuration, unplugged, moved, replugged, booted, checked everything and applied configuration again. Server by server. Service by service. The hand allocation of services to survive a crashed server or a server bay crash was really handy. It was the critical element that enabled us to not have to schedule a downtime.

Services were spread out, we could lose a server or bay and have no customer impact

As soon as an infrastructure server was moved, we racked a storage server in its location. Cabling both the network and power. After booting them to do final checks, we stopped just before adding them to the cluster. While we were there, we also did a few swaps of parts that had died.

The cluster was well taken care of. As we executed each step, it stayed smooth. We were not wasting a minute. We had one last infrastructure server. That server controlled critical services. Namely the DHCP, TFTP and some other important services. We did the last checks, turned it off, moved it and racked a storage server in its new position.

Once the infrastructure server was plugged back in, it would not boot back in the operating system. The first real difficulty of the operation appeared. Folays (b) doubled down on it. Going to the BIOS setting and UEFI command line. While he worked on it, we bundled cables together and did some other small operations on network devices for the network team. To be half forgiven.

We had a fixed deadline. We had to catch a train back to Paris. Folays finally made the server boot and was able to get remote access with SSH. We packed our things as fast as possible and booked a taxi. By the time we would arrive at the Amsterdam central station, the train would have left. But, it stopped at the Shipol airport which was closer to our location. We instructed the taxi to go there and got there just in time to catch the train.

However the work wasn’t finished. While we were on the train, Folays was racing against time to make the services work again. The DHCP server had not come back online. By safety we had set long lease times but the states were now reaching their end of life. At some point, having gone through the DHCP state machine, the servers would release their IP and everything would break down instantly. He was working fast. Finally, he made it work. DHCP worked again, the leases renewed and after a few checks, we could breathe again. The laptops were closed and the team rested for the rest of the journey.

The next day, we made the new storage join the clusters. After cleaning our changes we did an additional change. The deployment automation was changed to get IPs dynamically at boot but then set them as static once booted. We had learned our lesson.

But soon, we would all be shocked.

(1) Rob Dammers - Hoofddorp Thalys TGV-PBA 4534 trein 9340 Paris-Nord

(a) Théotime Rivière, Storage Product Manager then, now Founder of Freedom From Scratch

(b) Eric (Folays) Gouyer, Storage DevOps at the time, still at Scaleway

If you have missed it, you can read the previous episode here

To pair with :