Splitting Light: Season 2 - Episode 36


Splitting light

Season 2 Episode 36

Before we miss the Thalys

If you are no longer interested in the newsletter, please unsubscribe

Around April 2019

Our Amsterdam cluster was different from the optimal design. It did not follow the rack design. That was a problem. There was a reason for that. It was the first batch of hardware. We had sent it quickly there. Making it work to launch the Object Storage. But, since then we had received additional hardware. We now needed to make it compliant.

Théo (a) did one of his signature moves, he had the hardware shipped, booked train tickets for part of the team and suddenly we were in a high speed train to Amsterdam. Inside a Thalys train. During the trip, we prepared the configuration changes, the steps of the maintenance. What we could plan for, we did.

The first day on site, we toured the datacenter room, the spare parts room and the few cages Scaleway had around. We went to the hotel and did final adjustments for the next day.

We came back the next morning to do the procedure. We needed to do multiple things while running the cluster with no interruption. We needed to move the four server bays of the infrastructure plus the two cold storage logs servers to a new rack. Then we had to add 6 new servers to the existing rack.

We started by moving things that had no impact, disabling services that were not critical. The slow log servers went first.

We pinged the network team that we were on site and required assistance now. They didn’t know we had gone there. They were not happy that we had not informed them beforehand. But they did the necessary configurations.

After the logs, we moved the infrastructure server by server. For each, we drained the load balancers, deployed configuration, unplugged, moved, replugged, booted, checked everything and applied configuration again. Server by server. Service by service. The hand allocation of services to survive a crashed server or a server bay crash was really handy. It was the critical element that enabled us to not have to schedule a downtime.

As soon as an infrastructure server was moved, we racked a storage server in its location. Cabling both the network and power. After booting them to do final checks, we stopped just before adding them to the cluster. While we were there, we also did a few swaps of parts that had died.

The cluster was well taken care of. As we executed each step, it stayed smooth. We were not wasting a minute. We had one last infrastructure server. That server controlled critical services. Namely the DHCP, TFTP and some other important services. We did the last checks, turned it off, moved it and racked a storage server in its new position.

Once the infrastructure server was plugged back in, it would not boot back in the operating system. The first real difficulty of the operation appeared. Folays (b) doubled down on it. Going to the BIOS setting and UEFI command line. While he worked on it, we bundled cables together and did some other small operations on network devices for the network team. To be half forgiven.

We had a fixed deadline. We had to catch a train back to Paris. Folays finally made the server boot and was able to get remote access with SSH. We packed our things as fast as possible and booked a taxi. By the time we would arrive at the Amsterdam central station, the train would have left. But, it stopped at the Shipol airport which was closer to our location. We instructed the taxi to go there and got there just in time to catch the train.

However the work wasn’t finished. While we were on the train, Folays was racing against time to make the services work again. The DHCP server had not come back online. By safety we had set long lease times but the states were now reaching their end of life. At some point, having gone through the DHCP state machine, the servers would release their IP and everything would break down instantly. He was working fast. Finally, he made it work. DHCP worked again, the leases renewed and after a few checks, we could breathe again. The laptops were closed and the team rested for the rest of the journey.

The next day, we made the new storage join the clusters. After cleaning our changes we did an additional change. The deployment automation was changed to get IPs dynamically at boot but then set them as static once booted. We had learned our lesson.

But soon, we would all be shocked.

(1) Rob Dammers - Hoofddorp Thalys TGV-PBA 4534 trein 9340 Paris-Nord

(a) Théotime Rivière, Storage Product Manager then, now Founder of Freedom From Scratch

(b) Eric (Folays) Gouyer, Storage DevOps at the time, still at Scaleway

If you have missed it, you can read the previous episode here

To pair with :

  • Cruising with you - Heartstreets
  • Le Cirque: Journal d'un dompteur de chaises (Not translated to English) by Iléana Surducan

Vincent Auclair

Connect with me on your favorite network!

Payson, Payson, AZ 85541
Unsubscribe · Preferences

Symbol Sled

Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.

Read more from Symbol Sled

Splitting light Season 3 Episode 01 La maison Roquepine If you are no longer interested in the newsletter, please unsubscribe Around mid June 2019 In May 2019, Scaleway’s cloud was now in its fifth year of existence. We were leaving the Bulgenkian residence, the Iena home, to a full building in the 8th district of Paris. Although the new office, the Roquepine home, was great by all standards it could not match nor would it ever match the Iena one. From an out of world place to a fancy office....

Splitting light Season 2 Episode 40 The cost of speed If you are no longer interested in the newsletter, please unsubscribe Around early June 2019 From November 2017 to May 2019, a half since the pivot, while maintaining half a dozen existing storage products we had launched one in general availability and were in private beta for another one. Object Storage was now in two regions. Block Storage was getting multiple AZs. At peak we were 14 in the storage team but on average it was more around...

Splitting light Season 2 Episode 39 Next steps If you are no longer interested in the newsletter, please unsubscribe Around end of May 2019 We continued to push forward. Our next enemy in the path, our next boss in our raid, was lifecycle policy. Now that we had both Object Storage and Cold Storage, as in Carbon14, we could link them together. The damaged dealers started to hit it. Nicolas (a) and Louis (b) worked on making this happen. It was a multi-step journey. The preparations started...