Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.
Share
Splitting Light: Season 2 - Episode 14
Published 2 months ago • 3 min read
Splitting light
Season 2 Episode 14
Access key
If you are no longer interested in the newsletter, please unsubscribe
We were now at a stage where we had to no longer have hardcoded credentials. We needed to be plugged to Scaleway’s authentication database.
Historically Scaleway, the cloud computing division, had decided to design their own API. They used a mechanism called Json Web Token (JWT) to authentify. This system only required a secret credential. On our side, S3 required two things. A secret credential and an access credential. An equivalent to user and password instead of just password.
Photo of the cats room with Florent and Théo (1)
The “user” equivalent did not exist in Scalway’s information system. Some co-workers suggested that we change the S3 protocol to accommodate that, we promptly shut them down. We were S3 compatible, not S3 look alike. Loic and Florent started the process to specify what we required and how it could be done. They were careful to look into many details. They interacted with the IAM & Billing team led by Kevin to have this introduced. It required a change in the database structure. They had to be careful in planning this. Any mistake would impact customers, autonomous workers and API, it had to be flawless.
Loic and Florent proposed to introduce an access key that had an identifiable component. Why? By having an identifiable prefix we could propose automatic tools to check if your credentials had been leaked.
Screenshot of an access key creation and the prefix (2)
From then on, an existing piece of software in python was patched to support the access and secret key combination. That code had been created for the first try at public Object Storage, five years back. That product had already flatlined. The purpose of this service was to bridge the authentication mechanism of OpenStack and the internal Scaleway information structure. Modifications were made to it to use the new combination. Using an anonymized database dump we could now use different users and keys.
We first did modifications to ensure we had a functional service. The first pass was to make sure we had the right logic. Making sure we mapped the organisation in Scaleway correctly to a user token. Then we started doing light performance tests. The python code got overwhelmed very quickly. Poor little python not able to keep up.
Left to right, Quentin, Nicolas and Maxime; Portrait of Xavier behind (3)
We decided to have it reimplemented in a more performant language. Quentin worked on porting the logic to Golang. I gave specific instructions to not use an Object Relation Manager (ORM), only SQL code. I knew the usual performance impact but most importantly the added complexity of an ORM, it wasn’t justified to onboard such code for a single query.
He reversed the python code and came up with a big SQL query. We had it validated by the IAM team and plugged the code in. The python version did several back and forth requests with the database whereas our Golang version did a single request. It was a big query but it was faster. After some tests, we swapped the implementations. We also added AWSv4 authentication support at that time.
Photos of Quentin and Florent (4)
A few weeks later, IAM team rolled the database change to production. From then on, we looked at quotas. OpenStack had a neat plugin system that we heavily used. Even though S3 was “infinite” we relied on a finite number of physical hardware. If a customer used up all the capacity, by mistake or maliciously, we needed to protect ourselves. Quentin replicated the authentication system service into a quota system service. Same constraints. Simple service, big SQL query, Golang speed.
Nicolas on his side, after switching from the AI team, worked on a small service to handle trust and safety requests from the support team. It had to interact with the OpenIO and OpenStack code carefully. We were the first of the new products to implement such a system.
Myself, I made sure everything was orchestrated correctly and that the architecture of the system was just right. Everything, software and humans needed to work together.
We plugged all three into our deployment. We continued our testing. Bugs started to creep. We had to identify them.
Splitting light Season 2 Episode 23 Beat the cluster to a pulp If you are no longer interested in the newsletter, please unsubscribe With proper observability we could now push the cluster even further. This was the final set of tests that we would perform before wiping everything and going to beta after a new setup. We huddled and concocted a strategy. Picked up our tools and went on the field to beat the cluster to a pulp one last time. Our goal was explicitly to overwhelm the cluster as...
Splitting light Season 2 Episode 22 Too many logs If you are no longer interested in the newsletter, please unsubscribe I’ve rarely seen people talk about this effect. The effect being the amplification of requests. This effect can overwhelm your system. We had to deal with it. The object storage, at least OpenIO, was a collection of distributed services. You might call them micro services if you want. That had implications. When a request comes in, from the user perspective, it’s a single...
Splitting light Season 2 Episode 21 All nighter If you are no longer interested in the newsletter, please unsubscribe As we were moving forward, in mid June 2018, we hit a point where we needed to be able to check the logs of the cluster as a whole. The way we had done it until then was manually connecting to the machines and opening the right files to look inside. This was no longer viable. One of the main office rooms (1) Scaleway’s monitoring team had done a metric stack which we already...