Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.
Share
Splitting Light: Season 2 - Episode 18
Published about 2 months ago • 3 min read
Splitting light
Season 2 Episode 18
Controlling latency
If you are no longer interested in the newsletter, please unsubscribe
We didn't only increase our scope in iteration phases to reduce risk or go faster. We also did it for customer facing metrics. One specifically required some tradeoffs; it was latency. To be more precise, time to first byte.
Object storage is a generic way of storing and fetching data. The maximum data you could store in a single object at the time was 5 terabytes but the minimum was just a few bytes. So when you are looking at the time it takes to complete a request, it can take up to a few minutes or hours to complete. It’s a completely normal pattern. Measuring the total time of the request makes no sense. You have to trick around that.
In the object storage context, what is important from a user perspective? It’s starting to receive data. What’s important is when the customer starts receiving bytes. That metric is called “time to first byte” (TTFB).
Time to first byte sequence
Anything on the software path that needs to be run before the first byte that is returned is in the critical path. Any time spent in that part results in an increased TTFB and lower user experience. The thing is, before you are able to return any byte, you have many steps. You need to route the request and check the authentication. If it’s a write request, the quota and other elements.
If we put the authentication software in racks physically distant, that will incur additional time just for the request to physically travel the wires. You can’t beat the speed of light in wires.
The authentication is behind a load balancer or crossing several components. Each of them adds precious time. The path has to be checked all along. If the service is colocated but the credential database is two milliseconds away, that two milliseconds ends up adding in the TTFB.
View from the rooftop terrace (1)
So you unfold and you unroll everything. Every request path, every network path and check. What is critical? What isn’t? What has to be made as short as possible? What has to be redundant otherwise the service will stop working.
This is partly why we started building our own infrastructure. We could then co-host our services physically closer. But we had to make sure we were redundant enough. I remember hand allocating the services to survive a server or a server bay loss. Which came very handy a year later.
The thing with latency, is that this isn’t enough. We did our tests from the same datacenter. Within the same network. Why? Why not do them from the office? Or from a customer location?
Because internet network latency is not stable. Paths depends on peering rules, on the quality of the internet operator network and it’s connectivity. The capacity. The bottlenecks. The fiber cuts… The internet saturates locally. This creates too much chaos outside of your own network to do accurate measures. It’s a good thing to monitor for the network team but not stable enough to use a performance test.
View of a few of the servers in the test cluster (2)
Anyhow, that was the network team’s turf. We were out of our league once it left the internal network.
But the next checks would require us to understand better the network.
Splitting light Season 2 Episode 24 Hackathon If you are no longer interested in the newsletter, please unsubscribe Several of the team members had gone to School 42. A tuition free university created by the owner of Scaleway, Xavier Niels. Several would be an understatement. Out of the 14 people, more than half had gone there. School 42 frequently organized hackathons. We decided it was the perfect opportunity for us to organize one. Théo doing the hackathon presentation (1) Our goal was to...
Splitting light Season 2 Episode 23 Beat the cluster to a pulp If you are no longer interested in the newsletter, please unsubscribe With proper observability we could now push the cluster even further. This was the final set of tests that we would perform before wiping everything and going to beta after a new setup. We huddled and concocted a strategy. Picked up our tools and went on the field to beat the cluster to a pulp one last time. Our goal was explicitly to overwhelm the cluster as...
Splitting light Season 2 Episode 22 Too many logs If you are no longer interested in the newsletter, please unsubscribe I’ve rarely seen people talk about this effect. The effect being the amplification of requests. This effect can overwhelm your system. We had to deal with it. The object storage, at least OpenIO, was a collection of distributed services. You might call them micro services if you want. That had implications. When a request comes in, from the user perspective, it’s a single...