Business, tech, and life by a nerd. New every Tuesday: Splitting Light: The Prism of Growth and Discovery.
Share
Splitting Light: Season 2 - Episode 18
Published 3 days ago • 3 min read
Splitting light
Season 2 Episode 18
Controlling latency
If you are no longer interested in the newsletter, please unsubscribe
We didn't only increase our scope in iteration phases to reduce risk or go faster. We also did it for customer facing metrics. One specifically required some tradeoffs; it was latency. To be more precise, time to first byte.
Object storage is a generic way of storing and fetching data. The maximum data you could store in a single object at the time was 5 terabytes but the minimum was just a few bytes. So when you are looking at the time it takes to complete a request, it can take up to a few minutes or hours to complete. It’s a completely normal pattern. Measuring the total time of the request makes no sense. You have to trick around that.
In the object storage context, what is important from a user perspective? It’s starting to receive data. What’s important is when the customer starts receiving bytes. That metric is called “time to first byte” (TTFB).
Time to first byte sequence
Anything on the software path that needs to be run before the first byte that is returned is in the critical path. Any time spent in that part results in an increased TTFB and lower user experience. The thing is, before you are able to return any byte, you have many steps. You need to route the request and check the authentication. If it’s a write request, the quota and other elements.
If we put the authentication software in racks physically distant, that will incur additional time just for the request to physically travel the wires. You can’t beat the speed of light in wires.
The authentication is behind a load balancer or crossing several components. Each of them adds precious time. The path has to be checked all along. If the service is colocated but the credential database is two milliseconds away, that two milliseconds ends up adding in the TTFB.
View from the rooftop terrace (1)
So you unfold and you unroll everything. Every request path, every network path and check. What is critical? What isn’t? What has to be made as short as possible? What has to be redundant otherwise the service will stop working.
This is partly why we started building our own infrastructure. We could then co-host our services physically closer. But we had to make sure we were redundant enough. I remember hand allocating the services to survive a server or a server bay loss. Which came very handy a year later.
The thing with latency, is that this isn’t enough. We did our tests from the same datacenter. Within the same network. Why? Why not do them from the office? Or from a customer location?
Because internet network latency is not stable. Paths depends on peering rules, on the quality of the internet operator network and it’s connectivity. The capacity. The bottlenecks. The fiber cuts… The internet saturates locally. This creates too much chaos outside of your own network to do accurate measures. It’s a good thing to monitor for the network team but not stable enough to use a performance test.
View of a few of the servers in the test cluster (2)
Anyhow, that was the network team’s turf. We were out of our league once it left the internal network.
But the next checks would require us to understand better the network.
Splitting light Season 2 Episode 17 Increasing scope If you are no longer interested in the newsletter, please unsubscribe Just as we were iterating to reduce risk and gain experience. We were also increasing the scope of our work as the iterations went. We didn’t expect to have to handle the hardware. Yet we learned how to choose it. How many servers we wanted per rack and why. Measuring the actual power load at full usage. With the help of the network team, choosing the right routers and...
Splitting light Season 2 Episode 16 Iterative process If you are no longer interested in the newsletter, please unsubscribe All throughout these first few months, from December 2017 to May 2018, we did iterations on object storage. This was both necessary to pick the skills and also reduce risks. The work to bring up an object storage is a subset to making a public facing object storage product. In the first case you control everything, in the second, you control much less. Quentin working in...
Splitting light Season 2 Episode 15 Internal identification If you are no longer interested in the newsletter, please unsubscribe End of May 2018 had arrived. I had an appointment. After doing a tattoo on my arm, I continued to look into tattoo art. Looking at different things and identifying more symbolism that I wanted. I remember reading Revenger by Alastair Reynolds, a mix between new space opera and pirates of the caribbean. Spaceships with light sales. I finally contacted the tattoo...