Splitting light

Season 2 Episode 18

Controlling latency

If you are no longer interested in the newsletter, please unsubscribe

We didn't only increase our scope in iteration phases to reduce risk or go faster. We also did it for customer facing metrics. One specifically required some tradeoffs; it was latency. To be more precise, time to first byte.

Object storage is a generic way of storing and fetching data. The maximum data you could store in a single object at the time was 5 terabytes but the minimum was just a few bytes. So when you are looking at the time it takes to complete a request, it can take up to a few minutes or hours to complete. It’s a completely normal pattern. Measuring the total time of the request makes no sense. You have to trick around that.

In the object storage context, what is important from a user perspective? It’s starting to receive data. What’s important is when the customer starts receiving bytes. That metric is called “time to first byte” (TTFB).

Anything on the software path that needs to be run before the first byte that is returned is in the critical path. Any time spent in that part results in an increased TTFB and lower user experience. The thing is, before you are able to return any byte, you have many steps. You need to route the request and check the authentication. If it’s a write request, the quota and other elements.

If we put the authentication software in racks physically distant, that will incur additional time just for the request to physically travel the wires. You can’t beat the speed of light in wires.

The authentication is behind a load balancer or crossing several components. Each of them adds precious time. The path has to be checked all along. If the service is colocated but the credential database is two milliseconds away, that two milliseconds ends up adding in the TTFB.

So you unfold and you unroll everything. Every request path, every network path and check. What is critical? What isn’t? What has to be made as short as possible? What has to be redundant otherwise the service will stop working.

This is partly why we started building our own infrastructure. We could then co-host our services physically closer. But we had to make sure we were redundant enough. I remember hand allocating the services to survive a server or a server bay loss. Which came very handy a year later.

The thing with latency, is that this isn’t enough. We did our tests from the same datacenter. Within the same network. Why? Why not do them from the office? Or from a customer location?

Because internet network latency is not stable. Paths depends on peering rules, on the quality of the internet operator network and it’s connectivity. The capacity. The bottlenecks. The fiber cuts… The internet saturates locally. This creates too much chaos outside of your own network to do accurate measures. It’s a good thing to monitor for the network team but not stable enough to use a performance test.