Benchmarks are hard
Benchmarking generally mean producing some measurements from a specific program or workload to typically understand and compare the performance characteristics of the benchmarked workload.
Benchmarks can be useful for:
- Optimizing costly workloads.
- Understanding how the underlying platform impact the cost.
Benchmarking can give you insights about your workload on various dimensions such as CPU cycles spent or memory allocation done for a given task. These measurements (even though they might be coming from idealized environments) might give you some hints about the cost. They may help you to pick the right algorithm or optimize an overwhelmingly expensive workload.
Benchmarking can also give you insights about the underlying runtime, operating system, and hardware. And you might find insights to compare how each of these elements impact your performance even if you don't change the code. You might want to run the same suite of benchmarks on a new hardware to estimate the impact of the new hardware on certain calls.
Benchmarking requires a deep understanding of all layers you depend on. Consider CPU benchmarking. Any aspect below can significantly impact your results:
- Whether the data is available in the CPU cache or not.
- Whether you hit a garbage collection or not.
- Whether compiler has optimized some cases or not.
- Whether there is concurrency or not.
- Whether you are sharing the cores with anything else.
- Whether your cores are virtual or physical.
- Whether the branch detector will do the same thing or not.
- Whether the code you are benchmarking has any side effects or not. Compilers are really good optimizing cases if it has no impact on the global state.
This is why when we talk about benchmarking, it is not good practice to limit ourselves to the user-space code. It can sometimes turn into a detective's job to design and evaluate benchmarks. This is why complicated and long workloads are harder to benchmark reliably. It becomes harder to do the right accounting and figure out whether it is the contention, context switches, caching, garbage collector, compiler optimizations or the user-space code.
Even though microbenchmarks can give some insights, they cannot replicate the situations how the workload is going to perform in production. Replicating an average production environment for microbenchmarking purposes is almost an impossible task. This is why microbenchmarks don't always serve a good starting point if you want to evaluate your production performance.
To conclude, replicating production-like situations is not trivial. Understanding the impact of underlying stack requires a lot of experience and expertise. Designing and evaluating benchmarks are not easy. Examining the production services and understanding the call patterns and pointing out the hot calls may provide better insights about your production. In the future articles, I will cover how we think about performance evaluation in production.