llvm-project/libc/benchmarks/RATIONALE.md

244 lines
11 KiB
Markdown

# Benchmarking `llvm-libc`'s memory functions
## Foreword
Microbenchmarks are valuable tools to assess and compare the performance of
isolated pieces of code. However they don't capture all interactions of complex
systems; and so other metrics can be equally important:
- **code size** (to reduce instruction cache pressure),
- **Profile Guided Optimization** friendliness,
- **hyperthreading / multithreading** friendliness.
## Rationale
The goal here is to satisfy the [Benchmarking
Principles](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Benchmarking_Principles).
1. **Relevance**: Benchmarks should measure relatively vital features.
2. **Representativeness**: Benchmark performance metrics should be broadly
accepted by industry and academia.
3. **Equity**: All systems should be fairly compared.
4. **Repeatability**: Benchmark results can be verified.
5. **Cost-effectiveness**: Benchmark tests are economical.
6. **Scalability**: Benchmark tests should measure from single server to
multiple servers.
7. **Transparency**: Benchmark metrics should be easy to understand.
Benchmarking is a [subtle
art](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Challenges) and
benchmarking memory functions is no exception. Here we'll dive into
peculiarities of designing good microbenchmarks for `llvm-libc` memory
functions.
## Challenges
As seen in the [README.md](README.md#benchmarking-regimes) the microbenchmarking
facility should focus on measuring **low latency code**. If copying a few bytes
takes in the order of a few cycles, the benchmark should be able to **measure
accurately down to the cycle**.
### Measuring instruments
There are different sources of time in a computer (ordered from high to low resolution)
- [Performance
Counters](https://en.wikipedia.org/wiki/Hardware_performance_counter): used to
introspect the internals of the CPU,
- [High Precision Event
Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer): used to
trigger short lived actions,
- [Real-Time Clocks (RTC)](https://en.wikipedia.org/wiki/Real-time_clock): used
to keep track of the computer's time.
In theory **Performance Counters** provide cycle accurate measurement via the
`cpu cycles` event. But as we'll see, they are not really practical in this
context.
### Performance counters and modern processor architecture
Modern CPUs are [out of
order](https://en.wikipedia.org/wiki/Out-of-order_execution) and
[superscalar](https://en.wikipedia.org/wiki/Superscalar_processor) as a
consequence it is [hard to know what is included when the counter is
read](https://en.wikipedia.org/wiki/Hardware_performance_counter#Instruction_based_sampling),
some instructions may still be **in flight**, some others may be executing
[**speculatively**](https://en.wikipedia.org/wiki/Speculative_execution). As a
matter of fact **on the same machine, measuring twice the same piece of code will yield
different results.**
### Performance counters semantics inconsistencies and availability
Although they have the same name, the exact semantics of performance counters
are micro-architecture dependent: **it is generally not possible to compare two
micro-architectures exposing the same performance counters.**
Each vendor decides which performance counters to implement and their exact
meaning. Although we want to benchmark `llvm-libc` memory functions for all
available [target
triples](https://clang.llvm.org/docs/CrossCompilation.html#target-triple), there
are **no guarantees that the counter we're interested in is available.**
### Additional imprecisions
- Reading performance counters is done through Kernel [System
calls](https://en.wikipedia.org/wiki/System_call). The System call itself
is costly (hundreds of cycles) and will perturbate the counter's value.
- [Interruptions](https://en.wikipedia.org/wiki/Interrupt#Processor_response)
can occur during measurement.
- If the system is already under monitoring (virtual machines or system wide
profiling) the kernel can decide to multiplex the performance counters
leading to lower precision or even completely missing the measurement.
- The Kernel can decide to [migrate the
process](https://en.wikipedia.org/wiki/Process_migration) to a different
core.
- [Dynamic frequency
scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling) can kick
in during the measurement and change the ticking duration. **Ultimately we
care about the amount of work over a period of time**. This removes some
legitimacy of measuring cycles rather than **raw time**.
### Cycle accuracy conclusion
We have seen that performance counters are: not widely available, semantically
inconsistent across micro-architectures and imprecise on modern CPUs for small
snippets of code.
## Design decisions
In order to achieve the needed precision we would need to resort on more widely
available counters and derive the time from a high number of runs: going from a
single deterministic measure to a probabilistic one.
**To get a good signal to noise ratio we need the running time of the piece of
code to be orders of magnitude greater than the measurement precision.**
For instance, if measurement precision is of 10 cycles, we need the function
runtime to take more than 1000 cycles to achieve 1%
[SNR](https://en.wikipedia.org/wiki/Signal-to-noise_ratio).
### Repeating code N-times until precision is sufficient
The algorithm is as follows:
- We measure the time it takes to run the code _N_ times (Initially _N_ is 10
for instance)
- We deduce an approximation of the runtime of one iteration (= _runtime_ /
_N_).
- We increase _N_ by _X%_ and repeat the measurement (geometric progression).
- We keep track of the _one iteration runtime approximation_ and build a
weighted mean of all the samples so far (weight is proportional to _N_)
- We stop the process when the difference between the weighted mean and the
last estimation is smaller than _ε_ or when other stopping conditions are
met (total runtime, maximum iterations or maximum sample count).
This method allows us to be as precise as needed provided that the measured
runtime is proportional to _N_. Longer run times also smooth out imprecision
related to _interrupts_ and _context switches_.
Note: When measuring longer runtimes (e.g. copying several megabytes of data)
the above assumption doesn't hold anymore and the _ε_ precision cannot be
reached by increasing iterations. The whole benchmarking process becomes
prohibitively slow. In this case the algorithm is limited to a single sample and
repeated several times to get a decent 95% confidence interval.
### Effect of branch prediction
When measuring code with branches, repeating the same call again and again will
allow the processor to learn the branching patterns and perfectly predict all
the branches, leading to unrealistic results.
**Decision: When benchmarking small buffer sizes, the function parameters should
be randomized between calls to prevent perfect branch predictions.**
### Effect of the memory subsystem
The CPU is tightly coupled to the memory subsystem. It is common to see `L1`,
`L2` and `L3` data caches.
We may be tempted to randomize data accesses widely to exercise all the caching
layers down to RAM but the [cost of accessing lower layers of
memory](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html)
completely dominates the runtime for small sizes.
So to respect **Equity** and **Repeatability** principles we should make sure we
**do not** depend on the memory subsystem.
**Decision: When benchmarking small buffer sizes, the data accessed by the
function should stay in `L1`.**
### Effect of prefetching
In case of small buffer sizes,
[prefetching](https://en.wikipedia.org/wiki/Cache_prefetching) should not kick
in but in case of large buffers it may introduce a bias.
**Decision: When benchmarking large buffer sizes, the data should be accessed in
a random fashion to lower the impact of prefetching between calls.**
### Effect of dynamic frequency scaling
Modern processors implement [dynamic frequency
scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling). In so-called
`performance` mode the CPU will increase its frequency and run faster than usual
within [some limits](https://en.wikipedia.org/wiki/Intel_Turbo_Boost) : _"The
increased clock rate is limited by the processor's power, current, and thermal
limits, the number of cores currently in use, and the maximum frequency of the
active cores."_
**Decision: When benchmarking we want to make sure the dynamic frequency scaling
is always set to `performance`. We also want to make sure that the time based
events are not impacted by frequency scaling.**
See [REAME.md](REAME.md) on how to set this up.
### Reserved and pinned cores
Some operating systems allow [core
reservation](https://stackoverflow.com/questions/13583146/whole-one-core-dedicated-to-single-process).
It removes a set of perturbation sources like: process migration, context
switches and interrupts. When a core is hyperthreaded, both cores should be
reserved.
## Microbenchmarks limitations
As stated in the Foreword section a number of effects do play a role in
production but are not directly measurable through microbenchmarks. The code
size of the benchmark is (much) smaller than the hot code of real applications
and **doesn't exhibit instruction cache pressure as much**.
### iCache pressure
Fundamental functions that are called frequently will occupy the L1 iCache
([illustration](https://en.wikipedia.org/wiki/CPU_cache#Example:_the_K8)). If
they are too big they will prevent other hot code to stay in the cache and incur
[stalls](https://en.wikipedia.org/wiki/CPU_cache#CPU_stalls). So the memory
functions should be as small as possible.
### iTLB pressure
The same reasoning goes for instruction Translation Lookaside Buffer
([iTLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)) incurring
[TLB
misses](https://en.wikipedia.org/wiki/Translation_lookaside_buffer#TLB-miss_handling).
## FAQ
1. Why don't you use Google Benchmark directly?
We reuse some parts of Google Benchmark (detection of frequency scaling, CPU
cache hierarchy informations) but when it comes to measuring memory
functions Google Benchmark have a few issues:
- Google Benchmark privileges code based configuration via macros and
builders. It is typically done in a static manner. In our case the
parameters we need to setup are a mix of what's usually controlled by
the framework (number of trials, maximum number of iterations, size
ranges) and parameters that are more tied to the function under test
(randomization strategies, custom values). Achieving this with Google
Benchmark is cumbersome as it involves templated benchmarks and
duplicated code. In the end, the configuration would be spread across
command line flags (via framework's option or custom flags), and code
constants.
- Output of the measurements is done through a `BenchmarkReporter` class,
that makes it hard to access the parameters discussed above.