forked from OSchip/llvm-project
244 lines
11 KiB
Markdown
244 lines
11 KiB
Markdown
# Benchmarking `llvm-libc`'s memory functions
|
|
|
|
## Foreword
|
|
|
|
Microbenchmarks are valuable tools to assess and compare the performance of
|
|
isolated pieces of code. However they don't capture all interactions of complex
|
|
systems; and so other metrics can be equally important:
|
|
|
|
- **code size** (to reduce instruction cache pressure),
|
|
- **Profile Guided Optimization** friendliness,
|
|
- **hyperthreading / multithreading** friendliness.
|
|
|
|
## Rationale
|
|
|
|
The goal here is to satisfy the [Benchmarking
|
|
Principles](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Benchmarking_Principles).
|
|
|
|
1. **Relevance**: Benchmarks should measure relatively vital features.
|
|
2. **Representativeness**: Benchmark performance metrics should be broadly
|
|
accepted by industry and academia.
|
|
3. **Equity**: All systems should be fairly compared.
|
|
4. **Repeatability**: Benchmark results can be verified.
|
|
5. **Cost-effectiveness**: Benchmark tests are economical.
|
|
6. **Scalability**: Benchmark tests should measure from single server to
|
|
multiple servers.
|
|
7. **Transparency**: Benchmark metrics should be easy to understand.
|
|
|
|
Benchmarking is a [subtle
|
|
art](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Challenges) and
|
|
benchmarking memory functions is no exception. Here we'll dive into
|
|
peculiarities of designing good microbenchmarks for `llvm-libc` memory
|
|
functions.
|
|
|
|
## Challenges
|
|
|
|
As seen in the [README.md](README.md#benchmarking-regimes) the microbenchmarking
|
|
facility should focus on measuring **low latency code**. If copying a few bytes
|
|
takes in the order of a few cycles, the benchmark should be able to **measure
|
|
accurately down to the cycle**.
|
|
|
|
### Measuring instruments
|
|
|
|
There are different sources of time in a computer (ordered from high to low resolution)
|
|
- [Performance
|
|
Counters](https://en.wikipedia.org/wiki/Hardware_performance_counter): used to
|
|
introspect the internals of the CPU,
|
|
- [High Precision Event
|
|
Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer): used to
|
|
trigger short lived actions,
|
|
- [Real-Time Clocks (RTC)](https://en.wikipedia.org/wiki/Real-time_clock): used
|
|
to keep track of the computer's time.
|
|
|
|
In theory **Performance Counters** provide cycle accurate measurement via the
|
|
`cpu cycles` event. But as we'll see, they are not really practical in this
|
|
context.
|
|
|
|
### Performance counters and modern processor architecture
|
|
|
|
Modern CPUs are [out of
|
|
order](https://en.wikipedia.org/wiki/Out-of-order_execution) and
|
|
[superscalar](https://en.wikipedia.org/wiki/Superscalar_processor) as a
|
|
consequence it is [hard to know what is included when the counter is
|
|
read](https://en.wikipedia.org/wiki/Hardware_performance_counter#Instruction_based_sampling),
|
|
some instructions may still be **in flight**, some others may be executing
|
|
[**speculatively**](https://en.wikipedia.org/wiki/Speculative_execution). As a
|
|
matter of fact **on the same machine, measuring twice the same piece of code will yield
|
|
different results.**
|
|
|
|
### Performance counters semantics inconsistencies and availability
|
|
|
|
Although they have the same name, the exact semantics of performance counters
|
|
are micro-architecture dependent: **it is generally not possible to compare two
|
|
micro-architectures exposing the same performance counters.**
|
|
|
|
Each vendor decides which performance counters to implement and their exact
|
|
meaning. Although we want to benchmark `llvm-libc` memory functions for all
|
|
available [target
|
|
triples](https://clang.llvm.org/docs/CrossCompilation.html#target-triple), there
|
|
are **no guarantees that the counter we're interested in is available.**
|
|
|
|
### Additional imprecisions
|
|
|
|
- Reading performance counters is done through Kernel [System
|
|
calls](https://en.wikipedia.org/wiki/System_call). The System call itself
|
|
is costly (hundreds of cycles) and will perturbate the counter's value.
|
|
- [Interruptions](https://en.wikipedia.org/wiki/Interrupt#Processor_response)
|
|
can occur during measurement.
|
|
- If the system is already under monitoring (virtual machines or system wide
|
|
profiling) the kernel can decide to multiplex the performance counters
|
|
leading to lower precision or even completely missing the measurement.
|
|
- The Kernel can decide to [migrate the
|
|
process](https://en.wikipedia.org/wiki/Process_migration) to a different
|
|
core.
|
|
- [Dynamic frequency
|
|
scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling) can kick
|
|
in during the measurement and change the ticking duration. **Ultimately we
|
|
care about the amount of work over a period of time**. This removes some
|
|
legitimacy of measuring cycles rather than **raw time**.
|
|
|
|
### Cycle accuracy conclusion
|
|
|
|
We have seen that performance counters are: not widely available, semantically
|
|
inconsistent across micro-architectures and imprecise on modern CPUs for small
|
|
snippets of code.
|
|
|
|
## Design decisions
|
|
|
|
In order to achieve the needed precision we would need to resort on more widely
|
|
available counters and derive the time from a high number of runs: going from a
|
|
single deterministic measure to a probabilistic one.
|
|
|
|
**To get a good signal to noise ratio we need the running time of the piece of
|
|
code to be orders of magnitude greater than the measurement precision.**
|
|
|
|
For instance, if measurement precision is of 10 cycles, we need the function
|
|
runtime to take more than 1000 cycles to achieve 1%
|
|
[SNR](https://en.wikipedia.org/wiki/Signal-to-noise_ratio).
|
|
|
|
### Repeating code N-times until precision is sufficient
|
|
|
|
The algorithm is as follows:
|
|
|
|
- We measure the time it takes to run the code _N_ times (Initially _N_ is 10
|
|
for instance)
|
|
- We deduce an approximation of the runtime of one iteration (= _runtime_ /
|
|
_N_).
|
|
- We increase _N_ by _X%_ and repeat the measurement (geometric progression).
|
|
- We keep track of the _one iteration runtime approximation_ and build a
|
|
weighted mean of all the samples so far (weight is proportional to _N_)
|
|
- We stop the process when the difference between the weighted mean and the
|
|
last estimation is smaller than _ε_ or when other stopping conditions are
|
|
met (total runtime, maximum iterations or maximum sample count).
|
|
|
|
This method allows us to be as precise as needed provided that the measured
|
|
runtime is proportional to _N_. Longer run times also smooth out imprecision
|
|
related to _interrupts_ and _context switches_.
|
|
|
|
Note: When measuring longer runtimes (e.g. copying several megabytes of data)
|
|
the above assumption doesn't hold anymore and the _ε_ precision cannot be
|
|
reached by increasing iterations. The whole benchmarking process becomes
|
|
prohibitively slow. In this case the algorithm is limited to a single sample and
|
|
repeated several times to get a decent 95% confidence interval.
|
|
|
|
### Effect of branch prediction
|
|
|
|
When measuring code with branches, repeating the same call again and again will
|
|
allow the processor to learn the branching patterns and perfectly predict all
|
|
the branches, leading to unrealistic results.
|
|
|
|
**Decision: When benchmarking small buffer sizes, the function parameters should
|
|
be randomized between calls to prevent perfect branch predictions.**
|
|
|
|
### Effect of the memory subsystem
|
|
|
|
The CPU is tightly coupled to the memory subsystem. It is common to see `L1`,
|
|
`L2` and `L3` data caches.
|
|
|
|
We may be tempted to randomize data accesses widely to exercise all the caching
|
|
layers down to RAM but the [cost of accessing lower layers of
|
|
memory](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html)
|
|
completely dominates the runtime for small sizes.
|
|
|
|
So to respect **Equity** and **Repeatability** principles we should make sure we
|
|
**do not** depend on the memory subsystem.
|
|
|
|
**Decision: When benchmarking small buffer sizes, the data accessed by the
|
|
function should stay in `L1`.**
|
|
|
|
### Effect of prefetching
|
|
|
|
In case of small buffer sizes,
|
|
[prefetching](https://en.wikipedia.org/wiki/Cache_prefetching) should not kick
|
|
in but in case of large buffers it may introduce a bias.
|
|
|
|
**Decision: When benchmarking large buffer sizes, the data should be accessed in
|
|
a random fashion to lower the impact of prefetching between calls.**
|
|
|
|
### Effect of dynamic frequency scaling
|
|
|
|
Modern processors implement [dynamic frequency
|
|
scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling). In so-called
|
|
`performance` mode the CPU will increase its frequency and run faster than usual
|
|
within [some limits](https://en.wikipedia.org/wiki/Intel_Turbo_Boost) : _"The
|
|
increased clock rate is limited by the processor's power, current, and thermal
|
|
limits, the number of cores currently in use, and the maximum frequency of the
|
|
active cores."_
|
|
|
|
**Decision: When benchmarking we want to make sure the dynamic frequency scaling
|
|
is always set to `performance`. We also want to make sure that the time based
|
|
events are not impacted by frequency scaling.**
|
|
|
|
See [REAME.md](REAME.md) on how to set this up.
|
|
|
|
### Reserved and pinned cores
|
|
|
|
Some operating systems allow [core
|
|
reservation](https://stackoverflow.com/questions/13583146/whole-one-core-dedicated-to-single-process).
|
|
It removes a set of perturbation sources like: process migration, context
|
|
switches and interrupts. When a core is hyperthreaded, both cores should be
|
|
reserved.
|
|
|
|
## Microbenchmarks limitations
|
|
|
|
As stated in the Foreword section a number of effects do play a role in
|
|
production but are not directly measurable through microbenchmarks. The code
|
|
size of the benchmark is (much) smaller than the hot code of real applications
|
|
and **doesn't exhibit instruction cache pressure as much**.
|
|
|
|
### iCache pressure
|
|
|
|
Fundamental functions that are called frequently will occupy the L1 iCache
|
|
([illustration](https://en.wikipedia.org/wiki/CPU_cache#Example:_the_K8)). If
|
|
they are too big they will prevent other hot code to stay in the cache and incur
|
|
[stalls](https://en.wikipedia.org/wiki/CPU_cache#CPU_stalls). So the memory
|
|
functions should be as small as possible.
|
|
|
|
### iTLB pressure
|
|
|
|
The same reasoning goes for instruction Translation Lookaside Buffer
|
|
([iTLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)) incurring
|
|
[TLB
|
|
misses](https://en.wikipedia.org/wiki/Translation_lookaside_buffer#TLB-miss_handling).
|
|
|
|
## FAQ
|
|
|
|
1. Why don't you use Google Benchmark directly?
|
|
|
|
We reuse some parts of Google Benchmark (detection of frequency scaling, CPU
|
|
cache hierarchy informations) but when it comes to measuring memory
|
|
functions Google Benchmark have a few issues:
|
|
|
|
- Google Benchmark privileges code based configuration via macros and
|
|
builders. It is typically done in a static manner. In our case the
|
|
parameters we need to setup are a mix of what's usually controlled by
|
|
the framework (number of trials, maximum number of iterations, size
|
|
ranges) and parameters that are more tied to the function under test
|
|
(randomization strategies, custom values). Achieving this with Google
|
|
Benchmark is cumbersome as it involves templated benchmarks and
|
|
duplicated code. In the end, the configuration would be spread across
|
|
command line flags (via framework's option or custom flags), and code
|
|
constants.
|
|
- Output of the measurements is done through a `BenchmarkReporter` class,
|
|
that makes it hard to access the parameters discussed above.
|