Summary:
Shrink wrapping is an expensive part of frame optimizations if
performed on all functions. This diff makes it run in parallel,
reducing wall time.
(cherry picked from FBD16092651)
Summary:
This diff parallelize the construction of call graph during disassembly.
The diff includes a change to parallel-utilities where another interface
is added, that support running tasks on binaryFunctions that involves
adding instruction annotations. This pattern is common in different places,
e.g. frame optimizations. And such, pattern justify creating an interface,
that abstract out all the messy details.
(cherry picked from FBD16232809)
Summary:
This diff change reorderBasicBlocks pass to run in parallel,
it does so by adding locks to the fix branches function,
and creating temporary MCCodeEmitters when estimating basic block code size.
(cherry picked from FBD16161149)
Summary:
If two indirect branches use the same jump table, we need to
detect this and duplicate dump tables so we can modify this CFG
correctly. This is necessary for instrumentation and shrink wrapping.
For the latter, we only detect this and bail, fixing this old known
issue with shrink wrapping.
Other minor changes to support better instrumentation: add an option
to instrument only hot functions, add LOCK prefix to instrumentation
increment instruction, speed up splitting critical edges by avoiding
calling recomputeLandingPads() unnecessarily.
(cherry picked from FBD16101312)
Summary:
Heuristic that creates a jump table for every memory access,
including those we do not match against a pattern in an indirect jump,
is too permissive and has false positives. Guard this logic under
strict mode until we figure out a better strategy.
(cherry picked from FBD16192205)
Summary:
Each time we run some work in parallel over the list of functions in bolt, we manage a thread pool, task scheduling and perform some work to manage the granularity of the tasks based on the type of the work we do.
In this task, I am creating an interface where all those details are abstracted out, the user provides the function that will run on each function, and some policy parameters that setup the scheduling and granularity configurations.
This will make it easier to implement parallel tasks, and eliminate redundant coding efforts.
(cherry picked from FBD16116077)
Summary:
This diff parallelize the STPClean() function reducing its runtime from 5 seconds to 0.4 on HHVM,
Making the runtime for the frame optimizer goes down to 33 seconds on HHVM.
(cherry picked from FBD15914371)
Summary:
This diff includes two main changes:
1) When creating an annotation, a dedicated annotation allocator can be used, instead of the default allocator. This allows some annotation to be deallocated right after the end of their usage completely. Furthermore, having the ability to use dedicated allocators allows running SPT in parallel where each task uses a different allocator.
2) SPT is parallelized.
(cherry picked from FBD15913492)
Summary: We select the top hot targets for indirect call promotion. But since we only have frequency for targets, not for actual jump table indices, we have to merge indices that share the same actual target. In order to do that we sort targets by pointer of target symbol before merging, which introduces instability. Later we stable sort merged targets by frequency. Due to the instability of sorting pointers, and depending on how many indices each merged target has, we could end up with unstable ICP. This commit changes the 2nd pass sorting to prioritize targets with fewer indices, and higher mispredicts, in addition to higher frequency. It improves stability of ICP, and also exposes more ICP because targets with fewer indices has better chance of getting promoted.
(cherry picked from FBD16099701)
Summary:
Check that a symbol address is less than the next function
address before considering it for a secondary entry.
(cherry picked from FBD16056468)
Summary:
In strict relocation mode we rely on relocations to represent all
possible entry points into a function. Most of the code generated by
tested compilers (gcc and clang) will result in relocations against
any internal labels for jump tables and for computed goto tables.
In situations where we cannot properly reconstruct a jump table, or when
we cannot determine a table that guides an indirect jump, e.g. when
multiple computed goto tables are used, we conservatively assume that
the indirect jump can end up at any possible basic block referenced by
relocations.
In strict mode, simple functions may include the aforementioned
instructions with unknown control flow with a conservative list of
destinations added to the containing basic block. This allows us to
expand coverage of simple functions and to enable code reordering
optimizations for more functions.
The strict mode is recommended when BOLT is used with a well-formed
code generated by a compiler.
To use the strict mode, add "-strict" on the command line.
Another effect of this diff, is that with relocations, we will always
replace the immediate operand of an instruction with a symbol if the
relocation exists against this operand.
Also this diff fixes issues with Clang compiled with -fpic.
(cherry picked from FBD15872849)
Summary:
A relocation can have an addend that makes it look as the relocated
value is in a different section from the symbol being relocated.
E.g., a relocation against a variable in .rodata could have a negative
offset that will make it look like it is against a symbol in .text
(a section that typically precedes .rodata).
Unless the relocation is against a section symbol, we know
exactly the symbol that is being relocated and there is no issue.
However, when the linker leaves only a section relocation (i.e. a
relocation against a section symbol when a temporary original symbol
gets deleted), we have to guess the relocated symbol, and can falsely
detect a function reference in the case described above.
The fix is to keep a section relocation if the corresponding
relocated value falls into a different section, and to detect and
ignore false function reference.
(cherry picked from FBD16030791)
Summary: BOLT operates in relocation mode by default when .reloc is in the binary. This changes disables relocation mode for heatmap generation so we can use that for more cases. There's a small separate change that ignores zero-sized symbol in zero-sized code section during function discovery.
(cherry picked from FBD16009610)
Summary:
An instrumentation pass that modifies the input binary to
generate a profile after execution finishes. It modifies branches to
increment counters stored in the process memory and injects a new
function that dumps this data to an fdata file, readable by BOLT.
This instrumentation is experimental and currently uses a naive
approach where every branch is instrumented. This is not ideal for
runtime performance, but should be good enough for us to
evaluate/debug LBR profile quality against instrumentation.
Does not support instrumenting indirect calls yet, only direct
calls, direct branches and indirect local branches.
(cherry picked from FBD15998096)
Summary:
Make BOLT ignore empty functions (those containing no instructions,
despite having some space allocated to it filled with zeroes).
(cherry picked from FBD15981683)
Summary:
Profile bias may happen depending on the hardware counter used
to trigger LBR sampling, on the hardware implementation and as an
intrinsic characteristic of relying on LBRs. Since we infer fall-through
execution and these non-taken branches take zero hardware resources to
be represented, LBR-based profile likely overrepresents paths with fall
throughs and underrepresents paths with many taken branches. This patch
adds an option to print statistics about profile bias so we can better
understand these biases.
The goal is to analyze differences in the sum of the frequency of all
incoming edges in a basic block versus the sum of all outgoing. In an
ideally sampled profile, these differences should be close to zero. With
this option, the user gets the mean of these differences in flow as a
percentage of the input flow. For example, if this number is 15%, it
means, on average, a block observed 15% more or less flow going out of
it in comparison with the flow going in. We also print the standard
deviation so we can have an idea of how spread apart are different
measurements of flow differences. If variance is low, it means the
average bias is happening across all blocks, which is compatible with
using LBRs. If the variance is high, it means some blocks in the profile
have a much higher bias than others, which is compatible with using a
biased event such as cycles to sample LBRs because it overrepresents
paths that end in an expensive instruction.
(cherry picked from FBD15790517)
Summary:
ICF consumes 10-15% of bolt runtime, for HHVM that is around 45 seconds.
this diff perform some parallelization for the pass to make it faster.
A 60% reduction in the ICF runtime is measured on the parallel version for HHVM.
(cherry picked from FBD15589515)
Summary:
Now that we populate jump tables after all functions are disassembled,
we can check for instruction boundaries corresponding to jump table
entries. No need to delegate this task to postProcessJumpTables().
(cherry picked from FBD15814762)
Summary:
During the initial disassembly pass, only identify jump tables
without populating the contents. Later, after all functions have been
disassembled, we have a better idea of jump table boundaries and can do
a better job of populating their entries.
As a result, we no longer have embedded jump tables (i.e. a jump table
that is parter of another jump table). If we ever need to keep
sequential jump tables inseparable during the output, we can always
add such functionality later.
Fixesfacebookincubator/BOLT#56.
(cherry picked from FBD15800427)
Summary:
During frame analysis, the functions do not change, and stack pointer tracking
does not need to be performed more than one time.
The current implementation performs the SPT analysis multiple times per
function during the frame analysis, we ca eliminate such computation redundancy.
On HHVM with -frame-opts=hot, this save around a minute which is 40% of the
frame optimization runtime. (129s to 76s).
fdata should be passed for a reasonable evaluation (we need the call graph).
However, this comes at a memory cost, around 2G to the peak when only -frame-opt=hot only is used but,
When all the usual flags are passed, the effect is to the peak is only 200K (measured from one test).
Update:
When jemalloc is used the base became way better and the following runtime are observed:
[jemalloc]
hhvm 85 --> 72.
clang 27 --> 23.
[malloc]
hhvm 129 --> 76.
clang 34 --> 27.
(cherry picked from FBD15707003)
Summary:
Add an option to get extra profile trace using the recorded event PC.
The trace goes from the latest LBR record destination to the event PC.
(cherry picked from FBD15711804)
Summary:
We used to handle PC-relative address references differently from direct
address references. As a result, some cases, such as escaped function
label address, were not handled when dealing with absolute (non-PIC)
code. This diff moves processing of an address reference into
BinaryContext::handleAddressRef() which is called for both PIC and
non-PIC code.
(cherry picked from FBD15643535)
Summary:
Compile Bolt using std 14.
We want that to be able to use some threading the locking tools that do not exists in std 11.
(cherry picked from FBD15671736)
Summary:
Similarly to how the compiler relies on DWARF to map samples, so
it is possible to collect profile data in binaries optimized by PGO
techniques and retrofit data to be used in a representation of the program
that was not optimized by PGO, this diff implements an option in BOLT to
encode a table in the output binary that allows us to map data collected
in optimized binaries back to the address space used in the input binary
(where the profile is useful, since we do not support running BOLT on a
binary already optimized by BOLT). The goal is to offer an option to
support BOLT in scenarios where it is not easy to run a special deployment of
the binary with a version that was not optimized by BOLT just for data
collection.
This feature is enabled with the -enable-bat flag. BAT stands for BOLT
Address Translation, which refers to the process of mapping output to
input addresses.
(cherry picked from FBD15531860)
Summary:
Options such as `-print-only`, `-skip-funcs`, etc. now take regular
expressions. Internally, the option is converted to '^funcname$' form
prior to regex matching. This ensures that names without special
symbols will match exactly, i.e. "foo" will not match "foo123".
(cherry picked from FBD15551930)
Summary:
Run analyzeIndirectBranch() using basic block boundaries instead of
running ad-hoc validation of the jump table assumptions.
(cherry picked from FBD15465034)
Summary:
Move handling of interprocedural references to BinaryContext.
Post-process indirect branches immediately after the CFG is built.
This is almost NFC. Since indirect branches are now post-processed
before the profile data is processed it interferes with the way the
profile data in YAML format is handled.
(cherry picked from FBD15456003)
Summary:
Add an option:
-memcpy1-spec=func1,func2:cs1,func3:cs1:cs2,...
to specialize calls to memcpy() in listed functions (the name could be
supplied in regex) for size 1. The optimization will dynamically check
if the size argument equals to 1 and execute a one byte copy, otherwise
it will call memcpy() as usual. Specific call sites could be indicated
after ":" using their numeric count from the start of the function.
(cherry picked from FBD15428936)
Summary:
SDT markers that appears as nops in the assembly, are preserved and not eliminated.
Functions with SDT markers are also flagged. Inlining and folding are disabled for
functions that have SDT markers.
(cherry picked from FBD15379799)
Summary:
Parse statically defined tracepoints(SDT) markers from the ELF file, and store them.
Add an option to print SDTs (-print-sdt).
Add test case for parsing and printing SDTs.
(cherry picked from FBD15366712)
Summary:
Previously, ICP worked with a budget of N targets to convert to
direct calls. As long as the frequency of up to N of the hottest targets
surpassed a given fraction (threshold) of the total frequency, say, 90%,
then the optimization would convert a number of targets (up to N) to
direct calls. Otherwise, it would completely abort processing this call
site. The intent was to convert a given fraction of the indirect call
site frequency to use direct calls instead, but this ends up being a
"all or nothing" strategy.
In this patch we change this to operate with the same strategy seem in
LLVM's ICP, with two thresholds. The idea is that the hottest target of
an indirect call site will be compared against these two thresholds: one
checks its frequency relative to the total frequency of the original
indirect call site, and the other checks its frequency relative to the
remaining, unconverted targets (excluding the hottest targets that were
already converted to direct calls). The remaining threshold is typically
set higher than the total threshold. This allows us more control over
ICP.
I expose two pairs of knobs, one for jump tables and another for
indirect calls.
To improve the promotion of hot jump table indices when we have memory
profile, I also fix a bug that could cause us to promote extra indices
besides the hottest ones as seen in the memory profile. When we have the
memory profile, I reapply the dual threshold checks to the memory
profile which specifies exactly which indices are hot. I then update N,
the number of targets to be promoted, based on this new information, and
update frequency information.
To allow us to work with smaller profiles, I also created an option in
perf2bolt to filter out memory samples outside the statically allocated
area of the binary (heap/stack). This option is on by default.
(cherry picked from FBD15187832)
Summary:
Make BinaryContext responsible for creation and management of
JumpTables. This will be used for detection and resolution of jump table
conflicts across functions.
(cherry picked from FBD15196017)
Summary:
perf tool requires the input data to be owned by the current user or
root, otherwise it rejects the input. Use `-f` option to override this
behavior.
(cherry picked from FBD15160678)
Summary:
While checking for a size of a jump table, we've used containing
section as a boundary. This worked for most cases as typically jump
tables are not marked with symbol table entries. However, the compiler
may generate objects for indirect goto's.
(cherry picked from FBD15158905)
Summary:
We used to ignore debug sections by default, but we kept them in the
binary which led to invalid debug information in the output. It's better
to strip debug info and print a warning to the user.
Note: we are not updating debug info by default due to high memory
requirements for large applications.
(cherry picked from FBD15128947)
Summary:
Commit "Update symbols for secondary entry points" introduced
a bug by using getBinaryFunctionContainingAddress() instead of
getBinaryFunctionAtAddress() regarding ICF'd functions. Only the latter
would fetch the correct BinaryFunction object for addresses of functions
that were ICF'd. As a result of this bug, the dynamic symbol table was
not updated for function symbols that were folded by ICF.
(cherry picked from FBD15112941)
Summary:
In non-relocation mode we may execute multiple re-write passes either
because we need to split large functions or update debug information for
large functions (in this context large functions are functions that do
not fit into the original function boundaries after optimizations).
When we execute another pass, we reset RewriteInstance and run most of
the steps such as disassembly and profile matching for the 2nd or 3rd
time. However, when we match a profile, we check `Used` flag, and don't
use the profile for the 2nd time. Since we didn't reset the flag while
resetting the rest of the states, we ignored profile for all functions.
Resetting the flag in-between rewrite passes solves the problem.
(cherry picked from FBD15110959)
Summary:
For pre-aggregated profile, we were using the number of records in the
profile for `NumTraces` ignoring the counts per record. As a result,
the reported percentage of mismatched traces was bogus.
(cherry picked from FBD15093123)
Summary:
Enable -hot-text by default if reordering functions.
Also fail immediately if function reordering is specified on the command
line in non-relocation mode.
(cherry picked from FBD15095178)
Summary: BOLT works as a series of patches rebased onto upstream LLVM at revision `f137ed238db`. Some of these patches introduce unnecessary whitespace changes or includes. Remove these to minimize the diff with upstream LLVM.
(cherry picked from FBD15064122)
Summary:
Update the output ELF symbol table for symbols representing
secondary entry points for functions. Previously, those were left
unchanged in the symtab.
(cherry picked from FBD15010517)
Summary:
`size_t` is platform-dependent, and on macOS it is defined as
`unsigned long long`. This is not the same type as is used in many calls
to templated functions that expect the same type. As a result, on macOS,
calls to `std::max` fail because a template function that takes
`uint64_t, unsigned long long` cannot be found.
To work around the issue:
* Specify explicit `std::max` and `std::min` functions where necessary,
to work around the compiler trying (and failing) to find a suitable
instantiation.
* For lambda return types, specify an explicit return type where necessary.
* For `operator ==()` calls, use an explicit cast where necessary.
(cherry picked from FBD15030283)
Summary:
When attempting to build llvm-bolt with `-DLLVM_ENABLE_TARGETS="X86"`, I
encountered an error:
```
CMake Error at cmake/modules/AddLLVM.cmake:559 (add_dependencies):
The dependency target "AArch64CommonTableGen" of target
"LLVMBOLTTargetAArch64" does not exist.
Call Stack (most recent call first):
cmake/modules/AddLLVM.cmake:607 (llvm_add_library)
tools/llvm-bolt/src/Target/AArch64/CMakeLists.txt:1 (add_llvm_library)
```
The issue is that the `llvm-bolt/src/Target/AArch64` subdirectory is
added by CMake unconditionally. The LLVM project, on the other hand,
only adds the subdirectories that are enabled, by using a `foreach` loop
over `LLVM_TARGETS_TO_BUILD`. Copying that same loop, from
`llvm/lib/Target/CMakeLists.txt`, to this project avoids the error.
(cherry picked from FBD15030236)