Summary:
Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done.
The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end.
(cherry picked from FBD18153069)
Summary:
When perf tool reports a mapping address of a binary, it is not always
the address of the first loadable segment we were checking against.
As a result, perf2botl was not working properly for binaries where the
first segment was not executable.
The fix is to check if the address reported by mmap event matches any
of the loadable segments. Note that the segment alignment has to be
applied to get real loadable address of the segment.
Fixesfacebookincubator/BOLT#65
(cherry picked from FBD19146419)
Summary:
Add full instrumentation support (branches, direct and
indirect calls). Add output statistics to show how many hot bytes
were split from cold ones in functions. Add -cold-threshold option
to allow splitting warm code (non-zero count). Add option in
bolt-diff to report missing functions in profile 2.
In instrumentation, fini hooks are fixed to run proper finalization
code after program finishes. Hooks for startup are added to setup
the runtime structures that needs initilization, such as indirect call
hash tables.
Add support for automatically dumping profile data every N seconds by
forking a watcher process during runtime.
(cherry picked from FBD17644396)
Summary:
-icp-top-callsites selects candidates for optimization until a
threshold is met. Currently, this parameter is set to 99% of calls by
default. The order of functions evaluated changes in parallel mode,
thus the functions that may be included to satisfy 99% of all calls may
change, leading to different optimization decisions when running in
parallel versus sequential.
Fix this by enabling optimizations for all branches with the same
frequency once we reach our budget instead of cutting off immediatelly
after our budget is satisfied. In that way, order of functions has no
impact on which functions are optimized.
(cherry picked from FBD18902239)
Summary:
Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done.
The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end.
(cherry picked from FBD18153069)
Summary:
Some processes can mmap the main binary for the purpose of
introspection. We should ignore such mmap events for fixed-address
binaries. For PIC binaries, we record the mapping and do the address
filtering later for all sample events.
(cherry picked from FBD18844314)
Summary:
The condition `DebugRangesOffset == -1U` can never happen since DebugRangesOffset has type `uint64_t` and the value always comes from `RangesSectionWriter->addRanges` which gets its value from `DebugRangesSectionWriter.SectionOffset` which has type `uint32_t`. The condition seems to be left over from a time where something was using `-1` as an error value.
I'm removing that check so I can use `-1` as a tag to refer to the empty range that will be at the beginning of the ranges section.
(cherry picked from FBD18153119)
Summary: The `.debug_aranges` section is already deterministic and is logically separate from the `.debug_ranges` section so separate them into separate classes so that it will be easier to make DebugRangesSectionsWriter deterministic
(cherry picked from FBD18153057)
Summary:
This fixes a bug which causes the debug_info and debug_loc sections to be unreadable by readelf/objdump.
Basically, we're using 12 bytes of a ULEB128 value to fill in space, but readelf can't read more than 9 bytes of ULEB128. Thus, we replace that value with a string of 'a' instead.
(cherry picked from FBD18097728)
Summary:
Avoid asserting on inputs that are shared libraries with
R_X86_64_64 static relocs and RELATIVE dynamic relocations matching
those. Our relocation checking mechanism would expect the result of
the static relocation to be encoded in the binary, but the linker
instead puts it as an addend in the RELATIVE dyn reloc.
Also fix aggregation for .so if the executable segment is not the
first one in the binary.
(cherry picked from FBD18651868)
Summary:
When combining icp=calls and shrink wrapping, the former may
generate empty BBs that are going to trigger a bug in shrink wraping
restore placement strategy. The restore is wrongly pushed to the BB
successor instead of being added to the current block. Add a pass to
go over the CFG to fix empty blocks by adding a temporary NOP
instruction that is going to be deleted later. Empty BBs are not
supported by one of the analysis done at this pass.
(cherry picked from FBD18717994)
Summary:
If -trap-avx512 option is not set, verify that we correctly encode
AVX-512 instructions and treat them as ordinary instructions.
(cherry picked from FBD18666427)
Summary:
There is no need to support existing functionality of adding entry
points after the CFG is built as the function is only called in empty or
disassembled state. Previously we used to run disassemble+buildCFG per
function, but now these phases are decoupled.
Also, remove a couple of redundant checks.
(cherry picked from FBD18622822)
Summary:
We only use locations of PC relocations and ignore the rest of the data.
There's no need to store type and value.
(cherry picked from FBD18623280)
Summary:
Speeding up cache+/ext-tsp block reordering algorithm.
On a high-level, the speedup is achieved by:
- precomputing and memorizing all jumps between a pair of chains
(instead of extracting them on every merge iteration);
- using a cache of size O(|E|) instead of O(|V|^2) as in previous version.
The final output is identical to previous one subject to a new deterministic
comparison of double values.
(cherry picked from FBD18380870)
Summary:
When we disassemble functions, we add discovered jump tables to a global
container in BinaryContext. Later, we analyze and verify all jump
tables. However, analysis for non-simple functions might fail for numerous
reasons, e.g. there would be no instruction at a destination. Since we
are not overwriting non-simple functions, it is not a critical error.
Thus, we can safely skip jump table analysis for non-simple functions.
(cherry picked from FBD18422997)
Summary:
Free more lists in BinaryFunction::releaseCFG().
Release BinaryFunction::Relocations after disassembly.
Do not populate BinaryFunction::MoveRelocations as we are not using them
currently.
Also remove PCRelativeRelocationOffsets that weren't used.
(cherry picked from FBD18413256)
Summary:
Once we emit function code, we no longer need CFG for next phases
that use basic blocks for address-translation and symbol update
purposes. We free memory used by CFG and instructions. The freed
memory gets reused by later phases resulting in overall memory usage
reduction.
We can probably improve memory consumption even further by replacing
BinaryBasicBlocks with more compact data structures.
(cherry picked from FBD18408954)
Summary:
We've used to emit special annotations to update SDT markers. However,
we can just use "Offset" annotations for the same purpose. Unlike BAT,
we have to generate "reverse" address translation tables.
This approach eliminates reliance on instructions after code emission.
(cherry picked from FBD18318660)
Summary:
Use BinaryBasicBlock::OffsetTranslationTable for BAT. This removes
dependency on instructions after the code emission.
(cherry picked from FBD18283965)
Summary:
Be default, we strip debug sections from the binary. Even though we did
not write the sections, we allocated space for them in the output binary
by mistake.
(cherry picked from FBD18218708)
Summary:
BOLT creates MCInst for every instruction from the input. For large
binaries, this means we are creating tens if not hundreds of millions of
instructions. If the number of operands for average instruction is much
less than 8, we benefit from changing the type of Operands from
SmallVector<MCOperand, 8> to SmallVector<MCOperand, 2>. That seems
to be the optimal type for X86-64 on average.
The size of MCInst goes down from 176 to 80 which often reduces BOLT
memory consumption by gigabytes.
(cherry picked from FBD18218924)
Summary:
We do not support optimizing functions with jump tables in
AArch64, but we do need to detect them. This idiom is slightly different
from the ones we've seen before. It encode jump table entries as
relative to the jump table itself instead of relative to the indirect
branch (BR) instruction.
(cherry picked from FBD18191100)
Summary:
For functions with unknown control flow, do not populate TakenBranches
with an entry pointing to the end of the function.
(cherry picked from FBD18034019)
Summary:
If collecting data in Intel Skylake machines, we may face a
bug where LBR0 or LBR1 may be duplicated w.r.t. the next entry. This
makes perf2bolt interpret it as an invalid trace, which ordinarily we
discard during aggregation. However, in BAT, since we do not disassemble
the binary where the collection happened but rely only on the
translation table, it is not possible to detect bad traces and discard
them. This gets to the fdata file, and this invalid trace ends up
invalidating the profile for the whole function (by being treated as
stale by BOLT).
In this patch, we detect Skylake by looking for LBRs with 32 entries,
and discard the first 2 entries to avoid running into this problem.
It also fixes an issue with collision in the translation map by
prioritizing the last basic block when more than one share the same
output address.
(cherry picked from FBD17996791)
Summary:
When reading debug info in parallel, CUs for functions were populated in
parallel and the order was non-deterministic. We used the first CU from
the non-deterministically-ordered list to set the line number resulting
in different outputs.
The fix is to sort the list after it's been created and before assigning
the line table unit.
(cherry picked from FBD17946889)
Summary:
merge-fdata for legacy format was simply appending all input
strings to output, but the real format supports some header strings
that can't be simply concatanated to output. Check for the header
string used by BAT before merging fdata to avoid creating an output
file with invalid lines (header in the middle of the fdata file).
For heatmap, avoid reading BAT tables, since they won't be used.
(cherry picked from FBD17943131)
Summary:
I noticed when setting up a new repository for bolt that bolt tests
would fail unexpectedly when running `ninja check-bolt` and
`ninja check-llvm`. This turns out to be because dependencies for bolt
binaries were not specified in the CMake configuration so they were not
built before running the tests. This diff adds the dependencies to the
CMake configuration for check-bolt and check-llvm so that bolt binaries
are built before running tests.
(cherry picked from FBD17919505)
Summary:
C++14 "sized deallocation" introduces a 2-argument `delete` where the new 2nd argument is the original allocated size. It's useful for allocators like jemalloc to be "reminded" of the original allocation size, else they incur the cost of an address to size lookup. Jemalloc has provided this for a while as `sdallocx`, and recently it got wired up to the new 2-arg `delete`.
Here I introduce typedefs for the SmallVectors so the "16" is consistent, which seems to fix the issue.
(cherry picked from FBD17618981)
Summary:
Change our CMake config for the standalone runtime instrumentation
library to check for the elf.h header before using it, so the build
doesn't break on systems lacking it. Also fix a SmallPtrSet usage where
its elements are not really pointers, but uint64_t, breaking the build
in Apple's Clang.
(cherry picked from FBD17505759)
Summary:
With the word "missed", the previous message about opportunities for
macro-op fusion optimization could be misleading.
(cherry picked from FBD17464603)
Summary:
The existing check for compiler de-virtualization bug was not working
when the relocation reference did not fall on a function boundary.
As a result, we were falsely detecting "unmarked object in code".
When running the check, the address could be arbitrary, except it
shouldn't match any existing function. Additionally, check that there's
a proper reference to the de-virtualized callee to avoid false
positives.
(cherry picked from FBD17433887)
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
Summary: `perf2bolt` accepts executable name, and the tool will find all the PIDs associated with that executable. When different versions of an executable are running at the same time, name alone may not be sufficient as we will get samples from different versions of the binary aggregated together. The resulting fdata may look stale to BOLT, which makes BOLT bailout optimization for functions. This change adds a `-pid` switch that lets user specify process ID in addition to executable name so BOLT can target a specific process.
(cherry picked from FBD17178898)
Summary: This change adds a switch (`ignore-interrupt-lbr`) to ignores LBR from perf input that is result of kernel interrupts. These asynchronous flow of user/kernel transition will make BOLT think that profile is stale, thus bailout optimization for functions. Ideally, user mode filter need to be set for `perf record` so we don't have asynchronous LBRs. However these are identifiable as kernel address space is known, so we can ignore any LBRs that come from or go into kernel addresses during aggregation. This is under a switch and off by default in case we need to BOLT kernel module.
(cherry picked from FBD17170107)
Summary:
Change our edge profiling technique when using instrumentation
to do not instrument every edge. Instead, build the spanning tree
for the CFG and omit instrumentation for edges in the spanning tree.
Infer the edge count for these edges when writing the profile during
run time. The inference works with a bottom-up traversal of the spanning
tree and establishes the value of the edge connecting to the parent based
on a simple flow equation involving output and input edges, where the
only unknown variable is the parent edge.
This requires some engineering in the runtime lib to support dynamic
allocation for building these graphs at runtime.
(cherry picked from FBD17062773)
Summary:
We start a thread to preprocess the profile while the main
thread continues to disassemble the input binary. We should not
disassemble it in BAT mode, however, the test to check whether we have
BAT in the input binary depends on the preprocessing thread, so there
is a race where we may start disassembling functions just because the
preprocessing thread didn't conclude we are in BAT mode. Fix this and
make the main thread check for BAT without depending on the
preprocessing thread.
(cherry picked from FBD17124370)
Summary:
We decode the regular .plt section and we are able to perform
optimizations on it with -plt=hot or -plt=all, however, .plt.got
sections are not decoded by BOLT. This patch teaches BOLT how to read
them. They are created by the bfd linker whenever there is no need for
the dynamic linker to lazy-bind the symbol (when they are eagerly
resolved at binary load time). These entries are 8-byte sized instead of
16-byte sized like the regular PLT, and contain a single indirect call
instruction with 7 bytes and a nop.
(cherry picked from FBD17060515)
Summary:
We should not rely on split function detection while aggregating
data, but only look up the original function names in the symbol table.
Split function detection should be done by BOLT and not perf2bolt while
writing the profile. Then, BOLT, when reading it, will take care of
combining functions if necessary.
This caused a bug in bolted data collection where samples in cold parts
of a function were being falsely attributed to the hot part of a function
instead of being attributed to the cold part, causing incorrect translation of
addresses.
(cherry picked from FBD16993065)
Summary:
We were too permissive by allowing more jump tables during the
preliminary scan of memory. This allowed for jump tables to be
falsely detected. And since we didn't have a way to backtrack
the jump table creation, we had to assert.
This diff refactors the code that analyzes jump table contents.
Preliminary and final passes share the same code. The only difference
should be the detection of instruction boundaries that are available
during the final pass.
This should affect strict relocation mode only.
(cherry picked from FBD16923335)
Summary:
BOLT prints "spawning thread to pre-process profile" message even when
it is not running in the aggregation mode. Fix that.
(cherry picked from FBD16908596)
Summary:
Avoid directly allocating string and description tables in
binary's static data region, since they are not needed during runtime
except when writing the profile at exit. Change the runtime library to
open the tables on disk and read only when necessary.
(cherry picked from FBD16626030)
Summary:
To allow the development of future instrumentation work, this
patch adds support in BOLT for linking arbitrary libraries into the
binary processed by BOLT. We use orc relocation handling mechanism for
that. With this support, this patch also moves code programatically
generated in X86 assembly language by X86MCPlusBuilder to C code written
in a new library called bolt_rt. Change CMake to support this library as
an external project in the same way as clang does with compiler_rt. This
library is installed in the lib/ folder relative to BOLT root
installation and by default instrumentation will look for the library
at that location to finish processing the binary with instrumentation.
(cherry picked from FBD16572013)