Summary:
Shrink wrapping has a mode where it will directly move push
pop pairs, instead of replacing them with stores/loads. This is an
ambitious mode that is triggered sometimes, but whenever matching with
a push, it would operate with the assumption that the restoring
instruction was a pop, not a load, otherwise it would assert. Fix this
assertion to bail nicely back to non-pushpop mode (use regular store and
load instructions).
(cherry picked from FBD20085905)
Summary:
Change our X86 target to use long nops by default. In general,
BOLT does not put nops into the instruction stream that is going to be
executed, since it doesn't align basic blocks, only functions. Since we
rebased BOLT, our relationship with MCAssembler changed because it
stopped using multibyte nops and we never needed to revisit that. But it
makes a difference if we want to mitigate perf issues with the Intel
JCC erratum, since the nops inserted are going to be decoded and
executed. To make MCAssembler emit long nops again, we need to explictly
set mattr (Features) of the X86 target.
(cherry picked from FBD19987277)
Summary: Start adding initial bits for MachO, this diff contains some small preparations for finding functions inside a MachO binary, this will be done in the next diff. The concept of a section in the MachO world is quite different from ELF, nevertheless, for functions for now it more or less fits into the current picture (in BOLT), but things will diverge more significantly a bit later.
(cherry picked from FBD19648161)
Summary:
There are two peephole subpasses, remove-double-jumps and
remove-useless-conditional-branches, that operates by reading branches
directly, which makes them tricky to run before fix-branches. In the
case of remove-double-jumps, it will even lead to suboptimal code if
the patched branch was going to be removed by fix-branches when the
target is the fall-through. If the final target is a tail call, it will
lead to a broken CFG in the worst case. Fix this by moving these passes
after SCTC, which already produces CFGs with conditional tail calls.
(cherry picked from FBD18795592)
Summary:
This diff ports reviews.llvm.org/D70157 to our LLVM tree, which
makes the integrated assembler able to align X86 control-flow changing
instructions in a way to reduce the performance impact of the ucode
update on Intel processors that implement the JCC erratum mitigation.
See white paper "Mitigations for Jump Conditional Code Erratum" by Intel
published November 2019.
To port this patch, I changed classifySecondInstInMacroFusion to analyze
instruction opcodes directly instead of analyzing the CondCond operand
(in more recent versions of LLVM, all conditional branches share the
same opcode, but with a different conditional operand). I also pulled to
our tree Alignment.h as a dependency, and the macroop analyzing helpers.
x86-align-branch-boundary and -x86-align-branch are the two flags that
control nop insertion to avoid disabling the decoder cache, following
the original patch. In BOLT, I added the flag
x86-align-branch-boundary-hot-only to request the alignment to only be
applied to hot code, which is turned on by default. The reason is
because such alignment is expensive to perform on large modules, but if
we limit it to hot code, the relaxation pass runtime becomes tolerable.
(cherry picked from FBD19828850)
Summary:
In this diff we refactor the code around getting the original binary encoding of function's body.
The main changes are: remove BinaryContext::getFunctionData, remove the parameter of the method BinaryFunction::disassemble, introduce BinaryFunction::getData.
(cherry picked from FBD19824368)
Summary:
In strict mode, a jump table with targets generated by
builtin_unreachable (located at the very end of the function) was
asserting when being recreated by postProcessIndirectBranches. Fix
this.
(cherry picked from FBD19614981)
Summary:
BinaryFunction used to have a list of Names associated with its main
entry point. However, the function is primarily identified by its
corresponding symbol or symbols, and these symbols are available as we
are creating them for a corresponding BinaryData object.
There's also no reason to emit symbols for alternative function names
(aliases), so change the code to only emit needed symbols.
When we emit a cold fragment for a function, only emit one cold symbol
for the fragment instead of one per every main entry symbol/name.
When we match a symbol to an entry point in the function, with this
change we can first go through the list of main entry symbols (now that
they are available).
(cherry picked from FBD19426709)
Summary:
1. Move createBinaryContext to BinaryContext.
1. Add support for nonlinux triples in createBinaryContext.
2. Remove unnecessary std::move in DWARFRewriter.cpp.
(cherry picked from FBD19421314)
Summary:
Call postProcessEntryPoints only after all functions have been
disassembled and all interprocedural references have been processed,
when all possible entry points have been accounted for. This makes our
detection of bad entries more robust as it does not depend on the order
of the functions any more.
(cherry picked from FBD19404767)
Summary:
When the validation of instruction encoding fails but we are able to
continue processing the binary, do no report an error. Report encoding
format only under `-v=1`.
(cherry picked from FBD19376531)
Summary:
For BinaryData, we used to maintain a vector of StringRef names and also
a vector of pointers to MCSymbol's associated with the data. There was
an unnecessary duplication of information and an associated overhead of
keeping it in sync. Fix it by removing Names and using Symbols wherever
Names were used.
Also merge two variants of registerNameAtAddress() and remove
unreachable/dead code in the process.
(cherry picked from FBD19359123)
Summary:
"Fix symbol table entries for secondary entries" diff broke the inliner.
Fix the breakage and make the discovery of secondary entry points more
accurate.
Add ability to BinaryContext::getFunctionForSymbol() to return an entry
point discriminator and use it instead of calling getEntryForSymbol()
and isSecondaryEntry(). This is the preferred way since
getFunctionForSymbol() is thread-safe.
(cherry picked from FBD19295983)
Summary:
Commit "Support full instrumentation" changed the map
SymbolToFunction in BinaryContext to map secondary entries of functions
too. This introduced unexpected behavior in our symbol table rewriting
logic, which caused it to mistakenly write them with the address of the
original function. Fix the behavior of getBinaryFunctionAtAddress to
correct this. Also fix other users of SymbolToFunction to ensure they
are not accidentally using secondary entries when they shouldn't.
(cherry picked from FBD19168319)
Summary:
Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done.
The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end.
(cherry picked from FBD18153069)
Summary:
When perf tool reports a mapping address of a binary, it is not always
the address of the first loadable segment we were checking against.
As a result, perf2botl was not working properly for binaries where the
first segment was not executable.
The fix is to check if the address reported by mmap event matches any
of the loadable segments. Note that the segment alignment has to be
applied to get real loadable address of the segment.
Fixesfacebookincubator/BOLT#65
(cherry picked from FBD19146419)
Summary:
Add full instrumentation support (branches, direct and
indirect calls). Add output statistics to show how many hot bytes
were split from cold ones in functions. Add -cold-threshold option
to allow splitting warm code (non-zero count). Add option in
bolt-diff to report missing functions in profile 2.
In instrumentation, fini hooks are fixed to run proper finalization
code after program finishes. Hooks for startup are added to setup
the runtime structures that needs initilization, such as indirect call
hash tables.
Add support for automatically dumping profile data every N seconds by
forking a watcher process during runtime.
(cherry picked from FBD17644396)
Summary:
-icp-top-callsites selects candidates for optimization until a
threshold is met. Currently, this parameter is set to 99% of calls by
default. The order of functions evaluated changes in parallel mode,
thus the functions that may be included to satisfy 99% of all calls may
change, leading to different optimization decisions when running in
parallel versus sequential.
Fix this by enabling optimizations for all branches with the same
frequency once we reach our budget instead of cutting off immediatelly
after our budget is satisfied. In that way, order of functions has no
impact on which functions are optimized.
(cherry picked from FBD18902239)
Summary:
Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done.
The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end.
(cherry picked from FBD18153069)
Summary:
Some processes can mmap the main binary for the purpose of
introspection. We should ignore such mmap events for fixed-address
binaries. For PIC binaries, we record the mapping and do the address
filtering later for all sample events.
(cherry picked from FBD18844314)
Summary:
The condition `DebugRangesOffset == -1U` can never happen since DebugRangesOffset has type `uint64_t` and the value always comes from `RangesSectionWriter->addRanges` which gets its value from `DebugRangesSectionWriter.SectionOffset` which has type `uint32_t`. The condition seems to be left over from a time where something was using `-1` as an error value.
I'm removing that check so I can use `-1` as a tag to refer to the empty range that will be at the beginning of the ranges section.
(cherry picked from FBD18153119)
Summary: The `.debug_aranges` section is already deterministic and is logically separate from the `.debug_ranges` section so separate them into separate classes so that it will be easier to make DebugRangesSectionsWriter deterministic
(cherry picked from FBD18153057)
Summary:
This fixes a bug which causes the debug_info and debug_loc sections to be unreadable by readelf/objdump.
Basically, we're using 12 bytes of a ULEB128 value to fill in space, but readelf can't read more than 9 bytes of ULEB128. Thus, we replace that value with a string of 'a' instead.
(cherry picked from FBD18097728)
Summary:
Avoid asserting on inputs that are shared libraries with
R_X86_64_64 static relocs and RELATIVE dynamic relocations matching
those. Our relocation checking mechanism would expect the result of
the static relocation to be encoded in the binary, but the linker
instead puts it as an addend in the RELATIVE dyn reloc.
Also fix aggregation for .so if the executable segment is not the
first one in the binary.
(cherry picked from FBD18651868)
Summary:
When combining icp=calls and shrink wrapping, the former may
generate empty BBs that are going to trigger a bug in shrink wraping
restore placement strategy. The restore is wrongly pushed to the BB
successor instead of being added to the current block. Add a pass to
go over the CFG to fix empty blocks by adding a temporary NOP
instruction that is going to be deleted later. Empty BBs are not
supported by one of the analysis done at this pass.
(cherry picked from FBD18717994)
Summary:
If -trap-avx512 option is not set, verify that we correctly encode
AVX-512 instructions and treat them as ordinary instructions.
(cherry picked from FBD18666427)
Summary:
There is no need to support existing functionality of adding entry
points after the CFG is built as the function is only called in empty or
disassembled state. Previously we used to run disassemble+buildCFG per
function, but now these phases are decoupled.
Also, remove a couple of redundant checks.
(cherry picked from FBD18622822)
Summary:
We only use locations of PC relocations and ignore the rest of the data.
There's no need to store type and value.
(cherry picked from FBD18623280)
Summary:
Speeding up cache+/ext-tsp block reordering algorithm.
On a high-level, the speedup is achieved by:
- precomputing and memorizing all jumps between a pair of chains
(instead of extracting them on every merge iteration);
- using a cache of size O(|E|) instead of O(|V|^2) as in previous version.
The final output is identical to previous one subject to a new deterministic
comparison of double values.
(cherry picked from FBD18380870)
Summary:
When we disassemble functions, we add discovered jump tables to a global
container in BinaryContext. Later, we analyze and verify all jump
tables. However, analysis for non-simple functions might fail for numerous
reasons, e.g. there would be no instruction at a destination. Since we
are not overwriting non-simple functions, it is not a critical error.
Thus, we can safely skip jump table analysis for non-simple functions.
(cherry picked from FBD18422997)
Summary:
Free more lists in BinaryFunction::releaseCFG().
Release BinaryFunction::Relocations after disassembly.
Do not populate BinaryFunction::MoveRelocations as we are not using them
currently.
Also remove PCRelativeRelocationOffsets that weren't used.
(cherry picked from FBD18413256)
Summary:
Once we emit function code, we no longer need CFG for next phases
that use basic blocks for address-translation and symbol update
purposes. We free memory used by CFG and instructions. The freed
memory gets reused by later phases resulting in overall memory usage
reduction.
We can probably improve memory consumption even further by replacing
BinaryBasicBlocks with more compact data structures.
(cherry picked from FBD18408954)
Summary:
We've used to emit special annotations to update SDT markers. However,
we can just use "Offset" annotations for the same purpose. Unlike BAT,
we have to generate "reverse" address translation tables.
This approach eliminates reliance on instructions after code emission.
(cherry picked from FBD18318660)
Summary:
Use BinaryBasicBlock::OffsetTranslationTable for BAT. This removes
dependency on instructions after the code emission.
(cherry picked from FBD18283965)
Summary:
Be default, we strip debug sections from the binary. Even though we did
not write the sections, we allocated space for them in the output binary
by mistake.
(cherry picked from FBD18218708)
Summary:
BOLT creates MCInst for every instruction from the input. For large
binaries, this means we are creating tens if not hundreds of millions of
instructions. If the number of operands for average instruction is much
less than 8, we benefit from changing the type of Operands from
SmallVector<MCOperand, 8> to SmallVector<MCOperand, 2>. That seems
to be the optimal type for X86-64 on average.
The size of MCInst goes down from 176 to 80 which often reduces BOLT
memory consumption by gigabytes.
(cherry picked from FBD18218924)
Summary:
We do not support optimizing functions with jump tables in
AArch64, but we do need to detect them. This idiom is slightly different
from the ones we've seen before. It encode jump table entries as
relative to the jump table itself instead of relative to the indirect
branch (BR) instruction.
(cherry picked from FBD18191100)
Summary:
For functions with unknown control flow, do not populate TakenBranches
with an entry pointing to the end of the function.
(cherry picked from FBD18034019)
Summary:
If collecting data in Intel Skylake machines, we may face a
bug where LBR0 or LBR1 may be duplicated w.r.t. the next entry. This
makes perf2bolt interpret it as an invalid trace, which ordinarily we
discard during aggregation. However, in BAT, since we do not disassemble
the binary where the collection happened but rely only on the
translation table, it is not possible to detect bad traces and discard
them. This gets to the fdata file, and this invalid trace ends up
invalidating the profile for the whole function (by being treated as
stale by BOLT).
In this patch, we detect Skylake by looking for LBRs with 32 entries,
and discard the first 2 entries to avoid running into this problem.
It also fixes an issue with collision in the translation map by
prioritizing the last basic block when more than one share the same
output address.
(cherry picked from FBD17996791)
Summary:
When reading debug info in parallel, CUs for functions were populated in
parallel and the order was non-deterministic. We used the first CU from
the non-deterministically-ordered list to set the line number resulting
in different outputs.
The fix is to sort the list after it's been created and before assigning
the line table unit.
(cherry picked from FBD17946889)
Summary:
merge-fdata for legacy format was simply appending all input
strings to output, but the real format supports some header strings
that can't be simply concatanated to output. Check for the header
string used by BAT before merging fdata to avoid creating an output
file with invalid lines (header in the middle of the fdata file).
For heatmap, avoid reading BAT tables, since they won't be used.
(cherry picked from FBD17943131)
Summary:
I noticed when setting up a new repository for bolt that bolt tests
would fail unexpectedly when running `ninja check-bolt` and
`ninja check-llvm`. This turns out to be because dependencies for bolt
binaries were not specified in the CMake configuration so they were not
built before running the tests. This diff adds the dependencies to the
CMake configuration for check-bolt and check-llvm so that bolt binaries
are built before running tests.
(cherry picked from FBD17919505)
Summary:
C++14 "sized deallocation" introduces a 2-argument `delete` where the new 2nd argument is the original allocated size. It's useful for allocators like jemalloc to be "reminded" of the original allocation size, else they incur the cost of an address to size lookup. Jemalloc has provided this for a while as `sdallocx`, and recently it got wired up to the new 2-arg `delete`.
Here I introduce typedefs for the SmallVectors so the "16" is consistent, which seems to fix the issue.
(cherry picked from FBD17618981)
Summary:
Change our CMake config for the standalone runtime instrumentation
library to check for the elf.h header before using it, so the build
doesn't break on systems lacking it. Also fix a SmallPtrSet usage where
its elements are not really pointers, but uint64_t, breaking the build
in Apple's Clang.
(cherry picked from FBD17505759)
Summary:
With the word "missed", the previous message about opportunities for
macro-op fusion optimization could be misleading.
(cherry picked from FBD17464603)
Summary:
The existing check for compiler de-virtualization bug was not working
when the relocation reference did not fall on a function boundary.
As a result, we were falsely detecting "unmarked object in code".
When running the check, the address could be arbitrary, except it
shouldn't match any existing function. Additionally, check that there's
a proper reference to the de-virtualized callee to avoid false
positives.
(cherry picked from FBD17433887)
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
Summary: `perf2bolt` accepts executable name, and the tool will find all the PIDs associated with that executable. When different versions of an executable are running at the same time, name alone may not be sufficient as we will get samples from different versions of the binary aggregated together. The resulting fdata may look stale to BOLT, which makes BOLT bailout optimization for functions. This change adds a `-pid` switch that lets user specify process ID in addition to executable name so BOLT can target a specific process.
(cherry picked from FBD17178898)
Summary: This change adds a switch (`ignore-interrupt-lbr`) to ignores LBR from perf input that is result of kernel interrupts. These asynchronous flow of user/kernel transition will make BOLT think that profile is stale, thus bailout optimization for functions. Ideally, user mode filter need to be set for `perf record` so we don't have asynchronous LBRs. However these are identifiable as kernel address space is known, so we can ignore any LBRs that come from or go into kernel addresses during aggregation. This is under a switch and off by default in case we need to BOLT kernel module.
(cherry picked from FBD17170107)
Summary:
Change our edge profiling technique when using instrumentation
to do not instrument every edge. Instead, build the spanning tree
for the CFG and omit instrumentation for edges in the spanning tree.
Infer the edge count for these edges when writing the profile during
run time. The inference works with a bottom-up traversal of the spanning
tree and establishes the value of the edge connecting to the parent based
on a simple flow equation involving output and input edges, where the
only unknown variable is the parent edge.
This requires some engineering in the runtime lib to support dynamic
allocation for building these graphs at runtime.
(cherry picked from FBD17062773)
Summary:
We start a thread to preprocess the profile while the main
thread continues to disassemble the input binary. We should not
disassemble it in BAT mode, however, the test to check whether we have
BAT in the input binary depends on the preprocessing thread, so there
is a race where we may start disassembling functions just because the
preprocessing thread didn't conclude we are in BAT mode. Fix this and
make the main thread check for BAT without depending on the
preprocessing thread.
(cherry picked from FBD17124370)
Summary:
We decode the regular .plt section and we are able to perform
optimizations on it with -plt=hot or -plt=all, however, .plt.got
sections are not decoded by BOLT. This patch teaches BOLT how to read
them. They are created by the bfd linker whenever there is no need for
the dynamic linker to lazy-bind the symbol (when they are eagerly
resolved at binary load time). These entries are 8-byte sized instead of
16-byte sized like the regular PLT, and contain a single indirect call
instruction with 7 bytes and a nop.
(cherry picked from FBD17060515)
Summary:
We should not rely on split function detection while aggregating
data, but only look up the original function names in the symbol table.
Split function detection should be done by BOLT and not perf2bolt while
writing the profile. Then, BOLT, when reading it, will take care of
combining functions if necessary.
This caused a bug in bolted data collection where samples in cold parts
of a function were being falsely attributed to the hot part of a function
instead of being attributed to the cold part, causing incorrect translation of
addresses.
(cherry picked from FBD16993065)
Summary:
We were too permissive by allowing more jump tables during the
preliminary scan of memory. This allowed for jump tables to be
falsely detected. And since we didn't have a way to backtrack
the jump table creation, we had to assert.
This diff refactors the code that analyzes jump table contents.
Preliminary and final passes share the same code. The only difference
should be the detection of instruction boundaries that are available
during the final pass.
This should affect strict relocation mode only.
(cherry picked from FBD16923335)
Summary:
BOLT prints "spawning thread to pre-process profile" message even when
it is not running in the aggregation mode. Fix that.
(cherry picked from FBD16908596)
Summary:
Avoid directly allocating string and description tables in
binary's static data region, since they are not needed during runtime
except when writing the profile at exit. Change the runtime library to
open the tables on disk and read only when necessary.
(cherry picked from FBD16626030)
Summary:
To allow the development of future instrumentation work, this
patch adds support in BOLT for linking arbitrary libraries into the
binary processed by BOLT. We use orc relocation handling mechanism for
that. With this support, this patch also moves code programatically
generated in X86 assembly language by X86MCPlusBuilder to C code written
in a new library called bolt_rt. Change CMake to support this library as
an external project in the same way as clang does with compiler_rt. This
library is installed in the lib/ folder relative to BOLT root
installation and by default instrumentation will look for the library
at that location to finish processing the binary with instrumentation.
(cherry picked from FBD16572013)
Summary:
Add a flag that disable writing botl-info section
and add a test that run bolt with two modes parallel
and sequential and assert that the resulting binaries
are the same.
(cherry picked from FBD16575440)
Summary:
Add option `-check-encoding` to verify if the input to LLVM disassembler
matches the output of the assembler. When set, the verification runs on
every instruction in processed functions.
I'm not enabling the option by default as it could be quite noisy on x86
where instruction encoding is ambiguous and can include redundant
prefixes.
(cherry picked from FBD16595415)
Summary:
In strict relocation mode, we get better function coverage. However, if
the profile used for optimization was converted using non-strict mode,
then it wouldn't match functions exclusive to strict mode. Hence,
we have to enforce strict relocation mode for profile conversion, so it
can be used for either mode.
I'm also adding parallel profile pre-processing unless `--no-threads` is
specified. This masks the runtime overhead of function disassembly on
multi-core machines.
(cherry picked from FBD16587855)
Summary:
switch to sequential execution when print-all is passed.
Since the function getDynoStats have an unsafe access
to the annotation allocators.
(cherry picked from FBD16503502)
Summary:
hfsort+ performs an expensive analysis to determine the
new order of the functions. 99% of the time during hfsort+
is spent in the function runPassTwo. This diff runs the body
of the hot loop in runPassTwo in parallel speeding up the
total runtime of reorder-functions pass by up to 4x
(cherry picked from FBD16450780)
Summary:
In non-relocation mode, we allow data objects to be embedded in the
code. Such objects could be unmarked, and could occupy an area between
functions, the area which is considered to be code padding.
When we disassemble code, we detect references into the padding area
and adjust it, so that it is not overwritten during the code emission.
We assume the reference to be pointing to the beginning of the object.
However, assembly-written functions may reference the middle of an
object and use negative offsets to reference data fields. Thus,
conservatively, we reduce the possibly-overwritten padding area to
a minimum if the object reference was detected.
Since we also allow functions with unknown code in non-relocation mode,
it is possible that we miss references to some objects in code.
To cover such cases, we need to verify the padding area before we
allow to overwrite it.
(cherry picked from FBD16477787)
Summary:
While reading debug info the function findSubprograms
runs on each compilation unit. This diff parallelize that loop
reducing its runtime duration by 70%.
(cherry picked from FBD16362867)
Summary:
BOLT spends a decent amount of time creating patches to update
debug information when -update-debug-sections is passed.
In updateDebugInfo patches are created to update .debug_info
and .debug_abbrev sections while .debug_loc and .debug_ranges
contents are populated. This this diff uses a lock-based approach to
parallelize updateDebugInfo functions and reduces the runtime of the
function by around 30%.
(cherry picked from FBD16352261)
Summary:
After stripping annotations, conditional tail calls no longer can be
identified by their corresponding tag. We can check the number of basic
block successors instead.
Fixesfacebookincubator/BOLT#58.
(cherry picked from FBD16444718)
Summary:
Shrink wrapping is an expensive part of frame optimizations if
performed on all functions. This diff makes it run in parallel,
reducing wall time.
(cherry picked from FBD16092651)
Summary:
This diff parallelize the construction of call graph during disassembly.
The diff includes a change to parallel-utilities where another interface
is added, that support running tasks on binaryFunctions that involves
adding instruction annotations. This pattern is common in different places,
e.g. frame optimizations. And such, pattern justify creating an interface,
that abstract out all the messy details.
(cherry picked from FBD16232809)
Summary:
This diff change reorderBasicBlocks pass to run in parallel,
it does so by adding locks to the fix branches function,
and creating temporary MCCodeEmitters when estimating basic block code size.
(cherry picked from FBD16161149)
Summary:
If two indirect branches use the same jump table, we need to
detect this and duplicate dump tables so we can modify this CFG
correctly. This is necessary for instrumentation and shrink wrapping.
For the latter, we only detect this and bail, fixing this old known
issue with shrink wrapping.
Other minor changes to support better instrumentation: add an option
to instrument only hot functions, add LOCK prefix to instrumentation
increment instruction, speed up splitting critical edges by avoiding
calling recomputeLandingPads() unnecessarily.
(cherry picked from FBD16101312)
Summary:
Heuristic that creates a jump table for every memory access,
including those we do not match against a pattern in an indirect jump,
is too permissive and has false positives. Guard this logic under
strict mode until we figure out a better strategy.
(cherry picked from FBD16192205)
Summary:
Each time we run some work in parallel over the list of functions in bolt, we manage a thread pool, task scheduling and perform some work to manage the granularity of the tasks based on the type of the work we do.
In this task, I am creating an interface where all those details are abstracted out, the user provides the function that will run on each function, and some policy parameters that setup the scheduling and granularity configurations.
This will make it easier to implement parallel tasks, and eliminate redundant coding efforts.
(cherry picked from FBD16116077)
Summary:
This diff parallelize the STPClean() function reducing its runtime from 5 seconds to 0.4 on HHVM,
Making the runtime for the frame optimizer goes down to 33 seconds on HHVM.
(cherry picked from FBD15914371)
Summary:
This diff includes two main changes:
1) When creating an annotation, a dedicated annotation allocator can be used, instead of the default allocator. This allows some annotation to be deallocated right after the end of their usage completely. Furthermore, having the ability to use dedicated allocators allows running SPT in parallel where each task uses a different allocator.
2) SPT is parallelized.
(cherry picked from FBD15913492)
Summary: We select the top hot targets for indirect call promotion. But since we only have frequency for targets, not for actual jump table indices, we have to merge indices that share the same actual target. In order to do that we sort targets by pointer of target symbol before merging, which introduces instability. Later we stable sort merged targets by frequency. Due to the instability of sorting pointers, and depending on how many indices each merged target has, we could end up with unstable ICP. This commit changes the 2nd pass sorting to prioritize targets with fewer indices, and higher mispredicts, in addition to higher frequency. It improves stability of ICP, and also exposes more ICP because targets with fewer indices has better chance of getting promoted.
(cherry picked from FBD16099701)
Summary:
Check that a symbol address is less than the next function
address before considering it for a secondary entry.
(cherry picked from FBD16056468)
Summary:
In strict relocation mode we rely on relocations to represent all
possible entry points into a function. Most of the code generated by
tested compilers (gcc and clang) will result in relocations against
any internal labels for jump tables and for computed goto tables.
In situations where we cannot properly reconstruct a jump table, or when
we cannot determine a table that guides an indirect jump, e.g. when
multiple computed goto tables are used, we conservatively assume that
the indirect jump can end up at any possible basic block referenced by
relocations.
In strict mode, simple functions may include the aforementioned
instructions with unknown control flow with a conservative list of
destinations added to the containing basic block. This allows us to
expand coverage of simple functions and to enable code reordering
optimizations for more functions.
The strict mode is recommended when BOLT is used with a well-formed
code generated by a compiler.
To use the strict mode, add "-strict" on the command line.
Another effect of this diff, is that with relocations, we will always
replace the immediate operand of an instruction with a symbol if the
relocation exists against this operand.
Also this diff fixes issues with Clang compiled with -fpic.
(cherry picked from FBD15872849)
Summary:
A relocation can have an addend that makes it look as the relocated
value is in a different section from the symbol being relocated.
E.g., a relocation against a variable in .rodata could have a negative
offset that will make it look like it is against a symbol in .text
(a section that typically precedes .rodata).
Unless the relocation is against a section symbol, we know
exactly the symbol that is being relocated and there is no issue.
However, when the linker leaves only a section relocation (i.e. a
relocation against a section symbol when a temporary original symbol
gets deleted), we have to guess the relocated symbol, and can falsely
detect a function reference in the case described above.
The fix is to keep a section relocation if the corresponding
relocated value falls into a different section, and to detect and
ignore false function reference.
(cherry picked from FBD16030791)
Summary: BOLT operates in relocation mode by default when .reloc is in the binary. This changes disables relocation mode for heatmap generation so we can use that for more cases. There's a small separate change that ignores zero-sized symbol in zero-sized code section during function discovery.
(cherry picked from FBD16009610)
Summary:
An instrumentation pass that modifies the input binary to
generate a profile after execution finishes. It modifies branches to
increment counters stored in the process memory and injects a new
function that dumps this data to an fdata file, readable by BOLT.
This instrumentation is experimental and currently uses a naive
approach where every branch is instrumented. This is not ideal for
runtime performance, but should be good enough for us to
evaluate/debug LBR profile quality against instrumentation.
Does not support instrumenting indirect calls yet, only direct
calls, direct branches and indirect local branches.
(cherry picked from FBD15998096)
Summary:
Make BOLT ignore empty functions (those containing no instructions,
despite having some space allocated to it filled with zeroes).
(cherry picked from FBD15981683)
Summary:
Profile bias may happen depending on the hardware counter used
to trigger LBR sampling, on the hardware implementation and as an
intrinsic characteristic of relying on LBRs. Since we infer fall-through
execution and these non-taken branches take zero hardware resources to
be represented, LBR-based profile likely overrepresents paths with fall
throughs and underrepresents paths with many taken branches. This patch
adds an option to print statistics about profile bias so we can better
understand these biases.
The goal is to analyze differences in the sum of the frequency of all
incoming edges in a basic block versus the sum of all outgoing. In an
ideally sampled profile, these differences should be close to zero. With
this option, the user gets the mean of these differences in flow as a
percentage of the input flow. For example, if this number is 15%, it
means, on average, a block observed 15% more or less flow going out of
it in comparison with the flow going in. We also print the standard
deviation so we can have an idea of how spread apart are different
measurements of flow differences. If variance is low, it means the
average bias is happening across all blocks, which is compatible with
using LBRs. If the variance is high, it means some blocks in the profile
have a much higher bias than others, which is compatible with using a
biased event such as cycles to sample LBRs because it overrepresents
paths that end in an expensive instruction.
(cherry picked from FBD15790517)
Summary:
ICF consumes 10-15% of bolt runtime, for HHVM that is around 45 seconds.
this diff perform some parallelization for the pass to make it faster.
A 60% reduction in the ICF runtime is measured on the parallel version for HHVM.
(cherry picked from FBD15589515)
Summary:
Now that we populate jump tables after all functions are disassembled,
we can check for instruction boundaries corresponding to jump table
entries. No need to delegate this task to postProcessJumpTables().
(cherry picked from FBD15814762)
Summary:
During the initial disassembly pass, only identify jump tables
without populating the contents. Later, after all functions have been
disassembled, we have a better idea of jump table boundaries and can do
a better job of populating their entries.
As a result, we no longer have embedded jump tables (i.e. a jump table
that is parter of another jump table). If we ever need to keep
sequential jump tables inseparable during the output, we can always
add such functionality later.
Fixesfacebookincubator/BOLT#56.
(cherry picked from FBD15800427)
Summary:
During frame analysis, the functions do not change, and stack pointer tracking
does not need to be performed more than one time.
The current implementation performs the SPT analysis multiple times per
function during the frame analysis, we ca eliminate such computation redundancy.
On HHVM with -frame-opts=hot, this save around a minute which is 40% of the
frame optimization runtime. (129s to 76s).
fdata should be passed for a reasonable evaluation (we need the call graph).
However, this comes at a memory cost, around 2G to the peak when only -frame-opt=hot only is used but,
When all the usual flags are passed, the effect is to the peak is only 200K (measured from one test).
Update:
When jemalloc is used the base became way better and the following runtime are observed:
[jemalloc]
hhvm 85 --> 72.
clang 27 --> 23.
[malloc]
hhvm 129 --> 76.
clang 34 --> 27.
(cherry picked from FBD15707003)
Summary:
Add an option to get extra profile trace using the recorded event PC.
The trace goes from the latest LBR record destination to the event PC.
(cherry picked from FBD15711804)
Summary:
We used to handle PC-relative address references differently from direct
address references. As a result, some cases, such as escaped function
label address, were not handled when dealing with absolute (non-PIC)
code. This diff moves processing of an address reference into
BinaryContext::handleAddressRef() which is called for both PIC and
non-PIC code.
(cherry picked from FBD15643535)
Summary:
Compile Bolt using std 14.
We want that to be able to use some threading the locking tools that do not exists in std 11.
(cherry picked from FBD15671736)
Summary:
Similarly to how the compiler relies on DWARF to map samples, so
it is possible to collect profile data in binaries optimized by PGO
techniques and retrofit data to be used in a representation of the program
that was not optimized by PGO, this diff implements an option in BOLT to
encode a table in the output binary that allows us to map data collected
in optimized binaries back to the address space used in the input binary
(where the profile is useful, since we do not support running BOLT on a
binary already optimized by BOLT). The goal is to offer an option to
support BOLT in scenarios where it is not easy to run a special deployment of
the binary with a version that was not optimized by BOLT just for data
collection.
This feature is enabled with the -enable-bat flag. BAT stands for BOLT
Address Translation, which refers to the process of mapping output to
input addresses.
(cherry picked from FBD15531860)
Summary:
Options such as `-print-only`, `-skip-funcs`, etc. now take regular
expressions. Internally, the option is converted to '^funcname$' form
prior to regex matching. This ensures that names without special
symbols will match exactly, i.e. "foo" will not match "foo123".
(cherry picked from FBD15551930)
Summary:
Run analyzeIndirectBranch() using basic block boundaries instead of
running ad-hoc validation of the jump table assumptions.
(cherry picked from FBD15465034)
Summary:
Move handling of interprocedural references to BinaryContext.
Post-process indirect branches immediately after the CFG is built.
This is almost NFC. Since indirect branches are now post-processed
before the profile data is processed it interferes with the way the
profile data in YAML format is handled.
(cherry picked from FBD15456003)
Summary:
Add an option:
-memcpy1-spec=func1,func2:cs1,func3:cs1:cs2,...
to specialize calls to memcpy() in listed functions (the name could be
supplied in regex) for size 1. The optimization will dynamically check
if the size argument equals to 1 and execute a one byte copy, otherwise
it will call memcpy() as usual. Specific call sites could be indicated
after ":" using their numeric count from the start of the function.
(cherry picked from FBD15428936)
Summary:
SDT markers that appears as nops in the assembly, are preserved and not eliminated.
Functions with SDT markers are also flagged. Inlining and folding are disabled for
functions that have SDT markers.
(cherry picked from FBD15379799)
Summary:
Parse statically defined tracepoints(SDT) markers from the ELF file, and store them.
Add an option to print SDTs (-print-sdt).
Add test case for parsing and printing SDTs.
(cherry picked from FBD15366712)
Summary:
Previously, ICP worked with a budget of N targets to convert to
direct calls. As long as the frequency of up to N of the hottest targets
surpassed a given fraction (threshold) of the total frequency, say, 90%,
then the optimization would convert a number of targets (up to N) to
direct calls. Otherwise, it would completely abort processing this call
site. The intent was to convert a given fraction of the indirect call
site frequency to use direct calls instead, but this ends up being a
"all or nothing" strategy.
In this patch we change this to operate with the same strategy seem in
LLVM's ICP, with two thresholds. The idea is that the hottest target of
an indirect call site will be compared against these two thresholds: one
checks its frequency relative to the total frequency of the original
indirect call site, and the other checks its frequency relative to the
remaining, unconverted targets (excluding the hottest targets that were
already converted to direct calls). The remaining threshold is typically
set higher than the total threshold. This allows us more control over
ICP.
I expose two pairs of knobs, one for jump tables and another for
indirect calls.
To improve the promotion of hot jump table indices when we have memory
profile, I also fix a bug that could cause us to promote extra indices
besides the hottest ones as seen in the memory profile. When we have the
memory profile, I reapply the dual threshold checks to the memory
profile which specifies exactly which indices are hot. I then update N,
the number of targets to be promoted, based on this new information, and
update frequency information.
To allow us to work with smaller profiles, I also created an option in
perf2bolt to filter out memory samples outside the statically allocated
area of the binary (heap/stack). This option is on by default.
(cherry picked from FBD15187832)
Summary:
Make BinaryContext responsible for creation and management of
JumpTables. This will be used for detection and resolution of jump table
conflicts across functions.
(cherry picked from FBD15196017)
Summary:
perf tool requires the input data to be owned by the current user or
root, otherwise it rejects the input. Use `-f` option to override this
behavior.
(cherry picked from FBD15160678)
Summary:
While checking for a size of a jump table, we've used containing
section as a boundary. This worked for most cases as typically jump
tables are not marked with symbol table entries. However, the compiler
may generate objects for indirect goto's.
(cherry picked from FBD15158905)
Summary:
We used to ignore debug sections by default, but we kept them in the
binary which led to invalid debug information in the output. It's better
to strip debug info and print a warning to the user.
Note: we are not updating debug info by default due to high memory
requirements for large applications.
(cherry picked from FBD15128947)
Summary:
Commit "Update symbols for secondary entry points" introduced
a bug by using getBinaryFunctionContainingAddress() instead of
getBinaryFunctionAtAddress() regarding ICF'd functions. Only the latter
would fetch the correct BinaryFunction object for addresses of functions
that were ICF'd. As a result of this bug, the dynamic symbol table was
not updated for function symbols that were folded by ICF.
(cherry picked from FBD15112941)
Summary:
In non-relocation mode we may execute multiple re-write passes either
because we need to split large functions or update debug information for
large functions (in this context large functions are functions that do
not fit into the original function boundaries after optimizations).
When we execute another pass, we reset RewriteInstance and run most of
the steps such as disassembly and profile matching for the 2nd or 3rd
time. However, when we match a profile, we check `Used` flag, and don't
use the profile for the 2nd time. Since we didn't reset the flag while
resetting the rest of the states, we ignored profile for all functions.
Resetting the flag in-between rewrite passes solves the problem.
(cherry picked from FBD15110959)
Summary:
For pre-aggregated profile, we were using the number of records in the
profile for `NumTraces` ignoring the counts per record. As a result,
the reported percentage of mismatched traces was bogus.
(cherry picked from FBD15093123)
Summary:
Enable -hot-text by default if reordering functions.
Also fail immediately if function reordering is specified on the command
line in non-relocation mode.
(cherry picked from FBD15095178)
Summary: BOLT works as a series of patches rebased onto upstream LLVM at revision `f137ed238db`. Some of these patches introduce unnecessary whitespace changes or includes. Remove these to minimize the diff with upstream LLVM.
(cherry picked from FBD15064122)
Summary:
Update the output ELF symbol table for symbols representing
secondary entry points for functions. Previously, those were left
unchanged in the symtab.
(cherry picked from FBD15010517)
Summary:
`size_t` is platform-dependent, and on macOS it is defined as
`unsigned long long`. This is not the same type as is used in many calls
to templated functions that expect the same type. As a result, on macOS,
calls to `std::max` fail because a template function that takes
`uint64_t, unsigned long long` cannot be found.
To work around the issue:
* Specify explicit `std::max` and `std::min` functions where necessary,
to work around the compiler trying (and failing) to find a suitable
instantiation.
* For lambda return types, specify an explicit return type where necessary.
* For `operator ==()` calls, use an explicit cast where necessary.
(cherry picked from FBD15030283)
Summary:
When attempting to build llvm-bolt with `-DLLVM_ENABLE_TARGETS="X86"`, I
encountered an error:
```
CMake Error at cmake/modules/AddLLVM.cmake:559 (add_dependencies):
The dependency target "AArch64CommonTableGen" of target
"LLVMBOLTTargetAArch64" does not exist.
Call Stack (most recent call first):
cmake/modules/AddLLVM.cmake:607 (llvm_add_library)
tools/llvm-bolt/src/Target/AArch64/CMakeLists.txt:1 (add_llvm_library)
```
The issue is that the `llvm-bolt/src/Target/AArch64` subdirectory is
added by CMake unconditionally. The LLVM project, on the other hand,
only adds the subdirectories that are enabled, by using a `foreach` loop
over `LLVM_TARGETS_TO_BUILD`. Copying that same loop, from
`llvm/lib/Target/CMakeLists.txt`, to this project avoids the error.
(cherry picked from FBD15030236)
Summary:
Iterating over SmallPtrSet is non-deterministic. Change it to
SmallSetVector. Similarly, do not sort a vector of ProgramPoint when
computing the dominance frontier, as ProgramPoint uses the pointer value
to determine order. Use a SmallSetVector there too to avoid duplicates
instead of sorting + uniqueing.
(cherry picked from FBD14992085)
Summary:
If a function size indicated in FDE is different from the one in the
symbol table, we can keep processing the function as we are using the
max size for internal purposes. Typically this happens for
assembly-written functions with padding at the end. This padding is not
included in FDE, but it is in the symbol table.
(cherry picked from FBD14987653)
Summary:
This adds very basic and limited support for split functions.
In non-relocation mode, split functions are ignored, while their debug
info is properly updated. No support in the relocation mode yet.
Split functions consist of a main body and one or more fragments.
For fragments, the main part is called their parent. Any fragment
could only be entered via its parent or another fragment.
The short-term goal is to correctly update debug information for split
functions, while the long-term goal is to have a complete support
including full optimization. Note that if we don't detect split
bodies, we would have to add multiple entry points via tail calls,
which we would rather avoid.
Parent functions and fragments are represented by a `BinaryFunction`
and are marked accordingly. For now they are marked as non-simple, and
thus only supported in non-relocation mode. Once we start building a
CFG, it should be a common graph (i.e. the one that includes all
fragments) in the parent function.
The function discovery is unchanged, except for the detection of
`\.cold\.` pattern in the function name, which automatically marks the
function as a fragment of another function.
Because of the local function name ambiguity, we cannot rely on the
function name to establish child fragment and parent relationship.
Instead we rely on disassembly processing.
`BinaryContext::getBinaryFunctionContainingAddress()` now returns a
parent function if an address from its fragment is passed.
There's no jump table support at the moment. Jump tables can have
source and destinations in both fragment and parent.
Parent functions that enter their fragments via C++ exception handling
mechanism are not yet supported.
(cherry picked from FBD14970569)
Summary:
On some platforms
`llvm::make_error_code(std::errc::no_such_process) == std::errc::no_such_process`
evaluates to false.
(cherry picked from FBD14944405)
Summary:
It's possible to pass a profile in invalid format to BOLT, and we
silently ignore it. This could cause a regression as such scenario can
go undetected. We should abort processing if no valid data was seen in
the profile and issue a warning if it was partially invalid.
(cherry picked from FBD14941211)
Summary:
If a function was already marked as non-simple, there's no reason to
issue a warning that it has a reference in the middle of an
instruction. Besides, sometimes there wouldn't be instructions
disassembled at a given entry, and the warning would be incorrect.
(cherry picked from FBD14938227)
Summary:
In binutils 2.30 a bfd linker accidentally started modifying some
relocations on output under `-q/--emit-relocs` by turning on
R_X86_64_converted_reloc_bit. As a result, BOLT ignored such
relocations and failed to correctly update the binary.
This diff filters out R_X86_64_converted_reloc_bit from the relocation
type.
(cherry picked from FBD14907832)
Summary:
For easier analysis of the hottest targets of jump tables it helps to
have basic block successors sorted based on the taken frequency.
(cherry picked from FBD14856640)
Summary:
If processing the perf.data in LBR mode but the data was
collected without -j, currently we confusingly report all samples
to mismatch the input binary, even though the samples match but
lack LBR info. Change perf2bolt to detect this scenario and print
a helpful message instructing the user to collect data with LBR.
(cherry picked from FBD14817732)
Summary:
While updating DWARF, we used to convert address ranges for functions
into DW_AT_ranges format, even if the ranges were not split and still
had a simple [low, high) form. We had to do this because functions with
contiguous ranges could be sharing an abbrev with non-contiguous range
function, and we had to convert the abbrev.
It turns out, that the excessive usage of DW_AT_ranges may lead to
internal core dumps in gdb in the presence of .gdb_index.
I still don't know the root cause of it, but reducing the number
DW_AT_ranges used by DW_TAG_subprogram DIEs does alleviate the
issue.
We can keep a simple range for DIEs that are guaranteed not to
share an abbrev with any non-contiguous function. Hence we have to
postpone the update of function ranges until we've seen all DIEs.
Note that DIEs from different compilation units could share the same
abbrev, and hence we have to process DIEs from all compilation units.
(cherry picked from FBD14814043)
Summary:
Some instructions in assembly-written functions could reference 8-byte
constants from another instructions using 4-byte offsets, presumably to
save a couple of bytes.
Detect such cases, and skip processing such functions until we teach
BOLT how to handle references into a middle of instruction.
(cherry picked from FBD14768212)
Summary:
A long due refactoring that makes interfaces cleaner and less awkward.
Mainly makes the future work way easier.
(cherry picked from FBD14766284)
Summary:
When we patch .debug_abbrev we issue many duplicate patches. Instead of
storing these patches as a vector, use a hash map. This saves some
processing time and memory.
(cherry picked from FBD14691292)
Summary:
In non-relocation mode we were accidentally emitting section headers for
every single jump table. This happened with default
`-jump-tables=basic`.
(cherry picked from FBD14653282)
Summary:
While using "-hot-text" option, we might not get enough cold text to
fill up the last huge page, and we can get data allocated on this page
producing undesirable effects. To prevent this from happening, always
make sure to allocate enough space past __hot_end.
(cherry picked from FBD14575100)
Summary:
While removing redundant local symbols, we used new section index to
lookup the corresponding section in the old section table. As a result,
we used to either not remove the correct symbols, or remove the wrong
ones.
(cherry picked from FBD14552047)
Summary:
We used to use existing symbol binding while duplicating and renaming
cold fragment symbols. As a result, some of those were emitted with
global binding. This confuses gdb, and it starts treating those symbols
as additional entry points.
The fix is to always emit such symbols with a local binding. This also
means that we have to sort static symbol table before emission to make
sure local symbols precede all others.
(cherry picked from FBD14529265)
Summary:
Create a separate pass for assigning functions to sections. Detect
functions originating from special sections (by default .stub and
.mover) and place them into ".text.mover" if "-hot-text" options is
specified.
Cold functions are isolated from hot functions even when no function
re-ordering is specified.
(cherry picked from FBD14512628)
Summary:
GDB does not like if the first entry in the line info table after
end_sequence entry is not marked with is_stmt. If this happens, it will
not print the correct line number information for such address. Note
that everything works fine starting with the first address marked
with is_stmt.
This could happen if the first instruction in the cold section wasn't
marked with is_stmt.
The fix is to always emit debug line info for the first instruction
in any function fragment with is_stmt flag.
(cherry picked from FBD14516629)
Summary:
This refactoring makes it easier to create new code sections and control
code placement. As an example, cold code is being placed into
".text.cold" which is emitted independently from ".text", and the final
address assignment becomes more flexible.
Previously, in non-relocation mode we used to emit temporary section
name into .shstrtab. This resulted in unnecessary bloat of this section.
There was unnecessary padding emitted at the end of text section. After
fixing this, the output binary becomes smaller.
I had to change the way exception handling tables are re-written
as the current infra does not support cross-section label difference.
This means we have to emit absolute landing pad addresses, which might
not work for PIE binaries. I'm going to address this once I investigate
the current exception handling issues in PIEs.
This diff temporarily disables "-hot-functions-at-end" option.
(cherry picked from FBD14475693)
Summary: As part of our heuristics to decode an indirect branch, if we
suspect the branch is an indirect tail call, we add its probable target
to the BC::InterproceduralReferences vector to detect functions with
more than one entry point. However, if this probable target is not in an
allocatable section, we were asserting. Remove this assertion and
change the code to conditionally store to InterproceduralReferences
instead. The probable target could be garbage at this point because
of analyzeIndirectBranch failing to identify the load instruction that
has the memory address of the target, so we should tolerate this.
(cherry picked from FBD14432821)
Summary:
Add heatmap subcommand to produce heatmaps based on perf.data with LBR.
The output is produced in colored ASCII format.
llvm-bolt heatmap -p perf.data <executable>
-block-size=<uint> - size of a heat map block in bytes (default 64)
-line-size=<uint> - number of entries per line (default 256)
-max-address=<uint> - maximum address considered valid for heatmap
(default 4GB)
-o=<string> - heatmap output file (default stdout)
(cherry picked from FBD13969992)
Summary:
For non-simple function we can miss a reference to a jump table or
to an indirect goto table. If we move the jump table, the missed
reference will not get updated, and the corresponding indirect jump
will end up in the old (wrong) location. Updating the original jump
table in-place should take care of the issue.
(cherry picked from FBD13849776)
Summary:
While converting perf profile, we only need CFG for functions that were
profiled and can skip building CFG for the rest. This saves us some
processing time and memory.
Breakdown processing of perf.data into two steps. The first
step parses the data, saves it in intermediate format, and marks
functions with the profile. The second step attributes the profile to
functions with CFG. When we disassemble and build CFG for functions in
aggregate-only mode, we skip functions without the profile.
(cherry picked from FBD13706697)
Summary:
Improve tracking of forked processes.
If a process corresponding to the input binary has forked/started
before 'perf record' was initiated, then the full name of the binary
will be recorded in a corresponding MMAP2 event. We've being handling
such cases well so far.
However, if the process was forked after 'perf record' has started, and
execve(2) wasn't called afterwards, then there will be no MMAP2 event
recorded corresponding to the mapping of the main binary (unrelated
MMAP2 events could still be recorded).
To track such cases, we need to parse 'perf script --show-task-events'
command output, and to scan for PERF_RECORD_FORK events, and then add
forked process PIDs to the list associated with the input binary. If
the fork event was followed by an exec event (PERF_RECORD_COMM exec)
of a different binary, then the forked PID should be ignored. If the
exec event was associated with our input binary, then the correct MMAP2
event was recorded and parsed.
To track if the event occurred before or after 'perf record', we parse
event's time. This helps us to differentiate some events. E.g. the exec
event is only registered correctly if it happened after perf recording
has started (otherwise the "exec" part is missing), and thus we only
record forks with non-zero time stamps.
(cherry picked from FBD13250904)
Summary:
Use newly added function size estimation to measure the effectiveness
and guide function splitting. Two new tuning options are added:
-split-threshold=<uint>
split function only if its main size is reduced by more than given
amount of bytes. Default value: 0, i.e. split iff the size is reduced.
Note that on some architectures the size can increase after splitting.
-split-align-threshold=<uint>
when deciding to split a function, apply this alignment while doing
the size comparison (see -split-threshold). Default value: 2.
(cherry picked from FBD13136352)
Summary:
Add BinaryContext::calculateEmittedSize() that ephemerally emits code
to allow precise estimation of the function size. Relaxation and
macro-op alignment adjustments are taken into account.
(cherry picked from FBD13092139)
Summary:
On x86 the difference between long and short jump instructions could be
either 4 or 3 bytes, depending if it's a conditional jump or not.
For a basic block with 2 jump instructions, if we know that one of
the successors is in a different code region, then we can make it
a target of an unconditional jump instruction. This will save 1 byte
in case the conditional jump happens to be a short one.
(cherry picked from FBD13078139)
Summary:
When Clang is boot-strapped with (Thin)LTO, it may produce a code
fragment similar to below:
.LFT663334 (6 instructions, align : 1)
Predecessors: .LFT663333
00000538: movb $0x1, %al
0000053a: movl %eax, -0x2c(%rbp)
0000053d: movl $"_ZN5clang6Parser12ConsumeParenEv/1", %ecx
00000542: testb $0x1, %cl
00000545: movq -0x40(%rbp), %r14
00000549: je .Ltmp1071462
Successors: .Ltmp1071462, .LFT663335
.LFT663335 (2 instructions, align : 1)
Predecessors: .LFT663334
0000054b: movq (%r12), %rax
0000054f: movq .Ltmp0(%rax), %rcx
Successors: .Ltmp1071462
.Ltmp1071462 (7 instructions, align : 1)
Predecessors: .LFT663334, .LFT663335
00000556: movq %r12, %rdi
00000559: callq *%rcx
.......
The code above is making a call by dereferencing a pointer to a member
function. A pointer to a member function could either be a regular
function, or a virtual function. To differentiate between the two, AMD64
ABI (originated from Itanium ABI) uses the last bit of the pointer. The
call instruction sequence varies depending if the function is virtual or
not, and the pointer's last bit is checked. If it's "1" then the value
of the pointer (minus 1) is used as an offset in the object vtable to
get the address of the function, otherwise the pointer is used directly
as a function address.
In this specific case, a de-virtualization is taking place, but it's not
complete. Compiler knows that the member function pointer is actually a
non-virtual function _ZN5clang6Parser12ConsumeParenEv (aka
"clang::Parser::ConsumeParen()"). However, it keeps the (dead) code that
checks the last bit of _ZN5clang6Parser12ConsumeParenEv, and furthermore
keeps the code (unreachable/dead) to make a virtual call while using
(_ZN5clang6Parser12ConsumeParenEv - 1) as an offset into the vtable.
This is obviously wrong, but since the code is unreachable, it will
never affect the runtime correctness.
The value "_ZN5clang6Parser12ConsumeParenEv - 1" falls into a last byte
of a function preceding _ZN5clang6Parser12ConsumeParenEv, and BOLT
creates a label ".Ltmp0" pointing to this last byte that is referenced
in by the instruction sequence above. It just happens that the last byte
is also in the middle of the last instruction, and as a result, BOLT
never emits the label, hence resulting in the error message "Undefined
temporary symbol".
The workaround is to detect non-pc-relative relocations from code
pointing to some (fptr - 1). Note that this is not completely
error-prone, but non-pc-relative references from code into a middle of
a function are quite rare, and chances that in a normal situation they
will point to a byte preceding some function address are virtually zero.
(cherry picked from FBD13030310)
Summary:
Special case GOT relocs to ignore addend subtracting
logic in analyzeRelocation, since the addend does not refer to the
target of the instruction being analyzed. Also make the code honor
the comments in the special case about zeroed out ExtractValue but
non-zero addend.
Fixfacebookincubator/BOLT#40
(cherry picked from FBD10355019)
Summary:
This pull request fixes two compiler warnings:
- missing `break;` in a switch-case statement in RegAnalysis.cpp (-Wimplicit-fallthrough warning)
- misleading indentation in BinaryContext.cpp (-Wmisleading-indentation warning)
Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/39
GitHub Author: Andreas Ziegler <andreas.ziegler@fau.de>
(cherry picked from FBD10202092)
Summary:
lld may generate relocations without associated symbols. Instead of
rejecting binaries with such relocations, we can re-create the symbol
the relocation is against based on the extracted value.
(cherry picked from FBD10054576)
Summary:
Previously, we were expanding eligible branches with stubs. After
expansion, we were computing which stubs were unnecessary and removing them,
assuming ranges were shortening as code is removed. The problem with this
approach is that for branches that refer to code that is not managed by
BOLT, the distance to that location can increase and we can end up with an
out-of-range branch.
This rewrites the pass to be simpler, only increasing size and expanding code
with stubs as needed after each iteration, stopping when code stops increasing.
Besides this rewrite, the stub-insertion pass now supports stubs grouping
similar to what the linker does, allowing different functions to share the
same veneer that jumps to a common callee. It also fixes a bug in the previous
implementation that, in very large functions that use TBZ/TBNZ (+-32KB range),
it would mistakenly try to reuse a local stub BB that is out of range.
This includes a change to allow hot functions to be put at the end of the
.text section, closer to the heap, requiring no veneers to jump to JITted
code. And finally it enables eliminate veneers pass by default.
(cherry picked from FBD10023158)
Summary:
If we reuse text section under `-use-old-text` option, then there's no
need to rename it. Tools, such as perf, seem to not like binaries
without `.text`.
Additionally, check if the code fits into `.text` using the page
alignment, otherwise we were skipping the alignment relying on the user
detecting the warning message. This could have resulted in unexpected
performance drops.
Also add `-no-huge-pages` option to use regular page size for code
alignment purposes (i.e. 4KiB instead of 2MiB).
(cherry picked from FBD10024670)
Summary:
While creating BinaryData objects we used to process all symbol table
entries. However, some symbols could belong to non-allocatable sections,
and thus we have to ignore them for the purpose of analyzing in-memory
data.
(cherry picked from FBD9666511)
Summary:
For jump tables ICP was using profile from the jump table itself which
doesn't work correct if the jump table is re-used at different code
locations.
(cherry picked from FBD9618774)