Summary:
Shrink wrapping has a mode where it will directly move push
pop pairs, instead of replacing them with stores/loads. This is an
ambitious mode that is triggered sometimes, but whenever matching with
a push, it would operate with the assumption that the restoring
instruction was a pop, not a load, otherwise it would assert. Fix this
assertion to bail nicely back to non-pushpop mode (use regular store and
load instructions).
(cherry picked from FBD20085905)
Summary:
Change our X86 target to use long nops by default. In general,
BOLT does not put nops into the instruction stream that is going to be
executed, since it doesn't align basic blocks, only functions. Since we
rebased BOLT, our relationship with MCAssembler changed because it
stopped using multibyte nops and we never needed to revisit that. But it
makes a difference if we want to mitigate perf issues with the Intel
JCC erratum, since the nops inserted are going to be decoded and
executed. To make MCAssembler emit long nops again, we need to explictly
set mattr (Features) of the X86 target.
(cherry picked from FBD19987277)
Summary: Start adding initial bits for MachO, this diff contains some small preparations for finding functions inside a MachO binary, this will be done in the next diff. The concept of a section in the MachO world is quite different from ELF, nevertheless, for functions for now it more or less fits into the current picture (in BOLT), but things will diverge more significantly a bit later.
(cherry picked from FBD19648161)
Summary:
There are two peephole subpasses, remove-double-jumps and
remove-useless-conditional-branches, that operates by reading branches
directly, which makes them tricky to run before fix-branches. In the
case of remove-double-jumps, it will even lead to suboptimal code if
the patched branch was going to be removed by fix-branches when the
target is the fall-through. If the final target is a tail call, it will
lead to a broken CFG in the worst case. Fix this by moving these passes
after SCTC, which already produces CFGs with conditional tail calls.
(cherry picked from FBD18795592)
Summary:
This diff ports reviews.llvm.org/D70157 to our LLVM tree, which
makes the integrated assembler able to align X86 control-flow changing
instructions in a way to reduce the performance impact of the ucode
update on Intel processors that implement the JCC erratum mitigation.
See white paper "Mitigations for Jump Conditional Code Erratum" by Intel
published November 2019.
To port this patch, I changed classifySecondInstInMacroFusion to analyze
instruction opcodes directly instead of analyzing the CondCond operand
(in more recent versions of LLVM, all conditional branches share the
same opcode, but with a different conditional operand). I also pulled to
our tree Alignment.h as a dependency, and the macroop analyzing helpers.
x86-align-branch-boundary and -x86-align-branch are the two flags that
control nop insertion to avoid disabling the decoder cache, following
the original patch. In BOLT, I added the flag
x86-align-branch-boundary-hot-only to request the alignment to only be
applied to hot code, which is turned on by default. The reason is
because such alignment is expensive to perform on large modules, but if
we limit it to hot code, the relaxation pass runtime becomes tolerable.
(cherry picked from FBD19828850)
Summary:
In this diff we refactor the code around getting the original binary encoding of function's body.
The main changes are: remove BinaryContext::getFunctionData, remove the parameter of the method BinaryFunction::disassemble, introduce BinaryFunction::getData.
(cherry picked from FBD19824368)
Summary:
In strict mode, a jump table with targets generated by
builtin_unreachable (located at the very end of the function) was
asserting when being recreated by postProcessIndirectBranches. Fix
this.
(cherry picked from FBD19614981)
Summary:
BinaryFunction used to have a list of Names associated with its main
entry point. However, the function is primarily identified by its
corresponding symbol or symbols, and these symbols are available as we
are creating them for a corresponding BinaryData object.
There's also no reason to emit symbols for alternative function names
(aliases), so change the code to only emit needed symbols.
When we emit a cold fragment for a function, only emit one cold symbol
for the fragment instead of one per every main entry symbol/name.
When we match a symbol to an entry point in the function, with this
change we can first go through the list of main entry symbols (now that
they are available).
(cherry picked from FBD19426709)
Summary:
1. Move createBinaryContext to BinaryContext.
1. Add support for nonlinux triples in createBinaryContext.
2. Remove unnecessary std::move in DWARFRewriter.cpp.
(cherry picked from FBD19421314)
Summary:
Call postProcessEntryPoints only after all functions have been
disassembled and all interprocedural references have been processed,
when all possible entry points have been accounted for. This makes our
detection of bad entries more robust as it does not depend on the order
of the functions any more.
(cherry picked from FBD19404767)
Summary:
When the validation of instruction encoding fails but we are able to
continue processing the binary, do no report an error. Report encoding
format only under `-v=1`.
(cherry picked from FBD19376531)
Summary:
For BinaryData, we used to maintain a vector of StringRef names and also
a vector of pointers to MCSymbol's associated with the data. There was
an unnecessary duplication of information and an associated overhead of
keeping it in sync. Fix it by removing Names and using Symbols wherever
Names were used.
Also merge two variants of registerNameAtAddress() and remove
unreachable/dead code in the process.
(cherry picked from FBD19359123)
Summary:
"Fix symbol table entries for secondary entries" diff broke the inliner.
Fix the breakage and make the discovery of secondary entry points more
accurate.
Add ability to BinaryContext::getFunctionForSymbol() to return an entry
point discriminator and use it instead of calling getEntryForSymbol()
and isSecondaryEntry(). This is the preferred way since
getFunctionForSymbol() is thread-safe.
(cherry picked from FBD19295983)
Summary:
Commit "Support full instrumentation" changed the map
SymbolToFunction in BinaryContext to map secondary entries of functions
too. This introduced unexpected behavior in our symbol table rewriting
logic, which caused it to mistakenly write them with the address of the
original function. Fix the behavior of getBinaryFunctionAtAddress to
correct this. Also fix other users of SymbolToFunction to ensure they
are not accidentally using secondary entries when they shouldn't.
(cherry picked from FBD19168319)
Summary:
Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done.
The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end.
(cherry picked from FBD18153069)
Summary:
When perf tool reports a mapping address of a binary, it is not always
the address of the first loadable segment we were checking against.
As a result, perf2botl was not working properly for binaries where the
first segment was not executable.
The fix is to check if the address reported by mmap event matches any
of the loadable segments. Note that the segment alignment has to be
applied to get real loadable address of the segment.
Fixesfacebookincubator/BOLT#65
(cherry picked from FBD19146419)
Summary:
Add full instrumentation support (branches, direct and
indirect calls). Add output statistics to show how many hot bytes
were split from cold ones in functions. Add -cold-threshold option
to allow splitting warm code (non-zero count). Add option in
bolt-diff to report missing functions in profile 2.
In instrumentation, fini hooks are fixed to run proper finalization
code after program finishes. Hooks for startup are added to setup
the runtime structures that needs initilization, such as indirect call
hash tables.
Add support for automatically dumping profile data every N seconds by
forking a watcher process during runtime.
(cherry picked from FBD17644396)
Summary:
-icp-top-callsites selects candidates for optimization until a
threshold is met. Currently, this parameter is set to 99% of calls by
default. The order of functions evaluated changes in parallel mode,
thus the functions that may be included to satisfy 99% of all calls may
change, leading to different optimization decisions when running in
parallel versus sequential.
Fix this by enabling optimizations for all branches with the same
frequency once we reach our budget instead of cutting off immediatelly
after our budget is satisfied. In that way, order of functions has no
impact on which functions are optimized.
(cherry picked from FBD18902239)
Summary:
Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done.
The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end.
(cherry picked from FBD18153069)
Summary:
Some processes can mmap the main binary for the purpose of
introspection. We should ignore such mmap events for fixed-address
binaries. For PIC binaries, we record the mapping and do the address
filtering later for all sample events.
(cherry picked from FBD18844314)
Summary:
The condition `DebugRangesOffset == -1U` can never happen since DebugRangesOffset has type `uint64_t` and the value always comes from `RangesSectionWriter->addRanges` which gets its value from `DebugRangesSectionWriter.SectionOffset` which has type `uint32_t`. The condition seems to be left over from a time where something was using `-1` as an error value.
I'm removing that check so I can use `-1` as a tag to refer to the empty range that will be at the beginning of the ranges section.
(cherry picked from FBD18153119)
Summary: The `.debug_aranges` section is already deterministic and is logically separate from the `.debug_ranges` section so separate them into separate classes so that it will be easier to make DebugRangesSectionsWriter deterministic
(cherry picked from FBD18153057)
Summary:
This fixes a bug which causes the debug_info and debug_loc sections to be unreadable by readelf/objdump.
Basically, we're using 12 bytes of a ULEB128 value to fill in space, but readelf can't read more than 9 bytes of ULEB128. Thus, we replace that value with a string of 'a' instead.
(cherry picked from FBD18097728)
Summary:
Avoid asserting on inputs that are shared libraries with
R_X86_64_64 static relocs and RELATIVE dynamic relocations matching
those. Our relocation checking mechanism would expect the result of
the static relocation to be encoded in the binary, but the linker
instead puts it as an addend in the RELATIVE dyn reloc.
Also fix aggregation for .so if the executable segment is not the
first one in the binary.
(cherry picked from FBD18651868)
Summary:
When combining icp=calls and shrink wrapping, the former may
generate empty BBs that are going to trigger a bug in shrink wraping
restore placement strategy. The restore is wrongly pushed to the BB
successor instead of being added to the current block. Add a pass to
go over the CFG to fix empty blocks by adding a temporary NOP
instruction that is going to be deleted later. Empty BBs are not
supported by one of the analysis done at this pass.
(cherry picked from FBD18717994)
Summary:
If -trap-avx512 option is not set, verify that we correctly encode
AVX-512 instructions and treat them as ordinary instructions.
(cherry picked from FBD18666427)
Summary:
There is no need to support existing functionality of adding entry
points after the CFG is built as the function is only called in empty or
disassembled state. Previously we used to run disassemble+buildCFG per
function, but now these phases are decoupled.
Also, remove a couple of redundant checks.
(cherry picked from FBD18622822)
Summary:
We only use locations of PC relocations and ignore the rest of the data.
There's no need to store type and value.
(cherry picked from FBD18623280)