Commit Graph

1011 Commits

Author SHA1 Message Date
Rafael Auler c82e7fd1cc [BOLT] Decoder cache friendly alignment wrt Intel JCC Erratum
Summary:
This diff ports reviews.llvm.org/D70157 to our LLVM tree, which
makes the integrated assembler able to align X86 control-flow changing
instructions in a way to reduce the performance impact of the ucode
update on Intel processors that implement the JCC erratum mitigation.

See white paper "Mitigations for Jump Conditional Code Erratum" by Intel
published November 2019.

To port this patch, I changed classifySecondInstInMacroFusion to analyze
instruction opcodes directly instead of analyzing the CondCond operand
(in more recent versions of LLVM, all conditional branches share the
same opcode, but with a different conditional operand). I also pulled to
our tree Alignment.h as a dependency, and the macroop analyzing helpers.

x86-align-branch-boundary and -x86-align-branch are the two flags that
control nop insertion to avoid disabling the decoder cache, following
the original patch. In BOLT, I added the flag
x86-align-branch-boundary-hot-only to request the alignment to only be
applied to hot code, which is turned on by default. The reason is
because such alignment  is expensive to perform on large modules, but if
we limit it to hot code, the relaxation pass runtime becomes tolerable.

(cherry picked from FBD19828850)
2020-02-10 18:50:53 -08:00
Alexander Shaposhnikov d5b8fc8fbe [BOLT] Make the methods isText/isData more robust
Summary: Make the methods isText/isData work for MachO.

(cherry picked from FBD19849460)
2020-02-11 17:54:48 -08:00
Alexander Shaposhnikov c3c4b15a2e [BOLT] Remove BinaryContext::getFunctionData
Summary:
In this diff we refactor the code around getting the original binary encoding of function's body.
The main changes are: remove BinaryContext::getFunctionData, remove the parameter of the method BinaryFunction::disassemble, introduce BinaryFunction::getData.

(cherry picked from FBD19824368)
2020-02-10 15:35:11 -08:00
Maksim Panchenko 41de03b8e9 [BOLT] Fix section names under `-generate-link-sections`
Summary:
Use proper function while printing modified function name to file.

(cherry picked from FBD19791847)
2020-02-07 09:39:38 -08:00
Rafael Auler 0080d74506 [BOLT] Fix issue with strict and builtin_unreachable
Summary:
In strict mode, a jump table with targets generated by
builtin_unreachable (located at the very end of the function) was
asserting when being recreated by postProcessIndirectBranches. Fix
this.

(cherry picked from FBD19614981)
2020-01-28 18:38:10 -08:00
Maksim Panchenko d57513e4ab [BOLT] Fix symbol table issue with ICF
Summary: Not all symbol table entries were updated after ICF.

(cherry picked from FBD19319685)
2020-01-08 13:32:59 -08:00
Maksim Panchenko ac697b7d3a [BOLT] Replace list of Names with Symbols for BinaryFunction
Summary:
BinaryFunction used to have a list of Names associated with its main
entry point. However, the function is primarily identified by its
corresponding symbol or symbols, and these symbols are available as we
are creating them for a corresponding BinaryData object.

There's also no reason to emit symbols for alternative function names
(aliases), so change the code to only emit needed symbols.

When we emit a cold fragment for a function, only emit one cold symbol
for the fragment instead of one per every main entry symbol/name.

When we match a symbol to an entry point in the function, with this
change we can first go through the list of main entry symbols (now that
they are available).

(cherry picked from FBD19426709)
2020-01-13 11:56:59 -08:00
Alexander Shaposhnikov 7a59783d7a [BOLT] Move createBinaryContext to BinaryContext
Summary:
1. Move createBinaryContext to BinaryContext.
1. Add support for nonlinux triples in createBinaryContext.
2. Remove unnecessary std::move in DWARFRewriter.cpp.

(cherry picked from FBD19421314)
2020-01-15 15:23:45 -08:00
Rafael Auler 961d3d02d8 [BOLT] Move postProcessEntryPoints after disassembly
Summary:
Call postProcessEntryPoints only after all functions have been
disassembled and all interprocedural references have been processed,
when all possible entry points have been accounted for. This makes our
detection of bad entries more robust as it does not depend on the order
of the functions any more.

(cherry picked from FBD19404767)
2020-01-14 17:12:03 -08:00
Maksim Panchenko 0283271f29 [BOLT] Do no report error on mismatched instruction encoding
Summary:
When the validation of instruction encoding fails but we are able to
continue processing the binary, do no report an error. Report encoding
format only under `-v=1`.

(cherry picked from FBD19376531)
2020-01-13 11:24:10 -08:00
Maksim Panchenko 45b27d7b44 [BOLT] Get rid of Names in BinaryData
Summary:
For BinaryData, we used to maintain a vector of StringRef names and also
a vector of pointers to MCSymbol's associated with the data. There was
an unnecessary duplication of information and an associated overhead of
keeping it in sync. Fix it by removing Names and using Symbols wherever
Names were used.

Also merge two variants of registerNameAtAddress() and remove
unreachable/dead code in the process.

(cherry picked from FBD19359123)
2020-01-10 16:17:47 -08:00
Maksim Panchenko 088e3c032a [BOLT] Improve handling of secondary function entry points
Summary:
"Fix symbol table entries for secondary entries" diff broke the inliner.

Fix the breakage and make the discovery of secondary entry points more
accurate.

Add ability to BinaryContext::getFunctionForSymbol() to return an entry
point discriminator and use it instead of calling getEntryForSymbol()
and isSecondaryEntry(). This is the preferred way since
getFunctionForSymbol() is thread-safe.

(cherry picked from FBD19295983)
2020-01-06 14:57:15 -08:00
Alexander Shaposhnikov 8c7f524afb [BOLT] Fix build of the runtime on OSX
Summary: Fix the compilation error on OSX

(cherry picked from FBD19269806)
2020-01-02 16:20:13 -08:00
Rafael Auler de284bc510 [BOLT] Fix symbol table entries for secondary entries
Summary:
Commit "Support full instrumentation" changed the map
SymbolToFunction in BinaryContext to map secondary entries of functions
too. This introduced unexpected behavior in our symbol table rewriting
logic, which caused it to mistakenly write them with the address of the
original function. Fix the behavior of getBinaryFunctionAtAddress to
correct this. Also fix other users of SymbolToFunction to ensure they
are not accidentally using secondary entries when they shouldn't.

(cherry picked from FBD19168319)
2019-12-18 12:14:42 -08:00
Xin-Xin Wang 9aa276d349 [BOLT] Make .debug_loc update deterministic
Summary:
Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done.

The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end.

(cherry picked from FBD18153069)
2019-10-25 11:47:51 -07:00
Maksim Panchenko d414acfbb6 [perf2bolt] Better mmap event matching
Summary:
When perf tool reports a mapping address of a binary, it is not always
the address of the first loadable segment we were checking against.
As a result, perf2botl was not working properly for binaries where the
first segment was not executable.

The fix is to check if the address reported by mmap event matches any
of the loadable segments. Note that the segment alignment has to be
applied to get real loadable address of the segment.

Fixes facebookincubator/BOLT#65

(cherry picked from FBD19146419)
2019-12-17 11:17:31 -08:00
Rafael Auler 16a497c627 [BOLT] Support full instrumentation
Summary:
Add full instrumentation support (branches, direct and
indirect calls). Add output statistics to show how many hot bytes
were split from cold ones in functions. Add -cold-threshold option
to allow splitting warm code (non-zero count). Add option in
bolt-diff to report missing functions in profile 2.

In instrumentation, fini hooks are fixed to run proper finalization
code after program finishes. Hooks for startup are added to setup
the runtime structures that needs initilization, such as indirect call
hash tables.

Add support for automatically dumping profile data every N seconds by
forking a watcher process during runtime.

(cherry picked from FBD17644396)
2019-12-13 17:27:03 -08:00
Rafael Auler e46d52de5b [BOLT] Fix non-determinism in ICP with threads
Summary:
-icp-top-callsites selects candidates for optimization until a
threshold is met. Currently, this parameter is set to 99% of calls by
default. The order of functions evaluated changes in parallel mode,
thus the functions that may be included to satisfy 99% of all calls may
change, leading to different optimization decisions when running in
parallel versus sequential.

Fix this by enabling optimizations for all branches with the same
frequency once we reach our budget instead of cutting off immediatelly
after our budget is satisfied. In that way, order of functions has no
impact on which functions are optimized.

(cherry picked from FBD18902239)
2019-12-13 16:46:00 -08:00
Xin-Xin Wang bdb60857e8 [BOLT] Make .debug_loc update deterministic
Summary:
Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done.

The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end.

(cherry picked from FBD18153069)
2019-10-25 11:47:51 -07:00
Maksim Panchenko e5d1334ad5 [perf2bolt] Ignore mmap events unrelated to execution
Summary:
Some processes can mmap the main binary for the purpose of
introspection. We should ignore such mmap events for fixed-address
binaries. For PIC binaries, we record the mapping and do the address
filtering later for all sample events.

(cherry picked from FBD18844314)
2019-12-05 16:52:15 -08:00
Xin-Xin Wang 6f93d53bf5 [BOLT] Remove test for impossible debug ranges condition
Summary:
The condition `DebugRangesOffset == -1U` can never happen since DebugRangesOffset has type `uint64_t` and the value always comes from `RangesSectionWriter->addRanges` which gets its value from `DebugRangesSectionWriter.SectionOffset` which has type `uint32_t`. The condition seems to be left over from a time where something was using `-1` as an error value.

I'm removing that check so I can use `-1` as a tag to refer to the empty range that will be at the beginning of the ranges section.

(cherry picked from FBD18153119)
2019-10-25 15:18:37 -07:00
Xin-Xin Wang 112c4251f5 [BOLT] Separate DebugRangesSectionsWriter into Ranges and ARanges
Summary: The `.debug_aranges` section is already deterministic and is logically separate from the `.debug_ranges` section so separate them into separate classes so that it will be easier to make DebugRangesSectionsWriter deterministic

(cherry picked from FBD18153057)
2019-10-25 11:24:49 -07:00
Xin-Xin Wang 8e2d3f7c30 [BOLT] Fix invalid abbrev error when reading debug_info section with readelf
Summary:
This fixes a bug which causes the debug_info and debug_loc sections to be unreadable by readelf/objdump.

Basically, we're using 12 bytes of a ULEB128 value to fill in space, but readelf can't read more than 9 bytes of ULEB128. Thus, we replace that value with a string of 'a' instead.

(cherry picked from FBD18097728)
2019-10-23 15:19:49 -07:00
Rafael Auler 28f91871b3 [PERF2BOLT/BOLT] Improve support for .so
Summary:
Avoid asserting on inputs that are shared libraries with
R_X86_64_64 static relocs and RELATIVE dynamic relocations matching
those. Our relocation checking mechanism would expect the result of
the static relocation to be encoded in the binary, but the linker
instead puts it as an addend in the RELATIVE dyn reloc.

Also fix aggregation for .so if the executable segment is not the
first one in the binary.

(cherry picked from FBD18651868)
2019-11-14 16:07:11 -08:00
Rafael Auler 4bcc53a408 [BOLT] Fix shrink wrapping empty BB issue
Summary:
When combining icp=calls and shrink wrapping, the former may
generate empty BBs that are going to trigger a bug in shrink wraping
restore placement strategy. The restore is wrongly pushed to the BB
successor instead of being added to the current block. Add a pass to
go over the CFG to fix empty blocks by adding a temporary NOP
instruction that is going to be deleted later. Empty BBs are not
supported by one of the analysis done at this pass.

(cherry picked from FBD18717994)
2019-11-26 15:09:40 -08:00
Maksim Panchenko 3cc4fc267b [BOLT] Proper support for -trap-avx512 option
Summary:
If -trap-avx512 option is not set, verify that we correctly encode
AVX-512 instructions and treat them as ordinary instructions.

(cherry picked from FBD18666427)
2019-11-22 14:53:20 -08:00
Maksim Panchenko 7350d40404 [BOLT][NFC] Refactor BinaryFunction::addEntryPoint()
Summary:
There is no need to support existing functionality of adding entry
points after the CFG is built as the function is only called in empty or
disassembled state. Previously we used to run disassemble+buildCFG per
function, but now these phases are decoupled.

Also, remove a couple of redundant checks.

(cherry picked from FBD18622822)
2019-11-11 17:02:37 -08:00
Maksim Panchenko a09659fd54 [BOLT] Refactor markAmbiguousRelocations()
Summary:
Refactor markAmbiguousRelocations() code and move it to BinaryContext.

Also remove a redundant check.

(cherry picked from FBD18623815)
2019-11-18 14:08:17 -08:00
Maksim Panchenko 658f270417 [BOLT] Refactor data PC relocations in BinaryContext
Summary:
We only use locations of PC relocations and ignore the rest of the data.
There's no need to store type and value.

(cherry picked from FBD18623280)
2019-11-19 18:52:08 -08:00
Maksim Panchenko b07e870d78 [BOLT] Add BinarySection::flushPendingRelocations()
(cherry picked from FBD18623527)
2019-11-20 00:16:19 -08:00
Maksim Panchenko 3b1b9916dd [BOLT][NFC] Refactor data section emission code
Summary: RewriteInstance::emitDataSection() -> BinarySection::emitAsData()

(cherry picked from FBD18623050)
2019-11-19 14:47:49 -08:00
spupyrev 95a1c7f553 speeding up ext-tsp
Summary:
Speeding up cache+/ext-tsp block reordering algorithm.
On a high-level, the speedup is achieved by:
- precomputing and memorizing all jumps between a pair of chains
(instead of extracting them on every merge iteration);
- using a cache of size O(|E|) instead of O(|V|^2) as in previous version.

The final output is identical to previous one subject to a new deterministic
comparison of double values.

(cherry picked from FBD18380870)
2019-10-31 13:32:25 -07:00
Maksim Panchenko 6796b7216b [BOLT] Fix jump table analysis for non-simple functions
Summary:
When we disassemble functions, we add discovered jump tables to a global
container in BinaryContext. Later, we analyze and verify all jump
tables. However, analysis for non-simple functions might fail for numerous
reasons, e.g. there would be no instruction at a destination. Since we
are not overwriting non-simple functions, it is not a critical error.
Thus, we can safely skip jump table analysis for non-simple functions.

(cherry picked from FBD18422997)
2019-11-10 21:09:01 -08:00
Maksim Panchenko 72b52edcbb [BOLT] Free more memory in BinaryFunction::releaseCFG()
Summary:
Free more lists in BinaryFunction::releaseCFG().

Release BinaryFunction::Relocations after disassembly.

Do not populate BinaryFunction::MoveRelocations as we are not using them
currently.

Also remove PCRelativeRelocationOffsets that weren't used.

(cherry picked from FBD18413256)
2019-11-08 14:41:31 -08:00
Maksim Panchenko d5ddb320ef [BOLT] Free memory for CFG after emission
Summary:
Once we emit function code, we no longer need CFG for next phases
that use basic blocks for address-translation and symbol update
purposes. We free memory used by CFG and instructions. The freed
memory gets reused by later phases resulting in overall memory usage
reduction.

We can probably improve memory consumption even further by replacing
BinaryBasicBlocks with more compact data structures.

(cherry picked from FBD18408954)
2019-10-31 16:54:48 -07:00
Maksim Panchenko f2b257bec8 [BOLT] Update SDTs based on translation tables
Summary:
We've used to emit special annotations to update SDT markers. However,
we can just use "Offset" annotations for the same purpose. Unlike BAT,
we have to generate "reverse" address translation tables.
This approach eliminates reliance on instructions after code emission.

(cherry picked from FBD18318660)
2019-11-03 21:57:15 -08:00
Maksim Panchenko 98e63610b1 [BOLT] Create OffsetTranslationTable for basic blocks
Summary:
Use BinaryBasicBlock::OffsetTranslationTable for BAT. This removes
dependency on instructions after the code emission.

(cherry picked from FBD18283965)
2019-11-01 16:19:45 -07:00
Maksim Panchenko a1388308f0 [BOLT] Use NameResolver class for local symbols
Summary: NameResolver class is used to assign unique names to local symbols.

(cherry picked from FBD18277131)
2019-11-01 12:31:17 -07:00
Maksim Panchenko 1ed3ac17ff [BOLT] Fix section offsets after debug stripping
Summary:
Be default, we strip debug sections from the binary. Even though we did
not write the sections, we allocated space for them in the output binary
by mistake.

(cherry picked from FBD18218708)
2019-10-29 14:49:49 -07:00
Maksim Panchenko ed8be23e73 [BOLT][llvm] Reduce memory used by MCInst
Summary:
BOLT creates MCInst for every instruction from the input. For large
binaries, this means we are creating tens if not hundreds of millions of
instructions. If the number of operands for average instruction is much
less than 8, we benefit from changing the type of Operands from
SmallVector<MCOperand, 8> to SmallVector<MCOperand, 2>. That seems
to be the optimal type for X86-64 on average.

The size of MCInst goes down from 176 to 80 which often reduces BOLT
memory consumption by gigabytes.

(cherry picked from FBD18218924)
2019-10-28 17:40:18 -07:00
Rafael Auler a3295715e4 [AArch64] Recognize one extra br idiom
Summary:
We do not support optimizing functions with jump tables in
AArch64, but we do need to detect them. This idiom is slightly different
from the ones we've seen before. It encode jump table entries as
relative to the jump table itself instead of relative to the indirect
branch (BR) instruction.

(cherry picked from FBD18191100)
2019-10-28 16:16:35 -07:00
Maksim Panchenko 8fb6512a23 [BOLT][Docs] Instructions for linking with jemalloc/tcmalloc
(cherry picked from FBD18050722)
2019-10-21 15:57:36 -07:00
Maksim Panchenko 12aca4005c [BOLT] Ignore __builtin_unreachable destination
Summary:
For functions with unknown control flow, do not populate TakenBranches
with an entry pointing to the end of the function.

(cherry picked from FBD18034019)
2019-10-20 20:46:32 -07:00
Rafael Auler b807641e2a [BOLT] Fix stale functions when using BAT
Summary:
If collecting data in Intel Skylake machines, we may face a
bug where LBR0 or LBR1 may be duplicated w.r.t. the next entry. This
makes perf2bolt interpret it as an invalid trace, which ordinarily we
discard during aggregation. However, in BAT, since we do not disassemble
the binary where the collection happened but rely only on the
translation table, it is not possible to detect bad traces and discard
them. This gets to the fdata file, and this invalid trace ends up
invalidating the profile for the whole function (by being treated as
stale by BOLT).

In this patch, we detect Skylake by looking for LBRs with 32 entries,
and discard the first 2 entries to avoid running into this problem.

It also fixes an issue with collision in the translation map by
prioritizing the last basic block when more than one share the same
output address.

(cherry picked from FBD17996791)
2019-10-17 16:35:57 -07:00
Maksim Panchenko 103b0a77cc [BOLT] Fix non-determinism while reading debug info
Summary:
When reading debug info in parallel, CUs for functions were populated in
parallel and the order was non-deterministic. We used the first CU from
the non-deterministically-ordered list to set the line number resulting
in different outputs.

The fix is to sort the list after it's been created and before assigning
the line table unit.

(cherry picked from FBD17946889)
2019-10-14 17:57:36 -07:00
Rafael Auler 698a4684ac [BOLT] Fix merge-fdata and heatmap in BAT
Summary:
merge-fdata for legacy format was simply appending all input
strings to output, but the real format supports some header strings
that can't be simply concatanated to output. Check for the header
string used by BAT before merging fdata to avoid creating an output
file with invalid lines (header in the middle of the fdata file).

For heatmap, avoid reading BAT tables, since they won't be used.

(cherry picked from FBD17943131)
2019-10-11 13:32:14 -07:00
Xin-Xin Wang d87f95065a [BOLT] Add missing CMake test dependencies
Summary:
I noticed when setting up a new repository for bolt that bolt tests
would fail unexpectedly when running `ninja check-bolt` and
`ninja check-llvm`. This turns out to be because dependencies for bolt
binaries were not specified in the CMake configuration so they were not
built before running the tests. This diff adds the dependencies to the
CMake configuration for check-bolt and check-llvm so that bolt binaries
are built before running tests.

(cherry picked from FBD17919505)
2019-10-14 16:03:54 -07:00
Maksim Panchenko 8c6ea8540a [BOLT] Improve object discovery runtime
Summary:

(cherry picked from FBD17872824)
2019-10-08 11:03:33 -07:00
Rafael Auler 13948f376d [BOLT] Do not emit BAT for non-simple in nonreloc
Summary: Doing so cause corrupt entries to be emitted.

(cherry picked from FBD17774505)
2019-10-04 16:28:03 -07:00
Mark Santaniello c9f4bbdc22 [llvm-bolt] Bugfix jemalloc sized deallocation segfault
Summary:
C++14 "sized deallocation" introduces a 2-argument `delete` where the new 2nd argument is the original allocated size.  It's useful for allocators like jemalloc to be "reminded" of the original allocation size, else they incur the cost of an address to size lookup.  Jemalloc has provided this for a while as `sdallocx`, and recently it got wired up to the new 2-arg `delete`.

Here I introduce typedefs for the SmallVectors so the "16" is consistent, which seems to fix the issue.

(cherry picked from FBD17618981)
2019-09-26 16:51:22 -07:00