Commit Graph

997 Commits

Author SHA1 Message Date
Xin-Xin Wang 9aa276d349 [BOLT] Make .debug_loc update deterministic
Summary:
Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done.

The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end.

(cherry picked from FBD18153069)
2019-10-25 11:47:51 -07:00
Maksim Panchenko d414acfbb6 [perf2bolt] Better mmap event matching
Summary:
When perf tool reports a mapping address of a binary, it is not always
the address of the first loadable segment we were checking against.
As a result, perf2botl was not working properly for binaries where the
first segment was not executable.

The fix is to check if the address reported by mmap event matches any
of the loadable segments. Note that the segment alignment has to be
applied to get real loadable address of the segment.

Fixes facebookincubator/BOLT#65

(cherry picked from FBD19146419)
2019-12-17 11:17:31 -08:00
Rafael Auler 16a497c627 [BOLT] Support full instrumentation
Summary:
Add full instrumentation support (branches, direct and
indirect calls). Add output statistics to show how many hot bytes
were split from cold ones in functions. Add -cold-threshold option
to allow splitting warm code (non-zero count). Add option in
bolt-diff to report missing functions in profile 2.

In instrumentation, fini hooks are fixed to run proper finalization
code after program finishes. Hooks for startup are added to setup
the runtime structures that needs initilization, such as indirect call
hash tables.

Add support for automatically dumping profile data every N seconds by
forking a watcher process during runtime.

(cherry picked from FBD17644396)
2019-12-13 17:27:03 -08:00
Rafael Auler e46d52de5b [BOLT] Fix non-determinism in ICP with threads
Summary:
-icp-top-callsites selects candidates for optimization until a
threshold is met. Currently, this parameter is set to 99% of calls by
default. The order of functions evaluated changes in parallel mode,
thus the functions that may be included to satisfy 99% of all calls may
change, leading to different optimization decisions when running in
parallel versus sequential.

Fix this by enabling optimizations for all branches with the same
frequency once we reach our budget instead of cutting off immediatelly
after our budget is satisfied. In that way, order of functions has no
impact on which functions are optimized.

(cherry picked from FBD18902239)
2019-12-13 16:46:00 -08:00
Xin-Xin Wang bdb60857e8 [BOLT] Make .debug_loc update deterministic
Summary:
Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done.

The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end.

(cherry picked from FBD18153069)
2019-10-25 11:47:51 -07:00
Maksim Panchenko e5d1334ad5 [perf2bolt] Ignore mmap events unrelated to execution
Summary:
Some processes can mmap the main binary for the purpose of
introspection. We should ignore such mmap events for fixed-address
binaries. For PIC binaries, we record the mapping and do the address
filtering later for all sample events.

(cherry picked from FBD18844314)
2019-12-05 16:52:15 -08:00
Xin-Xin Wang 6f93d53bf5 [BOLT] Remove test for impossible debug ranges condition
Summary:
The condition `DebugRangesOffset == -1U` can never happen since DebugRangesOffset has type `uint64_t` and the value always comes from `RangesSectionWriter->addRanges` which gets its value from `DebugRangesSectionWriter.SectionOffset` which has type `uint32_t`. The condition seems to be left over from a time where something was using `-1` as an error value.

I'm removing that check so I can use `-1` as a tag to refer to the empty range that will be at the beginning of the ranges section.

(cherry picked from FBD18153119)
2019-10-25 15:18:37 -07:00
Xin-Xin Wang 112c4251f5 [BOLT] Separate DebugRangesSectionsWriter into Ranges and ARanges
Summary: The `.debug_aranges` section is already deterministic and is logically separate from the `.debug_ranges` section so separate them into separate classes so that it will be easier to make DebugRangesSectionsWriter deterministic

(cherry picked from FBD18153057)
2019-10-25 11:24:49 -07:00
Xin-Xin Wang 8e2d3f7c30 [BOLT] Fix invalid abbrev error when reading debug_info section with readelf
Summary:
This fixes a bug which causes the debug_info and debug_loc sections to be unreadable by readelf/objdump.

Basically, we're using 12 bytes of a ULEB128 value to fill in space, but readelf can't read more than 9 bytes of ULEB128. Thus, we replace that value with a string of 'a' instead.

(cherry picked from FBD18097728)
2019-10-23 15:19:49 -07:00
Rafael Auler 28f91871b3 [PERF2BOLT/BOLT] Improve support for .so
Summary:
Avoid asserting on inputs that are shared libraries with
R_X86_64_64 static relocs and RELATIVE dynamic relocations matching
those. Our relocation checking mechanism would expect the result of
the static relocation to be encoded in the binary, but the linker
instead puts it as an addend in the RELATIVE dyn reloc.

Also fix aggregation for .so if the executable segment is not the
first one in the binary.

(cherry picked from FBD18651868)
2019-11-14 16:07:11 -08:00
Rafael Auler 4bcc53a408 [BOLT] Fix shrink wrapping empty BB issue
Summary:
When combining icp=calls and shrink wrapping, the former may
generate empty BBs that are going to trigger a bug in shrink wraping
restore placement strategy. The restore is wrongly pushed to the BB
successor instead of being added to the current block. Add a pass to
go over the CFG to fix empty blocks by adding a temporary NOP
instruction that is going to be deleted later. Empty BBs are not
supported by one of the analysis done at this pass.

(cherry picked from FBD18717994)
2019-11-26 15:09:40 -08:00
Maksim Panchenko 3cc4fc267b [BOLT] Proper support for -trap-avx512 option
Summary:
If -trap-avx512 option is not set, verify that we correctly encode
AVX-512 instructions and treat them as ordinary instructions.

(cherry picked from FBD18666427)
2019-11-22 14:53:20 -08:00
Maksim Panchenko 7350d40404 [BOLT][NFC] Refactor BinaryFunction::addEntryPoint()
Summary:
There is no need to support existing functionality of adding entry
points after the CFG is built as the function is only called in empty or
disassembled state. Previously we used to run disassemble+buildCFG per
function, but now these phases are decoupled.

Also, remove a couple of redundant checks.

(cherry picked from FBD18622822)
2019-11-11 17:02:37 -08:00
Maksim Panchenko a09659fd54 [BOLT] Refactor markAmbiguousRelocations()
Summary:
Refactor markAmbiguousRelocations() code and move it to BinaryContext.

Also remove a redundant check.

(cherry picked from FBD18623815)
2019-11-18 14:08:17 -08:00
Maksim Panchenko 658f270417 [BOLT] Refactor data PC relocations in BinaryContext
Summary:
We only use locations of PC relocations and ignore the rest of the data.
There's no need to store type and value.

(cherry picked from FBD18623280)
2019-11-19 18:52:08 -08:00
Maksim Panchenko b07e870d78 [BOLT] Add BinarySection::flushPendingRelocations()
(cherry picked from FBD18623527)
2019-11-20 00:16:19 -08:00
Maksim Panchenko 3b1b9916dd [BOLT][NFC] Refactor data section emission code
Summary: RewriteInstance::emitDataSection() -> BinarySection::emitAsData()

(cherry picked from FBD18623050)
2019-11-19 14:47:49 -08:00
spupyrev 95a1c7f553 speeding up ext-tsp
Summary:
Speeding up cache+/ext-tsp block reordering algorithm.
On a high-level, the speedup is achieved by:
- precomputing and memorizing all jumps between a pair of chains
(instead of extracting them on every merge iteration);
- using a cache of size O(|E|) instead of O(|V|^2) as in previous version.

The final output is identical to previous one subject to a new deterministic
comparison of double values.

(cherry picked from FBD18380870)
2019-10-31 13:32:25 -07:00
Maksim Panchenko 6796b7216b [BOLT] Fix jump table analysis for non-simple functions
Summary:
When we disassemble functions, we add discovered jump tables to a global
container in BinaryContext. Later, we analyze and verify all jump
tables. However, analysis for non-simple functions might fail for numerous
reasons, e.g. there would be no instruction at a destination. Since we
are not overwriting non-simple functions, it is not a critical error.
Thus, we can safely skip jump table analysis for non-simple functions.

(cherry picked from FBD18422997)
2019-11-10 21:09:01 -08:00
Maksim Panchenko 72b52edcbb [BOLT] Free more memory in BinaryFunction::releaseCFG()
Summary:
Free more lists in BinaryFunction::releaseCFG().

Release BinaryFunction::Relocations after disassembly.

Do not populate BinaryFunction::MoveRelocations as we are not using them
currently.

Also remove PCRelativeRelocationOffsets that weren't used.

(cherry picked from FBD18413256)
2019-11-08 14:41:31 -08:00
Maksim Panchenko d5ddb320ef [BOLT] Free memory for CFG after emission
Summary:
Once we emit function code, we no longer need CFG for next phases
that use basic blocks for address-translation and symbol update
purposes. We free memory used by CFG and instructions. The freed
memory gets reused by later phases resulting in overall memory usage
reduction.

We can probably improve memory consumption even further by replacing
BinaryBasicBlocks with more compact data structures.

(cherry picked from FBD18408954)
2019-10-31 16:54:48 -07:00
Maksim Panchenko f2b257bec8 [BOLT] Update SDTs based on translation tables
Summary:
We've used to emit special annotations to update SDT markers. However,
we can just use "Offset" annotations for the same purpose. Unlike BAT,
we have to generate "reverse" address translation tables.
This approach eliminates reliance on instructions after code emission.

(cherry picked from FBD18318660)
2019-11-03 21:57:15 -08:00
Maksim Panchenko 98e63610b1 [BOLT] Create OffsetTranslationTable for basic blocks
Summary:
Use BinaryBasicBlock::OffsetTranslationTable for BAT. This removes
dependency on instructions after the code emission.

(cherry picked from FBD18283965)
2019-11-01 16:19:45 -07:00
Maksim Panchenko a1388308f0 [BOLT] Use NameResolver class for local symbols
Summary: NameResolver class is used to assign unique names to local symbols.

(cherry picked from FBD18277131)
2019-11-01 12:31:17 -07:00
Maksim Panchenko 1ed3ac17ff [BOLT] Fix section offsets after debug stripping
Summary:
Be default, we strip debug sections from the binary. Even though we did
not write the sections, we allocated space for them in the output binary
by mistake.

(cherry picked from FBD18218708)
2019-10-29 14:49:49 -07:00
Maksim Panchenko ed8be23e73 [BOLT][llvm] Reduce memory used by MCInst
Summary:
BOLT creates MCInst for every instruction from the input. For large
binaries, this means we are creating tens if not hundreds of millions of
instructions. If the number of operands for average instruction is much
less than 8, we benefit from changing the type of Operands from
SmallVector<MCOperand, 8> to SmallVector<MCOperand, 2>. That seems
to be the optimal type for X86-64 on average.

The size of MCInst goes down from 176 to 80 which often reduces BOLT
memory consumption by gigabytes.

(cherry picked from FBD18218924)
2019-10-28 17:40:18 -07:00
Rafael Auler a3295715e4 [AArch64] Recognize one extra br idiom
Summary:
We do not support optimizing functions with jump tables in
AArch64, but we do need to detect them. This idiom is slightly different
from the ones we've seen before. It encode jump table entries as
relative to the jump table itself instead of relative to the indirect
branch (BR) instruction.

(cherry picked from FBD18191100)
2019-10-28 16:16:35 -07:00
Maksim Panchenko 8fb6512a23 [BOLT][Docs] Instructions for linking with jemalloc/tcmalloc
(cherry picked from FBD18050722)
2019-10-21 15:57:36 -07:00
Maksim Panchenko 12aca4005c [BOLT] Ignore __builtin_unreachable destination
Summary:
For functions with unknown control flow, do not populate TakenBranches
with an entry pointing to the end of the function.

(cherry picked from FBD18034019)
2019-10-20 20:46:32 -07:00
Rafael Auler b807641e2a [BOLT] Fix stale functions when using BAT
Summary:
If collecting data in Intel Skylake machines, we may face a
bug where LBR0 or LBR1 may be duplicated w.r.t. the next entry. This
makes perf2bolt interpret it as an invalid trace, which ordinarily we
discard during aggregation. However, in BAT, since we do not disassemble
the binary where the collection happened but rely only on the
translation table, it is not possible to detect bad traces and discard
them. This gets to the fdata file, and this invalid trace ends up
invalidating the profile for the whole function (by being treated as
stale by BOLT).

In this patch, we detect Skylake by looking for LBRs with 32 entries,
and discard the first 2 entries to avoid running into this problem.

It also fixes an issue with collision in the translation map by
prioritizing the last basic block when more than one share the same
output address.

(cherry picked from FBD17996791)
2019-10-17 16:35:57 -07:00
Maksim Panchenko 103b0a77cc [BOLT] Fix non-determinism while reading debug info
Summary:
When reading debug info in parallel, CUs for functions were populated in
parallel and the order was non-deterministic. We used the first CU from
the non-deterministically-ordered list to set the line number resulting
in different outputs.

The fix is to sort the list after it's been created and before assigning
the line table unit.

(cherry picked from FBD17946889)
2019-10-14 17:57:36 -07:00
Rafael Auler 698a4684ac [BOLT] Fix merge-fdata and heatmap in BAT
Summary:
merge-fdata for legacy format was simply appending all input
strings to output, but the real format supports some header strings
that can't be simply concatanated to output. Check for the header
string used by BAT before merging fdata to avoid creating an output
file with invalid lines (header in the middle of the fdata file).

For heatmap, avoid reading BAT tables, since they won't be used.

(cherry picked from FBD17943131)
2019-10-11 13:32:14 -07:00
Xin-Xin Wang d87f95065a [BOLT] Add missing CMake test dependencies
Summary:
I noticed when setting up a new repository for bolt that bolt tests
would fail unexpectedly when running `ninja check-bolt` and
`ninja check-llvm`. This turns out to be because dependencies for bolt
binaries were not specified in the CMake configuration so they were not
built before running the tests. This diff adds the dependencies to the
CMake configuration for check-bolt and check-llvm so that bolt binaries
are built before running tests.

(cherry picked from FBD17919505)
2019-10-14 16:03:54 -07:00
Maksim Panchenko 8c6ea8540a [BOLT] Improve object discovery runtime
Summary:

(cherry picked from FBD17872824)
2019-10-08 11:03:33 -07:00
Rafael Auler 13948f376d [BOLT] Do not emit BAT for non-simple in nonreloc
Summary: Doing so cause corrupt entries to be emitted.

(cherry picked from FBD17774505)
2019-10-04 16:28:03 -07:00
Mark Santaniello c9f4bbdc22 [llvm-bolt] Bugfix jemalloc sized deallocation segfault
Summary:
C++14 "sized deallocation" introduces a 2-argument `delete` where the new 2nd argument is the original allocated size.  It's useful for allocators like jemalloc to be "reminded" of the original allocation size, else they incur the cost of an address to size lookup.  Jemalloc has provided this for a while as `sdallocx`, and recently it got wired up to the new 2-arg `delete`.

Here I introduce typedefs for the SmallVectors so the "16" is consistent, which seems to fix the issue.

(cherry picked from FBD17618981)
2019-09-26 16:51:22 -07:00
Rafael Auler ba31344fa9 [BOLT] Fix build for Mac
Summary:
Change our CMake config for the standalone runtime instrumentation
library to check for the elf.h header before using it, so the build
doesn't break on systems lacking it. Also fix a SmallPtrSet usage where
its elements are not really pointers, but uint64_t, breaking the build
in Apple's Clang.

(cherry picked from FBD17505759)
2019-09-20 11:29:35 -07:00
Maksim Panchenko 5e6d246b9c [BOLT] Reword message for macro-op fusion optimization
Summary:
With the word "missed", the previous message about opportunities for
macro-op fusion optimization could be misleading.

(cherry picked from FBD17464603)
2019-09-18 15:33:03 -07:00
Maksim Panchenko c823220116 [BOLT] Better check for compiler de-virtualization bug
Summary:
The existing check for compiler de-virtualization bug was not working
when the relocation reference did not fall on a function boundary.
As a result, we were falsely detecting "unmarked object in code".

When running the check, the address could be arbitrary, except it
shouldn't match any existing function. Additionally, check that there's
a proper reference to the de-virtualized callee to avoid false
positives.

(cherry picked from FBD17433887)
2019-09-17 14:24:31 -07:00
Maksim Panchenko e9c6c73bb8 [BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.

For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.

After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.

One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.

Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.

As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.

New semantics for `-split-functions=<n>`:

  -split-functions - split functions into hot and cold regions
    =0 -   do not split any function
    =1 -   in non-relocation mode only split functions too large to fit
           into original code space
    =2 -   same as 1 (backwards compatibility)
    =3 -   split all functions

(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
Wenlei He 615a318b60 [BOLT] Filter perf samples by PID
Summary: `perf2bolt` accepts executable name, and the tool will find all the PIDs associated with that executable. When different versions of an executable are running at the same time, name alone may not be sufficient as we will get samples from different versions of the binary aggregated together. The resulting fdata may look stale to BOLT, which makes BOLT bailout optimization for functions. This change adds a `-pid` switch that lets user specify process ID in addition to executable name so BOLT can target a specific process.

(cherry picked from FBD17178898)
2019-09-03 22:24:06 -07:00
Wenlei He 8cd1ba599b [BOLT] Ignore LBR from kernel interrupts
Summary: This change adds a switch (`ignore-interrupt-lbr`) to ignores LBR from perf input that is result of kernel interrupts. These asynchronous flow of user/kernel transition will make BOLT think that profile is stale, thus bailout optimization for functions. Ideally, user mode filter need to be set for `perf record` so we don't have asynchronous LBRs. However these are identifiable as kernel address space is known, so we can ignore any LBRs that come from or go into kernel addresses during aggregation. This is under a switch and off by default in case we need to BOLT kernel module.

(cherry picked from FBD17170107)
2019-09-03 10:01:26 -07:00
Rafael Auler cc4b2fb614 [BOLT] Efficient edge profiling in instrumented mode
Summary:
Change our edge profiling technique when using instrumentation
to do not instrument every edge. Instead, build the spanning tree
for the CFG and omit instrumentation for edges in the spanning tree.
Infer the edge count for these edges when writing the profile during
run time. The inference works with a bottom-up traversal of the spanning
tree and establishes the value of the edge connecting to the parent based
on a simple flow equation involving output and input edges, where the
only unknown variable is the parent edge.

This requires some engineering in the runtime lib to support dynamic
allocation for building these graphs at runtime.

(cherry picked from FBD17062773)
2019-08-07 16:09:50 -07:00
Rafael Auler 52786928ff [BOLT] Fix perf2bolt race in BAT mode
Summary:
We start a thread to preprocess the profile while the main
thread continues to disassemble the input binary. We should not
disassemble it in BAT mode, however, the test to check whether we have
BAT in the input binary depends on the preprocessing thread, so there
is a race where we may start disassembling functions just because the
preprocessing thread didn't conclude we are in BAT mode. Fix this and
make the main thread check for BAT without depending on the
preprocessing thread.

(cherry picked from FBD17124370)
2019-08-29 16:18:43 -07:00
Rafael Auler 1f6564f117 [BOLT] Support .plt.got section
Summary:
We decode the regular .plt section and we are able to perform
optimizations on it with -plt=hot or -plt=all, however, .plt.got
sections are not decoded by BOLT. This patch teaches BOLT how to read
them. They are created by the bfd linker whenever there is no need for
the dynamic linker to lazy-bind the symbol (when they are eagerly
resolved at binary load time). These entries are 8-byte sized instead of
16-byte sized like the regular PLT, and contain a single indirect call
instruction with 7 bytes and a nop.

(cherry picked from FBD17060515)
2019-08-26 15:03:38 -07:00
Rafael Auler 243507db99 [BOLT] Fix aggregator w.r.t. split functions
Summary:
We should not rely on split function detection while aggregating
data, but only look up the original function names in the symbol table.
Split function detection should be done by BOLT and not perf2bolt while
writing the profile. Then, BOLT, when reading it, will take care of
combining functions if necessary.

This caused a bug in bolted data collection where samples in cold parts
of a function were being falsely attributed to the hot part of a function
instead of being attributed to the cold part, causing incorrect translation of
addresses.

(cherry picked from FBD16993065)
2019-08-23 12:18:31 -07:00
Maksim Panchenko f588d7a6ea [BOLT] Tighter control of jump table detection
Summary:
We were too permissive by allowing more jump tables during the
preliminary scan of memory. This allowed for jump tables to be
falsely detected. And since we didn't have a way to backtrack
the jump table creation, we had to assert.

This diff refactors the code that analyzes jump table contents.
Preliminary and final passes share the same code. The only difference
should be the detection of instruction boundaries that are available
during the final pass.

This should affect strict relocation mode only.

(cherry picked from FBD16923335)
2019-08-19 14:06:36 -07:00
Maksim Panchenko bf030f336a [BOLT] Fix misleading output
Summary:
BOLT prints "spawning thread to pre-process profile" message even when
it is not running in the aggregation mode. Fix that.

(cherry picked from FBD16908596)
2019-08-19 17:11:42 -07:00
Rafael Auler 821480d27f [BOLT] Encode instrumentation tables in file
Summary:
Avoid directly allocating string and description tables in
binary's static data region, since they are not needed during runtime
except when writing the profile at exit. Change the runtime library to
open the tables on disk and read only when necessary.

(cherry picked from FBD16626030)
2019-08-02 11:20:13 -07:00
Rafael Auler 62aa74f836 [BOLT] Support instrumentation via runtime library
Summary:
To allow the development of future instrumentation work, this
patch adds support in BOLT for linking arbitrary libraries into the
binary processed by BOLT. We use orc relocation handling mechanism for
that. With this support, this patch also moves code programatically
generated in X86 assembly language by X86MCPlusBuilder to C code written
in a new library called bolt_rt. Change CMake to support this library as
an external project in the same way as clang does with compiler_rt. This
library is installed in the lib/ folder relative to BOLT root
installation and by default instrumentation will look for the library
at that location to finish processing the binary with instrumentation.

(cherry picked from FBD16572013)
2019-07-24 14:03:43 -07:00