Commit Graph

832 Commits

Author SHA1 Message Date
laith sakka fde5a2b470 Run shrink wrapping in parallel
Summary:
Shrink wrapping is an expensive part of frame optimizations if
performed on all functions. This diff makes it run in parallel,
reducing wall time.

(cherry picked from FBD16092651)
2019-07-02 10:48:43 -07:00
laith sakka 7d42835418 Run buildCFG in disassembly in parallel
Summary:
This diff  parallelize the construction of call graph during disassembly.
The diff includes a change to  parallel-utilities where another interface
is added, that support running tasks on binaryFunctions that involves
adding instruction annotations. This pattern is common in different places,
e.g. frame optimizations. And such, pattern justify creating an interface,
that abstract out all the messy details.

(cherry picked from FBD16232809)
2019-07-12 07:25:50 -07:00
laith sakka f4ab6e6924 run finalize functions in parallel
Summary:

(cherry picked from FBD16188733)
2019-07-10 10:59:56 -07:00
laith sakka 98539b0966 run aligner pass in parallel
Summary: this diff parallelize the aligner pass

(cherry picked from FBD16176327)
2019-07-09 17:59:41 -07:00
laith sakka 9977b03fea Run reorder blocks in parallel
Summary:
This diff change reorderBasicBlocks pass to run in parallel,
it does so by adding locks to the fix branches function,
and creating temporary MCCodeEmitters when estimating basic block code size.

(cherry picked from FBD16161149)
2019-07-08 12:32:58 -07:00
Rafael Auler 1169f1fdd8 [BOLT] Support duplicating jump tables
Summary:
If two indirect branches use the same jump table, we need to
detect this and duplicate dump tables so we can modify this CFG
correctly. This is necessary for instrumentation and shrink wrapping.
For the latter, we only detect this and bail, fixing this old known
issue with shrink wrapping.

Other minor changes to support better instrumentation: add an option
to instrument only hot functions, add LOCK prefix to instrumentation
increment instruction, speed up splitting critical edges by avoiding
calling recomputeLandingPads() unnecessarily.

(cherry picked from FBD16101312)
2019-07-02 16:56:41 -07:00
Rafael Auler 8880969ced [BOLT] Restrict creation of jump tables
Summary:
Heuristic that creates a jump table for every memory access,
including those we do not match against a pattern in an indirect jump,
is too permissive and has false positives. Guard this logic under
strict mode until we figure out a better strategy.

(cherry picked from FBD16192205)
2019-07-10 15:41:34 -07:00
laith sakka 3cfc76cdbf Create a general interface to implement parallel tasks easily and apply it to run EliminateUnreachableBlocks in parallel.
Summary:
Each time we run some work in parallel over the list of functions in bolt, we manage a thread pool, task scheduling and perform some work to manage the granularity of the tasks based on the type of the work we do.

In this task, I am creating an interface where all those details are abstracted out, the user provides the function that will run on each  function, and some policy parameters that setup the scheduling and granularity configurations.

This will make it easier to implement parallel tasks, and eliminate redundant coding efforts.

(cherry picked from FBD16116077)
2019-07-03 17:23:19 -07:00
laith sakka f10d1fe0f3 Run cleanAnnotations within frame analysis in parallel
Summary: This diff parallelize the function FrameAnalysis::cleanAnnotations()

(cherry picked from FBD16096711)
2019-07-02 13:42:17 -07:00
laith sakka 00c252f6d8 Clean SPTMap in frame anaylsis in parallel
Summary:
This diff parallelize the STPClean() function reducing its runtime from 5 seconds to 0.4 on HHVM,
Making the runtime for the frame optimizer goes down to 33 seconds on HHVM.

(cherry picked from FBD15914371)
2019-06-19 18:01:00 -07:00
laith sakka 86b529bd54 run SPT in parallel, and split annotation allocator
Summary:
This diff includes two main changes:
1) When creating an annotation, a dedicated annotation allocator can be used, instead of the default allocator. This allows some annotation to be deallocated  right after the end of their usage completely. Furthermore, having the ability to use dedicated allocators allows running SPT in parallel where each task uses a different allocator.

2) SPT is parallelized.

(cherry picked from FBD15913492)
2019-06-14 19:56:11 -07:00
Wenlei He 4e90fc1e3b [BOLT] Prioritize Jump Table ICP target by frequency and indice count
Summary: We select the top hot targets for indirect call promotion. But since we only have frequency for targets, not for actual jump table indices, we have to merge indices that share the same actual target. In order to do that we sort targets by pointer of target symbol before merging, which introduces instability. Later we stable sort merged targets by frequency. Due to the instability of sorting pointers, and depending on how many indices each merged target has, we could end up with unstable ICP. This commit changes the 2nd pass sorting to prioritize targets with fewer indices, and higher mispredicts, in addition to higher frequency. It improves stability of ICP, and also exposes more ICP because targets with fewer indices has better chance of getting promoted.

(cherry picked from FBD16099701)
2019-07-02 15:51:20 -07:00
Maksim Panchenko 078ece1691 [BOLT] Fix out-of-bounds entry points
Summary:
Check that a symbol address is less than the next function
address before considering it for a secondary entry.

(cherry picked from FBD16056468)
2019-06-28 11:53:34 -07:00
Maksim Panchenko e89ad0db4b [BOLT] Introduce strict relocation mode
Summary:
In strict relocation mode we rely on relocations to represent all
possible entry points into a function. Most of the code generated by
tested compilers (gcc and clang) will result in relocations against
any internal labels for jump tables and for computed goto tables.

In situations where we cannot properly reconstruct a jump table, or when
we cannot determine a table that guides an indirect jump, e.g. when
multiple computed goto tables are used, we conservatively assume that
the indirect jump can end up at any possible basic block referenced by
relocations.

In strict mode, simple functions may include the aforementioned
instructions with unknown control flow with a conservative list of
destinations added to the containing basic block. This allows us to
expand coverage of simple functions and to enable code reordering
optimizations for more functions.

The strict mode is recommended when BOLT is used with a well-formed
code generated by a compiler.

To use the strict mode, add "-strict" on the command line.

Another effect of this diff, is that with relocations, we will always
replace the immediate operand of an instruction with a symbol if the
relocation exists against this operand.

Also this diff fixes issues with Clang compiled with -fpic.

(cherry picked from FBD15872849)
2019-06-28 09:21:27 -07:00
Maksim Panchenko 06e7a1e059 [BOLT] Ignore false function references
Summary:
A relocation can have an addend that makes it look as the relocated
value is in a different section from the symbol being relocated.
E.g., a relocation against a variable in .rodata could have a negative
offset that will make it look like it is against a symbol in .text
(a section that typically precedes .rodata).

Unless the relocation is against a section symbol, we know
exactly the symbol that is being relocated and there is no issue.
However, when the linker leaves only a section relocation (i.e. a
relocation against a section symbol when a temporary original symbol
gets deleted), we have to guess the relocated symbol, and can falsely
detect a function reference in the case described above.

The fix is to keep a section relocation if the corresponding
relocated value falls into a different section, and to detect and
ignore false function reference.

(cherry picked from FBD16030791)
2019-06-27 03:20:17 -07:00
Wenlei He 459add2827 [BOLT] Force non-relocation mode for heatmap generation
Summary: BOLT operates in relocation mode by default when .reloc is in the binary. This changes disables relocation mode for heatmap generation so we can use that for more cases. There's a small separate change that ignores zero-sized symbol in zero-sized code section during function discovery.

(cherry picked from FBD16009610)
2019-06-26 11:06:46 -07:00
Rafael Auler 0d23cbaa52 [BOLT] Initial experimental instrumentation pass
Summary:
An instrumentation pass that modifies the input binary to
generate a profile after execution finishes. It modifies branches to
increment counters stored in the process memory and injects a new
function that dumps this data to an fdata file, readable by BOLT.

This instrumentation is experimental and currently uses a naive
approach where every branch is instrumented. This is not ideal for
runtime performance, but should be good enough for us to
evaluate/debug LBR profile quality against instrumentation.

Does not support instrumenting indirect calls yet, only direct
calls, direct branches and indirect local branches.

(cherry picked from FBD15998096)
2019-06-19 20:10:49 -07:00
Rafael Auler db02a1a142 [BOLT] Ignore empty funcs in relocation mode
Summary:
Make BOLT ignore empty functions (those containing no instructions,
despite having some space allocated to it filled with zeroes).

(cherry picked from FBD15981683)
2019-06-24 20:23:22 -07:00
Rafael Auler bda13b7dd8 [BOLT] Add option to print profile bias stats
Summary:
Profile bias may happen depending on the hardware counter used
to trigger LBR sampling, on the hardware implementation and as an
intrinsic characteristic of relying on LBRs. Since we infer fall-through
execution and these non-taken branches take zero hardware resources to
be represented, LBR-based profile likely overrepresents paths with fall
throughs and underrepresents paths with many taken branches. This patch
adds an option to print statistics about profile bias so we can better
understand these biases.

The goal is to analyze differences in the sum of the frequency of all
incoming edges in a basic block versus the sum of all outgoing. In an
ideally sampled profile, these differences should be close to zero. With
this option, the user gets the mean of these differences in flow as a
percentage of the input flow. For example, if this number is 15%, it
means, on average, a block observed 15% more or less flow going out of
it in comparison with the flow going in. We also print the standard
deviation so we can have an idea of how spread apart are different
measurements of flow differences. If variance is low, it means the
average bias is happening across all blocks, which is compatible with
using LBRs. If the variance is high, it means some blocks in the profile
have a much higher bias than others, which is compatible with using a
biased event such as cycles to sample LBRs because it overrepresents
paths that end in an expensive instruction.

(cherry picked from FBD15790517)
2019-06-10 17:26:48 -07:00
laith sakka 1ec091e6f5 Parallelize ICF Pass
Summary:
ICF consumes 10-15% of bolt runtime, for HHVM that is around 45 seconds.
this diff perform some parallelization for the pass to make it faster.
A 60% reduction in the ICF runtime  is measured on the parallel version for HHVM.

(cherry picked from FBD15589515)
2019-05-31 16:45:31 -07:00
Maksim Panchenko 9894de0094 [BOLT] Check instruction boundaries while populating jump tables
Summary:
Now that we populate jump tables after all functions are disassembled,
we can check for instruction boundaries corresponding to jump table
entries. No need to delegate this task to postProcessJumpTables().

(cherry picked from FBD15814762)
2019-06-13 15:31:30 -07:00
Maksim Panchenko 9e2ad3f593 [BOLT] Delay populating jump tables
Summary:
During the initial disassembly pass, only identify jump tables
without populating the contents. Later, after all functions have been
disassembled, we have a better idea of jump table boundaries and can do
a better job of populating their entries.

As a result, we no longer have embedded jump tables (i.e. a jump table
that is parter of another jump table). If we ever need to keep
sequential jump tables inseparable during the output, we can always
add such functionality later.

Fixes facebookincubator/BOLT#56.

(cherry picked from FBD15800427)
2019-06-12 18:21:02 -07:00
laith sakka 66cf16208f Use singleton instances for SPT (stack pointer tracking) in FrameAnalysis.
Summary:
During frame analysis, the functions do not change, and stack pointer tracking
does not need to be performed more than one time.

The current implementation performs the SPT analysis multiple times per
function during the frame analysis, we ca eliminate such computation redundancy.

On HHVM with -frame-opts=hot, this save around a minute which is 40% of the
frame optimization runtime. (129s to 76s).
fdata should be passed for a reasonable evaluation (we need the call graph).

However, this comes at a memory cost, around 2G to the peak when only -frame-opt=hot only is used but,
When all the usual flags are passed, the effect is to the peak is only 200K (measured from one test).

Update:
When jemalloc is used the base became way better and the following runtime are observed:

[jemalloc]
hhvm  85 -->  72.
clang  27 --> 23.

[malloc]
hhvm 129 -->  76.
clang  34   --> 27.

(cherry picked from FBD15707003)
2019-06-06 12:58:14 -07:00
Maksim Panchenko 9df5063c0e [perf2bolt] Option to use event PC with LBR stack
Summary:
Add an option to get extra profile trace using the recorded event PC.
The trace goes from the latest LBR record destination to the event PC.

(cherry picked from FBD15711804)
2019-06-06 19:38:06 -07:00
Maksim Panchenko fac6a89c23 [BOLT] Better handling of address references
Summary:
We used to handle PC-relative address references differently from direct
address references. As a result, some cases, such as escaped function
label address, were not handled when dealing with absolute (non-PIC)
code. This diff moves processing of an address reference into
BinaryContext::handleAddressRef() which is called for both PIC and
non-PIC code.

(cherry picked from FBD15643535)
2019-06-04 15:30:22 -07:00
laith sakka d3c1821f5f Compile Bolt using std 14.
Summary:
Compile Bolt using std 14.
We want that to be able to use some threading the locking tools that do not exists in std 11.

(cherry picked from FBD15671736)
2019-06-05 10:32:29 -07:00
Rafael Auler 21f4303bfd Support data collection in bolted binaries
Summary:
Similarly to how the compiler relies on DWARF to map samples, so
it is possible to collect profile data in binaries optimized by PGO
techniques and retrofit data to be used in a representation of the program
that was not optimized by PGO, this diff implements an option in BOLT to
encode a table in the output binary that allows us to map data collected
in optimized binaries back to the address space used in the input binary
(where the profile is useful, since we do not support running BOLT on a
binary already optimized by BOLT). The goal is to offer an option to
support BOLT in scenarios where it is not easy to run a special deployment of
the binary with a version that was not optimized by BOLT just for data
collection.

This feature is enabled with the -enable-bat flag. BAT stands for BOLT
Address Translation, which refers to the process of mapping output to
input addresses.

(cherry picked from FBD15531860)
2019-04-12 17:33:46 -07:00
Laith Sakka 3df2c9ea1f Update SDT locations after bolt reordering
Summary: Update SDT locations in .note section to match the new location after bolt reorder the code.

(cherry picked from FBD15427779)
2019-05-17 07:58:27 -07:00
Maksim Panchenko 9ef9a7b1be [BOLT] Use regex matching for function names passed on command line
Summary:
Options such as `-print-only`, `-skip-funcs`, etc. now take regular
expressions. Internally, the option is converted to '^funcname$' form
prior to regex matching. This ensures that names without special
symbols will match exactly, i.e. "foo" will not match "foo123".

(cherry picked from FBD15551930)
2019-05-29 18:33:09 -07:00
Laith Sakka c8038da36e Minor-fix: remove duplicate definition of SPT optimization timer
Summary:

(cherry picked from FBD28111560)
2019-05-22 15:03:42 -07:00
Maksim Panchenko e5b1d9cd8c [BOLT][NFC] Fix white space
(cherry picked from FBD15485688)
2019-05-23 15:49:36 -07:00
Maksim Panchenko f57d3c00fc [BOLT] Better verification of jump tables
Summary:
Run analyzeIndirectBranch() using basic block boundaries instead of
running ad-hoc validation of the jump table assumptions.

(cherry picked from FBD15465034)
2019-05-22 18:14:34 -07:00
Maksim Panchenko be344c8de7 [BOLT] Refactor handling of interproc refs
Summary:
Move handling of interprocedural references to BinaryContext.

Post-process indirect branches immediately after the CFG is built.

This is almost NFC. Since indirect branches are now post-processed
before the profile data is processed it interferes with the way the
profile data in YAML format is handled.

(cherry picked from FBD15456003)
2019-05-22 11:26:58 -07:00
Maksim Panchenko d047df12c5 [BOLT] Add an option to specialize memcpy() for 1 byte copy
Summary:
Add an option:

  -memcpy1-spec=func1,func2:cs1,func3:cs1:cs2,...

to specialize calls to memcpy() in listed functions (the name could be
supplied in regex) for size 1. The optimization will dynamically check
if the size argument equals to 1 and execute a one byte copy, otherwise
it will call memcpy() as usual. Specific call sites could be indicated
after ":" using their numeric count from the start of the function.

(cherry picked from FBD15428936)
2019-05-20 20:11:40 -07:00
Laith Saed Sakka ca659e4336 Preserve nops that are SDT markers in binaries and disable SDT conflicting optimizations
Summary:
SDT markers that appears as nops in the assembly, are preserved and not eliminated.
Functions with SDT markers are also flagged. Inlining and folding are disabled for
functions that have SDT markers.

(cherry picked from FBD15379799)
2019-05-16 12:46:32 -07:00
Laith Saed Sakka 4755825447 Parse statically defined tracepoint markers from .note.stapsdt section
Summary:
    Parse statically defined tracepoints(SDT) markers from the ELF file, and store them.
    Add an option to print SDTs (-print-sdt).
    Add test case for parsing and printing SDTs.

(cherry picked from FBD15366712)
2019-05-15 17:19:18 -07:00
Rafael Auler f1fde44154 [BOLT] Improve ICP activation policy and hot jt processing
Summary:
Previously, ICP worked with a budget of N targets to convert to
direct calls. As long as the frequency of up to N of the hottest targets
surpassed a given fraction (threshold) of the total frequency, say, 90%,
then the optimization would convert a number of targets (up to N) to
direct calls. Otherwise, it would completely abort processing this call
site. The intent was to convert a given fraction of the indirect call
site frequency to use direct calls instead, but this ends up being a
"all or nothing" strategy.

In this patch we change this to operate with the same strategy seem in
LLVM's ICP, with two thresholds. The idea is that the hottest target of
an indirect call site will be compared against these two thresholds: one
checks its frequency relative to the total frequency of the original
indirect call site, and the other checks its frequency relative to the
remaining, unconverted targets (excluding the hottest targets that were
already converted to direct calls). The remaining threshold is typically
set higher than the total threshold. This allows us more control over
ICP.

I expose two pairs of knobs, one for jump tables and another for
indirect calls.

To improve the promotion of hot jump table indices when we have memory
profile, I also fix a bug that could cause us to promote extra indices
besides the hottest ones as seen in the memory profile. When we have the
memory profile, I reapply the dual threshold checks to the memory
profile which specifies exactly which indices are hot. I then update N,
the number of targets to be promoted, based on this new information, and
update frequency information.

To allow us to work with smaller profiles, I also created an option in
perf2bolt to filter out memory samples outside the statically allocated
area of the binary (heap/stack). This option is on by default.

(cherry picked from FBD15187832)
2019-05-02 12:28:34 -07:00
Maksim Panchenko fee61231ef [BOLT] Move JumpTable management to BinaryContext
Summary:
Make BinaryContext responsible for creation and management of
JumpTables. This will be used for detection and resolution of jump table
conflicts across functions.

(cherry picked from FBD15196017)
2019-05-02 17:42:06 -07:00
Maksim Panchenko 4b55967d9e [perf2bot] Pass `-f` flag to perf
Summary:
perf tool requires the input data to be owned by the current user or
root, otherwise it rejects the input. Use `-f` option to override this
behavior.

(cherry picked from FBD15160678)
2019-04-30 17:08:22 -07:00
Maksim Panchenko 310b32fbe5 [BOLT] Limit jump table size by containing object
Summary:
While checking for a size of a jump table, we've used containing
section as a boundary. This worked for most cases as typically jump
tables are not marked with symbol table entries. However, the compiler
may generate objects for indirect goto's.

(cherry picked from FBD15158905)
2019-04-30 15:47:10 -07:00
Maksim Panchenko f1dfd38dec [BOLT][NFC] Move DynoStats out of BinaryFunction
Summary: Move DynoStats into separate source files.

(cherry picked from FBD15138883)
2019-04-29 12:51:10 -07:00
Maksim Panchenko 2b1523362e [BOLT] Strip debug sections by default
Summary:
We used to ignore debug sections by default, but we kept them in the
binary which led to invalid debug information in the output. It's better
to strip debug info and print a warning to the user.

Note: we are not updating debug info by default due to high memory
requirements for large applications.

(cherry picked from FBD15128947)
2019-04-26 15:30:12 -07:00
Rafael Auler 21ee0e98c7 [BOLT] Fix symboltable update bug
Summary:
Commit "Update symbols for secondary entry points" introduced
a bug by using getBinaryFunctionContainingAddress() instead of
getBinaryFunctionAtAddress() regarding ICF'd functions. Only the latter
would fetch the correct BinaryFunction object for addresses of functions
that were ICF'd. As a result of this bug, the dynamic symbol table was
not updated for function symbols that were folded by ICF.

(cherry picked from FBD15112941)
2019-04-26 19:52:36 -07:00
Maksim Panchenko caa0fafa18 [BOLT] Fix profile reading in non-reloc mode
Summary:
In non-relocation mode we may execute multiple re-write passes either
because we need to split large functions or update debug information for
large functions (in this context large functions are functions that do
not fit into the original function boundaries after optimizations).

When we execute another pass, we reset RewriteInstance and run most of
the steps such as disassembly and profile matching for the 2nd or 3rd
time. However, when we match a profile, we check `Used` flag, and don't
use the profile for the 2nd time. Since we didn't reset the flag while
resetting the rest of the states, we ignored profile for all functions.
Resetting the flag in-between rewrite passes solves the problem.

(cherry picked from FBD15110959)
2019-04-26 16:32:28 -07:00
Maksim Panchenko 5717b0c427 [perf2bolt] Fix print report for pre-aggregated profile
Summary:
For pre-aggregated profile, we were using the number of records in the
profile for `NumTraces` ignoring the counts per record. As a result,
the reported percentage of mismatched traces was bogus.

(cherry picked from FBD15093123)
2019-04-25 16:34:50 -07:00
Maksim Panchenko 492e4a515e [BOLT] Automatically enable -hot-text
Summary:
Enable -hot-text by default if reordering functions.

Also fail immediately if function reordering is specified on the command
line in non-relocation mode.

(cherry picked from FBD15095178)
2019-04-25 17:00:05 -07:00
Brian Gesiak 91b2de3c23 [BOLT] Minimize BOLT's diff with LLVM by removing trivial changes (NFC)
Summary: BOLT works as a series of patches rebased onto upstream LLVM at revision `f137ed238db`. Some of these patches introduce unnecessary whitespace changes or includes. Remove these to minimize the diff with upstream LLVM.

(cherry picked from FBD15064122)
2019-04-24 11:24:15 -04:00
Rafael Auler 4e4d39c21c [BOLT] Update symbols for secondary entry points
Summary:
Update the output ELF symbol table for symbols representing
secondary entry points for functions. Previously, those were left
unchanged in the symtab.

(cherry picked from FBD15010517)
2019-04-18 16:32:22 -07:00
Brian Gesiak eba1a67730 Fix casting issues on macOS
Summary:
`size_t` is platform-dependent, and on macOS it is defined as
`unsigned long long`. This is not the same type as is used in many calls
to templated functions that expect the same type. As a result, on macOS,
calls to `std::max` fail because a template function that takes
`uint64_t, unsigned long long` cannot be found.

To work around the issue:

* Specify explicit `std::max` and `std::min` functions where necessary,
  to work around the compiler trying (and failing) to find a suitable
  instantiation.
* For lambda return types, specify an explicit return type where necessary.
* For `operator ==()` calls, use an explicit cast where necessary.

(cherry picked from FBD15030283)
2019-04-22 11:27:50 -04:00
Brian Gesiak d9f1bd42fd [cmake] Only build enabled targets
Summary:
When attempting to build llvm-bolt with `-DLLVM_ENABLE_TARGETS="X86"`, I
encountered an error:

```
CMake Error at cmake/modules/AddLLVM.cmake:559 (add_dependencies):
  The dependency target "AArch64CommonTableGen" of target
  "LLVMBOLTTargetAArch64" does not exist.
Call Stack (most recent call first):
  cmake/modules/AddLLVM.cmake:607 (llvm_add_library)
  tools/llvm-bolt/src/Target/AArch64/CMakeLists.txt:1 (add_llvm_library)
```

The issue is that the `llvm-bolt/src/Target/AArch64` subdirectory is
added by CMake unconditionally. The LLVM project, on the other hand,
only adds the subdirectories that are enabled, by using a `foreach` loop
over `LLVM_TARGETS_TO_BUILD`. Copying that same loop, from
`llvm/lib/Target/CMakeLists.txt`, to this project avoids the error.

(cherry picked from FBD15030236)
2019-04-22 11:19:02 -04:00