Commit Graph

832 Commits

Author SHA1 Message Date
Maksim Panchenko b07e870d78 [BOLT] Add BinarySection::flushPendingRelocations()
(cherry picked from FBD18623527)
2019-11-20 00:16:19 -08:00
Maksim Panchenko 3b1b9916dd [BOLT][NFC] Refactor data section emission code
Summary: RewriteInstance::emitDataSection() -> BinarySection::emitAsData()

(cherry picked from FBD18623050)
2019-11-19 14:47:49 -08:00
spupyrev 95a1c7f553 speeding up ext-tsp
Summary:
Speeding up cache+/ext-tsp block reordering algorithm.
On a high-level, the speedup is achieved by:
- precomputing and memorizing all jumps between a pair of chains
(instead of extracting them on every merge iteration);
- using a cache of size O(|E|) instead of O(|V|^2) as in previous version.

The final output is identical to previous one subject to a new deterministic
comparison of double values.

(cherry picked from FBD18380870)
2019-10-31 13:32:25 -07:00
Maksim Panchenko 6796b7216b [BOLT] Fix jump table analysis for non-simple functions
Summary:
When we disassemble functions, we add discovered jump tables to a global
container in BinaryContext. Later, we analyze and verify all jump
tables. However, analysis for non-simple functions might fail for numerous
reasons, e.g. there would be no instruction at a destination. Since we
are not overwriting non-simple functions, it is not a critical error.
Thus, we can safely skip jump table analysis for non-simple functions.

(cherry picked from FBD18422997)
2019-11-10 21:09:01 -08:00
Maksim Panchenko 72b52edcbb [BOLT] Free more memory in BinaryFunction::releaseCFG()
Summary:
Free more lists in BinaryFunction::releaseCFG().

Release BinaryFunction::Relocations after disassembly.

Do not populate BinaryFunction::MoveRelocations as we are not using them
currently.

Also remove PCRelativeRelocationOffsets that weren't used.

(cherry picked from FBD18413256)
2019-11-08 14:41:31 -08:00
Maksim Panchenko d5ddb320ef [BOLT] Free memory for CFG after emission
Summary:
Once we emit function code, we no longer need CFG for next phases
that use basic blocks for address-translation and symbol update
purposes. We free memory used by CFG and instructions. The freed
memory gets reused by later phases resulting in overall memory usage
reduction.

We can probably improve memory consumption even further by replacing
BinaryBasicBlocks with more compact data structures.

(cherry picked from FBD18408954)
2019-10-31 16:54:48 -07:00
Maksim Panchenko f2b257bec8 [BOLT] Update SDTs based on translation tables
Summary:
We've used to emit special annotations to update SDT markers. However,
we can just use "Offset" annotations for the same purpose. Unlike BAT,
we have to generate "reverse" address translation tables.
This approach eliminates reliance on instructions after code emission.

(cherry picked from FBD18318660)
2019-11-03 21:57:15 -08:00
Maksim Panchenko 98e63610b1 [BOLT] Create OffsetTranslationTable for basic blocks
Summary:
Use BinaryBasicBlock::OffsetTranslationTable for BAT. This removes
dependency on instructions after the code emission.

(cherry picked from FBD18283965)
2019-11-01 16:19:45 -07:00
Maksim Panchenko a1388308f0 [BOLT] Use NameResolver class for local symbols
Summary: NameResolver class is used to assign unique names to local symbols.

(cherry picked from FBD18277131)
2019-11-01 12:31:17 -07:00
Maksim Panchenko 1ed3ac17ff [BOLT] Fix section offsets after debug stripping
Summary:
Be default, we strip debug sections from the binary. Even though we did
not write the sections, we allocated space for them in the output binary
by mistake.

(cherry picked from FBD18218708)
2019-10-29 14:49:49 -07:00
Maksim Panchenko ed8be23e73 [BOLT][llvm] Reduce memory used by MCInst
Summary:
BOLT creates MCInst for every instruction from the input. For large
binaries, this means we are creating tens if not hundreds of millions of
instructions. If the number of operands for average instruction is much
less than 8, we benefit from changing the type of Operands from
SmallVector<MCOperand, 8> to SmallVector<MCOperand, 2>. That seems
to be the optimal type for X86-64 on average.

The size of MCInst goes down from 176 to 80 which often reduces BOLT
memory consumption by gigabytes.

(cherry picked from FBD18218924)
2019-10-28 17:40:18 -07:00
Rafael Auler a3295715e4 [AArch64] Recognize one extra br idiom
Summary:
We do not support optimizing functions with jump tables in
AArch64, but we do need to detect them. This idiom is slightly different
from the ones we've seen before. It encode jump table entries as
relative to the jump table itself instead of relative to the indirect
branch (BR) instruction.

(cherry picked from FBD18191100)
2019-10-28 16:16:35 -07:00
Maksim Panchenko 8fb6512a23 [BOLT][Docs] Instructions for linking with jemalloc/tcmalloc
(cherry picked from FBD18050722)
2019-10-21 15:57:36 -07:00
Maksim Panchenko 12aca4005c [BOLT] Ignore __builtin_unreachable destination
Summary:
For functions with unknown control flow, do not populate TakenBranches
with an entry pointing to the end of the function.

(cherry picked from FBD18034019)
2019-10-20 20:46:32 -07:00
Rafael Auler b807641e2a [BOLT] Fix stale functions when using BAT
Summary:
If collecting data in Intel Skylake machines, we may face a
bug where LBR0 or LBR1 may be duplicated w.r.t. the next entry. This
makes perf2bolt interpret it as an invalid trace, which ordinarily we
discard during aggregation. However, in BAT, since we do not disassemble
the binary where the collection happened but rely only on the
translation table, it is not possible to detect bad traces and discard
them. This gets to the fdata file, and this invalid trace ends up
invalidating the profile for the whole function (by being treated as
stale by BOLT).

In this patch, we detect Skylake by looking for LBRs with 32 entries,
and discard the first 2 entries to avoid running into this problem.

It also fixes an issue with collision in the translation map by
prioritizing the last basic block when more than one share the same
output address.

(cherry picked from FBD17996791)
2019-10-17 16:35:57 -07:00
Maksim Panchenko 103b0a77cc [BOLT] Fix non-determinism while reading debug info
Summary:
When reading debug info in parallel, CUs for functions were populated in
parallel and the order was non-deterministic. We used the first CU from
the non-deterministically-ordered list to set the line number resulting
in different outputs.

The fix is to sort the list after it's been created and before assigning
the line table unit.

(cherry picked from FBD17946889)
2019-10-14 17:57:36 -07:00
Rafael Auler 698a4684ac [BOLT] Fix merge-fdata and heatmap in BAT
Summary:
merge-fdata for legacy format was simply appending all input
strings to output, but the real format supports some header strings
that can't be simply concatanated to output. Check for the header
string used by BAT before merging fdata to avoid creating an output
file with invalid lines (header in the middle of the fdata file).

For heatmap, avoid reading BAT tables, since they won't be used.

(cherry picked from FBD17943131)
2019-10-11 13:32:14 -07:00
Xin-Xin Wang d87f95065a [BOLT] Add missing CMake test dependencies
Summary:
I noticed when setting up a new repository for bolt that bolt tests
would fail unexpectedly when running `ninja check-bolt` and
`ninja check-llvm`. This turns out to be because dependencies for bolt
binaries were not specified in the CMake configuration so they were not
built before running the tests. This diff adds the dependencies to the
CMake configuration for check-bolt and check-llvm so that bolt binaries
are built before running tests.

(cherry picked from FBD17919505)
2019-10-14 16:03:54 -07:00
Maksim Panchenko 8c6ea8540a [BOLT] Improve object discovery runtime
Summary:

(cherry picked from FBD17872824)
2019-10-08 11:03:33 -07:00
Rafael Auler 13948f376d [BOLT] Do not emit BAT for non-simple in nonreloc
Summary: Doing so cause corrupt entries to be emitted.

(cherry picked from FBD17774505)
2019-10-04 16:28:03 -07:00
Mark Santaniello c9f4bbdc22 [llvm-bolt] Bugfix jemalloc sized deallocation segfault
Summary:
C++14 "sized deallocation" introduces a 2-argument `delete` where the new 2nd argument is the original allocated size.  It's useful for allocators like jemalloc to be "reminded" of the original allocation size, else they incur the cost of an address to size lookup.  Jemalloc has provided this for a while as `sdallocx`, and recently it got wired up to the new 2-arg `delete`.

Here I introduce typedefs for the SmallVectors so the "16" is consistent, which seems to fix the issue.

(cherry picked from FBD17618981)
2019-09-26 16:51:22 -07:00
Rafael Auler ba31344fa9 [BOLT] Fix build for Mac
Summary:
Change our CMake config for the standalone runtime instrumentation
library to check for the elf.h header before using it, so the build
doesn't break on systems lacking it. Also fix a SmallPtrSet usage where
its elements are not really pointers, but uint64_t, breaking the build
in Apple's Clang.

(cherry picked from FBD17505759)
2019-09-20 11:29:35 -07:00
Maksim Panchenko 5e6d246b9c [BOLT] Reword message for macro-op fusion optimization
Summary:
With the word "missed", the previous message about opportunities for
macro-op fusion optimization could be misleading.

(cherry picked from FBD17464603)
2019-09-18 15:33:03 -07:00
Maksim Panchenko c823220116 [BOLT] Better check for compiler de-virtualization bug
Summary:
The existing check for compiler de-virtualization bug was not working
when the relocation reference did not fall on a function boundary.
As a result, we were falsely detecting "unmarked object in code".

When running the check, the address could be arbitrary, except it
shouldn't match any existing function. Additionally, check that there's
a proper reference to the de-virtualized callee to avoid false
positives.

(cherry picked from FBD17433887)
2019-09-17 14:24:31 -07:00
Maksim Panchenko e9c6c73bb8 [BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.

For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.

After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.

One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.

Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.

As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.

New semantics for `-split-functions=<n>`:

  -split-functions - split functions into hot and cold regions
    =0 -   do not split any function
    =1 -   in non-relocation mode only split functions too large to fit
           into original code space
    =2 -   same as 1 (backwards compatibility)
    =3 -   split all functions

(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
Wenlei He 615a318b60 [BOLT] Filter perf samples by PID
Summary: `perf2bolt` accepts executable name, and the tool will find all the PIDs associated with that executable. When different versions of an executable are running at the same time, name alone may not be sufficient as we will get samples from different versions of the binary aggregated together. The resulting fdata may look stale to BOLT, which makes BOLT bailout optimization for functions. This change adds a `-pid` switch that lets user specify process ID in addition to executable name so BOLT can target a specific process.

(cherry picked from FBD17178898)
2019-09-03 22:24:06 -07:00
Wenlei He 8cd1ba599b [BOLT] Ignore LBR from kernel interrupts
Summary: This change adds a switch (`ignore-interrupt-lbr`) to ignores LBR from perf input that is result of kernel interrupts. These asynchronous flow of user/kernel transition will make BOLT think that profile is stale, thus bailout optimization for functions. Ideally, user mode filter need to be set for `perf record` so we don't have asynchronous LBRs. However these are identifiable as kernel address space is known, so we can ignore any LBRs that come from or go into kernel addresses during aggregation. This is under a switch and off by default in case we need to BOLT kernel module.

(cherry picked from FBD17170107)
2019-09-03 10:01:26 -07:00
Rafael Auler cc4b2fb614 [BOLT] Efficient edge profiling in instrumented mode
Summary:
Change our edge profiling technique when using instrumentation
to do not instrument every edge. Instead, build the spanning tree
for the CFG and omit instrumentation for edges in the spanning tree.
Infer the edge count for these edges when writing the profile during
run time. The inference works with a bottom-up traversal of the spanning
tree and establishes the value of the edge connecting to the parent based
on a simple flow equation involving output and input edges, where the
only unknown variable is the parent edge.

This requires some engineering in the runtime lib to support dynamic
allocation for building these graphs at runtime.

(cherry picked from FBD17062773)
2019-08-07 16:09:50 -07:00
Rafael Auler 52786928ff [BOLT] Fix perf2bolt race in BAT mode
Summary:
We start a thread to preprocess the profile while the main
thread continues to disassemble the input binary. We should not
disassemble it in BAT mode, however, the test to check whether we have
BAT in the input binary depends on the preprocessing thread, so there
is a race where we may start disassembling functions just because the
preprocessing thread didn't conclude we are in BAT mode. Fix this and
make the main thread check for BAT without depending on the
preprocessing thread.

(cherry picked from FBD17124370)
2019-08-29 16:18:43 -07:00
Rafael Auler 1f6564f117 [BOLT] Support .plt.got section
Summary:
We decode the regular .plt section and we are able to perform
optimizations on it with -plt=hot or -plt=all, however, .plt.got
sections are not decoded by BOLT. This patch teaches BOLT how to read
them. They are created by the bfd linker whenever there is no need for
the dynamic linker to lazy-bind the symbol (when they are eagerly
resolved at binary load time). These entries are 8-byte sized instead of
16-byte sized like the regular PLT, and contain a single indirect call
instruction with 7 bytes and a nop.

(cherry picked from FBD17060515)
2019-08-26 15:03:38 -07:00
Rafael Auler 243507db99 [BOLT] Fix aggregator w.r.t. split functions
Summary:
We should not rely on split function detection while aggregating
data, but only look up the original function names in the symbol table.
Split function detection should be done by BOLT and not perf2bolt while
writing the profile. Then, BOLT, when reading it, will take care of
combining functions if necessary.

This caused a bug in bolted data collection where samples in cold parts
of a function were being falsely attributed to the hot part of a function
instead of being attributed to the cold part, causing incorrect translation of
addresses.

(cherry picked from FBD16993065)
2019-08-23 12:18:31 -07:00
Maksim Panchenko f588d7a6ea [BOLT] Tighter control of jump table detection
Summary:
We were too permissive by allowing more jump tables during the
preliminary scan of memory. This allowed for jump tables to be
falsely detected. And since we didn't have a way to backtrack
the jump table creation, we had to assert.

This diff refactors the code that analyzes jump table contents.
Preliminary and final passes share the same code. The only difference
should be the detection of instruction boundaries that are available
during the final pass.

This should affect strict relocation mode only.

(cherry picked from FBD16923335)
2019-08-19 14:06:36 -07:00
Maksim Panchenko bf030f336a [BOLT] Fix misleading output
Summary:
BOLT prints "spawning thread to pre-process profile" message even when
it is not running in the aggregation mode. Fix that.

(cherry picked from FBD16908596)
2019-08-19 17:11:42 -07:00
Rafael Auler 821480d27f [BOLT] Encode instrumentation tables in file
Summary:
Avoid directly allocating string and description tables in
binary's static data region, since they are not needed during runtime
except when writing the profile at exit. Change the runtime library to
open the tables on disk and read only when necessary.

(cherry picked from FBD16626030)
2019-08-02 11:20:13 -07:00
Rafael Auler 62aa74f836 [BOLT] Support instrumentation via runtime library
Summary:
To allow the development of future instrumentation work, this
patch adds support in BOLT for linking arbitrary libraries into the
binary processed by BOLT. We use orc relocation handling mechanism for
that. With this support, this patch also moves code programatically
generated in X86 assembly language by X86MCPlusBuilder to C code written
in a new library called bolt_rt. Change CMake to support this library as
an external project in the same way as clang does with compiler_rt. This
library is installed in the lib/ folder relative to BOLT root
installation and by default instrumentation will look for the library
at that location to finish processing the binary with instrumentation.

(cherry picked from FBD16572013)
2019-07-24 14:03:43 -07:00
laith sakka f77cccf681 Rename option
(cherry picked from FBD16655093)
2019-08-05 13:56:48 -07:00
laith sakka c1564a1026 Add test for parallel mode
Summary:
Add a flag that disable writing botl-info section
and add a test that run bolt with two modes parallel
and sequential and assert that the resulting binaries
are the same.

(cherry picked from FBD16575440)
2019-07-30 17:55:27 -07:00
laith sakka cc8415406c Rewrite frame analysis using parallel utilities
Summary: Rewrite frame analysis using parallel utilities

(cherry picked from FBD16499130)
2019-07-25 11:57:08 -07:00
laith sakka 5084534699 Rewrite ICF using parallel utilities
Summary: Rewrite ICF using parallel utilities

(cherry picked from FBD16472975)
2019-07-24 17:13:15 -07:00
Maksim Panchenko 8d5854ef09 [BOLT] Add option to verify instruction encoder/decoder
Summary:
Add option `-check-encoding` to verify if the input to LLVM disassembler
matches the output of the assembler. When set, the verification runs on
every instruction in processed functions.

I'm not enabling the option by default as it could be quite noisy on x86
where instruction encoding is ambiguous and can include redundant
prefixes.

(cherry picked from FBD16595415)
2019-07-31 16:03:49 -07:00
Maksim Panchenko 79ff4ec1cb [perf2bolt] Enforce strict mode for perf2bolt
Summary:
In strict relocation mode, we get better function coverage. However, if
the profile used for optimization was converted using non-strict mode,
then it wouldn't match functions exclusive to strict mode. Hence,
we have to enforce strict relocation mode for profile conversion, so it
can be used for either mode.

I'm also adding parallel profile pre-processing unless `--no-threads` is
specified. This masks the runtime overhead of function disassembly on
multi-core machines.

(cherry picked from FBD16587855)
2019-06-11 13:24:10 -07:00
laith sakka 1bce256e67 Fix race condition in buildCFG
Summary:
switch to sequential execution when print-all is passed.
Since the function getDynoStats have an unsafe access
to the annotation allocators.

(cherry picked from FBD16503502)
2019-07-25 14:41:57 -07:00
laith sakka 6443c46b9d Run hfsort+ in parallel
Summary:
hfsort+ performs an expensive analysis to determine the
new order of the functions. 99% of the time during hfsort+
is spent in the function runPassTwo. This diff runs the body
of the hot loop in runPassTwo in parallel speeding up the
total runtime of reorder-functions pass by up to 4x

(cherry picked from FBD16450780)
2019-07-23 15:49:02 -07:00
Maksim Panchenko a9b9aa1e02 [BOLT] Add code padding verification
Summary:
In non-relocation mode, we allow data objects to be embedded in the
code. Such objects could be unmarked, and could occupy an area between
functions, the area which is considered to be code padding.

When we disassemble code, we detect references into the padding area
and adjust it, so that it is not overwritten during the code emission.
We assume the reference to be pointing to the beginning of the object.

However, assembly-written functions may reference the middle of an
object and use negative offsets to reference data fields. Thus,
conservatively, we reduce the possibly-overwritten padding area to
a minimum if the object reference was detected.

Since we also allow functions with unknown code in non-relocation mode,
it is possible that we miss references to some objects in code.
To cover such cases, we need to verify the padding area before we
allow to overwrite it.

(cherry picked from FBD16477787)
2019-07-23 20:48:41 -07:00
Maksim Panchenko 6722875047 [BOLT] Fix processing PLT without relocs
Summary:
Some binaries may not have a relocation section corresponding to PLT.
Handle them properly.

(cherry picked from FBD16477841)
2019-07-24 22:08:36 -07:00
Maksim Panchenko 98fdba2cc7 [BOLT][NFC] Fix white space
(cherry picked from FBD16473918)
2019-07-24 17:54:14 -07:00
laith sakka 744a2417dd Run findSubprograms in preprocessDebugInfo in parallel
Summary:
While reading debug info the function findSubprograms
runs on each compilation unit. This diff parallelize that loop
reducing its runtime duration by 70%.

(cherry picked from FBD16362867)
2019-07-17 20:54:53 -07:00
laith sakka b50500893d Lock-based parallelization for updateDebugInfo
Summary:
BOLT spends a decent amount of time creating patches to update
debug information when -update-debug-sections is passed.
In updateDebugInfo patches are created to update .debug_info
and .debug_abbrev sections while .debug_loc and .debug_ranges
contents are populated. This this diff uses a lock-based approach to
parallelize  updateDebugInfo functions and reduces the runtime of the
function by around 30%.

(cherry picked from FBD16352261)
2019-07-17 14:58:17 -07:00
Facebook Github Bot 86800abc81 [BOLT][PR] Target compilation based on LLVM CMake configuration
Summary:

Minimalist implementation of target configurable compilation.

Fixes https://github.com/facebookincubator/BOLT/issues/59
Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/60
GitHub Author: Pierre RAMOIN <pierre.ramoin@amadeus.com>

(cherry picked from FBD16461879)
2019-07-24 11:05:08 -07:00
Maksim Panchenko 2c9c6b164b [BOLT] Fix issue printing CTCs without annotations
Summary:
After stripping annotations, conditional tail calls no longer can be
identified by their corresponding tag. We can check the number of basic
block successors instead.

Fixes facebookincubator/BOLT#58.

(cherry picked from FBD16444718)
2019-07-22 20:57:19 -07:00
laith sakka fde5a2b470 Run shrink wrapping in parallel
Summary:
Shrink wrapping is an expensive part of frame optimizations if
performed on all functions. This diff makes it run in parallel,
reducing wall time.

(cherry picked from FBD16092651)
2019-07-02 10:48:43 -07:00
laith sakka 7d42835418 Run buildCFG in disassembly in parallel
Summary:
This diff  parallelize the construction of call graph during disassembly.
The diff includes a change to  parallel-utilities where another interface
is added, that support running tasks on binaryFunctions that involves
adding instruction annotations. This pattern is common in different places,
e.g. frame optimizations. And such, pattern justify creating an interface,
that abstract out all the messy details.

(cherry picked from FBD16232809)
2019-07-12 07:25:50 -07:00
laith sakka f4ab6e6924 run finalize functions in parallel
Summary:

(cherry picked from FBD16188733)
2019-07-10 10:59:56 -07:00
laith sakka 98539b0966 run aligner pass in parallel
Summary: this diff parallelize the aligner pass

(cherry picked from FBD16176327)
2019-07-09 17:59:41 -07:00
laith sakka 9977b03fea Run reorder blocks in parallel
Summary:
This diff change reorderBasicBlocks pass to run in parallel,
it does so by adding locks to the fix branches function,
and creating temporary MCCodeEmitters when estimating basic block code size.

(cherry picked from FBD16161149)
2019-07-08 12:32:58 -07:00
Rafael Auler 1169f1fdd8 [BOLT] Support duplicating jump tables
Summary:
If two indirect branches use the same jump table, we need to
detect this and duplicate dump tables so we can modify this CFG
correctly. This is necessary for instrumentation and shrink wrapping.
For the latter, we only detect this and bail, fixing this old known
issue with shrink wrapping.

Other minor changes to support better instrumentation: add an option
to instrument only hot functions, add LOCK prefix to instrumentation
increment instruction, speed up splitting critical edges by avoiding
calling recomputeLandingPads() unnecessarily.

(cherry picked from FBD16101312)
2019-07-02 16:56:41 -07:00
Rafael Auler 8880969ced [BOLT] Restrict creation of jump tables
Summary:
Heuristic that creates a jump table for every memory access,
including those we do not match against a pattern in an indirect jump,
is too permissive and has false positives. Guard this logic under
strict mode until we figure out a better strategy.

(cherry picked from FBD16192205)
2019-07-10 15:41:34 -07:00
laith sakka 3cfc76cdbf Create a general interface to implement parallel tasks easily and apply it to run EliminateUnreachableBlocks in parallel.
Summary:
Each time we run some work in parallel over the list of functions in bolt, we manage a thread pool, task scheduling and perform some work to manage the granularity of the tasks based on the type of the work we do.

In this task, I am creating an interface where all those details are abstracted out, the user provides the function that will run on each  function, and some policy parameters that setup the scheduling and granularity configurations.

This will make it easier to implement parallel tasks, and eliminate redundant coding efforts.

(cherry picked from FBD16116077)
2019-07-03 17:23:19 -07:00
laith sakka f10d1fe0f3 Run cleanAnnotations within frame analysis in parallel
Summary: This diff parallelize the function FrameAnalysis::cleanAnnotations()

(cherry picked from FBD16096711)
2019-07-02 13:42:17 -07:00
laith sakka 00c252f6d8 Clean SPTMap in frame anaylsis in parallel
Summary:
This diff parallelize the STPClean() function reducing its runtime from 5 seconds to 0.4 on HHVM,
Making the runtime for the frame optimizer goes down to 33 seconds on HHVM.

(cherry picked from FBD15914371)
2019-06-19 18:01:00 -07:00
laith sakka 86b529bd54 run SPT in parallel, and split annotation allocator
Summary:
This diff includes two main changes:
1) When creating an annotation, a dedicated annotation allocator can be used, instead of the default allocator. This allows some annotation to be deallocated  right after the end of their usage completely. Furthermore, having the ability to use dedicated allocators allows running SPT in parallel where each task uses a different allocator.

2) SPT is parallelized.

(cherry picked from FBD15913492)
2019-06-14 19:56:11 -07:00
Wenlei He 4e90fc1e3b [BOLT] Prioritize Jump Table ICP target by frequency and indice count
Summary: We select the top hot targets for indirect call promotion. But since we only have frequency for targets, not for actual jump table indices, we have to merge indices that share the same actual target. In order to do that we sort targets by pointer of target symbol before merging, which introduces instability. Later we stable sort merged targets by frequency. Due to the instability of sorting pointers, and depending on how many indices each merged target has, we could end up with unstable ICP. This commit changes the 2nd pass sorting to prioritize targets with fewer indices, and higher mispredicts, in addition to higher frequency. It improves stability of ICP, and also exposes more ICP because targets with fewer indices has better chance of getting promoted.

(cherry picked from FBD16099701)
2019-07-02 15:51:20 -07:00
Maksim Panchenko 078ece1691 [BOLT] Fix out-of-bounds entry points
Summary:
Check that a symbol address is less than the next function
address before considering it for a secondary entry.

(cherry picked from FBD16056468)
2019-06-28 11:53:34 -07:00
Maksim Panchenko e89ad0db4b [BOLT] Introduce strict relocation mode
Summary:
In strict relocation mode we rely on relocations to represent all
possible entry points into a function. Most of the code generated by
tested compilers (gcc and clang) will result in relocations against
any internal labels for jump tables and for computed goto tables.

In situations where we cannot properly reconstruct a jump table, or when
we cannot determine a table that guides an indirect jump, e.g. when
multiple computed goto tables are used, we conservatively assume that
the indirect jump can end up at any possible basic block referenced by
relocations.

In strict mode, simple functions may include the aforementioned
instructions with unknown control flow with a conservative list of
destinations added to the containing basic block. This allows us to
expand coverage of simple functions and to enable code reordering
optimizations for more functions.

The strict mode is recommended when BOLT is used with a well-formed
code generated by a compiler.

To use the strict mode, add "-strict" on the command line.

Another effect of this diff, is that with relocations, we will always
replace the immediate operand of an instruction with a symbol if the
relocation exists against this operand.

Also this diff fixes issues with Clang compiled with -fpic.

(cherry picked from FBD15872849)
2019-06-28 09:21:27 -07:00
Maksim Panchenko 06e7a1e059 [BOLT] Ignore false function references
Summary:
A relocation can have an addend that makes it look as the relocated
value is in a different section from the symbol being relocated.
E.g., a relocation against a variable in .rodata could have a negative
offset that will make it look like it is against a symbol in .text
(a section that typically precedes .rodata).

Unless the relocation is against a section symbol, we know
exactly the symbol that is being relocated and there is no issue.
However, when the linker leaves only a section relocation (i.e. a
relocation against a section symbol when a temporary original symbol
gets deleted), we have to guess the relocated symbol, and can falsely
detect a function reference in the case described above.

The fix is to keep a section relocation if the corresponding
relocated value falls into a different section, and to detect and
ignore false function reference.

(cherry picked from FBD16030791)
2019-06-27 03:20:17 -07:00
Wenlei He 459add2827 [BOLT] Force non-relocation mode for heatmap generation
Summary: BOLT operates in relocation mode by default when .reloc is in the binary. This changes disables relocation mode for heatmap generation so we can use that for more cases. There's a small separate change that ignores zero-sized symbol in zero-sized code section during function discovery.

(cherry picked from FBD16009610)
2019-06-26 11:06:46 -07:00
Rafael Auler 0d23cbaa52 [BOLT] Initial experimental instrumentation pass
Summary:
An instrumentation pass that modifies the input binary to
generate a profile after execution finishes. It modifies branches to
increment counters stored in the process memory and injects a new
function that dumps this data to an fdata file, readable by BOLT.

This instrumentation is experimental and currently uses a naive
approach where every branch is instrumented. This is not ideal for
runtime performance, but should be good enough for us to
evaluate/debug LBR profile quality against instrumentation.

Does not support instrumenting indirect calls yet, only direct
calls, direct branches and indirect local branches.

(cherry picked from FBD15998096)
2019-06-19 20:10:49 -07:00
Rafael Auler db02a1a142 [BOLT] Ignore empty funcs in relocation mode
Summary:
Make BOLT ignore empty functions (those containing no instructions,
despite having some space allocated to it filled with zeroes).

(cherry picked from FBD15981683)
2019-06-24 20:23:22 -07:00
Rafael Auler bda13b7dd8 [BOLT] Add option to print profile bias stats
Summary:
Profile bias may happen depending on the hardware counter used
to trigger LBR sampling, on the hardware implementation and as an
intrinsic characteristic of relying on LBRs. Since we infer fall-through
execution and these non-taken branches take zero hardware resources to
be represented, LBR-based profile likely overrepresents paths with fall
throughs and underrepresents paths with many taken branches. This patch
adds an option to print statistics about profile bias so we can better
understand these biases.

The goal is to analyze differences in the sum of the frequency of all
incoming edges in a basic block versus the sum of all outgoing. In an
ideally sampled profile, these differences should be close to zero. With
this option, the user gets the mean of these differences in flow as a
percentage of the input flow. For example, if this number is 15%, it
means, on average, a block observed 15% more or less flow going out of
it in comparison with the flow going in. We also print the standard
deviation so we can have an idea of how spread apart are different
measurements of flow differences. If variance is low, it means the
average bias is happening across all blocks, which is compatible with
using LBRs. If the variance is high, it means some blocks in the profile
have a much higher bias than others, which is compatible with using a
biased event such as cycles to sample LBRs because it overrepresents
paths that end in an expensive instruction.

(cherry picked from FBD15790517)
2019-06-10 17:26:48 -07:00
laith sakka 1ec091e6f5 Parallelize ICF Pass
Summary:
ICF consumes 10-15% of bolt runtime, for HHVM that is around 45 seconds.
this diff perform some parallelization for the pass to make it faster.
A 60% reduction in the ICF runtime  is measured on the parallel version for HHVM.

(cherry picked from FBD15589515)
2019-05-31 16:45:31 -07:00
Maksim Panchenko 9894de0094 [BOLT] Check instruction boundaries while populating jump tables
Summary:
Now that we populate jump tables after all functions are disassembled,
we can check for instruction boundaries corresponding to jump table
entries. No need to delegate this task to postProcessJumpTables().

(cherry picked from FBD15814762)
2019-06-13 15:31:30 -07:00
Maksim Panchenko 9e2ad3f593 [BOLT] Delay populating jump tables
Summary:
During the initial disassembly pass, only identify jump tables
without populating the contents. Later, after all functions have been
disassembled, we have a better idea of jump table boundaries and can do
a better job of populating their entries.

As a result, we no longer have embedded jump tables (i.e. a jump table
that is parter of another jump table). If we ever need to keep
sequential jump tables inseparable during the output, we can always
add such functionality later.

Fixes facebookincubator/BOLT#56.

(cherry picked from FBD15800427)
2019-06-12 18:21:02 -07:00
laith sakka 66cf16208f Use singleton instances for SPT (stack pointer tracking) in FrameAnalysis.
Summary:
During frame analysis, the functions do not change, and stack pointer tracking
does not need to be performed more than one time.

The current implementation performs the SPT analysis multiple times per
function during the frame analysis, we ca eliminate such computation redundancy.

On HHVM with -frame-opts=hot, this save around a minute which is 40% of the
frame optimization runtime. (129s to 76s).
fdata should be passed for a reasonable evaluation (we need the call graph).

However, this comes at a memory cost, around 2G to the peak when only -frame-opt=hot only is used but,
When all the usual flags are passed, the effect is to the peak is only 200K (measured from one test).

Update:
When jemalloc is used the base became way better and the following runtime are observed:

[jemalloc]
hhvm  85 -->  72.
clang  27 --> 23.

[malloc]
hhvm 129 -->  76.
clang  34   --> 27.

(cherry picked from FBD15707003)
2019-06-06 12:58:14 -07:00
Maksim Panchenko 9df5063c0e [perf2bolt] Option to use event PC with LBR stack
Summary:
Add an option to get extra profile trace using the recorded event PC.
The trace goes from the latest LBR record destination to the event PC.

(cherry picked from FBD15711804)
2019-06-06 19:38:06 -07:00
Maksim Panchenko fac6a89c23 [BOLT] Better handling of address references
Summary:
We used to handle PC-relative address references differently from direct
address references. As a result, some cases, such as escaped function
label address, were not handled when dealing with absolute (non-PIC)
code. This diff moves processing of an address reference into
BinaryContext::handleAddressRef() which is called for both PIC and
non-PIC code.

(cherry picked from FBD15643535)
2019-06-04 15:30:22 -07:00
laith sakka d3c1821f5f Compile Bolt using std 14.
Summary:
Compile Bolt using std 14.
We want that to be able to use some threading the locking tools that do not exists in std 11.

(cherry picked from FBD15671736)
2019-06-05 10:32:29 -07:00
Rafael Auler 21f4303bfd Support data collection in bolted binaries
Summary:
Similarly to how the compiler relies on DWARF to map samples, so
it is possible to collect profile data in binaries optimized by PGO
techniques and retrofit data to be used in a representation of the program
that was not optimized by PGO, this diff implements an option in BOLT to
encode a table in the output binary that allows us to map data collected
in optimized binaries back to the address space used in the input binary
(where the profile is useful, since we do not support running BOLT on a
binary already optimized by BOLT). The goal is to offer an option to
support BOLT in scenarios where it is not easy to run a special deployment of
the binary with a version that was not optimized by BOLT just for data
collection.

This feature is enabled with the -enable-bat flag. BAT stands for BOLT
Address Translation, which refers to the process of mapping output to
input addresses.

(cherry picked from FBD15531860)
2019-04-12 17:33:46 -07:00
Laith Sakka 3df2c9ea1f Update SDT locations after bolt reordering
Summary: Update SDT locations in .note section to match the new location after bolt reorder the code.

(cherry picked from FBD15427779)
2019-05-17 07:58:27 -07:00
Maksim Panchenko 9ef9a7b1be [BOLT] Use regex matching for function names passed on command line
Summary:
Options such as `-print-only`, `-skip-funcs`, etc. now take regular
expressions. Internally, the option is converted to '^funcname$' form
prior to regex matching. This ensures that names without special
symbols will match exactly, i.e. "foo" will not match "foo123".

(cherry picked from FBD15551930)
2019-05-29 18:33:09 -07:00
Laith Sakka c8038da36e Minor-fix: remove duplicate definition of SPT optimization timer
Summary:

(cherry picked from FBD28111560)
2019-05-22 15:03:42 -07:00
Maksim Panchenko e5b1d9cd8c [BOLT][NFC] Fix white space
(cherry picked from FBD15485688)
2019-05-23 15:49:36 -07:00
Maksim Panchenko f57d3c00fc [BOLT] Better verification of jump tables
Summary:
Run analyzeIndirectBranch() using basic block boundaries instead of
running ad-hoc validation of the jump table assumptions.

(cherry picked from FBD15465034)
2019-05-22 18:14:34 -07:00
Maksim Panchenko be344c8de7 [BOLT] Refactor handling of interproc refs
Summary:
Move handling of interprocedural references to BinaryContext.

Post-process indirect branches immediately after the CFG is built.

This is almost NFC. Since indirect branches are now post-processed
before the profile data is processed it interferes with the way the
profile data in YAML format is handled.

(cherry picked from FBD15456003)
2019-05-22 11:26:58 -07:00
Maksim Panchenko d047df12c5 [BOLT] Add an option to specialize memcpy() for 1 byte copy
Summary:
Add an option:

  -memcpy1-spec=func1,func2:cs1,func3:cs1:cs2,...

to specialize calls to memcpy() in listed functions (the name could be
supplied in regex) for size 1. The optimization will dynamically check
if the size argument equals to 1 and execute a one byte copy, otherwise
it will call memcpy() as usual. Specific call sites could be indicated
after ":" using their numeric count from the start of the function.

(cherry picked from FBD15428936)
2019-05-20 20:11:40 -07:00
Laith Saed Sakka ca659e4336 Preserve nops that are SDT markers in binaries and disable SDT conflicting optimizations
Summary:
SDT markers that appears as nops in the assembly, are preserved and not eliminated.
Functions with SDT markers are also flagged. Inlining and folding are disabled for
functions that have SDT markers.

(cherry picked from FBD15379799)
2019-05-16 12:46:32 -07:00
Laith Saed Sakka 4755825447 Parse statically defined tracepoint markers from .note.stapsdt section
Summary:
    Parse statically defined tracepoints(SDT) markers from the ELF file, and store them.
    Add an option to print SDTs (-print-sdt).
    Add test case for parsing and printing SDTs.

(cherry picked from FBD15366712)
2019-05-15 17:19:18 -07:00
Rafael Auler f1fde44154 [BOLT] Improve ICP activation policy and hot jt processing
Summary:
Previously, ICP worked with a budget of N targets to convert to
direct calls. As long as the frequency of up to N of the hottest targets
surpassed a given fraction (threshold) of the total frequency, say, 90%,
then the optimization would convert a number of targets (up to N) to
direct calls. Otherwise, it would completely abort processing this call
site. The intent was to convert a given fraction of the indirect call
site frequency to use direct calls instead, but this ends up being a
"all or nothing" strategy.

In this patch we change this to operate with the same strategy seem in
LLVM's ICP, with two thresholds. The idea is that the hottest target of
an indirect call site will be compared against these two thresholds: one
checks its frequency relative to the total frequency of the original
indirect call site, and the other checks its frequency relative to the
remaining, unconverted targets (excluding the hottest targets that were
already converted to direct calls). The remaining threshold is typically
set higher than the total threshold. This allows us more control over
ICP.

I expose two pairs of knobs, one for jump tables and another for
indirect calls.

To improve the promotion of hot jump table indices when we have memory
profile, I also fix a bug that could cause us to promote extra indices
besides the hottest ones as seen in the memory profile. When we have the
memory profile, I reapply the dual threshold checks to the memory
profile which specifies exactly which indices are hot. I then update N,
the number of targets to be promoted, based on this new information, and
update frequency information.

To allow us to work with smaller profiles, I also created an option in
perf2bolt to filter out memory samples outside the statically allocated
area of the binary (heap/stack). This option is on by default.

(cherry picked from FBD15187832)
2019-05-02 12:28:34 -07:00
Maksim Panchenko fee61231ef [BOLT] Move JumpTable management to BinaryContext
Summary:
Make BinaryContext responsible for creation and management of
JumpTables. This will be used for detection and resolution of jump table
conflicts across functions.

(cherry picked from FBD15196017)
2019-05-02 17:42:06 -07:00
Maksim Panchenko 4b55967d9e [perf2bot] Pass `-f` flag to perf
Summary:
perf tool requires the input data to be owned by the current user or
root, otherwise it rejects the input. Use `-f` option to override this
behavior.

(cherry picked from FBD15160678)
2019-04-30 17:08:22 -07:00
Maksim Panchenko 310b32fbe5 [BOLT] Limit jump table size by containing object
Summary:
While checking for a size of a jump table, we've used containing
section as a boundary. This worked for most cases as typically jump
tables are not marked with symbol table entries. However, the compiler
may generate objects for indirect goto's.

(cherry picked from FBD15158905)
2019-04-30 15:47:10 -07:00
Maksim Panchenko f1dfd38dec [BOLT][NFC] Move DynoStats out of BinaryFunction
Summary: Move DynoStats into separate source files.

(cherry picked from FBD15138883)
2019-04-29 12:51:10 -07:00
Maksim Panchenko 2b1523362e [BOLT] Strip debug sections by default
Summary:
We used to ignore debug sections by default, but we kept them in the
binary which led to invalid debug information in the output. It's better
to strip debug info and print a warning to the user.

Note: we are not updating debug info by default due to high memory
requirements for large applications.

(cherry picked from FBD15128947)
2019-04-26 15:30:12 -07:00
Rafael Auler 21ee0e98c7 [BOLT] Fix symboltable update bug
Summary:
Commit "Update symbols for secondary entry points" introduced
a bug by using getBinaryFunctionContainingAddress() instead of
getBinaryFunctionAtAddress() regarding ICF'd functions. Only the latter
would fetch the correct BinaryFunction object for addresses of functions
that were ICF'd. As a result of this bug, the dynamic symbol table was
not updated for function symbols that were folded by ICF.

(cherry picked from FBD15112941)
2019-04-26 19:52:36 -07:00
Maksim Panchenko caa0fafa18 [BOLT] Fix profile reading in non-reloc mode
Summary:
In non-relocation mode we may execute multiple re-write passes either
because we need to split large functions or update debug information for
large functions (in this context large functions are functions that do
not fit into the original function boundaries after optimizations).

When we execute another pass, we reset RewriteInstance and run most of
the steps such as disassembly and profile matching for the 2nd or 3rd
time. However, when we match a profile, we check `Used` flag, and don't
use the profile for the 2nd time. Since we didn't reset the flag while
resetting the rest of the states, we ignored profile for all functions.
Resetting the flag in-between rewrite passes solves the problem.

(cherry picked from FBD15110959)
2019-04-26 16:32:28 -07:00
Maksim Panchenko 5717b0c427 [perf2bolt] Fix print report for pre-aggregated profile
Summary:
For pre-aggregated profile, we were using the number of records in the
profile for `NumTraces` ignoring the counts per record. As a result,
the reported percentage of mismatched traces was bogus.

(cherry picked from FBD15093123)
2019-04-25 16:34:50 -07:00
Maksim Panchenko 492e4a515e [BOLT] Automatically enable -hot-text
Summary:
Enable -hot-text by default if reordering functions.

Also fail immediately if function reordering is specified on the command
line in non-relocation mode.

(cherry picked from FBD15095178)
2019-04-25 17:00:05 -07:00
Brian Gesiak 91b2de3c23 [BOLT] Minimize BOLT's diff with LLVM by removing trivial changes (NFC)
Summary: BOLT works as a series of patches rebased onto upstream LLVM at revision `f137ed238db`. Some of these patches introduce unnecessary whitespace changes or includes. Remove these to minimize the diff with upstream LLVM.

(cherry picked from FBD15064122)
2019-04-24 11:24:15 -04:00
Rafael Auler 4e4d39c21c [BOLT] Update symbols for secondary entry points
Summary:
Update the output ELF symbol table for symbols representing
secondary entry points for functions. Previously, those were left
unchanged in the symtab.

(cherry picked from FBD15010517)
2019-04-18 16:32:22 -07:00
Brian Gesiak eba1a67730 Fix casting issues on macOS
Summary:
`size_t` is platform-dependent, and on macOS it is defined as
`unsigned long long`. This is not the same type as is used in many calls
to templated functions that expect the same type. As a result, on macOS,
calls to `std::max` fail because a template function that takes
`uint64_t, unsigned long long` cannot be found.

To work around the issue:

* Specify explicit `std::max` and `std::min` functions where necessary,
  to work around the compiler trying (and failing) to find a suitable
  instantiation.
* For lambda return types, specify an explicit return type where necessary.
* For `operator ==()` calls, use an explicit cast where necessary.

(cherry picked from FBD15030283)
2019-04-22 11:27:50 -04:00
Brian Gesiak d9f1bd42fd [cmake] Only build enabled targets
Summary:
When attempting to build llvm-bolt with `-DLLVM_ENABLE_TARGETS="X86"`, I
encountered an error:

```
CMake Error at cmake/modules/AddLLVM.cmake:559 (add_dependencies):
  The dependency target "AArch64CommonTableGen" of target
  "LLVMBOLTTargetAArch64" does not exist.
Call Stack (most recent call first):
  cmake/modules/AddLLVM.cmake:607 (llvm_add_library)
  tools/llvm-bolt/src/Target/AArch64/CMakeLists.txt:1 (add_llvm_library)
```

The issue is that the `llvm-bolt/src/Target/AArch64` subdirectory is
added by CMake unconditionally. The LLVM project, on the other hand,
only adds the subdirectories that are enabled, by using a `foreach` loop
over `LLVM_TARGETS_TO_BUILD`. Copying that same loop, from
`llvm/lib/Target/CMakeLists.txt`, to this project avoids the error.

(cherry picked from FBD15030236)
2019-04-22 11:19:02 -04:00
Rafael Auler 3b422eafd0 [BOLT] Fix non-determinism in shrink wrapping
Summary:
Iterating over SmallPtrSet is non-deterministic. Change it to
SmallSetVector. Similarly, do not sort a vector of ProgramPoint when
computing the dominance frontier, as ProgramPoint uses the pointer value
to determine order. Use a SmallSetVector there too to avoid duplicates
instead of sorting + uniqueing.

(cherry picked from FBD14992085)
2019-04-17 18:20:56 -07:00
Maksim Panchenko 433f3e3e02 [BOLT] Process CFIs for functions with FDE size mismatch
Summary:
If a function size indicated in FDE is different from the one in the
symbol table, we can keep processing the function as we are using the
max size for internal purposes. Typically this happens for
assembly-written functions with padding at the end. This padding is not
included in FDE, but it is in the symbol table.

(cherry picked from FBD14987653)
2019-04-17 15:17:55 -07:00
Maksim Panchenko 99ef4c90c1 [BOLT] Basic support for split functions
Summary:
This adds very basic and limited support for split functions.
In non-relocation mode, split functions are ignored, while their debug
info is properly updated. No support in the relocation mode yet.

Split functions consist of a main body and one or more fragments.
For fragments, the main part is called their parent. Any fragment
could only be entered via its parent or another fragment.

The short-term goal is to correctly update debug information for split
functions, while the long-term goal is to have a complete support
including full optimization. Note that if we don't detect split
bodies, we would have to add multiple entry points via tail calls,
which we would rather avoid.

Parent functions and fragments are represented by a `BinaryFunction`
and are marked accordingly. For now they are marked as non-simple, and
thus only supported in non-relocation mode. Once we start building a
CFG, it should be a common graph (i.e. the one that includes all
fragments) in the parent function.

The function discovery is unchanged, except for the detection of
`\.cold\.` pattern in the function name, which automatically marks the
function as a fragment of another function.

Because of the local function name ambiguity, we cannot rely on the
function name to establish child fragment and parent relationship.
Instead we rely on disassembly processing.

`BinaryContext::getBinaryFunctionContainingAddress()` now returns a
parent function if an address from its fragment is passed.

There's no jump table support at the moment. Jump tables can have
source and destinations in both fragment and parent.

Parent functions that enter their fragments via C++ exception handling
mechanism are not yet supported.

(cherry picked from FBD14970569)
2019-04-16 10:24:34 -07:00
Maksim Panchenko ffae5e73f3 [BOLT] Fix an issue with std:errc
Summary:
On some platforms
`llvm::make_error_code(std::errc::no_such_process) == std::errc::no_such_process`
evaluates to false.

(cherry picked from FBD14944405)
2019-04-15 16:42:49 -07:00
Rafael Auler 31fc56b313 [BOLT] Fix adjustFunctionBoundaries w.r.t. entry points
Summary:
Don't consider symbols in another section when processing
additional entry points for a function.

(cherry picked from FBD14962853)
2019-04-16 14:35:29 -07:00
Maksim Panchenko 22ba3dc816 [BOLT] Add another section to the list of hot text movers
Summary:

(cherry picked from FBD14954472)
2019-04-16 10:39:05 -07:00
Maksim Panchenko 27dcec9717 [BOLT] Abort processing if the profile has no valid data
Summary:
It's possible to pass a profile in invalid format to BOLT, and we
silently ignore it. This could cause a regression as such scenario can
go undetected. We should abort processing if no valid data was seen in
the profile and issue a warning if it was partially invalid.

(cherry picked from FBD14941211)
2019-04-15 14:03:01 -07:00
Maksim Panchenko 8f98268518 [BOLT] Reduce warnings for non-simple functions
Summary:
If a function was already marked as non-simple, there's no reason to
issue a warning that it has a reference in the middle of an
instruction. Besides, sometimes there wouldn't be instructions
disassembled at a given entry, and the warning would be incorrect.

(cherry picked from FBD14938227)
2019-04-15 11:56:55 -07:00
Maksim Panchenko e50e89be9e [BOLT] Handle R_X86_64_converted_reloc_bit
Summary:
In binutils 2.30 a bfd linker accidentally started modifying some
relocations on output under `-q/--emit-relocs` by turning on
R_X86_64_converted_reloc_bit. As a result, BOLT ignored such
relocations and failed to correctly update the binary.

This diff filters out R_X86_64_converted_reloc_bit from the relocation
type.

(cherry picked from FBD14907832)
2019-04-11 17:11:08 -07:00
Maksim Panchenko 315ae74de3 [BOLT] Include <numeric> for std::iota
Summary: Some compilers require <numeric> header.

(cherry picked from FBD14868132)
2019-04-09 21:22:41 -07:00
Maksim Panchenko 88375d311e [BOLT] Sort basic block successors for printing
Summary:
For easier analysis of the hottest targets of jump tables it helps to
have basic block successors sorted based on the taken frequency.

(cherry picked from FBD14856640)
2019-04-09 11:27:23 -07:00
Maksim Panchenko a8e05d067d [BOLT] Add interface to extract values from static addresses
(cherry picked from FBD14858028)
2019-04-09 12:29:40 -07:00
Maksim Panchenko 7d89b113d8 [BOLT][NFC] Indentation fix
(cherry picked from FBD14856700)
2019-04-09 11:31:45 -07:00
Rafael Auler 90996eb54b [PERF2BOLT] Print a better message if perf.data lacks LBR
Summary:
If processing the perf.data in LBR mode but the data was
collected without -j, currently we confusingly report all samples
to mismatch the input binary, even though the samples match but
lack LBR info. Change perf2bolt to detect this scenario and print
a helpful message instructing the user to collect data with LBR.

(cherry picked from FBD14817732)
2019-04-05 17:27:25 -07:00
Maksim Panchenko 624a0e810d [DWARF][BOLT] Convert DW_AT_(low|high)_pc to DW_AT_ranges only if necessary
Summary:
While updating DWARF, we used to convert address ranges for functions
into DW_AT_ranges format, even if the ranges were not split and still
had a simple [low, high) form. We had to do this because functions with
contiguous ranges could be sharing an abbrev with non-contiguous range
function, and we had to convert the abbrev.

It turns out, that the excessive usage of DW_AT_ranges may lead to
internal core dumps in gdb in the presence of .gdb_index.
I still don't know the root cause of it, but reducing the number
DW_AT_ranges used by DW_TAG_subprogram DIEs does alleviate the
issue.

We can keep a simple range for DIEs that are guaranteed not to
share an abbrev with any non-contiguous function. Hence we have to
postpone the update of function ranges until we've seen all DIEs.
Note that DIEs from different compilation units could share the same
abbrev, and hence we have to process DIEs from all compilation units.

(cherry picked from FBD14814043)
2019-04-01 20:26:41 -07:00
Maksim Panchenko c8a927696c [BOLT] Detect internal references into a middle of instruction
Summary:
Some instructions in assembly-written functions could reference 8-byte
constants from another instructions using 4-byte offsets, presumably to
save a couple of bytes.

Detect such cases, and skip processing such functions until we teach
BOLT how to handle references into a middle of instruction.

(cherry picked from FBD14768212)
2019-04-03 22:31:12 -07:00
Maksim Panchenko 7fd487066f [BOLT] Move BinaryFunctions into a BinaryContext and more
Summary:
A long due refactoring that makes interfaces cleaner and less awkward.
Mainly makes the future work way easier.

(cherry picked from FBD14766284)
2019-04-03 15:52:01 -07:00
Maksim Panchenko 8894853f42 [BOLT][DWARF] Dedup .debug_abbrev section patches
Summary:
When we patch .debug_abbrev we issue many duplicate patches. Instead of
storing these patches as a vector, use a hash map. This saves some
processing time and memory.

(cherry picked from FBD14691292)
2019-03-29 14:22:54 -07:00
Maksim Panchenko 297d1a4e1a [BOLT] Do not write jump table section headers
Summary:
In non-relocation mode we were accidentally emitting section headers for
every single jump table. This happened with default
`-jump-tables=basic`.

(cherry picked from FBD14653282)
2019-03-27 13:58:31 -07:00
Maksim Panchenko d1b76f2ac2 [BOLT] Allocate enough space past __hot_end for huge pages
Summary:
While using "-hot-text" option, we might not get enough cold text to
fill up the last huge page, and we can get data allocated on this page
producing undesirable effects. To prevent this from happening, always
make sure to allocate enough space past __hot_end.

(cherry picked from FBD14575100)
2019-03-21 21:13:45 -07:00
Maksim Panchenko 69faf61372 [BOLT] Fix section lookup while deleting symbols
Summary:
While removing redundant local symbols, we used new section index to
lookup the corresponding section in the old section table. As a result,
we used to either not remove the correct symbols, or remove the wrong
ones.

(cherry picked from FBD14552047)
2019-03-20 16:13:09 -07:00
Maksim Panchenko b8d3dc40ea [BOLT] Use local binding for cold fragment symbols
Summary:
We used to use existing symbol binding while duplicating and renaming
cold fragment symbols. As a result, some of those were emitted with
global binding. This confuses gdb, and it starts treating those symbols
as additional entry points.

The fix is to always emit such symbols with a local binding. This also
means that we have to sort static symbol table before emission to make
sure local symbols precede all others.

(cherry picked from FBD14529265)
2019-03-19 13:46:21 -07:00
Maksim Panchenko 6bcb3389dd [BOLT] Place hot text mover functions into a separate section
Summary:
Create a separate pass for assigning functions to sections. Detect
functions originating from special sections (by default .stub and
.mover) and place them into ".text.mover" if "-hot-text" options is
specified.

Cold functions are isolated from hot functions even when no function
re-ordering is specified.

(cherry picked from FBD14512628)
2019-03-15 13:43:36 -07:00
Maksim Panchenko 17cd2034f3 [BOLT] Fix debug line info emission
Summary:
GDB does not like if the first entry in the line info table after
end_sequence entry is not marked with is_stmt. If this happens, it will
not print the correct line number information for such address. Note
that everything works fine starting with the first address marked
with is_stmt.

This could happen if the first instruction in the cold section wasn't
marked with is_stmt.

The fix is to always emit debug line info for the first instruction
in any function fragment with is_stmt flag.

(cherry picked from FBD14516629)
2019-03-18 19:22:26 -07:00
Maksim Panchenko 61ea19edf8 [BOLT][NFC] Fix compilation warnings
Summary: Get rid of warnings while building with Clang.

(cherry picked from FBD14495620)
2019-03-15 15:06:41 -07:00
Maksim Panchenko 0a55001a0e [BOLT] Fix -hot-functions-at-end option
Summary: Make "-hot-functions-at-end" option work again.

(cherry picked from FBD14476242)
2019-03-14 20:32:04 -07:00
Maksim Panchenko 163adbec9f [BOLT] Refactor allocatable sections rewrite part
Summary:
This refactoring makes it easier to create new code sections and control
code placement. As an example, cold code is being placed into
".text.cold" which is emitted independently from ".text", and the final
address assignment becomes more flexible.

Previously, in non-relocation mode we used to emit temporary section
name into .shstrtab. This resulted in unnecessary bloat of this section.

There was unnecessary padding emitted at the end of text section. After
fixing this, the output binary becomes smaller.

I had to change the way exception handling tables are re-written
as the current infra does not support cross-section label difference.
This means we have to emit absolute landing pad addresses, which might
not work for PIE binaries. I'm going to address this once I investigate
the current exception handling issues in PIEs.

This diff temporarily disables "-hot-functions-at-end" option.

(cherry picked from FBD14475693)
2019-03-14 18:51:05 -07:00
Maksim Panchenko a9e64947c5 [NFC][BOLT] Move ExecutableFileMemoryManager into its own file
(cherry picked from FBD14474800)
2019-03-14 18:49:40 -07:00
Rafael Auler c593563d1f Do not assert on addresses read from processIndirectBranch
Summary: As part of our heuristics to decode an indirect branch, if we
suspect the branch is an indirect tail call, we add its probable target
to the BC::InterproceduralReferences vector to detect functions with
more than one entry point. However, if this probable target is not in an
allocatable section, we were asserting. Remove this assertion and
change the code to conditionally store to InterproceduralReferences
instead. The probable target could be garbage at this point because
of analyzeIndirectBranch failing to identify the load instruction that
has the memory address of the target, so we should tolerate this.

(cherry picked from FBD14432821)
2019-03-12 16:36:35 -07:00
Maksim Panchenko 0c704eb75a [BOLT-HEATMAP] Initial heat map implementation
Summary:
Add heatmap subcommand to produce heatmaps based on perf.data with LBR.
The output is produced in colored ASCII format.

  llvm-bolt heatmap -p perf.data <executable>

    -block-size=<uint> - size of a heat map block in bytes (default 64)
    -line-size=<uint>  - number of entries per line (default 256)
    -max-address=<uint> - maximum address considered valid for heatmap
                          (default 4GB)
    -o=<string>        - heatmap output file (default stdout)

(cherry picked from FBD13969992)
2019-02-05 15:28:19 -08:00
Maksim Panchenko ff6e21290f [BOLT] New inliner implementation
Summary:
Addresses correctness issues related to inlining.
Inlining heuristics are not part of this diff.

(cherry picked from FBD13796888)
2019-01-31 11:23:02 -08:00
Maksim Panchenko 365bd1f1c8 [BOLT] For non-simple functions always update jump tables in-place
Summary:
For non-simple function we can miss a reference to a jump table or
to an indirect goto table. If we move the jump table, the missed
reference will not get updated, and the corresponding indirect jump
will end up in the old (wrong) location. Updating the original jump
table in-place should take care of the issue.

(cherry picked from FBD13849776)
2019-01-28 13:46:18 -08:00
Rafael Auler af81c7ff80 [perf2bolt] Add support for generating autofdo input
Summary:
Autofdo tools support.

(cherry picked from FBD13779026)
2019-01-22 17:21:45 -08:00
Maksim Panchenko c6ce2abb7d [perf2bolt] Optimize memory usage in perf2bolt
Summary:
While converting perf profile, we only need CFG for functions that were
profiled and can skip building CFG for the rest. This saves us some
processing time and memory.

Breakdown processing of perf.data into two steps. The first
step parses the data, saves it in intermediate format, and marks
functions with the profile. The second step attributes the profile to
functions with CFG. When we disassemble and build CFG for functions in
aggregate-only mode, we skip functions without the profile.

(cherry picked from FBD13706697)
2019-01-15 23:43:40 -08:00
Maksim Panchenko 2fe0c38d6b [perf2bolt] Better tracking of process forking
Summary:
Improve tracking of forked processes.

If a process corresponding to the input binary has forked/started
before 'perf record' was initiated, then the full name of the binary
will be recorded in a corresponding MMAP2 event. We've being handling
such cases well so far.

However, if the process was forked after 'perf record' has started, and
execve(2) wasn't called afterwards, then there will be no MMAP2 event
recorded corresponding to the mapping of the main binary (unrelated
MMAP2 events could still be recorded).

To track such cases, we need to parse 'perf script --show-task-events'
command output, and to scan for PERF_RECORD_FORK events, and then add
forked process PIDs to the list associated with the input binary. If
the fork event was followed by an exec event (PERF_RECORD_COMM exec)
of a different binary, then the forked PID should be ignored. If the
exec event was associated with our input binary, then the correct MMAP2
event was recorded and parsed.

To track if the event occurred before or after 'perf record', we parse
event's time. This helps us to differentiate some events. E.g. the exec
event is only registered correctly if it happened after perf recording
has started (otherwise the "exec" part is missing), and thus we only
record forks with non-zero time stamps.

(cherry picked from FBD13250904)
2018-11-21 20:04:00 -08:00
Maksim Panchenko 067a385000 [BOLT] Add thresholds for function splitting
Summary:
Use newly added function size estimation to measure the effectiveness
and guide function splitting. Two new tuning options are added:

  -split-threshold=<uint>
    split function only if its main size is reduced by more than given
    amount of bytes. Default value: 0, i.e. split iff the size is reduced.
    Note that on some architectures the size can increase after splitting.
  -split-align-threshold=<uint>
    when deciding to split a function, apply this alignment while doing
    the size comparison (see -split-threshold). Default value: 2.

(cherry picked from FBD13136352)
2018-11-15 16:03:34 -08:00
Maksim Panchenko b0f7fddd35 [BOLT] Add method for better function size estimation
Summary:
Add BinaryContext::calculateEmittedSize() that ephemerally emits code
to allow precise estimation of the function size. Relaxation and
macro-op alignment adjustments are taken into account.

(cherry picked from FBD13092139)
2018-11-15 16:02:16 -08:00
Maksim Panchenko e1b8fade7f [BOLT] Add branch priority policy for blocks with 2 successors
Summary:
On x86 the difference between long and short jump instructions could be
either 4 or 3 bytes, depending if it's a conditional jump or not.
For a basic block with 2 jump instructions, if we know that one of
the successors is in a different code region, then we can make it
a target of an unconditional jump instruction. This will save 1 byte
in case the conditional jump happens to be a short one.

(cherry picked from FBD13078139)
2018-11-14 14:43:59 -08:00
Maksim Panchenko 40d9fcfdca [BOLT] Workaround for Clang de-virtualization bug
Summary:
When Clang is boot-strapped with (Thin)LTO, it may produce a code
fragment similar to below:

  .LFT663334 (6 instructions, align : 1)
    Predecessors: .LFT663333
      00000538:   movb    $0x1, %al
      0000053a:   movl    %eax, -0x2c(%rbp)
      0000053d:   movl    $"_ZN5clang6Parser12ConsumeParenEv/1", %ecx
      00000542:   testb   $0x1, %cl
      00000545:   movq    -0x40(%rbp), %r14
      00000549:   je      .Ltmp1071462
    Successors: .Ltmp1071462, .LFT663335

  .LFT663335 (2 instructions, align : 1)
    Predecessors: .LFT663334
      0000054b:   movq    (%r12), %rax
      0000054f:   movq    .Ltmp0(%rax), %rcx
    Successors: .Ltmp1071462

  .Ltmp1071462 (7 instructions, align : 1)
    Predecessors: .LFT663334, .LFT663335
      00000556:   movq    %r12, %rdi
      00000559:   callq   *%rcx
      .......

The code above is making a call by dereferencing a pointer to a member
function. A pointer to a member function could either be a regular
function, or a virtual function. To differentiate between the two, AMD64
ABI (originated from Itanium ABI) uses the last bit of the pointer. The
call instruction sequence varies depending if the function is virtual or
not, and the pointer's last bit is checked. If it's "1" then the value
of the pointer (minus 1) is used as an offset in the object vtable to
get the address of the function, otherwise the pointer is used directly
as a function address.

In this specific case, a de-virtualization is taking place, but it's not
complete. Compiler knows that the member function pointer is actually a
non-virtual function _ZN5clang6Parser12ConsumeParenEv (aka
"clang::Parser::ConsumeParen()"). However, it keeps the (dead) code that
checks the last bit of _ZN5clang6Parser12ConsumeParenEv, and furthermore
keeps the code (unreachable/dead) to make a virtual call while using
(_ZN5clang6Parser12ConsumeParenEv - 1) as an offset into the vtable.
This is obviously wrong, but since the code is unreachable, it will
never affect the runtime correctness.

The value "_ZN5clang6Parser12ConsumeParenEv - 1" falls into a last byte
of a function preceding _ZN5clang6Parser12ConsumeParenEv, and BOLT
creates a label ".Ltmp0" pointing to this last byte that is referenced
in by the instruction sequence above. It just happens that the last byte
is also in the middle of the last instruction, and as a result, BOLT
never emits the label, hence resulting in the error message "Undefined
temporary symbol".

The workaround is to detect non-pc-relative relocations from code
pointing to some (fptr - 1). Note that this is not completely
error-prone, but non-pc-relative references from code into a middle of
a function are quite rare, and chances that in a normal situation they
will point to a byte preceding some function address are virtually zero.

(cherry picked from FBD13030310)
2018-11-12 12:38:50 -08:00
Maksim Panchenko 30fd960951 [BOLT] Update local symbol count in symbol table
Summary:
Fix sh_info entry for symbol table section to reflect updated number of
local symbols.

(cherry picked from FBD10503216)
2018-10-22 18:48:12 -07:00
Maksim Panchenko a76b13d48e [perf2bolt] Pre-aggregate LBR samples
Summary: Pre-aggregating LBR data cuts pef2bolt processing times in half.

(cherry picked from FBD10420286)
2018-10-02 17:16:26 -07:00
Rafael Auler 74a71c6812 Fix bug in analyzeRelocation for GOT entries
Summary:
Special case GOT relocs to ignore addend subtracting
logic in analyzeRelocation, since the addend does not refer to the
target of the instruction being analyzed. Also make the code honor
the comments in the special case about zeroed out ExtractValue but
non-zero addend.
Fix facebookincubator/BOLT#40

(cherry picked from FBD10355019)
2018-10-11 18:12:09 -07:00
Facebook Github Bot b166ccbea8 [BOLT][PR] Fix compiler warnings in BinaryContext and RegAnalysis
Summary:
This pull request fixes two compiler warnings:

- missing `break;` in a switch-case statement in RegAnalysis.cpp (-Wimplicit-fallthrough warning)
- misleading indentation in BinaryContext.cpp (-Wmisleading-indentation warning)
Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/39
GitHub Author: Andreas Ziegler <andreas.ziegler@fau.de>

(cherry picked from FBD10202092)
2018-10-04 10:46:16 -07:00
Igor Sugak c3c80822a3 [BOLT] Capitalize i
Summary: as titled

(cherry picked from FBD10136655)
2018-10-01 16:22:46 -07:00
Igor Sugak cc2276d3f1 [BOLT] fix build with gcc-4.8.5
Summary: These are two minor changes to make it copatible with gcc-4.8.5

(cherry picked from FBD9884971)
2018-09-17 12:17:33 -07:00
Maksim Panchenko ce508b58c6 [BOLT] Support relocations without symbols
Summary:
lld may generate relocations without associated symbols. Instead of
rejecting binaries with such relocations, we can re-create the symbol
the relocation is against based on the extracted value.

(cherry picked from FBD10054576)
2018-09-21 12:00:20 -07:00
Rafael Auler bd0b99c45d [BOLT] Change stub-insertion pass for AArch64
Summary:
Previously, we were expanding eligible branches with stubs. After
expansion, we were computing which stubs were unnecessary and removing them,
assuming ranges were shortening as code is removed. The problem with this
approach is that for branches that refer to code that is not managed by
BOLT, the distance to that location can increase and we can end up with an
out-of-range branch.

This rewrites the pass to be simpler, only increasing size and expanding code
with stubs as needed after each iteration, stopping when code stops increasing.
Besides this rewrite, the stub-insertion pass now supports stubs grouping
similar to what the linker does, allowing different functions to share the
same veneer that jumps to a common callee. It also fixes a bug in the previous
implementation that, in very large functions that use TBZ/TBNZ (+-32KB range),
it would mistakenly try to reuse a local stub BB that is out of range.

This includes a change to allow hot functions to be put at the end of the
.text section, closer to the heap, requiring no veneers to jump to JITted
code. And finally it enables eliminate veneers pass by default.

(cherry picked from FBD10023158)
2018-09-17 13:36:59 -07:00
Maksim Panchenko 1387a9d761 [BOLT] Keep .text section in file when using old text
Summary:
If we reuse text section under `-use-old-text` option, then there's no
need to rename it. Tools, such as perf, seem to not like binaries
without `.text`.

Additionally, check if the code fits into `.text` using the page
alignment, otherwise we were skipping the alignment relying on the user
detecting the warning message. This could have resulted in unexpected
performance drops.

Also add `-no-huge-pages` option to use regular page size for code
alignment purposes (i.e. 4KiB instead of 2MiB).

(cherry picked from FBD10024670)
2018-09-24 20:58:31 -07:00
Maksim Panchenko 53b72d0f2e [BOLT] Ignore symbols from non-allocatable sections
Summary:
While creating BinaryData objects we used to process all symbol table
entries. However, some symbols could belong to non-allocatable sections,
and thus we have to ignore them for the purpose of analyzing in-memory
data.

(cherry picked from FBD9666511)
2018-09-05 14:36:52 -07:00
Maksim Panchenko 8026760ac0 [BOLT] Fix another issue with profile after ICP
Summary:
For jump tables ICP was using profile from the jump table itself which
doesn't work correct if the jump table is re-used at different code
locations.

(cherry picked from FBD9618774)
2018-08-30 13:21:50 -07:00
spupyrev 41ed5431a0 [BOLT] turning on the compact aligner by default
Summary: Making UseCompactAligner true by default

(cherry picked from FBD9325158)
2018-08-14 14:49:10 -07:00
Maksim Panchenko cd19f718b4 [BOLT] Merge jump table profile data
Summary:
While running ICF pass we have skipped merging profile data for jump
tables. We were only updating profile in the CFG. Fix that.

(cherry picked from FBD9595523)
2018-08-30 13:21:29 -07:00
Maksim Panchenko 69e6004a42 [perf2bolt] Fix processing of binaries with names over 15 chars long
Summary:
Do not truncate the binary name for comparison purposes as the binary
name we are getting from "perf script" is no longer truncated.

(cherry picked from FBD9596409)
2018-08-30 14:51:10 -07:00
Rafael Auler d0a80b0870 [BOLT] Change ForceRelocation behavior
Summary:
Only record address as addend if the target of the relocation
is the pseudo-symbol Zero.

(cherry picked from FBD9551543)
2018-08-28 18:15:13 -07:00
Maksim Panchenko 708a550084 [BOLT] Fix profile after ICP
Summary:
After optimizing a target of a jump table, ICP was not updating edge
counts corresponding to that target. As a result the edge could be left
hot and negatively influence the code layout.

(cherry picked from FBD9524396)
2018-08-23 22:47:46 -07:00
Maksim Panchenko 2511b09985 [BOLT][DWARF] Fix line info for empty CU DIEs
Summary:
In some rare cases a compiler may generate DWARF that contains an empty
CU DIE that references a debug line fragment. That fragment will contain
no file name information, and we fail to register it. Then, as a result,
DW_AT_stmt_list is not updated for the CU. This may cause some
DWARF-processing tools to segfault.

As a solution/workaround, we register "<unknown>" file name for such
debug line tables.

(cherry picked from FBD9526705)
2018-08-27 20:12:59 -07:00
Rafael Auler a7e0704be6 [BOLT] Reduce AArch64 target feature flags
Summary:
Eliminate some flags that are not recognized and
are currently printing warnings when BOLT runs on AArch64.

(cherry picked from FBD9499971)
2018-08-24 10:42:00 -07:00
Rafael Auler af1177d99f [BOLT] Add mattr options to AArch64 target
Summary:
Make the AArch64 subtarget enable all features, so the disassembler
won't choke on extension instructions.

(cherry picked from FBD9477066)
2018-08-22 18:47:39 -07:00
Rafael Auler 9c4fcafa37 [BOLT] Add update-build-id option, on by default
Summary:
The build-id is used by tools to uniquely identify binaries. Update
the output binary build-id with a different number to make it
distinguishable from the input binary. This implementation just flips
the last build-id bit.

(cherry picked from FBD9235336)
2018-08-08 17:55:24 -07:00
Rafael Auler 510a8c4bbe [BOLT] Fix shrink-wrapping CFI update
Summary:
When updating CFI for a function that was optimized by
shrink-wrapping, if the function had no frame pointers, the CFI update
algorithm was incorrect.

(cherry picked from FBD9328658)
2018-08-14 17:32:06 -07:00
Maksim Panchenko 88bb145164 [BOLT] Update allocatable relocation sections
Summary:
Position-independent binaries may have runtime relocations of type
R_X86_64_RELATIVE that need an update if they were pointing to one of
the functions that we have relocated.

(cherry picked from FBD9374164)
2018-08-16 16:53:14 -07:00
Maksim Panchenko 87788ca926 [perf2bolt] Support profiling of PIEs and .so's
Summary:
Processing profile data for binaries with flexible load address (such as
position-independent executables and shared objects) requires adjusting
binary addresses depending on the base load address.

For every PID the mapping will be more or less unique when executing
with ASLR enabled, thus we have to keep the mapping record for all PIDs
associated with the binary. Then we adjust the addresses based on those
mappings.

(cherry picked from FBD9368566)
2018-08-14 13:24:44 -07:00
Maksim Panchenko 560c23411a [perf2bolt] Use mmap events for PID collection
Summary:
Switch from using `perf script --show-task-events` to
`perf script --show-mmap-events` for associating a binary with PIDs in
perf.data. The output of the former command does not provide enough
information for PIE/.so processing.

(cherry picked from FBD9346586)
2018-08-14 13:24:44 -07:00
Rafael Auler b10d4724c3 [BOLT] Fix pseudo calculation in BinaryBasicBlock
Summary:
A recent commit broke our tests because it was depending on
getNumNonPseudos() at a very late stage of our optimization pipeline.
The problem was in a instruction deletion member function in
BinaryBasicBlock that was not updating the number of pseudos after
deletion. Fix this.

(cherry picked from FBD9305972)
2018-08-13 14:36:38 -07:00
Laith Saed Sakka b2382dc552 retpoline insertion : further updates.
Summary:
Couple of updates:

1) Handle address pattern with segment register.
2) Assume R11 available for PLT calls always.
3) Add CFI state to each BB.
4) early exit getMacroOpFusionPair if Instruction.size() <2.

(cherry picked from FBD9172426)
2018-08-03 16:36:06 -07:00
Maksim Panchenko c35dc2a386 [BOLT] Detect and handle fixed indirect branches
Summary:
Sometimes GCC can generate code where one of jump table entries
is being used by an indirect branch with a fixed memory reference,
such as "jmp *(JT+8)". If we don't convert such branches to direct ones
and move jump tables, then the indirect branch will reference the old
table value and will end up at the non-updated destination, possibly
causing a runtime crash.

This fix converts such indirect branches into direct ones.

For now we mark functions containing indirect branches with fixed
destination as non-simple to prevent unreachable code elimination
problem triggered by related dead/unreachable jump table.

(cherry picked from FBD9192363)
2018-08-06 11:22:45 -07:00
Laith Saed Sakka 06e1554158 Retpoline Insertion Pass
Summary:
retpoline insertion implemented for reloc mode,

(cherry picked from FBD8832838)
2018-07-25 19:07:41 -07:00
Maksim Panchenko 39f6fcc947 [BOLT] Add support for IFUNC
Summary:
Relocation value verification was failing for IFUNC as the real value
used for relocation wasn't the symbol value, but a corresponding PLT
entry.

Relax the verification and skip any symbols of ST_Other type.

(cherry picked from FBD9123741)
2018-07-30 10:29:47 -07:00
Maksim Panchenko df94786119 [BOLT] Fix range checks
Summary:
containsRange() functions were incorrectly checking for an empty range
at the end of containing object. I.e. [a,b) was reporting true for
containing [b,b).

(cherry picked from FBD9074643)
2018-07-30 16:30:18 -07:00
Maksim Panchenko fe9f8219fa [BOLT] Fix TBSS-related issue
Summary:
TLS segment provide a template for initializing thread-local storage
for every new thread. It consists of initialized  and uninitialized
parts. The uninitialized part of TLS, .tbss, is completely meaningless
from a binary analysis perspective. It doesn't take any space in the
file, or in memory. Note that this is different from a regular .bss
section that takes space in memory.

We should not place .tbss into a list of allocatable sections, otherwise
it may cause conflicts with objects contained in the next section.

(cherry picked from FBD9074056)
2018-07-30 16:30:18 -07:00
Maksim Panchenko 771d976543 [BOLT][NFC] Minor code refactoring
(cherry picked from FBD8882632)
2018-07-12 10:13:03 -07:00
Maksim Panchenko 49920a8fad [BOLT] Add R_X86_64_PC64 relocation support
(cherry picked from FBD8980691)
2018-07-24 14:30:16 -07:00
spupyrev 631da736b0 [BOLT] further speeding up cache+
Summary:
For large binaries, cache+ algorithm adds a noticeable overhead in
comparison with cache. This modification restricts search space of the
optimization, which makes cache+ as fast as cache for all tested binaries.

There is a tiny (in the order of 0.01%) regression in cache-related metrics,
but this is not noticeable in practice.

(cherry picked from FBD8369968)
2018-05-17 18:27:13 -07:00
Rafael Auler ddfcf4f266 [BOLT] Add parser for pre-aggregated perf data
Summary:
The regular perf2bolt aggregation job is to read perf output directly.
However, if the data is coming from a database instead of perf, one
could write a query to produce a pre-aggregated file. This function
deals with this case.

The pre-aggregated file contains aggregated LBR data, but without binary
knowledge. BOLT will parse it and, using information from the
disassembled binary, augment it with fall-through edge frequency
information. After this step is finished, this data can be either
written to disk to be consumed by BOLT later, or can be used by BOLT
immediately if kept in memory.

File format syntax:
{B|F|f} [<start_id>:]<start_offset> [<end_id>:]<end_offset> <count>
[<mispred_count>]

B - indicates an aggregated branch
F - an aggregated fall-through (trace)
f - an aggregated fall-through with external origin - used to disambiguate
between a return hitting a basic block head and a regular internal
jump to the block

<start_id> - build id of the object containing the start address. We can
skip it for the main binary and use "X" for an unknown object. This will
save some space and facilitate human parsing.

<start_offset> - hex offset from the object base load address (0 for the
main executable unless it's PIE) to the start address.

<end_id>, <end_offset> - same for the end address.

<count> - total aggregated count of the branch or a fall-through.

<mispred_count> - the number of times the branch was mispredicted.
Omitted for fall-throughs.

Example
F 41be50 41be50 3
F 41be90 41be90 4
f 41be90 41be90 7
B 4b1942 39b57f0 3 0
B 4b196f 4b19e0 2 0

(cherry picked from FBD8887182)
2018-07-17 18:31:46 -07:00
Laith Saed Sakka 27f3032447 Add initial function injection support
Summary:
This diff have the API needed to inject functions using bolt.
In relocation mode injected functions are emitted between the cold and the hot functions,
In non-reloc mode injected functions are emitted a next text section.

(cherry picked from FBD8715965)
2018-07-08 12:14:08 -07:00
Maksim Panchenko 6e45f5aeec [perf2bolt] Enforce file matching in perf2bolt
Summary:
If the input binary does not have a build-id and the name does not match
any file names in perf.data, then reject the binary, and issue an error
message suggesting to rename it to one of the listed names from
perf.data.

(cherry picked from FBD8846181)
2018-07-13 15:26:41 -07:00
Maksim Panchenko f2f164f474 [perf2bolt] Fix perf build-id matching
Summary:
Recent compiler tool chains can produce build-ids that are less than 40
characters long. Linux perf, however, always outputs 40 characters,
expanding the string with 0's as needed. Fix the matching by only
checking the string prefix.

(cherry picked from FBD8839452)
2018-07-13 10:49:41 -07:00
Rafael Auler 7aee0adbf9 [BOLT-AArch64] Create cold symbols on demand
Summary:
Rework the logic we use for managing references to constant
islands. Defer the creation of the cold versions to when we split the
function and will need them.

(cherry picked from FBD8228803)
2018-05-31 10:33:53 -07:00
Maksim Panchenko 44a36937f8 [BOLT] Fix llvm-dwarfdump issues
Summary:
llvm-dwarfdump is relying on getRelocatedSection() to return
section_end() for ELF files of types other than relocatable objects.
We've changed the function to return relocatable section for other
types of ELF files. As a result, llvm-dwarfdump started re-processing
relocations for sections that already had relocations applied, e.g. in
executable files, and this resulted in wrong values reported.

As a workaround/solution, we make this function return relocated section
for executable (and any non-relocatable objects) files only if the
section is allocatable.

(cherry picked from FBD8760175)
2018-07-06 21:30:23 -07:00
Maksim Panchenko 66e0313d15 [perf2bolt] Accept `-` as a valid misprediction symbol
Summary:
As reported in GH-28 `perf` can produce `-` symbol for misprediction bit
if the bit is not supported by the kernel/HW. In this case we can ignore
the bit.

(cherry picked from FBD8786827)
2018-07-10 10:25:55 -07:00
Rafael Auler 12380b8b06 Fix assembly after adding entry points
Summary:
When a given function B, located after function A, references
one of A's basic blocks, it registers a new global symbol at the
reference address and update A's Labels vector via
BinaryFunction::addEntryPoint(). However, we don't update A's branch
targets at this point. So we end up with an inconsistent CFG, where the
basic block names are global symbols, but the internal branch operands
are still referencing the old local name of the corresponding blocks
that got promoted to an entry point. This patch fix this by detecting
this situation in addEntryPoint and iterating over all instructions,
looking for references to the old symbol and replacing them to use the
new global symbol (since this is now an entry point).

Fixes facebookincubator/BOLT#26

(cherry picked from FBD8728407)
2018-07-03 11:57:46 -07:00
Rafael Auler 544d1577c1 Avoid removing BBs referenced by JTs
Summary:
While removing unreachable blocks, we may decide to remove a
block that is listed as a target in a jump table entry. If we do that,
this label will be then undefined and LLVM assembler will crash.
Mitigate this for now by not removing such blocks, as we don't support
removing unnecessary jump tables yet.

Fixes facebookincubator/BOLT#20

(cherry picked from FBD8730269)
2018-07-03 17:02:33 -07:00
Laith Saed Sakka b6c4d8e924 -- Adding Veneer elimination pass and Veneer count to dyno stats.
Summary: Create a pass that performs veneers elimination .

(cherry picked from FBD8359299)
2018-06-07 11:10:37 -07:00
Maksim Panchenko 207ac19c63 Revert "[LongJumpPass] X86 enablement. First attempt."
This reverts commit 010b0f7603fc9fa209c6dc95ce4b9c08e7b70d75.

(cherry picked from FBD28111178)
2018-07-06 14:54:53 -07:00
Puyan Lotfi 64c429da89 [LongJumpPass] X86 enablement. First attempt.
(cherry picked from FBD8753328)
2018-07-06 12:31:36 -07:00
Maksim Panchenko b447979b8c [BOLT] Fix diagnostics printing in data aggregator
Summary: Print correct part of the string while reporting an error.

(cherry picked from FBD8745329)
2018-07-05 20:47:38 -07:00
Maksim Panchenko d7b2474f83 [DebugInfo] Change default value of FDEPointerEncoding
Summary:
If the encoding is not specified in CIE augmentation string, then it
should be DW_EH_PE_absptr instead of DW_EH_PE_omit.

(cherry picked from FBD8740274)
2018-07-05 14:21:49 -07:00
Maksim Panchenko 365613b404 [BOLT] Fix no-assertions build
Summary:
In release build without assertions MCInst::dump() is undefined and
causes link time failure.

Fixes facebookincubator/BOLT#27.

(cherry picked from FBD8732905)
2018-07-04 10:33:26 -07:00
Maksim Panchenko a6a37995d9 [BOLT] Reject processing of PIE binaries
Summary:
Check if the input binary ELF type. Reject any binary not of
ET_EXEC type, including position-independent executables (PIEs).

Also print the first function containing PIC jump table.

(cherry picked from FBD8707274)
2018-06-29 21:12:55 -07:00
Maksim Panchenko edc0cb1121 [LLVM] Accept `S` in augmentation strings in CIE
Summary:
Ignore 'S' in augmentation string on input. It just marks a signal
frame. All we have to do is propagate it.

Fixes facebookincubator/BOLT#21

This was already in LLVM trunk rL331738. Update llvm.patch.

(cherry picked from FBD8707222)
2018-06-29 20:30:36 -07:00
Maksim Panchenko 6802948028 [BOLT] Allow jump tables with 2 entries
Summary:
GCC 8 can generate jump tables with just 2 entries. Modify our heuristic
to accept it. We still assert that there's more than one entry.

(cherry picked from FBD8709416)
2018-06-30 13:30:47 -07:00
Rafael Auler 8835f90d1e [X86] Support a subset of internal calls
Summary:
Add support for functions with internal calls, necessary for
handling Intel MKL library and some code observed in google core dumper
library.

This is not optimizing these functions, but only identifying them,
running analyses to assure we will not break those functions if we move
them, and then "freezing" these functions (marking as not simple so Bolt
will not try to reorder it or touch it in any way).

(cherry picked from FBD8364381)
2018-06-11 13:18:44 -07:00
Facebook Github Bot 07353e9590 [BOLT][PR] In some cases DB could be nullptr
Summary:
When processing binary with -debug mode in some cases, BD could be nullptr. It will be better to fail later on assert than here with segfault.
Closes https://github.com/facebookincubator/BOLT/pull/18
GitHub Author: Alexander Gryanko <xpahos@gmail.com>

(cherry picked from FBD8650719)
2018-06-26 17:02:00 -07:00
Rafael Auler 72ecd12f2f Disable -split-eh in non-relocation mode
Summary:
This option only works in relocation mode. In non-relocation
mode, it generates invalid references that cause MCStreamer to fail.
Disable this flag if the user requested and print a warning.

(cherry picked from FBD8625990)
2018-06-25 14:55:48 -07:00
Rafael Auler 5b2eab6538 [BOLT] Fix call to evaluateX86MemOperands
Summary:
There was a call site not providing a displament immediate
value. This assertion is firing in opensource.

(cherry picked from FBD8576033)
2018-06-21 11:03:57 -07:00
Rafael Auler 8f717dd25e [BOLT] Add initial bolt-only test infra
Summary:
Create folders and setup to make LIT run BOLT-only tests. Add
a test example. This will add a new make/ninja rule "check-bolt" that
the user can invoke to run LIT on this folder.

(cherry picked from FBD8595786)
2018-06-22 13:50:07 -07:00
Maksim Panchenko 1baa2529ea [merge-fdata] Support legacy/non-YAML profile format
Summary: Concatenate profile contents if they are not in YAML format.

(cherry picked from FBD8579955)
2018-06-21 14:45:38 -07:00
Maksim Panchenko 3ab2929b36 [BOLT] Fix support for PIC jump tables
Summary:
BOLT heuristics failed to work if false PIC jump table entries were
accepted when they were pointing inside a function, but not at
an instruction boundary.

This fix checks if the destination falls at instruction boundary, and
if it does not, it truncates the jump table. This, of course, still does not
guarantee that the entry corresponds to a real destination, and we can
have "false positive" entry(ies). However, it shouldn't affect
correctness of the function, but the CFG may have edges that are never
taken. We may update an incorrect jump table entry, corresponding to an
unrelated data, and for that reason we force moving of jump tables if a
PIC jump table was detected.

(cherry picked from FBD8559588)
2018-06-20 21:43:22 -07:00
Rafael Auler 35c09dc4dd [BOLT] Add a user friendly error reporting message
Summary:
In case we fail to disassemble or to build the CFG for a
function, print instructions on bug reporting.

(cherry picked from FBD8549737)
2018-06-20 12:03:24 -07:00
Maksim Panchenko 221107c5fb [BOLT] Update llvm.patch
Summary:

(cherry picked from FBD8475998)
2018-06-17 22:29:27 -07:00
Maksim Panchenko a7d025139f Revert "[Bolt][NFC] Change capitalization s/BOLT/Bolt/g"
Summary:

(cherry picked from FBD8431879)
2018-06-14 14:27:20 -07:00
Maksim Panchenko 789162276d [Bolt][NFC] Change capitalization s/BOLT/Bolt/g
(cherry picked from FBD8373789)
2018-06-11 19:46:40 -07:00
Maksim Panchenko 232046f9b2 [Bolt] Reduce verbosity while reporting hash collisions
Summary:
Don't report all data objects with hash collisions by default. Only
report the summary, and use -v=1 for providing the full list.

(cherry picked from FBD8372241)
2018-06-11 17:17:25 -07:00
Bill Nell 706abb6c95 [BOLT] Hash anonymous symbol names
Summary:
This diff replaces the addresses in all the {SYMBOLat,HOLEat,DATAat} symbols with hash values based on the data contained in the symbol.  It should make the profiling data for anonymous symbols robust to address changes.

The only small problem with this approach is that the hashed name for padding symbols of the same size collide frequently.  This shouldn't be a big deal since it would be weird if those symbols were hot.

On a test run with hhvm there were 26 collisions (out of ~338k symbols).  Most of the collisions were from small (2,4,8 byte) objects.

(cherry picked from FBD7134261)
2018-06-06 03:17:32 -07:00
spupyrev 779541283a [BOLT] merging cold basic blocks to reduce #jumps
Summary:
This diff introduces a modification of cache+ block ordering algorithm,
which reordered and merges cold blocks in a function with the goal of reducing
the number of (non-fallthrough) jumps, and thus, the code size.

(cherry picked from FBD8044978)
2018-05-17 11:14:15 -07:00
Maksim Panchenko b4dbd35d6c [BOLT] Initial support for memcpy() inlininig
Summary:
Add "-inline-memcpy" option to inline calls to memcpy() using
"rep movsb" instruction. The pass is X86-specific.

Calls to _memcpy8 are optimized too using a special return value
(dest+size).

The implementation is very primitive in that it does not track liveness
of %rax after return, and no %rcx substitution. This is going to get
improved if we find the optimization to be useful.

(cherry picked from FBD8211890)
2018-05-26 12:40:51 -07:00
Rafael Auler 42e6512241 [BOLT-AArch64] Detect linker stubs and address them
Summary:
In AArch64, when the binary gets large, the linker inserts
stubs with 3 instructions: ADRP to load the PC-relative address of
a page; ADD to add the offset of the page; and a branch instruction
to do an indirect jump to the contents of X16 (the linker-reserved
reg). The problem is that the linker does not issue a relocation for
this (since this is not code coming from the assembler), so BOLT has
no idea what is the real target, unless it recognizes these instructions
and extract the target by combining the operands of the instructions
from the stub. This diff does exactly that.

(cherry picked from FBD7882653)
2018-04-30 14:47:32 -07:00
Maksim Panchenko 929b0908f7 [BOLT][NFC] Move ICF pass into a separate file
Summary:
Consolidate code used by identical code folding under
Passes/IdenticalCodeFolding.cpp.

(cherry picked from FBD8109916)
2018-05-22 15:52:21 -07:00
Maksim Panchenko 6302e18f94 [PERF2BOLT] Improve file matching
Summary:
If the input binary for perf2bolt has a build-id and perf data has
recorded build-ids, then try to match them. Adjust the file name if
build-ids match to cover cases where the binary was renamed after data
collection. If there's no matching build-id report an error and exit.

While scanning task events, truncate the name to 15 characters prior to
matching, since that's how names are reported by perf.

(cherry picked from FBD8034436)
2018-05-16 13:31:13 -07:00
Maksim Panchenko 13968f7fa9 [BOLT] Add option to print functions with bad layout
Summary:
Option `-report-bad-layout=N` prints top N functions with layouts
that have cold blocks placed in the middle of hot blocks. The sorting is
based on execution_count / number_of_basic_blocks formula.

(cherry picked from FBD8051950)
2018-05-17 16:58:29 -07:00
Maksim Panchenko 3af3537383 [BOLT] Properly handle non-standard function refs
Summary:
Application code can reference functions in a non-standard way, e.g.
using arithmetic and bitmask operations on them. One example is if a
program checks if a function is below a certain address or within
a certain address range to perform a low-level optimization or generate
a proper code (JIT).

Instead of relying on a relocation value (symbol+addend), we use only
the symbol value, and then check if the value is inside the function.
If it is, we treat it as a code reference against location within the
function, otherwise we handle it as a non-standard function reference
and issue a warning.

(cherry picked from FBD7996274)
2018-05-14 11:10:26 -07:00
Maksim Panchenko 1750fee2ac [BOLT] Add option to ignore function hash in profile
Summary:
When we make changes to MCInst opcodes (or get changes from upstream),
a hash value for BinaryFunction changes. As a result, we are unable
to match profile generated by a previous version of BOLT.

Add option `-profile-ignore-hash` to match profile while ignoring
function hash value. With this option we match functions with common
names using the number of basic blocks.

(cherry picked from FBD7983269)
2018-05-11 18:30:47 -07:00
Maksim Panchenko 56b38a14c5 [BOLT] Fix dyno-stats for PLT calls
Summary:
To accurately account for PLT optimization, each PLT call should be
counted as an extra indirect call instruction, which in turn is
a load, a call, an indirect call, and instruction entry in dyno stats.

(cherry picked from FBD7978980)
2018-05-11 15:30:56 -07:00
spupyrev e4f39bda51 adjusting cache stats for non-simple functions
Summary:
While working with a binary in non-relocations mode, I realized
some cache metrics are not computed correctly. Hence, this fix.
In addition, logging the number of functions with modified ordering of
basic blocks, which is helpful for analysis.

(cherry picked from FBD7975392)
2018-05-11 12:03:19 -07:00
Bill Nell 729da2da22 [BOLT] Static data reordering pass.
Summary:
Enable BOLT to reorder data sections in a binary based on memory
profiling data.

This diff adds a new pass to BOLT that can reorder data sections for
better locality based on memory profiling data.  For now, the algorithm
to order data is primitive and just relies on the frequency of loads to
order the contents of a section.  We could probably do a lot better by
looking at what functions use the hot data and grouping together hot
data that is used by a single function (or cluster of functions).
Block ordering might give some hints on how to order the data better as
well.

The new pass has two basic modes: inplace and split (when inplace is
false).  The default is split since inplace hasn't really been tested
much.  When splitting is on, the cold data is copied to a "cold" version
of the section while the hot data is kept in the original section, e.g.
for .rodata, .rodata will contain the hot data and .bolt.org.rodata will
contain the cold bits.  In inplace mode, the section contents are
reordered inplace.  In either mode, all relocations to data within that
section are updated to reflect new data locations.

Things to improve:
- The current algorithm is really dumb and doesn't seem to lead to any
  wins.  It certainly could use some improvement.
- Private symbols can have data that leaks over to an adjacent symbol,
  e.g. a string that has a common suffix can start in one symbol and
  leak over (with the common suffix) into the next.  For now, we punt on
  adjacent private symbols.
- Handle ambiguous relocations better.  Section relocations that point
  to the boundary of two symbols will prevent the adjacent symbols from
  being moved because we can't tell which symbol the relocation is for.
- Handle jump tables.  Right now jump table support must be basic if
  data reordering is enabled.
- Being able to handle TLS.  A good amount of data access in some
  binaries are happening in TLS. It would be worthwhile to be able to
  reorder any TLS sections too.
- Handle sections with writeable data.  This hasn't been tested so
  probably won't work.  We could try to prevent false sharing in
  writeable sections as well.
- A pie in the sky goal would be to use DWARF info to reorder types.

(cherry picked from FBD6792876)
2018-04-20 20:03:31 -07:00
Maksim Panchenko bdf21f7617 [BOLT] Align basic blocks based on execution count
Summary:
The default is not changing, i.e. we are not aligning code within a
function by default.

New meaning of options for aligning basic blocks:

  -align-blocks
      triggers basic block alignment based on profile

  -preserve-blocks-alignment
      tries to preserve basic block alignment seen on input

Tuning options for "-align-blocks":
  -align-blocks-min-size=<uint>
      blocks smaller than the specified size wouldn't be aligned

  -align-blocks-threshold=<uint>
      align only blocks with frequency larger than containing function
      execution frequency specified in percent. E.g. 1000 means aligning
      blocks that are 10 times more frequently executed than the containing
      function.

(cherry picked from FBD7921980)
2017-11-07 15:42:28 -08:00
Maksim Panchenko 9c6f965616 [BOLT] Getting open-source ready
Summary:
BOLT sources are being moved under tools/llvm-bolt/src
and tools/llvm-bolt will contain more files such as LICENSE.txt,
README.txt, etc.

Remove trailing white spaces from our sources.

Create llvm.patch by running

  > git diff f137ed238db11440f03083b1c88b7ffc0f4af65e include lib > \
    tools/llvm-bolt/llvm.patch

README.txt has instructions on checking out sources and applying the
patch.

(cherry picked from FBD7878380)
2018-05-04 10:10:41 -07:00
Maksim Panchenko caad4bcf3a [BOLT] Fix crash while writing new profile
Summary:
New profile writer was crashing as functions were lacking a profile
flags. Fix it by requiring flags when marking function as profiled.

Generate new profile for clang. The new profile has more coverage and
results in better overall improvement from BOLT. It was generated by
merging multiple runs of:

% perf record -e cycles:u -j any,u -F32000 -- \
    ./clang bf.cpp -O2 -std=c++11 -c -o /tmp/bf.o

(cherry picked from FBD7798580)
2018-04-27 14:16:42 -07:00
Rafael Auler d6003e94eb [BOLT-AArch64] Fix -icf, -use-old-text and -update-debug-sections
Summary:
Refactor MCInst comparison code to support target-dependent
functionality. This was necessary because AArch64 uses MCTargetExprs
that only the AArch64 backend knows how to unpack it and compare. Also
fix a bug where a relocation against a constant island would make BOLT
create a fixed reference against a code location in a similar way to
read-only data, so when we asked to -use-old-text, the code would break
for this particular HHVM function
(_ZN5folly2io4zlib18defaultZlibOptionsEv) because the reference now
contains invalid data, since the original .text was overwritten. Finally,
fix a bug with -update-debug-sections on AArch64 where the update
loop wasn't expecting a function with zero basic blocks, which can
happen on AArch64 because some functions contain just a constant
island.

(cherry picked from FBD7679244)
2018-04-12 10:07:11 -07:00
spupyrev aa91281ac3 [BOLT] improving cache metrics
Summary: Modifying parameters of block reordering algorithm that result in better performance. Additionally extending some cache-related metrics

(cherry picked from FBD7578336)
2018-03-28 09:10:25 -07:00
Rafael Auler db949fc1f5 [PERF2BOLT] Add support for non-LBR aggregation
Summary:
Previously, we depended on the python script perf2bolt.py whenever
operating with non-LBR data.

(cherry picked from FBD7620125)
2018-04-13 11:18:46 -07:00
Rafael Auler a30fff6e36 [BOLT-AArch64] Fix BOLT build on AArch64
Summary:
Whenever building BOLT in an AArch64 box, we need to make sure
we do not run tests that are excluse for x86. This diff also adds a tag
for expensive tests, so the user can disable them, which is useful when
using a memory-constrained machine to run BOLT tests. It also removes
ifdefs that caused BOLT to behave diferently when running in a non-x86
host. Finally, it changes a case where we depended on updated libstdc++
implementation for insert to make the codebase more friendly with boxes
that do not have the newer version of the lib.

(cherry picked from FBD7625715)
2018-04-13 15:34:09 -07:00
Maksim Panchenko 120d26727a [BOLT] Restore macro-fusion optimization
Summary:
Restore the optimization with some modifications:
  * Only enabled in relocation mode.
  * Covers instructions other than TEST/CMP.
  * Prints missed macro-fusion opportunities for input.
  * By default enabled for all hot code.
  * Without profile enabled for all code.

The new command-line option:
  -align-macro-fusion - fix instruction alignment for macro-fusion (x86 relocation mode)
      =none   - do not insert alignment no-ops for macro-fusion
      =hot    - only insert alignment no-ops on hot execution paths (default)
      =all    - always align instructions to allow macro-fusion

(cherry picked from FBD7644042)
2018-04-13 15:46:19 -07:00
Maksim Panchenko c13cd9084d [BOLT] Fix tests
Summary:
During a rebase function hashes changed and new profile
stopped matching functions.

(cherry picked from FBD7618919)
2018-04-13 10:09:55 -07:00
Maksim Panchenko dc12911fea [BOLT] Report when operating in relocation mode
Summary:
Since BOLT can use relocations in the binary automatically, it's not
always clear if we are operating in relocation mode or not. This diff
adds "BOLT-INFO" message indicating if the relocation mode in ON.

(cherry picked from FBD7557492)
2018-04-09 13:47:43 -07:00
Maksim Panchenko 8b049d3c7f [BOLT] Support for non-LBR profile in YAML
Summary:
Expanded YAML profile format to support different kinds of profile
including LBR and non-LBR (and memevents in the future).

The profile now starts with a header that includes the profile
description. "profile-flags" field includes either "lbr" or "sample",
but not both at the same time. It could also include "memevent" in
addition to other flags.

For now, the only way to generate non-LBR YAML profile is through
conversion. Once task is done, it should be possible to use
perf2bolt for it.

(cherry picked from FBD7595693)
2018-04-09 19:10:19 -07:00
Maksim Panchenko 4878770072 [BOLT][Cleanup] Remove branch history
Summary:
We are not using branch histories and don't have plans to.
Clean up the code.

(cherry picked from FBD7588644)
2018-04-11 11:23:14 -07:00
Maksim Panchenko 190693059a [merge-fdata] Rewrite merge-fdata to use YAML format
Summary:
merge-fdata now operates on .fdata files in YAML format. The previous
format is not supported, which means that non-LBR data could not be
merged and memory data has to be merged with "cat" command.

(cherry picked from FBD7544031)
2018-04-05 13:03:05 -07:00
Rafael Auler 7df6a6d5c6 [BOLT-AArch64] Fix AArch64 port - make it work with hhvm
Summary:
This diff has 3 fixes. First fixes the way relocations are read
and interpreted for AArch64, so the references are preserved correctly.
Second, it fixes constant islands to be able to live in the very first
address of a function (which means there is no code, but this function
contains just a constant island).
Third, it fixes function splitting to do not outline entry points for
AArch64. This was done because some functions may load pointers to its
internal basic blocks, issueing a short-range ADR instruction to do so
without its pair ADRP (since the size of the function is supposed to
be small). But when we move this block to a cold region, that is not
the case anymore. Since blocks with a reference are marked as entry
points, we conservatively disable outlining for them in AArch64.

(cherry picked from FBD7505067)
2018-03-20 14:34:58 -07:00
Maksim Panchenko 489e514530 [BOLT] Improve annotations format and processing
Summary:
Change the way annotations are stored and processed.

Embed annotation type/index into immediate value stored as an operand.
This limits the effective range of values that could be stored as
annotations to 56 bits, which is still plenty for most integer types
that we use and for pointers on real systems. High 8 bits are reserved
for storing annotation type/index.

Expand the interface for general annotations to include reference to
annotations by index. The main purpose of this interface is to improve
performance of annotations that are used by heavy (>O(N)) algorithms,
such as data flow analysis.

For -frame-opt pass, new memory usage and processing times are slightly
better compared to those before refactoring.

(cherry picked from FBD7492017)
2018-03-29 18:42:06 -07:00
Maksim Panchenko d8cf08b243 [BOLT] Use MCPlus::getNumPrimeOperands()
Summary:
Use MCPlus::getNumPrimeOperands() to get the real number of operands
on MCInst. Alternatively, use MCInstrDesc::getNumOperands().

(cherry picked from FBD7507666)
2018-04-04 15:00:00 -07:00
Maksim Panchenko 7956da0fe8 [BOLT] Fix CFG in BinaryFunction::eraseInvalidBBs()
Summary:
When we erase invalid/unreachable basic blocks, we have to remove them
from a list of predecessors of regular blocks, otherwise the CFG will be
left in a broken state containing references to removed basic blocks.

(cherry picked from FBD7464292)
2018-03-30 17:44:14 -07:00
Maksim Panchenko 0d729f218b [BOLT] Fix relocation verification
Summary:
We verify that relocation information matches a value stored in a
binary, i.e. "ExtractedValue == SymbolValue + Addend". However, because
of the size of the relocation, and the fact that an addend is always
of type int64_t, we have to sign-extend the extracted value, and then we
might get mismatch in higher bits in certain scenarios. Hence, we should
only compare values that are truncated to a relocation size.

Discovered while processing hhvm binary with modified compiler flags.

(cherry picked from FBD7462559)
2018-03-30 15:49:34 -07:00
Maksim Panchenko 77f35bd0e9 [BOLT] Fix iterator issue
Summary:
Getting a forward iterator from reverse iterator was implemented
incorrectly. For some reason erase worked on it, but it's clearly wrong
and printing the instruction (before the deletion) results in an error.

(cherry picked from FBD7457457)
2018-03-30 10:54:42 -07:00
Maksim Panchenko a62f4fda46 [BOLT][Refactoring] Isolate changes to MC layer
Summary:
Changes that we made to MCInst, MCOperand, MCExpr, etc. are now all
moved into tools/llvm-bolt. That required a change to the way we handle
annotations and any extra operands for MCInst.

Any MCPlus information is now attached via an extra operand of type
MCInst with an opcode ANNOTATION_LABEL. Since this operand is MCInst, we
attach extra info as operands to this instruction. For first-level
annotations use functions to access the information, such as
getConditionalTailCall() or getEHInfo(), etc. For the rest, optional or
second-class annotations, use a general named-annotation interface such
as getAnnotationAs<uint64_t>(Inst, "Count").

I did a test on HHVM binary, and a memory consumption went down a little
bit while the runtime remained the same.

(cherry picked from FBD7405412)
2018-03-19 18:32:12 -07:00
spupyrev 0dea33737a [BOLT] improvements for CFG construction
Summary:
Some improvements for CFG construction:
- getting rid of fallthrough-inferrence, as this is already
done DataAggregator;
- adjusting block counts for blocks with non-zero outgoing edges
to make sure they're not outlined;
- making sure that all functions (including non-simple ones) are
reordered and placed in the hot section.

The main goal of the diff is to make sure that constructed CFG graphs
exactly correspond to the input profile data.

(cherry picked from FBD7323205)
2018-03-22 09:48:59 -07:00
spupyrev 3458e92285 removing compact-mode
Summary: this is not needed but makes code harder to read; hence, removing

(cherry picked from FBD7257937)
2018-03-14 09:05:26 -07:00
Bill Nell faacdf6080 [BOLT] Fix assertion when building test binary
Summary:
The binary had some unexpected ovelapping symbols:

.str.34.llvm.2944770977690351622/1 address = 0x48e9ec7, next address =
   0x48e9ed2, size = 21
PG.LC135/1 address = 0x48e9ed2, next address = 0x48e9eef, size = 29

BOLT wasn't expecting this type of overlap when generating HOLE symbols,
so it was asserting.  I've changed the code to deal with this case.

I'll need to change the reordering pass to mark these types of symbols
as unmoveable for now.

(cherry picked from FBD7304195)
2018-03-16 09:03:12 -07:00
Bill Nell 598a346abf [BOLT] Fix assertion when setting size of jump table symbol
Summary: This assertion was making sure that when we patched up symbol sizes that we wouldn't modify the size of a symbol that has already had its size set.  The issue here is that private symbols are sometimes composed of multiple objects internally (e.g. jump tables).  In this particular case a jump table started at the same address as the private data blob it was contained in.  Currently, there isn't any good way of differentiating symbols that start at the same address (except possibly using multimaps for certain data structures).  I'm hacking around it by modifying the assertion to ignore jump tables and skip setting the size when it has already been set.  This shouldn't affect any existing optimizations since the only thing that depended on sizes is data reordering and that currently ignores jump tables and private data blobs.

(cherry picked from FBD7269207)
2018-03-13 18:59:22 -07:00
Maksim Panchenko 48ae32a33b [BOLT] Introduce MCPlus layer
Summary:
Refactor architecture-specific code out of llvm into llvm-bolt.

Introduce MCPlusBuilder, a class that is taking over MCInstrAnalysis
responsibilities, i.e. creating, analyzing, and modifying instructions.
To access the builder use BC->MIB, i.e. substitute MIA with MIB.
MIB is an acronym for MCInstBuilder, that's what MCPlusBuilder used
to be. The name stuck, and I find it better than MPB.

Instructions are still MCInst, and a bunch of BOLT-specific code still
lives in LLVM, but the staff under Target/* is significantly reduced.

(cherry picked from FBD7300101)
2018-03-09 09:45:13 -08:00
Maksim Panchenko 8c16594f2e [BOLT] Fix ORC to properly update symbols
Summary:
In new ORC, the sequence of how sections are allocated and loaded is
changed. Now everything is delayed until emitAndFinalize() is called,
and all actions are supposed to happen via notification functors.
There are two functors that we pass to new ObjectLinkingLayer object.
One is used to notify when objects are loaded, and the other - once they
are finalized. We use the first one to remap sections to proper
addresses, and that's the earliest place where we can do it. However,
ORC decides to update symbols right before that, and as a result they
are updated with non-mapped values.

There are two possible fixes for that. This diff postpones the update to
the symbol table until the notifier is called. I don't know what other
tools depend on the existing sequence, and the proper fix may involve
creating a third notifier to be called before the symbol table update.

(cherry picked from FBD7280973)
2018-03-14 15:07:16 -07:00
Rafael Auler 2fe37b4435 [BOLT] Fix remove-unused-stores in rebased bolt
Summary:
Rebased version revealed a mistake when computing the dataflow
for the "remove-unused-stores" optimization. This is disabled in prod but
it doesn't hurt to fix it, so the tests for the rebased bolt go green
again.

(cherry picked from FBD7253418)
2018-03-12 20:24:01 -07:00
Rafael Auler 6644548c74 [BOLTDIFF] Add a tool to audit performance differences
Summary:
This is a simple bolt-based tool that instantiates two
RewriteInstances objects and compares them. Add a method to
RewriteInstance to enable us to compare two objects. Include a mechanism
to match functions from binary 1 to binary 2 and finally print the
largest differences in profiling data from one binary to another.

(cherry picked from FBD6517076)
2017-12-07 15:00:41 -08:00
Maksim Panchenko d660f8b1fe [BOLT] Disassemble all functions before building CFGs
Summary:
This makes it possible to do adjustments to all functions based on
information gained during disassembly. E.g. if we detect an entry point
after the CFG for a function is constructed, we have to take a
conservative approach, and mark such function as non-simple. Now we have
this information before building the CFG. This could also be used to do
other processing/post-processing on disassembled functions that might
affect CFG construction of other functions (e.g. early detection of
functions that never return).

The drawback of this approach is that we lose cache locality and some
processing performance as a result. I've measured 5 second difference
on HHVM binary.

(cherry picked from FBD7258466)
2018-02-14 12:06:17 -08:00
Bill Nell 0e4d86bf19 [BOLT] Refactor global symbol handling code.
Summary:
This is preparation work for static data reordering.

I've created a new class called BinaryData which represents a symbol
contained in a section.  It records almost all the information relevant
for dealing with data, e.g. names, address, size, alignment, profiling
data, etc.  BinaryContext still stores and manages BinaryData objects
similar to how it managed symbols and global addresses before.  The
interfaces are not changed too drastically from before either.  There is
a bit of overlap between BinaryData and BinaryFunction.  I would have
liked to do some more refactoring to make a BinaryFunctionFragment that
subclassed from BinaryData and then have BinaryFunction be composed or
associated with BinaryFunctionFragments.

I've also attempted to use (symbol + offset) for when addresses are
pointing into the middle of symbols with known sizes.  This changes the
simplify rodata loads optimization slightly since the expression on an
instruction can now also be a (symbol + offset) rather than just a symbol.

One of the overall goals for this refactoring is to make sure every
relocation is associated with a BinaryData object.  This requires adding
"hole" BinaryData's wherever there are gaps in a section's address space.
Most of the holes seem to be data that has no associated symbol info. In
this case we can't do any better than lumping all the adjacent hole
symbols into one big symbol (there may be more than one actual data
object that contributes to a hole). At least the combined holes should
be moveable.

Jump tables have similar issues. They appear to mostly be sub-objects
for top level local symbols. The main problem is that we can't recognize
jump tables at the time we scan the symbol table, we have to wait til
disassembly. When a jump table is discovered we add it as a sub-object
to the existing local symbol. If there are one or more existing
BinaryData's that appear in the address range of a newly created jump
table, those are added as sub-objects as well.

(cherry picked from FBD6362544)
2017-11-14 20:05:11 -08:00
Rafael Auler 32b332ad2d [BOLT] Fix ShrinkWrapping bugs and enable testing
Summary:
Fix a few ShrinkWrapping bugs:

 - Using push-pop mode in a function that required aligned stack
 - Correctly update the edges in jump tables after splitting critical
   edges
 - Fix stack pointer restores based on RBP + offset, when we change the
   stack layout in push-pop mode.

(cherry picked from FBD6755232)
2017-12-14 17:26:19 -08:00
Rafael Auler 6d0401ccfb [BOLT/LSDA] Fix alignment
Summary:
Fix a bug introduced by rebasing with respect to aligned ULEBs.
This wasn't breaking anything but it is good to keep LDSA aligned.

(cherry picked from FBD7094742)
2018-02-26 20:09:14 -08:00
Bill Nell ddefc770b0 [BOLT] Refactoring of section handling code
Summary:
This is a big refactoring of the section handling code.  I've removed the SectionInfoMap and NoteSectionInfo and stored all the associated info about sections in BinaryContext and BinarySection classes.  BinarySections should now hold all the info we care about for each section.  They can be initialized from SectionRefs but don't necessarily require one to be created.  There are only one or two spots that needed access to the original SectionRef to work properly.

The trickiest part was making sure RewriteInstance.cpp iterated over the proper sets of sections for each of it's different types of processing.  The different sets are broken down roughly as allocatable and non-alloctable and "registered" (I couldn't think up a better name).  "Registered" means that the section has been updated to include output information, i.e. contents, file offset/address, new size, etc.  It may help to have special iterators on BinaryContext to iterate over the different classes to make things easier.  I can do that if you guys think it is worthwhile.

I found pointee_iterator in the llvm ADT code.  Use that for iterating over BBs in BinaryFunction rather than the custom iterator class.

(cherry picked from FBD6879086)
2018-02-01 16:33:43 -08:00
Maksim Panchenko 6744f0dbeb [BOLT] Fix jump table placement for non-simple functions
Summary:
When we move a jump table to either hot or cold new section
(-jump-tables=move), we rely on a number of taken branches from the table
to decide if it's hot or cold. However, if the function is non-simple, we
always get 0 count, and always move the table to the cold section.
Instead, we should make a conservative decision based on the execution
count of the function.

(cherry picked from FBD7058127)
2018-02-22 11:20:46 -08:00
Andy Newell e15623058e Cache+ speed, reduce mallocs
Summary:
Speed of cache+ by skipping mallocs on vectors.

Although this change speeds up the algorithm by 2x, this is still not
enough for some binaries where some functions have ~2500 hot basic
blocks. Hence, introduce a threshold for expensive optimizations in
CachePlusReorderAlgorithm. If the number of hot basic blocks exceeds
the threshold (2048 by default), we use a cheaper version, which is
quite fast.

(cherry picked from FBD6928075)
2018-02-09 09:58:19 -08:00