Commit Graph

616 Commits

Author SHA1 Message Date
Maksim Panchenko 7fd487066f [BOLT] Move BinaryFunctions into a BinaryContext and more
Summary:
A long due refactoring that makes interfaces cleaner and less awkward.
Mainly makes the future work way easier.

(cherry picked from FBD14766284)
2019-04-03 15:52:01 -07:00
Maksim Panchenko 8894853f42 [BOLT][DWARF] Dedup .debug_abbrev section patches
Summary:
When we patch .debug_abbrev we issue many duplicate patches. Instead of
storing these patches as a vector, use a hash map. This saves some
processing time and memory.

(cherry picked from FBD14691292)
2019-03-29 14:22:54 -07:00
Maksim Panchenko 297d1a4e1a [BOLT] Do not write jump table section headers
Summary:
In non-relocation mode we were accidentally emitting section headers for
every single jump table. This happened with default
`-jump-tables=basic`.

(cherry picked from FBD14653282)
2019-03-27 13:58:31 -07:00
Maksim Panchenko d1b76f2ac2 [BOLT] Allocate enough space past __hot_end for huge pages
Summary:
While using "-hot-text" option, we might not get enough cold text to
fill up the last huge page, and we can get data allocated on this page
producing undesirable effects. To prevent this from happening, always
make sure to allocate enough space past __hot_end.

(cherry picked from FBD14575100)
2019-03-21 21:13:45 -07:00
Maksim Panchenko 69faf61372 [BOLT] Fix section lookup while deleting symbols
Summary:
While removing redundant local symbols, we used new section index to
lookup the corresponding section in the old section table. As a result,
we used to either not remove the correct symbols, or remove the wrong
ones.

(cherry picked from FBD14552047)
2019-03-20 16:13:09 -07:00
Maksim Panchenko b8d3dc40ea [BOLT] Use local binding for cold fragment symbols
Summary:
We used to use existing symbol binding while duplicating and renaming
cold fragment symbols. As a result, some of those were emitted with
global binding. This confuses gdb, and it starts treating those symbols
as additional entry points.

The fix is to always emit such symbols with a local binding. This also
means that we have to sort static symbol table before emission to make
sure local symbols precede all others.

(cherry picked from FBD14529265)
2019-03-19 13:46:21 -07:00
Maksim Panchenko 6bcb3389dd [BOLT] Place hot text mover functions into a separate section
Summary:
Create a separate pass for assigning functions to sections. Detect
functions originating from special sections (by default .stub and
.mover) and place them into ".text.mover" if "-hot-text" options is
specified.

Cold functions are isolated from hot functions even when no function
re-ordering is specified.

(cherry picked from FBD14512628)
2019-03-15 13:43:36 -07:00
Maksim Panchenko 17cd2034f3 [BOLT] Fix debug line info emission
Summary:
GDB does not like if the first entry in the line info table after
end_sequence entry is not marked with is_stmt. If this happens, it will
not print the correct line number information for such address. Note
that everything works fine starting with the first address marked
with is_stmt.

This could happen if the first instruction in the cold section wasn't
marked with is_stmt.

The fix is to always emit debug line info for the first instruction
in any function fragment with is_stmt flag.

(cherry picked from FBD14516629)
2019-03-18 19:22:26 -07:00
Maksim Panchenko 61ea19edf8 [BOLT][NFC] Fix compilation warnings
Summary: Get rid of warnings while building with Clang.

(cherry picked from FBD14495620)
2019-03-15 15:06:41 -07:00
Maksim Panchenko 0a55001a0e [BOLT] Fix -hot-functions-at-end option
Summary: Make "-hot-functions-at-end" option work again.

(cherry picked from FBD14476242)
2019-03-14 20:32:04 -07:00
Maksim Panchenko 163adbec9f [BOLT] Refactor allocatable sections rewrite part
Summary:
This refactoring makes it easier to create new code sections and control
code placement. As an example, cold code is being placed into
".text.cold" which is emitted independently from ".text", and the final
address assignment becomes more flexible.

Previously, in non-relocation mode we used to emit temporary section
name into .shstrtab. This resulted in unnecessary bloat of this section.

There was unnecessary padding emitted at the end of text section. After
fixing this, the output binary becomes smaller.

I had to change the way exception handling tables are re-written
as the current infra does not support cross-section label difference.
This means we have to emit absolute landing pad addresses, which might
not work for PIE binaries. I'm going to address this once I investigate
the current exception handling issues in PIEs.

This diff temporarily disables "-hot-functions-at-end" option.

(cherry picked from FBD14475693)
2019-03-14 18:51:05 -07:00
Maksim Panchenko a9e64947c5 [NFC][BOLT] Move ExecutableFileMemoryManager into its own file
(cherry picked from FBD14474800)
2019-03-14 18:49:40 -07:00
Rafael Auler c593563d1f Do not assert on addresses read from processIndirectBranch
Summary: As part of our heuristics to decode an indirect branch, if we
suspect the branch is an indirect tail call, we add its probable target
to the BC::InterproceduralReferences vector to detect functions with
more than one entry point. However, if this probable target is not in an
allocatable section, we were asserting. Remove this assertion and
change the code to conditionally store to InterproceduralReferences
instead. The probable target could be garbage at this point because
of analyzeIndirectBranch failing to identify the load instruction that
has the memory address of the target, so we should tolerate this.

(cherry picked from FBD14432821)
2019-03-12 16:36:35 -07:00
Maksim Panchenko 0c704eb75a [BOLT-HEATMAP] Initial heat map implementation
Summary:
Add heatmap subcommand to produce heatmaps based on perf.data with LBR.
The output is produced in colored ASCII format.

  llvm-bolt heatmap -p perf.data <executable>

    -block-size=<uint> - size of a heat map block in bytes (default 64)
    -line-size=<uint>  - number of entries per line (default 256)
    -max-address=<uint> - maximum address considered valid for heatmap
                          (default 4GB)
    -o=<string>        - heatmap output file (default stdout)

(cherry picked from FBD13969992)
2019-02-05 15:28:19 -08:00
Maksim Panchenko ff6e21290f [BOLT] New inliner implementation
Summary:
Addresses correctness issues related to inlining.
Inlining heuristics are not part of this diff.

(cherry picked from FBD13796888)
2019-01-31 11:23:02 -08:00
Maksim Panchenko 365bd1f1c8 [BOLT] For non-simple functions always update jump tables in-place
Summary:
For non-simple function we can miss a reference to a jump table or
to an indirect goto table. If we move the jump table, the missed
reference will not get updated, and the corresponding indirect jump
will end up in the old (wrong) location. Updating the original jump
table in-place should take care of the issue.

(cherry picked from FBD13849776)
2019-01-28 13:46:18 -08:00
Rafael Auler af81c7ff80 [perf2bolt] Add support for generating autofdo input
Summary:
Autofdo tools support.

(cherry picked from FBD13779026)
2019-01-22 17:21:45 -08:00
Maksim Panchenko c6ce2abb7d [perf2bolt] Optimize memory usage in perf2bolt
Summary:
While converting perf profile, we only need CFG for functions that were
profiled and can skip building CFG for the rest. This saves us some
processing time and memory.

Breakdown processing of perf.data into two steps. The first
step parses the data, saves it in intermediate format, and marks
functions with the profile. The second step attributes the profile to
functions with CFG. When we disassemble and build CFG for functions in
aggregate-only mode, we skip functions without the profile.

(cherry picked from FBD13706697)
2019-01-15 23:43:40 -08:00
Maksim Panchenko 2fe0c38d6b [perf2bolt] Better tracking of process forking
Summary:
Improve tracking of forked processes.

If a process corresponding to the input binary has forked/started
before 'perf record' was initiated, then the full name of the binary
will be recorded in a corresponding MMAP2 event. We've being handling
such cases well so far.

However, if the process was forked after 'perf record' has started, and
execve(2) wasn't called afterwards, then there will be no MMAP2 event
recorded corresponding to the mapping of the main binary (unrelated
MMAP2 events could still be recorded).

To track such cases, we need to parse 'perf script --show-task-events'
command output, and to scan for PERF_RECORD_FORK events, and then add
forked process PIDs to the list associated with the input binary. If
the fork event was followed by an exec event (PERF_RECORD_COMM exec)
of a different binary, then the forked PID should be ignored. If the
exec event was associated with our input binary, then the correct MMAP2
event was recorded and parsed.

To track if the event occurred before or after 'perf record', we parse
event's time. This helps us to differentiate some events. E.g. the exec
event is only registered correctly if it happened after perf recording
has started (otherwise the "exec" part is missing), and thus we only
record forks with non-zero time stamps.

(cherry picked from FBD13250904)
2018-11-21 20:04:00 -08:00
Maksim Panchenko 067a385000 [BOLT] Add thresholds for function splitting
Summary:
Use newly added function size estimation to measure the effectiveness
and guide function splitting. Two new tuning options are added:

  -split-threshold=<uint>
    split function only if its main size is reduced by more than given
    amount of bytes. Default value: 0, i.e. split iff the size is reduced.
    Note that on some architectures the size can increase after splitting.
  -split-align-threshold=<uint>
    when deciding to split a function, apply this alignment while doing
    the size comparison (see -split-threshold). Default value: 2.

(cherry picked from FBD13136352)
2018-11-15 16:03:34 -08:00
Maksim Panchenko b0f7fddd35 [BOLT] Add method for better function size estimation
Summary:
Add BinaryContext::calculateEmittedSize() that ephemerally emits code
to allow precise estimation of the function size. Relaxation and
macro-op alignment adjustments are taken into account.

(cherry picked from FBD13092139)
2018-11-15 16:02:16 -08:00
Maksim Panchenko e1b8fade7f [BOLT] Add branch priority policy for blocks with 2 successors
Summary:
On x86 the difference between long and short jump instructions could be
either 4 or 3 bytes, depending if it's a conditional jump or not.
For a basic block with 2 jump instructions, if we know that one of
the successors is in a different code region, then we can make it
a target of an unconditional jump instruction. This will save 1 byte
in case the conditional jump happens to be a short one.

(cherry picked from FBD13078139)
2018-11-14 14:43:59 -08:00
Maksim Panchenko 40d9fcfdca [BOLT] Workaround for Clang de-virtualization bug
Summary:
When Clang is boot-strapped with (Thin)LTO, it may produce a code
fragment similar to below:

  .LFT663334 (6 instructions, align : 1)
    Predecessors: .LFT663333
      00000538:   movb    $0x1, %al
      0000053a:   movl    %eax, -0x2c(%rbp)
      0000053d:   movl    $"_ZN5clang6Parser12ConsumeParenEv/1", %ecx
      00000542:   testb   $0x1, %cl
      00000545:   movq    -0x40(%rbp), %r14
      00000549:   je      .Ltmp1071462
    Successors: .Ltmp1071462, .LFT663335

  .LFT663335 (2 instructions, align : 1)
    Predecessors: .LFT663334
      0000054b:   movq    (%r12), %rax
      0000054f:   movq    .Ltmp0(%rax), %rcx
    Successors: .Ltmp1071462

  .Ltmp1071462 (7 instructions, align : 1)
    Predecessors: .LFT663334, .LFT663335
      00000556:   movq    %r12, %rdi
      00000559:   callq   *%rcx
      .......

The code above is making a call by dereferencing a pointer to a member
function. A pointer to a member function could either be a regular
function, or a virtual function. To differentiate between the two, AMD64
ABI (originated from Itanium ABI) uses the last bit of the pointer. The
call instruction sequence varies depending if the function is virtual or
not, and the pointer's last bit is checked. If it's "1" then the value
of the pointer (minus 1) is used as an offset in the object vtable to
get the address of the function, otherwise the pointer is used directly
as a function address.

In this specific case, a de-virtualization is taking place, but it's not
complete. Compiler knows that the member function pointer is actually a
non-virtual function _ZN5clang6Parser12ConsumeParenEv (aka
"clang::Parser::ConsumeParen()"). However, it keeps the (dead) code that
checks the last bit of _ZN5clang6Parser12ConsumeParenEv, and furthermore
keeps the code (unreachable/dead) to make a virtual call while using
(_ZN5clang6Parser12ConsumeParenEv - 1) as an offset into the vtable.
This is obviously wrong, but since the code is unreachable, it will
never affect the runtime correctness.

The value "_ZN5clang6Parser12ConsumeParenEv - 1" falls into a last byte
of a function preceding _ZN5clang6Parser12ConsumeParenEv, and BOLT
creates a label ".Ltmp0" pointing to this last byte that is referenced
in by the instruction sequence above. It just happens that the last byte
is also in the middle of the last instruction, and as a result, BOLT
never emits the label, hence resulting in the error message "Undefined
temporary symbol".

The workaround is to detect non-pc-relative relocations from code
pointing to some (fptr - 1). Note that this is not completely
error-prone, but non-pc-relative references from code into a middle of
a function are quite rare, and chances that in a normal situation they
will point to a byte preceding some function address are virtually zero.

(cherry picked from FBD13030310)
2018-11-12 12:38:50 -08:00
Maksim Panchenko 30fd960951 [BOLT] Update local symbol count in symbol table
Summary:
Fix sh_info entry for symbol table section to reflect updated number of
local symbols.

(cherry picked from FBD10503216)
2018-10-22 18:48:12 -07:00
Maksim Panchenko a76b13d48e [perf2bolt] Pre-aggregate LBR samples
Summary: Pre-aggregating LBR data cuts pef2bolt processing times in half.

(cherry picked from FBD10420286)
2018-10-02 17:16:26 -07:00
Rafael Auler 74a71c6812 Fix bug in analyzeRelocation for GOT entries
Summary:
Special case GOT relocs to ignore addend subtracting
logic in analyzeRelocation, since the addend does not refer to the
target of the instruction being analyzed. Also make the code honor
the comments in the special case about zeroed out ExtractValue but
non-zero addend.
Fix facebookincubator/BOLT#40

(cherry picked from FBD10355019)
2018-10-11 18:12:09 -07:00
Facebook Github Bot b166ccbea8 [BOLT][PR] Fix compiler warnings in BinaryContext and RegAnalysis
Summary:
This pull request fixes two compiler warnings:

- missing `break;` in a switch-case statement in RegAnalysis.cpp (-Wimplicit-fallthrough warning)
- misleading indentation in BinaryContext.cpp (-Wmisleading-indentation warning)
Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/39
GitHub Author: Andreas Ziegler <andreas.ziegler@fau.de>

(cherry picked from FBD10202092)
2018-10-04 10:46:16 -07:00
Igor Sugak c3c80822a3 [BOLT] Capitalize i
Summary: as titled

(cherry picked from FBD10136655)
2018-10-01 16:22:46 -07:00
Igor Sugak cc2276d3f1 [BOLT] fix build with gcc-4.8.5
Summary: These are two minor changes to make it copatible with gcc-4.8.5

(cherry picked from FBD9884971)
2018-09-17 12:17:33 -07:00
Maksim Panchenko ce508b58c6 [BOLT] Support relocations without symbols
Summary:
lld may generate relocations without associated symbols. Instead of
rejecting binaries with such relocations, we can re-create the symbol
the relocation is against based on the extracted value.

(cherry picked from FBD10054576)
2018-09-21 12:00:20 -07:00
Rafael Auler bd0b99c45d [BOLT] Change stub-insertion pass for AArch64
Summary:
Previously, we were expanding eligible branches with stubs. After
expansion, we were computing which stubs were unnecessary and removing them,
assuming ranges were shortening as code is removed. The problem with this
approach is that for branches that refer to code that is not managed by
BOLT, the distance to that location can increase and we can end up with an
out-of-range branch.

This rewrites the pass to be simpler, only increasing size and expanding code
with stubs as needed after each iteration, stopping when code stops increasing.
Besides this rewrite, the stub-insertion pass now supports stubs grouping
similar to what the linker does, allowing different functions to share the
same veneer that jumps to a common callee. It also fixes a bug in the previous
implementation that, in very large functions that use TBZ/TBNZ (+-32KB range),
it would mistakenly try to reuse a local stub BB that is out of range.

This includes a change to allow hot functions to be put at the end of the
.text section, closer to the heap, requiring no veneers to jump to JITted
code. And finally it enables eliminate veneers pass by default.

(cherry picked from FBD10023158)
2018-09-17 13:36:59 -07:00
Maksim Panchenko 1387a9d761 [BOLT] Keep .text section in file when using old text
Summary:
If we reuse text section under `-use-old-text` option, then there's no
need to rename it. Tools, such as perf, seem to not like binaries
without `.text`.

Additionally, check if the code fits into `.text` using the page
alignment, otherwise we were skipping the alignment relying on the user
detecting the warning message. This could have resulted in unexpected
performance drops.

Also add `-no-huge-pages` option to use regular page size for code
alignment purposes (i.e. 4KiB instead of 2MiB).

(cherry picked from FBD10024670)
2018-09-24 20:58:31 -07:00
Maksim Panchenko 53b72d0f2e [BOLT] Ignore symbols from non-allocatable sections
Summary:
While creating BinaryData objects we used to process all symbol table
entries. However, some symbols could belong to non-allocatable sections,
and thus we have to ignore them for the purpose of analyzing in-memory
data.

(cherry picked from FBD9666511)
2018-09-05 14:36:52 -07:00
Maksim Panchenko 8026760ac0 [BOLT] Fix another issue with profile after ICP
Summary:
For jump tables ICP was using profile from the jump table itself which
doesn't work correct if the jump table is re-used at different code
locations.

(cherry picked from FBD9618774)
2018-08-30 13:21:50 -07:00
spupyrev 41ed5431a0 [BOLT] turning on the compact aligner by default
Summary: Making UseCompactAligner true by default

(cherry picked from FBD9325158)
2018-08-14 14:49:10 -07:00
Maksim Panchenko cd19f718b4 [BOLT] Merge jump table profile data
Summary:
While running ICF pass we have skipped merging profile data for jump
tables. We were only updating profile in the CFG. Fix that.

(cherry picked from FBD9595523)
2018-08-30 13:21:29 -07:00
Maksim Panchenko 69e6004a42 [perf2bolt] Fix processing of binaries with names over 15 chars long
Summary:
Do not truncate the binary name for comparison purposes as the binary
name we are getting from "perf script" is no longer truncated.

(cherry picked from FBD9596409)
2018-08-30 14:51:10 -07:00
Rafael Auler d0a80b0870 [BOLT] Change ForceRelocation behavior
Summary:
Only record address as addend if the target of the relocation
is the pseudo-symbol Zero.

(cherry picked from FBD9551543)
2018-08-28 18:15:13 -07:00
Maksim Panchenko 708a550084 [BOLT] Fix profile after ICP
Summary:
After optimizing a target of a jump table, ICP was not updating edge
counts corresponding to that target. As a result the edge could be left
hot and negatively influence the code layout.

(cherry picked from FBD9524396)
2018-08-23 22:47:46 -07:00
Maksim Panchenko 2511b09985 [BOLT][DWARF] Fix line info for empty CU DIEs
Summary:
In some rare cases a compiler may generate DWARF that contains an empty
CU DIE that references a debug line fragment. That fragment will contain
no file name information, and we fail to register it. Then, as a result,
DW_AT_stmt_list is not updated for the CU. This may cause some
DWARF-processing tools to segfault.

As a solution/workaround, we register "<unknown>" file name for such
debug line tables.

(cherry picked from FBD9526705)
2018-08-27 20:12:59 -07:00
Rafael Auler a7e0704be6 [BOLT] Reduce AArch64 target feature flags
Summary:
Eliminate some flags that are not recognized and
are currently printing warnings when BOLT runs on AArch64.

(cherry picked from FBD9499971)
2018-08-24 10:42:00 -07:00
Rafael Auler af1177d99f [BOLT] Add mattr options to AArch64 target
Summary:
Make the AArch64 subtarget enable all features, so the disassembler
won't choke on extension instructions.

(cherry picked from FBD9477066)
2018-08-22 18:47:39 -07:00
Rafael Auler 9c4fcafa37 [BOLT] Add update-build-id option, on by default
Summary:
The build-id is used by tools to uniquely identify binaries. Update
the output binary build-id with a different number to make it
distinguishable from the input binary. This implementation just flips
the last build-id bit.

(cherry picked from FBD9235336)
2018-08-08 17:55:24 -07:00
Rafael Auler 510a8c4bbe [BOLT] Fix shrink-wrapping CFI update
Summary:
When updating CFI for a function that was optimized by
shrink-wrapping, if the function had no frame pointers, the CFI update
algorithm was incorrect.

(cherry picked from FBD9328658)
2018-08-14 17:32:06 -07:00
Maksim Panchenko 88bb145164 [BOLT] Update allocatable relocation sections
Summary:
Position-independent binaries may have runtime relocations of type
R_X86_64_RELATIVE that need an update if they were pointing to one of
the functions that we have relocated.

(cherry picked from FBD9374164)
2018-08-16 16:53:14 -07:00
Maksim Panchenko 87788ca926 [perf2bolt] Support profiling of PIEs and .so's
Summary:
Processing profile data for binaries with flexible load address (such as
position-independent executables and shared objects) requires adjusting
binary addresses depending on the base load address.

For every PID the mapping will be more or less unique when executing
with ASLR enabled, thus we have to keep the mapping record for all PIDs
associated with the binary. Then we adjust the addresses based on those
mappings.

(cherry picked from FBD9368566)
2018-08-14 13:24:44 -07:00
Maksim Panchenko 560c23411a [perf2bolt] Use mmap events for PID collection
Summary:
Switch from using `perf script --show-task-events` to
`perf script --show-mmap-events` for associating a binary with PIDs in
perf.data. The output of the former command does not provide enough
information for PIE/.so processing.

(cherry picked from FBD9346586)
2018-08-14 13:24:44 -07:00
Rafael Auler b10d4724c3 [BOLT] Fix pseudo calculation in BinaryBasicBlock
Summary:
A recent commit broke our tests because it was depending on
getNumNonPseudos() at a very late stage of our optimization pipeline.
The problem was in a instruction deletion member function in
BinaryBasicBlock that was not updating the number of pseudos after
deletion. Fix this.

(cherry picked from FBD9305972)
2018-08-13 14:36:38 -07:00
Laith Saed Sakka b2382dc552 retpoline insertion : further updates.
Summary:
Couple of updates:

1) Handle address pattern with segment register.
2) Assume R11 available for PLT calls always.
3) Add CFI state to each BB.
4) early exit getMacroOpFusionPair if Instruction.size() <2.

(cherry picked from FBD9172426)
2018-08-03 16:36:06 -07:00
Maksim Panchenko c35dc2a386 [BOLT] Detect and handle fixed indirect branches
Summary:
Sometimes GCC can generate code where one of jump table entries
is being used by an indirect branch with a fixed memory reference,
such as "jmp *(JT+8)". If we don't convert such branches to direct ones
and move jump tables, then the indirect branch will reference the old
table value and will end up at the non-updated destination, possibly
causing a runtime crash.

This fix converts such indirect branches into direct ones.

For now we mark functions containing indirect branches with fixed
destination as non-simple to prevent unreachable code elimination
problem triggered by related dead/unreachable jump table.

(cherry picked from FBD9192363)
2018-08-06 11:22:45 -07:00