Commit Graph

616 Commits

Author SHA1 Message Date
Maksim Panchenko 82965b963f [BOLT] Emit short tail calls in relocation mode.
Summary:
To minimize size of the output code we should emit tail calls
that are as short as possible. For this we have to convert a synthetic
TAILJMPd into JMP_1 instruction. This should be one of the last passes
as most of analysis passes could break since tail calls will no longer
be marked as such.

The total size of the code is smaller, but not by much - hot text was
reduced by 192 bytes.

(cherry picked from FBD4557804)
2017-02-13 23:05:12 -08:00
Maksim Panchenko 734a7a5437 [BOLT] Skip disassembly of padding at function end.
Summary:
Some functions coming from assembly may not have been marked
with size. We assume the size to include all bytes up to
the next function/object in the file. As a result,
function body will include any padding inserted by the linker.
If linker inserts 0-value bytes this could be misinterpreted
as invalid instruction and BOLT will bail out on such functions
in non-relocation mode, and give up on a binary in relocation
mode.

This diff detects zero-padding, ignores it, and continues processing
as normal.

(cherry picked from FBD4528893)
2017-02-08 09:14:10 -08:00
Maksim Panchenko 6b0b5bbae7 [BOLT] Reject sanitized binaries.
Summary:
Whenever input binary is suspected to have been sanitized we print an error
message and exit. I've checked that "__asan_init*" symbol
presence is the most conservative way to detect "sanitization".

(cherry picked from FBD4525478)
2017-02-07 15:56:00 -08:00
Maksim Panchenko c89821cee3 [BOLT] Detect and prevent re-optimization attempts.
Summary:
Whenever we try to re-optimize a binary with BOLT we should
issue an error and exit.

(cherry picked from FBD4525228)
2017-02-07 15:31:14 -08:00
Maksim Panchenko e212805ea6 [BOLT] Update section names in output file.
Summary:
Re-write section header string table to reflect new names
given to sections. Old sections get ".bolt.org" prefix.

E.g. when we write ".eh_frame" section, we keep the old copy
but rename it to ".bolt.org.eh_frame".

Note: the new code section is named ".bolt.text" - it contains split
function bodies, while original ".text" name is left unchanged.

(cherry picked from FBD4524935)
2017-02-07 12:20:46 -08:00
Bill Nell d74997c3cc Indirect call promotion optimization.
Summary:
Perform indirect call promotion optimization in BOLT.

The code scans the instructions during CFG creation for all
indirect calls.  Right now indirect tail calls are not handled
since the functions are marked not simple.  The offsets of the
indirect calls are stored for later use by the ICP pass.

The indirect call promotion pass visits each indirect call and
examines the BranchData for each.  If the most frequent targets
from that callsite exceed the specified threshold (default 90%),
the call is promoted.  Otherwise, it is ignored.  By default,
only one target is considered at each callsite.

When an candiate callsite is processed, we modify the callsite
to test for the most common call targets before calling through
the original generic call mechanism.

The CFG and layout are modified by ICP.

A few new command line options have been added:
-indirect-call-promotion
-indirect-call-promotion-threshold=<percentage>
-indirect-call-promotion-topn=<int>

The threshold is the minimum frequency of a call target needed
before ICP is triggered.

The topn option controls the number of targets to consider for
each callsite, e.g. ICP is triggered if topn=2 and the total
requency of the top two call targets exceeds the threshold.

Example of ICP:

C++ code:

  int B_count = 0;
  int C_count = 0;

  struct A { virtual void foo() = 0; }
  struct B : public A { virtual void foo() { ++B_count; }; };
  struct C : public A { virtual void foo() { ++C_count; }; };

  A* a = ...
  a->foo();
  ...

original:
  400863:	49 8b 07             	mov    (%r15),%rax
  400866:	4c 89 ff             	mov    %r15,%rdi
  400869:	ff 10                	callq  *(%rax)
  40086b:	41 83 e6 01          	and    $0x1,%r14d
  40086f:	4d 89 e6             	mov    %r12,%r14
  400872:	4c 0f 44 f5          	cmove  %rbp,%r14
  400876:	4c 89 f7             	mov    %r14,%rdi
  ...

after ICP:
  40085e:	49 8b 07             	mov    (%r15),%rax
  400861:	4c 89 ff             	mov    %r15,%rdi
  400864:	49 ba e0 0b 40 00 00 	movabs $0x400be0,%r10
  40086b:	00 00 00
  40086e:	4c 3b 10             	cmp    (%rax),%r10
  400871:	75 29                	jne    40089c <main+0x9c>
  400873:	41 ff d2             	callq  *%r10
  400876:	41 83 e6 01          	and    $0x1,%r14d
  40087a:	4d 89 e6             	mov    %r12,%r14
  40087d:	4c 0f 44 f5          	cmove  %rbp,%r14
  400881:	4c 89 f7             	mov    %r14,%rdi
  ...

  40089c:	ff 10                	callq  *(%rax)
  40089e:	eb d6                	jmp    400876 <main+0x76>

(cherry picked from FBD3612218)
2016-09-07 18:59:23 -07:00
Maksim Panchenko 6ff1795d96 [BOLT] Support overwriting jump tables in-place.
Summary:
Add an option to overwrite jump tables without moving and make it a
default:

  -jump-tables   - jump tables support (default=basic)
    =none        -   do not optimize functions with jump tables
    =basic       -   optimize functions with jump tables
    =move        -   move jump tables to a separate section
    =split       - split jump tables section into hot and cold based on
                   function execution frequency
    =aggressive  - aggressively split jump tables section based on usage of
                   the tables

(cherry picked from FBD4448499)
2017-01-17 15:49:59 -08:00
Rafael Auler 6dfd16cb4c Cover RSP-indexed accesses in frame optimization
Summary:
Add a new dataflow analysis to recover the value of RSP at a
given point of the program. This value is expressed as an offset from
the CFA. Use this information to detect redundant load in memory
accesses performed via RSP as well, not only RBP as done previously.
Bail when RSP value (as an offset of the CFA) can't be reliably
determined with a simple dataflow analysis.

(cherry picked from FBD4372261)
2016-12-28 17:09:52 -08:00
Maksim Panchenko 503c741d43 [BOLT] Report stale functions' percentage wrt all profiled functions.
Summary:
Report stale functions percentage with respect to all profiled
functions instead of all simple functions in the binary.
The new reporting format should make it more apparent if the
profile is out-of-date. Compare:

  BOLT-INFO: 341 (16.7% of all profiled) functions have invalid (possibly
stale) profile.

vs old:

  BOLT-INFO: 341 (0.3%)  functions have invalid (possibly stale) profile.

(cherry picked from FBD4451746)
2017-01-23 13:08:40 -08:00
Maksim Panchenko 19859377f8 [BOLT] Fix debug info update for zero-length ranges.
Summary:
Due to a clowntown on my part we were generating wrong ranges
when an empty range was seen on input. We were basically expanding
the range to include all basic blocks following such range and setting
wrong sizes at the same time.

Add "-dump-cu" option to llvm-dwarfdump that allows to look at debug
info of a single compile unit only. Saves time if we are only interested
in a subset of information.

(cherry picked from FBD4430989)
2017-01-18 10:09:54 -08:00
Maksim Panchenko 0894905373 [ICF] Don't re-fold functions in non-relocation mode.
Summary:
In-non relocation mode, when we run ICF the second time,
we fold the same functions again since they were not
removed from the function set. This diff marks them as
folded and ignores them during ICF optimization. Note
that we still want to optimize such functions since they
are potentially called from the code not covered by BOLT
in non-relocation mode.

Folded functions are also excluded from dyno stats with
this diff

Also print the number of times folded functions were called.
When 2 functions -  f1() and f2() are folded, that number
would be min(call_frequency(f1), call_frequency(f2)).

(cherry picked from FBD4399993)
2017-01-10 11:20:56 -08:00
Maksim Panchenko bc8a456309 ICF improvements.
Summary:
Re-worked the way ICF operates. The pass now checks for more than just
call instructions, but also for all references including function
pointers. Jump tables are handled too.

(cherry picked from FBD4372491)
2016-12-21 17:13:56 -08:00
Maksim Panchenko 55fc5417f8 Relocations support for BOLT.
Summary: Read relocation from linker and relocate all functions.

(cherry picked from FBD4223901)
2016-09-27 19:09:38 -07:00
Rafael Auler a75bbfc640 Add a frame optimization pass
Summary:
This is a first attempt to perform data flow analyses on bolt
and try to rebuild the stack frame for functions. The goal of the frame
optimization pass is to detect instructions that are accessing stack and,
if loading values, evaluate whether this load is redundant and we can
substitute the memory operation for a register load or immediate load.
To find opportunities, this pass also builds a map of clobbered registers
by function, so we use this in our analysis at call sites. If a call site
is found out to not clobber a caller-saved register but the caller is
spilling it anyway to the stack (to comply with the ABI), we should
detect these cases and remove this unnecessary move.

(cherry picked from FBD4337238)
2016-12-05 11:47:08 -08:00
Bill Nell 3a3dfc3dc2 BOLT: Use profiling info to control branch simplification optimization.
Summary:
An optimization to simplify conditional tail calls by removing unnecessary branches.  It adds the following two command line options:

  -simplify-conditional-tail-calls  - simplify conditional tail calls by removing unnecessary jumps
  -sctc-mode                        - mode for simplify conditional tail calls
    =always                         -   always perform sctc
    =preserve                       -   only perform sctc when branch direction is preserved
    =heuristic                      -   use branch prediction data to control sctc

This optimization considers both of the following cases:

  foo: ...
       jcc L1   original
       ...
  L1:  jmp bar  # TAILJMP

->

  foo: ...
       jcc bar  iff jcc L1 is expected
       ...

  L1 is unreachable

OR

  foo: ...
       jcc  L2
  L1:  jmp  dest  # TAILJMP
  L2:  ...

->

  foo: jncc dest  # TAILJMP
  L2:  ...

  L1 is unreachable

For this particular case, the first basic block ends with a conditional branch and has two successors, one fall-through and one for when the condition is true.  The target of the conditional is a basic block with a single unconditional branch (i.e. tail call) to another function.  We don't care about the contents of the fall-through block.

(cherry picked from FBD3719617)
2016-09-22 18:08:20 -07:00
Rafael Auler 06caefdb1d Fix typo in time passes
Summary:
Previously NamedRegionTimer's constructor was being called
with no local variable associated with it owing to a typo. We need a
local variable to keep track of the time spent in the scope. At the
end of the scope, the destructor will be called an then the timer will
stop.

(cherry picked from FBD4301844)
2016-12-08 13:34:56 -08:00
Rafael Auler c570038d31 Add option to time passes
Summary:
As we begin to work on optimization passes for bolt, it is important to
keep track of the time spent in each of these to measure their
contribution to the time bolt takes to finish rewriting a program.

(cherry picked from FBD4301136)
2016-12-08 12:15:20 -08:00
Rafael Auler 3888c5604f Remove unused private var in CFIReaderWriter (NFC)
Summary: This member variable is dead.

(cherry picked from FBD4255342)
2016-11-30 16:03:53 -08:00
Rafael Auler 5c0e4b6a57 Fix undefined behavior in DebugInfo
Summary:
The CFI instructions parser in libDebugInfo was relying on
undefined behavior to parse operands by assuming the order function
parameters are evaluated in a function call site is defined (it is
not). This patch fix this and makes our clang and gcc tests agree.
It also fixes wrong LIT tests in our codebase with respect to the
order of DW_CFA_def_cfa operands.

(cherry picked from FBD4255227)
2016-11-30 15:52:24 -08:00
Rafael Auler a331fa396b Fix memory leak in DWARFRewriter
Summary:
Clang's Address Sanitizer caught this leak where MCAsmBackend
and MCObjectWriter instances were being created but not freed. Fix this.

(cherry picked from FBD4249941)
2016-11-29 20:11:32 -08:00
Rafael Auler 5cc9c58410 Avoid const_iterator on std::vector::emplace
Summary:
This is part of a series of clean-up patches to make bolt
cleanly compile with clang 4.0. This patch fixes an error where clang
will fail to compile because it does not support passing a
const_iterator to std::vector<T>::emplace(Iter, ...).

(cherry picked from FBD4242546)
2016-11-28 17:45:25 -08:00
Rafael Auler b21bc02ac4 Remove pessimizing std::move
Summary:
This is part of a series of clean-up patches to make bolt
cleanly compile with clang 4.0. This patch fixes the following warning:
moving a temporary object prevents copy elision

(cherry picked from FBD4242236)
2016-11-28 17:25:17 -08:00
Rafael Auler 7115706d02 Fix clang warning about switch covering all enums
Summary:
This is part of a series of clean-up patches to make bolt
cleanly compile with clang 4.0. This patch fixes the following warning:
default label in switch which covers all enumeration values

(cherry picked from FBD4242168)
2016-11-28 17:17:14 -08:00
Maksim Panchenko ac2621fbf4 Add stats for "-optimize-bodyless-functions".
Summary: Print the number of calls eliminated.

(cherry picked from FBD4010698)
2016-10-12 13:08:52 -07:00
Rafael Auler 8609ad51e5 Detect default CFI frame instructions for the target
Summary:
Make BOLT resilient to changes in the LLVM's X86 target library
by not hardwiring the list of default CIE instructions, but detecting it
at run time.

(cherry picked from FBD4200982)
2016-11-17 14:56:42 -08:00
Maksim Panchenko a7fb610eba Relocate old .eh_frame section next to the new one.
Summary:
In order to improve gdb experience with BOLT we have to make
sure the output file has a single .eh_frame section. Otherwise
gdb will use either old or new section for unwinding purposes.

This diff relocates the original .eh_frame section next to
the new one generated by LLVM. Later we merge two sections
into one and make sure only the newly created section has
.eh_frame name.

(cherry picked from FBD4203943)
2016-11-11 14:33:34 -08:00
Maksim Panchenko 809c28f585 Generate .eh_frame_hdr based on contents of .eh_frame's.
Summary:
We used to patch an existing .eh_frame_hdr and append contents
for split functions at the end. However, this approach does not
work in relocation mode since function addresses change and split
functions will not necessarily be at the end.

Instead of patching and appending we generate the new .eh_frame_hdr
based on contents of old and new .eh_frame sections.

(cherry picked from FBD4180756)
2016-11-14 16:39:55 -08:00
Maksim Panchenko 055dfe48e7 Another EH fix for cold fragments of functions that we fail to write.
Summary:
In a prev diff I disabled inclusion of FDEs for cold fragments that
we fail to write. The side effect of it was that we failed to
write FDE for the next function with a cold fragment since it
had the same assigned address that we had put in FailedAddresses.

The correct fix is to assign zero address to failed cold fragments
and ignore them when we write .eh_frame_hdr.

(cherry picked from FBD4156740)
2016-11-09 11:19:02 -08:00
Rafael Auler 355dbd769e Fix DW_CFA_def_cfa CFI duping in output binary
Summary:
CFI instructions may live in CIEs or FDEs. CIEs hold common
instructions used across many FDEs. When replaying CFIs to the output
binary, llvm-bolt needs to replay both instructions from CIE and the
corresponding FDE for the function. However, some instructions need not
to be replayed because MCStreamer/MCDwarf and friends will write them
by default in the output CIE. This patch fix the code that tried to
recognize one of these default instructions but was failing, resulting
in an extra CFI instruction in each FDE we outputted. With this patch,
the output binary should be a bit smaller.

(cherry picked from FBD4194753)
2016-11-16 17:47:31 -08:00
Rafael Auler bc8cb088c0 Support DWARF expressions in CFI instructions
Summary:
Modify the MC layer (MCDwarf.h|cpp) to understand CFI
instructions dealing with DWARF expressions. Add code to emit DWARF
expressions in MCDwarf. Change llvm-bolt to pass these CFI instructions
to streamer instead of bailing on them. Change -dump-eh-frame option in
llvm-bolt to dump the EH frame of the rewritten binary in addition to
the one in the original binary, allowing us to proper test this patch.

(cherry picked from FBD4194452)
2016-11-15 10:40:00 -08:00
Maksim Panchenko 99dce7d05e Disable processing of functions with EVEX-encoded instructions (AVX-512).
Summary:
AVX-512 disassembler support in LLVM is not quite ready yet.
Before we feel more comfortable about it we disable processing
of all functions that use any EVEX-encoded instructions.

(cherry picked from FBD4028706)
2016-10-16 18:56:56 -07:00
Maksim Panchenko 0eb2559fee Fix EH for cold fragments that we fail to write.
Summary:
When we fail to write functions that are too big, we have to
effectively cancel their effect on exception handling by ignoring
their FDE entries in .eh_frame while writing .eh_frame_hdr.

This can happen to functions that we split too. In such cases
the cold part has its own FDE and we have to ignore that one too.
This doesn't happen very often - I've only seen one case on
hhvm binary, however it is a potential issue. The fix is to
add the cold part address to the list of failed-to-write
addresses.

(cherry picked from FBD3987984)
2016-10-07 09:34:16 -07:00
Maksim Panchenko e241e9c156 New function discovery and support for multiple entries.
Summary:
Modified function discovery process to tolerate more functions and
symbols coming from assembly. The processing order now matches
the memory order of the functions (input symbol table is unsorted).

Added basic support for functions with multiple entries. When
a function references its internal address other than with
a branch instruction, that address could potentially escape.
We mark such addresses as entry points and make sure they
are treated as roots by unreachable code elimination.

Without relocations we have to mark multiple-entry functions
as non-simple.

(cherry picked from FBD3950243)
2016-09-29 11:19:06 -07:00
Maksim Panchenko 9cf5d74ffb Support for PIC-style jump tables.
Summary:
Added support for jump tables in code compiled with "-fpic".
Code pattern generated for position-independent jump tables
is quite different, as is the format of the tables.
More details in comments.

Coverage increased slightly for a test, mostly due to the code
coming from external lib that was compiled with "-fpic".

(cherry picked from FBD3940771)
2016-09-27 19:09:38 -07:00
Bill Nell 4a0c494bc1 BOLT: Remove restrictions on unreachable code elimination
Summary:
Allow UCE when blocks have EH info.  Since UCE may remove blocks
that are referenced from debugging info data structures, we don't
actually delete them.  We just mark them with an "invalid" index
and store them in a different vector to be cleaned up later once
the BinaryFunction is destroyed.  The debugging code just skips
any BBs that have an invalid index.

Eliminating blocks may also expose useless jmp instructions, i.e.
a jmp around a dead block could just be a fallthrough.  I've added
a new routine to cleanup these jmps.  Although, @maks is working on
changing fixBranches() so that it can be used instead.

(cherry picked from FBD3793259)
2016-09-07 18:59:23 -07:00
Maksim Panchenko 4464861a02 Support for splitting jump tables.
Summary:
Add level for "-jump-tables=<n>" option:
  1 - all jump tables are output in the same section (default).
  2 - basic splitting, if the table is used it is output to hot section
      otherwise to cold one.
  3 - aggressively split compound jump tables and collect profile for
      all entries.

Option "-print-jump-tables" outputs all jump tables for debugging
and/or analyzing purposes. Use with "-jump-tables=3" to get profile
values for every entry in a jump table.

(cherry picked from FBD3912119)
2016-09-16 15:54:32 -07:00
Bill Nell ecc4b9e713 BOLT: Add ud2 after indirect tailcalls.
Summary:
Insert ud2 instructions after indirect tailcalls to prevent the CPU from
decoding instructions following the callsite.

A simple counter in the peephole pass shows 3260 tail call traps inserted.

(cherry picked from FBD3859737)
2016-09-13 15:16:11 -07:00
Bill Nell 2f1341b51d BOLT: Refactoring BinaryFunction interface.
Summary:
Get rid of all uses of getIndex/getLayoutIndex/getOffset outside of BinaryFunction.
Also made some other offset related methods private.

(cherry picked from FBD3861968)
2016-09-13 20:32:12 -07:00
Bill Nell 510f227cbd BOLT: Add feature to sort functions by dyno stats.
Summary:
Add -print-sorted-by and -print-sorted-by-order command line options.
The first option takes a list of dyno stats keys used to sort functions
that are printed at the end of all optimization passes.  Only the top
100 functions are printed.  The -print-sorted-by-order option can be
either ascending or descending (descending is the default).

(cherry picked from FBD3898818)
2016-09-20 20:55:49 -07:00
Maksim Panchenko 62bff426c3 Do no collect dyno stats on functions with stale profile.
Summary:
Dyno stats collected on functions with invalid profile may appear
completely bogus. Skip them.

(cherry picked from FBD3879371)
2016-09-16 13:13:16 -07:00
Maksim Panchenko 2c9bf9afd6 Add PLT dyno stats.
Summary: Get PLT call stats.

(cherry picked from FBD3874799)
2016-09-15 15:47:10 -07:00
Maksim Panchenko c4e36c1dd6 Fix issue with zero-size duplicate function symbols.
Summary:
While working on PLT dyno stats I've noticed that we were missing
BinaryFunctions for some symbols that were not PLT. Upon closer inspection
turned out that those symbols were marked as zero-sized functions in
symbol table, but they had duplicates with non-zero size. Since the
zero-size symbols were preceding other duplicates, we were not creating
BinaryFunction for them and they were not added as duplicates.

The 2 most prominent functions that were missing for a test were free() and
malloc().  There's not much to optimize in these functions, but they were
contributing quite significantly to dyno stats.

As a result dyno stats for this test needed an adjustment.

Also several assembly functions (e.g. _init()) had zero size, and now we
set the size to the max size and start processing those. It's good for
coverage but will not affect the performance.

(cherry picked from FBD3874622)
2016-09-15 15:47:10 -07:00
Maksim Panchenko 8dbf0e2b3d Add dyno stats for jump tables.
Summary: Add dyno stats for jump tables.

(cherry picked from FBD3871035)
2016-09-15 10:24:22 -07:00
Maksim Panchenko 2f3a859772 Add experimental jump table support.
Summary:
Option "-jump-tables=1" enables experimental support for jump tables.

The option hasn't been tested with optimizations other than block
re-ordering.

Only non-PIC jump tables are supported at the moment.

(cherry picked from FBD3867849)
2016-09-14 16:45:40 -07:00
Bill Nell 7483cd0fa6 BOLT: Clean up interface between BinaryFunction and BinaryBasicBlock.
Summary:
This is just a bit of refactoring to make sure that BinaryFunction goes
through methods to get at the state in BinaryBasicBlock.  I did this so
that changing the way Index/LayoutIndex/Valid works will be easier.

(cherry picked from FBD3860899)
2016-09-13 17:12:00 -07:00
Maksim Panchenko b0f4031db3 Add cluster randomization layout algorithm.
Summary:
Add "-reorder-blocks=cluster-shuffle" for performance experiments.
Use "-bolt-seed=<N>" to set a randomization seed.

(cherry picked from FBD3851035)
2016-09-11 14:33:58 -07:00
Maksim Panchenko 52bfc3f92f Fix switch table detection. Disassemble all instructions in non-simple functions.
Summary:
Switch table can contain __builtin_unreachable(). As a result,
a compiler may place an entry into a jump table that contains
an address immediately past the last instruction in the function.
Sometimes it may coincide with a start of the next function in
the binary. Thus when we check for switch tables in such cases
we have to check more than a single entry until we see either
an address inside containing function or some address outside
different from the address past the last instruction.

Additonally, don't stop disassembly after discovering that the
function was not simple. We need to detect all outside
references whenever possible.

(cherry picked from FBD3850825)
2016-09-12 10:12:31 -07:00
Bill Nell 861d5a1586 BOLT: Remove double jumps peephole.
Summary:
Replace jumps to other unconditional jumps with the final
destination, e.g.

  B0: ...
      jmp B1  (or jcc B1)

  B1: jmp B2

  ->

  B0: ...
      jmp B2  (or jcc B1)

This peephole removes 8928 double jumps from a test binary.

Note: after filtering out double jumps found in EH code and infinite
loops, the number of double jumps patched is 49 (24 for a clang
compiled test).  The 24 in the clang build are all from external
libraries which have probably been compiled with gcc.  This peephole
is still useful for cleaning up after ICP though.

(cherry picked from FBD3815420)
2016-09-02 18:09:07 -07:00
Maksim Panchenko 617c6a13b7 Use BB.getNumNonPseudos() in more places.
Summary:
Use BB.getNumNonPseudos() in more places.

Fix analyze_potential script to pass the new parameter.

(cherry picked from FBD3844416)
2016-09-09 14:42:35 -07:00
Bill Nell 71be567969 BOLT: Add per pass dyno stats + factor out post pass printing.
Summary:
I've added dyno stats printing per pass so we can see the results
of each optimization pass on the stats.  I've also factored out the
post pass function printing code since it was pretty much the same
after each pass.

(cherry picked from FBD3843587)
2016-09-09 12:37:37 -07:00