Commit Graph

1191 Commits

Author SHA1 Message Date
Maksim Panchenko a99005397f [BOLT] Fix branch count in removeDuplicateConditionalSuccessor().
Summary:
When we merge the original branch counts we have to make sure
both of them have a profile. Otherwise set the count to COUNT_NO_PROFILE.

The misprediction count should be 0.

(cherry picked from FBD4837774)
2017-04-05 13:00:20 -07:00
Bill Nell 6c5c65e3a3 [BOLT] Fix double jump peephole, remove useless conditional branches.
Summary:
I split some of this out from the jumptable diff since it fixes the
double jump peephole.

I've changed the pass manager so that UCE and peepholes are not called
after SCTC.  I've incorporated a call to the double jump fixer to SCTC
since it is needed to fix things up afterwards.

While working on fixing the double jump peephole I discovered a few
useless conditional branches that could be removed as well.  I highly
doubt that removing them will improve perf at all but it does seem
odd to leave in useless conditional branches.

There are also some minor logging improvements.

(cherry picked from FBD4751875)
2017-03-20 22:44:25 -07:00
Maksim Panchenko f7d32f7e7d [BOLT] Detect and reject binaries built for coverage.
Summary: Don't attempt to optimize binaries built with coverage support.

(cherry picked from FBD4810330)
2017-03-31 07:51:30 -07:00
Maksim Panchenko c166a8c1a7 [BOLT] Fix debug info update for inlining.
Summary:
When inlining, if a callee has debug info and a caller does not
(i.e. a containing compilation unit was compiled without "-g"), we try
to update a nonexistent compilation unit. Instead we should skip
updating debug info in such cases.

Minor refactoring of line number emitting code.

(cherry picked from FBD4823982)
2017-04-03 16:24:26 -07:00
Maksim Panchenko 0bde796e50 [BOLT] Organize options in categories for pretty printing (near NFC).
Summary:
Each BOLT-specific option now belongs to BoltCategory or BoltOptCategory.

Use alphabetical order for options in source code (does not affect
output).

The result is a cleaner output of "llvm-bolt -help" which does not
include any unrelated llvm options and is close to the following:

.....

BOLT generic options:

  -data=<string>                                       - <data file>
  -dyno-stats                                          - print execution info based on profile
  -hot-text                                            - hot text symbols support (relocation mode)
  -o=<string>                                          - <output file>
  -relocs                                              - relocation mode - use relocations to move functions in the binary
  -update-debug-sections                               - update DWARF debug sections of the executable
  -use-gnu-stack                                       - use GNU_STACK program header for new segment (workaround for issues with strip/objcopy)
  -use-old-text                                        - re-use space in old .text if possible (relocation mode)
  -v=<uint>                                            - set verbosity level for diagnostic output

BOLT optimization options:

  -align-blocks                                        - try to align BBs inserting nops
  -align-functions=<uint>                              - align functions at a given value (relocation mode)
  -align-functions-max-bytes=<uint>                    - maximum number of bytes to use to align functions
  -boost-macroops                                      - try to boost macro-op fusions by avoiding the cache-line boundary
  -eliminate-unreachable                               - eliminate unreachable code
  -frame-opt                                           - optimize stack frame accesses
  ......

(cherry picked from FBD4793684)
2017-03-28 14:40:20 -07:00
Maksim Panchenko d5a0264a9e [BOLT] Issue error in relocs mode if input is lacking relocations.
Summary:
If we specify "-relocs" flag and an input has no relocations we
proceed with assumptions that relocations were there and break the
binary.

Detect the condition above, and reject the input.

(cherry picked from FBD4761239)
2017-03-22 22:05:50 -07:00
Rafael Auler ad81bd6779 Change dynostats dynamic instruction count policy
Summary:
Also add LOAD/STORE counters.

(cherry picked from FBD4732284)
2017-03-17 10:32:56 -07:00
Bill Nell b1ef186ca9 [BOLT] Don't allow non-symbol targets in ICP
Summary:
ICP was letting through call targets that weren't symbols.  This diff
filters out the non-symbol targets before running ICP.

(cherry picked from FBD4735358)
2017-03-18 11:55:45 -07:00
Maksim Panchenko e6f96de4d0 [BOLT] Add option to print only specific functions.
Summary:
Add option '-print-only=func1,func2,...' to print only functions
of interest. The rest of the functions are still processed and
optimized (e.g. inlined), but only the ones on the list are printed.

(cherry picked from FBD4734610)
2017-03-17 19:05:11 -07:00
Maksim Panchenko 6cfd7ac2d5 [BOLT] Do not overwrite starting address in non-relocation mode.
Summary:
In non-relocation mode we shouldn't attemtp to change ELF
entry point.

What made matters worse - it broke '-max-funcs=' and '-funcs=' options
since an entry function more often than not was excluded from the list
of processed functions, and we were setting entry point to 0.

(cherry picked from FBD4720044)
2017-03-15 19:31:20 -07:00
Maksim Panchenko 559a57a181 [BOLT] Improve dynostats output.
Summary:
Reduce verbosity of dynostats to make them more readable.

  * Don't print "before" dynostats twice.
  * Detect if dynostats have changed after optimization and print
    before/after only if at least one metric have changed. Otherwise
    just print dynostats once and indicate "no change".
  * If any given metric hasn't changed, then print the difference as
    "(=)" as opposed to (+0.0%).

(cherry picked from FBD4705920)
2017-03-14 09:03:23 -07:00
Maksim Panchenko 351af0c895 [BOLT] Do not process empty functions.
Summary:
While running on a recent test binary BOLT failed with an error. We were
trying to process '__hot_end' (which is not really a function), and asserted
that it had no basic blocks.

This diff marks functions with empty basic blocks list as non-simple since
there's no need to process them.

(cherry picked from FBD4696517)
2017-03-12 11:30:05 -07:00
Bill Nell 2e5c2e689f Fix hfsort callgraph stats, add hfsort test.
Summary:
The stats for call sites that are not included in the call graph were broken.
The intention is to count the total number of call sites vs. the number of call sites that are ignored because they have targets that are not BinaryFunctions.

Also add a new test for hfsort.

(cherry picked from FBD4668631)
2017-03-07 11:45:07 -08:00
Maksim Panchenko f4825ea417 [BOLT] Fix gcc5 build.
Summary: A <numeric> include is required for gcc5 build.

(cherry picked from FBD4671953)
2017-03-07 18:09:09 -08:00
Maksim Panchenko 98737b34bb [BOLT] Fix verbose output.
Summary:
Inadvertently, output of BOLT became way too verbose. Discovered while
building HHVM on master.

(cherry picked from FBD4669881)
2017-03-07 14:22:15 -08:00
Bill Nell fed0980139 [BOLT] Update tests
Summary:
Fix validateCFG to handle BBs that were generated from code that used
_builtin_unreachable().
Add -verify-cfg option to run CFG validation after every optimization
pass.

(cherry picked from FBD4641174)
2017-02-27 21:44:38 -08:00
Maksim Panchenko 0acba2bcf0 [BOLT] Detect unmarked data in text.
Summary:
Sometimes a code written in assembly will have unmarked data (such as
constants) embedded into text.

Typically such data falls into a "padding" address space of a function.

This diffs detects such references, and adjusts the padding space to
prevent overwriting of code in data.

Note that in relocation mode we prefer to overwrite the original code
(-use-old-text) and thus cannot simply ignore data in text.

(cherry picked from FBD4662780)
2017-02-21 14:18:09 -08:00
Maksim Panchenko f241e252fc [BOLT] Detect and handle __builtin_unreachable().
Summary:
Calls to __builtin_unreachable() can result in a inconsistent CFG.
It was possible for basic block to end with a conditional branche
and have a single successor. Or there could exist non-terminated
basic block without successors.

We also often treated conditional jumps with destination past the end
of a function as conditional tail calls. This can be prevented
reliably at least when the byte past the end of the function does
not belong to the next function.

This diff includes several changes:
  * At disassembly stage jumps past the end of a function are converted
    into 'nops'. This is done only for cases when we can guarantee that
    the jump is not a tail call. Conversion to nop is required since the
    instruction could be referenced either by exception handling
    tables and/or debug info. Nops are later removed.
  * In CFG insert 'ret' into non-terminated basic blocks without
    successors (this almost never happens).
  * Conditional jumps at the end of the function are removed from
    CFG. The block will still have a single successor.
  * Cases where a destination of a jump instruction is the start
    of the next function, are still conservatively handled as
    (conditional) tail calls.

(cherry picked from FBD4655046)
2017-03-03 11:35:41 -08:00
Maksim Panchenko 6dc2351505 [BOLT] New CFI handling policy.
Summary:
The new interface for handling Call Frame Information:

  * CFI state at any point in a function (in CFG state) is defined by
    CFI state at basic block entry and CFI instructions inside the
    block. The state is independent of basic blocks layout order
    (this is implied by CFG state but wasn't always true in the past).
  * Use BinaryBasicBlock::getCFIStateAtInstr(const MCInst *Inst) to
    get CFI state at any given instruction in the program.
  * No need to call fixCFIState() after any given pass. fixCFIState()
    is called only once during function finalization, and any function
    transformations after that point are prohibited.
  * When introducing new basic blocks, make sure CFI state at entry
    is set correctly and matches CFI instructions in the basic block
    (if any).
  * When splitting basic blocks, use getCFIStateAtInstr() to get
    a state at the split point, and set the new basic block's CFI
    state to this value.

Introduce CFG_Finalized state to indicate that no further optimizations
are allowed on the function. This state is reached after we have synced
CFI instructions and updated EH info.

Rename "-print-after-fixup" option to "-print-finalized".

This diffs fixes CFI for cases when we split conditional tail calls,
and for indirect call promotion optimization.

(cherry picked from FBD4629307)
2017-02-24 21:59:33 -08:00
Rafael Auler 965a373dc4 Fix warnings when compiling with clang (NFC)
Summary:
Fix inconsistent override keyword usages and initializes a
missing field of a Relocation object when using braced initializers.

(cherry picked from FBD4622856)
2017-02-27 13:09:27 -08:00
Maksim Panchenko 2029458f34 [BOLT] Strip 'repz' prefix from 'repz retq'.
Summary:
Add pass to strip 'repz' prefix from 'repz retq' sequence. The prefix
is not used in Intel CPUs afaik. The pass is on by default.

(cherry picked from FBD4610329)
2017-02-23 18:09:10 -08:00
Maksim Panchenko 88a461014b [BOLT] Don't set code skew in relocations mode.
Summary:
We use code skew in non-relocation mode since functions have fixed
addresses, and internal alignment has to be adjusted wrt the skew.
However in relocation mode it interferes with effective code
alignment, and has to be disabled. I missed it when was re-basing
the relocation diff.

(cherry picked from FBD4599670)
2017-02-22 11:29:52 -08:00
Maksim Panchenko d3e33b6edc [BOLT] Fix -jump-tables=basic in relocation mode.
Summary:
In a prev diff I added an option to update jump tables in-place (on by default)
and accidentally broke the default handling of jump tables in relocation
mode. The update should be happening semi-automatically, but because
we ignore relocations for jump tables it wasn't happening (derp).

Since we mostly use '-jump-tables=move' this hasn't been noticed for
some time.

This diff gets rid of IgnoredRelocations and removes relocations
from a relocation set when they are no longer needed. If relocations
are created later for jump tables they are no longer ignored.

(cherry picked from FBD4595159)
2017-02-21 16:15:15 -08:00
Maksim Panchenko 88244a10bb [BOLT] Move BOLT passes under Passes subdirectory (NFC).
Summary:
Move passes under Passes subdirectory.

Move inlining passes under Passes/Inliner.*

(cherry picked from FBD4575832)
2017-02-16 14:57:57 -08:00
Maksim Panchenko f06a1455ea [BOLT] Add support for *GOTPCRELX relocation type.
Summary:
gcc5 can generate new types of relocations that give linker a freedom
to substitute instructions. These relocations are PC-relative, and
since we manually process such relocations they don't present
much of a problem.

Additionally, detect non-pc-relative access from code into a middle of
a function. Occasionally I've seen such code, but don't know exactly
how to trigger its generation. Just issue a warning for now.

(cherry picked from FBD4566473)
2017-02-14 22:55:10 -08:00
Maksim Panchenko 82965b963f [BOLT] Emit short tail calls in relocation mode.
Summary:
To minimize size of the output code we should emit tail calls
that are as short as possible. For this we have to convert a synthetic
TAILJMPd into JMP_1 instruction. This should be one of the last passes
as most of analysis passes could break since tail calls will no longer
be marked as such.

The total size of the code is smaller, but not by much - hot text was
reduced by 192 bytes.

(cherry picked from FBD4557804)
2017-02-13 23:05:12 -08:00
Maksim Panchenko 734a7a5437 [BOLT] Skip disassembly of padding at function end.
Summary:
Some functions coming from assembly may not have been marked
with size. We assume the size to include all bytes up to
the next function/object in the file. As a result,
function body will include any padding inserted by the linker.
If linker inserts 0-value bytes this could be misinterpreted
as invalid instruction and BOLT will bail out on such functions
in non-relocation mode, and give up on a binary in relocation
mode.

This diff detects zero-padding, ignores it, and continues processing
as normal.

(cherry picked from FBD4528893)
2017-02-08 09:14:10 -08:00
Maksim Panchenko 6b0b5bbae7 [BOLT] Reject sanitized binaries.
Summary:
Whenever input binary is suspected to have been sanitized we print an error
message and exit. I've checked that "__asan_init*" symbol
presence is the most conservative way to detect "sanitization".

(cherry picked from FBD4525478)
2017-02-07 15:56:00 -08:00
Maksim Panchenko c89821cee3 [BOLT] Detect and prevent re-optimization attempts.
Summary:
Whenever we try to re-optimize a binary with BOLT we should
issue an error and exit.

(cherry picked from FBD4525228)
2017-02-07 15:31:14 -08:00
Maksim Panchenko e212805ea6 [BOLT] Update section names in output file.
Summary:
Re-write section header string table to reflect new names
given to sections. Old sections get ".bolt.org" prefix.

E.g. when we write ".eh_frame" section, we keep the old copy
but rename it to ".bolt.org.eh_frame".

Note: the new code section is named ".bolt.text" - it contains split
function bodies, while original ".text" name is left unchanged.

(cherry picked from FBD4524935)
2017-02-07 12:20:46 -08:00
Bill Nell d74997c3cc Indirect call promotion optimization.
Summary:
Perform indirect call promotion optimization in BOLT.

The code scans the instructions during CFG creation for all
indirect calls.  Right now indirect tail calls are not handled
since the functions are marked not simple.  The offsets of the
indirect calls are stored for later use by the ICP pass.

The indirect call promotion pass visits each indirect call and
examines the BranchData for each.  If the most frequent targets
from that callsite exceed the specified threshold (default 90%),
the call is promoted.  Otherwise, it is ignored.  By default,
only one target is considered at each callsite.

When an candiate callsite is processed, we modify the callsite
to test for the most common call targets before calling through
the original generic call mechanism.

The CFG and layout are modified by ICP.

A few new command line options have been added:
-indirect-call-promotion
-indirect-call-promotion-threshold=<percentage>
-indirect-call-promotion-topn=<int>

The threshold is the minimum frequency of a call target needed
before ICP is triggered.

The topn option controls the number of targets to consider for
each callsite, e.g. ICP is triggered if topn=2 and the total
requency of the top two call targets exceeds the threshold.

Example of ICP:

C++ code:

  int B_count = 0;
  int C_count = 0;

  struct A { virtual void foo() = 0; }
  struct B : public A { virtual void foo() { ++B_count; }; };
  struct C : public A { virtual void foo() { ++C_count; }; };

  A* a = ...
  a->foo();
  ...

original:
  400863:	49 8b 07             	mov    (%r15),%rax
  400866:	4c 89 ff             	mov    %r15,%rdi
  400869:	ff 10                	callq  *(%rax)
  40086b:	41 83 e6 01          	and    $0x1,%r14d
  40086f:	4d 89 e6             	mov    %r12,%r14
  400872:	4c 0f 44 f5          	cmove  %rbp,%r14
  400876:	4c 89 f7             	mov    %r14,%rdi
  ...

after ICP:
  40085e:	49 8b 07             	mov    (%r15),%rax
  400861:	4c 89 ff             	mov    %r15,%rdi
  400864:	49 ba e0 0b 40 00 00 	movabs $0x400be0,%r10
  40086b:	00 00 00
  40086e:	4c 3b 10             	cmp    (%rax),%r10
  400871:	75 29                	jne    40089c <main+0x9c>
  400873:	41 ff d2             	callq  *%r10
  400876:	41 83 e6 01          	and    $0x1,%r14d
  40087a:	4d 89 e6             	mov    %r12,%r14
  40087d:	4c 0f 44 f5          	cmove  %rbp,%r14
  400881:	4c 89 f7             	mov    %r14,%rdi
  ...

  40089c:	ff 10                	callq  *(%rax)
  40089e:	eb d6                	jmp    400876 <main+0x76>

(cherry picked from FBD3612218)
2016-09-07 18:59:23 -07:00
Maksim Panchenko 6ff1795d96 [BOLT] Support overwriting jump tables in-place.
Summary:
Add an option to overwrite jump tables without moving and make it a
default:

  -jump-tables   - jump tables support (default=basic)
    =none        -   do not optimize functions with jump tables
    =basic       -   optimize functions with jump tables
    =move        -   move jump tables to a separate section
    =split       - split jump tables section into hot and cold based on
                   function execution frequency
    =aggressive  - aggressively split jump tables section based on usage of
                   the tables

(cherry picked from FBD4448499)
2017-01-17 15:49:59 -08:00
Rafael Auler 6dfd16cb4c Cover RSP-indexed accesses in frame optimization
Summary:
Add a new dataflow analysis to recover the value of RSP at a
given point of the program. This value is expressed as an offset from
the CFA. Use this information to detect redundant load in memory
accesses performed via RSP as well, not only RBP as done previously.
Bail when RSP value (as an offset of the CFA) can't be reliably
determined with a simple dataflow analysis.

(cherry picked from FBD4372261)
2016-12-28 17:09:52 -08:00
Maksim Panchenko 503c741d43 [BOLT] Report stale functions' percentage wrt all profiled functions.
Summary:
Report stale functions percentage with respect to all profiled
functions instead of all simple functions in the binary.
The new reporting format should make it more apparent if the
profile is out-of-date. Compare:

  BOLT-INFO: 341 (16.7% of all profiled) functions have invalid (possibly
stale) profile.

vs old:

  BOLT-INFO: 341 (0.3%)  functions have invalid (possibly stale) profile.

(cherry picked from FBD4451746)
2017-01-23 13:08:40 -08:00
Maksim Panchenko 19859377f8 [BOLT] Fix debug info update for zero-length ranges.
Summary:
Due to a clowntown on my part we were generating wrong ranges
when an empty range was seen on input. We were basically expanding
the range to include all basic blocks following such range and setting
wrong sizes at the same time.

Add "-dump-cu" option to llvm-dwarfdump that allows to look at debug
info of a single compile unit only. Saves time if we are only interested
in a subset of information.

(cherry picked from FBD4430989)
2017-01-18 10:09:54 -08:00
Maksim Panchenko 0894905373 [ICF] Don't re-fold functions in non-relocation mode.
Summary:
In-non relocation mode, when we run ICF the second time,
we fold the same functions again since they were not
removed from the function set. This diff marks them as
folded and ignores them during ICF optimization. Note
that we still want to optimize such functions since they
are potentially called from the code not covered by BOLT
in non-relocation mode.

Folded functions are also excluded from dyno stats with
this diff

Also print the number of times folded functions were called.
When 2 functions -  f1() and f2() are folded, that number
would be min(call_frequency(f1), call_frequency(f2)).

(cherry picked from FBD4399993)
2017-01-10 11:20:56 -08:00
Maksim Panchenko bc8a456309 ICF improvements.
Summary:
Re-worked the way ICF operates. The pass now checks for more than just
call instructions, but also for all references including function
pointers. Jump tables are handled too.

(cherry picked from FBD4372491)
2016-12-21 17:13:56 -08:00
Maksim Panchenko 55fc5417f8 Relocations support for BOLT.
Summary: Read relocation from linker and relocate all functions.

(cherry picked from FBD4223901)
2016-09-27 19:09:38 -07:00
Rafael Auler a75bbfc640 Add a frame optimization pass
Summary:
This is a first attempt to perform data flow analyses on bolt
and try to rebuild the stack frame for functions. The goal of the frame
optimization pass is to detect instructions that are accessing stack and,
if loading values, evaluate whether this load is redundant and we can
substitute the memory operation for a register load or immediate load.
To find opportunities, this pass also builds a map of clobbered registers
by function, so we use this in our analysis at call sites. If a call site
is found out to not clobber a caller-saved register but the caller is
spilling it anyway to the stack (to comply with the ABI), we should
detect these cases and remove this unnecessary move.

(cherry picked from FBD4337238)
2016-12-05 11:47:08 -08:00
Bill Nell 3a3dfc3dc2 BOLT: Use profiling info to control branch simplification optimization.
Summary:
An optimization to simplify conditional tail calls by removing unnecessary branches.  It adds the following two command line options:

  -simplify-conditional-tail-calls  - simplify conditional tail calls by removing unnecessary jumps
  -sctc-mode                        - mode for simplify conditional tail calls
    =always                         -   always perform sctc
    =preserve                       -   only perform sctc when branch direction is preserved
    =heuristic                      -   use branch prediction data to control sctc

This optimization considers both of the following cases:

  foo: ...
       jcc L1   original
       ...
  L1:  jmp bar  # TAILJMP

->

  foo: ...
       jcc bar  iff jcc L1 is expected
       ...

  L1 is unreachable

OR

  foo: ...
       jcc  L2
  L1:  jmp  dest  # TAILJMP
  L2:  ...

->

  foo: jncc dest  # TAILJMP
  L2:  ...

  L1 is unreachable

For this particular case, the first basic block ends with a conditional branch and has two successors, one fall-through and one for when the condition is true.  The target of the conditional is a basic block with a single unconditional branch (i.e. tail call) to another function.  We don't care about the contents of the fall-through block.

(cherry picked from FBD3719617)
2016-09-22 18:08:20 -07:00
Rafael Auler 06caefdb1d Fix typo in time passes
Summary:
Previously NamedRegionTimer's constructor was being called
with no local variable associated with it owing to a typo. We need a
local variable to keep track of the time spent in the scope. At the
end of the scope, the destructor will be called an then the timer will
stop.

(cherry picked from FBD4301844)
2016-12-08 13:34:56 -08:00
Rafael Auler c570038d31 Add option to time passes
Summary:
As we begin to work on optimization passes for bolt, it is important to
keep track of the time spent in each of these to measure their
contribution to the time bolt takes to finish rewriting a program.

(cherry picked from FBD4301136)
2016-12-08 12:15:20 -08:00
Rafael Auler 3888c5604f Remove unused private var in CFIReaderWriter (NFC)
Summary: This member variable is dead.

(cherry picked from FBD4255342)
2016-11-30 16:03:53 -08:00
Rafael Auler 5c0e4b6a57 Fix undefined behavior in DebugInfo
Summary:
The CFI instructions parser in libDebugInfo was relying on
undefined behavior to parse operands by assuming the order function
parameters are evaluated in a function call site is defined (it is
not). This patch fix this and makes our clang and gcc tests agree.
It also fixes wrong LIT tests in our codebase with respect to the
order of DW_CFA_def_cfa operands.

(cherry picked from FBD4255227)
2016-11-30 15:52:24 -08:00
Rafael Auler a331fa396b Fix memory leak in DWARFRewriter
Summary:
Clang's Address Sanitizer caught this leak where MCAsmBackend
and MCObjectWriter instances were being created but not freed. Fix this.

(cherry picked from FBD4249941)
2016-11-29 20:11:32 -08:00
Rafael Auler 5cc9c58410 Avoid const_iterator on std::vector::emplace
Summary:
This is part of a series of clean-up patches to make bolt
cleanly compile with clang 4.0. This patch fixes an error where clang
will fail to compile because it does not support passing a
const_iterator to std::vector<T>::emplace(Iter, ...).

(cherry picked from FBD4242546)
2016-11-28 17:45:25 -08:00
Rafael Auler b21bc02ac4 Remove pessimizing std::move
Summary:
This is part of a series of clean-up patches to make bolt
cleanly compile with clang 4.0. This patch fixes the following warning:
moving a temporary object prevents copy elision

(cherry picked from FBD4242236)
2016-11-28 17:25:17 -08:00
Rafael Auler 7115706d02 Fix clang warning about switch covering all enums
Summary:
This is part of a series of clean-up patches to make bolt
cleanly compile with clang 4.0. This patch fixes the following warning:
default label in switch which covers all enumeration values

(cherry picked from FBD4242168)
2016-11-28 17:17:14 -08:00
Maksim Panchenko ac2621fbf4 Add stats for "-optimize-bodyless-functions".
Summary: Print the number of calls eliminated.

(cherry picked from FBD4010698)
2016-10-12 13:08:52 -07:00
Rafael Auler 8609ad51e5 Detect default CFI frame instructions for the target
Summary:
Make BOLT resilient to changes in the LLVM's X86 target library
by not hardwiring the list of default CIE instructions, but detecting it
at run time.

(cherry picked from FBD4200982)
2016-11-17 14:56:42 -08:00
Maksim Panchenko a7fb610eba Relocate old .eh_frame section next to the new one.
Summary:
In order to improve gdb experience with BOLT we have to make
sure the output file has a single .eh_frame section. Otherwise
gdb will use either old or new section for unwinding purposes.

This diff relocates the original .eh_frame section next to
the new one generated by LLVM. Later we merge two sections
into one and make sure only the newly created section has
.eh_frame name.

(cherry picked from FBD4203943)
2016-11-11 14:33:34 -08:00
Maksim Panchenko 809c28f585 Generate .eh_frame_hdr based on contents of .eh_frame's.
Summary:
We used to patch an existing .eh_frame_hdr and append contents
for split functions at the end. However, this approach does not
work in relocation mode since function addresses change and split
functions will not necessarily be at the end.

Instead of patching and appending we generate the new .eh_frame_hdr
based on contents of old and new .eh_frame sections.

(cherry picked from FBD4180756)
2016-11-14 16:39:55 -08:00
Maksim Panchenko 055dfe48e7 Another EH fix for cold fragments of functions that we fail to write.
Summary:
In a prev diff I disabled inclusion of FDEs for cold fragments that
we fail to write. The side effect of it was that we failed to
write FDE for the next function with a cold fragment since it
had the same assigned address that we had put in FailedAddresses.

The correct fix is to assign zero address to failed cold fragments
and ignore them when we write .eh_frame_hdr.

(cherry picked from FBD4156740)
2016-11-09 11:19:02 -08:00
Rafael Auler 355dbd769e Fix DW_CFA_def_cfa CFI duping in output binary
Summary:
CFI instructions may live in CIEs or FDEs. CIEs hold common
instructions used across many FDEs. When replaying CFIs to the output
binary, llvm-bolt needs to replay both instructions from CIE and the
corresponding FDE for the function. However, some instructions need not
to be replayed because MCStreamer/MCDwarf and friends will write them
by default in the output CIE. This patch fix the code that tried to
recognize one of these default instructions but was failing, resulting
in an extra CFI instruction in each FDE we outputted. With this patch,
the output binary should be a bit smaller.

(cherry picked from FBD4194753)
2016-11-16 17:47:31 -08:00
Rafael Auler bc8cb088c0 Support DWARF expressions in CFI instructions
Summary:
Modify the MC layer (MCDwarf.h|cpp) to understand CFI
instructions dealing with DWARF expressions. Add code to emit DWARF
expressions in MCDwarf. Change llvm-bolt to pass these CFI instructions
to streamer instead of bailing on them. Change -dump-eh-frame option in
llvm-bolt to dump the EH frame of the rewritten binary in addition to
the one in the original binary, allowing us to proper test this patch.

(cherry picked from FBD4194452)
2016-11-15 10:40:00 -08:00
Maksim Panchenko 99dce7d05e Disable processing of functions with EVEX-encoded instructions (AVX-512).
Summary:
AVX-512 disassembler support in LLVM is not quite ready yet.
Before we feel more comfortable about it we disable processing
of all functions that use any EVEX-encoded instructions.

(cherry picked from FBD4028706)
2016-10-16 18:56:56 -07:00
Maksim Panchenko 0eb2559fee Fix EH for cold fragments that we fail to write.
Summary:
When we fail to write functions that are too big, we have to
effectively cancel their effect on exception handling by ignoring
their FDE entries in .eh_frame while writing .eh_frame_hdr.

This can happen to functions that we split too. In such cases
the cold part has its own FDE and we have to ignore that one too.
This doesn't happen very often - I've only seen one case on
hhvm binary, however it is a potential issue. The fix is to
add the cold part address to the list of failed-to-write
addresses.

(cherry picked from FBD3987984)
2016-10-07 09:34:16 -07:00
Maksim Panchenko e241e9c156 New function discovery and support for multiple entries.
Summary:
Modified function discovery process to tolerate more functions and
symbols coming from assembly. The processing order now matches
the memory order of the functions (input symbol table is unsorted).

Added basic support for functions with multiple entries. When
a function references its internal address other than with
a branch instruction, that address could potentially escape.
We mark such addresses as entry points and make sure they
are treated as roots by unreachable code elimination.

Without relocations we have to mark multiple-entry functions
as non-simple.

(cherry picked from FBD3950243)
2016-09-29 11:19:06 -07:00
Maksim Panchenko 9cf5d74ffb Support for PIC-style jump tables.
Summary:
Added support for jump tables in code compiled with "-fpic".
Code pattern generated for position-independent jump tables
is quite different, as is the format of the tables.
More details in comments.

Coverage increased slightly for a test, mostly due to the code
coming from external lib that was compiled with "-fpic".

(cherry picked from FBD3940771)
2016-09-27 19:09:38 -07:00
Bill Nell 4a0c494bc1 BOLT: Remove restrictions on unreachable code elimination
Summary:
Allow UCE when blocks have EH info.  Since UCE may remove blocks
that are referenced from debugging info data structures, we don't
actually delete them.  We just mark them with an "invalid" index
and store them in a different vector to be cleaned up later once
the BinaryFunction is destroyed.  The debugging code just skips
any BBs that have an invalid index.

Eliminating blocks may also expose useless jmp instructions, i.e.
a jmp around a dead block could just be a fallthrough.  I've added
a new routine to cleanup these jmps.  Although, @maks is working on
changing fixBranches() so that it can be used instead.

(cherry picked from FBD3793259)
2016-09-07 18:59:23 -07:00
Maksim Panchenko 4464861a02 Support for splitting jump tables.
Summary:
Add level for "-jump-tables=<n>" option:
  1 - all jump tables are output in the same section (default).
  2 - basic splitting, if the table is used it is output to hot section
      otherwise to cold one.
  3 - aggressively split compound jump tables and collect profile for
      all entries.

Option "-print-jump-tables" outputs all jump tables for debugging
and/or analyzing purposes. Use with "-jump-tables=3" to get profile
values for every entry in a jump table.

(cherry picked from FBD3912119)
2016-09-16 15:54:32 -07:00
Bill Nell ecc4b9e713 BOLT: Add ud2 after indirect tailcalls.
Summary:
Insert ud2 instructions after indirect tailcalls to prevent the CPU from
decoding instructions following the callsite.

A simple counter in the peephole pass shows 3260 tail call traps inserted.

(cherry picked from FBD3859737)
2016-09-13 15:16:11 -07:00
Bill Nell 2f1341b51d BOLT: Refactoring BinaryFunction interface.
Summary:
Get rid of all uses of getIndex/getLayoutIndex/getOffset outside of BinaryFunction.
Also made some other offset related methods private.

(cherry picked from FBD3861968)
2016-09-13 20:32:12 -07:00
Bill Nell 510f227cbd BOLT: Add feature to sort functions by dyno stats.
Summary:
Add -print-sorted-by and -print-sorted-by-order command line options.
The first option takes a list of dyno stats keys used to sort functions
that are printed at the end of all optimization passes.  Only the top
100 functions are printed.  The -print-sorted-by-order option can be
either ascending or descending (descending is the default).

(cherry picked from FBD3898818)
2016-09-20 20:55:49 -07:00
Maksim Panchenko 62bff426c3 Do no collect dyno stats on functions with stale profile.
Summary:
Dyno stats collected on functions with invalid profile may appear
completely bogus. Skip them.

(cherry picked from FBD3879371)
2016-09-16 13:13:16 -07:00
Maksim Panchenko 2c9bf9afd6 Add PLT dyno stats.
Summary: Get PLT call stats.

(cherry picked from FBD3874799)
2016-09-15 15:47:10 -07:00
Maksim Panchenko c4e36c1dd6 Fix issue with zero-size duplicate function symbols.
Summary:
While working on PLT dyno stats I've noticed that we were missing
BinaryFunctions for some symbols that were not PLT. Upon closer inspection
turned out that those symbols were marked as zero-sized functions in
symbol table, but they had duplicates with non-zero size. Since the
zero-size symbols were preceding other duplicates, we were not creating
BinaryFunction for them and they were not added as duplicates.

The 2 most prominent functions that were missing for a test were free() and
malloc().  There's not much to optimize in these functions, but they were
contributing quite significantly to dyno stats.

As a result dyno stats for this test needed an adjustment.

Also several assembly functions (e.g. _init()) had zero size, and now we
set the size to the max size and start processing those. It's good for
coverage but will not affect the performance.

(cherry picked from FBD3874622)
2016-09-15 15:47:10 -07:00
Maksim Panchenko 8dbf0e2b3d Add dyno stats for jump tables.
Summary: Add dyno stats for jump tables.

(cherry picked from FBD3871035)
2016-09-15 10:24:22 -07:00
Maksim Panchenko 2f3a859772 Add experimental jump table support.
Summary:
Option "-jump-tables=1" enables experimental support for jump tables.

The option hasn't been tested with optimizations other than block
re-ordering.

Only non-PIC jump tables are supported at the moment.

(cherry picked from FBD3867849)
2016-09-14 16:45:40 -07:00
Bill Nell 7483cd0fa6 BOLT: Clean up interface between BinaryFunction and BinaryBasicBlock.
Summary:
This is just a bit of refactoring to make sure that BinaryFunction goes
through methods to get at the state in BinaryBasicBlock.  I did this so
that changing the way Index/LayoutIndex/Valid works will be easier.

(cherry picked from FBD3860899)
2016-09-13 17:12:00 -07:00
Maksim Panchenko b0f4031db3 Add cluster randomization layout algorithm.
Summary:
Add "-reorder-blocks=cluster-shuffle" for performance experiments.
Use "-bolt-seed=<N>" to set a randomization seed.

(cherry picked from FBD3851035)
2016-09-11 14:33:58 -07:00
Maksim Panchenko 52bfc3f92f Fix switch table detection. Disassemble all instructions in non-simple functions.
Summary:
Switch table can contain __builtin_unreachable(). As a result,
a compiler may place an entry into a jump table that contains
an address immediately past the last instruction in the function.
Sometimes it may coincide with a start of the next function in
the binary. Thus when we check for switch tables in such cases
we have to check more than a single entry until we see either
an address inside containing function or some address outside
different from the address past the last instruction.

Additonally, don't stop disassembly after discovering that the
function was not simple. We need to detect all outside
references whenever possible.

(cherry picked from FBD3850825)
2016-09-12 10:12:31 -07:00
Bill Nell 861d5a1586 BOLT: Remove double jumps peephole.
Summary:
Replace jumps to other unconditional jumps with the final
destination, e.g.

  B0: ...
      jmp B1  (or jcc B1)

  B1: jmp B2

  ->

  B0: ...
      jmp B2  (or jcc B1)

This peephole removes 8928 double jumps from a test binary.

Note: after filtering out double jumps found in EH code and infinite
loops, the number of double jumps patched is 49 (24 for a clang
compiled test).  The 24 in the clang build are all from external
libraries which have probably been compiled with gcc.  This peephole
is still useful for cleaning up after ICP though.

(cherry picked from FBD3815420)
2016-09-02 18:09:07 -07:00
Maksim Panchenko 617c6a13b7 Use BB.getNumNonPseudos() in more places.
Summary:
Use BB.getNumNonPseudos() in more places.

Fix analyze_potential script to pass the new parameter.

(cherry picked from FBD3844416)
2016-09-09 14:42:35 -07:00
Bill Nell 71be567969 BOLT: Add per pass dyno stats + factor out post pass printing.
Summary:
I've added dyno stats printing per pass so we can see the results
of each optimization pass on the stats.  I've also factored out the
post pass function printing code since it was pretty much the same
after each pass.

(cherry picked from FBD3843587)
2016-09-09 12:37:37 -07:00
Maksim Panchenko c4c518ee9d Rewrite SCTC pass to do UCE and make it the last optimization pass.
Summary:
For now we make SCTC a special pass that runs at the end of all
optimizations and transformations right after fixupBranches().

Since it's the last pass, it has to do its own UCE.

(cherry picked from FBD3838051)
2016-09-08 14:52:26 -07:00
Maksim Panchenko 6bef336cc2 Add dyno stats to BOLT.
Summary:
Add "-dyno-stats" option that prints instruction stats based on
the execution profile similar to below:

BOLT-INFO: program-wide dynostats after optimizations:
  executed forward branches : 109706407 (+8.1%)
  taken forward branches : 13769074 (-55.5%)
  executed backward branches : 24517582 (-25.0%)
  taken backward branches : 15330256 (-27.2%)
  executed unconditional branches : 6009826 (-35.5%)
  function calls : 17192114 (+0.0%)
  executed instructions : 837733057 (-0.4%)
  total branches : 140233815 (-2.3%)
  taken branches : 35109156 (-42.8%)

Also fixed pseudo instruction discrepancies and added assertions
for BinaryBasicBlock::getNumPseudos() to make sure the number is
synchronized with real number of pseudo instructions.

(cherry picked from FBD3826995)
2016-08-29 21:11:22 -07:00
Maksim Panchenko 17e691915b Make BinaryFunction::fixBranches() more flexible and support CFG updates.
Summary:
The CFG represents "the ultimate source of truth". Transformations
on functions and blocks have to update the CFG and fixBranches() would
make sure the correct branch instructions are inserted at the end of
basic blocks (or removed when necessary).

We do require a conditional branch at the end of the basic block if
the block has 2 successors as CFG currently lacks the conditional
code support (it will probably stay that way). We only use this
branch instruction for its conditional code, the destination is
determined by CFG - first successor representing true/taken branch,
while the second successor - false/fall-through branch.

When we reverse the branch condition, the CFG is updated accordingly.

The previous version used to insert jumps after some terminating
instructions sometimes resulting in a larger code than needed. As a
result with the new version 1 extra function becomes overwritten for
HHVM binary.

With this diff we also convert conditional branches with one successor
(result of code from __builtin_unreachable()) into unconditional
jumps.

(cherry picked from FBD3802062)
2016-08-29 21:11:22 -07:00
Bill Nell 48b55300e0 BOLT: Make most command line options ZeroOrMore.
Summary:
This will make it easier to run experiments with the same baseline
BOLT binary but different command line options.

(cherry picked from FBD3831978)
2016-09-07 14:41:56 -07:00
Bill Nell dcaffe64d3 Inlining fixes/enhancements
Summary:
A number of fixes/enhancements to inline-small-functions
- Fixed size estimateHotSize to use computeCodeSize instead of the original layout offsets.
- Added -print-inline option to dump CFGs for functions that have been modified by inlining.
- Added flag to force consideration of functions without any profiling info (mostly for testing)
- Updated debug line info for inlined functions.
- Ignore the number of pseudo instructions when checking for candidates of suitable size.

Misc changes
- Moved most print flags to BinaryPasses.cpp

(cherry picked from FBD3812658)
2016-09-02 11:58:53 -07:00
Maksim Panchenko 1cf200107e Fix tail call conversion and test cases.
Summary:
A previous diff accidentally disabled tail call conversion.

Additionally some test cases relied on output of "-v=2". Fix those.

(cherry picked from FBD3823760)
2016-09-06 13:19:26 -07:00
Bill Nell c27a6a5c63 Add verbosity level and clean up stream usage.
Summary:
I've added a verbosity level to help keep the BOLT spewage to a minimum.
The default level is pretty terse now, level 1 is closer to the original,
I've saved level 2 for the noisiest of messages.  Error messages should
never be suppressed by the verbosity level only warnings and info messages.

The rational behind stream usage is as follows:
outs() for info and debugging controlled by command line flags.
errs() for errors and warnings.
dbgs() for output within DEBUG().

With the exception of a few of the level 2 messages I don't have any strong feelings about the others.

(cherry picked from FBD3814259)
2016-09-02 14:15:29 -07:00
Maksim Panchenko 43acb6a28a Emit remember_state CFI in the same code region as restore_state.
Summary:
While creating remember_state/restore_state CFI sequences, we
were always placing remember_state instruction into the first
basic block. However, when we have hot-cold splitting, the cold
part has and independent FDE entry in .eh_frame, and thus the
restore_state instruction was missing its counter part.

The fix is to adjust the basic block that is used for placing
remember_state instruction whenever we see the hot-cold split
boundary.

(cherry picked from FBD3767102)
2016-08-24 14:25:33 -07:00
Maksim Panchenko 97f598fd17 Handling for indirect tail calls.
Summary:
Analyze indirect branches and convert them into indirect
tail calls when possible. We analyze the memory contents
when the address could be calculated statically and also
detect epilogue code.

(cherry picked from FBD3754395)
2016-08-22 14:24:09 -07:00
Maksim Panchenko 42c5894fe2 Write padding for .eh_frame_hdr to a file.
Summary:
We were applying padding to the calculated address but were never
writing it to a file triggering an assertion for cases when
.gcc_except_table size wasn't multiple of 4.

(cherry picked from FBD3744638)
2016-08-19 13:54:35 -07:00
Maksim Panchenko a10fb73ab3 Compute ClusterEdges only when necessary.
Summary:
We only need ClusterEdges in reordering algorithm optimized for
branches and the computation is quite resource-hungry, thus it
makes sense to only do it when needed.

Some refactoring too.

(cherry picked from FBD3721107)
2016-08-15 15:37:00 -07:00
Bill Nell c1d1c2e7cd Check if operands are immediates before trying shortening.
Summary:
Operands in the initial instruction stream should all have immediate operands
for instructions that can be shortened.  But if a BOLT optimization pass adds
one of these instructions with a symbolic operand, the shortening operation
will assert.  This diff adds checks to make sure that the operands are
immediate.

I've also disabled shortening pass by default since it won't really be needed
until ICP is submitted.  It will still run at CFG creation time.

(cherry picked from FBD3610646)
2016-07-22 20:52:57 -07:00
Bill Nell 406aa62083 Add additional info to BOLT graphviz CFG dumps.
Summary:
Add the following info the graphviz CFG dump:
- Edges are labeled with the jmp instruction that leads to that edge.
- Edges include the count and misprediction count.
- Nodes have (offset, BB index, BB layout index)
- Nodes optionally have tooltips which contain the code of the basic block.
  (enabled with -dot-tooltip-code)
- Added dashed edges to landing pads.

(cherry picked from FBD3646568)
2016-07-29 19:18:37 -07:00
Maksim Panchenko 003d106c0b More refactoring work.
Summary:
Avoid referring to BinaryFunction's by name.

Functions could be found by MCSymbol using
BinaryContext::getFunctionForSymbol().

(cherry picked from FBD3707685)
2016-08-11 14:23:54 -07:00
Maksim Panchenko 36df6057b0 Refactoring. Mainly NFC.
Summary:
Eliminated BinaryFunction::getName(). The function was confusing since
the name is ambigous. Instead we have BinaryFunction::getPrintName()
used for printing and whenever unique string identifier is needed
one can use getSymbol()->getName(). In the next diff I'll have
a map from MCSymbol to BinaryFunction in BinaryContext to facilitate
function lookup from instruction operand expressions.

There's one bug fixed where the function was called only under assert()
in ICF::foldFunction().

For output we update all symbols associated with the function. At the
moment it has no effect on the generated binary but in the future we
would like to have all symbols in the symbol table updated.

(cherry picked from FBD3704790)
2016-08-07 12:35:23 -07:00
Theodoros Kasampalis 32739247eb More aggressive inlining pass
Summary:
This adds functionality for a more aggressive inlining pass, that can
inline tail calls and functions with more than one basic block.

(cherry picked from FBD3677856)
2016-07-29 14:17:06 -07:00
Bill Nell 82d76ae18b Add MCInst annotation mechanism to MCInstrAnalysis class.
Summary:
Add three new MCOperand types: Annotation, LandingPad and GnuArgsSize.

Annotation is used for associating random data with MCInsts.  Clients can
construct their own annotation types (subclassed from MCAnnotation) and
associate them with instructions.  Annotations are looked up by string keys.

Annotations can be added, removed and queried using an instance of the
MCInstrAnalysis class.

The LandingPad operand is a MCSymbol, uint64_t pair used to encode exception
handling information for call instructions.

GnuArgsSize is used to annotate calls with the DW_CFA_GNU_args_size attribute.

(cherry picked from FBD3597877)
2016-07-28 10:34:50 -07:00
Theodoros Kasampalis 713e361f36 Fix for correct disassembling of conditional tail calls.
Summary:
BOLT attempts to convert jumps that serve as tail calls to dedicated tail call
instructions, but this is impossible when the jump is conditional because there is
no corresponding tail call instruction. This was causing the creation of a duplicate
fall-through edge for basic blocks terminated with a conditional jump serving as
a tail call when there is profile data available for the non-taken branch. In this
case, the first fall-through edge had a count taken from the profile data, while
the second has a count computed (incorrectly) by
BinaryFunction::inferFallThroughCounts.

(cherry picked from FBD3560504)
2016-07-13 18:57:40 -07:00
Maksim Panchenko 486ab273c7 Add printing support for indirect tail calls.
Summary:
LLVM was missing assembler print string for indirect tail
calls which are synthetic instructions created by us.

(cherry picked from FBD3640197)
2016-07-28 18:49:48 -07:00
Bill Nell 50e011f4e5 CFG editing functions
Summary:
This diff adds a number of methods to BinaryFunction that can be used to edit the CFG after it is created.

The basic public functions are:
  - createBasicBlock - create a new block that is not inserted into the CFG.
  - insertBasicBlocks - insert a range of blocks (made with createBasicBlock) into the CFG.
  - updateLayout - update the CFG layout (either by inserting new blocks at a certain point or recomputing the entire layout).
  - fixFallthroughBranch - add a direct jump to the fallthrough successor for a given block.

There are a number of private helper functions used to implement the above.

This was split off the ICP diff to simplify it a bit.

(cherry picked from FBD3611313)
2016-07-23 12:50:34 -07:00
Theodoros Kasampalis ab599fe71a Basic block clustering algorithm for minimizing branches.
Summary:
This algorithm is similar to our main clustering algorithm but uses
a different heuristic for selecting edges to become fall-throughs.
The weight of an edge is calculated as the win in branches if we choose
to layout this edge as a fall-through. For example, the edges A -> B with
execution count 100 and A -> C with execution count 500 (where B and C
are the only successors of A) have weights -400 and +400 respectively.

(cherry picked from FBD3606591)
2016-07-15 16:11:30 -07:00
Theodoros Kasampalis a9bb3320ad Identical Code Folding (ICF) pass
Summary:
Added an ICF pass to BOLT, that can recognize identical functions
and replace references to these functions with references to just one
representative.

(cherry picked from FBD3460297)
2016-06-09 11:36:55 -07:00
Bill Nell 82401630a2 Factor out instruction printing and size computation.
Summary:
I've factored out the instruction printing and size computation routines to
methods on BinaryContext.  I've also added some more debug print functions.

This was split off the ICP diff to simplify it a bit.

(cherry picked from FBD3610690)
2016-07-23 08:01:53 -07:00
Theodoros Kasampalis 156a55209c Simplification of loads from read-only data sections.
Summary:
Instructions that load data from the a read-only data section and their
target address can be computed statically (e.g. RIP-relative addressing)
are modified to corresponding instructions that use immediate operands.
We apply the transformation only when the resulting instruction will have
smaller or equal size.

(cherry picked from FBD3397112)
2016-06-03 00:58:11 -07:00
Theodoros Kasampalis 17b846586c Loop detection for BOLT's CFG.
Summary:
Loop detection for the CFG data structure. Added a GraphTraits
specialization for BOLT's CFG that allows us to use LLVM's loop
detection interface.

(cherry picked from FBD3604837)
2016-05-26 10:58:01 -07:00
Bill Nell ea53cffb2d Add movabs -> mov shortening optimization. Add peephole optimization pass that does instruction shortening.
Summary:
Shorten when a mov instruction has a 64-bit immediate that can be repesented as
a sign extended 32-bit number, use the smaller mov instruction (MOV64ri -> MOV64ri32).

Add peephole optimization pass that does instruction shortening.

(cherry picked from FBD3603099)
2016-07-21 16:40:06 -07:00
Maksim Panchenko c6d0c568d4 Add BinaryContext::getSectionForAddress()
Summary: Interface for accessing section from BinaryContext.

(cherry picked from FBD3600854)
2016-07-21 12:45:35 -07:00
Maksim Panchenko f2d82919d0 Move debug-handling code into DWARFRewriter (NFC).
Summary: RewriteInstance.cpp is getting too big. Split the code.

(cherry picked from FBD3596103)
2016-05-31 19:12:26 -07:00
Maksim Panchenko bf46263eed Shorten instructions if possible.
Summary:
Generate short versions of branch instructions by default and rely on
relaxation to produce longer versions when needed.

Also produce short versions of arithmetic instructions if immediate
fits into one byte. This was only triggered once on HHVM binary.

(cherry picked from FBD3591466)
2016-07-19 11:19:18 -07:00
Bill Nell 674dbcc0de Fix crash in patchELFPHDRTable when no functions are modified.
Summary:
patchELFPHDRTable was asserting that it could not find an entry
for .eh_frame_hdr in SectionMapInfo when no functions were modified
by BOLT.

This just changes code to skip modifying GNU_EH_FRAME program headers
hen SectionMapInfo is empty.  The existing header is copied and written
instead.

(cherry picked from FBD3557481)
2016-07-12 16:43:53 -07:00
Maksim Panchenko 84b5b9e462 Create alternative name for local symbols.
Summary:
If a profile data was collected on a stripped binary but an input
to BOLT is unstripped, we would use a different mangling scheme for
local functions and ignore their profiles. To solve the issue this
diff adds alternative name for all local functions such that one
of the names would match the name in the profile.

If the input binary was stripped, we reject it, unless "-allow-stripped"
option was passed. It's more complicated to do a matching in this case
since we have less information than at the time of profile collection.
It's also not that simple to tell if the profile was gathered on a
stripped binary (in which case we would have no issue matching data).

(cherry picked from FBD3548012)
2016-07-11 18:51:13 -07:00
Bill Nell bdd4af2134 Store index inside BinaryBasicBlock instead of in map on BinaryFunction.
Summary:
Store the basic block index inside the BinaryBasicBlock instead of a map in BinaryFunction.
This cut another 15-20 sec. from the processing time for hhvm.

(cherry picked from FBD3533606)
2016-07-07 21:43:43 -07:00
Bill Nell 90c9323511 Use unordered_map instead of map in ReorderAlgorithm and BinaryFunction::BasicBlockIndices.
Summary:
Use unordered_map instead of map in ReorderAlgorithm and BinaryFunction::BasicBlockIndices.
Cuts about 30sec off the processing time for the hhvm binary. (~8.5 min to ~8min)

(cherry picked from FBD3530910)
2016-07-07 11:48:50 -07:00
Theodoros Kasampalis c20506c570 Fix in inferFallthroughCounts
Summary:
This fixes the initialization of basic block execution counts, where
we should skip edges to the first basic block but we were not
skipping the corresponding profile info.

Also, I removed a check that was done twice.

(cherry picked from FBD3519265)
2016-07-03 21:30:35 -07:00
Bill Nell 260f6fbdb6 Add option to dump CFGs in (simple) graphviz format during all passes.
Summary:
I noticed the BinaryFunction::viewGraph() method that hadn't been implemented
and decided I could use a simple DOT dumper for CFGs while working on the indirect
call optimization.

I've implemented the bare minimum for the dumper.  It's just nodes+BB labels with
dges. We can add more detailed information as needed/desired.

(cherry picked from FBD3509326)
2016-07-01 08:40:56 -07:00
Theodoros Kasampalis 6eb4e5b687 perf2bolt can extract branch records with histories
Summary:
Added perf2bolt functionality for extracting branch records
with histories of previous branches. The length of the histories
is user defined, and the default is 0 (previous functionality). Also,
DataReader can parse perf2bolt output with histories.
Note: creating profile data with long histories can increase their
size significantly (2x for history of length 1, 3x for length 2 etc).

(cherry picked from FBD3473983)
2016-06-21 18:44:42 -07:00
Theodoros Kasampalis 287fa51324 Fix for ignoring fall-through profile data when jump is followed by no-op
Summary:
When a conditional jump is followed by one or more no-ops, the
destination of fall-through branch was recorded as the first no-op in
FuncBranchInfo. However the fall-through basic block after the jump
starts after the no-ops, so the profile data could not match the CFG
and was ignored.

(cherry picked from FBD3496084)
2016-06-27 14:51:38 -07:00
Theodoros Kasampalis d09b00ebff Refactoring of the reordering algorithms
Summary:
The various reorder and clustering algorithms have been refactored
into separate classes, so that it is easier to add new algorithms and/or
change the logic of algorithm selection.

(cherry picked from FBD3473656)
2016-06-16 18:47:57 -07:00
Maksim Panchenko f1192a7118 Support for multiple function names.
Summary:
With ICF optimization in the linker we were getting mismatches of
function names in .fdata and BinaryFunction name. This diff adds
support for multiple function names for BinaryFunction and
does a match against all possible names for the profile.

(cherry picked from FBD3466215)
2016-06-10 17:13:05 -07:00
Maksim Panchenko 70f82d9371 Reject profile data for functions that do not match.
Summary:
Verify profile data for a function and reject if there are branches
that don't correspond to any branches in the function CFG. Note that
we have to ignore branches resulting from recursive calls.

Fix printing instruction offsets in disassembled state.

Allow function to have non-zero execution count even if we don't
have branch information.

(cherry picked from FBD3451596)
2016-06-15 18:36:16 -07:00
Maksim Panchenko 88ac5d9d0e [merge-fdata] Add option to print function list.
Summary:
Print total number of functions/objects that have profile
and add new options:

  -print      - print the list of objects with count to stderr
    =none     -   do not print objects/functions
    =exec     -   print functions sorted by execution count
    =branches -   print functions sorted by total branch count
  -q          - do not print merged data to stdout

(cherry picked from FBD3442288)
2016-06-09 17:45:15 -07:00
Bill Nell 980a06265a Revert "Indirect call optimization."
This reverts commit 33966090e18545b64013614e7929ff1bdcdf10d5.

(cherry picked from FBD28110782)
2016-06-08 17:38:13 -07:00
Bill Nell 8bcfd9a392 Indirect call optimization.
(cherry picked from FBD28110629)
2016-06-07 16:27:52 -07:00
Bill Nell 45e2219ae4 Allocate BinaryBasicBlocks with new rather than storing them in the BasicBlocks vector.
Summary: This will help optimization passes that need to modify the CFG after it is constructed.  Otherwise, the BinaryBasicBlock pointers stored in the layout, successors and predecessors would need to be modified every time a new basic block is created.

(cherry picked from FBD3403372)
2016-06-07 16:27:52 -07:00
Maksim Panchenko 6da0d95326 Fix large functions debug info by default.
Summary:
Turn on -fix-debuginfo-large-functions by default.

In the process of testing I've discovered that we output cold code
for functions that were too large to be emitted. Fixed that.

(cherry picked from FBD3372697)
2016-05-31 19:29:34 -07:00
Maksim Panchenko 4460da0d81 Improvements for debug info.
Summary:
Assembly functions could have no corresponding DW_AT_subprogram
entries, yet they are represented in module ranges (and .debug_aranges)
and will have line number information. Make sure we update those.

Eliminated unnecessary data structures and optimized some passes.

For .debug_loc unused location entries are no longer processed
resulting in smaller output files.

Overall it's a small processing time improvement and memory imporement.

(cherry picked from FBD3362540)
2016-05-27 20:19:19 -07:00
Theodoros Kasampalis 65ac8bbdf2 Better edge counts for fall through blocks in presence of C++ exceptions.
Summary: The inference algorithm for counts of fall through edges takes possible jumps to landing pad blocks into account. Also, the landing pad block execution counts are updated using profile data.

(cherry picked from FBD3350727)
2016-05-26 15:10:09 -07:00
Theodoros Kasampalis 485f9220b7 Taking LP counts into account for FT count inference
(cherry picked from FBD28110493)
2016-05-24 09:26:25 -07:00
Theodoros Kasampalis fb5f18b2dc Correctly updating landing pad exec counts.
(cherry picked from FBD28110316)
2016-05-23 16:16:25 -07:00
Maksim Panchenko 06b9c5b342 Better .debug_line for non-simple functions.
Summary:
Generate .debug_line info for non-simple functions in a way
that if preferrable by 'objdump -S'.

(cherry picked from FBD3345485)
2016-05-24 20:50:36 -07:00
Maksim Panchenko 7b97793b94 Fix for clang .debug_info.
Summary:
Clang uses different attribute for high_pc which
was incompatible with the way we were updating
ranges. This diff fixes it.

(cherry picked from FBD3345537)
2016-05-24 14:54:23 -07:00
Maksim Panchenko cfa5d753eb Miscellaneous fixes for debug info.
Summary:
* Fix several cases for handling debug info:
  - properly update CU DW_AT_ranges for function with folded body
    due to ICF optimization
  - convert ranges to DW_AT_ranges from hi/low PC for all DIEs
  - add support for [a, a) range
  - update CU ranges even when there are no functions registered
* Overwrite .debug_ranges section instead of appending.
* Convert assertions in debug info handling part into warnings.

(cherry picked from FBD3339383)
2016-05-23 19:36:38 -07:00
Maksim Panchenko 7ab3db129b Create DW_AT_ranges for compile units.
Summary:
Some compile unit DIEs might be missing DW_AT_ranges because they were
compiled without "-ffunction-sections" option. This diff adds the
attribute to all compile units.

If the section is not present, we need to create it. Will do it in a
separate diff.

(cherry picked from FBD3314984)
2016-05-17 18:10:14 -07:00
Maksim Panchenko f047b9d43a Overwrite contents of .debug_line section.
Summary:
Overwrite contents of .debug_line section since we don't reference
the original contents anymore. This saves ~100MB of HHVM binary.

(cherry picked from FBD3314917)
2016-05-16 17:02:17 -07:00
Bill Nell e63984f325 Patch forward jumping tail calls to prevent branch mispredictions.
Summary:
A simple optimization to prevent branch misprediction for tail calls.
Convert the sequence:

        j<cc> L1
        ...
    L1: jmp foo # tail call

into:

        j<cc> foo

but only if 'j<cc> foo' turns out to be a forward branch.

(cherry picked from FBD3234207)
2016-05-02 12:47:18 -07:00
Maksim Panchenko b445f5eb7b Fix issue with garbage address in .debug_line.
Summary:
While emitting debug lines for a function we don't overwrite, we
don't have a code section context that is needed by default
writing routine. Hence we have to emit end_sequence after the
last address, not at the end of section.

(cherry picked from FBD3291533)
2016-05-11 19:13:38 -07:00
Bill Nell f7e7e25b88 Put all optimization passes under the pass manager.
Summary:
Move eliminate unreachable code, block reordering, and CFI/exception fixup
into official optimization passes.

(cherry picked from FBD3248991)
2016-05-02 12:47:18 -07:00
Gabriel Poesia 5fa128e748 Inlining of small functions.
Summary:
Added an optimization pass of inlining calls to small functions (with only one
basic block). Inlining is done in a very simple way, inserting instructions to
simulate the changes to the stack pointer that call/ret would make before/after the
inlined function executes. Also, the heuristic prefers to inline calls that happen
in the hottest blocks (by looking at their execution count). Calls in cold blocks are
ignored.

(cherry picked from FBD3233516)
2016-04-25 14:25:58 -07:00
Gabriel Poesia d1f525499e Optimize calls to functions that are a single unconditional jump
Summary:
Many functions (around 600) in the HHVM binary are simply
a single unconditional jump instruction to another function. These can
be trivially optimized by modifying the call sites to directly call the
branch target instead (because it also happens with more than one jump
in sequence, we do it iteratively).

This diff also adds a very simple analysis/optimization pass system in
which this pass is the first one to be implemented. A follow-up to this
could be to move the current optimizations to other passes.

(cherry picked from FBD3211138)
2016-04-15 15:59:52 -07:00
Gabriel Poesia e6acc7bb53 Optimize calls to functions that are a single unconditional jump
Summary:
Many functions (around 600) in the HHVM binary are simply
a single unconditional jump instruction to another function. These can
be trivially optimized by modifying the call sites to directly call the
branch target instead (because it also happens with more than one jump
in sequence, we do it iteratively).

This diff also adds a very simple analysis/optimization pass system in
which this pass is the first one to be implemented. A follow-up to this
could be to move the current optimizations to other passes.

(cherry picked from FBD3211138)
2016-04-15 15:59:52 -07:00
Gabriel Poesia 459eb8c230 Fix "Cannot update ranges for DIE at offset" error messages.
Summary:
Fix the error message by not printing it :)

Explanation: a previous diff accidentally removed this error message from within
the DEBUG macro, and it's expected that we'll have a bunch of them since a lot
of the DIEs we try to update are empty or meaningless. For instance (and mainly), there
is a huge number of lexical block DIEs with no attributes in .debug_info.
In the first phase of collecting debugging info, we store the offsets of all
these DIEs, only later to realize that we cannot update their address
ranges because they have none.

A better fix would be to check this earlier and not store offsets of DIEs
we cannot update to begin with.

(cherry picked from FBD3236923)
2016-04-28 12:55:35 -07:00
Maksim Panchenko de95a5b6a4 Make merge-fdata generate smaller .fdata files.
Summary:
A lot of the space in the merged .fdata is taken by branches
to and from [heap], which is jitted code. On different machines,
or during different runs, jitted addresses are all different.
We don't use these addresses, but we need branch info to get
accurate function call counts.

This diff treats all [heap] addresses the same, resulting in a
simplified merged file. The size of the compressed file decreased
from 70MB to 8MB.

(cherry picked from FBD3233943)
2016-04-27 18:06:18 -07:00
Maksim Panchenko 1258903b54 Fix for functions in different segments.
Summary:
In a test binary some functions are placed in a segment
preceding the segment containing .text section. As a result,
we were miscalculating maximum function size as the calculation
was based on addresses only.

This diff fixes the calculation by checking if symbol after function
belongs to the same section.  If it does not, then we set the maximum
function size based on the size of the containing section and not
on the address distance to the next symbol.

(cherry picked from FBD3229205)
2016-04-26 23:42:39 -07:00
Maksim Panchenko 3811673a0c Option to break in given functions.
Summary:
Added option "-break-funcs=func1,func2,...." to coredump in any
given function by introducing ud2 sequence at the beginning of the
function. Useful for debugging and validating stack traces.

Also renamed options containing "_" to use "-" instead.

Also run hhvm test with "-update-debug-sections".

(cherry picked from FBD3210248)
2016-04-21 09:54:33 -07:00
Maksim Panchenko 87a90ae133 Fix ninja install-* for BOLT utilities.
Summary:
Make sure we can install all tools needed for processing
BOLT .fdata files such as perf2bolt, merge-fdata, etc.

(cherry picked from FBD3223477)
2016-04-25 22:13:12 -07:00
Maksim Panchenko ff68b34553 Tool to merge .fdata files.
Summary:
merge-fdata tool takes multiple .fdata files and outputs to stdout
combined fdata. Takes about 2 seconds per each additional .fdata
file with hhvm production data.

(cherry picked from FBD3216430)
2016-04-08 12:18:06 -07:00
Maksim Panchenko 43bc4a09ad Changed splitting options and fixed sorting.
Summary:
Splitting option now has different meanings/values. Since landing pads
are mostly always cold/frozen, we should split them before anything
else (we still check the execution count is 0). That's value '1'.
Everything else goes on top of that and has increased value (2 - large
functions, 3 - everything).

Sorting was non-deterministic and somewhat broken for functions
with EH ranges. Fixed that and added '-split-all-cold' option to
outline all 0-count blocks.

Fixed compilation of test cases. After my last commit the binaries
were linked to wrong source files (i.e. debug info). Had to rebuild
the binaries from updated sources.

(cherry picked from FBD3209369)
2016-04-20 15:31:11 -07:00
Maksim Panchenko 4f44d60947 Special handling for GNU_args_size call frame instruction.
Summary:
GNU_args_size is a special kind of CFI that tells runtime to adjust
%rsp when control is passed to a landing pad. It is used for annotating
call instructions that pass (extra) parameters on the stack and there's
a corresponding landing pad.

It is also special in a way that its value is not handled by
DW_CFA_remember_state/DW_CFA_restore_state instruction sequence
that we utilize to restore the state after block re-ordering.

This diff adds association of call instructions with GNU_args_size value
when it's used. If the function does not use GNU_args_size, there is
no overhead. Otherwise, we regenerate GNU_args_size instruction during
code emission, i.e. after all optimizations and block-reordering.

(cherry picked from FBD3201322)
2016-04-19 22:00:29 -07:00
Gabriel Poesia ad344c4387 Group debugging info representation and serialization code.
Summary:
Moved the classes related to representing and serializing DWARF entities into a single
header, DebugData.h.

(cherry picked from FBD3153279)
2016-04-07 15:06:43 -07:00
Gabriel Poesia f6c8929799 Fix debugging info for simple functions that we fail to rewrite.
Summary:
Simple functions which we fail to rewrite after optimizations were
having wrong debugging information because the latter would reflect the optimized
version of the function.

There are only 48 functions (at this time) in this situation in the HHVM binary.

The simple fix is to add another full pass. Another more complicated path, which will
be more efficient, is to reset only the BinaryContext and emit again, but then we need
to recreate all symbols in the new MCContext and update the pointers. I started
taking this path but it started getting too complicated for only those 48 functions
(needed to create a new map of global symbols, recreate landing pads - which needed
to have the internal intermediate labels in the functions kept to be updated too, etc).

Because the overhead is quite large (another full emission pass - around 4m30s here)
and the impact is small I put this behind a new
command-line flag which is off by default: -fix-debuginfo-large-functions.

(cherry picked from FBD3166576)
2016-04-11 17:46:18 -07:00
Gabriel Poesia 0e77c53b89 Update address ranges of inlined functions and try/catch blocks.
Summary:
Update address ranges of inlined functions and try/catch blocks.
This was missing and lead gdb to show weird information in a core dump we inspected
because of the several nestings of inline in the call stack.

This is very similar to Lexical Blocks, so the change is to basically generalize that
code to do the same for DW_AT_try_block, DW_AT_catch_block and DW_AT_inlined_subroutine.

(cherry picked from FBD3169417)
2016-04-12 11:41:03 -07:00
Maksim Panchenko e16b5d8b78 Option to pass a file with list of functions to skip.
Summary:
Take "-skip_funcs_file=<file>" option and don't process any function
listed in the <file>.

(cherry picked from FBD3160226)
2016-04-08 19:30:27 -07:00
Gabriel Poesia 2694e58fa2 Update unmatched and nested subprogram DIEs.
Summary:
readelf was showing some errors because we weren't updating DIEs that were not shallow
in the DIE tree, or DIEs of functions with addresses we don't recognize (mostly functions with
address 0, which could have been removed by the Linker Script but still have debugging information
there). These DIEs need to be updated because their abbreviations are patched.

(cherry picked from FBD3159335)
2016-04-08 16:24:38 -07:00
Gabriel Poesia 665b03a464 Fix behavior with multiple functions with same address.
Summary:
We were updating only one DIE per function, but because the Linker Script may map
multiple functions to the same address this would cause us to generate invalid debug info
(as some DIEs weren't updated but their abbreviations were changed).

(cherry picked from FBD3157263)
2016-04-08 11:55:42 -07:00
Gabriel Poesia 784f6a8773 Emit debug line information for non-simple functions.
Summary:
Non-simple functions aren't emitted, and thus didn't have line number information
emitted. This diff emits it for those functions by extending LLVM's generation of the line number program to allow for absolute addresses (it is wholly symbolic), then iterating over the relevant line tables from the input and appending entries with absolute addresses to the line tables to be emited.

This still leaves the simple but not overwritten functions unhandled (there were 48 in HHVM in
my last run). However, I think that to fix them we'd need another pass, since by the time we
realize a simple function wont't fit, debug line info was already written to the output.

(cherry picked from FBD3148468)
2016-04-05 19:35:45 -07:00
Maksim Panchenko e513bfd86d Only set output ranges when updating dbg info.
Summary: Save processing time by setting output ranges when needed.

(cherry picked from FBD3148791)
2016-04-06 18:03:44 -07:00
Gabriel Poesia 4b4db40174 Update DWARF location lists after optimization.
Summary:

    Summary: Update DWARF location lists in .debug_loc and pointers to
    them in .debug_info so that gdb can print variables which change
    location during their lifetime.

    The following changes were made:

    - Refactored BasicBlockOffsetRanges to allow ranges to be tied to binary information (so that we can reuse it for location lists)
    - Implemented range compression optimization in BasicBlockOffsetRanges (needed otherwise too much data was being generated).
    - Added representation for location lists (LocationList.h, BinaryContext.h)
    - Implemented .debug_loc serializer that keeps the updated offsets (DebugLocWriter.{h,cpp})
    - After disassembly, traverse entries in .debug_loc and save them in context (BinaryContext.cpp)
    - After optimizations, serialize .debug_loc and update pointers in .debug_info (RewriteInstance.cpp)

(cherry picked from FBD3130682)
2016-04-01 11:37:28 -07:00
Maksim Panchenko 4349b63144 Re-enable conditional function spitting under an option.
Summary:
Add a parameter value to "-split-functions=" option to allow splitting
only when the function is too large to fit:
  0 - never split
  1 - split if too large to fit
  2 - always split
We may use this option when the profile data is not very precise.
In that case excessive splitting may increase iTLB misses.

(cherry picked from FBD3137700)
2016-03-31 16:38:49 -07:00
Gabriel Poesia 0a07d9bf88 Don't skip non-simple functions on function address ranges update.
Summary:
This fixes a problem in which bolt was generating a malformed .debug_info
section on the bzip2 binary. The bug was the following:

- A simple and a non-simple function shared an abbreviation
- The abbreviation was patched to contain DW_AT_ranges because of the simple function
- The non-simple function's data was not updated, but then it didn't match the
  layout expected by the abbreviation anymore

And because we were already creating an address ranges list in .debug_ranges even
for non-simple functions, it doesn't make sense not to use it anyway.

(cherry picked from FBD3129219)
2016-04-01 15:09:34 -07:00
Gabriel Poesia ffa9641e16 Update DWARF lexical blocks address ranges.
Summary:
Updates DWARF lexical blocks address ranges in the output binary after optimizations.
This is similar to updating function address ranges except that the ranges representation needs
to be more general, since address ranges can begin or end in the middle of a basic block.

The following changes were made:

- Added a data structure for iterating over the basic blocks that intersect an address range: BasicBlockTable.h
- Added some more bookkeeping in BinaryBasicBlock. Basically, I needed to keep track of the block's size in the input binary as well as its address in the output binary. This information is mostly set by BinaryFunction after disassembly.
- Added a representation for address ranges relative to basic blocks (BasicBlockOffsetRanges.h). Will also serve for location lists.
- Added a representation for Lexical Blocks (LexicalBlock.h)
- Small refactorings in DebugArangesWriter:
-- Renamed to DebugRangesSectionsWriter since it also writes .debug_ranges
-- Refactored it not to depend on BinaryFunction but instead on anything that can be assined an aoffset in .debug_ranges (added an interface for that)
- Iterate over the DIE tree during initialization to find lexical blocks in .debug_info (BinaryContext.cpp)
- Added patches to .debug_abbrev and .debug_info in RewriteInstance to update lexical blocks attributes (in fact, this part is very similar to what was done to function address ranges and I just refactored/reused that code)
- Added small test case (lexical_blocks_address_ranges_debug.test)

(cherry picked from FBD3113181)
2016-03-28 17:45:22 -07:00
Maksim Panchenko e8ef8a5619 Speedup section remapping.
Summary:
Before this diff LLVM used to iterate over all sections to find the
one with an address we want to remap. Since we have extremely
large number of section this process is highly inefficient.
Instead we add a new interface to remap a section with a given ID
(which effectively is an index into an array of sections), and
pass the ID instead of the address.

This cuts down the processing time of hhvm binary by 10 seconds,
and brings the total processing time to a little under 2 minutes.

(cherry picked from FBD3110015)
2016-03-28 22:39:48 -07:00
Maksim Panchenko 595d0885d9 Populate function execution count while parsing fdata.
Summary:
Populate function execution count while parsing fdata. Before
we used a quadratic algorithm to populate the execution count
(had to iterate over *all* branches for every single function).

Ignore non-symbol to non-symbol branches while parsing fdata.

These changes combined drop HHVM processing time from
4 minutes 53 seconds down to 2 minutes 9 seconds on my devserver.

Test case had to be modified since it contained irrelevant
branches from PLT to libc.

(cherry picked from FBD3106263)
2016-03-28 11:06:28 -07:00
Gabriel Poesia 466cbae866 Update subroutine address ranges in binary.
Summary:
[WIP] Update DWARF info for function address ranges.

This diff currently does not work for unknown reasons,
but I'm describing here what's the current state.
According to both llvm-dwarf and readelf our output seems correct,
but GDB does not interpret it as expected. All details go below in
hope I missed something.

I couldn't actually track the whole change that introduced support for
what we need in gdb yet, but I think I can get to it
(2007-12-04: Support
lexical bocks and function bodies that occupy non-contiguous address ranges). I have reasons to believe gdb at least at some
nges).

The set of introduced changes was basically this:

- After disassembly, iterate over the DIEs in .debug_info and find the
ones that correspond to each BinaryFunction.
- Refactor DebugArangesWriter to also write addresses of functions to
.debug_ranges and track the offsets of function address ranges there
- Add some infrastructure to facilitate patching the binary in
simple ways (BinaryPatcher.h)
- In RewriteInstance, after writing .debug_ranges already with
function address ranges, for each function do:
-- Find the abbreviation corresponding to the function
-- Patch .debug_abbrev to replace DW_AT_low_pc with DW_AT_ranges and
DW_AT_high_pc with DW_AT_producer (I'll explain this hack below).
Also patch the corresponding forms to DW_FORM_sec_offset and
DW_FORM_string (null-terminated in-place string).
-- Patch debug_info with the .debug_ranges offset in place of
the first 4 bytes of DW_AT_low_pc (DW_AT_ranges only occupies 4
bytes whereas low_pc occupies 8), and write an arbitrary string
in-place in the other 12 bytes that were the 4 MSB of low_pc
and the 8 bytes of high_pc before the patch. This depends on
low_pc and high_pc being put consecutively by the compiler, but
it serves to validate the idea. I tried another way of doing it
that does not rely on this but it didn't work either and I believe
the reason for either not working is the same (and still unknown,
but unrelated to them. I might be wrong though, and if I find yet
another way of doing it I may try it). The other way was to
use a form of DW_FORM_data8 for the section offset. This is
disallowed by the specification, but I doubt gdb validates this,
as it's just easier to store it as 64-bit anyway as this is even
necessary to support 64-bit DWARF (which is not what gcc generates
by default apparently).

I still need to make changes to the diff to make it production-ready,
but first I want to figure out why it doesn't work as expected.

By looking at the output of llvm-dwarfdump or readelf, all of
.debug_ranges, .debug_abbrev and .debug_info seem to have been
correctly updated. However, gdb seems to have serious problems with
what we write.

(In fact, readelf --debug-dump=Ranges shows some funny warning messages
of the form ("Warning: There is a hole [0x100 - 0x120] in .debug_ranges"),
but I played around with this and it seems it's just because no
compile unit was using these ranges. Changing .debug_info apparently
changes these warnings, so they seem to be unrelated to the section
itself. Also looking at the hex dump of the section doesn't help,
as everything seems fine. llvm-dwarfdump doesn't say anything.
So I think .debug_ranges is fine.)

The result is that gdb not only doesn't show the function name as we
wanted, but it also stops showing line number information.
Apparently it's not reading/interpreting the address ranges at all,
and so the functions now have no associated address ranges, only the
symbol value which allows one to put a breakpoint in the function,
but not to show source code.

As this left me without more ideas of what to try to feed gdb with,
I believe the most promising next trial is to try to debug gdb itself,
unless someone spots anything I missed.
I found where the interesting part of the code lies for this
case (gdb/dwarf2read.c and some other related files, but mainly that one).
It seems in some parts gdb uses DW_AT_ranges for only getting
its lowest and highest addresses and setting that as low_pc and
high_pc (see dwarf2_get_pc_bounds in gdb's code and where it's called).
I really hope this is not actually the case for
function address ranges. I'll investigate this further. Otherwise
I don't think any changes we make will make it work as initially
intended, as we'll simply need gdb to support it and in that case it
doesn't.

(cherry picked from FBD3073641)
2016-03-16 18:08:29 -07:00
Gabriel Poesia 9cdb7bdb55 Write only minimal .debug_line information.
Summary:
We used to output .debug_line information for every instruction, but because of the way
gdb (and probably lldb as of llvm::DWARFDebugLine::LineTable::findAddress) queries the
line table it's not necessary to output information for two instructions if they follow
each other and map to the same source line. By not repeating this information we generate
a bit less .debug_line data.

(cherry picked from FBD3056402)
2016-03-15 16:22:04 -07:00
Maksim Panchenko a60914427c Update DW_AT_ranges for CU when it exists.
Summary:
If CU has DW_AT_ranges update the value.
Note that it does not create DW_AT_ranges attribute.

(cherry picked from FBD3051904)
2016-03-14 19:04:23 -07:00
Maksim Panchenko d01172ffa8 Refactor existing debugging code.
Summary: Almost NFC. Isolate code for updating debug info.

(cherry picked from FBD3051536)
2016-03-14 18:48:05 -07:00
Gabriel Poesia dc7cc1fb18 Fix default line number information for instructions.
Summary:
The line number information generated from a null pointer
was actually valid, which caused new instructions without the line number
information set to have a valid and wrong line number reference. This diff
fixes this by making the null pointer be assigned to an invalid line number
row.

(cherry picked from FBD3048453)
2016-03-14 11:40:52 -07:00
Gabriel Poesia 80ea31b24e Write updated .debug_aranges section after optimizations.
Summary:
Write the .debug_aranges section after optimizations to the output binary.
Each function generates at least one range and at most two (one extra for its cold part).
The writing is done manually because LLVM's implementation is tied to the output of
.debug_info (see EmitGenDwarfInfo and EmitGenDwarfARanges in lib/MC/MCDwarf.cpp),
which we don't want to trigger right now.

(cherry picked from FBD3043108)
2016-03-11 11:30:30 -08:00
Maksim Panchenko e7e9e15b90 Check function data in symbol table against data in .eh_frame.
Summary:
At the moment we rely solely on the symbol table information to discover
function boundaries. However, similar information is contained in
.eh_frame. Verify that the information from these two sources is
consistent, and if it's not, then skip processing the functions with
conflicting information.

(cherry picked from FBD3043800)
2016-03-11 11:09:34 -08:00
Maksim Panchenko f2df1a8d97 Update stmt_list value to point to new .debug_line offset.
Summary:
After we add new line number information we have to update stmt_list
offsets in .debug_info. For this I had to add a primitive relocations
support for non-allocatable sections we are copying from input file.

Also enabled functionality to process relocations in non-allocatable
sections that LLVM is generating, such as .debug_line. I thought
we already had it, but apparently it didn't work, at least not
for ELF binaries.

(cherry picked from FBD3037903)
2016-03-09 16:06:41 -08:00
Maksim Panchenko 9212a9ad69 Proper skipping of unsupported CFI instructions.
Summary:
Skip DW_CFA_expression and DW_CFA_val_expression instructions
properly, according to DWARF spec.

If CFI range does not match function range skip that function.

(cherry picked from FBD3040502)
2016-03-10 23:03:17 -08:00
Gabriel Poesia 73c9f0abe3 Write updated .debug_line information to temp file
Summary:
Writes .debug_line section by setting the state
in MCContext that LLVM needs to produce and output the
line tables. This basically consists of setting the
current location and compile unit offset. This makes LLVM
output .debug_line in the temporary file, but not yet in
the generated ELF file.

Also computes the line table offsets for each compile unit
and saves them into BinaryContext. Added an option to
print these offsets.

(cherry picked from FBD3004554)
2016-03-02 18:40:10 -08:00
Maksim Panchenko d68b1c7b16 Extending support for non-allocatable sections.
Summary:
The is a set of changes that allow modification of non-allocatable
sections in ELF binary. Primarily for the purpose of updating debug
info.

Extend LLVM interface to allow processing relocations in non-allocatable
sections. This allows to produce .debug* sections with resolved
relocations against generated code.

Extend BOLT rewriting framework to allow appending contents to
non-allocatable sections in the binary.

Re-worked ELF binary rewriting to support the above and to allow future
extensions (e.g. new section names).

(cherry picked from FBD3023403)
2016-03-03 10:13:11 -08:00
Gabriel Poesia 77a6b72842 BOLT: Read and tie .debug_line info to IR.
Summary:
Reads information in the DWARF .debug_line section using LLVM and
tie every MCInst to one line of a line table from the input binary. Subsequent
diffs will update this information to match the final binary layout and
output updated line tables.

(cherry picked from FBD2989813)
2016-02-25 16:57:07 -08:00
Maksim Panchenko 62da18d32a Always split functions under '-split-functions=1' option.
Summary:
Force the splitting of the function into hot/cold even when
the function fits into original slot.

This reduces BOLT optimization time by 50% without affecting
hhvm performance.

(cherry picked from FBD2973773)
2016-02-22 16:49:26 -08:00
Maksim Panchenko 73e9afe99c Don't abort on unknown CFI instructions.
Summary:
If we see an unknown CFI instruction, skip processing the function
containing it instead of aborting execution.

(cherry picked from FBD2964557)
2016-02-22 18:25:43 -08:00
Maksim Panchenko 7f7d4af7e0 Add an option to use PT_GNU_STACK for new segment.
Summary:
Added an option to reuse existing program header entry.
This option allows for bfd tools like strip and objcopy
to operate on the optimized binary without destroying it.

Also, all new sections are now properly marked in ELF.

(cherry picked from FBD2943339)
2016-02-12 19:01:53 -08:00
Maksim Panchenko 50c895ad0c Drop requirement for __flo_storage in the input binary.
Summary:
We used to require pre-allocated space in the input binary so that
we can write extra sections in there (.eh_frame, .eh_frame_hdr,
.gcc_except_table, etc.). With this diff there's no further
need for pre-allocated storage as we create a new segment and
can use as much space as needed.

There are certain limitations on where the new segment could
be allocated, and as a result the size of the file may increase.

There's currently a limitation if the binary size is close to 4GB
we cannot allocate new segment prior to that and as a result
we require debug info to be stripped to reduce the file size.
The fix is in progress.

(cherry picked from FBD2916029)
2016-02-08 10:02:48 -08:00
Maksim Panchenko e1a61e1eed Keep intermediate .o file only under -keep-tmp option.
Summary:
We use intermediate .o file for debugging purposes, but there's no
reason to generate it by default. Only do it if "-keep-tmp" is
specified.

(cherry picked from FBD2912098)
2016-02-08 10:08:28 -08:00
Maksim Panchenko d1526083fc Rename binary optimizer to BOLT.
Summary:
BOLT - Binary Optimization and Layout Tool replaces FLO.
I'm keeping .fdata extension for "feedback data".

(cherry picked from FBD2908028)
2016-02-05 14:42:04 -08:00
Maksim Panchenko 628d06b1e5 Preserve layout of basic blocks with 0 profile counts.
Summary:
Preserve original layout for basic blocks that have 0 execution
count. Since we don't optimize for size, it's better to rely on
the original input order.

(cherry picked from FBD2875335)
2016-01-21 14:18:30 -08:00
Maksim Panchenko b91d1f1299 Enable REPNZ prefix support.
Summary:
I didn't see a case where REPNZ were not disassembled/reassembled
properly.

(cherry picked from FBD2869229)
2016-01-26 17:53:08 -08:00
Maksim Panchenko 218c5f0916 Fix a bug with outlining first basic block.
Summary:
We should never outline the first basic block.
Also add an option to accept a file with the list of
functions to optimize.

(cherry picked from FBD2868184)
2016-01-26 16:03:58 -08:00
Maksim Panchenko 89578e2314 Allow to partially split functions with exceptions.
Summary:
We could split functions with exceptions even without creating
a new exception handling table. This limits us to only move
basic blocks that never throw, and are not a start of a
landing pad.

(cherry picked from FBD2862937)
2016-01-22 16:45:39 -08:00
Maksim Panchenko bbb745efa9 Don't create empty basic blocks. Fix CFI bug.
Summary:
Some basic blocks were created empty because they only contained
alignment nop's. Ignore such nop's before basic block gets created.

Fixed intermittent aborts related to CFI update.

(cherry picked from FBD2844465)
2016-01-19 00:20:06 -08:00
Maksim Panchenko 4a44d187c6 Handle more CFI cases and some.
Summary:
  * Update CFI state for larger range of functions to increase coverage.
  * Issue more warnings indicating reasons for skipping functions.
  * Print top called functions in the binary.

(cherry picked from FBD2839734)
2016-01-16 14:58:22 -08:00
Maksim Panchenko d9536e6092 Added an option to reverse original basic blocks order.
Summary:
Modified processing of "-reorder-blocks=" option and added an option
to reverse original basic blocks order for testing purposes.

(cherry picked from FBD2829862)
2016-01-13 17:19:40 -08:00
Maksim Panchenko c9b7e3e09e Write updated LSDA's.
Summary: Write new exception ranges tables (LSDA's) into the output file.

(cherry picked from FBD2828312)
2015-12-18 17:00:46 -08:00
Maksim Panchenko b42c72cbf6 Fix issues with some CFI instructions with gcc 4.9.
Summary:
Fixes some issues discovered after hhvm switched to gcc 4.9.
Add support for DW_CFA_GNU_args_size instruction.
Allow CFI instruction after the last instruction in a function.
Reverse conditions of assert for DW_CFA_set_loc.

(cherry picked from FBD28110096)
2015-12-18 20:26:44 -08:00
Maksim Panchenko a6efd11c05 Code/comments cleanup.
Summary:
Consolidate cold function info under cold FragmentInfo.
Minor code and comment mods to LSDA handling.

(cherry picked from FBD28109981)
2015-12-17 12:59:15 -08:00
Maksim Panchenko e2fcb371a8 Ignore functions referencing symbol at 0x0.
Summary:
Binary code could be weird. It could include calls to address 0 and
reference data at 0 (e.g. with lea on x86). LLVM JIT fatals
while resolving relocations against symbols at address 0x0. For now
we will stop emitting such code, i.e. we'll skip functions.

(cherry picked from FBD28109837)
2015-12-16 17:56:49 -08:00
Maksim Panchenko f7d7a85a24 Turn EH ranges support back on.
Summary:
Changed the way EH info is stored/extracted from call instruction. Make
sure indirect calls work.

(cherry picked from FBD28109629)
2015-12-15 17:06:27 -08:00
Rafael Auler fb6e8c5d0b Don't touch functions whose internal BBs are targets of interprocedural branches
Summary:
In a test binary, we found 8 cases where code in a function A would jump to the
middle of another function B. In this case, we cannot reorder function B because
this would change instruction offsets and break the program. This is pretty rare
but can happen in code written in assembly.

(cherry picked from FBD2719850)
2015-12-03 13:29:52 -08:00
Rafael Auler 9a73a8c446 Turns off basic block alignment by default
Summary:
We found out that the insertion of extra nops to preserve alignment of
some loop bodies do not pay off the increased function size, since this extra
size may inhibit us from rewriting a reordered version of this function.

(cherry picked from FBD2718466)
2015-12-03 09:45:18 -08:00
Rafael Auler 04c80af012 Don't choke on DW_CFA_def_cfa_expression and friends
Summary:
Our CFI parser in the LLVM library was giving up on parsing all CFI
instructions when finding a single instruction with expression operands. Yet,
all gcc-4.9 binaries seem to have at least one CFI instruction with expression
operands (DW_CFA_def_cfa_expression). This patch fixes this and makes DebugInfo
continue to parse other instructions, even though it does not completely parse
DWARF expressions yet. However, this seems to be enough to allow llvm-flo to
process gcc-4.9 binaries because the FDEs with DWARF expressions are linked to
the PLT region, and not to functions that we process.

If we ever try to read a function whose CFI depends on DWARF expression, which
is unlikely, llvm-flo will assert.

(cherry picked from FBD2693088)
2015-11-24 13:55:44 -08:00
Rafael Auler d6f01452d1 Change function splitting to be a two-pass process
Summary:
This patch builds upon the previous patch to create a two-pass process
to function splitting. We first perform the full rewriting pipeline to discover
which functions need splitting. Afterwards, we restart the pipeline with those
functions annotated to be split.

(cherry picked from FBD2691709)
2015-11-24 09:29:41 -08:00
Rafael Auler c67a753e3c Refactoring llvm-flo.cpp into a new class RewriteInstance, NFC.
Summary:
Previously, llvm-flo.cpp contained a long function doing lots of
different tasks. This patch refactors this logic into a separate class with
different member functions, exposing the relationship between each step of
the rewritting process and making it easier to coordinate/change it.

(cherry picked from FBD2691674)
2015-11-23 17:54:18 -08:00
Rafael Auler ccbbb8f8b9 Teach llvm-flo how to split functions into hot and cold regions
Summary:
After basic block reordering, it may be possible that the reordered
function is now larger than the original because of the following reasons:

- jump offsets may change, forcing some jump instructions to use 4-byte
immediate operand instead of the 1-byte, shorter version.
- fall-throughs change, forcing us to emit an extra jump instruction to jump
to the original fall-through at the end of a basic block.

Since we currently do not change function addresses, we need to rewrite the
function back in the binary in the original location. If it doesn't fit, we were
dropping the function.

This patch adds a flag -split-functions that tells llvm-flo to split hot
functions into hot and cold separate regions. The hot region is written back
in the original function location, while the cold region is written in a
separate, far-away region reserved to flo via a linker script.

This patch also adds the logic to create and extra FDE to supply unwinding
information to the cold part of the function. Owing to this, we now need to
rewrite .eh_frame_hdr to another location and patch the EH_FRAME ELF segment
to point to this new .eh_frame_hdr.

(cherry picked from FBD2677996)
2015-11-19 17:59:41 -08:00
Rafael Auler 38dac03e6b Make llvm-flo print dynamic coverage of rewritten functions
Summary:
This is an attempt at determining the hotness of functions we are
rewriting and help detect if we are discarding hot functions. This patch
introduces logic to estimate the number of instructions executed in each
function by using the profile data for branches. It sums the products of
BB frequency and size. Since we can only do this for functions we have
successfully disassembled, created the CFG and annotated with profiling
data, all complex functions that were not disassembled are left out from
this analysis.

(cherry picked from FBD2654985)
2015-11-13 15:27:59 -08:00
Rafael Auler 75798a891b Do not bail on functions with indirect calls
Summary:
Previously, we were marking functions with indirect calls as too
complex to be disassembled, but this was unnecessarily conservative. This patch
removes this restriction.

(cherry picked from FBD2669627)
2015-11-02 09:46:50 -08:00
Rafael Auler 7886f4e81a Ignore LSDA information for now
Summary:
Teach llvm-flo to drop on function with LSDA information until we know
how to update them after block reordering.

(cherry picked from FBD2640806)
2015-11-10 17:21:42 -08:00
Rafael Auler 1d248ec51b Write .eh_frame and .eh_frame_hdr after reordering BBs
Summary:
This patch adds logic to detect when the binary has extra space
reserved for us via the __flo_storage symbol. If this symbol is present,
it means we have extra space in the binary to write extraneous information.
When we write a new .eh_frame, we cannot discard the old .eh_frame because
it may still contain relevant information for functions we do not reorder.
Thus, we write the new .eh_frame into __flo_storage and patch the current
.eh_frame_hdr to point to the new .eh_frame only for the functions we touched,
generating a binary that works with a bi-.eh_frame model.

(cherry picked from FBD2639326)
2015-11-10 15:20:50 -08:00
Rafael Auler 70db5677fb Write updated CFI to temporary object file
Summary:
This patch is an intermediary step towards updating the CFI in the
optimized binary. It adds the logic necessary to output our CFI annotations to
a new .eh_frame in the temporary object file we create to hold rewritten
functions. The next step will be to fully integrate this new .eh_frame into the
optimized binary.

(cherry picked from FBD2633728)
2015-11-09 11:08:02 -08:00
Rafael Auler 6c851dc2e3 Attempts to fix CFI state after reordering
Summary:
This patch introduces logic to check how the CFI instructions define a
table to help during stack unwinding at exception run time and attempts to fix
any problem in this table that may have been introduced by reordering the basic
blocks. If it fails to fix this problem, the function is marked as not simple
and not eligible for rewriting.

(cherry picked from FBD2633696)
2015-11-08 12:23:54 -08:00
Maksim Panchenko bc9d6e3b6c Regenerate exception handling information after optimizations.
Summary:
Regenerate exception handling information after optimizations.
Use '-print-eh-ranges' to see CFG with updated ranges.

(cherry picked from FBD2660982)
2015-11-13 14:18:45 -08:00
Maksim Panchenko 56cca2fb5b Fix LSDA reading issues.
Summary:
There were two issues: we were trying to process non-simple functions,
i.e. function that we don't fully understand, and then we failed to stop
iterating if EH closing label was after the last instruction in a
function.

(cherry picked from FBD2664460)
2015-11-17 11:02:04 -08:00
Maksim Panchenko be2a19523c Add exception handling information to CFG.
Summary:
Read .gcc_except_table and add information to CFG. Calls have extra operands
indicating there's a possible handler for exceptions and an action. Landing
pad information is recorded in BinaryFunction.

Also convert JMP instructions that are calls into tail calls pseudo
instructions so that they don't miss call instruction analysis.

(cherry picked from FBD2652775)
2015-11-12 18:56:58 -08:00
Rafael Auler 2117362a09 Revert 45fc13b as it breaks HHVM rewriting
Summary: Reverting this commit until we better investigate why
it is necessary to change local symbol names with a prefix.

(cherry picked from FBD28109521)
2015-11-12 10:41:46 -08:00
Rafael Auler 1df130ae17 Remove add PG prefix from symbols that are already local
Summary: After discussion with Maksim, we decided to drop the lines
that add the PG prefix if the symbol is already local, since they
wouldn't be impacted by the way LLVM handles these symbols.

(cherry picked from FBD28109400)
2015-11-12 10:02:12 -08:00
Rafael Auler e80d11f27a Fix bug in local symbol name disambiguation algorithm
Summary:
This bug would cause llvm-flo to fail to disambiguate two local symbols
with the same file name, causing two different addresses to compete in the
symbol table for the resolution of a given name, causing unpredicted behavior in
the linker.

(cherry picked from FBD2646626)
2015-11-11 23:56:24 -08:00
Rafael Auler a30d04c3e2 Annotate BinaryFunctions with MCCFIInstructions encoding CFI
Summary:
In order to represent CFI information in our BinaryFunction class, this
patch adds a map of Offsets to CFI instructions. In this way, we make it easy to
check exactly where DWARF CFI information is annotated in the disassembled
function.

(cherry picked from FBD2619216)
2015-11-04 16:48:47 -08:00
Maksim Panchenko de46e6fc07 Parse whole contents of .gcc_except_table even if we are not printing.
Summary:
We need to parse the whole contents of .gcc_except_table even if we are
not printing exceptions. Otherwise we are missing type index table and
miscalculate the size of the current table.

(cherry picked from FBD2632965)
2015-11-09 12:27:13 -08:00
Rafael Auler 2088875656 Teach llvm-flo how to read .eh_frame information from binaries
Summary: In order to reorder binaries with C++ exceptions, we first need to
read DWARF CFI (call frame info) from binaries in a table in the .eh_frame
ELF section. This table contains unwinding information we need to be aware of
when reordering basic blocks, so as to avoid corrupting it. This patch also
cleans up some code from Exceptions.cpp due to a refactoring where we moved
some functions to the LLVM's libSupport.

(cherry picked from FBD2614464)
2015-11-05 13:37:30 -08:00
Maksim Panchenko 7d592d0975 Verbose printing of actions from .gcc_except_table
Summary:
Print actions for exception ranges from .gcc_except_table.
Types are printed as names if the name is available from symbol table.

(cherry picked from FBD2612631)
2015-11-03 14:26:33 -08:00
Maksim Panchenko 21cc191ea8 Added function to parse and dump .gcc_except_table
Summary: Use '-print-exceptions' option to dump contents of .gcc_except_table.

(cherry picked from FBD2609925)
2015-11-02 11:50:53 -07:00
Rafael Auler 0e8998713c Extract non-taken branch frequencies from LBR
Summary:
Previously, we inferred all non-taken branch frequencies with the
information we had for taken branches. This patch teaches perf2flo and llvm-flo
how to read and incorporate non-taken branch frequencies directly from the
traces available in LBR data and by disassembling the binary. It still leaves
the inference engine untouched in case we need it to fill out other
fall-throughs.

(cherry picked from FBD2589212)
2015-10-26 15:00:56 -07:00
Rafael Auler 13a520ab30 Implement two cluster layout heuristics
Summary:
Pettis' paper on block layout (PLDI'90) suggests we should order
clusters (or chains, using the paper terminology) using a specific criterion.
This patch implements two distinct ideas for cluster layout that can be
activated using different command-line flags. The first one reflects Pettis'
ideas on minimizing branch mispredictions and the second one is targeted at
reducing I-cache misses, described in the Ispike paper (CGO'04).

(cherry picked from FBD2588693)
2015-10-23 09:38:26 -07:00
Rafael Auler 2539539bde Fixes priority queue ordering in llvm-flo block reordering
Summary:
Fixes a bug which caused the block reordering heuristic to put in the
same cluster hot basic blocks and cold basic blocks, increasing I-cache misses.

(cherry picked from FBD2588203)
2015-10-27 03:04:58 -07:00
Maksim Panchenko d4d773458c More control over function printing.
Summary:
Can use '-print-*' option to print function at specific stage.
Use '-print-all' to print at every stage.

(cherry picked from FBD2578196)
2015-10-23 15:52:59 -07:00
Maksim Panchenko 7f44331773 Issue warning when relaxed tail call is seen on input.
Summary:
Issue warning when we see a 2-byte tail call. Currently we
will increase the size of these instructions.

(cherry picked from FBD2575520)
2015-10-20 10:51:17 -07:00
Rafael Auler 546c4e6e84 Fix bug in BinaryFunction::fixBranches() in llvm-flo
Summary:
When the ignore-nops patch landed, it exposed a bug in fixBranches()
where it ignored empty BBs. However, we cannot ignore empty BBs when it is
reordered and its fall-through changes. We must update it with a jump to the
original fall-through. This patch fixes this.

(cherry picked from FBD2568244)
2015-10-21 16:25:16 -07:00
Rafael Auler dc848b5376 Fix entry BB execution count in llvm-flo
Summary:
When we have tailcalls, the execution count for the entry point is
wrongly computed. Fix this.

(cherry picked from FBD2563112)
2015-10-20 16:48:54 -07:00
Rafael Auler ab63ca9afb Implement unreachable BB elimination in llvm-flo
Summary:
It is important to remove dead blocks to free up space in functions
and allow us to reorder blocks or align branch targets with more
freedom. This patch implements a simple algorithm to delete all basic
blocks that are not reachable from the entry point. Note that C++
exceptions may create "unreachable" blocks, so this option must be
used with care.

(cherry picked from FBD2562637)
2015-10-20 12:47:37 -07:00
Rafael Auler 9f41a0d263 Do not schedule BBs before the entry point
Summary:
SPEC CPU2006 perlbench triggered a bug in our heuristic block
reordering algorithm where a hot edge that targets the entry point (as in a
recursive tail call) would make us try to allocate the call site before the
function entry point. Since we don't update function addresses yet, moving the
entry point will corrupt the program. This patch fixes this.

(cherry picked from FBD2562528)
2015-10-20 12:30:22 -07:00
Rafael Auler b0115a4536 Teach llvm-flo how to handle two back-to-back JMPs
Summary:
If we have two consecutive JMP instructions and no branches to the
second one, the second one is dead code, but llvm-flo does not handle these
cases properly and put two JMPs in the same BB. This patch fixes this, putting
the extraneous JMP in a separate block, making it easy for us to detect it is
dead code and remove it later in a separate step.

(cherry picked from FBD2562465)
2015-10-20 10:17:38 -07:00
Maksim Panchenko 85b99eb7b7 Eliminate nop instruction in input and derive alignment.
Summary:
Nop instructions are primarily used for alignment purposes on the input.
We remove all nops when we build CFG and derive alignment of basic blocks
based on existing alignment and a presence of nops before it. This
will not always work as some basic blocks will be naturally aligned
without necessity for nops. However, it's better than random alignment.
We would also add heuristics for BB alignment based on execution profile.

(cherry picked from FBD2561740)
2015-10-20 10:51:17 -07:00
Rafael Auler cd6250d1e3 Fixes branches after reordering basic blocks in a binary function
Summary:
Adds logic in BinaryFunction to be able to fix branches (invert
its condition, delete or add a branch), making the new function work with the
new layout proposed by the layout pass. All the architecture-specific content
was designed to live in the LLVM Target library, in the MCInstrAnalysis pass.
For now, we only introduce such logic to the X86 backend.

(cherry picked from FBD2551479)
2015-10-16 09:49:04 -07:00
Rafael Auler ef059af3d1 Fix bug in block reorder heuristic
Summary:
Tests with SPEC CPU2006 400.perlbench exposed a bug in the block reordering
heuristic that happened when two blocks are both successor and predecessor of
each other. This patch fixes this.

(cherry picked from FBD2555835)
2015-10-19 10:43:54 -07:00
Rafael Auler 31e6bd1226 Fix missing sanity check in BinaryFunction::optimizeLayout()
Summary:
SPEC CPU2006 perlbench exposed a bug in BinaryFunction::optimizeLayout()
where it would try to optimize the layout even though the function had zero
basic blocks. This patch simply checks if the function has zero basic blocks and
bails out.

(cherry picked from FBD2556831)
2015-10-19 13:23:03 -07:00
Maksim Panchenko b4ed5cc942 Make FLO work on hhvm binary.
Summary: Fixes several issues that prevented us from running hhvm binary.

(cherry picked from FBD2543057)
2015-10-14 15:35:14 -07:00
Rafael Auler ec22caff1e Fix comments. NFC.
Summary: Updated comments in BinaryFunction class.

(cherry picked from FBD28108888)
2015-10-16 17:15:00 -07:00
Rafael Auler 9a8d357d0b Fix DataReader to work with new local sym perf2flo format
Summary:
In a recent commit, we changed local symbols to be specially tagged
with the number 2 (local sym) instead of 1 (sym). This patch modifies the reader
to don't choke when seeing a 2 in the symbol id field.

(cherry picked from FBD2552776)
2015-10-16 17:00:36 -07:00
Rafael Auler f9ed45893b Teach llvm-flo how to reorder blocks in an optimal way
Summary:
This patch implements a dynamic programming approach to solve reorder
basic blocks with profiling information in an optimal way. Since this is
analogous to TSP, it is NP-hard and the algorithm is exponential in time and
memory consumption. Therefore, we only use the optimal algorithm to decide the
layout of small functions (with less than 11 basic blocks).

(cherry picked from FBD2544124)
2015-10-14 16:58:55 -07:00
Rafael Auler 34f7085503 Teach llvm-flo how to reorder basic blocks with a heuristic
Summary:
This patch introduces a first approach to reorder basic blocks based on
profiling data that gives us the execution frequency for each edge. Our strategy
is to layout basic blocks in a order that maximizes the weight (hotness) of
branches that will be deleted. We can delete branches when src comes right
before dst in the new layout order. This can be reduced to the TSP problem. This
patch uses a greedy heuristic to solve the problem: we start with a graph with
no edges and progressively add edges by choosing the hottest edges first,
building a layout order that attempts to put BBs with hot edges together.

(cherry picked from FBD2544076)
2015-10-13 12:18:54 -07:00
Rafael Auler 9b58b2e64b Make llvm-flo infer branch count data for fall-through edges
Summary:
The LBR only has information about taken branches and does not record
information when a branch is not taken. In our CFG, we call these edges
"fall-through" edges. This patch teaches llvm-flo how to infer fall-through
edge frequencies.

(cherry picked from FBD2536633)
2015-10-13 10:25:45 -07:00
Maksim Panchenko f79f6302c1 Converted local offsets from uint64_t to uint32_t. Refactoring.
(cherry picked from FBD2543557)
2015-10-14 16:46:59 -07:00
Rafael Auler 4c1da22ae9 Add branch count information to binary CFG
Summary:
Changes DataReader to organize branch perf data per function name and
sets up logistics to bring this data to BinaryFunction::buildCFG(). To do this,
we expand BinaryContext with a const reference to DataReader. This patch also
adds the "-dump-functions" flag to force llvm-flo to dump the current state of
BinaryFunctions once they are disassembled and their CFG built, allowing us to
test whether the builder is sane with LLVM LIT tests.

(cherry picked from FBD2534675)
2015-10-12 12:30:47 -07:00
Maksim Panchenko d30423f872 Don't bail out if there's no input data file specified.
Summary:
Don't attempt to read data file if it was not
specified by the user.

(cherry picked from FBD2533440)
2015-10-12 14:46:18 -07:00
Maksim Panchenko ffcc2be7fa FLO: added support for rip-relative operands.
Summary: Detect and replace rip-relative operands with relocations.

(cherry picked from FBD2529818)
2015-10-09 21:47:18 -07:00
Maksim Panchenko f166c4ab2b Fix CFG building issue.
Summary:
Fixed getBasicBlockContainingOffset() to return correct basic
block.

(cherry picked from FBD2532514)
2015-10-12 12:12:16 -07:00
Rafael Auler e1a539b0ec Add initial implementation of DataReader
Summary:
This patch introduces DataReader, a module responsible for
parsing llvm flo data files into in-memory data structures.

(cherry picked from FBD2515754)
2015-10-05 18:31:25 -07:00
Maksim Panchenko 9a2fe7ebe4 Commit FLO with control flow graph.
Summary:
llvm-flo disassembles, builds control flow graph, and re-writes
simple functions.

(cherry picked from FBD2524024)
2015-10-09 17:21:14 -07:00
Maksim Panchenko 7927c14ff5 Fixed cmake.
(cherry picked from FBD28108725)
2015-10-02 12:38:07 -07:00
Maksim Panchenko a89c417357 Removed remote .arcconfig + comment change.
(cherry picked from FBD2503821)
2015-10-02 12:06:31 -07:00
Maksim Panchenko 575b24d719 Initial FLO commit.
Summary:
Directory created.

(cherry picked from FBD28105260)
2015-10-02 11:55:15 -07:00
Maksim Panchenko 25b976aa12 BOLT root commit 2022-01-10 17:58:05 -08:00