Summary:
To minimize size of the output code we should emit tail calls
that are as short as possible. For this we have to convert a synthetic
TAILJMPd into JMP_1 instruction. This should be one of the last passes
as most of analysis passes could break since tail calls will no longer
be marked as such.
The total size of the code is smaller, but not by much - hot text was
reduced by 192 bytes.
(cherry picked from FBD4557804)
Summary:
Some functions coming from assembly may not have been marked
with size. We assume the size to include all bytes up to
the next function/object in the file. As a result,
function body will include any padding inserted by the linker.
If linker inserts 0-value bytes this could be misinterpreted
as invalid instruction and BOLT will bail out on such functions
in non-relocation mode, and give up on a binary in relocation
mode.
This diff detects zero-padding, ignores it, and continues processing
as normal.
(cherry picked from FBD4528893)
Summary:
Whenever input binary is suspected to have been sanitized we print an error
message and exit. I've checked that "__asan_init*" symbol
presence is the most conservative way to detect "sanitization".
(cherry picked from FBD4525478)
Summary:
Re-write section header string table to reflect new names
given to sections. Old sections get ".bolt.org" prefix.
E.g. when we write ".eh_frame" section, we keep the old copy
but rename it to ".bolt.org.eh_frame".
Note: the new code section is named ".bolt.text" - it contains split
function bodies, while original ".text" name is left unchanged.
(cherry picked from FBD4524935)
Summary:
Perform indirect call promotion optimization in BOLT.
The code scans the instructions during CFG creation for all
indirect calls. Right now indirect tail calls are not handled
since the functions are marked not simple. The offsets of the
indirect calls are stored for later use by the ICP pass.
The indirect call promotion pass visits each indirect call and
examines the BranchData for each. If the most frequent targets
from that callsite exceed the specified threshold (default 90%),
the call is promoted. Otherwise, it is ignored. By default,
only one target is considered at each callsite.
When an candiate callsite is processed, we modify the callsite
to test for the most common call targets before calling through
the original generic call mechanism.
The CFG and layout are modified by ICP.
A few new command line options have been added:
-indirect-call-promotion
-indirect-call-promotion-threshold=<percentage>
-indirect-call-promotion-topn=<int>
The threshold is the minimum frequency of a call target needed
before ICP is triggered.
The topn option controls the number of targets to consider for
each callsite, e.g. ICP is triggered if topn=2 and the total
requency of the top two call targets exceeds the threshold.
Example of ICP:
C++ code:
int B_count = 0;
int C_count = 0;
struct A { virtual void foo() = 0; }
struct B : public A { virtual void foo() { ++B_count; }; };
struct C : public A { virtual void foo() { ++C_count; }; };
A* a = ...
a->foo();
...
original:
400863: 49 8b 07 mov (%r15),%rax
400866: 4c 89 ff mov %r15,%rdi
400869: ff 10 callq *(%rax)
40086b: 41 83 e6 01 and $0x1,%r14d
40086f: 4d 89 e6 mov %r12,%r14
400872: 4c 0f 44 f5 cmove %rbp,%r14
400876: 4c 89 f7 mov %r14,%rdi
...
after ICP:
40085e: 49 8b 07 mov (%r15),%rax
400861: 4c 89 ff mov %r15,%rdi
400864: 49 ba e0 0b 40 00 00 movabs $0x400be0,%r10
40086b: 00 00 00
40086e: 4c 3b 10 cmp (%rax),%r10
400871: 75 29 jne 40089c <main+0x9c>
400873: 41 ff d2 callq *%r10
400876: 41 83 e6 01 and $0x1,%r14d
40087a: 4d 89 e6 mov %r12,%r14
40087d: 4c 0f 44 f5 cmove %rbp,%r14
400881: 4c 89 f7 mov %r14,%rdi
...
40089c: ff 10 callq *(%rax)
40089e: eb d6 jmp 400876 <main+0x76>
(cherry picked from FBD3612218)
Summary:
Add an option to overwrite jump tables without moving and make it a
default:
-jump-tables - jump tables support (default=basic)
=none - do not optimize functions with jump tables
=basic - optimize functions with jump tables
=move - move jump tables to a separate section
=split - split jump tables section into hot and cold based on
function execution frequency
=aggressive - aggressively split jump tables section based on usage of
the tables
(cherry picked from FBD4448499)
Summary:
Add a new dataflow analysis to recover the value of RSP at a
given point of the program. This value is expressed as an offset from
the CFA. Use this information to detect redundant load in memory
accesses performed via RSP as well, not only RBP as done previously.
Bail when RSP value (as an offset of the CFA) can't be reliably
determined with a simple dataflow analysis.
(cherry picked from FBD4372261)
Summary:
Report stale functions percentage with respect to all profiled
functions instead of all simple functions in the binary.
The new reporting format should make it more apparent if the
profile is out-of-date. Compare:
BOLT-INFO: 341 (16.7% of all profiled) functions have invalid (possibly
stale) profile.
vs old:
BOLT-INFO: 341 (0.3%) functions have invalid (possibly stale) profile.
(cherry picked from FBD4451746)
Summary:
Due to a clowntown on my part we were generating wrong ranges
when an empty range was seen on input. We were basically expanding
the range to include all basic blocks following such range and setting
wrong sizes at the same time.
Add "-dump-cu" option to llvm-dwarfdump that allows to look at debug
info of a single compile unit only. Saves time if we are only interested
in a subset of information.
(cherry picked from FBD4430989)
Summary:
In-non relocation mode, when we run ICF the second time,
we fold the same functions again since they were not
removed from the function set. This diff marks them as
folded and ignores them during ICF optimization. Note
that we still want to optimize such functions since they
are potentially called from the code not covered by BOLT
in non-relocation mode.
Folded functions are also excluded from dyno stats with
this diff
Also print the number of times folded functions were called.
When 2 functions - f1() and f2() are folded, that number
would be min(call_frequency(f1), call_frequency(f2)).
(cherry picked from FBD4399993)
Summary:
Re-worked the way ICF operates. The pass now checks for more than just
call instructions, but also for all references including function
pointers. Jump tables are handled too.
(cherry picked from FBD4372491)
Summary:
This is a first attempt to perform data flow analyses on bolt
and try to rebuild the stack frame for functions. The goal of the frame
optimization pass is to detect instructions that are accessing stack and,
if loading values, evaluate whether this load is redundant and we can
substitute the memory operation for a register load or immediate load.
To find opportunities, this pass also builds a map of clobbered registers
by function, so we use this in our analysis at call sites. If a call site
is found out to not clobber a caller-saved register but the caller is
spilling it anyway to the stack (to comply with the ABI), we should
detect these cases and remove this unnecessary move.
(cherry picked from FBD4337238)
Summary:
An optimization to simplify conditional tail calls by removing unnecessary branches. It adds the following two command line options:
-simplify-conditional-tail-calls - simplify conditional tail calls by removing unnecessary jumps
-sctc-mode - mode for simplify conditional tail calls
=always - always perform sctc
=preserve - only perform sctc when branch direction is preserved
=heuristic - use branch prediction data to control sctc
This optimization considers both of the following cases:
foo: ...
jcc L1 original
...
L1: jmp bar # TAILJMP
->
foo: ...
jcc bar iff jcc L1 is expected
...
L1 is unreachable
OR
foo: ...
jcc L2
L1: jmp dest # TAILJMP
L2: ...
->
foo: jncc dest # TAILJMP
L2: ...
L1 is unreachable
For this particular case, the first basic block ends with a conditional branch and has two successors, one fall-through and one for when the condition is true. The target of the conditional is a basic block with a single unconditional branch (i.e. tail call) to another function. We don't care about the contents of the fall-through block.
(cherry picked from FBD3719617)
Summary:
Previously NamedRegionTimer's constructor was being called
with no local variable associated with it owing to a typo. We need a
local variable to keep track of the time spent in the scope. At the
end of the scope, the destructor will be called an then the timer will
stop.
(cherry picked from FBD4301844)
Summary:
As we begin to work on optimization passes for bolt, it is important to
keep track of the time spent in each of these to measure their
contribution to the time bolt takes to finish rewriting a program.
(cherry picked from FBD4301136)
Summary:
The CFI instructions parser in libDebugInfo was relying on
undefined behavior to parse operands by assuming the order function
parameters are evaluated in a function call site is defined (it is
not). This patch fix this and makes our clang and gcc tests agree.
It also fixes wrong LIT tests in our codebase with respect to the
order of DW_CFA_def_cfa operands.
(cherry picked from FBD4255227)
Summary:
Clang's Address Sanitizer caught this leak where MCAsmBackend
and MCObjectWriter instances were being created but not freed. Fix this.
(cherry picked from FBD4249941)
Summary:
This is part of a series of clean-up patches to make bolt
cleanly compile with clang 4.0. This patch fixes an error where clang
will fail to compile because it does not support passing a
const_iterator to std::vector<T>::emplace(Iter, ...).
(cherry picked from FBD4242546)
Summary:
This is part of a series of clean-up patches to make bolt
cleanly compile with clang 4.0. This patch fixes the following warning:
moving a temporary object prevents copy elision
(cherry picked from FBD4242236)
Summary:
This is part of a series of clean-up patches to make bolt
cleanly compile with clang 4.0. This patch fixes the following warning:
default label in switch which covers all enumeration values
(cherry picked from FBD4242168)
Summary:
Make BOLT resilient to changes in the LLVM's X86 target library
by not hardwiring the list of default CIE instructions, but detecting it
at run time.
(cherry picked from FBD4200982)
Summary:
In order to improve gdb experience with BOLT we have to make
sure the output file has a single .eh_frame section. Otherwise
gdb will use either old or new section for unwinding purposes.
This diff relocates the original .eh_frame section next to
the new one generated by LLVM. Later we merge two sections
into one and make sure only the newly created section has
.eh_frame name.
(cherry picked from FBD4203943)
Summary:
We used to patch an existing .eh_frame_hdr and append contents
for split functions at the end. However, this approach does not
work in relocation mode since function addresses change and split
functions will not necessarily be at the end.
Instead of patching and appending we generate the new .eh_frame_hdr
based on contents of old and new .eh_frame sections.
(cherry picked from FBD4180756)
Summary:
In a prev diff I disabled inclusion of FDEs for cold fragments that
we fail to write. The side effect of it was that we failed to
write FDE for the next function with a cold fragment since it
had the same assigned address that we had put in FailedAddresses.
The correct fix is to assign zero address to failed cold fragments
and ignore them when we write .eh_frame_hdr.
(cherry picked from FBD4156740)
Summary:
CFI instructions may live in CIEs or FDEs. CIEs hold common
instructions used across many FDEs. When replaying CFIs to the output
binary, llvm-bolt needs to replay both instructions from CIE and the
corresponding FDE for the function. However, some instructions need not
to be replayed because MCStreamer/MCDwarf and friends will write them
by default in the output CIE. This patch fix the code that tried to
recognize one of these default instructions but was failing, resulting
in an extra CFI instruction in each FDE we outputted. With this patch,
the output binary should be a bit smaller.
(cherry picked from FBD4194753)
Summary:
Modify the MC layer (MCDwarf.h|cpp) to understand CFI
instructions dealing with DWARF expressions. Add code to emit DWARF
expressions in MCDwarf. Change llvm-bolt to pass these CFI instructions
to streamer instead of bailing on them. Change -dump-eh-frame option in
llvm-bolt to dump the EH frame of the rewritten binary in addition to
the one in the original binary, allowing us to proper test this patch.
(cherry picked from FBD4194452)
Summary:
AVX-512 disassembler support in LLVM is not quite ready yet.
Before we feel more comfortable about it we disable processing
of all functions that use any EVEX-encoded instructions.
(cherry picked from FBD4028706)
Summary:
When we fail to write functions that are too big, we have to
effectively cancel their effect on exception handling by ignoring
their FDE entries in .eh_frame while writing .eh_frame_hdr.
This can happen to functions that we split too. In such cases
the cold part has its own FDE and we have to ignore that one too.
This doesn't happen very often - I've only seen one case on
hhvm binary, however it is a potential issue. The fix is to
add the cold part address to the list of failed-to-write
addresses.
(cherry picked from FBD3987984)
Summary:
Modified function discovery process to tolerate more functions and
symbols coming from assembly. The processing order now matches
the memory order of the functions (input symbol table is unsorted).
Added basic support for functions with multiple entries. When
a function references its internal address other than with
a branch instruction, that address could potentially escape.
We mark such addresses as entry points and make sure they
are treated as roots by unreachable code elimination.
Without relocations we have to mark multiple-entry functions
as non-simple.
(cherry picked from FBD3950243)
Summary:
Added support for jump tables in code compiled with "-fpic".
Code pattern generated for position-independent jump tables
is quite different, as is the format of the tables.
More details in comments.
Coverage increased slightly for a test, mostly due to the code
coming from external lib that was compiled with "-fpic".
(cherry picked from FBD3940771)
Summary:
Allow UCE when blocks have EH info. Since UCE may remove blocks
that are referenced from debugging info data structures, we don't
actually delete them. We just mark them with an "invalid" index
and store them in a different vector to be cleaned up later once
the BinaryFunction is destroyed. The debugging code just skips
any BBs that have an invalid index.
Eliminating blocks may also expose useless jmp instructions, i.e.
a jmp around a dead block could just be a fallthrough. I've added
a new routine to cleanup these jmps. Although, @maks is working on
changing fixBranches() so that it can be used instead.
(cherry picked from FBD3793259)
Summary:
Add level for "-jump-tables=<n>" option:
1 - all jump tables are output in the same section (default).
2 - basic splitting, if the table is used it is output to hot section
otherwise to cold one.
3 - aggressively split compound jump tables and collect profile for
all entries.
Option "-print-jump-tables" outputs all jump tables for debugging
and/or analyzing purposes. Use with "-jump-tables=3" to get profile
values for every entry in a jump table.
(cherry picked from FBD3912119)
Summary:
Insert ud2 instructions after indirect tailcalls to prevent the CPU from
decoding instructions following the callsite.
A simple counter in the peephole pass shows 3260 tail call traps inserted.
(cherry picked from FBD3859737)
Summary:
Get rid of all uses of getIndex/getLayoutIndex/getOffset outside of BinaryFunction.
Also made some other offset related methods private.
(cherry picked from FBD3861968)
Summary:
Add -print-sorted-by and -print-sorted-by-order command line options.
The first option takes a list of dyno stats keys used to sort functions
that are printed at the end of all optimization passes. Only the top
100 functions are printed. The -print-sorted-by-order option can be
either ascending or descending (descending is the default).
(cherry picked from FBD3898818)
Summary:
While working on PLT dyno stats I've noticed that we were missing
BinaryFunctions for some symbols that were not PLT. Upon closer inspection
turned out that those symbols were marked as zero-sized functions in
symbol table, but they had duplicates with non-zero size. Since the
zero-size symbols were preceding other duplicates, we were not creating
BinaryFunction for them and they were not added as duplicates.
The 2 most prominent functions that were missing for a test were free() and
malloc(). There's not much to optimize in these functions, but they were
contributing quite significantly to dyno stats.
As a result dyno stats for this test needed an adjustment.
Also several assembly functions (e.g. _init()) had zero size, and now we
set the size to the max size and start processing those. It's good for
coverage but will not affect the performance.
(cherry picked from FBD3874622)
Summary:
Option "-jump-tables=1" enables experimental support for jump tables.
The option hasn't been tested with optimizations other than block
re-ordering.
Only non-PIC jump tables are supported at the moment.
(cherry picked from FBD3867849)
Summary:
This is just a bit of refactoring to make sure that BinaryFunction goes
through methods to get at the state in BinaryBasicBlock. I did this so
that changing the way Index/LayoutIndex/Valid works will be easier.
(cherry picked from FBD3860899)
Summary:
Add "-reorder-blocks=cluster-shuffle" for performance experiments.
Use "-bolt-seed=<N>" to set a randomization seed.
(cherry picked from FBD3851035)
Summary:
Switch table can contain __builtin_unreachable(). As a result,
a compiler may place an entry into a jump table that contains
an address immediately past the last instruction in the function.
Sometimes it may coincide with a start of the next function in
the binary. Thus when we check for switch tables in such cases
we have to check more than a single entry until we see either
an address inside containing function or some address outside
different from the address past the last instruction.
Additonally, don't stop disassembly after discovering that the
function was not simple. We need to detect all outside
references whenever possible.
(cherry picked from FBD3850825)
Summary:
Replace jumps to other unconditional jumps with the final
destination, e.g.
B0: ...
jmp B1 (or jcc B1)
B1: jmp B2
->
B0: ...
jmp B2 (or jcc B1)
This peephole removes 8928 double jumps from a test binary.
Note: after filtering out double jumps found in EH code and infinite
loops, the number of double jumps patched is 49 (24 for a clang
compiled test). The 24 in the clang build are all from external
libraries which have probably been compiled with gcc. This peephole
is still useful for cleaning up after ICP though.
(cherry picked from FBD3815420)
Summary:
I've added dyno stats printing per pass so we can see the results
of each optimization pass on the stats. I've also factored out the
post pass function printing code since it was pretty much the same
after each pass.
(cherry picked from FBD3843587)