Summary:
When we merge the original branch counts we have to make sure
both of them have a profile. Otherwise set the count to COUNT_NO_PROFILE.
The misprediction count should be 0.
(cherry picked from FBD4837774)
Summary:
I split some of this out from the jumptable diff since it fixes the
double jump peephole.
I've changed the pass manager so that UCE and peepholes are not called
after SCTC. I've incorporated a call to the double jump fixer to SCTC
since it is needed to fix things up afterwards.
While working on fixing the double jump peephole I discovered a few
useless conditional branches that could be removed as well. I highly
doubt that removing them will improve perf at all but it does seem
odd to leave in useless conditional branches.
There are also some minor logging improvements.
(cherry picked from FBD4751875)
Summary:
When inlining, if a callee has debug info and a caller does not
(i.e. a containing compilation unit was compiled without "-g"), we try
to update a nonexistent compilation unit. Instead we should skip
updating debug info in such cases.
Minor refactoring of line number emitting code.
(cherry picked from FBD4823982)
Summary:
Each BOLT-specific option now belongs to BoltCategory or BoltOptCategory.
Use alphabetical order for options in source code (does not affect
output).
The result is a cleaner output of "llvm-bolt -help" which does not
include any unrelated llvm options and is close to the following:
.....
BOLT generic options:
-data=<string> - <data file>
-dyno-stats - print execution info based on profile
-hot-text - hot text symbols support (relocation mode)
-o=<string> - <output file>
-relocs - relocation mode - use relocations to move functions in the binary
-update-debug-sections - update DWARF debug sections of the executable
-use-gnu-stack - use GNU_STACK program header for new segment (workaround for issues with strip/objcopy)
-use-old-text - re-use space in old .text if possible (relocation mode)
-v=<uint> - set verbosity level for diagnostic output
BOLT optimization options:
-align-blocks - try to align BBs inserting nops
-align-functions=<uint> - align functions at a given value (relocation mode)
-align-functions-max-bytes=<uint> - maximum number of bytes to use to align functions
-boost-macroops - try to boost macro-op fusions by avoiding the cache-line boundary
-eliminate-unreachable - eliminate unreachable code
-frame-opt - optimize stack frame accesses
......
(cherry picked from FBD4793684)
Summary:
If we specify "-relocs" flag and an input has no relocations we
proceed with assumptions that relocations were there and break the
binary.
Detect the condition above, and reject the input.
(cherry picked from FBD4761239)
Summary:
ICP was letting through call targets that weren't symbols. This diff
filters out the non-symbol targets before running ICP.
(cherry picked from FBD4735358)
Summary:
Add option '-print-only=func1,func2,...' to print only functions
of interest. The rest of the functions are still processed and
optimized (e.g. inlined), but only the ones on the list are printed.
(cherry picked from FBD4734610)
Summary:
In non-relocation mode we shouldn't attemtp to change ELF
entry point.
What made matters worse - it broke '-max-funcs=' and '-funcs=' options
since an entry function more often than not was excluded from the list
of processed functions, and we were setting entry point to 0.
(cherry picked from FBD4720044)
Summary:
Reduce verbosity of dynostats to make them more readable.
* Don't print "before" dynostats twice.
* Detect if dynostats have changed after optimization and print
before/after only if at least one metric have changed. Otherwise
just print dynostats once and indicate "no change".
* If any given metric hasn't changed, then print the difference as
"(=)" as opposed to (+0.0%).
(cherry picked from FBD4705920)
Summary:
While running on a recent test binary BOLT failed with an error. We were
trying to process '__hot_end' (which is not really a function), and asserted
that it had no basic blocks.
This diff marks functions with empty basic blocks list as non-simple since
there's no need to process them.
(cherry picked from FBD4696517)
Summary:
The stats for call sites that are not included in the call graph were broken.
The intention is to count the total number of call sites vs. the number of call sites that are ignored because they have targets that are not BinaryFunctions.
Also add a new test for hfsort.
(cherry picked from FBD4668631)
Summary:
Fix validateCFG to handle BBs that were generated from code that used
_builtin_unreachable().
Add -verify-cfg option to run CFG validation after every optimization
pass.
(cherry picked from FBD4641174)
Summary:
Sometimes a code written in assembly will have unmarked data (such as
constants) embedded into text.
Typically such data falls into a "padding" address space of a function.
This diffs detects such references, and adjusts the padding space to
prevent overwriting of code in data.
Note that in relocation mode we prefer to overwrite the original code
(-use-old-text) and thus cannot simply ignore data in text.
(cherry picked from FBD4662780)
Summary:
Calls to __builtin_unreachable() can result in a inconsistent CFG.
It was possible for basic block to end with a conditional branche
and have a single successor. Or there could exist non-terminated
basic block without successors.
We also often treated conditional jumps with destination past the end
of a function as conditional tail calls. This can be prevented
reliably at least when the byte past the end of the function does
not belong to the next function.
This diff includes several changes:
* At disassembly stage jumps past the end of a function are converted
into 'nops'. This is done only for cases when we can guarantee that
the jump is not a tail call. Conversion to nop is required since the
instruction could be referenced either by exception handling
tables and/or debug info. Nops are later removed.
* In CFG insert 'ret' into non-terminated basic blocks without
successors (this almost never happens).
* Conditional jumps at the end of the function are removed from
CFG. The block will still have a single successor.
* Cases where a destination of a jump instruction is the start
of the next function, are still conservatively handled as
(conditional) tail calls.
(cherry picked from FBD4655046)
Summary:
The new interface for handling Call Frame Information:
* CFI state at any point in a function (in CFG state) is defined by
CFI state at basic block entry and CFI instructions inside the
block. The state is independent of basic blocks layout order
(this is implied by CFG state but wasn't always true in the past).
* Use BinaryBasicBlock::getCFIStateAtInstr(const MCInst *Inst) to
get CFI state at any given instruction in the program.
* No need to call fixCFIState() after any given pass. fixCFIState()
is called only once during function finalization, and any function
transformations after that point are prohibited.
* When introducing new basic blocks, make sure CFI state at entry
is set correctly and matches CFI instructions in the basic block
(if any).
* When splitting basic blocks, use getCFIStateAtInstr() to get
a state at the split point, and set the new basic block's CFI
state to this value.
Introduce CFG_Finalized state to indicate that no further optimizations
are allowed on the function. This state is reached after we have synced
CFI instructions and updated EH info.
Rename "-print-after-fixup" option to "-print-finalized".
This diffs fixes CFI for cases when we split conditional tail calls,
and for indirect call promotion optimization.
(cherry picked from FBD4629307)
Summary:
Fix inconsistent override keyword usages and initializes a
missing field of a Relocation object when using braced initializers.
(cherry picked from FBD4622856)
Summary:
Add pass to strip 'repz' prefix from 'repz retq' sequence. The prefix
is not used in Intel CPUs afaik. The pass is on by default.
(cherry picked from FBD4610329)
Summary:
We use code skew in non-relocation mode since functions have fixed
addresses, and internal alignment has to be adjusted wrt the skew.
However in relocation mode it interferes with effective code
alignment, and has to be disabled. I missed it when was re-basing
the relocation diff.
(cherry picked from FBD4599670)
Summary:
In a prev diff I added an option to update jump tables in-place (on by default)
and accidentally broke the default handling of jump tables in relocation
mode. The update should be happening semi-automatically, but because
we ignore relocations for jump tables it wasn't happening (derp).
Since we mostly use '-jump-tables=move' this hasn't been noticed for
some time.
This diff gets rid of IgnoredRelocations and removes relocations
from a relocation set when they are no longer needed. If relocations
are created later for jump tables they are no longer ignored.
(cherry picked from FBD4595159)
Summary:
gcc5 can generate new types of relocations that give linker a freedom
to substitute instructions. These relocations are PC-relative, and
since we manually process such relocations they don't present
much of a problem.
Additionally, detect non-pc-relative access from code into a middle of
a function. Occasionally I've seen such code, but don't know exactly
how to trigger its generation. Just issue a warning for now.
(cherry picked from FBD4566473)
Summary:
To minimize size of the output code we should emit tail calls
that are as short as possible. For this we have to convert a synthetic
TAILJMPd into JMP_1 instruction. This should be one of the last passes
as most of analysis passes could break since tail calls will no longer
be marked as such.
The total size of the code is smaller, but not by much - hot text was
reduced by 192 bytes.
(cherry picked from FBD4557804)
Summary:
Some functions coming from assembly may not have been marked
with size. We assume the size to include all bytes up to
the next function/object in the file. As a result,
function body will include any padding inserted by the linker.
If linker inserts 0-value bytes this could be misinterpreted
as invalid instruction and BOLT will bail out on such functions
in non-relocation mode, and give up on a binary in relocation
mode.
This diff detects zero-padding, ignores it, and continues processing
as normal.
(cherry picked from FBD4528893)
Summary:
Whenever input binary is suspected to have been sanitized we print an error
message and exit. I've checked that "__asan_init*" symbol
presence is the most conservative way to detect "sanitization".
(cherry picked from FBD4525478)
Summary:
Re-write section header string table to reflect new names
given to sections. Old sections get ".bolt.org" prefix.
E.g. when we write ".eh_frame" section, we keep the old copy
but rename it to ".bolt.org.eh_frame".
Note: the new code section is named ".bolt.text" - it contains split
function bodies, while original ".text" name is left unchanged.
(cherry picked from FBD4524935)
Summary:
Perform indirect call promotion optimization in BOLT.
The code scans the instructions during CFG creation for all
indirect calls. Right now indirect tail calls are not handled
since the functions are marked not simple. The offsets of the
indirect calls are stored for later use by the ICP pass.
The indirect call promotion pass visits each indirect call and
examines the BranchData for each. If the most frequent targets
from that callsite exceed the specified threshold (default 90%),
the call is promoted. Otherwise, it is ignored. By default,
only one target is considered at each callsite.
When an candiate callsite is processed, we modify the callsite
to test for the most common call targets before calling through
the original generic call mechanism.
The CFG and layout are modified by ICP.
A few new command line options have been added:
-indirect-call-promotion
-indirect-call-promotion-threshold=<percentage>
-indirect-call-promotion-topn=<int>
The threshold is the minimum frequency of a call target needed
before ICP is triggered.
The topn option controls the number of targets to consider for
each callsite, e.g. ICP is triggered if topn=2 and the total
requency of the top two call targets exceeds the threshold.
Example of ICP:
C++ code:
int B_count = 0;
int C_count = 0;
struct A { virtual void foo() = 0; }
struct B : public A { virtual void foo() { ++B_count; }; };
struct C : public A { virtual void foo() { ++C_count; }; };
A* a = ...
a->foo();
...
original:
400863: 49 8b 07 mov (%r15),%rax
400866: 4c 89 ff mov %r15,%rdi
400869: ff 10 callq *(%rax)
40086b: 41 83 e6 01 and $0x1,%r14d
40086f: 4d 89 e6 mov %r12,%r14
400872: 4c 0f 44 f5 cmove %rbp,%r14
400876: 4c 89 f7 mov %r14,%rdi
...
after ICP:
40085e: 49 8b 07 mov (%r15),%rax
400861: 4c 89 ff mov %r15,%rdi
400864: 49 ba e0 0b 40 00 00 movabs $0x400be0,%r10
40086b: 00 00 00
40086e: 4c 3b 10 cmp (%rax),%r10
400871: 75 29 jne 40089c <main+0x9c>
400873: 41 ff d2 callq *%r10
400876: 41 83 e6 01 and $0x1,%r14d
40087a: 4d 89 e6 mov %r12,%r14
40087d: 4c 0f 44 f5 cmove %rbp,%r14
400881: 4c 89 f7 mov %r14,%rdi
...
40089c: ff 10 callq *(%rax)
40089e: eb d6 jmp 400876 <main+0x76>
(cherry picked from FBD3612218)
Summary:
Add an option to overwrite jump tables without moving and make it a
default:
-jump-tables - jump tables support (default=basic)
=none - do not optimize functions with jump tables
=basic - optimize functions with jump tables
=move - move jump tables to a separate section
=split - split jump tables section into hot and cold based on
function execution frequency
=aggressive - aggressively split jump tables section based on usage of
the tables
(cherry picked from FBD4448499)
Summary:
Add a new dataflow analysis to recover the value of RSP at a
given point of the program. This value is expressed as an offset from
the CFA. Use this information to detect redundant load in memory
accesses performed via RSP as well, not only RBP as done previously.
Bail when RSP value (as an offset of the CFA) can't be reliably
determined with a simple dataflow analysis.
(cherry picked from FBD4372261)
Summary:
Report stale functions percentage with respect to all profiled
functions instead of all simple functions in the binary.
The new reporting format should make it more apparent if the
profile is out-of-date. Compare:
BOLT-INFO: 341 (16.7% of all profiled) functions have invalid (possibly
stale) profile.
vs old:
BOLT-INFO: 341 (0.3%) functions have invalid (possibly stale) profile.
(cherry picked from FBD4451746)
Summary:
Due to a clowntown on my part we were generating wrong ranges
when an empty range was seen on input. We were basically expanding
the range to include all basic blocks following such range and setting
wrong sizes at the same time.
Add "-dump-cu" option to llvm-dwarfdump that allows to look at debug
info of a single compile unit only. Saves time if we are only interested
in a subset of information.
(cherry picked from FBD4430989)
Summary:
In-non relocation mode, when we run ICF the second time,
we fold the same functions again since they were not
removed from the function set. This diff marks them as
folded and ignores them during ICF optimization. Note
that we still want to optimize such functions since they
are potentially called from the code not covered by BOLT
in non-relocation mode.
Folded functions are also excluded from dyno stats with
this diff
Also print the number of times folded functions were called.
When 2 functions - f1() and f2() are folded, that number
would be min(call_frequency(f1), call_frequency(f2)).
(cherry picked from FBD4399993)
Summary:
Re-worked the way ICF operates. The pass now checks for more than just
call instructions, but also for all references including function
pointers. Jump tables are handled too.
(cherry picked from FBD4372491)
Summary:
This is a first attempt to perform data flow analyses on bolt
and try to rebuild the stack frame for functions. The goal of the frame
optimization pass is to detect instructions that are accessing stack and,
if loading values, evaluate whether this load is redundant and we can
substitute the memory operation for a register load or immediate load.
To find opportunities, this pass also builds a map of clobbered registers
by function, so we use this in our analysis at call sites. If a call site
is found out to not clobber a caller-saved register but the caller is
spilling it anyway to the stack (to comply with the ABI), we should
detect these cases and remove this unnecessary move.
(cherry picked from FBD4337238)
Summary:
An optimization to simplify conditional tail calls by removing unnecessary branches. It adds the following two command line options:
-simplify-conditional-tail-calls - simplify conditional tail calls by removing unnecessary jumps
-sctc-mode - mode for simplify conditional tail calls
=always - always perform sctc
=preserve - only perform sctc when branch direction is preserved
=heuristic - use branch prediction data to control sctc
This optimization considers both of the following cases:
foo: ...
jcc L1 original
...
L1: jmp bar # TAILJMP
->
foo: ...
jcc bar iff jcc L1 is expected
...
L1 is unreachable
OR
foo: ...
jcc L2
L1: jmp dest # TAILJMP
L2: ...
->
foo: jncc dest # TAILJMP
L2: ...
L1 is unreachable
For this particular case, the first basic block ends with a conditional branch and has two successors, one fall-through and one for when the condition is true. The target of the conditional is a basic block with a single unconditional branch (i.e. tail call) to another function. We don't care about the contents of the fall-through block.
(cherry picked from FBD3719617)
Summary:
Previously NamedRegionTimer's constructor was being called
with no local variable associated with it owing to a typo. We need a
local variable to keep track of the time spent in the scope. At the
end of the scope, the destructor will be called an then the timer will
stop.
(cherry picked from FBD4301844)
Summary:
As we begin to work on optimization passes for bolt, it is important to
keep track of the time spent in each of these to measure their
contribution to the time bolt takes to finish rewriting a program.
(cherry picked from FBD4301136)
Summary:
The CFI instructions parser in libDebugInfo was relying on
undefined behavior to parse operands by assuming the order function
parameters are evaluated in a function call site is defined (it is
not). This patch fix this and makes our clang and gcc tests agree.
It also fixes wrong LIT tests in our codebase with respect to the
order of DW_CFA_def_cfa operands.
(cherry picked from FBD4255227)
Summary:
Clang's Address Sanitizer caught this leak where MCAsmBackend
and MCObjectWriter instances were being created but not freed. Fix this.
(cherry picked from FBD4249941)
Summary:
This is part of a series of clean-up patches to make bolt
cleanly compile with clang 4.0. This patch fixes an error where clang
will fail to compile because it does not support passing a
const_iterator to std::vector<T>::emplace(Iter, ...).
(cherry picked from FBD4242546)
Summary:
This is part of a series of clean-up patches to make bolt
cleanly compile with clang 4.0. This patch fixes the following warning:
moving a temporary object prevents copy elision
(cherry picked from FBD4242236)
Summary:
This is part of a series of clean-up patches to make bolt
cleanly compile with clang 4.0. This patch fixes the following warning:
default label in switch which covers all enumeration values
(cherry picked from FBD4242168)
Summary:
Make BOLT resilient to changes in the LLVM's X86 target library
by not hardwiring the list of default CIE instructions, but detecting it
at run time.
(cherry picked from FBD4200982)
Summary:
In order to improve gdb experience with BOLT we have to make
sure the output file has a single .eh_frame section. Otherwise
gdb will use either old or new section for unwinding purposes.
This diff relocates the original .eh_frame section next to
the new one generated by LLVM. Later we merge two sections
into one and make sure only the newly created section has
.eh_frame name.
(cherry picked from FBD4203943)
Summary:
We used to patch an existing .eh_frame_hdr and append contents
for split functions at the end. However, this approach does not
work in relocation mode since function addresses change and split
functions will not necessarily be at the end.
Instead of patching and appending we generate the new .eh_frame_hdr
based on contents of old and new .eh_frame sections.
(cherry picked from FBD4180756)
Summary:
In a prev diff I disabled inclusion of FDEs for cold fragments that
we fail to write. The side effect of it was that we failed to
write FDE for the next function with a cold fragment since it
had the same assigned address that we had put in FailedAddresses.
The correct fix is to assign zero address to failed cold fragments
and ignore them when we write .eh_frame_hdr.
(cherry picked from FBD4156740)
Summary:
CFI instructions may live in CIEs or FDEs. CIEs hold common
instructions used across many FDEs. When replaying CFIs to the output
binary, llvm-bolt needs to replay both instructions from CIE and the
corresponding FDE for the function. However, some instructions need not
to be replayed because MCStreamer/MCDwarf and friends will write them
by default in the output CIE. This patch fix the code that tried to
recognize one of these default instructions but was failing, resulting
in an extra CFI instruction in each FDE we outputted. With this patch,
the output binary should be a bit smaller.
(cherry picked from FBD4194753)
Summary:
Modify the MC layer (MCDwarf.h|cpp) to understand CFI
instructions dealing with DWARF expressions. Add code to emit DWARF
expressions in MCDwarf. Change llvm-bolt to pass these CFI instructions
to streamer instead of bailing on them. Change -dump-eh-frame option in
llvm-bolt to dump the EH frame of the rewritten binary in addition to
the one in the original binary, allowing us to proper test this patch.
(cherry picked from FBD4194452)
Summary:
AVX-512 disassembler support in LLVM is not quite ready yet.
Before we feel more comfortable about it we disable processing
of all functions that use any EVEX-encoded instructions.
(cherry picked from FBD4028706)
Summary:
When we fail to write functions that are too big, we have to
effectively cancel their effect on exception handling by ignoring
their FDE entries in .eh_frame while writing .eh_frame_hdr.
This can happen to functions that we split too. In such cases
the cold part has its own FDE and we have to ignore that one too.
This doesn't happen very often - I've only seen one case on
hhvm binary, however it is a potential issue. The fix is to
add the cold part address to the list of failed-to-write
addresses.
(cherry picked from FBD3987984)
Summary:
Modified function discovery process to tolerate more functions and
symbols coming from assembly. The processing order now matches
the memory order of the functions (input symbol table is unsorted).
Added basic support for functions with multiple entries. When
a function references its internal address other than with
a branch instruction, that address could potentially escape.
We mark such addresses as entry points and make sure they
are treated as roots by unreachable code elimination.
Without relocations we have to mark multiple-entry functions
as non-simple.
(cherry picked from FBD3950243)
Summary:
Added support for jump tables in code compiled with "-fpic".
Code pattern generated for position-independent jump tables
is quite different, as is the format of the tables.
More details in comments.
Coverage increased slightly for a test, mostly due to the code
coming from external lib that was compiled with "-fpic".
(cherry picked from FBD3940771)
Summary:
Allow UCE when blocks have EH info. Since UCE may remove blocks
that are referenced from debugging info data structures, we don't
actually delete them. We just mark them with an "invalid" index
and store them in a different vector to be cleaned up later once
the BinaryFunction is destroyed. The debugging code just skips
any BBs that have an invalid index.
Eliminating blocks may also expose useless jmp instructions, i.e.
a jmp around a dead block could just be a fallthrough. I've added
a new routine to cleanup these jmps. Although, @maks is working on
changing fixBranches() so that it can be used instead.
(cherry picked from FBD3793259)
Summary:
Add level for "-jump-tables=<n>" option:
1 - all jump tables are output in the same section (default).
2 - basic splitting, if the table is used it is output to hot section
otherwise to cold one.
3 - aggressively split compound jump tables and collect profile for
all entries.
Option "-print-jump-tables" outputs all jump tables for debugging
and/or analyzing purposes. Use with "-jump-tables=3" to get profile
values for every entry in a jump table.
(cherry picked from FBD3912119)
Summary:
Insert ud2 instructions after indirect tailcalls to prevent the CPU from
decoding instructions following the callsite.
A simple counter in the peephole pass shows 3260 tail call traps inserted.
(cherry picked from FBD3859737)
Summary:
Get rid of all uses of getIndex/getLayoutIndex/getOffset outside of BinaryFunction.
Also made some other offset related methods private.
(cherry picked from FBD3861968)
Summary:
Add -print-sorted-by and -print-sorted-by-order command line options.
The first option takes a list of dyno stats keys used to sort functions
that are printed at the end of all optimization passes. Only the top
100 functions are printed. The -print-sorted-by-order option can be
either ascending or descending (descending is the default).
(cherry picked from FBD3898818)
Summary:
While working on PLT dyno stats I've noticed that we were missing
BinaryFunctions for some symbols that were not PLT. Upon closer inspection
turned out that those symbols were marked as zero-sized functions in
symbol table, but they had duplicates with non-zero size. Since the
zero-size symbols were preceding other duplicates, we were not creating
BinaryFunction for them and they were not added as duplicates.
The 2 most prominent functions that were missing for a test were free() and
malloc(). There's not much to optimize in these functions, but they were
contributing quite significantly to dyno stats.
As a result dyno stats for this test needed an adjustment.
Also several assembly functions (e.g. _init()) had zero size, and now we
set the size to the max size and start processing those. It's good for
coverage but will not affect the performance.
(cherry picked from FBD3874622)
Summary:
Option "-jump-tables=1" enables experimental support for jump tables.
The option hasn't been tested with optimizations other than block
re-ordering.
Only non-PIC jump tables are supported at the moment.
(cherry picked from FBD3867849)
Summary:
This is just a bit of refactoring to make sure that BinaryFunction goes
through methods to get at the state in BinaryBasicBlock. I did this so
that changing the way Index/LayoutIndex/Valid works will be easier.
(cherry picked from FBD3860899)
Summary:
Add "-reorder-blocks=cluster-shuffle" for performance experiments.
Use "-bolt-seed=<N>" to set a randomization seed.
(cherry picked from FBD3851035)
Summary:
Switch table can contain __builtin_unreachable(). As a result,
a compiler may place an entry into a jump table that contains
an address immediately past the last instruction in the function.
Sometimes it may coincide with a start of the next function in
the binary. Thus when we check for switch tables in such cases
we have to check more than a single entry until we see either
an address inside containing function or some address outside
different from the address past the last instruction.
Additonally, don't stop disassembly after discovering that the
function was not simple. We need to detect all outside
references whenever possible.
(cherry picked from FBD3850825)
Summary:
Replace jumps to other unconditional jumps with the final
destination, e.g.
B0: ...
jmp B1 (or jcc B1)
B1: jmp B2
->
B0: ...
jmp B2 (or jcc B1)
This peephole removes 8928 double jumps from a test binary.
Note: after filtering out double jumps found in EH code and infinite
loops, the number of double jumps patched is 49 (24 for a clang
compiled test). The 24 in the clang build are all from external
libraries which have probably been compiled with gcc. This peephole
is still useful for cleaning up after ICP though.
(cherry picked from FBD3815420)
Summary:
I've added dyno stats printing per pass so we can see the results
of each optimization pass on the stats. I've also factored out the
post pass function printing code since it was pretty much the same
after each pass.
(cherry picked from FBD3843587)
Summary:
For now we make SCTC a special pass that runs at the end of all
optimizations and transformations right after fixupBranches().
Since it's the last pass, it has to do its own UCE.
(cherry picked from FBD3838051)
Summary:
Add "-dyno-stats" option that prints instruction stats based on
the execution profile similar to below:
BOLT-INFO: program-wide dynostats after optimizations:
executed forward branches : 109706407 (+8.1%)
taken forward branches : 13769074 (-55.5%)
executed backward branches : 24517582 (-25.0%)
taken backward branches : 15330256 (-27.2%)
executed unconditional branches : 6009826 (-35.5%)
function calls : 17192114 (+0.0%)
executed instructions : 837733057 (-0.4%)
total branches : 140233815 (-2.3%)
taken branches : 35109156 (-42.8%)
Also fixed pseudo instruction discrepancies and added assertions
for BinaryBasicBlock::getNumPseudos() to make sure the number is
synchronized with real number of pseudo instructions.
(cherry picked from FBD3826995)
Summary:
The CFG represents "the ultimate source of truth". Transformations
on functions and blocks have to update the CFG and fixBranches() would
make sure the correct branch instructions are inserted at the end of
basic blocks (or removed when necessary).
We do require a conditional branch at the end of the basic block if
the block has 2 successors as CFG currently lacks the conditional
code support (it will probably stay that way). We only use this
branch instruction for its conditional code, the destination is
determined by CFG - first successor representing true/taken branch,
while the second successor - false/fall-through branch.
When we reverse the branch condition, the CFG is updated accordingly.
The previous version used to insert jumps after some terminating
instructions sometimes resulting in a larger code than needed. As a
result with the new version 1 extra function becomes overwritten for
HHVM binary.
With this diff we also convert conditional branches with one successor
(result of code from __builtin_unreachable()) into unconditional
jumps.
(cherry picked from FBD3802062)
Summary:
This will make it easier to run experiments with the same baseline
BOLT binary but different command line options.
(cherry picked from FBD3831978)
Summary:
A number of fixes/enhancements to inline-small-functions
- Fixed size estimateHotSize to use computeCodeSize instead of the original layout offsets.
- Added -print-inline option to dump CFGs for functions that have been modified by inlining.
- Added flag to force consideration of functions without any profiling info (mostly for testing)
- Updated debug line info for inlined functions.
- Ignore the number of pseudo instructions when checking for candidates of suitable size.
Misc changes
- Moved most print flags to BinaryPasses.cpp
(cherry picked from FBD3812658)
Summary:
A previous diff accidentally disabled tail call conversion.
Additionally some test cases relied on output of "-v=2". Fix those.
(cherry picked from FBD3823760)
Summary:
I've added a verbosity level to help keep the BOLT spewage to a minimum.
The default level is pretty terse now, level 1 is closer to the original,
I've saved level 2 for the noisiest of messages. Error messages should
never be suppressed by the verbosity level only warnings and info messages.
The rational behind stream usage is as follows:
outs() for info and debugging controlled by command line flags.
errs() for errors and warnings.
dbgs() for output within DEBUG().
With the exception of a few of the level 2 messages I don't have any strong feelings about the others.
(cherry picked from FBD3814259)
Summary:
While creating remember_state/restore_state CFI sequences, we
were always placing remember_state instruction into the first
basic block. However, when we have hot-cold splitting, the cold
part has and independent FDE entry in .eh_frame, and thus the
restore_state instruction was missing its counter part.
The fix is to adjust the basic block that is used for placing
remember_state instruction whenever we see the hot-cold split
boundary.
(cherry picked from FBD3767102)
Summary:
Analyze indirect branches and convert them into indirect
tail calls when possible. We analyze the memory contents
when the address could be calculated statically and also
detect epilogue code.
(cherry picked from FBD3754395)
Summary:
We were applying padding to the calculated address but were never
writing it to a file triggering an assertion for cases when
.gcc_except_table size wasn't multiple of 4.
(cherry picked from FBD3744638)
Summary:
We only need ClusterEdges in reordering algorithm optimized for
branches and the computation is quite resource-hungry, thus it
makes sense to only do it when needed.
Some refactoring too.
(cherry picked from FBD3721107)
Summary:
Operands in the initial instruction stream should all have immediate operands
for instructions that can be shortened. But if a BOLT optimization pass adds
one of these instructions with a symbolic operand, the shortening operation
will assert. This diff adds checks to make sure that the operands are
immediate.
I've also disabled shortening pass by default since it won't really be needed
until ICP is submitted. It will still run at CFG creation time.
(cherry picked from FBD3610646)
Summary:
Add the following info the graphviz CFG dump:
- Edges are labeled with the jmp instruction that leads to that edge.
- Edges include the count and misprediction count.
- Nodes have (offset, BB index, BB layout index)
- Nodes optionally have tooltips which contain the code of the basic block.
(enabled with -dot-tooltip-code)
- Added dashed edges to landing pads.
(cherry picked from FBD3646568)
Summary:
Avoid referring to BinaryFunction's by name.
Functions could be found by MCSymbol using
BinaryContext::getFunctionForSymbol().
(cherry picked from FBD3707685)
Summary:
Eliminated BinaryFunction::getName(). The function was confusing since
the name is ambigous. Instead we have BinaryFunction::getPrintName()
used for printing and whenever unique string identifier is needed
one can use getSymbol()->getName(). In the next diff I'll have
a map from MCSymbol to BinaryFunction in BinaryContext to facilitate
function lookup from instruction operand expressions.
There's one bug fixed where the function was called only under assert()
in ICF::foldFunction().
For output we update all symbols associated with the function. At the
moment it has no effect on the generated binary but in the future we
would like to have all symbols in the symbol table updated.
(cherry picked from FBD3704790)
Summary:
This adds functionality for a more aggressive inlining pass, that can
inline tail calls and functions with more than one basic block.
(cherry picked from FBD3677856)
Summary:
Add three new MCOperand types: Annotation, LandingPad and GnuArgsSize.
Annotation is used for associating random data with MCInsts. Clients can
construct their own annotation types (subclassed from MCAnnotation) and
associate them with instructions. Annotations are looked up by string keys.
Annotations can be added, removed and queried using an instance of the
MCInstrAnalysis class.
The LandingPad operand is a MCSymbol, uint64_t pair used to encode exception
handling information for call instructions.
GnuArgsSize is used to annotate calls with the DW_CFA_GNU_args_size attribute.
(cherry picked from FBD3597877)
Summary:
BOLT attempts to convert jumps that serve as tail calls to dedicated tail call
instructions, but this is impossible when the jump is conditional because there is
no corresponding tail call instruction. This was causing the creation of a duplicate
fall-through edge for basic blocks terminated with a conditional jump serving as
a tail call when there is profile data available for the non-taken branch. In this
case, the first fall-through edge had a count taken from the profile data, while
the second has a count computed (incorrectly) by
BinaryFunction::inferFallThroughCounts.
(cherry picked from FBD3560504)
Summary:
LLVM was missing assembler print string for indirect tail
calls which are synthetic instructions created by us.
(cherry picked from FBD3640197)
Summary:
This diff adds a number of methods to BinaryFunction that can be used to edit the CFG after it is created.
The basic public functions are:
- createBasicBlock - create a new block that is not inserted into the CFG.
- insertBasicBlocks - insert a range of blocks (made with createBasicBlock) into the CFG.
- updateLayout - update the CFG layout (either by inserting new blocks at a certain point or recomputing the entire layout).
- fixFallthroughBranch - add a direct jump to the fallthrough successor for a given block.
There are a number of private helper functions used to implement the above.
This was split off the ICP diff to simplify it a bit.
(cherry picked from FBD3611313)
Summary:
This algorithm is similar to our main clustering algorithm but uses
a different heuristic for selecting edges to become fall-throughs.
The weight of an edge is calculated as the win in branches if we choose
to layout this edge as a fall-through. For example, the edges A -> B with
execution count 100 and A -> C with execution count 500 (where B and C
are the only successors of A) have weights -400 and +400 respectively.
(cherry picked from FBD3606591)
Summary:
Added an ICF pass to BOLT, that can recognize identical functions
and replace references to these functions with references to just one
representative.
(cherry picked from FBD3460297)
Summary:
I've factored out the instruction printing and size computation routines to
methods on BinaryContext. I've also added some more debug print functions.
This was split off the ICP diff to simplify it a bit.
(cherry picked from FBD3610690)
Summary:
Instructions that load data from the a read-only data section and their
target address can be computed statically (e.g. RIP-relative addressing)
are modified to corresponding instructions that use immediate operands.
We apply the transformation only when the resulting instruction will have
smaller or equal size.
(cherry picked from FBD3397112)
Summary:
Loop detection for the CFG data structure. Added a GraphTraits
specialization for BOLT's CFG that allows us to use LLVM's loop
detection interface.
(cherry picked from FBD3604837)
Summary:
Shorten when a mov instruction has a 64-bit immediate that can be repesented as
a sign extended 32-bit number, use the smaller mov instruction (MOV64ri -> MOV64ri32).
Add peephole optimization pass that does instruction shortening.
(cherry picked from FBD3603099)
Summary:
Generate short versions of branch instructions by default and rely on
relaxation to produce longer versions when needed.
Also produce short versions of arithmetic instructions if immediate
fits into one byte. This was only triggered once on HHVM binary.
(cherry picked from FBD3591466)
Summary:
patchELFPHDRTable was asserting that it could not find an entry
for .eh_frame_hdr in SectionMapInfo when no functions were modified
by BOLT.
This just changes code to skip modifying GNU_EH_FRAME program headers
hen SectionMapInfo is empty. The existing header is copied and written
instead.
(cherry picked from FBD3557481)
Summary:
If a profile data was collected on a stripped binary but an input
to BOLT is unstripped, we would use a different mangling scheme for
local functions and ignore their profiles. To solve the issue this
diff adds alternative name for all local functions such that one
of the names would match the name in the profile.
If the input binary was stripped, we reject it, unless "-allow-stripped"
option was passed. It's more complicated to do a matching in this case
since we have less information than at the time of profile collection.
It's also not that simple to tell if the profile was gathered on a
stripped binary (in which case we would have no issue matching data).
(cherry picked from FBD3548012)
Summary:
Store the basic block index inside the BinaryBasicBlock instead of a map in BinaryFunction.
This cut another 15-20 sec. from the processing time for hhvm.
(cherry picked from FBD3533606)
Summary:
Use unordered_map instead of map in ReorderAlgorithm and BinaryFunction::BasicBlockIndices.
Cuts about 30sec off the processing time for the hhvm binary. (~8.5 min to ~8min)
(cherry picked from FBD3530910)
Summary:
This fixes the initialization of basic block execution counts, where
we should skip edges to the first basic block but we were not
skipping the corresponding profile info.
Also, I removed a check that was done twice.
(cherry picked from FBD3519265)
Summary:
I noticed the BinaryFunction::viewGraph() method that hadn't been implemented
and decided I could use a simple DOT dumper for CFGs while working on the indirect
call optimization.
I've implemented the bare minimum for the dumper. It's just nodes+BB labels with
dges. We can add more detailed information as needed/desired.
(cherry picked from FBD3509326)
Summary:
Added perf2bolt functionality for extracting branch records
with histories of previous branches. The length of the histories
is user defined, and the default is 0 (previous functionality). Also,
DataReader can parse perf2bolt output with histories.
Note: creating profile data with long histories can increase their
size significantly (2x for history of length 1, 3x for length 2 etc).
(cherry picked from FBD3473983)
Summary:
When a conditional jump is followed by one or more no-ops, the
destination of fall-through branch was recorded as the first no-op in
FuncBranchInfo. However the fall-through basic block after the jump
starts after the no-ops, so the profile data could not match the CFG
and was ignored.
(cherry picked from FBD3496084)
Summary:
The various reorder and clustering algorithms have been refactored
into separate classes, so that it is easier to add new algorithms and/or
change the logic of algorithm selection.
(cherry picked from FBD3473656)
Summary:
With ICF optimization in the linker we were getting mismatches of
function names in .fdata and BinaryFunction name. This diff adds
support for multiple function names for BinaryFunction and
does a match against all possible names for the profile.
(cherry picked from FBD3466215)
Summary:
Verify profile data for a function and reject if there are branches
that don't correspond to any branches in the function CFG. Note that
we have to ignore branches resulting from recursive calls.
Fix printing instruction offsets in disassembled state.
Allow function to have non-zero execution count even if we don't
have branch information.
(cherry picked from FBD3451596)
Summary:
Print total number of functions/objects that have profile
and add new options:
-print - print the list of objects with count to stderr
=none - do not print objects/functions
=exec - print functions sorted by execution count
=branches - print functions sorted by total branch count
-q - do not print merged data to stdout
(cherry picked from FBD3442288)
Summary: This will help optimization passes that need to modify the CFG after it is constructed. Otherwise, the BinaryBasicBlock pointers stored in the layout, successors and predecessors would need to be modified every time a new basic block is created.
(cherry picked from FBD3403372)
Summary:
Turn on -fix-debuginfo-large-functions by default.
In the process of testing I've discovered that we output cold code
for functions that were too large to be emitted. Fixed that.
(cherry picked from FBD3372697)
Summary:
Assembly functions could have no corresponding DW_AT_subprogram
entries, yet they are represented in module ranges (and .debug_aranges)
and will have line number information. Make sure we update those.
Eliminated unnecessary data structures and optimized some passes.
For .debug_loc unused location entries are no longer processed
resulting in smaller output files.
Overall it's a small processing time improvement and memory imporement.
(cherry picked from FBD3362540)
Summary: The inference algorithm for counts of fall through edges takes possible jumps to landing pad blocks into account. Also, the landing pad block execution counts are updated using profile data.
(cherry picked from FBD3350727)
Summary:
Clang uses different attribute for high_pc which
was incompatible with the way we were updating
ranges. This diff fixes it.
(cherry picked from FBD3345537)
Summary:
* Fix several cases for handling debug info:
- properly update CU DW_AT_ranges for function with folded body
due to ICF optimization
- convert ranges to DW_AT_ranges from hi/low PC for all DIEs
- add support for [a, a) range
- update CU ranges even when there are no functions registered
* Overwrite .debug_ranges section instead of appending.
* Convert assertions in debug info handling part into warnings.
(cherry picked from FBD3339383)
Summary:
Some compile unit DIEs might be missing DW_AT_ranges because they were
compiled without "-ffunction-sections" option. This diff adds the
attribute to all compile units.
If the section is not present, we need to create it. Will do it in a
separate diff.
(cherry picked from FBD3314984)
Summary:
Overwrite contents of .debug_line section since we don't reference
the original contents anymore. This saves ~100MB of HHVM binary.
(cherry picked from FBD3314917)
Summary:
A simple optimization to prevent branch misprediction for tail calls.
Convert the sequence:
j<cc> L1
...
L1: jmp foo # tail call
into:
j<cc> foo
but only if 'j<cc> foo' turns out to be a forward branch.
(cherry picked from FBD3234207)
Summary:
While emitting debug lines for a function we don't overwrite, we
don't have a code section context that is needed by default
writing routine. Hence we have to emit end_sequence after the
last address, not at the end of section.
(cherry picked from FBD3291533)
Summary:
Added an optimization pass of inlining calls to small functions (with only one
basic block). Inlining is done in a very simple way, inserting instructions to
simulate the changes to the stack pointer that call/ret would make before/after the
inlined function executes. Also, the heuristic prefers to inline calls that happen
in the hottest blocks (by looking at their execution count). Calls in cold blocks are
ignored.
(cherry picked from FBD3233516)
Summary:
Many functions (around 600) in the HHVM binary are simply
a single unconditional jump instruction to another function. These can
be trivially optimized by modifying the call sites to directly call the
branch target instead (because it also happens with more than one jump
in sequence, we do it iteratively).
This diff also adds a very simple analysis/optimization pass system in
which this pass is the first one to be implemented. A follow-up to this
could be to move the current optimizations to other passes.
(cherry picked from FBD3211138)
Summary:
Many functions (around 600) in the HHVM binary are simply
a single unconditional jump instruction to another function. These can
be trivially optimized by modifying the call sites to directly call the
branch target instead (because it also happens with more than one jump
in sequence, we do it iteratively).
This diff also adds a very simple analysis/optimization pass system in
which this pass is the first one to be implemented. A follow-up to this
could be to move the current optimizations to other passes.
(cherry picked from FBD3211138)
Summary:
Fix the error message by not printing it :)
Explanation: a previous diff accidentally removed this error message from within
the DEBUG macro, and it's expected that we'll have a bunch of them since a lot
of the DIEs we try to update are empty or meaningless. For instance (and mainly), there
is a huge number of lexical block DIEs with no attributes in .debug_info.
In the first phase of collecting debugging info, we store the offsets of all
these DIEs, only later to realize that we cannot update their address
ranges because they have none.
A better fix would be to check this earlier and not store offsets of DIEs
we cannot update to begin with.
(cherry picked from FBD3236923)
Summary:
A lot of the space in the merged .fdata is taken by branches
to and from [heap], which is jitted code. On different machines,
or during different runs, jitted addresses are all different.
We don't use these addresses, but we need branch info to get
accurate function call counts.
This diff treats all [heap] addresses the same, resulting in a
simplified merged file. The size of the compressed file decreased
from 70MB to 8MB.
(cherry picked from FBD3233943)
Summary:
In a test binary some functions are placed in a segment
preceding the segment containing .text section. As a result,
we were miscalculating maximum function size as the calculation
was based on addresses only.
This diff fixes the calculation by checking if symbol after function
belongs to the same section. If it does not, then we set the maximum
function size based on the size of the containing section and not
on the address distance to the next symbol.
(cherry picked from FBD3229205)
Summary:
Added option "-break-funcs=func1,func2,...." to coredump in any
given function by introducing ud2 sequence at the beginning of the
function. Useful for debugging and validating stack traces.
Also renamed options containing "_" to use "-" instead.
Also run hhvm test with "-update-debug-sections".
(cherry picked from FBD3210248)
Summary:
Make sure we can install all tools needed for processing
BOLT .fdata files such as perf2bolt, merge-fdata, etc.
(cherry picked from FBD3223477)
Summary:
merge-fdata tool takes multiple .fdata files and outputs to stdout
combined fdata. Takes about 2 seconds per each additional .fdata
file with hhvm production data.
(cherry picked from FBD3216430)
Summary:
Splitting option now has different meanings/values. Since landing pads
are mostly always cold/frozen, we should split them before anything
else (we still check the execution count is 0). That's value '1'.
Everything else goes on top of that and has increased value (2 - large
functions, 3 - everything).
Sorting was non-deterministic and somewhat broken for functions
with EH ranges. Fixed that and added '-split-all-cold' option to
outline all 0-count blocks.
Fixed compilation of test cases. After my last commit the binaries
were linked to wrong source files (i.e. debug info). Had to rebuild
the binaries from updated sources.
(cherry picked from FBD3209369)
Summary:
GNU_args_size is a special kind of CFI that tells runtime to adjust
%rsp when control is passed to a landing pad. It is used for annotating
call instructions that pass (extra) parameters on the stack and there's
a corresponding landing pad.
It is also special in a way that its value is not handled by
DW_CFA_remember_state/DW_CFA_restore_state instruction sequence
that we utilize to restore the state after block re-ordering.
This diff adds association of call instructions with GNU_args_size value
when it's used. If the function does not use GNU_args_size, there is
no overhead. Otherwise, we regenerate GNU_args_size instruction during
code emission, i.e. after all optimizations and block-reordering.
(cherry picked from FBD3201322)
Summary:
Simple functions which we fail to rewrite after optimizations were
having wrong debugging information because the latter would reflect the optimized
version of the function.
There are only 48 functions (at this time) in this situation in the HHVM binary.
The simple fix is to add another full pass. Another more complicated path, which will
be more efficient, is to reset only the BinaryContext and emit again, but then we need
to recreate all symbols in the new MCContext and update the pointers. I started
taking this path but it started getting too complicated for only those 48 functions
(needed to create a new map of global symbols, recreate landing pads - which needed
to have the internal intermediate labels in the functions kept to be updated too, etc).
Because the overhead is quite large (another full emission pass - around 4m30s here)
and the impact is small I put this behind a new
command-line flag which is off by default: -fix-debuginfo-large-functions.
(cherry picked from FBD3166576)
Summary:
Update address ranges of inlined functions and try/catch blocks.
This was missing and lead gdb to show weird information in a core dump we inspected
because of the several nestings of inline in the call stack.
This is very similar to Lexical Blocks, so the change is to basically generalize that
code to do the same for DW_AT_try_block, DW_AT_catch_block and DW_AT_inlined_subroutine.
(cherry picked from FBD3169417)
Summary:
readelf was showing some errors because we weren't updating DIEs that were not shallow
in the DIE tree, or DIEs of functions with addresses we don't recognize (mostly functions with
address 0, which could have been removed by the Linker Script but still have debugging information
there). These DIEs need to be updated because their abbreviations are patched.
(cherry picked from FBD3159335)
Summary:
We were updating only one DIE per function, but because the Linker Script may map
multiple functions to the same address this would cause us to generate invalid debug info
(as some DIEs weren't updated but their abbreviations were changed).
(cherry picked from FBD3157263)
Summary:
Non-simple functions aren't emitted, and thus didn't have line number information
emitted. This diff emits it for those functions by extending LLVM's generation of the line number program to allow for absolute addresses (it is wholly symbolic), then iterating over the relevant line tables from the input and appending entries with absolute addresses to the line tables to be emited.
This still leaves the simple but not overwritten functions unhandled (there were 48 in HHVM in
my last run). However, I think that to fix them we'd need another pass, since by the time we
realize a simple function wont't fit, debug line info was already written to the output.
(cherry picked from FBD3148468)
Summary:
Summary: Update DWARF location lists in .debug_loc and pointers to
them in .debug_info so that gdb can print variables which change
location during their lifetime.
The following changes were made:
- Refactored BasicBlockOffsetRanges to allow ranges to be tied to binary information (so that we can reuse it for location lists)
- Implemented range compression optimization in BasicBlockOffsetRanges (needed otherwise too much data was being generated).
- Added representation for location lists (LocationList.h, BinaryContext.h)
- Implemented .debug_loc serializer that keeps the updated offsets (DebugLocWriter.{h,cpp})
- After disassembly, traverse entries in .debug_loc and save them in context (BinaryContext.cpp)
- After optimizations, serialize .debug_loc and update pointers in .debug_info (RewriteInstance.cpp)
(cherry picked from FBD3130682)
Summary:
Add a parameter value to "-split-functions=" option to allow splitting
only when the function is too large to fit:
0 - never split
1 - split if too large to fit
2 - always split
We may use this option when the profile data is not very precise.
In that case excessive splitting may increase iTLB misses.
(cherry picked from FBD3137700)
Summary:
This fixes a problem in which bolt was generating a malformed .debug_info
section on the bzip2 binary. The bug was the following:
- A simple and a non-simple function shared an abbreviation
- The abbreviation was patched to contain DW_AT_ranges because of the simple function
- The non-simple function's data was not updated, but then it didn't match the
layout expected by the abbreviation anymore
And because we were already creating an address ranges list in .debug_ranges even
for non-simple functions, it doesn't make sense not to use it anyway.
(cherry picked from FBD3129219)
Summary:
Updates DWARF lexical blocks address ranges in the output binary after optimizations.
This is similar to updating function address ranges except that the ranges representation needs
to be more general, since address ranges can begin or end in the middle of a basic block.
The following changes were made:
- Added a data structure for iterating over the basic blocks that intersect an address range: BasicBlockTable.h
- Added some more bookkeeping in BinaryBasicBlock. Basically, I needed to keep track of the block's size in the input binary as well as its address in the output binary. This information is mostly set by BinaryFunction after disassembly.
- Added a representation for address ranges relative to basic blocks (BasicBlockOffsetRanges.h). Will also serve for location lists.
- Added a representation for Lexical Blocks (LexicalBlock.h)
- Small refactorings in DebugArangesWriter:
-- Renamed to DebugRangesSectionsWriter since it also writes .debug_ranges
-- Refactored it not to depend on BinaryFunction but instead on anything that can be assined an aoffset in .debug_ranges (added an interface for that)
- Iterate over the DIE tree during initialization to find lexical blocks in .debug_info (BinaryContext.cpp)
- Added patches to .debug_abbrev and .debug_info in RewriteInstance to update lexical blocks attributes (in fact, this part is very similar to what was done to function address ranges and I just refactored/reused that code)
- Added small test case (lexical_blocks_address_ranges_debug.test)
(cherry picked from FBD3113181)
Summary:
Before this diff LLVM used to iterate over all sections to find the
one with an address we want to remap. Since we have extremely
large number of section this process is highly inefficient.
Instead we add a new interface to remap a section with a given ID
(which effectively is an index into an array of sections), and
pass the ID instead of the address.
This cuts down the processing time of hhvm binary by 10 seconds,
and brings the total processing time to a little under 2 minutes.
(cherry picked from FBD3110015)
Summary:
Populate function execution count while parsing fdata. Before
we used a quadratic algorithm to populate the execution count
(had to iterate over *all* branches for every single function).
Ignore non-symbol to non-symbol branches while parsing fdata.
These changes combined drop HHVM processing time from
4 minutes 53 seconds down to 2 minutes 9 seconds on my devserver.
Test case had to be modified since it contained irrelevant
branches from PLT to libc.
(cherry picked from FBD3106263)
Summary:
[WIP] Update DWARF info for function address ranges.
This diff currently does not work for unknown reasons,
but I'm describing here what's the current state.
According to both llvm-dwarf and readelf our output seems correct,
but GDB does not interpret it as expected. All details go below in
hope I missed something.
I couldn't actually track the whole change that introduced support for
what we need in gdb yet, but I think I can get to it
(2007-12-04: Support
lexical bocks and function bodies that occupy non-contiguous address ranges). I have reasons to believe gdb at least at some
nges).
The set of introduced changes was basically this:
- After disassembly, iterate over the DIEs in .debug_info and find the
ones that correspond to each BinaryFunction.
- Refactor DebugArangesWriter to also write addresses of functions to
.debug_ranges and track the offsets of function address ranges there
- Add some infrastructure to facilitate patching the binary in
simple ways (BinaryPatcher.h)
- In RewriteInstance, after writing .debug_ranges already with
function address ranges, for each function do:
-- Find the abbreviation corresponding to the function
-- Patch .debug_abbrev to replace DW_AT_low_pc with DW_AT_ranges and
DW_AT_high_pc with DW_AT_producer (I'll explain this hack below).
Also patch the corresponding forms to DW_FORM_sec_offset and
DW_FORM_string (null-terminated in-place string).
-- Patch debug_info with the .debug_ranges offset in place of
the first 4 bytes of DW_AT_low_pc (DW_AT_ranges only occupies 4
bytes whereas low_pc occupies 8), and write an arbitrary string
in-place in the other 12 bytes that were the 4 MSB of low_pc
and the 8 bytes of high_pc before the patch. This depends on
low_pc and high_pc being put consecutively by the compiler, but
it serves to validate the idea. I tried another way of doing it
that does not rely on this but it didn't work either and I believe
the reason for either not working is the same (and still unknown,
but unrelated to them. I might be wrong though, and if I find yet
another way of doing it I may try it). The other way was to
use a form of DW_FORM_data8 for the section offset. This is
disallowed by the specification, but I doubt gdb validates this,
as it's just easier to store it as 64-bit anyway as this is even
necessary to support 64-bit DWARF (which is not what gcc generates
by default apparently).
I still need to make changes to the diff to make it production-ready,
but first I want to figure out why it doesn't work as expected.
By looking at the output of llvm-dwarfdump or readelf, all of
.debug_ranges, .debug_abbrev and .debug_info seem to have been
correctly updated. However, gdb seems to have serious problems with
what we write.
(In fact, readelf --debug-dump=Ranges shows some funny warning messages
of the form ("Warning: There is a hole [0x100 - 0x120] in .debug_ranges"),
but I played around with this and it seems it's just because no
compile unit was using these ranges. Changing .debug_info apparently
changes these warnings, so they seem to be unrelated to the section
itself. Also looking at the hex dump of the section doesn't help,
as everything seems fine. llvm-dwarfdump doesn't say anything.
So I think .debug_ranges is fine.)
The result is that gdb not only doesn't show the function name as we
wanted, but it also stops showing line number information.
Apparently it's not reading/interpreting the address ranges at all,
and so the functions now have no associated address ranges, only the
symbol value which allows one to put a breakpoint in the function,
but not to show source code.
As this left me without more ideas of what to try to feed gdb with,
I believe the most promising next trial is to try to debug gdb itself,
unless someone spots anything I missed.
I found where the interesting part of the code lies for this
case (gdb/dwarf2read.c and some other related files, but mainly that one).
It seems in some parts gdb uses DW_AT_ranges for only getting
its lowest and highest addresses and setting that as low_pc and
high_pc (see dwarf2_get_pc_bounds in gdb's code and where it's called).
I really hope this is not actually the case for
function address ranges. I'll investigate this further. Otherwise
I don't think any changes we make will make it work as initially
intended, as we'll simply need gdb to support it and in that case it
doesn't.
(cherry picked from FBD3073641)
Summary:
We used to output .debug_line information for every instruction, but because of the way
gdb (and probably lldb as of llvm::DWARFDebugLine::LineTable::findAddress) queries the
line table it's not necessary to output information for two instructions if they follow
each other and map to the same source line. By not repeating this information we generate
a bit less .debug_line data.
(cherry picked from FBD3056402)
Summary:
The line number information generated from a null pointer
was actually valid, which caused new instructions without the line number
information set to have a valid and wrong line number reference. This diff
fixes this by making the null pointer be assigned to an invalid line number
row.
(cherry picked from FBD3048453)
Summary:
Write the .debug_aranges section after optimizations to the output binary.
Each function generates at least one range and at most two (one extra for its cold part).
The writing is done manually because LLVM's implementation is tied to the output of
.debug_info (see EmitGenDwarfInfo and EmitGenDwarfARanges in lib/MC/MCDwarf.cpp),
which we don't want to trigger right now.
(cherry picked from FBD3043108)
Summary:
At the moment we rely solely on the symbol table information to discover
function boundaries. However, similar information is contained in
.eh_frame. Verify that the information from these two sources is
consistent, and if it's not, then skip processing the functions with
conflicting information.
(cherry picked from FBD3043800)
Summary:
After we add new line number information we have to update stmt_list
offsets in .debug_info. For this I had to add a primitive relocations
support for non-allocatable sections we are copying from input file.
Also enabled functionality to process relocations in non-allocatable
sections that LLVM is generating, such as .debug_line. I thought
we already had it, but apparently it didn't work, at least not
for ELF binaries.
(cherry picked from FBD3037903)
Summary:
Skip DW_CFA_expression and DW_CFA_val_expression instructions
properly, according to DWARF spec.
If CFI range does not match function range skip that function.
(cherry picked from FBD3040502)
Summary:
Writes .debug_line section by setting the state
in MCContext that LLVM needs to produce and output the
line tables. This basically consists of setting the
current location and compile unit offset. This makes LLVM
output .debug_line in the temporary file, but not yet in
the generated ELF file.
Also computes the line table offsets for each compile unit
and saves them into BinaryContext. Added an option to
print these offsets.
(cherry picked from FBD3004554)
Summary:
The is a set of changes that allow modification of non-allocatable
sections in ELF binary. Primarily for the purpose of updating debug
info.
Extend LLVM interface to allow processing relocations in non-allocatable
sections. This allows to produce .debug* sections with resolved
relocations against generated code.
Extend BOLT rewriting framework to allow appending contents to
non-allocatable sections in the binary.
Re-worked ELF binary rewriting to support the above and to allow future
extensions (e.g. new section names).
(cherry picked from FBD3023403)
Summary:
Reads information in the DWARF .debug_line section using LLVM and
tie every MCInst to one line of a line table from the input binary. Subsequent
diffs will update this information to match the final binary layout and
output updated line tables.
(cherry picked from FBD2989813)
Summary:
Force the splitting of the function into hot/cold even when
the function fits into original slot.
This reduces BOLT optimization time by 50% without affecting
hhvm performance.
(cherry picked from FBD2973773)
Summary:
If we see an unknown CFI instruction, skip processing the function
containing it instead of aborting execution.
(cherry picked from FBD2964557)
Summary:
Added an option to reuse existing program header entry.
This option allows for bfd tools like strip and objcopy
to operate on the optimized binary without destroying it.
Also, all new sections are now properly marked in ELF.
(cherry picked from FBD2943339)
Summary:
We used to require pre-allocated space in the input binary so that
we can write extra sections in there (.eh_frame, .eh_frame_hdr,
.gcc_except_table, etc.). With this diff there's no further
need for pre-allocated storage as we create a new segment and
can use as much space as needed.
There are certain limitations on where the new segment could
be allocated, and as a result the size of the file may increase.
There's currently a limitation if the binary size is close to 4GB
we cannot allocate new segment prior to that and as a result
we require debug info to be stripped to reduce the file size.
The fix is in progress.
(cherry picked from FBD2916029)
Summary:
We use intermediate .o file for debugging purposes, but there's no
reason to generate it by default. Only do it if "-keep-tmp" is
specified.
(cherry picked from FBD2912098)
Summary:
Preserve original layout for basic blocks that have 0 execution
count. Since we don't optimize for size, it's better to rely on
the original input order.
(cherry picked from FBD2875335)
Summary:
We should never outline the first basic block.
Also add an option to accept a file with the list of
functions to optimize.
(cherry picked from FBD2868184)
Summary:
We could split functions with exceptions even without creating
a new exception handling table. This limits us to only move
basic blocks that never throw, and are not a start of a
landing pad.
(cherry picked from FBD2862937)
Summary:
Some basic blocks were created empty because they only contained
alignment nop's. Ignore such nop's before basic block gets created.
Fixed intermittent aborts related to CFI update.
(cherry picked from FBD2844465)
Summary:
* Update CFI state for larger range of functions to increase coverage.
* Issue more warnings indicating reasons for skipping functions.
* Print top called functions in the binary.
(cherry picked from FBD2839734)
Summary:
Modified processing of "-reorder-blocks=" option and added an option
to reverse original basic blocks order for testing purposes.
(cherry picked from FBD2829862)
Summary:
Fixes some issues discovered after hhvm switched to gcc 4.9.
Add support for DW_CFA_GNU_args_size instruction.
Allow CFI instruction after the last instruction in a function.
Reverse conditions of assert for DW_CFA_set_loc.
(cherry picked from FBD28110096)
Summary:
Binary code could be weird. It could include calls to address 0 and
reference data at 0 (e.g. with lea on x86). LLVM JIT fatals
while resolving relocations against symbols at address 0x0. For now
we will stop emitting such code, i.e. we'll skip functions.
(cherry picked from FBD28109837)
Summary:
In a test binary, we found 8 cases where code in a function A would jump to the
middle of another function B. In this case, we cannot reorder function B because
this would change instruction offsets and break the program. This is pretty rare
but can happen in code written in assembly.
(cherry picked from FBD2719850)
Summary:
We found out that the insertion of extra nops to preserve alignment of
some loop bodies do not pay off the increased function size, since this extra
size may inhibit us from rewriting a reordered version of this function.
(cherry picked from FBD2718466)
Summary:
Our CFI parser in the LLVM library was giving up on parsing all CFI
instructions when finding a single instruction with expression operands. Yet,
all gcc-4.9 binaries seem to have at least one CFI instruction with expression
operands (DW_CFA_def_cfa_expression). This patch fixes this and makes DebugInfo
continue to parse other instructions, even though it does not completely parse
DWARF expressions yet. However, this seems to be enough to allow llvm-flo to
process gcc-4.9 binaries because the FDEs with DWARF expressions are linked to
the PLT region, and not to functions that we process.
If we ever try to read a function whose CFI depends on DWARF expression, which
is unlikely, llvm-flo will assert.
(cherry picked from FBD2693088)
Summary:
This patch builds upon the previous patch to create a two-pass process
to function splitting. We first perform the full rewriting pipeline to discover
which functions need splitting. Afterwards, we restart the pipeline with those
functions annotated to be split.
(cherry picked from FBD2691709)
Summary:
Previously, llvm-flo.cpp contained a long function doing lots of
different tasks. This patch refactors this logic into a separate class with
different member functions, exposing the relationship between each step of
the rewritting process and making it easier to coordinate/change it.
(cherry picked from FBD2691674)
Summary:
After basic block reordering, it may be possible that the reordered
function is now larger than the original because of the following reasons:
- jump offsets may change, forcing some jump instructions to use 4-byte
immediate operand instead of the 1-byte, shorter version.
- fall-throughs change, forcing us to emit an extra jump instruction to jump
to the original fall-through at the end of a basic block.
Since we currently do not change function addresses, we need to rewrite the
function back in the binary in the original location. If it doesn't fit, we were
dropping the function.
This patch adds a flag -split-functions that tells llvm-flo to split hot
functions into hot and cold separate regions. The hot region is written back
in the original function location, while the cold region is written in a
separate, far-away region reserved to flo via a linker script.
This patch also adds the logic to create and extra FDE to supply unwinding
information to the cold part of the function. Owing to this, we now need to
rewrite .eh_frame_hdr to another location and patch the EH_FRAME ELF segment
to point to this new .eh_frame_hdr.
(cherry picked from FBD2677996)
Summary:
This is an attempt at determining the hotness of functions we are
rewriting and help detect if we are discarding hot functions. This patch
introduces logic to estimate the number of instructions executed in each
function by using the profile data for branches. It sums the products of
BB frequency and size. Since we can only do this for functions we have
successfully disassembled, created the CFG and annotated with profiling
data, all complex functions that were not disassembled are left out from
this analysis.
(cherry picked from FBD2654985)
Summary:
Previously, we were marking functions with indirect calls as too
complex to be disassembled, but this was unnecessarily conservative. This patch
removes this restriction.
(cherry picked from FBD2669627)
Summary:
Teach llvm-flo to drop on function with LSDA information until we know
how to update them after block reordering.
(cherry picked from FBD2640806)
Summary:
This patch adds logic to detect when the binary has extra space
reserved for us via the __flo_storage symbol. If this symbol is present,
it means we have extra space in the binary to write extraneous information.
When we write a new .eh_frame, we cannot discard the old .eh_frame because
it may still contain relevant information for functions we do not reorder.
Thus, we write the new .eh_frame into __flo_storage and patch the current
.eh_frame_hdr to point to the new .eh_frame only for the functions we touched,
generating a binary that works with a bi-.eh_frame model.
(cherry picked from FBD2639326)
Summary:
This patch is an intermediary step towards updating the CFI in the
optimized binary. It adds the logic necessary to output our CFI annotations to
a new .eh_frame in the temporary object file we create to hold rewritten
functions. The next step will be to fully integrate this new .eh_frame into the
optimized binary.
(cherry picked from FBD2633728)
Summary:
This patch introduces logic to check how the CFI instructions define a
table to help during stack unwinding at exception run time and attempts to fix
any problem in this table that may have been introduced by reordering the basic
blocks. If it fails to fix this problem, the function is marked as not simple
and not eligible for rewriting.
(cherry picked from FBD2633696)
Summary:
Regenerate exception handling information after optimizations.
Use '-print-eh-ranges' to see CFG with updated ranges.
(cherry picked from FBD2660982)
Summary:
There were two issues: we were trying to process non-simple functions,
i.e. function that we don't fully understand, and then we failed to stop
iterating if EH closing label was after the last instruction in a
function.
(cherry picked from FBD2664460)
Summary:
Read .gcc_except_table and add information to CFG. Calls have extra operands
indicating there's a possible handler for exceptions and an action. Landing
pad information is recorded in BinaryFunction.
Also convert JMP instructions that are calls into tail calls pseudo
instructions so that they don't miss call instruction analysis.
(cherry picked from FBD2652775)
Summary: Reverting this commit until we better investigate why
it is necessary to change local symbol names with a prefix.
(cherry picked from FBD28109521)
Summary: After discussion with Maksim, we decided to drop the lines
that add the PG prefix if the symbol is already local, since they
wouldn't be impacted by the way LLVM handles these symbols.
(cherry picked from FBD28109400)
Summary:
This bug would cause llvm-flo to fail to disambiguate two local symbols
with the same file name, causing two different addresses to compete in the
symbol table for the resolution of a given name, causing unpredicted behavior in
the linker.
(cherry picked from FBD2646626)
Summary:
In order to represent CFI information in our BinaryFunction class, this
patch adds a map of Offsets to CFI instructions. In this way, we make it easy to
check exactly where DWARF CFI information is annotated in the disassembled
function.
(cherry picked from FBD2619216)
Summary:
We need to parse the whole contents of .gcc_except_table even if we are
not printing exceptions. Otherwise we are missing type index table and
miscalculate the size of the current table.
(cherry picked from FBD2632965)
Summary: In order to reorder binaries with C++ exceptions, we first need to
read DWARF CFI (call frame info) from binaries in a table in the .eh_frame
ELF section. This table contains unwinding information we need to be aware of
when reordering basic blocks, so as to avoid corrupting it. This patch also
cleans up some code from Exceptions.cpp due to a refactoring where we moved
some functions to the LLVM's libSupport.
(cherry picked from FBD2614464)
Summary:
Print actions for exception ranges from .gcc_except_table.
Types are printed as names if the name is available from symbol table.
(cherry picked from FBD2612631)
Summary:
Previously, we inferred all non-taken branch frequencies with the
information we had for taken branches. This patch teaches perf2flo and llvm-flo
how to read and incorporate non-taken branch frequencies directly from the
traces available in LBR data and by disassembling the binary. It still leaves
the inference engine untouched in case we need it to fill out other
fall-throughs.
(cherry picked from FBD2589212)
Summary:
Pettis' paper on block layout (PLDI'90) suggests we should order
clusters (or chains, using the paper terminology) using a specific criterion.
This patch implements two distinct ideas for cluster layout that can be
activated using different command-line flags. The first one reflects Pettis'
ideas on minimizing branch mispredictions and the second one is targeted at
reducing I-cache misses, described in the Ispike paper (CGO'04).
(cherry picked from FBD2588693)
Summary:
Fixes a bug which caused the block reordering heuristic to put in the
same cluster hot basic blocks and cold basic blocks, increasing I-cache misses.
(cherry picked from FBD2588203)
Summary:
When the ignore-nops patch landed, it exposed a bug in fixBranches()
where it ignored empty BBs. However, we cannot ignore empty BBs when it is
reordered and its fall-through changes. We must update it with a jump to the
original fall-through. This patch fixes this.
(cherry picked from FBD2568244)
Summary:
It is important to remove dead blocks to free up space in functions
and allow us to reorder blocks or align branch targets with more
freedom. This patch implements a simple algorithm to delete all basic
blocks that are not reachable from the entry point. Note that C++
exceptions may create "unreachable" blocks, so this option must be
used with care.
(cherry picked from FBD2562637)
Summary:
SPEC CPU2006 perlbench triggered a bug in our heuristic block
reordering algorithm where a hot edge that targets the entry point (as in a
recursive tail call) would make us try to allocate the call site before the
function entry point. Since we don't update function addresses yet, moving the
entry point will corrupt the program. This patch fixes this.
(cherry picked from FBD2562528)
Summary:
If we have two consecutive JMP instructions and no branches to the
second one, the second one is dead code, but llvm-flo does not handle these
cases properly and put two JMPs in the same BB. This patch fixes this, putting
the extraneous JMP in a separate block, making it easy for us to detect it is
dead code and remove it later in a separate step.
(cherry picked from FBD2562465)
Summary:
Nop instructions are primarily used for alignment purposes on the input.
We remove all nops when we build CFG and derive alignment of basic blocks
based on existing alignment and a presence of nops before it. This
will not always work as some basic blocks will be naturally aligned
without necessity for nops. However, it's better than random alignment.
We would also add heuristics for BB alignment based on execution profile.
(cherry picked from FBD2561740)
Summary:
Adds logic in BinaryFunction to be able to fix branches (invert
its condition, delete or add a branch), making the new function work with the
new layout proposed by the layout pass. All the architecture-specific content
was designed to live in the LLVM Target library, in the MCInstrAnalysis pass.
For now, we only introduce such logic to the X86 backend.
(cherry picked from FBD2551479)
Summary:
Tests with SPEC CPU2006 400.perlbench exposed a bug in the block reordering
heuristic that happened when two blocks are both successor and predecessor of
each other. This patch fixes this.
(cherry picked from FBD2555835)
Summary:
SPEC CPU2006 perlbench exposed a bug in BinaryFunction::optimizeLayout()
where it would try to optimize the layout even though the function had zero
basic blocks. This patch simply checks if the function has zero basic blocks and
bails out.
(cherry picked from FBD2556831)
Summary:
In a recent commit, we changed local symbols to be specially tagged
with the number 2 (local sym) instead of 1 (sym). This patch modifies the reader
to don't choke when seeing a 2 in the symbol id field.
(cherry picked from FBD2552776)
Summary:
This patch implements a dynamic programming approach to solve reorder
basic blocks with profiling information in an optimal way. Since this is
analogous to TSP, it is NP-hard and the algorithm is exponential in time and
memory consumption. Therefore, we only use the optimal algorithm to decide the
layout of small functions (with less than 11 basic blocks).
(cherry picked from FBD2544124)
Summary:
This patch introduces a first approach to reorder basic blocks based on
profiling data that gives us the execution frequency for each edge. Our strategy
is to layout basic blocks in a order that maximizes the weight (hotness) of
branches that will be deleted. We can delete branches when src comes right
before dst in the new layout order. This can be reduced to the TSP problem. This
patch uses a greedy heuristic to solve the problem: we start with a graph with
no edges and progressively add edges by choosing the hottest edges first,
building a layout order that attempts to put BBs with hot edges together.
(cherry picked from FBD2544076)
Summary:
The LBR only has information about taken branches and does not record
information when a branch is not taken. In our CFG, we call these edges
"fall-through" edges. This patch teaches llvm-flo how to infer fall-through
edge frequencies.
(cherry picked from FBD2536633)
Summary:
Changes DataReader to organize branch perf data per function name and
sets up logistics to bring this data to BinaryFunction::buildCFG(). To do this,
we expand BinaryContext with a const reference to DataReader. This patch also
adds the "-dump-functions" flag to force llvm-flo to dump the current state of
BinaryFunctions once they are disassembled and their CFG built, allowing us to
test whether the builder is sane with LLVM LIT tests.
(cherry picked from FBD2534675)
Summary:
This patch introduces DataReader, a module responsible for
parsing llvm flo data files into in-memory data structures.
(cherry picked from FBD2515754)