Summary:
This adds functionality for a more aggressive inlining pass, that can
inline tail calls and functions with more than one basic block.
(cherry picked from FBD3677856)
Summary:
Add three new MCOperand types: Annotation, LandingPad and GnuArgsSize.
Annotation is used for associating random data with MCInsts. Clients can
construct their own annotation types (subclassed from MCAnnotation) and
associate them with instructions. Annotations are looked up by string keys.
Annotations can be added, removed and queried using an instance of the
MCInstrAnalysis class.
The LandingPad operand is a MCSymbol, uint64_t pair used to encode exception
handling information for call instructions.
GnuArgsSize is used to annotate calls with the DW_CFA_GNU_args_size attribute.
(cherry picked from FBD3597877)
Summary:
BOLT attempts to convert jumps that serve as tail calls to dedicated tail call
instructions, but this is impossible when the jump is conditional because there is
no corresponding tail call instruction. This was causing the creation of a duplicate
fall-through edge for basic blocks terminated with a conditional jump serving as
a tail call when there is profile data available for the non-taken branch. In this
case, the first fall-through edge had a count taken from the profile data, while
the second has a count computed (incorrectly) by
BinaryFunction::inferFallThroughCounts.
(cherry picked from FBD3560504)
Summary:
LLVM was missing assembler print string for indirect tail
calls which are synthetic instructions created by us.
(cherry picked from FBD3640197)
Summary:
This diff adds a number of methods to BinaryFunction that can be used to edit the CFG after it is created.
The basic public functions are:
- createBasicBlock - create a new block that is not inserted into the CFG.
- insertBasicBlocks - insert a range of blocks (made with createBasicBlock) into the CFG.
- updateLayout - update the CFG layout (either by inserting new blocks at a certain point or recomputing the entire layout).
- fixFallthroughBranch - add a direct jump to the fallthrough successor for a given block.
There are a number of private helper functions used to implement the above.
This was split off the ICP diff to simplify it a bit.
(cherry picked from FBD3611313)
Summary:
This algorithm is similar to our main clustering algorithm but uses
a different heuristic for selecting edges to become fall-throughs.
The weight of an edge is calculated as the win in branches if we choose
to layout this edge as a fall-through. For example, the edges A -> B with
execution count 100 and A -> C with execution count 500 (where B and C
are the only successors of A) have weights -400 and +400 respectively.
(cherry picked from FBD3606591)
Summary:
Added an ICF pass to BOLT, that can recognize identical functions
and replace references to these functions with references to just one
representative.
(cherry picked from FBD3460297)
Summary:
I've factored out the instruction printing and size computation routines to
methods on BinaryContext. I've also added some more debug print functions.
This was split off the ICP diff to simplify it a bit.
(cherry picked from FBD3610690)
Summary:
Instructions that load data from the a read-only data section and their
target address can be computed statically (e.g. RIP-relative addressing)
are modified to corresponding instructions that use immediate operands.
We apply the transformation only when the resulting instruction will have
smaller or equal size.
(cherry picked from FBD3397112)
Summary:
Loop detection for the CFG data structure. Added a GraphTraits
specialization for BOLT's CFG that allows us to use LLVM's loop
detection interface.
(cherry picked from FBD3604837)
Summary:
Shorten when a mov instruction has a 64-bit immediate that can be repesented as
a sign extended 32-bit number, use the smaller mov instruction (MOV64ri -> MOV64ri32).
Add peephole optimization pass that does instruction shortening.
(cherry picked from FBD3603099)
Summary:
Generate short versions of branch instructions by default and rely on
relaxation to produce longer versions when needed.
Also produce short versions of arithmetic instructions if immediate
fits into one byte. This was only triggered once on HHVM binary.
(cherry picked from FBD3591466)
Summary:
patchELFPHDRTable was asserting that it could not find an entry
for .eh_frame_hdr in SectionMapInfo when no functions were modified
by BOLT.
This just changes code to skip modifying GNU_EH_FRAME program headers
hen SectionMapInfo is empty. The existing header is copied and written
instead.
(cherry picked from FBD3557481)
Summary:
If a profile data was collected on a stripped binary but an input
to BOLT is unstripped, we would use a different mangling scheme for
local functions and ignore their profiles. To solve the issue this
diff adds alternative name for all local functions such that one
of the names would match the name in the profile.
If the input binary was stripped, we reject it, unless "-allow-stripped"
option was passed. It's more complicated to do a matching in this case
since we have less information than at the time of profile collection.
It's also not that simple to tell if the profile was gathered on a
stripped binary (in which case we would have no issue matching data).
(cherry picked from FBD3548012)
Summary:
Store the basic block index inside the BinaryBasicBlock instead of a map in BinaryFunction.
This cut another 15-20 sec. from the processing time for hhvm.
(cherry picked from FBD3533606)
Summary:
Use unordered_map instead of map in ReorderAlgorithm and BinaryFunction::BasicBlockIndices.
Cuts about 30sec off the processing time for the hhvm binary. (~8.5 min to ~8min)
(cherry picked from FBD3530910)
Summary:
This fixes the initialization of basic block execution counts, where
we should skip edges to the first basic block but we were not
skipping the corresponding profile info.
Also, I removed a check that was done twice.
(cherry picked from FBD3519265)
Summary:
I noticed the BinaryFunction::viewGraph() method that hadn't been implemented
and decided I could use a simple DOT dumper for CFGs while working on the indirect
call optimization.
I've implemented the bare minimum for the dumper. It's just nodes+BB labels with
dges. We can add more detailed information as needed/desired.
(cherry picked from FBD3509326)
Summary:
Added perf2bolt functionality for extracting branch records
with histories of previous branches. The length of the histories
is user defined, and the default is 0 (previous functionality). Also,
DataReader can parse perf2bolt output with histories.
Note: creating profile data with long histories can increase their
size significantly (2x for history of length 1, 3x for length 2 etc).
(cherry picked from FBD3473983)
Summary:
When a conditional jump is followed by one or more no-ops, the
destination of fall-through branch was recorded as the first no-op in
FuncBranchInfo. However the fall-through basic block after the jump
starts after the no-ops, so the profile data could not match the CFG
and was ignored.
(cherry picked from FBD3496084)
Summary:
The various reorder and clustering algorithms have been refactored
into separate classes, so that it is easier to add new algorithms and/or
change the logic of algorithm selection.
(cherry picked from FBD3473656)
Summary:
With ICF optimization in the linker we were getting mismatches of
function names in .fdata and BinaryFunction name. This diff adds
support for multiple function names for BinaryFunction and
does a match against all possible names for the profile.
(cherry picked from FBD3466215)
Summary:
Verify profile data for a function and reject if there are branches
that don't correspond to any branches in the function CFG. Note that
we have to ignore branches resulting from recursive calls.
Fix printing instruction offsets in disassembled state.
Allow function to have non-zero execution count even if we don't
have branch information.
(cherry picked from FBD3451596)
Summary:
Print total number of functions/objects that have profile
and add new options:
-print - print the list of objects with count to stderr
=none - do not print objects/functions
=exec - print functions sorted by execution count
=branches - print functions sorted by total branch count
-q - do not print merged data to stdout
(cherry picked from FBD3442288)
Summary: This will help optimization passes that need to modify the CFG after it is constructed. Otherwise, the BinaryBasicBlock pointers stored in the layout, successors and predecessors would need to be modified every time a new basic block is created.
(cherry picked from FBD3403372)
Summary:
Turn on -fix-debuginfo-large-functions by default.
In the process of testing I've discovered that we output cold code
for functions that were too large to be emitted. Fixed that.
(cherry picked from FBD3372697)
Summary:
Assembly functions could have no corresponding DW_AT_subprogram
entries, yet they are represented in module ranges (and .debug_aranges)
and will have line number information. Make sure we update those.
Eliminated unnecessary data structures and optimized some passes.
For .debug_loc unused location entries are no longer processed
resulting in smaller output files.
Overall it's a small processing time improvement and memory imporement.
(cherry picked from FBD3362540)
Summary: The inference algorithm for counts of fall through edges takes possible jumps to landing pad blocks into account. Also, the landing pad block execution counts are updated using profile data.
(cherry picked from FBD3350727)
Summary:
Clang uses different attribute for high_pc which
was incompatible with the way we were updating
ranges. This diff fixes it.
(cherry picked from FBD3345537)
Summary:
* Fix several cases for handling debug info:
- properly update CU DW_AT_ranges for function with folded body
due to ICF optimization
- convert ranges to DW_AT_ranges from hi/low PC for all DIEs
- add support for [a, a) range
- update CU ranges even when there are no functions registered
* Overwrite .debug_ranges section instead of appending.
* Convert assertions in debug info handling part into warnings.
(cherry picked from FBD3339383)
Summary:
Some compile unit DIEs might be missing DW_AT_ranges because they were
compiled without "-ffunction-sections" option. This diff adds the
attribute to all compile units.
If the section is not present, we need to create it. Will do it in a
separate diff.
(cherry picked from FBD3314984)
Summary:
Overwrite contents of .debug_line section since we don't reference
the original contents anymore. This saves ~100MB of HHVM binary.
(cherry picked from FBD3314917)
Summary:
A simple optimization to prevent branch misprediction for tail calls.
Convert the sequence:
j<cc> L1
...
L1: jmp foo # tail call
into:
j<cc> foo
but only if 'j<cc> foo' turns out to be a forward branch.
(cherry picked from FBD3234207)
Summary:
While emitting debug lines for a function we don't overwrite, we
don't have a code section context that is needed by default
writing routine. Hence we have to emit end_sequence after the
last address, not at the end of section.
(cherry picked from FBD3291533)
Summary:
Added an optimization pass of inlining calls to small functions (with only one
basic block). Inlining is done in a very simple way, inserting instructions to
simulate the changes to the stack pointer that call/ret would make before/after the
inlined function executes. Also, the heuristic prefers to inline calls that happen
in the hottest blocks (by looking at their execution count). Calls in cold blocks are
ignored.
(cherry picked from FBD3233516)
Summary:
Many functions (around 600) in the HHVM binary are simply
a single unconditional jump instruction to another function. These can
be trivially optimized by modifying the call sites to directly call the
branch target instead (because it also happens with more than one jump
in sequence, we do it iteratively).
This diff also adds a very simple analysis/optimization pass system in
which this pass is the first one to be implemented. A follow-up to this
could be to move the current optimizations to other passes.
(cherry picked from FBD3211138)
Summary:
Many functions (around 600) in the HHVM binary are simply
a single unconditional jump instruction to another function. These can
be trivially optimized by modifying the call sites to directly call the
branch target instead (because it also happens with more than one jump
in sequence, we do it iteratively).
This diff also adds a very simple analysis/optimization pass system in
which this pass is the first one to be implemented. A follow-up to this
could be to move the current optimizations to other passes.
(cherry picked from FBD3211138)
Summary:
Fix the error message by not printing it :)
Explanation: a previous diff accidentally removed this error message from within
the DEBUG macro, and it's expected that we'll have a bunch of them since a lot
of the DIEs we try to update are empty or meaningless. For instance (and mainly), there
is a huge number of lexical block DIEs with no attributes in .debug_info.
In the first phase of collecting debugging info, we store the offsets of all
these DIEs, only later to realize that we cannot update their address
ranges because they have none.
A better fix would be to check this earlier and not store offsets of DIEs
we cannot update to begin with.
(cherry picked from FBD3236923)
Summary:
A lot of the space in the merged .fdata is taken by branches
to and from [heap], which is jitted code. On different machines,
or during different runs, jitted addresses are all different.
We don't use these addresses, but we need branch info to get
accurate function call counts.
This diff treats all [heap] addresses the same, resulting in a
simplified merged file. The size of the compressed file decreased
from 70MB to 8MB.
(cherry picked from FBD3233943)
Summary:
In a test binary some functions are placed in a segment
preceding the segment containing .text section. As a result,
we were miscalculating maximum function size as the calculation
was based on addresses only.
This diff fixes the calculation by checking if symbol after function
belongs to the same section. If it does not, then we set the maximum
function size based on the size of the containing section and not
on the address distance to the next symbol.
(cherry picked from FBD3229205)
Summary:
Added option "-break-funcs=func1,func2,...." to coredump in any
given function by introducing ud2 sequence at the beginning of the
function. Useful for debugging and validating stack traces.
Also renamed options containing "_" to use "-" instead.
Also run hhvm test with "-update-debug-sections".
(cherry picked from FBD3210248)
Summary:
Make sure we can install all tools needed for processing
BOLT .fdata files such as perf2bolt, merge-fdata, etc.
(cherry picked from FBD3223477)