Summary:
A lot of the space in the merged .fdata is taken by branches
to and from [heap], which is jitted code. On different machines,
or during different runs, jitted addresses are all different.
We don't use these addresses, but we need branch info to get
accurate function call counts.
This diff treats all [heap] addresses the same, resulting in a
simplified merged file. The size of the compressed file decreased
from 70MB to 8MB.
(cherry picked from FBD3233943)
Summary:
In a test binary some functions are placed in a segment
preceding the segment containing .text section. As a result,
we were miscalculating maximum function size as the calculation
was based on addresses only.
This diff fixes the calculation by checking if symbol after function
belongs to the same section. If it does not, then we set the maximum
function size based on the size of the containing section and not
on the address distance to the next symbol.
(cherry picked from FBD3229205)
Summary:
Added option "-break-funcs=func1,func2,...." to coredump in any
given function by introducing ud2 sequence at the beginning of the
function. Useful for debugging and validating stack traces.
Also renamed options containing "_" to use "-" instead.
Also run hhvm test with "-update-debug-sections".
(cherry picked from FBD3210248)
Summary:
Make sure we can install all tools needed for processing
BOLT .fdata files such as perf2bolt, merge-fdata, etc.
(cherry picked from FBD3223477)
Summary:
merge-fdata tool takes multiple .fdata files and outputs to stdout
combined fdata. Takes about 2 seconds per each additional .fdata
file with hhvm production data.
(cherry picked from FBD3216430)
Summary:
Splitting option now has different meanings/values. Since landing pads
are mostly always cold/frozen, we should split them before anything
else (we still check the execution count is 0). That's value '1'.
Everything else goes on top of that and has increased value (2 - large
functions, 3 - everything).
Sorting was non-deterministic and somewhat broken for functions
with EH ranges. Fixed that and added '-split-all-cold' option to
outline all 0-count blocks.
Fixed compilation of test cases. After my last commit the binaries
were linked to wrong source files (i.e. debug info). Had to rebuild
the binaries from updated sources.
(cherry picked from FBD3209369)
Summary:
GNU_args_size is a special kind of CFI that tells runtime to adjust
%rsp when control is passed to a landing pad. It is used for annotating
call instructions that pass (extra) parameters on the stack and there's
a corresponding landing pad.
It is also special in a way that its value is not handled by
DW_CFA_remember_state/DW_CFA_restore_state instruction sequence
that we utilize to restore the state after block re-ordering.
This diff adds association of call instructions with GNU_args_size value
when it's used. If the function does not use GNU_args_size, there is
no overhead. Otherwise, we regenerate GNU_args_size instruction during
code emission, i.e. after all optimizations and block-reordering.
(cherry picked from FBD3201322)
Summary:
Simple functions which we fail to rewrite after optimizations were
having wrong debugging information because the latter would reflect the optimized
version of the function.
There are only 48 functions (at this time) in this situation in the HHVM binary.
The simple fix is to add another full pass. Another more complicated path, which will
be more efficient, is to reset only the BinaryContext and emit again, but then we need
to recreate all symbols in the new MCContext and update the pointers. I started
taking this path but it started getting too complicated for only those 48 functions
(needed to create a new map of global symbols, recreate landing pads - which needed
to have the internal intermediate labels in the functions kept to be updated too, etc).
Because the overhead is quite large (another full emission pass - around 4m30s here)
and the impact is small I put this behind a new
command-line flag which is off by default: -fix-debuginfo-large-functions.
(cherry picked from FBD3166576)
Summary:
Update address ranges of inlined functions and try/catch blocks.
This was missing and lead gdb to show weird information in a core dump we inspected
because of the several nestings of inline in the call stack.
This is very similar to Lexical Blocks, so the change is to basically generalize that
code to do the same for DW_AT_try_block, DW_AT_catch_block and DW_AT_inlined_subroutine.
(cherry picked from FBD3169417)
Summary:
readelf was showing some errors because we weren't updating DIEs that were not shallow
in the DIE tree, or DIEs of functions with addresses we don't recognize (mostly functions with
address 0, which could have been removed by the Linker Script but still have debugging information
there). These DIEs need to be updated because their abbreviations are patched.
(cherry picked from FBD3159335)
Summary:
We were updating only one DIE per function, but because the Linker Script may map
multiple functions to the same address this would cause us to generate invalid debug info
(as some DIEs weren't updated but their abbreviations were changed).
(cherry picked from FBD3157263)
Summary:
Non-simple functions aren't emitted, and thus didn't have line number information
emitted. This diff emits it for those functions by extending LLVM's generation of the line number program to allow for absolute addresses (it is wholly symbolic), then iterating over the relevant line tables from the input and appending entries with absolute addresses to the line tables to be emited.
This still leaves the simple but not overwritten functions unhandled (there were 48 in HHVM in
my last run). However, I think that to fix them we'd need another pass, since by the time we
realize a simple function wont't fit, debug line info was already written to the output.
(cherry picked from FBD3148468)
Summary:
Summary: Update DWARF location lists in .debug_loc and pointers to
them in .debug_info so that gdb can print variables which change
location during their lifetime.
The following changes were made:
- Refactored BasicBlockOffsetRanges to allow ranges to be tied to binary information (so that we can reuse it for location lists)
- Implemented range compression optimization in BasicBlockOffsetRanges (needed otherwise too much data was being generated).
- Added representation for location lists (LocationList.h, BinaryContext.h)
- Implemented .debug_loc serializer that keeps the updated offsets (DebugLocWriter.{h,cpp})
- After disassembly, traverse entries in .debug_loc and save them in context (BinaryContext.cpp)
- After optimizations, serialize .debug_loc and update pointers in .debug_info (RewriteInstance.cpp)
(cherry picked from FBD3130682)
Summary:
Add a parameter value to "-split-functions=" option to allow splitting
only when the function is too large to fit:
0 - never split
1 - split if too large to fit
2 - always split
We may use this option when the profile data is not very precise.
In that case excessive splitting may increase iTLB misses.
(cherry picked from FBD3137700)
Summary:
This fixes a problem in which bolt was generating a malformed .debug_info
section on the bzip2 binary. The bug was the following:
- A simple and a non-simple function shared an abbreviation
- The abbreviation was patched to contain DW_AT_ranges because of the simple function
- The non-simple function's data was not updated, but then it didn't match the
layout expected by the abbreviation anymore
And because we were already creating an address ranges list in .debug_ranges even
for non-simple functions, it doesn't make sense not to use it anyway.
(cherry picked from FBD3129219)
Summary:
Updates DWARF lexical blocks address ranges in the output binary after optimizations.
This is similar to updating function address ranges except that the ranges representation needs
to be more general, since address ranges can begin or end in the middle of a basic block.
The following changes were made:
- Added a data structure for iterating over the basic blocks that intersect an address range: BasicBlockTable.h
- Added some more bookkeeping in BinaryBasicBlock. Basically, I needed to keep track of the block's size in the input binary as well as its address in the output binary. This information is mostly set by BinaryFunction after disassembly.
- Added a representation for address ranges relative to basic blocks (BasicBlockOffsetRanges.h). Will also serve for location lists.
- Added a representation for Lexical Blocks (LexicalBlock.h)
- Small refactorings in DebugArangesWriter:
-- Renamed to DebugRangesSectionsWriter since it also writes .debug_ranges
-- Refactored it not to depend on BinaryFunction but instead on anything that can be assined an aoffset in .debug_ranges (added an interface for that)
- Iterate over the DIE tree during initialization to find lexical blocks in .debug_info (BinaryContext.cpp)
- Added patches to .debug_abbrev and .debug_info in RewriteInstance to update lexical blocks attributes (in fact, this part is very similar to what was done to function address ranges and I just refactored/reused that code)
- Added small test case (lexical_blocks_address_ranges_debug.test)
(cherry picked from FBD3113181)
Summary:
Before this diff LLVM used to iterate over all sections to find the
one with an address we want to remap. Since we have extremely
large number of section this process is highly inefficient.
Instead we add a new interface to remap a section with a given ID
(which effectively is an index into an array of sections), and
pass the ID instead of the address.
This cuts down the processing time of hhvm binary by 10 seconds,
and brings the total processing time to a little under 2 minutes.
(cherry picked from FBD3110015)
Summary:
Populate function execution count while parsing fdata. Before
we used a quadratic algorithm to populate the execution count
(had to iterate over *all* branches for every single function).
Ignore non-symbol to non-symbol branches while parsing fdata.
These changes combined drop HHVM processing time from
4 minutes 53 seconds down to 2 minutes 9 seconds on my devserver.
Test case had to be modified since it contained irrelevant
branches from PLT to libc.
(cherry picked from FBD3106263)
Summary:
[WIP] Update DWARF info for function address ranges.
This diff currently does not work for unknown reasons,
but I'm describing here what's the current state.
According to both llvm-dwarf and readelf our output seems correct,
but GDB does not interpret it as expected. All details go below in
hope I missed something.
I couldn't actually track the whole change that introduced support for
what we need in gdb yet, but I think I can get to it
(2007-12-04: Support
lexical bocks and function bodies that occupy non-contiguous address ranges). I have reasons to believe gdb at least at some
nges).
The set of introduced changes was basically this:
- After disassembly, iterate over the DIEs in .debug_info and find the
ones that correspond to each BinaryFunction.
- Refactor DebugArangesWriter to also write addresses of functions to
.debug_ranges and track the offsets of function address ranges there
- Add some infrastructure to facilitate patching the binary in
simple ways (BinaryPatcher.h)
- In RewriteInstance, after writing .debug_ranges already with
function address ranges, for each function do:
-- Find the abbreviation corresponding to the function
-- Patch .debug_abbrev to replace DW_AT_low_pc with DW_AT_ranges and
DW_AT_high_pc with DW_AT_producer (I'll explain this hack below).
Also patch the corresponding forms to DW_FORM_sec_offset and
DW_FORM_string (null-terminated in-place string).
-- Patch debug_info with the .debug_ranges offset in place of
the first 4 bytes of DW_AT_low_pc (DW_AT_ranges only occupies 4
bytes whereas low_pc occupies 8), and write an arbitrary string
in-place in the other 12 bytes that were the 4 MSB of low_pc
and the 8 bytes of high_pc before the patch. This depends on
low_pc and high_pc being put consecutively by the compiler, but
it serves to validate the idea. I tried another way of doing it
that does not rely on this but it didn't work either and I believe
the reason for either not working is the same (and still unknown,
but unrelated to them. I might be wrong though, and if I find yet
another way of doing it I may try it). The other way was to
use a form of DW_FORM_data8 for the section offset. This is
disallowed by the specification, but I doubt gdb validates this,
as it's just easier to store it as 64-bit anyway as this is even
necessary to support 64-bit DWARF (which is not what gcc generates
by default apparently).
I still need to make changes to the diff to make it production-ready,
but first I want to figure out why it doesn't work as expected.
By looking at the output of llvm-dwarfdump or readelf, all of
.debug_ranges, .debug_abbrev and .debug_info seem to have been
correctly updated. However, gdb seems to have serious problems with
what we write.
(In fact, readelf --debug-dump=Ranges shows some funny warning messages
of the form ("Warning: There is a hole [0x100 - 0x120] in .debug_ranges"),
but I played around with this and it seems it's just because no
compile unit was using these ranges. Changing .debug_info apparently
changes these warnings, so they seem to be unrelated to the section
itself. Also looking at the hex dump of the section doesn't help,
as everything seems fine. llvm-dwarfdump doesn't say anything.
So I think .debug_ranges is fine.)
The result is that gdb not only doesn't show the function name as we
wanted, but it also stops showing line number information.
Apparently it's not reading/interpreting the address ranges at all,
and so the functions now have no associated address ranges, only the
symbol value which allows one to put a breakpoint in the function,
but not to show source code.
As this left me without more ideas of what to try to feed gdb with,
I believe the most promising next trial is to try to debug gdb itself,
unless someone spots anything I missed.
I found where the interesting part of the code lies for this
case (gdb/dwarf2read.c and some other related files, but mainly that one).
It seems in some parts gdb uses DW_AT_ranges for only getting
its lowest and highest addresses and setting that as low_pc and
high_pc (see dwarf2_get_pc_bounds in gdb's code and where it's called).
I really hope this is not actually the case for
function address ranges. I'll investigate this further. Otherwise
I don't think any changes we make will make it work as initially
intended, as we'll simply need gdb to support it and in that case it
doesn't.
(cherry picked from FBD3073641)
Summary:
We used to output .debug_line information for every instruction, but because of the way
gdb (and probably lldb as of llvm::DWARFDebugLine::LineTable::findAddress) queries the
line table it's not necessary to output information for two instructions if they follow
each other and map to the same source line. By not repeating this information we generate
a bit less .debug_line data.
(cherry picked from FBD3056402)
Summary:
The line number information generated from a null pointer
was actually valid, which caused new instructions without the line number
information set to have a valid and wrong line number reference. This diff
fixes this by making the null pointer be assigned to an invalid line number
row.
(cherry picked from FBD3048453)
Summary:
Write the .debug_aranges section after optimizations to the output binary.
Each function generates at least one range and at most two (one extra for its cold part).
The writing is done manually because LLVM's implementation is tied to the output of
.debug_info (see EmitGenDwarfInfo and EmitGenDwarfARanges in lib/MC/MCDwarf.cpp),
which we don't want to trigger right now.
(cherry picked from FBD3043108)
Summary:
At the moment we rely solely on the symbol table information to discover
function boundaries. However, similar information is contained in
.eh_frame. Verify that the information from these two sources is
consistent, and if it's not, then skip processing the functions with
conflicting information.
(cherry picked from FBD3043800)
Summary:
After we add new line number information we have to update stmt_list
offsets in .debug_info. For this I had to add a primitive relocations
support for non-allocatable sections we are copying from input file.
Also enabled functionality to process relocations in non-allocatable
sections that LLVM is generating, such as .debug_line. I thought
we already had it, but apparently it didn't work, at least not
for ELF binaries.
(cherry picked from FBD3037903)
Summary:
Skip DW_CFA_expression and DW_CFA_val_expression instructions
properly, according to DWARF spec.
If CFI range does not match function range skip that function.
(cherry picked from FBD3040502)
Summary:
Writes .debug_line section by setting the state
in MCContext that LLVM needs to produce and output the
line tables. This basically consists of setting the
current location and compile unit offset. This makes LLVM
output .debug_line in the temporary file, but not yet in
the generated ELF file.
Also computes the line table offsets for each compile unit
and saves them into BinaryContext. Added an option to
print these offsets.
(cherry picked from FBD3004554)
Summary:
The is a set of changes that allow modification of non-allocatable
sections in ELF binary. Primarily for the purpose of updating debug
info.
Extend LLVM interface to allow processing relocations in non-allocatable
sections. This allows to produce .debug* sections with resolved
relocations against generated code.
Extend BOLT rewriting framework to allow appending contents to
non-allocatable sections in the binary.
Re-worked ELF binary rewriting to support the above and to allow future
extensions (e.g. new section names).
(cherry picked from FBD3023403)
Summary:
Reads information in the DWARF .debug_line section using LLVM and
tie every MCInst to one line of a line table from the input binary. Subsequent
diffs will update this information to match the final binary layout and
output updated line tables.
(cherry picked from FBD2989813)
Summary:
Force the splitting of the function into hot/cold even when
the function fits into original slot.
This reduces BOLT optimization time by 50% without affecting
hhvm performance.
(cherry picked from FBD2973773)
Summary:
If we see an unknown CFI instruction, skip processing the function
containing it instead of aborting execution.
(cherry picked from FBD2964557)
Summary:
Added an option to reuse existing program header entry.
This option allows for bfd tools like strip and objcopy
to operate on the optimized binary without destroying it.
Also, all new sections are now properly marked in ELF.
(cherry picked from FBD2943339)
Summary:
We used to require pre-allocated space in the input binary so that
we can write extra sections in there (.eh_frame, .eh_frame_hdr,
.gcc_except_table, etc.). With this diff there's no further
need for pre-allocated storage as we create a new segment and
can use as much space as needed.
There are certain limitations on where the new segment could
be allocated, and as a result the size of the file may increase.
There's currently a limitation if the binary size is close to 4GB
we cannot allocate new segment prior to that and as a result
we require debug info to be stripped to reduce the file size.
The fix is in progress.
(cherry picked from FBD2916029)
Summary:
We use intermediate .o file for debugging purposes, but there's no
reason to generate it by default. Only do it if "-keep-tmp" is
specified.
(cherry picked from FBD2912098)
Summary:
Preserve original layout for basic blocks that have 0 execution
count. Since we don't optimize for size, it's better to rely on
the original input order.
(cherry picked from FBD2875335)
Summary:
We should never outline the first basic block.
Also add an option to accept a file with the list of
functions to optimize.
(cherry picked from FBD2868184)
Summary:
We could split functions with exceptions even without creating
a new exception handling table. This limits us to only move
basic blocks that never throw, and are not a start of a
landing pad.
(cherry picked from FBD2862937)
Summary:
Some basic blocks were created empty because they only contained
alignment nop's. Ignore such nop's before basic block gets created.
Fixed intermittent aborts related to CFI update.
(cherry picked from FBD2844465)
Summary:
* Update CFI state for larger range of functions to increase coverage.
* Issue more warnings indicating reasons for skipping functions.
* Print top called functions in the binary.
(cherry picked from FBD2839734)
Summary:
Modified processing of "-reorder-blocks=" option and added an option
to reverse original basic blocks order for testing purposes.
(cherry picked from FBD2829862)
Summary:
Fixes some issues discovered after hhvm switched to gcc 4.9.
Add support for DW_CFA_GNU_args_size instruction.
Allow CFI instruction after the last instruction in a function.
Reverse conditions of assert for DW_CFA_set_loc.
(cherry picked from FBD28110096)
Summary:
Binary code could be weird. It could include calls to address 0 and
reference data at 0 (e.g. with lea on x86). LLVM JIT fatals
while resolving relocations against symbols at address 0x0. For now
we will stop emitting such code, i.e. we'll skip functions.
(cherry picked from FBD28109837)
Summary:
In a test binary, we found 8 cases where code in a function A would jump to the
middle of another function B. In this case, we cannot reorder function B because
this would change instruction offsets and break the program. This is pretty rare
but can happen in code written in assembly.
(cherry picked from FBD2719850)
Summary:
We found out that the insertion of extra nops to preserve alignment of
some loop bodies do not pay off the increased function size, since this extra
size may inhibit us from rewriting a reordered version of this function.
(cherry picked from FBD2718466)
Summary:
Our CFI parser in the LLVM library was giving up on parsing all CFI
instructions when finding a single instruction with expression operands. Yet,
all gcc-4.9 binaries seem to have at least one CFI instruction with expression
operands (DW_CFA_def_cfa_expression). This patch fixes this and makes DebugInfo
continue to parse other instructions, even though it does not completely parse
DWARF expressions yet. However, this seems to be enough to allow llvm-flo to
process gcc-4.9 binaries because the FDEs with DWARF expressions are linked to
the PLT region, and not to functions that we process.
If we ever try to read a function whose CFI depends on DWARF expression, which
is unlikely, llvm-flo will assert.
(cherry picked from FBD2693088)
Summary:
This patch builds upon the previous patch to create a two-pass process
to function splitting. We first perform the full rewriting pipeline to discover
which functions need splitting. Afterwards, we restart the pipeline with those
functions annotated to be split.
(cherry picked from FBD2691709)
Summary:
Previously, llvm-flo.cpp contained a long function doing lots of
different tasks. This patch refactors this logic into a separate class with
different member functions, exposing the relationship between each step of
the rewritting process and making it easier to coordinate/change it.
(cherry picked from FBD2691674)
Summary:
After basic block reordering, it may be possible that the reordered
function is now larger than the original because of the following reasons:
- jump offsets may change, forcing some jump instructions to use 4-byte
immediate operand instead of the 1-byte, shorter version.
- fall-throughs change, forcing us to emit an extra jump instruction to jump
to the original fall-through at the end of a basic block.
Since we currently do not change function addresses, we need to rewrite the
function back in the binary in the original location. If it doesn't fit, we were
dropping the function.
This patch adds a flag -split-functions that tells llvm-flo to split hot
functions into hot and cold separate regions. The hot region is written back
in the original function location, while the cold region is written in a
separate, far-away region reserved to flo via a linker script.
This patch also adds the logic to create and extra FDE to supply unwinding
information to the cold part of the function. Owing to this, we now need to
rewrite .eh_frame_hdr to another location and patch the EH_FRAME ELF segment
to point to this new .eh_frame_hdr.
(cherry picked from FBD2677996)
Summary:
This is an attempt at determining the hotness of functions we are
rewriting and help detect if we are discarding hot functions. This patch
introduces logic to estimate the number of instructions executed in each
function by using the profile data for branches. It sums the products of
BB frequency and size. Since we can only do this for functions we have
successfully disassembled, created the CFG and annotated with profiling
data, all complex functions that were not disassembled are left out from
this analysis.
(cherry picked from FBD2654985)
Summary:
Previously, we were marking functions with indirect calls as too
complex to be disassembled, but this was unnecessarily conservative. This patch
removes this restriction.
(cherry picked from FBD2669627)
Summary:
Teach llvm-flo to drop on function with LSDA information until we know
how to update them after block reordering.
(cherry picked from FBD2640806)
Summary:
This patch adds logic to detect when the binary has extra space
reserved for us via the __flo_storage symbol. If this symbol is present,
it means we have extra space in the binary to write extraneous information.
When we write a new .eh_frame, we cannot discard the old .eh_frame because
it may still contain relevant information for functions we do not reorder.
Thus, we write the new .eh_frame into __flo_storage and patch the current
.eh_frame_hdr to point to the new .eh_frame only for the functions we touched,
generating a binary that works with a bi-.eh_frame model.
(cherry picked from FBD2639326)
Summary:
This patch is an intermediary step towards updating the CFI in the
optimized binary. It adds the logic necessary to output our CFI annotations to
a new .eh_frame in the temporary object file we create to hold rewritten
functions. The next step will be to fully integrate this new .eh_frame into the
optimized binary.
(cherry picked from FBD2633728)
Summary:
This patch introduces logic to check how the CFI instructions define a
table to help during stack unwinding at exception run time and attempts to fix
any problem in this table that may have been introduced by reordering the basic
blocks. If it fails to fix this problem, the function is marked as not simple
and not eligible for rewriting.
(cherry picked from FBD2633696)
Summary:
Regenerate exception handling information after optimizations.
Use '-print-eh-ranges' to see CFG with updated ranges.
(cherry picked from FBD2660982)
Summary:
There were two issues: we were trying to process non-simple functions,
i.e. function that we don't fully understand, and then we failed to stop
iterating if EH closing label was after the last instruction in a
function.
(cherry picked from FBD2664460)
Summary:
Read .gcc_except_table and add information to CFG. Calls have extra operands
indicating there's a possible handler for exceptions and an action. Landing
pad information is recorded in BinaryFunction.
Also convert JMP instructions that are calls into tail calls pseudo
instructions so that they don't miss call instruction analysis.
(cherry picked from FBD2652775)
Summary: Reverting this commit until we better investigate why
it is necessary to change local symbol names with a prefix.
(cherry picked from FBD28109521)
Summary: After discussion with Maksim, we decided to drop the lines
that add the PG prefix if the symbol is already local, since they
wouldn't be impacted by the way LLVM handles these symbols.
(cherry picked from FBD28109400)
Summary:
This bug would cause llvm-flo to fail to disambiguate two local symbols
with the same file name, causing two different addresses to compete in the
symbol table for the resolution of a given name, causing unpredicted behavior in
the linker.
(cherry picked from FBD2646626)
Summary:
In order to represent CFI information in our BinaryFunction class, this
patch adds a map of Offsets to CFI instructions. In this way, we make it easy to
check exactly where DWARF CFI information is annotated in the disassembled
function.
(cherry picked from FBD2619216)
Summary:
We need to parse the whole contents of .gcc_except_table even if we are
not printing exceptions. Otherwise we are missing type index table and
miscalculate the size of the current table.
(cherry picked from FBD2632965)
Summary: In order to reorder binaries with C++ exceptions, we first need to
read DWARF CFI (call frame info) from binaries in a table in the .eh_frame
ELF section. This table contains unwinding information we need to be aware of
when reordering basic blocks, so as to avoid corrupting it. This patch also
cleans up some code from Exceptions.cpp due to a refactoring where we moved
some functions to the LLVM's libSupport.
(cherry picked from FBD2614464)
Summary:
Print actions for exception ranges from .gcc_except_table.
Types are printed as names if the name is available from symbol table.
(cherry picked from FBD2612631)
Summary:
Previously, we inferred all non-taken branch frequencies with the
information we had for taken branches. This patch teaches perf2flo and llvm-flo
how to read and incorporate non-taken branch frequencies directly from the
traces available in LBR data and by disassembling the binary. It still leaves
the inference engine untouched in case we need it to fill out other
fall-throughs.
(cherry picked from FBD2589212)
Summary:
Pettis' paper on block layout (PLDI'90) suggests we should order
clusters (or chains, using the paper terminology) using a specific criterion.
This patch implements two distinct ideas for cluster layout that can be
activated using different command-line flags. The first one reflects Pettis'
ideas on minimizing branch mispredictions and the second one is targeted at
reducing I-cache misses, described in the Ispike paper (CGO'04).
(cherry picked from FBD2588693)
Summary:
Fixes a bug which caused the block reordering heuristic to put in the
same cluster hot basic blocks and cold basic blocks, increasing I-cache misses.
(cherry picked from FBD2588203)
Summary:
When the ignore-nops patch landed, it exposed a bug in fixBranches()
where it ignored empty BBs. However, we cannot ignore empty BBs when it is
reordered and its fall-through changes. We must update it with a jump to the
original fall-through. This patch fixes this.
(cherry picked from FBD2568244)
Summary:
It is important to remove dead blocks to free up space in functions
and allow us to reorder blocks or align branch targets with more
freedom. This patch implements a simple algorithm to delete all basic
blocks that are not reachable from the entry point. Note that C++
exceptions may create "unreachable" blocks, so this option must be
used with care.
(cherry picked from FBD2562637)
Summary:
SPEC CPU2006 perlbench triggered a bug in our heuristic block
reordering algorithm where a hot edge that targets the entry point (as in a
recursive tail call) would make us try to allocate the call site before the
function entry point. Since we don't update function addresses yet, moving the
entry point will corrupt the program. This patch fixes this.
(cherry picked from FBD2562528)
Summary:
If we have two consecutive JMP instructions and no branches to the
second one, the second one is dead code, but llvm-flo does not handle these
cases properly and put two JMPs in the same BB. This patch fixes this, putting
the extraneous JMP in a separate block, making it easy for us to detect it is
dead code and remove it later in a separate step.
(cherry picked from FBD2562465)
Summary:
Nop instructions are primarily used for alignment purposes on the input.
We remove all nops when we build CFG and derive alignment of basic blocks
based on existing alignment and a presence of nops before it. This
will not always work as some basic blocks will be naturally aligned
without necessity for nops. However, it's better than random alignment.
We would also add heuristics for BB alignment based on execution profile.
(cherry picked from FBD2561740)
Summary:
Adds logic in BinaryFunction to be able to fix branches (invert
its condition, delete or add a branch), making the new function work with the
new layout proposed by the layout pass. All the architecture-specific content
was designed to live in the LLVM Target library, in the MCInstrAnalysis pass.
For now, we only introduce such logic to the X86 backend.
(cherry picked from FBD2551479)
Summary:
Tests with SPEC CPU2006 400.perlbench exposed a bug in the block reordering
heuristic that happened when two blocks are both successor and predecessor of
each other. This patch fixes this.
(cherry picked from FBD2555835)
Summary:
SPEC CPU2006 perlbench exposed a bug in BinaryFunction::optimizeLayout()
where it would try to optimize the layout even though the function had zero
basic blocks. This patch simply checks if the function has zero basic blocks and
bails out.
(cherry picked from FBD2556831)
Summary:
In a recent commit, we changed local symbols to be specially tagged
with the number 2 (local sym) instead of 1 (sym). This patch modifies the reader
to don't choke when seeing a 2 in the symbol id field.
(cherry picked from FBD2552776)
Summary:
This patch implements a dynamic programming approach to solve reorder
basic blocks with profiling information in an optimal way. Since this is
analogous to TSP, it is NP-hard and the algorithm is exponential in time and
memory consumption. Therefore, we only use the optimal algorithm to decide the
layout of small functions (with less than 11 basic blocks).
(cherry picked from FBD2544124)
Summary:
This patch introduces a first approach to reorder basic blocks based on
profiling data that gives us the execution frequency for each edge. Our strategy
is to layout basic blocks in a order that maximizes the weight (hotness) of
branches that will be deleted. We can delete branches when src comes right
before dst in the new layout order. This can be reduced to the TSP problem. This
patch uses a greedy heuristic to solve the problem: we start with a graph with
no edges and progressively add edges by choosing the hottest edges first,
building a layout order that attempts to put BBs with hot edges together.
(cherry picked from FBD2544076)
Summary:
The LBR only has information about taken branches and does not record
information when a branch is not taken. In our CFG, we call these edges
"fall-through" edges. This patch teaches llvm-flo how to infer fall-through
edge frequencies.
(cherry picked from FBD2536633)
Summary:
Changes DataReader to organize branch perf data per function name and
sets up logistics to bring this data to BinaryFunction::buildCFG(). To do this,
we expand BinaryContext with a const reference to DataReader. This patch also
adds the "-dump-functions" flag to force llvm-flo to dump the current state of
BinaryFunctions once they are disassembled and their CFG built, allowing us to
test whether the builder is sane with LLVM LIT tests.
(cherry picked from FBD2534675)
Summary:
This patch introduces DataReader, a module responsible for
parsing llvm flo data files into in-memory data structures.
(cherry picked from FBD2515754)