Summary: Register all sections with BinaryContext. Store all sections in a set ordered by (address, size, name). Add two separate maps to lookup sections by address or by name. Non-allocatable sections are not stored in the address->section map since they all "start" at 0.
(cherry picked from FBD6862973)
Summary:
Refactor the relocation anaylsis code. It should be a little better at validating
that the relocation value matches up with the symbol address + addend stored in the
relocation (except on aarch64). It is also a little better at finding the symbol
address used to do the lookup in BinaryContext, rather than just using symbol
address + addend.
(cherry picked from FBD6814702)
Summary: Add BinarySection class that is a wrapper around SectionRef. This is refactoring work for static data reordering.
(cherry picked from FBD6792785)
Summary:
Rewrite how data/code markers are interpreted, so the code
can have constant islands essentially anywhere. This is necessary to
accomodate custom AArch64 assembly code coming from mozjpeg. Allow
any function to refer to the constant island owned by any other
function. When this happens, we pull the constant island from the
referred function and emit it as our own, so it will live nearby
the code that refers to it, allowing us to freely reorder functions
and code pieces. Make bolt more strict about not changing anything
in non-simple ARM functions, as we need to preserve offsets for
those functions we don't interpret their jump tables (currently
any function with jump tables in ARM is non-simple and is left
untouched).
(cherry picked from FBD6402324)
Summary:
Add a few new relocation types to support a wider variety of
binaries, add support for constant island duplication (so we can split
functions in large binaries) and make LongJmp pass really precise with
respect to layout, so we don't miss stubs insertions at the correct
places for really large binaries. In LongJmp, introduce "freeze"
annotations so fixBranches won't mess the jumps we carefully determined
that needed a stub.
(cherry picked from FBD6294390)
Summary:
The pass was previously copying data that would change after layout
because it had a relocation at the copied address.
(cherry picked from FBD6541334)
Summary:
Profile reading was tightly coupled with building CFG. Since I plan
to move to a new profile format that will be associated with CFG
it is critical to decouple the two phases.
We now have read profile right after the cfg was constructed, but
before it is "canonicalized", i.e. CTCs will till be there.
After reading the profile, we do a post-processing pass that fixes
CFG and does some post-processing for debug info, such as
inference of fall-throughs, which is still required with the current
format.
Another good reason for decoupling is that we can use profile with
CFG to more accurately record fall-through branches during
aggregation.
At the moment we use "Offset" annotations to facilitate location
of instructions corresponding to the profile. This might not be
super efficient. However, once we switch to the new profile format
the offsets would be no longer needed. We might keep them for
the aggregator, but if we have to trust LBR data that might
not be strictly necessary.
I've tried to make changes while keeping backwards compatibly. This makes
it easier to verify correctness of the changes, but that also means
that we lose accuracy of the profile.
Some refactoring is included.
Flag "-prof-compat-mode" (on by default) is used for bug-level
backwards compatibility. Disable it for more accurate tracing.
(cherry picked from FBD6506156)
Summary:
If relocations are available in the binary, use them by default.
If "-relocs" is specified, then require relocations for further
processing. Use "-relocs=0" to forcefully ignore relocations.
Instead of `opts::Relocs` use `BinaryContext::HasRelocations` to check
for the presence of the relocations.
(cherry picked from FBD6530023)
Summary:
Use value profiling data to remove the method pointer loads from vtables when doing ICP at virtual function and jump table callsites.
The basic process is the following:
1. Work backwards from the callsite to find the most recent def of the call register.
2. Work back from the call register def to find the instruction where the vtable is loaded.
3. Find out of there is any value profiling data associated with the vtable load. If so, record all these addresses as potential vtables + method offsets.
4. Since the addresses extracted by #3 will be vtable + method offset, we need to figure out the method offset in order to determine the actual vtable base address. At this point I virtually execute all the instructions that occur between #3 and #2 that touch the method pointer register. The result of this execution should be the method offset.
5. Fetch the actual method address from the appropriate data section containing the vtable using the computed method offset. Make sure that this address maps to an actual function symbol.
6. Try to associate a vtable pointer with each target address in SymTargets. If every target has a vtable, then this is almost certainly a virtual method callsite.
7. Use the vtable address when generating the promoted call code. It's basically the same as regular ICP code except that the compare is against the vtable and not the method pointer. Additionally, the instructions to load up the method are dumped into the cold call block.
For jump tables, the basic idea is the same. I use the memory profiling data to find the hottest slots in the jumptable and then use that information to compute the indices of the hottest entries. We can then compare the index register to the hot index values and avoid the load from the jump table.
Note: I'm assuming the whole call is in a single BB. According to @rafaelauler, this isn't always the case on ARM. This also isn't always the case on X86 either. If there are non-trivial arguments that are passed by value, there could be branches in between the setup and the call. I'm going to leave fixing this until later since it makes things a bit more complicated.
I've also fixed a bug where ICP was introducing a conditional tail call. I made sure that SCTC fixes these up afterwards. I have no idea why I made it introduce a CTC in the first place.
(cherry picked from FBD6120768)
Summary:
Enhance the basic infrastructure for relocation mode for
AArch64 to make a reasonably large program work after reordering (gcc).
Detect jump table patterns and skip optimizing functions with jump
tables in AArch64, as those will require extra future effort to fully
decode. To make these work in relocation mode, we skip changing
the function body and introduce a mode to preserve even the original
nops. By not changing any local offsets in the function, the input
original jump tables should just work.
Functions with no jump tables are optimized with BB reordering. No other
optimizations have been tested.
(cherry picked from FBD6130117)
Summary:
As we deal with incomplete addresses in address-computing
sequences of code in AArch64, we found it is easier to handle them in
relocation mode in the presence of relocations.
Incomplete addresses may mislead BOLT into thinking there are
instructions referring to a basic block when, in fact, this may be the
base address of a data reference. If the relocation is present, we can
easily spot such cases.
This diff contains extensions in relocation mode to understand and
deal with AArch64 relocations. It also adds code to process data inside
functions as marked by AArch64 ABI (symbol table entries named "$d").
In our code, this is called constant islands handling. Last, it extends
bughunter with a "cross" mode, in which the host generates the binaries
and the user test them (uploading to the target), useful when debugging
in AArch64.
(cherry picked from FBD6024570)
Summary:
Add functionality to support reordering bzip2 compiled to
AArch64, with function splitting but without relocations:
* Expand the AArch64 backend to support inverting branches and
analyzing branches so BOLT reordering machinery is able to shuffle
blocks and fix branches correctly;
* Add a new pass named LongJmp to add stubs whenever code needs to
jump to the cold area, when using function splitting, because of the
limited target encoding capability in AArch64 (as a RISC architecture).
(cherry picked from FBD5748184)
Summary:
Add support for reading value profiling info from perf data. This diff adds support in DataReader/DataAggregator for value profiling data. Each event is recorded as two Locations (a PC and an address/value) and a count.
For now, I'm assuming that the value profiling data is in the same file as the usual BOLT profiling data. Collecting both at the same time seems to work.
(cherry picked from FBD6076877)
Summary:
Exceptions tables for PIC may contain indirect type references
that are also encoded using relative addresses.
This diff adds support for such encodings. We read PIC-style
type info table, and write it using new encoding.
(cherry picked from FBD5716060)
Summary:
Rewrote the guts of buildCallGraph. There are two new options to control how the CG is created. UsePerfData controls whether we use the perf data directly to construct the CG for functions with a stale profile. IgnoreRecursiveCalls omits recursive calls from the CG since they might be skewing results unfairly for heavily recursive functions.
I've changed the way BinaryFunction::estimateHotSize() works. If the function is marked as split, I count the size of all the non-cold blocks. This gives a different but more accurate answer than the old method.
I've improved and updated the CG build stats with extra information.
(cherry picked from FBD5224183)
Summary:
Add an implementation for shrink wrapping, a frame optimization
that moves callee-saved register spills from hot prologues to cold
successors.
(cherry picked from FBD4983706)
Summary:
Multiple improvements to debug info handling:
* Add support for relocation mode.
* Speed-up processing.
* Reduce memory consumption.
* Bug fixes.
The high-level idea behind the new debug handling is that we don't save
intermediate state for ranges and location lists. Instead we depend
on function and basic block address transformations to update the info
as a final post-processing step.
For HHVM in non-relocation mode the peak memory went down from 55GB to 35GB. Processing time went from over 6 minutes to under 5 minutes.
(cherry picked from FBD5113431)
Summary:
Each BOLT-specific option now belongs to BoltCategory or BoltOptCategory.
Use alphabetical order for options in source code (does not affect
output).
The result is a cleaner output of "llvm-bolt -help" which does not
include any unrelated llvm options and is close to the following:
.....
BOLT generic options:
-data=<string> - <data file>
-dyno-stats - print execution info based on profile
-hot-text - hot text symbols support (relocation mode)
-o=<string> - <output file>
-relocs - relocation mode - use relocations to move functions in the binary
-update-debug-sections - update DWARF debug sections of the executable
-use-gnu-stack - use GNU_STACK program header for new segment (workaround for issues with strip/objcopy)
-use-old-text - re-use space in old .text if possible (relocation mode)
-v=<uint> - set verbosity level for diagnostic output
BOLT optimization options:
-align-blocks - try to align BBs inserting nops
-align-functions=<uint> - align functions at a given value (relocation mode)
-align-functions-max-bytes=<uint> - maximum number of bytes to use to align functions
-boost-macroops - try to boost macro-op fusions by avoiding the cache-line boundary
-eliminate-unreachable - eliminate unreachable code
-frame-opt - optimize stack frame accesses
......
(cherry picked from FBD4793684)
Summary:
Fix inconsistent override keyword usages and initializes a
missing field of a Relocation object when using braced initializers.
(cherry picked from FBD4622856)
Summary:
In a prev diff I added an option to update jump tables in-place (on by default)
and accidentally broke the default handling of jump tables in relocation
mode. The update should be happening semi-automatically, but because
we ignore relocations for jump tables it wasn't happening (derp).
Since we mostly use '-jump-tables=move' this hasn't been noticed for
some time.
This diff gets rid of IgnoredRelocations and removes relocations
from a relocation set when they are no longer needed. If relocations
are created later for jump tables they are no longer ignored.
(cherry picked from FBD4595159)
Summary:
Perform indirect call promotion optimization in BOLT.
The code scans the instructions during CFG creation for all
indirect calls. Right now indirect tail calls are not handled
since the functions are marked not simple. The offsets of the
indirect calls are stored for later use by the ICP pass.
The indirect call promotion pass visits each indirect call and
examines the BranchData for each. If the most frequent targets
from that callsite exceed the specified threshold (default 90%),
the call is promoted. Otherwise, it is ignored. By default,
only one target is considered at each callsite.
When an candiate callsite is processed, we modify the callsite
to test for the most common call targets before calling through
the original generic call mechanism.
The CFG and layout are modified by ICP.
A few new command line options have been added:
-indirect-call-promotion
-indirect-call-promotion-threshold=<percentage>
-indirect-call-promotion-topn=<int>
The threshold is the minimum frequency of a call target needed
before ICP is triggered.
The topn option controls the number of targets to consider for
each callsite, e.g. ICP is triggered if topn=2 and the total
requency of the top two call targets exceeds the threshold.
Example of ICP:
C++ code:
int B_count = 0;
int C_count = 0;
struct A { virtual void foo() = 0; }
struct B : public A { virtual void foo() { ++B_count; }; };
struct C : public A { virtual void foo() { ++C_count; }; };
A* a = ...
a->foo();
...
original:
400863: 49 8b 07 mov (%r15),%rax
400866: 4c 89 ff mov %r15,%rdi
400869: ff 10 callq *(%rax)
40086b: 41 83 e6 01 and $0x1,%r14d
40086f: 4d 89 e6 mov %r12,%r14
400872: 4c 0f 44 f5 cmove %rbp,%r14
400876: 4c 89 f7 mov %r14,%rdi
...
after ICP:
40085e: 49 8b 07 mov (%r15),%rax
400861: 4c 89 ff mov %r15,%rdi
400864: 49 ba e0 0b 40 00 00 movabs $0x400be0,%r10
40086b: 00 00 00
40086e: 4c 3b 10 cmp (%rax),%r10
400871: 75 29 jne 40089c <main+0x9c>
400873: 41 ff d2 callq *%r10
400876: 41 83 e6 01 and $0x1,%r14d
40087a: 4d 89 e6 mov %r12,%r14
40087d: 4c 0f 44 f5 cmove %rbp,%r14
400881: 4c 89 f7 mov %r14,%rdi
...
40089c: ff 10 callq *(%rax)
40089e: eb d6 jmp 400876 <main+0x76>
(cherry picked from FBD3612218)
Summary:
In-non relocation mode, when we run ICF the second time,
we fold the same functions again since they were not
removed from the function set. This diff marks them as
folded and ignores them during ICF optimization. Note
that we still want to optimize such functions since they
are potentially called from the code not covered by BOLT
in non-relocation mode.
Folded functions are also excluded from dyno stats with
this diff
Also print the number of times folded functions were called.
When 2 functions - f1() and f2() are folded, that number
would be min(call_frequency(f1), call_frequency(f2)).
(cherry picked from FBD4399993)
Summary:
Re-worked the way ICF operates. The pass now checks for more than just
call instructions, but also for all references including function
pointers. Jump tables are handled too.
(cherry picked from FBD4372491)
Summary:
Modified function discovery process to tolerate more functions and
symbols coming from assembly. The processing order now matches
the memory order of the functions (input symbol table is unsorted).
Added basic support for functions with multiple entries. When
a function references its internal address other than with
a branch instruction, that address could potentially escape.
We mark such addresses as entry points and make sure they
are treated as roots by unreachable code elimination.
Without relocations we have to mark multiple-entry functions
as non-simple.
(cherry picked from FBD3950243)
Summary:
Add level for "-jump-tables=<n>" option:
1 - all jump tables are output in the same section (default).
2 - basic splitting, if the table is used it is output to hot section
otherwise to cold one.
3 - aggressively split compound jump tables and collect profile for
all entries.
Option "-print-jump-tables" outputs all jump tables for debugging
and/or analyzing purposes. Use with "-jump-tables=3" to get profile
values for every entry in a jump table.
(cherry picked from FBD3912119)
Summary:
Option "-jump-tables=1" enables experimental support for jump tables.
The option hasn't been tested with optimizations other than block
re-ordering.
Only non-PIC jump tables are supported at the moment.
(cherry picked from FBD3867849)
Summary:
For now we make SCTC a special pass that runs at the end of all
optimizations and transformations right after fixupBranches().
Since it's the last pass, it has to do its own UCE.
(cherry picked from FBD3838051)
Summary:
A number of fixes/enhancements to inline-small-functions
- Fixed size estimateHotSize to use computeCodeSize instead of the original layout offsets.
- Added -print-inline option to dump CFGs for functions that have been modified by inlining.
- Added flag to force consideration of functions without any profiling info (mostly for testing)
- Updated debug line info for inlined functions.
- Ignore the number of pseudo instructions when checking for candidates of suitable size.
Misc changes
- Moved most print flags to BinaryPasses.cpp
(cherry picked from FBD3812658)
Summary:
Analyze indirect branches and convert them into indirect
tail calls when possible. We analyze the memory contents
when the address could be calculated statically and also
detect epilogue code.
(cherry picked from FBD3754395)
Summary:
LLVM was missing assembler print string for indirect tail
calls which are synthetic instructions created by us.
(cherry picked from FBD3640197)
Summary:
I've factored out the instruction printing and size computation routines to
methods on BinaryContext. I've also added some more debug print functions.
This was split off the ICP diff to simplify it a bit.
(cherry picked from FBD3610690)
Summary:
Instructions that load data from the a read-only data section and their
target address can be computed statically (e.g. RIP-relative addressing)
are modified to corresponding instructions that use immediate operands.
We apply the transformation only when the resulting instruction will have
smaller or equal size.
(cherry picked from FBD3397112)
Summary:
Assembly functions could have no corresponding DW_AT_subprogram
entries, yet they are represented in module ranges (and .debug_aranges)
and will have line number information. Make sure we update those.
Eliminated unnecessary data structures and optimized some passes.
For .debug_loc unused location entries are no longer processed
resulting in smaller output files.
Overall it's a small processing time improvement and memory imporement.
(cherry picked from FBD3362540)
Summary:
* Fix several cases for handling debug info:
- properly update CU DW_AT_ranges for function with folded body
due to ICF optimization
- convert ranges to DW_AT_ranges from hi/low PC for all DIEs
- add support for [a, a) range
- update CU ranges even when there are no functions registered
* Overwrite .debug_ranges section instead of appending.
* Convert assertions in debug info handling part into warnings.
(cherry picked from FBD3339383)
Summary:
Update address ranges of inlined functions and try/catch blocks.
This was missing and lead gdb to show weird information in a core dump we inspected
because of the several nestings of inline in the call stack.
This is very similar to Lexical Blocks, so the change is to basically generalize that
code to do the same for DW_AT_try_block, DW_AT_catch_block and DW_AT_inlined_subroutine.
(cherry picked from FBD3169417)
Summary:
readelf was showing some errors because we weren't updating DIEs that were not shallow
in the DIE tree, or DIEs of functions with addresses we don't recognize (mostly functions with
address 0, which could have been removed by the Linker Script but still have debugging information
there). These DIEs need to be updated because their abbreviations are patched.
(cherry picked from FBD3159335)
Summary:
We were updating only one DIE per function, but because the Linker Script may map
multiple functions to the same address this would cause us to generate invalid debug info
(as some DIEs weren't updated but their abbreviations were changed).
(cherry picked from FBD3157263)
Summary:
Summary: Update DWARF location lists in .debug_loc and pointers to
them in .debug_info so that gdb can print variables which change
location during their lifetime.
The following changes were made:
- Refactored BasicBlockOffsetRanges to allow ranges to be tied to binary information (so that we can reuse it for location lists)
- Implemented range compression optimization in BasicBlockOffsetRanges (needed otherwise too much data was being generated).
- Added representation for location lists (LocationList.h, BinaryContext.h)
- Implemented .debug_loc serializer that keeps the updated offsets (DebugLocWriter.{h,cpp})
- After disassembly, traverse entries in .debug_loc and save them in context (BinaryContext.cpp)
- After optimizations, serialize .debug_loc and update pointers in .debug_info (RewriteInstance.cpp)
(cherry picked from FBD3130682)
Summary:
Updates DWARF lexical blocks address ranges in the output binary after optimizations.
This is similar to updating function address ranges except that the ranges representation needs
to be more general, since address ranges can begin or end in the middle of a basic block.
The following changes were made:
- Added a data structure for iterating over the basic blocks that intersect an address range: BasicBlockTable.h
- Added some more bookkeeping in BinaryBasicBlock. Basically, I needed to keep track of the block's size in the input binary as well as its address in the output binary. This information is mostly set by BinaryFunction after disassembly.
- Added a representation for address ranges relative to basic blocks (BasicBlockOffsetRanges.h). Will also serve for location lists.
- Added a representation for Lexical Blocks (LexicalBlock.h)
- Small refactorings in DebugArangesWriter:
-- Renamed to DebugRangesSectionsWriter since it also writes .debug_ranges
-- Refactored it not to depend on BinaryFunction but instead on anything that can be assined an aoffset in .debug_ranges (added an interface for that)
- Iterate over the DIE tree during initialization to find lexical blocks in .debug_info (BinaryContext.cpp)
- Added patches to .debug_abbrev and .debug_info in RewriteInstance to update lexical blocks attributes (in fact, this part is very similar to what was done to function address ranges and I just refactored/reused that code)
- Added small test case (lexical_blocks_address_ranges_debug.test)
(cherry picked from FBD3113181)
Summary:
[WIP] Update DWARF info for function address ranges.
This diff currently does not work for unknown reasons,
but I'm describing here what's the current state.
According to both llvm-dwarf and readelf our output seems correct,
but GDB does not interpret it as expected. All details go below in
hope I missed something.
I couldn't actually track the whole change that introduced support for
what we need in gdb yet, but I think I can get to it
(2007-12-04: Support
lexical bocks and function bodies that occupy non-contiguous address ranges). I have reasons to believe gdb at least at some
nges).
The set of introduced changes was basically this:
- After disassembly, iterate over the DIEs in .debug_info and find the
ones that correspond to each BinaryFunction.
- Refactor DebugArangesWriter to also write addresses of functions to
.debug_ranges and track the offsets of function address ranges there
- Add some infrastructure to facilitate patching the binary in
simple ways (BinaryPatcher.h)
- In RewriteInstance, after writing .debug_ranges already with
function address ranges, for each function do:
-- Find the abbreviation corresponding to the function
-- Patch .debug_abbrev to replace DW_AT_low_pc with DW_AT_ranges and
DW_AT_high_pc with DW_AT_producer (I'll explain this hack below).
Also patch the corresponding forms to DW_FORM_sec_offset and
DW_FORM_string (null-terminated in-place string).
-- Patch debug_info with the .debug_ranges offset in place of
the first 4 bytes of DW_AT_low_pc (DW_AT_ranges only occupies 4
bytes whereas low_pc occupies 8), and write an arbitrary string
in-place in the other 12 bytes that were the 4 MSB of low_pc
and the 8 bytes of high_pc before the patch. This depends on
low_pc and high_pc being put consecutively by the compiler, but
it serves to validate the idea. I tried another way of doing it
that does not rely on this but it didn't work either and I believe
the reason for either not working is the same (and still unknown,
but unrelated to them. I might be wrong though, and if I find yet
another way of doing it I may try it). The other way was to
use a form of DW_FORM_data8 for the section offset. This is
disallowed by the specification, but I doubt gdb validates this,
as it's just easier to store it as 64-bit anyway as this is even
necessary to support 64-bit DWARF (which is not what gcc generates
by default apparently).
I still need to make changes to the diff to make it production-ready,
but first I want to figure out why it doesn't work as expected.
By looking at the output of llvm-dwarfdump or readelf, all of
.debug_ranges, .debug_abbrev and .debug_info seem to have been
correctly updated. However, gdb seems to have serious problems with
what we write.
(In fact, readelf --debug-dump=Ranges shows some funny warning messages
of the form ("Warning: There is a hole [0x100 - 0x120] in .debug_ranges"),
but I played around with this and it seems it's just because no
compile unit was using these ranges. Changing .debug_info apparently
changes these warnings, so they seem to be unrelated to the section
itself. Also looking at the hex dump of the section doesn't help,
as everything seems fine. llvm-dwarfdump doesn't say anything.
So I think .debug_ranges is fine.)
The result is that gdb not only doesn't show the function name as we
wanted, but it also stops showing line number information.
Apparently it's not reading/interpreting the address ranges at all,
and so the functions now have no associated address ranges, only the
symbol value which allows one to put a breakpoint in the function,
but not to show source code.
As this left me without more ideas of what to try to feed gdb with,
I believe the most promising next trial is to try to debug gdb itself,
unless someone spots anything I missed.
I found where the interesting part of the code lies for this
case (gdb/dwarf2read.c and some other related files, but mainly that one).
It seems in some parts gdb uses DW_AT_ranges for only getting
its lowest and highest addresses and setting that as low_pc and
high_pc (see dwarf2_get_pc_bounds in gdb's code and where it's called).
I really hope this is not actually the case for
function address ranges. I'll investigate this further. Otherwise
I don't think any changes we make will make it work as initially
intended, as we'll simply need gdb to support it and in that case it
doesn't.
(cherry picked from FBD3073641)
Summary:
Reads information in the DWARF .debug_line section using LLVM and
tie every MCInst to one line of a line table from the input binary. Subsequent
diffs will update this information to match the final binary layout and
output updated line tables.
(cherry picked from FBD2989813)