Summary:
SCTC was incorrectly swapping BranchInfo when reversing the branch condition. This was wrong because when we remove the successor BB later, it removes the BranchInfo for that BB. In this case the successor would be the BB with the stats we had just swapped.
Instead leave BranchInfo as it is and read the branch count from the false or true branch depending on whether we reverse or replace the branch, respectively. The call to removeSuccessor later will remove the unused BranchInfo we no longer care about.
(cherry picked from FBD6876799)
Summary: Register all sections with BinaryContext. Store all sections in a set ordered by (address, size, name). Add two separate maps to lookup sections by address or by name. Non-allocatable sections are not stored in the address->section map since they all "start" at 0.
(cherry picked from FBD6862973)
Summary:
Handle types CU list in `updateGdbIndexSection`.
It looks like the types part of `.gdb_index` isn't empty when `-fdebug-types-section` is used. So instead of aborting, we copy the part to new `.gdb_index` section.
(cherry picked from FBD6770460)
Summary:
When we read profile for functions, we initialize counts for entry
blocks first, and then populate counts for all blocks based
on incoming edges.
During the second phase we ignore the entry blocks because we expect
them to be already initialized. For the primary entry at offset 0 it's
the correct thing to do, since we treat all incoming branches as calls
or tail calls. However, for secondary entries we only consider external
edges to be from calls and don't increase entry count if an edge
originates from inside the function. Thus we need to update the
secondary entry basic block counts with internal edges too.
(cherry picked from FBD6836817)
Summary:
A test is asserting on impossible addresses coming from
perf.data, instead of just reporting it as bad data. Fix this behavior.
(cherry picked from FBD6835590)
Summary:
Speeding up cache+ algorithm.
The idea is to find and merge "fallthrough" successors before main
optimization. For a pair of blocks, A and B, block B is the fallthrough
successor of A, if (i) all jumps (based on profile) from A goes to B
and (ii) all jumps to B are from A.
Such blocks should be adjacent in an optimal ordering, and should
not be considered for splitting. (This gives the speed up).
The gap between cache and cache+ reduced from ~2m to ~1m.
(cherry picked from FBD6799900)
Summary:
Refactor the relocation anaylsis code. It should be a little better at validating
that the relocation value matches up with the symbol address + addend stored in the
relocation (except on aarch64). It is also a little better at finding the symbol
address used to do the lookup in BinaryContext, rather than just using symbol
address + addend.
(cherry picked from FBD6814702)
Summary: Add BinarySection class that is a wrapper around SectionRef. This is refactoring work for static data reordering.
(cherry picked from FBD6792785)
Summary:
Rewrite how data/code markers are interpreted, so the code
can have constant islands essentially anywhere. This is necessary to
accomodate custom AArch64 assembly code coming from mozjpeg. Allow
any function to refer to the constant island owned by any other
function. When this happens, we pull the constant island from the
referred function and emit it as our own, so it will live nearby
the code that refers to it, allowing us to freely reorder functions
and code pieces. Make bolt more strict about not changing anything
in non-simple ARM functions, as we need to preserve offsets for
those functions we don't interpret their jump tables (currently
any function with jump tables in ARM is non-simple and is left
untouched).
(cherry picked from FBD6402324)
Summary:
A new profile that is more resilient to minor binary modifications.
BranchData is eliminated. For calls, the data is converted into instruction
annotations if the profile matches a function. If a profile cannot be matched,
AllCallSites data should have call sites profiles.
The new profile format is YAML, which is quite verbose. It still takes
less space than the older format because we avoid function name repetition.
The plan is to get rid of the old profile format eventually.
merge-fdata does not work with the new format yet.
(cherry picked from FBD6753747)
Summary:
Add a few new relocation types to support a wider variety of
binaries, add support for constant island duplication (so we can split
functions in large binaries) and make LongJmp pass really precise with
respect to layout, so we don't miss stubs insertions at the correct
places for really large binaries. In LongJmp, introduce "freeze"
annotations so fixBranches won't mess the jumps we carefully determined
that needed a stub.
(cherry picked from FBD6294390)
Summary:
A new block reordering algorithm, cache+, that is designed to optimize
i-cache performance.
On a high level, this algorithm is a greedy heuristic that merges
clusters (ordered sequences) of basic blocks, similarly to how it is
done in OptimizeCacheReorderAlgorithm. There are two important
differences: (a) the metric that is optimized in the procedure, and
(b) how two clusters are merged together.
Initially all clusters are isolated basic blocks. On every iteration,
we pick a pair of clusters whose merging yields the biggest increase
in the ExtTSP metric (see CacheMetrics.cpp for exact implementation),
which models how i-cache "friendly" a pecific cluster is. A pair of
clusters giving the maximum gain is merged to a new clusters. The
procedure stops when there is only one cluster left, or when merging
does not increase ExtTSP. In the latter case, the remaining clusters
are sorted by density.
An important aspect is the way two clusters are merged. Unlike earlier
algorithms (e.g., OptimizeCacheReorderAlgorithm or Pettis-Hansen), two
clusters, X and Y, are first split into three, X1, X2, and Y. Then we
consider all possible ways of gluing the three clusters (e.g., X1YX2,
X1X2Y, X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the
largest score. This improves the quality of the final result (the
search space is larger) while keeping the implementation sufficiently
fast.
(cherry picked from FBD6466264)
Summary:
Do not assign a LP to tail calls. They are not calls in the
view of an unwinder, they are just regular branches. We were hitting an
assertion in BinaryFunction::removeConditionalTailCalls() complaining
about landing pads in a CTC, however it was in fact a
builtin_unreachable being conservatively treated as a CTC.
(cherry picked from FBD6564957)
Summary:
The pass was previously copying data that would change after layout
because it had a relocation at the copied address.
(cherry picked from FBD6541334)
Summary:
Profile reading was tightly coupled with building CFG. Since I plan
to move to a new profile format that will be associated with CFG
it is critical to decouple the two phases.
We now have read profile right after the cfg was constructed, but
before it is "canonicalized", i.e. CTCs will till be there.
After reading the profile, we do a post-processing pass that fixes
CFG and does some post-processing for debug info, such as
inference of fall-throughs, which is still required with the current
format.
Another good reason for decoupling is that we can use profile with
CFG to more accurately record fall-through branches during
aggregation.
At the moment we use "Offset" annotations to facilitate location
of instructions corresponding to the profile. This might not be
super efficient. However, once we switch to the new profile format
the offsets would be no longer needed. We might keep them for
the aggregator, but if we have to trust LBR data that might
not be strictly necessary.
I've tried to make changes while keeping backwards compatibly. This makes
it easier to verify correctness of the changes, but that also means
that we lose accuracy of the profile.
Some refactoring is included.
Flag "-prof-compat-mode" (on by default) is used for bug-level
backwards compatibility. Disable it for more accurate tracing.
(cherry picked from FBD6506156)
Summary:
If relocations are available in the binary, use them by default.
If "-relocs" is specified, then require relocations for further
processing. Use "-relocs=0" to forcefully ignore relocations.
Instead of `opts::Relocs` use `BinaryContext::HasRelocations` to check
for the presence of the relocations.
(cherry picked from FBD6530023)
Summary:
The list of landing pads in BinaryBasicBlock was sorted by their address
in memory. As a result, the DFS order was not always deterministic.
The change is to store landing pads in the order they appear in invoke
instructions while keeping them unique.
Also, add Throwers verification to validateCFG().
(cherry picked from FBD6529032)
Summary:
Some helpful options:
-print-dyno-stats-only
while printing functions output dyno-stats and skip instructions
-report-stale
print a list of functions with a stale profile
(cherry picked from FBD6505141)
Summary:
Add a pass to rebalance the usage of REX prefixes, moving them
from the hot code path to the cold path whenever possible. To do this, we
rank the usage frequency of each register and exchange an X86 classic reg
with an extended one (which requires a REX prefix) whenever the classic
register is being used less times than the extended one. There are two
versions of this pass: regular one will only consider RBX as classic and
R12-R15 as extended registers because those are callee-saved, which means
their scope is local to the function and therefore they can be easily
interchanged within the function without further consequences. The
aggressive version relies on liveness analysis to detect if the value of
a register is being used as a caller-saved value (written to without
being read first), which also is eligible for reallocation. However, it
showed limited results and is not the default option because it is
expensive.
Currently, this pass does not update debug info. This means that if a
substitution is made, the AT_LOCATION of a variable inside a function may
be outdated and GDB will display the wrong value if you ask it to print
the value of the affected variable. Updating DWARF involves a painful
task of writing a new DWARF expression parser/writer similar to the one
we already have for CFI expressions. I'll defer the task of writing this
until we determine this optimization is enabled in production. So far,
it is experimental to be combined with other optimizations to help us
find a new set of optimizations that is beneficial.
(cherry picked from FBD6476659)
Summary: Load elimination for ICP wasn't handling nested jump tables correctly. It wasn't offseting the indices by the range of the nested table. I also wasn't computing some of the stats ICP correctly in all cases which was leading to weird results in the stats.
(cherry picked from FBD6453693)
Summary:
The diff introduces two measures for i-cache performance: a TSP measure (currently used for optimization) and an "extended" TSP measure that takes into account jumps between non-consecutive basic blocks. The two measures are computed for estimated addresses/sizes of basic blocks and for the actually omitted addresses/sizes.
Intuitively, the Extended-TSP metric quantifies the expected number of i-cache misses for a given ordering of basic blocks. It has 5 parameters:
- FallthroughWeight is the impact of fallthrough jumps on the score
- ForwardWeight is the impact of forward (but not fallthrough) jumps
- BackwardWeight is the impact of backward jumps
- ForwardDistance is the max distance of a forward jump affecting the score
- BackwardDistance is the max distance of a backward jump affecting the score
We're still learning the "best" values for the options but default values look reasonable so far.
(cherry picked from FBD6331418)
Summary:
Add a pass to identify indirect jumps to jump tables and reduce
their entries size from 8 to 4 bytes. For PIC jump tables, it will
convert the PIC code to non-PIC (since BOLT only processes static code,
it makes no sense to use expensive PIC-style jumps in static code). Add
corresponding improvements to register scavenging pass and add a MCInst
matcher machinery.
(cherry picked from FBD6421582)
Summary: The arithmetic shortening code on x86 was broken. It would sometimes shorten instructions with immediate operands that wouldn't fit into 8 bits.
(cherry picked from FBD6444699)
Summary: The icp-top-callsites option was using basic block counts to pick the top callsites while the ICP main loop was using branch info from the targets of each call. These numbers do not exactly match up so there was a dispcrepancy in computing the top calls. I've switch top callsites over to use the same stats as the main loop. The icp-always-on option was redundant with -icp-top-callsites=100, so I removed it.
(cherry picked from FBD6370977)
Summary: Add timers for non-optimization related phases. There are two new options, -time-build for disassembling functions and building CFGs, and -time-rewrite for phases in executeRewritePass().
(cherry picked from FBD6422006)
Summary:
Previously the perf2bolt aggregator was rejecting traces
finishing with REP RET (return instruction with REP prefix) as a
result of the migration from objdump output to LLVM disassembler,
which decodes REP as a separate instruction. Add code to detect
REP RET and treat it as a single return instruction.
(cherry picked from FBD6417496)
Summary:
Here's an implementation of an abstract instruction iterator for the branch/call
analysis code in MCInstrAnalysis. I'm posting it up to see what you guys think.
It's a bit sloppy with constness and probably needs more tidying up.
(cherry picked from FBD6244012)
Summary:
Use value profiling data to remove the method pointer loads from vtables when doing ICP at virtual function and jump table callsites.
The basic process is the following:
1. Work backwards from the callsite to find the most recent def of the call register.
2. Work back from the call register def to find the instruction where the vtable is loaded.
3. Find out of there is any value profiling data associated with the vtable load. If so, record all these addresses as potential vtables + method offsets.
4. Since the addresses extracted by #3 will be vtable + method offset, we need to figure out the method offset in order to determine the actual vtable base address. At this point I virtually execute all the instructions that occur between #3 and #2 that touch the method pointer register. The result of this execution should be the method offset.
5. Fetch the actual method address from the appropriate data section containing the vtable using the computed method offset. Make sure that this address maps to an actual function symbol.
6. Try to associate a vtable pointer with each target address in SymTargets. If every target has a vtable, then this is almost certainly a virtual method callsite.
7. Use the vtable address when generating the promoted call code. It's basically the same as regular ICP code except that the compare is against the vtable and not the method pointer. Additionally, the instructions to load up the method are dumped into the cold call block.
For jump tables, the basic idea is the same. I use the memory profiling data to find the hottest slots in the jumptable and then use that information to compute the indices of the hottest entries. We can then compare the index register to the hot index values and avoid the load from the jump table.
Note: I'm assuming the whole call is in a single BB. According to @rafaelauler, this isn't always the case on ARM. This also isn't always the case on X86 either. If there are non-trivial arguments that are passed by value, there could be branches in between the setup and the call. I'm going to leave fixing this until later since it makes things a bit more complicated.
I've also fixed a bug where ICP was introducing a conditional tail call. I made sure that SCTC fixes these up afterwards. I have no idea why I made it introduce a CTC in the first place.
(cherry picked from FBD6120768)
Summary:
When running hfsort+, we invalidate too many cache entries, which leads to inefficiencies. It seems we only need to invalidate cache for pairs of clusters (Into, X) and (X, Into) when modifying cluster Into (for all clusters X).
With the modification, we do not really need ShortCache, since it is computed only once per pair of clusters.
(cherry picked from FBD6341039)
Summary:
When RememberState CFI happens to be the last CFI in a basic block, we
used to set the state of the next basic block to a CFI prior to
executing RememberState instruction. This contradicts comments in
annotateCFIState() function and also differs form behaviour of
getCFIStateAtInstr(). As a result we were getting code like the
following:
.LBB0121166 (21 instructions, align : 1)
CFI State : 0
....
0000001a: !CFI $1 ; OpOffset Reg6 -16
0000001a: !CFI $2 ; OpRememberState
....
Successors: .Ltmp4167600, .Ltmp4167601
CFI State: 3
.Ltmp4167601 (13 instructions, align : 1)
CFI State : 2
....
Notice that the state at the entry of the 2nd basic block is less than
the state at the exit of the previous basic block.
In practice we have never seen basic blocks where RememberState was the
last CFI instruction in the basic block, and hence we've never run into
this issue before.
The fix is a synchronization of handling of last RememberState
instruction by annotateCFIState() and getCFIStateAtInstr().
In the example above, the CFI state at the entry to the second BB will
be 3 after this diff.
(cherry picked from FBD6314916)
Summary: Add selective control over peephole options. This makes it easier to test which ones might have a positive effect.
(cherry picked from FBD6289659)
Summary:
The logic to append an unconditional branch at the end of a block that had
the condition flipped on its conditional tail was broken. It should have
been looking at the successor to PredBB instead of BB. It also wasn't skipping
invalid blocks when finding the fallthrough block.
This fixes the SCTC bug uncovered by @spupyrev's work on block reordering.
(cherry picked from FBD6269493)
Summary:
With "-debug" flag we are using a dump in intermediate state when
basic block's list is initialized, but layout is not. In new isSplit()
funciton we were checking the size() which uses basic block list,
and then we were accessing the (uninitiazed) layout.
Instead of checking size() we should be checking layout_size().
(cherry picked from FBD6277770)
Summary:
A new 'compact' function aligner that takes function sizes in consideration. The approach is based on the following assumptions:
-- It is not desirable to introduce a large offset when aligning short functions, as it leads to a lot of "wasted" address space.
-- For longer functions, the offset can be larger than the default 32 bytes; However, using 64 bytes for the offset still worsen performance, as again a lot of address space is wasted.
-- Cold parts of functions can still use the default max-32 offset.
The algorithm is switched on/off by flag 'use-compact-aligner' and is controlled by parameters align-functions-max-bytes and align-cold-functions-max-bytes described above. In my tests the best performance is produced with '-use-compact-aligner=true -align-functions-max-bytes=48 -align-cold-functions-max-bytes=32'.
(cherry picked from FBD6194092)
Summary:
Enhance the basic infrastructure for relocation mode for
AArch64 to make a reasonably large program work after reordering (gcc).
Detect jump table patterns and skip optimizing functions with jump
tables in AArch64, as those will require extra future effort to fully
decode. To make these work in relocation mode, we skip changing
the function body and introduce a mode to preserve even the original
nops. By not changing any local offsets in the function, the input
original jump tables should just work.
Functions with no jump tables are optimized with BB reordering. No other
optimizations have been tested.
(cherry picked from FBD6130117)
Summary:
Fix a bug in reconstruction of an optimal path. When calculating the
best path we need to take into account a path from new "last" node
to the current last node.
Add "-tsp-threshold" (defaults to 10) to control when the TSP
algorithm should be used.
(cherry picked from FBD6253461)
Summary:
As we deal with incomplete addresses in address-computing
sequences of code in AArch64, we found it is easier to handle them in
relocation mode in the presence of relocations.
Incomplete addresses may mislead BOLT into thinking there are
instructions referring to a basic block when, in fact, this may be the
base address of a data reference. If the relocation is present, we can
easily spot such cases.
This diff contains extensions in relocation mode to understand and
deal with AArch64 relocations. It also adds code to process data inside
functions as marked by AArch64 ABI (symbol table entries named "$d").
In our code, this is called constant islands handling. Last, it extends
bughunter with a "cross" mode, in which the host generates the binaries
and the user test them (uploading to the target), useful when debugging
in AArch64.
(cherry picked from FBD6024570)
Summary:
Add functionality to support reordering bzip2 compiled to
AArch64, with function splitting but without relocations:
* Expand the AArch64 backend to support inverting branches and
analyzing branches so BOLT reordering machinery is able to shuffle
blocks and fix branches correctly;
* Add a new pass named LongJmp to add stubs whenever code needs to
jump to the cold area, when using function splitting, because of the
limited target encoding capability in AArch64 (as a RISC architecture).
(cherry picked from FBD5748184)
Summary:
Add basic AArch64 read/write capability to be able to
disassemble bzip2 for AArch64 compiled with gcc 5.4.0 and write
it back after going through the basic BOLT pipeline with no block
reordering (NOPs/unreachable blocks get removed).
This is not for relocation mode.
(cherry picked from FBD5701994)
Summary:
A few improvements for hfsort+ algorithm. The goal of the diff is (i) to make the resulting function order more i-cache "friendly" and (ii) fix a bug with incorrect input edge weights. A specific list of changes is as follows:
- The "samples" field of CallGraph.Node should be at least the sum of incoming edge weights. Fixed with a new method CallGraph::adjustArcWeights()
- A new optimization pass for hfsort+ in which pairs of functions that call each other with very high probability (>=0.99) are always merged. This improves the resulting i-cache but may worsen i-TLB. See a new method HFSortPlus::runPassOne()
- Adjusted optimization goal to make the resulting ordering more i-cache "friendly", see HFSortPlus::expectedCalls and HFSortPlus::mergeGain
- Functions w/o samples are now reordered too (they're placed at the end of the list of hot functions). These functions do appear in the call graph, as some of their basic blocks have samples in the LBR dataset. See HfSortPlus::initializeClusters
(cherry picked from FBD6248850)
Summary:
If you attempted to use a function filter on a binary when in relocation mode, the resulting binary would probably crash. This is because we weren't calling fixBranches on all functions. This was breaking bughunter.sh
I also strengthened the validation of basic blocks. The cond branch should always be non-null when there are two successors.
(cherry picked from FBD6261930)
Summary:
Refactor basic block reordering code out of the BinaryFunction.
BinaryFunction::isSplit() is now checking if the first and the last
blocks in the layout belong to the same fragment. As a result,
it no longer returns true for functions that have their cold part
optimized away.
Change type for returned "size" from unsigned to size_t.
Fix lines over 80 characters long.
(cherry picked from FBD6250825)
Summary:
Move the indirect branch analysis code from BinaryFunction to MCInstrAnalysis/X86MCTargetDesc.cpp.
In the process of doing this, I've added an MCRegInfo to MCInstrAnalysis which allowed me to remove a bunch of extra method parameters. I've also had to refactor how BinaryFunction held on to instructions/offsets so that it would be easy to pass a sequence of instructions to the analysis code (rather than a map keyed by offset).
Note: I think there are a bunch of MCInstrAnalysis methods that have a BitVector output parameter that could be changed to a return value since the size of the vector is based on the number of registers, i.e. from MCRegisterInfo. I haven't done this in order to keep the diff a more manageable size.
(cherry picked from FBD6213556)
Summary:
Add support for reading value profiling info from perf data. This diff adds support in DataReader/DataAggregator for value profiling data. Each event is recorded as two Locations (a PC and an address/value) and a count.
For now, I'm assuming that the value profiling data is in the same file as the usual BOLT profiling data. Collecting both at the same time seems to work.
(cherry picked from FBD6076877)