Summary:
Relocation value verification was failing for IFUNC as the real value
used for relocation wasn't the symbol value, but a corresponding PLT
entry.
Relax the verification and skip any symbols of ST_Other type.
(cherry picked from FBD9123741)
Summary:
containsRange() functions were incorrectly checking for an empty range
at the end of containing object. I.e. [a,b) was reporting true for
containing [b,b).
(cherry picked from FBD9074643)
Summary:
TLS segment provide a template for initializing thread-local storage
for every new thread. It consists of initialized and uninitialized
parts. The uninitialized part of TLS, .tbss, is completely meaningless
from a binary analysis perspective. It doesn't take any space in the
file, or in memory. Note that this is different from a regular .bss
section that takes space in memory.
We should not place .tbss into a list of allocatable sections, otherwise
it may cause conflicts with objects contained in the next section.
(cherry picked from FBD9074056)
Summary:
For large binaries, cache+ algorithm adds a noticeable overhead in
comparison with cache. This modification restricts search space of the
optimization, which makes cache+ as fast as cache for all tested binaries.
There is a tiny (in the order of 0.01%) regression in cache-related metrics,
but this is not noticeable in practice.
(cherry picked from FBD8369968)
Summary:
The regular perf2bolt aggregation job is to read perf output directly.
However, if the data is coming from a database instead of perf, one
could write a query to produce a pre-aggregated file. This function
deals with this case.
The pre-aggregated file contains aggregated LBR data, but without binary
knowledge. BOLT will parse it and, using information from the
disassembled binary, augment it with fall-through edge frequency
information. After this step is finished, this data can be either
written to disk to be consumed by BOLT later, or can be used by BOLT
immediately if kept in memory.
File format syntax:
{B|F|f} [<start_id>:]<start_offset> [<end_id>:]<end_offset> <count>
[<mispred_count>]
B - indicates an aggregated branch
F - an aggregated fall-through (trace)
f - an aggregated fall-through with external origin - used to disambiguate
between a return hitting a basic block head and a regular internal
jump to the block
<start_id> - build id of the object containing the start address. We can
skip it for the main binary and use "X" for an unknown object. This will
save some space and facilitate human parsing.
<start_offset> - hex offset from the object base load address (0 for the
main executable unless it's PIE) to the start address.
<end_id>, <end_offset> - same for the end address.
<count> - total aggregated count of the branch or a fall-through.
<mispred_count> - the number of times the branch was mispredicted.
Omitted for fall-throughs.
Example
F 41be50 41be50 3
F 41be90 41be90 4
f 41be90 41be90 7
B 4b1942 39b57f0 3 0
B 4b196f 4b19e0 2 0
(cherry picked from FBD8887182)
Summary:
This diff have the API needed to inject functions using bolt.
In relocation mode injected functions are emitted between the cold and the hot functions,
In non-reloc mode injected functions are emitted a next text section.
(cherry picked from FBD8715965)
Summary:
If the input binary does not have a build-id and the name does not match
any file names in perf.data, then reject the binary, and issue an error
message suggesting to rename it to one of the listed names from
perf.data.
(cherry picked from FBD8846181)
Summary:
Recent compiler tool chains can produce build-ids that are less than 40
characters long. Linux perf, however, always outputs 40 characters,
expanding the string with 0's as needed. Fix the matching by only
checking the string prefix.
(cherry picked from FBD8839452)
Summary:
Rework the logic we use for managing references to constant
islands. Defer the creation of the cold versions to when we split the
function and will need them.
(cherry picked from FBD8228803)
Summary:
llvm-dwarfdump is relying on getRelocatedSection() to return
section_end() for ELF files of types other than relocatable objects.
We've changed the function to return relocatable section for other
types of ELF files. As a result, llvm-dwarfdump started re-processing
relocations for sections that already had relocations applied, e.g. in
executable files, and this resulted in wrong values reported.
As a workaround/solution, we make this function return relocated section
for executable (and any non-relocatable objects) files only if the
section is allocatable.
(cherry picked from FBD8760175)
Summary:
As reported in GH-28 `perf` can produce `-` symbol for misprediction bit
if the bit is not supported by the kernel/HW. In this case we can ignore
the bit.
(cherry picked from FBD8786827)
Summary:
When a given function B, located after function A, references
one of A's basic blocks, it registers a new global symbol at the
reference address and update A's Labels vector via
BinaryFunction::addEntryPoint(). However, we don't update A's branch
targets at this point. So we end up with an inconsistent CFG, where the
basic block names are global symbols, but the internal branch operands
are still referencing the old local name of the corresponding blocks
that got promoted to an entry point. This patch fix this by detecting
this situation in addEntryPoint and iterating over all instructions,
looking for references to the old symbol and replacing them to use the
new global symbol (since this is now an entry point).
Fixesfacebookincubator/BOLT#26
(cherry picked from FBD8728407)
Summary:
While removing unreachable blocks, we may decide to remove a
block that is listed as a target in a jump table entry. If we do that,
this label will be then undefined and LLVM assembler will crash.
Mitigate this for now by not removing such blocks, as we don't support
removing unnecessary jump tables yet.
Fixesfacebookincubator/BOLT#20
(cherry picked from FBD8730269)
Summary:
If the encoding is not specified in CIE augmentation string, then it
should be DW_EH_PE_absptr instead of DW_EH_PE_omit.
(cherry picked from FBD8740274)
Summary:
In release build without assertions MCInst::dump() is undefined and
causes link time failure.
Fixesfacebookincubator/BOLT#27.
(cherry picked from FBD8732905)
Summary:
Check if the input binary ELF type. Reject any binary not of
ET_EXEC type, including position-independent executables (PIEs).
Also print the first function containing PIC jump table.
(cherry picked from FBD8707274)
Summary:
Ignore 'S' in augmentation string on input. It just marks a signal
frame. All we have to do is propagate it.
Fixesfacebookincubator/BOLT#21
This was already in LLVM trunk rL331738. Update llvm.patch.
(cherry picked from FBD8707222)
Summary:
GCC 8 can generate jump tables with just 2 entries. Modify our heuristic
to accept it. We still assert that there's more than one entry.
(cherry picked from FBD8709416)
Summary:
Add support for functions with internal calls, necessary for
handling Intel MKL library and some code observed in google core dumper
library.
This is not optimizing these functions, but only identifying them,
running analyses to assure we will not break those functions if we move
them, and then "freezing" these functions (marking as not simple so Bolt
will not try to reorder it or touch it in any way).
(cherry picked from FBD8364381)
Summary:
When processing binary with -debug mode in some cases, BD could be nullptr. It will be better to fail later on assert than here with segfault.
Closes https://github.com/facebookincubator/BOLT/pull/18
GitHub Author: Alexander Gryanko <xpahos@gmail.com>
(cherry picked from FBD8650719)
Summary:
This option only works in relocation mode. In non-relocation
mode, it generates invalid references that cause MCStreamer to fail.
Disable this flag if the user requested and print a warning.
(cherry picked from FBD8625990)
Summary:
Create folders and setup to make LIT run BOLT-only tests. Add
a test example. This will add a new make/ninja rule "check-bolt" that
the user can invoke to run LIT on this folder.
(cherry picked from FBD8595786)
Summary:
BOLT heuristics failed to work if false PIC jump table entries were
accepted when they were pointing inside a function, but not at
an instruction boundary.
This fix checks if the destination falls at instruction boundary, and
if it does not, it truncates the jump table. This, of course, still does not
guarantee that the entry corresponds to a real destination, and we can
have "false positive" entry(ies). However, it shouldn't affect
correctness of the function, but the CFG may have edges that are never
taken. We may update an incorrect jump table entry, corresponding to an
unrelated data, and for that reason we force moving of jump tables if a
PIC jump table was detected.
(cherry picked from FBD8559588)
Summary:
Don't report all data objects with hash collisions by default. Only
report the summary, and use -v=1 for providing the full list.
(cherry picked from FBD8372241)
Summary:
This diff replaces the addresses in all the {SYMBOLat,HOLEat,DATAat} symbols with hash values based on the data contained in the symbol. It should make the profiling data for anonymous symbols robust to address changes.
The only small problem with this approach is that the hashed name for padding symbols of the same size collide frequently. This shouldn't be a big deal since it would be weird if those symbols were hot.
On a test run with hhvm there were 26 collisions (out of ~338k symbols). Most of the collisions were from small (2,4,8 byte) objects.
(cherry picked from FBD7134261)
Summary:
This diff introduces a modification of cache+ block ordering algorithm,
which reordered and merges cold blocks in a function with the goal of reducing
the number of (non-fallthrough) jumps, and thus, the code size.
(cherry picked from FBD8044978)
Summary:
Add "-inline-memcpy" option to inline calls to memcpy() using
"rep movsb" instruction. The pass is X86-specific.
Calls to _memcpy8 are optimized too using a special return value
(dest+size).
The implementation is very primitive in that it does not track liveness
of %rax after return, and no %rcx substitution. This is going to get
improved if we find the optimization to be useful.
(cherry picked from FBD8211890)
Summary:
In AArch64, when the binary gets large, the linker inserts
stubs with 3 instructions: ADRP to load the PC-relative address of
a page; ADD to add the offset of the page; and a branch instruction
to do an indirect jump to the contents of X16 (the linker-reserved
reg). The problem is that the linker does not issue a relocation for
this (since this is not code coming from the assembler), so BOLT has
no idea what is the real target, unless it recognizes these instructions
and extract the target by combining the operands of the instructions
from the stub. This diff does exactly that.
(cherry picked from FBD7882653)
Summary:
If the input binary for perf2bolt has a build-id and perf data has
recorded build-ids, then try to match them. Adjust the file name if
build-ids match to cover cases where the binary was renamed after data
collection. If there's no matching build-id report an error and exit.
While scanning task events, truncate the name to 15 characters prior to
matching, since that's how names are reported by perf.
(cherry picked from FBD8034436)
Summary:
Option `-report-bad-layout=N` prints top N functions with layouts
that have cold blocks placed in the middle of hot blocks. The sorting is
based on execution_count / number_of_basic_blocks formula.
(cherry picked from FBD8051950)
Summary:
Application code can reference functions in a non-standard way, e.g.
using arithmetic and bitmask operations on them. One example is if a
program checks if a function is below a certain address or within
a certain address range to perform a low-level optimization or generate
a proper code (JIT).
Instead of relying on a relocation value (symbol+addend), we use only
the symbol value, and then check if the value is inside the function.
If it is, we treat it as a code reference against location within the
function, otherwise we handle it as a non-standard function reference
and issue a warning.
(cherry picked from FBD7996274)
Summary:
When we make changes to MCInst opcodes (or get changes from upstream),
a hash value for BinaryFunction changes. As a result, we are unable
to match profile generated by a previous version of BOLT.
Add option `-profile-ignore-hash` to match profile while ignoring
function hash value. With this option we match functions with common
names using the number of basic blocks.
(cherry picked from FBD7983269)
Summary:
To accurately account for PLT optimization, each PLT call should be
counted as an extra indirect call instruction, which in turn is
a load, a call, an indirect call, and instruction entry in dyno stats.
(cherry picked from FBD7978980)
Summary:
While working with a binary in non-relocations mode, I realized
some cache metrics are not computed correctly. Hence, this fix.
In addition, logging the number of functions with modified ordering of
basic blocks, which is helpful for analysis.
(cherry picked from FBD7975392)
Summary:
Enable BOLT to reorder data sections in a binary based on memory
profiling data.
This diff adds a new pass to BOLT that can reorder data sections for
better locality based on memory profiling data. For now, the algorithm
to order data is primitive and just relies on the frequency of loads to
order the contents of a section. We could probably do a lot better by
looking at what functions use the hot data and grouping together hot
data that is used by a single function (or cluster of functions).
Block ordering might give some hints on how to order the data better as
well.
The new pass has two basic modes: inplace and split (when inplace is
false). The default is split since inplace hasn't really been tested
much. When splitting is on, the cold data is copied to a "cold" version
of the section while the hot data is kept in the original section, e.g.
for .rodata, .rodata will contain the hot data and .bolt.org.rodata will
contain the cold bits. In inplace mode, the section contents are
reordered inplace. In either mode, all relocations to data within that
section are updated to reflect new data locations.
Things to improve:
- The current algorithm is really dumb and doesn't seem to lead to any
wins. It certainly could use some improvement.
- Private symbols can have data that leaks over to an adjacent symbol,
e.g. a string that has a common suffix can start in one symbol and
leak over (with the common suffix) into the next. For now, we punt on
adjacent private symbols.
- Handle ambiguous relocations better. Section relocations that point
to the boundary of two symbols will prevent the adjacent symbols from
being moved because we can't tell which symbol the relocation is for.
- Handle jump tables. Right now jump table support must be basic if
data reordering is enabled.
- Being able to handle TLS. A good amount of data access in some
binaries are happening in TLS. It would be worthwhile to be able to
reorder any TLS sections too.
- Handle sections with writeable data. This hasn't been tested so
probably won't work. We could try to prevent false sharing in
writeable sections as well.
- A pie in the sky goal would be to use DWARF info to reorder types.
(cherry picked from FBD6792876)
Summary:
The default is not changing, i.e. we are not aligning code within a
function by default.
New meaning of options for aligning basic blocks:
-align-blocks
triggers basic block alignment based on profile
-preserve-blocks-alignment
tries to preserve basic block alignment seen on input
Tuning options for "-align-blocks":
-align-blocks-min-size=<uint>
blocks smaller than the specified size wouldn't be aligned
-align-blocks-threshold=<uint>
align only blocks with frequency larger than containing function
execution frequency specified in percent. E.g. 1000 means aligning
blocks that are 10 times more frequently executed than the containing
function.
(cherry picked from FBD7921980)