Summary:
A long due refactoring that makes interfaces cleaner and less awkward.
Mainly makes the future work way easier.
(cherry picked from FBD14766284)
Summary:
When we patch .debug_abbrev we issue many duplicate patches. Instead of
storing these patches as a vector, use a hash map. This saves some
processing time and memory.
(cherry picked from FBD14691292)
Summary:
In non-relocation mode we were accidentally emitting section headers for
every single jump table. This happened with default
`-jump-tables=basic`.
(cherry picked from FBD14653282)
Summary:
While using "-hot-text" option, we might not get enough cold text to
fill up the last huge page, and we can get data allocated on this page
producing undesirable effects. To prevent this from happening, always
make sure to allocate enough space past __hot_end.
(cherry picked from FBD14575100)
Summary:
While removing redundant local symbols, we used new section index to
lookup the corresponding section in the old section table. As a result,
we used to either not remove the correct symbols, or remove the wrong
ones.
(cherry picked from FBD14552047)
Summary:
We used to use existing symbol binding while duplicating and renaming
cold fragment symbols. As a result, some of those were emitted with
global binding. This confuses gdb, and it starts treating those symbols
as additional entry points.
The fix is to always emit such symbols with a local binding. This also
means that we have to sort static symbol table before emission to make
sure local symbols precede all others.
(cherry picked from FBD14529265)
Summary:
Create a separate pass for assigning functions to sections. Detect
functions originating from special sections (by default .stub and
.mover) and place them into ".text.mover" if "-hot-text" options is
specified.
Cold functions are isolated from hot functions even when no function
re-ordering is specified.
(cherry picked from FBD14512628)
Summary:
GDB does not like if the first entry in the line info table after
end_sequence entry is not marked with is_stmt. If this happens, it will
not print the correct line number information for such address. Note
that everything works fine starting with the first address marked
with is_stmt.
This could happen if the first instruction in the cold section wasn't
marked with is_stmt.
The fix is to always emit debug line info for the first instruction
in any function fragment with is_stmt flag.
(cherry picked from FBD14516629)
Summary:
This refactoring makes it easier to create new code sections and control
code placement. As an example, cold code is being placed into
".text.cold" which is emitted independently from ".text", and the final
address assignment becomes more flexible.
Previously, in non-relocation mode we used to emit temporary section
name into .shstrtab. This resulted in unnecessary bloat of this section.
There was unnecessary padding emitted at the end of text section. After
fixing this, the output binary becomes smaller.
I had to change the way exception handling tables are re-written
as the current infra does not support cross-section label difference.
This means we have to emit absolute landing pad addresses, which might
not work for PIE binaries. I'm going to address this once I investigate
the current exception handling issues in PIEs.
This diff temporarily disables "-hot-functions-at-end" option.
(cherry picked from FBD14475693)
Summary: As part of our heuristics to decode an indirect branch, if we
suspect the branch is an indirect tail call, we add its probable target
to the BC::InterproceduralReferences vector to detect functions with
more than one entry point. However, if this probable target is not in an
allocatable section, we were asserting. Remove this assertion and
change the code to conditionally store to InterproceduralReferences
instead. The probable target could be garbage at this point because
of analyzeIndirectBranch failing to identify the load instruction that
has the memory address of the target, so we should tolerate this.
(cherry picked from FBD14432821)
Summary:
Add heatmap subcommand to produce heatmaps based on perf.data with LBR.
The output is produced in colored ASCII format.
llvm-bolt heatmap -p perf.data <executable>
-block-size=<uint> - size of a heat map block in bytes (default 64)
-line-size=<uint> - number of entries per line (default 256)
-max-address=<uint> - maximum address considered valid for heatmap
(default 4GB)
-o=<string> - heatmap output file (default stdout)
(cherry picked from FBD13969992)
Summary:
For non-simple function we can miss a reference to a jump table or
to an indirect goto table. If we move the jump table, the missed
reference will not get updated, and the corresponding indirect jump
will end up in the old (wrong) location. Updating the original jump
table in-place should take care of the issue.
(cherry picked from FBD13849776)
Summary:
While converting perf profile, we only need CFG for functions that were
profiled and can skip building CFG for the rest. This saves us some
processing time and memory.
Breakdown processing of perf.data into two steps. The first
step parses the data, saves it in intermediate format, and marks
functions with the profile. The second step attributes the profile to
functions with CFG. When we disassemble and build CFG for functions in
aggregate-only mode, we skip functions without the profile.
(cherry picked from FBD13706697)
Summary:
Improve tracking of forked processes.
If a process corresponding to the input binary has forked/started
before 'perf record' was initiated, then the full name of the binary
will be recorded in a corresponding MMAP2 event. We've being handling
such cases well so far.
However, if the process was forked after 'perf record' has started, and
execve(2) wasn't called afterwards, then there will be no MMAP2 event
recorded corresponding to the mapping of the main binary (unrelated
MMAP2 events could still be recorded).
To track such cases, we need to parse 'perf script --show-task-events'
command output, and to scan for PERF_RECORD_FORK events, and then add
forked process PIDs to the list associated with the input binary. If
the fork event was followed by an exec event (PERF_RECORD_COMM exec)
of a different binary, then the forked PID should be ignored. If the
exec event was associated with our input binary, then the correct MMAP2
event was recorded and parsed.
To track if the event occurred before or after 'perf record', we parse
event's time. This helps us to differentiate some events. E.g. the exec
event is only registered correctly if it happened after perf recording
has started (otherwise the "exec" part is missing), and thus we only
record forks with non-zero time stamps.
(cherry picked from FBD13250904)
Summary:
Use newly added function size estimation to measure the effectiveness
and guide function splitting. Two new tuning options are added:
-split-threshold=<uint>
split function only if its main size is reduced by more than given
amount of bytes. Default value: 0, i.e. split iff the size is reduced.
Note that on some architectures the size can increase after splitting.
-split-align-threshold=<uint>
when deciding to split a function, apply this alignment while doing
the size comparison (see -split-threshold). Default value: 2.
(cherry picked from FBD13136352)
Summary:
Add BinaryContext::calculateEmittedSize() that ephemerally emits code
to allow precise estimation of the function size. Relaxation and
macro-op alignment adjustments are taken into account.
(cherry picked from FBD13092139)
Summary:
On x86 the difference between long and short jump instructions could be
either 4 or 3 bytes, depending if it's a conditional jump or not.
For a basic block with 2 jump instructions, if we know that one of
the successors is in a different code region, then we can make it
a target of an unconditional jump instruction. This will save 1 byte
in case the conditional jump happens to be a short one.
(cherry picked from FBD13078139)
Summary:
When Clang is boot-strapped with (Thin)LTO, it may produce a code
fragment similar to below:
.LFT663334 (6 instructions, align : 1)
Predecessors: .LFT663333
00000538: movb $0x1, %al
0000053a: movl %eax, -0x2c(%rbp)
0000053d: movl $"_ZN5clang6Parser12ConsumeParenEv/1", %ecx
00000542: testb $0x1, %cl
00000545: movq -0x40(%rbp), %r14
00000549: je .Ltmp1071462
Successors: .Ltmp1071462, .LFT663335
.LFT663335 (2 instructions, align : 1)
Predecessors: .LFT663334
0000054b: movq (%r12), %rax
0000054f: movq .Ltmp0(%rax), %rcx
Successors: .Ltmp1071462
.Ltmp1071462 (7 instructions, align : 1)
Predecessors: .LFT663334, .LFT663335
00000556: movq %r12, %rdi
00000559: callq *%rcx
.......
The code above is making a call by dereferencing a pointer to a member
function. A pointer to a member function could either be a regular
function, or a virtual function. To differentiate between the two, AMD64
ABI (originated from Itanium ABI) uses the last bit of the pointer. The
call instruction sequence varies depending if the function is virtual or
not, and the pointer's last bit is checked. If it's "1" then the value
of the pointer (minus 1) is used as an offset in the object vtable to
get the address of the function, otherwise the pointer is used directly
as a function address.
In this specific case, a de-virtualization is taking place, but it's not
complete. Compiler knows that the member function pointer is actually a
non-virtual function _ZN5clang6Parser12ConsumeParenEv (aka
"clang::Parser::ConsumeParen()"). However, it keeps the (dead) code that
checks the last bit of _ZN5clang6Parser12ConsumeParenEv, and furthermore
keeps the code (unreachable/dead) to make a virtual call while using
(_ZN5clang6Parser12ConsumeParenEv - 1) as an offset into the vtable.
This is obviously wrong, but since the code is unreachable, it will
never affect the runtime correctness.
The value "_ZN5clang6Parser12ConsumeParenEv - 1" falls into a last byte
of a function preceding _ZN5clang6Parser12ConsumeParenEv, and BOLT
creates a label ".Ltmp0" pointing to this last byte that is referenced
in by the instruction sequence above. It just happens that the last byte
is also in the middle of the last instruction, and as a result, BOLT
never emits the label, hence resulting in the error message "Undefined
temporary symbol".
The workaround is to detect non-pc-relative relocations from code
pointing to some (fptr - 1). Note that this is not completely
error-prone, but non-pc-relative references from code into a middle of
a function are quite rare, and chances that in a normal situation they
will point to a byte preceding some function address are virtually zero.
(cherry picked from FBD13030310)
Summary:
Special case GOT relocs to ignore addend subtracting
logic in analyzeRelocation, since the addend does not refer to the
target of the instruction being analyzed. Also make the code honor
the comments in the special case about zeroed out ExtractValue but
non-zero addend.
Fixfacebookincubator/BOLT#40
(cherry picked from FBD10355019)
Summary:
This pull request fixes two compiler warnings:
- missing `break;` in a switch-case statement in RegAnalysis.cpp (-Wimplicit-fallthrough warning)
- misleading indentation in BinaryContext.cpp (-Wmisleading-indentation warning)
Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/39
GitHub Author: Andreas Ziegler <andreas.ziegler@fau.de>
(cherry picked from FBD10202092)
Summary:
lld may generate relocations without associated symbols. Instead of
rejecting binaries with such relocations, we can re-create the symbol
the relocation is against based on the extracted value.
(cherry picked from FBD10054576)
Summary:
Previously, we were expanding eligible branches with stubs. After
expansion, we were computing which stubs were unnecessary and removing them,
assuming ranges were shortening as code is removed. The problem with this
approach is that for branches that refer to code that is not managed by
BOLT, the distance to that location can increase and we can end up with an
out-of-range branch.
This rewrites the pass to be simpler, only increasing size and expanding code
with stubs as needed after each iteration, stopping when code stops increasing.
Besides this rewrite, the stub-insertion pass now supports stubs grouping
similar to what the linker does, allowing different functions to share the
same veneer that jumps to a common callee. It also fixes a bug in the previous
implementation that, in very large functions that use TBZ/TBNZ (+-32KB range),
it would mistakenly try to reuse a local stub BB that is out of range.
This includes a change to allow hot functions to be put at the end of the
.text section, closer to the heap, requiring no veneers to jump to JITted
code. And finally it enables eliminate veneers pass by default.
(cherry picked from FBD10023158)
Summary:
If we reuse text section under `-use-old-text` option, then there's no
need to rename it. Tools, such as perf, seem to not like binaries
without `.text`.
Additionally, check if the code fits into `.text` using the page
alignment, otherwise we were skipping the alignment relying on the user
detecting the warning message. This could have resulted in unexpected
performance drops.
Also add `-no-huge-pages` option to use regular page size for code
alignment purposes (i.e. 4KiB instead of 2MiB).
(cherry picked from FBD10024670)
Summary:
While creating BinaryData objects we used to process all symbol table
entries. However, some symbols could belong to non-allocatable sections,
and thus we have to ignore them for the purpose of analyzing in-memory
data.
(cherry picked from FBD9666511)
Summary:
For jump tables ICP was using profile from the jump table itself which
doesn't work correct if the jump table is re-used at different code
locations.
(cherry picked from FBD9618774)
Summary:
While running ICF pass we have skipped merging profile data for jump
tables. We were only updating profile in the CFG. Fix that.
(cherry picked from FBD9595523)
Summary:
Do not truncate the binary name for comparison purposes as the binary
name we are getting from "perf script" is no longer truncated.
(cherry picked from FBD9596409)
Summary:
After optimizing a target of a jump table, ICP was not updating edge
counts corresponding to that target. As a result the edge could be left
hot and negatively influence the code layout.
(cherry picked from FBD9524396)
Summary:
In some rare cases a compiler may generate DWARF that contains an empty
CU DIE that references a debug line fragment. That fragment will contain
no file name information, and we fail to register it. Then, as a result,
DW_AT_stmt_list is not updated for the CU. This may cause some
DWARF-processing tools to segfault.
As a solution/workaround, we register "<unknown>" file name for such
debug line tables.
(cherry picked from FBD9526705)
Summary:
The build-id is used by tools to uniquely identify binaries. Update
the output binary build-id with a different number to make it
distinguishable from the input binary. This implementation just flips
the last build-id bit.
(cherry picked from FBD9235336)
Summary:
When updating CFI for a function that was optimized by
shrink-wrapping, if the function had no frame pointers, the CFI update
algorithm was incorrect.
(cherry picked from FBD9328658)
Summary:
Position-independent binaries may have runtime relocations of type
R_X86_64_RELATIVE that need an update if they were pointing to one of
the functions that we have relocated.
(cherry picked from FBD9374164)
Summary:
Processing profile data for binaries with flexible load address (such as
position-independent executables and shared objects) requires adjusting
binary addresses depending on the base load address.
For every PID the mapping will be more or less unique when executing
with ASLR enabled, thus we have to keep the mapping record for all PIDs
associated with the binary. Then we adjust the addresses based on those
mappings.
(cherry picked from FBD9368566)
Summary:
Switch from using `perf script --show-task-events` to
`perf script --show-mmap-events` for associating a binary with PIDs in
perf.data. The output of the former command does not provide enough
information for PIE/.so processing.
(cherry picked from FBD9346586)
Summary:
A recent commit broke our tests because it was depending on
getNumNonPseudos() at a very late stage of our optimization pipeline.
The problem was in a instruction deletion member function in
BinaryBasicBlock that was not updating the number of pseudos after
deletion. Fix this.
(cherry picked from FBD9305972)
Summary:
Couple of updates:
1) Handle address pattern with segment register.
2) Assume R11 available for PLT calls always.
3) Add CFI state to each BB.
4) early exit getMacroOpFusionPair if Instruction.size() <2.
(cherry picked from FBD9172426)
Summary:
Sometimes GCC can generate code where one of jump table entries
is being used by an indirect branch with a fixed memory reference,
such as "jmp *(JT+8)". If we don't convert such branches to direct ones
and move jump tables, then the indirect branch will reference the old
table value and will end up at the non-updated destination, possibly
causing a runtime crash.
This fix converts such indirect branches into direct ones.
For now we mark functions containing indirect branches with fixed
destination as non-simple to prevent unreachable code elimination
problem triggered by related dead/unreachable jump table.
(cherry picked from FBD9192363)