Summary:
This PR introduces 2 new instrumentation options:
1. instrumentation-no-counters-clear: Discussed at https://github.com/facebookincubator/BOLT/issues/121
2. instrumentation-wait-forks: Since the instrumentation counters are mapped as MAP_SHARED it will be nice to add ability to wait until all forks of the parent process will die using tracking of process group.
The last patch is just emitBinary code refactor.
Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/125
GitHub Author: Vladislav Khmelevskyi <Vladislav.Khmelevskyi@huawei.com>
(cherry picked from FBD26919011)
Summary:
TBSS section is a "virtual" section that does not take memory or file
space. Ignore it completely while adjusting section sizes.
(cherry picked from FBD26824484)
Summary:
There is no real link between CU and TU, so relying on fact
that address are the same, and we are updating all of them.
(cherry picked from FBD28112114)
Summary:
This commit is the first step in rebasing all of BOLT
history in the LLVM monorepo. It also solves trivial build issues
by updating BOLT codebase to use current LLVM. There is still work
left in rebasing some BOLT features and in making sure everything
is working as intended.
History has been rewritten to put BOLT in the /bolt folder, as
opposed to /tools/llvm-bolt.
(cherry picked from FBD33289252)
Summary:
1. Add support for __literal16 section in the instrumentation runtime library for MacOS.
2. Fix emitting __counters section.
(cherry picked from FBD25746342)
Summary:
a few minor updates in block reordering:
- some refactoring to improve readability;
- optimized chain splitting strategy to improve quality of layout and performance of the algorithm.
(cherry picked from FBD25126220)
Summary:
When looking at perf.data's available binaries and their
respective mmap'ed segments, match them with the input binary by
looking at both aligned and non-aligned addresses. If we suppose
the alignment is the mmap'ed page size, we may miss some cases and
perf2bolt will refuse to proceed because it failed to match the
input binary with a process recorded in perf.data.
(cherry picked from FBD25732673)
Summary:
Add options for trading processing speed for binary performance.
-lite-threshold-pct=<uint>
Threshold (in percent) for selecting functions to process in lite
mode. Higher threshold means fewer functions to process.
E.g threshold of 90 means only top 10 percent of functions with
profile will be processed.
-lite-threshold-count=<uint>
Similar to '-lite-threshold-pct' but specify threshold using
absolute function call count. I.e. limit processing to functions
executed at least the specified number of times.
-no-scan
Do not scan cold functions for external references (may result in
slower binary).
(cherry picked from FBD24739092)
Summary:
This fixes a bug with shrink wrapping when trying to move
push-pops in a function where we are not allowed to modify the
stack layout for alignment reasons. In this bug, we failed to
propagate alignment requirement upwards in the call graph from
function A to B when: (1) there is a cycle in the call graph and
(2) the distance from A to B is greater than 1 in the call graph
and (3) there is a node in the path from A to B, not including
A or B, that does not access parameters in the stack.
(cherry picked from FBD25315977)
Summary:
This diff is a preparation for dumping the profile generated by BOLT's instrumenation on MachO.
1/ Function "bolt_instr_fini" is placed into the predefined section "__fini"
2/ In the instrumentation pass we create a symbol "bolt_instr_fini" and
replace the last global destructor with it.
This is a temporary solution, in the future we need to register bolt_instr_fini in addition to the existing destructors without dropping the last one.
(cherry picked from FBD25071864)
Summary:
Fix corner case of insertion of updated CFI with unset `PrevBB`.
Handle it in the same way as inserting past hot-cold split point.
(cherry picked from FBD24943911)
Summary:
In BinaryContext::calculateEmittedSize(), after the temporary code
emission, we have to perform a cleanup and mark all symbols used
during the emission as undefined and unregistered (so that we can emit
them again later). The cleanup is happening even for symbols that were
referenced and not defined by emitted code.
If all emitted symbols are local, there is no risk that one thread will
define a symbol while some other thread will undefine it in its cleanup
code. Such behavior is expected as local symbols can only be referenced
within the containing function and each function is processed in one
thread. However, secondary entry points have associated global symbols
and if we emit them, then it is possible for a thread to undefine
a symbol while the other thread had defined it and was in the process of
emitting the fragment with it. In such case, a data race may happen and
the thread that contains the definition of the symbol may define it
twice causing a redefinition error.
To avoid the data race, we skip the emission of secondary entry global
symbols when emitting code used only for the size estimation.
(cherry picked from FBD24986007)
Summary:
A faster and better version of function reordering:
- fixed a bug when some computed probabilities were negative;
- changed an O(n^2) loop to a priority queue to find a candidate of chains to merge
(cherry picked from FBD24571208)
Summary:
Support jump tables belonging to split fragments with entries
pointing back to parent functions.
While skipping such families of functions, make sure to use the
topmost fragment to ignore its fragments.
(cherry picked from FBD24907438)
Summary:
In a jump table identification, register an invalid offset for jump table
entries pointing to function fragments.
These invalid offsets have no effect other than padding the jump
table size, calculated as `max(OffsetEntries, Entries)`.
Correct jump table size is required in strict mode (enabled by default
in aggregation mode by `perf2bolt`) in accounting of all PC-relative
relocations in data.
Functions containing these jump tables with invalid offsets are
marked to be ignored immediately afterwards in
`populateJumpTables`.
(cherry picked from FBD24897464)
Summary:
Introduce new BinaryFunction flag `IsCanonicalCFG`, which gets
unset by SCTC pass. Make DynoStats collection conditional on this
new flag.
SCTC leaves CFG in a state where branch counters of BBs with tail
calls/conditional tail calls are not available (except via annotations,
which get stripped by `lower-annotations`). Without branch
counters, DynoStats are invalid.
(cherry picked from FBD24558050)
Summary:
Fix cold fragment name matching regex by replacing existing
regexes `.*\.cold\..*` and `.*\.cold`
and combining them into `.*\.cold(\.\d)?`,
applied to restored name (with BOLT-added suffixes stripped)
This allows matching names like "execute_stack_op.cold/1", which
previously weren't recognized.
(cherry picked from FBD24804880)
Summary:
- Allow jump table entries to point to locations inside the function and its fragments.
Reasoning behind this is that jump table identification has the logic of stopping at entry which belongs to a function different from the one originally referencing jump table. This assumption is invalid for jump tables with entries pointing to both parent function and cold fragments, leading to "unclaimed PC-relative relocations" assertion.
- Add fragment identification heuristic based on function name regex and contiguous jump table entries.
Currently, parent-to-fragment relationship is set up based on interprocedural references – direct references from the parent function. These references don't include references through jump table.
Additionally, some fragments are only reachable through jump table. In that case, in order to fully consume jump table, add parent-to-fragment relationship during `analyzeJumpTable` using the following heuristics:
1. Fragment is identified as such based on name (contains `.cold.` part), but
2. Parent function is not set – no direct interprocedural references to that fragment, and
3. Fragment has the name of the form <parent>.cold(.\d+)
* For split functions with jump table entries spanning parent and fragments, mark parent and all fragments as ignored.
(cherry picked from FBD24456904)
Summary:
For interprocedural references to fragments, record them as
fragment entry points. Not registering these entry points leads to
UCE removing the blocks and "Undefined temporary symbol"
assertion.
(cherry picked from FBD24511281)
Summary:
Some of the TLS relocatios like R_AARCH64_TLSDESC_ADR_PAGE21 must be
handled by bolt and should not be skipped by the removed condition. Some
of the TLS relocations like R_AARCH64_TLS_TPREL64 could really be skipped
here, but AFAIU this condition was added as part of BOLT its self optimization, so
to prevent future problems here my suggestion is not to add another condition
like "isTLS(RType) && isTLSRelocatable(RType)", but just remove it since
absense of this condition should not broke any other TLS relocation.
Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/103
GitHub Author: Vladislav Khmelevsky <Vladislav.Khmelevskyi@huawei.com>
(cherry picked from FBD24745928)
Summary:
Fix several issues to make C++ exceptions work in shared objects:
* Set MCObjectFileInfo PIC type based on the input binary type.
* Support indirect (DW_EH_PE_indirect) encoding while writing
exception Type Table.
* Use different LPStart value and landing pad encoding for .so's.
* Disable splitting of exception-handling code for .so's because of
the new encoding.
(cherry picked from FBD24698765)
Summary:
EliminateUnreachableBlocks has a data race because it depends
on BinaryContext::computeCodeSize. computeCodeSize supports independent
Emitters, enabling a lock-free execution. Unfortunately, that is almost
as expensive as the lock. Removing the boilerplate code for
parallellization of this pass turned out to be the best alternative: no
races and slightly better execution time for HHVM.
(cherry picked from FBD24716250)
Summary:
In BinaryContext, we had StringRef holding a reference to
an r-value std::string. This triggers clang's address sanitizer
warnings. In MCPlusBuilder we had a left shift overflowing a type,
which is undefined behavior. Similarly, in CallGraph, we had a hash
function shifting a negative value, which is also UB. The last two
triggers the UB sanitizer.
(cherry picked from FBD24661045)
Summary:
Some symbols in .dynsym will be erroneously marked as belonging to a
non-allocatable section that BOLT can remove. In that case, keep the
original invalid index for such symbols instead of setting the UNDEF
index.
(cherry picked from FBD24488677)
Summary:
Change .dot dumps filename format from
<function>-<passname>.dot
to
<function>-<passidx>_<passname>.dot
This change helps navigate dumps by making the pass order explicit.
Example:
execute_stack_op.cold.6-1(*2)-00_build-cfg.dot
execute_stack_op.cold.6-1(*2)-01_validate-internal-calls.dot
execute_stack_op.cold.6-1(*2)-02_strip-rep-ret.dot
...
(cherry picked from FBD24452903)
Summary:
While refactoring the pass, I removed the important transactional
property of the patching process. Restore it.
(cherry picked from FBD24440214)
Summary:
When -hot-text is on, do not read __hot_start and __hot_end
from input (inserted by a linker script with the intent of ordering
functions). This can confuse BOLT into creating a function with this
name depending on which address the symbol lands and we will assert
when trying to emit our own __hot_start/__hot_end with symbol
redefinition.
(cherry picked from FBD24366636)
Summary:
This diff is a preparation for loading the runtime on MachO.
The proposed schema is the following:
1/ Function "bolt_instr_setup" is placed into the predefined section "setup" (in the final setting this function will be coming from the instrumentation runtime but we still will be placing it into this section).
2/ In the instrumentation pass we create a symbol "bolt_instr_setup" and inject the corresponding call into the beginning of the function representing the entry point of the binary.
(cherry picked from FBD24329530)
Summary:
Do not store processed DWARF DIEs, but instead process them while
reading one at a time.
Reduces memory consumption when updating debug info by 10%-25%.
(cherry picked from FBD24327029)