Summary:
We extended DynoStats to dump the histogram per instruction opcode. By
default the dump is turned off. Use '-print-dyno-opcode-stats' to enable
the dump.
BOLT also dumps for each instruction opcode the maximum execution count and
corresponding function name and basic block offsets where the instruction
occurs. Below is a sample of the dump:
Opcode, Execution Count, Max Exec Count, Function Name:Offset
SHR8rCL, 232, 232, _ZNK5folly14AsyncSSLSocket4goodEv:53
VPADDDYrr, 13956, 388, chacha20_encrypt_bytes.part.0/3:736
PMOVSXBWrr, 4, 2, ares_expand_name/1:264
VMOVAPSmr, 1082, 43, chacha20_encrypt_bytes.part.0/3:2864
VPSHUFBrr, 9540, 1667, chacha20_encrypt_bytes.part.0/3:4416
VPUNPCKLDQYrr, 1102, 188, jsimd_ycc_rgb_convert_avx2/1:125
VPBROADCASTQYrm, 39, 39, chacha20_encrypt_bytes.part.0/3:400
PMOVSXWDrr, 8, 2, ares_expand_name/1:264
VPORrr, 817, 129, jsimd_idct_islow_avx2/1:41
PSLLDri, 8690752, 65644, blockmix_salsa8_xor/1:1424
(cherry picked from FBD28859624)
Summary:
A binary can contain multiple PLT sections with different name and
attributes (such as an entry size). Extend the support to .plt.sec and
refactor the code to make future extensions simpler.
(cherry picked from FBD29502107)
Summary:
clang-12 now compiles bolt without warnings.
Some warnings were fixed if possible while others were suppressed by
doing (void)variable for unused variable warnings or moving code inside
assert statements of LLVM_DEBUG blocks.
(cherry picked from FBD29469054)
Summary:
Add code to read more dynamic relocations (DT_JMPREL) and enforce strict
checks that corresponding sections sizes match .dynamic entry
description.
(cherry picked from FBD29502109)
Summary:
The code for writing out dwo files wasn't handling case where DWP is an input.
Because all the sections are part of the same binary.
One note with current implementation. .debug-str.dwo will have strings for all the dwo objects.
This is because llvm-dwp de-duplicates strings and combines them in to one section. It then re-writes .debug-str-offsets.dwo to point to new .debug-str.dwo section.
(cherry picked from FBD29244835)
Summary:
Our YAML objects contain references to dynamic relocations via .dynamic,
but there are no corresponding relocation sections. Change .dynamic
contents to specify no dynamic relocations.
(cherry picked from FBD29502108)
Summary:
Move the code that handles true external references (non-unreachable)
out of a for-loop in `BinaryFunction::disassemble`.
(cherry picked from FBD29411345)
Summary:
Handle R_X86_64_64 the same way as R_X86_64_32;
`getSizeForType` takes care of the size:
```x86_64 ABI relocation types
Name Value Field Calculation
R_X86_64_64 1 word64 S + A
R_X86_64_32 10 word32 S + A
```
(cherry picked from FBD29370417)
Summary:
When we fold a function in relocation mode, make sure to clear its state
to avoid emitting relocations against undefined symbols.
(cherry picked from FBD29245320)
Summary:
Dived more in to DWARF APIs and llvm-symbolizer this is a more streamline way of doing it, and address base gets set properly.
Writing out dwo files with dwp input will be separate patch.
(cherry picked from FBD31361529)
Summary:
When indirect call is instrmented it locks SimpleHashTable's mutex on get() call.
If while locked we we receive a signal and signal handler also will call
indirect function we will end up with deadlock.
PR facebookincubator/BOLT#167
Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
(cherry picked from FBD28909921)
Summary:
Suppresses the warning
```
src/DebugData.h:338:20: warning: 'addList' overrides a member function but is not marked 'override' [-Wsuggest-override]
```
(cherry picked from FBD28858201)
Summary:
Make bolt decode pseudo probe section in binary
For more detail of pseudo probe, check https://reviews.llvm.org/D86490.
(cherry picked from FBD28856316)
Summary:
While printing debug info for instructions, we should use line tables
from the corresponding DWARF CU which could be different from the
containing function CU in case of inlined instructions.
(cherry picked from FBD28908324)
Summary:
FBD55943 changed the way ProcessAllSections works in RuntimeDyld. After
the change, all sections, including symbol table, section table, etc.
are loaded into memory whenever ProcessAllSections is enabled.
In BOLT we rely on RuntimeDyld for processing sections with relocations.
These include most allocatable sections and additionally .debug_line.
The latter is skipped by RuntimeDyld without ProcessAllSections flag.
If we enable ProcessAllSections, we will have to deal with allocating
memory for more sections than we need (see above) and later to filter
them out.
The alternative is to mark all sections that we actually plan to use as
"required for execution" (using RuntimeDyld terminology). For
.debug_line section on ELF it means adding SHF_ALLOC flag. On MachO,
RuntimeDyld currently treats all sections as required.
(cherry picked from FBD28729398)
Summary:
This patch introduces LoopInversionPass. Its main purpose is to ensure
that the loop layout is optimal depending on the profile information. So
if profile information shows that the loop is used, the unconditional
jump instruction must be executed only once and vice-versa. Please take
a look to the pass header file and test for more details.
Also change link_fdata script a bit, to be able to change FDATA prefix,
like FileCheck does.
Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
PR facebookincubator/BOLT#153
(cherry picked from FBD28391811)
Summary:
Implemented support for Debug Fission.
For the most part it doesn't impact Monolithic execution path.
One area that was changed is the DW_AT_low_pc/DW_AT_high_pc conversion. Before it was to DW_AT_ranges/DW_AT_low_pc, now DW_AT_low_pc is kept in same place.
Another more visible impact is in Skeleton CU the DW_AT_low_pc is replaced with DW_AT_ranges_base if it's not originally present and bolt converted ranges conversion inside the dwo units.
Output of this are multiple .dwo files with updated debug information.
(cherry picked from FBD29569788)
Summary:
Remove relocations against internal function labels, e.g. jump table
relocations, only when overwriting them.
While reading an input file with relocations, we create internal
relocations against code references (we skip PIC relocations).
Later, when we discover jump tables, we remove corresponding relocations
with the assumption that original relocations will either be ignored or
replaced by new relocations. However, it is possible to miss some
references to the jump table, in which case the original entries will
not be ignored. While such situation is abnormal, it is still a
better/safer approach to preserve relocations if we are not replacing
them with new ones.
(cherry picked from FBD28406628)
Summary:
Explicit assignment operator can be replaced with an implicit one.
Remove it to allow an implicit copy constructor:
```
bolt/src/Passes/DataflowAnalysis.h:74:8: warning: definition of
implicit copy constructor for 'ProgramPoint' is deprecated because it
has a user-declared copy assignment operator [-Wdeprecated-copy]
void operator=(const ProgramPoint &PP) {
^
bolt/src/Passes/DataflowAnalysis.h:62:14: note: in implicit copy
constructor for 'llvm::bolt::ProgramPoint' first required here
return ProgramPoint(&*Last);
```
(cherry picked from FBD28335138)
Summary:
Since gcc/ld could produce and expect PIE files we need to pass -no-pie option to avoid linking errors for tests.
Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
(cherry picked from FBD28360045)
Summary:
Reorder-blocks optimization pass doesn't take into account that
available offset for legacy Jcc instructions (for example,
JRCXZ - operand 8 bits) has to be less than 255 bytes.
It's rare case and to exclude such functions with unsupported
instructions from optimization passes added extra checking
Alexey Moksyakov
Advanced Software Technology Lab, Huawei
(cherry picked from FBD28264117)
Summary:
Ran iwyu multiple times, manually picked header remove lines.
Reached fixed point wrt removal: iwyu doesn't automatically remove
any more headers or forward declarations.
(cherry picked from FBD29569221)
Summary:
Previously, we used p_align value of the code segment to predict the
mapping of the segment at runtime. However, at times the reported
value is not aligned and at other times the actual aligned value will
be different because of the different page size used.
All we know is that the page size used at runtime should not exceed
p_align value. Adjust our segment address matching accordingly.
(cherry picked from FBD28133066)
Summary:
Addressing comments from the review for "Expand auto types".
Use const reference in MCPlusBuilder for MCInstrDesc where the copy
is not necessary.
(cherry picked from FBD27844344)
Summary:
We may have a CU with empty ranges, so accept errors coming
from DWARFDie::getAddressRanges(). This happens when using tools that
selectively strip debuginfo from the binary.
(cherry picked from FBD27602731)
Summary:
Refactor SectionPatches to avoid the use of extra map and a cast
from StringRef to std::string.
cherry-picked from FBD26756560
(cherry picked from FBD27490641)
Summary:
The user may wish to run BOLT for printing statistics only
(i.e. to check that the profile is valid). Add an option to run BOLT
without writing any output file, similar to a dry run. This option
is triggered by supplying -o with "/dev/null".
(cherry picked from FBD29568632)
Summary:
During the initial indirect jump analysis, we used to assert that the
discovered jump table type matched the pattern of the corresponding
instruction sequence. E.g., for PIC jump table memory we expected the
PIC jump table instruction sequence. The assertions were too
conservative, as in the case of a mismatch we can mark the indirect jump
as having an unknown control flow. That should be sufficient to either
skip the function processing or rely on relocation information for
possible recovery of the control flow.
(cherry picked from FBD27255816)
Summary:
Fix a bug with instrumentation when trying to instrument
functions that share a jump table with multiple indirect
jumps. Usually, each indirect jump that uses a JT will have its own
copy of it. When this does not happen, we need to duplicate the jump
table safely, so we can split the edges correctly (each copy of the
jump table may have different split edges). For this to happen, we
need to correctly match the sequence of instructions that perform the
indirect jump to identify the base address of the jump table and patch
it to point to the new cloned JT. It was reported to us a case in
which the compiler generated suboptimal code to do an indirect jump
which our matcher failed to identify.
Fixesfacebookincubator/BOLT#126
(cherry picked from FBD27065579)
Summary:
Whenever BOLT encounters a data reference in code, it tries to convert
it into <Object+Offset> form. The primary reason behind this approach is
to support read-only data-reordering optimization. However, with the
current level of the linker and compiler support we don't have enough
information to always correctly restore the original <Object+Offset>.
E.g. with zero-sized symbols we have to speculate that the actual size
of the underlying object extends to the next symbol. Most of the time,
there will be an object pointed by a zero-sized symbol and even
if we are guessing incorrectly, there will be no harm in creating
references of such form.
The problem happens when there's no object corresponding to the original
symbol and the next object is an (unmarked) jump table:
A: # <- zero-sized object
.LJUMP_TABLE:
.long <entry1>
.long <entry2>
....
.LB:
.long 21
.LC:
.long 42
The jump table will be moved and all references past it (up to the next
named object) will be incorrectly updated.
We should not speculate about the size of A in a case like that and
treat all discovered data objects (and thus references) independently.
(cherry picked from FBD27005660)
Summary:
This PR introduces 2 new instrumentation options:
1. instrumentation-no-counters-clear: Discussed at https://github.com/facebookincubator/BOLT/issues/121
2. instrumentation-wait-forks: Since the instrumentation counters are mapped as MAP_SHARED it will be nice to add ability to wait until all forks of the parent process will die using tracking of process group.
The last patch is just emitBinary code refactor.
Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/125
GitHub Author: Vladislav Khmelevskyi <Vladislav.Khmelevskyi@huawei.com>
(cherry picked from FBD26919011)
Summary:
TBSS section is a "virtual" section that does not take memory or file
space. Ignore it completely while adjusting section sizes.
(cherry picked from FBD26824484)
Summary:
There is no real link between CU and TU, so relying on fact
that address are the same, and we are updating all of them.
(cherry picked from FBD28112114)
Summary:
This commit is the first step in rebasing all of BOLT
history in the LLVM monorepo. It also solves trivial build issues
by updating BOLT codebase to use current LLVM. There is still work
left in rebasing some BOLT features and in making sure everything
is working as intended.
History has been rewritten to put BOLT in the /bolt folder, as
opposed to /tools/llvm-bolt.
(cherry picked from FBD33289252)
Summary:
1. Add support for __literal16 section in the instrumentation runtime library for MacOS.
2. Fix emitting __counters section.
(cherry picked from FBD25746342)
Summary:
a few minor updates in block reordering:
- some refactoring to improve readability;
- optimized chain splitting strategy to improve quality of layout and performance of the algorithm.
(cherry picked from FBD25126220)
Summary:
When looking at perf.data's available binaries and their
respective mmap'ed segments, match them with the input binary by
looking at both aligned and non-aligned addresses. If we suppose
the alignment is the mmap'ed page size, we may miss some cases and
perf2bolt will refuse to proceed because it failed to match the
input binary with a process recorded in perf.data.
(cherry picked from FBD25732673)
Summary:
Add options for trading processing speed for binary performance.
-lite-threshold-pct=<uint>
Threshold (in percent) for selecting functions to process in lite
mode. Higher threshold means fewer functions to process.
E.g threshold of 90 means only top 10 percent of functions with
profile will be processed.
-lite-threshold-count=<uint>
Similar to '-lite-threshold-pct' but specify threshold using
absolute function call count. I.e. limit processing to functions
executed at least the specified number of times.
-no-scan
Do not scan cold functions for external references (may result in
slower binary).
(cherry picked from FBD24739092)
Summary:
This fixes a bug with shrink wrapping when trying to move
push-pops in a function where we are not allowed to modify the
stack layout for alignment reasons. In this bug, we failed to
propagate alignment requirement upwards in the call graph from
function A to B when: (1) there is a cycle in the call graph and
(2) the distance from A to B is greater than 1 in the call graph
and (3) there is a node in the path from A to B, not including
A or B, that does not access parameters in the stack.
(cherry picked from FBD25315977)
Summary:
This diff is a preparation for dumping the profile generated by BOLT's instrumenation on MachO.
1/ Function "bolt_instr_fini" is placed into the predefined section "__fini"
2/ In the instrumentation pass we create a symbol "bolt_instr_fini" and
replace the last global destructor with it.
This is a temporary solution, in the future we need to register bolt_instr_fini in addition to the existing destructors without dropping the last one.
(cherry picked from FBD25071864)
Summary:
Fix corner case of insertion of updated CFI with unset `PrevBB`.
Handle it in the same way as inserting past hot-cold split point.
(cherry picked from FBD24943911)
Summary:
In BinaryContext::calculateEmittedSize(), after the temporary code
emission, we have to perform a cleanup and mark all symbols used
during the emission as undefined and unregistered (so that we can emit
them again later). The cleanup is happening even for symbols that were
referenced and not defined by emitted code.
If all emitted symbols are local, there is no risk that one thread will
define a symbol while some other thread will undefine it in its cleanup
code. Such behavior is expected as local symbols can only be referenced
within the containing function and each function is processed in one
thread. However, secondary entry points have associated global symbols
and if we emit them, then it is possible for a thread to undefine
a symbol while the other thread had defined it and was in the process of
emitting the fragment with it. In such case, a data race may happen and
the thread that contains the definition of the symbol may define it
twice causing a redefinition error.
To avoid the data race, we skip the emission of secondary entry global
symbols when emitting code used only for the size estimation.
(cherry picked from FBD24986007)
Summary:
A faster and better version of function reordering:
- fixed a bug when some computed probabilities were negative;
- changed an O(n^2) loop to a priority queue to find a candidate of chains to merge
(cherry picked from FBD24571208)
Summary:
Support jump tables belonging to split fragments with entries
pointing back to parent functions.
While skipping such families of functions, make sure to use the
topmost fragment to ignore its fragments.
(cherry picked from FBD24907438)
Summary:
In a jump table identification, register an invalid offset for jump table
entries pointing to function fragments.
These invalid offsets have no effect other than padding the jump
table size, calculated as `max(OffsetEntries, Entries)`.
Correct jump table size is required in strict mode (enabled by default
in aggregation mode by `perf2bolt`) in accounting of all PC-relative
relocations in data.
Functions containing these jump tables with invalid offsets are
marked to be ignored immediately afterwards in
`populateJumpTables`.
(cherry picked from FBD24897464)
Summary:
Introduce new BinaryFunction flag `IsCanonicalCFG`, which gets
unset by SCTC pass. Make DynoStats collection conditional on this
new flag.
SCTC leaves CFG in a state where branch counters of BBs with tail
calls/conditional tail calls are not available (except via annotations,
which get stripped by `lower-annotations`). Without branch
counters, DynoStats are invalid.
(cherry picked from FBD24558050)
Summary:
Fix cold fragment name matching regex by replacing existing
regexes `.*\.cold\..*` and `.*\.cold`
and combining them into `.*\.cold(\.\d)?`,
applied to restored name (with BOLT-added suffixes stripped)
This allows matching names like "execute_stack_op.cold/1", which
previously weren't recognized.
(cherry picked from FBD24804880)
Summary:
- Allow jump table entries to point to locations inside the function and its fragments.
Reasoning behind this is that jump table identification has the logic of stopping at entry which belongs to a function different from the one originally referencing jump table. This assumption is invalid for jump tables with entries pointing to both parent function and cold fragments, leading to "unclaimed PC-relative relocations" assertion.
- Add fragment identification heuristic based on function name regex and contiguous jump table entries.
Currently, parent-to-fragment relationship is set up based on interprocedural references – direct references from the parent function. These references don't include references through jump table.
Additionally, some fragments are only reachable through jump table. In that case, in order to fully consume jump table, add parent-to-fragment relationship during `analyzeJumpTable` using the following heuristics:
1. Fragment is identified as such based on name (contains `.cold.` part), but
2. Parent function is not set – no direct interprocedural references to that fragment, and
3. Fragment has the name of the form <parent>.cold(.\d+)
* For split functions with jump table entries spanning parent and fragments, mark parent and all fragments as ignored.
(cherry picked from FBD24456904)
Summary:
For interprocedural references to fragments, record them as
fragment entry points. Not registering these entry points leads to
UCE removing the blocks and "Undefined temporary symbol"
assertion.
(cherry picked from FBD24511281)
Summary:
Some of the TLS relocatios like R_AARCH64_TLSDESC_ADR_PAGE21 must be
handled by bolt and should not be skipped by the removed condition. Some
of the TLS relocations like R_AARCH64_TLS_TPREL64 could really be skipped
here, but AFAIU this condition was added as part of BOLT its self optimization, so
to prevent future problems here my suggestion is not to add another condition
like "isTLS(RType) && isTLSRelocatable(RType)", but just remove it since
absense of this condition should not broke any other TLS relocation.
Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/103
GitHub Author: Vladislav Khmelevsky <Vladislav.Khmelevskyi@huawei.com>
(cherry picked from FBD24745928)
Summary:
Fix several issues to make C++ exceptions work in shared objects:
* Set MCObjectFileInfo PIC type based on the input binary type.
* Support indirect (DW_EH_PE_indirect) encoding while writing
exception Type Table.
* Use different LPStart value and landing pad encoding for .so's.
* Disable splitting of exception-handling code for .so's because of
the new encoding.
(cherry picked from FBD24698765)
Summary:
EliminateUnreachableBlocks has a data race because it depends
on BinaryContext::computeCodeSize. computeCodeSize supports independent
Emitters, enabling a lock-free execution. Unfortunately, that is almost
as expensive as the lock. Removing the boilerplate code for
parallellization of this pass turned out to be the best alternative: no
races and slightly better execution time for HHVM.
(cherry picked from FBD24716250)
Summary:
In BinaryContext, we had StringRef holding a reference to
an r-value std::string. This triggers clang's address sanitizer
warnings. In MCPlusBuilder we had a left shift overflowing a type,
which is undefined behavior. Similarly, in CallGraph, we had a hash
function shifting a negative value, which is also UB. The last two
triggers the UB sanitizer.
(cherry picked from FBD24661045)
Summary:
Some symbols in .dynsym will be erroneously marked as belonging to a
non-allocatable section that BOLT can remove. In that case, keep the
original invalid index for such symbols instead of setting the UNDEF
index.
(cherry picked from FBD24488677)
Summary:
Change .dot dumps filename format from
<function>-<passname>.dot
to
<function>-<passidx>_<passname>.dot
This change helps navigate dumps by making the pass order explicit.
Example:
execute_stack_op.cold.6-1(*2)-00_build-cfg.dot
execute_stack_op.cold.6-1(*2)-01_validate-internal-calls.dot
execute_stack_op.cold.6-1(*2)-02_strip-rep-ret.dot
...
(cherry picked from FBD24452903)
Summary:
While refactoring the pass, I removed the important transactional
property of the patching process. Restore it.
(cherry picked from FBD24440214)
Summary:
When -hot-text is on, do not read __hot_start and __hot_end
from input (inserted by a linker script with the intent of ordering
functions). This can confuse BOLT into creating a function with this
name depending on which address the symbol lands and we will assert
when trying to emit our own __hot_start/__hot_end with symbol
redefinition.
(cherry picked from FBD24366636)
Summary:
This diff is a preparation for loading the runtime on MachO.
The proposed schema is the following:
1/ Function "bolt_instr_setup" is placed into the predefined section "setup" (in the final setting this function will be coming from the instrumentation runtime but we still will be placing it into this section).
2/ In the instrumentation pass we create a symbol "bolt_instr_setup" and inject the corresponding call into the beginning of the function representing the entry point of the binary.
(cherry picked from FBD24329530)
Summary:
Do not store processed DWARF DIEs, but instead process them while
reading one at a time.
Reduces memory consumption when updating debug info by 10%-25%.
(cherry picked from FBD24327029)
Summary:
When placing restore instructions in the shrink wrapping pass,
we typically put them right before the last instruction of a block at
the dominance frontier. If this instruction happened to have a prefix,
because the MC lib separates prefix into separate MCInsts, we would
accidentally put a load between a prefix and another instruction. Fix
this.
(cherry picked from FBD24295324)
Summary:
Add first bits to support emitting instrumented code on MachO.
This diff enables us to instrument branches / emit counters.
(cherry picked from FBD24255164)
Summary:
On targets that support it, emit size of the emitted function symbol.
At the moment there's no use for the size except that it is visible in a
temporary .o file symbol table.
(cherry picked from FBD24246177)
Summary: _end is "defined" but its address doesn't belong to any section. This diff adds special handling for this symbol.
(cherry picked from FBD24249120)
Summary:
Append ".cold.0" suffix to the original part of the name, such that
"foo/1" becomes "foo.cold.0/1" instead of "foo/1.cold.0".
(cherry picked from FBD24246112)
Summary:
At the moment we are not using PatchEntries pass in non-relocation mode
on ELF. However, we will use it on MachO.
(cherry picked from FBD24235271)
Summary: Add ToolPath field to MachORewriteInstance. This will enable us to locate the runtime library relative to the tool's location.
(cherry picked from FBD24183448)
Summary:
Do not mix relocation codes from different archs. Even though
they do not intersect at the moment, this could easily introduce bugs
once new relocations are supported (for example, ILP32 for AArch64).
(cherry picked from FBD24169425)
Summary:
This diff adds a command line option to disable the check of overlapping elements in Mach-O parsing. This check in its current form is prohibitively expensive for large binaries.
A long-term fix would be to reimplement the check in a more efficient manner (and contribute it to the upstream).
(cherry picked from FBD24109468)
Summary:
In analyzeRelocations, we extract the result of the relocation
from binary code to recreate the target of it in a few special cases.
For R_X86_64_32S relocations, however, we were neglecting the
possibility of the encoded value in the instruction to be negative.
(cherry picked from FBD24096347)
Summary:
This patch fixes the assertion failure during instrumentation.
The assertion is raised by `getInstructionAtOffset` , which expects `CurrentState` to be either `Disassembled` or `CFG`.
The function is called from `postProcessEntryPoints`, which goes over Labels and performs a series of checks. The checks call BinaryFunction methods `setSimple(false)` or `setIgnored()`.
However, if `setIgnored` is invoked, it resets the state to `Empty`. Thus subsequent call to `getInstructionAtOffset` will fail.
(cherry picked from FBD24005197)
Summary:
Enable initial support for reading and patching special Linux kernel sections.
Author: Tanvir Ahmed Khan <takh@fb.com>
GitHub Author: takhandipu
(cherry picked from FBD22998869)
Summary:
Whenever we search for a function based on its address in the input
binary, we now always return a corresponding fragment for split
functions. If the user needs an access to the main fragment, they can
call getTopmostFragment().
(cherry picked from FBD23670311)
Summary:
Sections that do not originate from the input binary will have an
input address set to zero and thus do not have to be mapped.
Mapping such sections caused a build time regression in non-relocation
mode.
(cherry picked from FBD23670334)
Summary:
Fix issue with splitting critical edges originating at
the same BB in ShrinkWrapping::splitFrontierCritEdges.
Splitting of critical edges originating at the same FromBB
wasn't handled correctly as the Frontier at index corresponding
to FromBB was overwritten with basic blocks created for
multiple DestinationBBs.
(cherry picked from FBD23232398)
Summary:
Right now, if activity is recorded in cold parts, we write to
the .fdata file the ".cold" name instead of the correct name of the
function. Fix this.
(cherry picked from FBD23148705)
Summary:
When the input file is processed by BOLT, we cannot save profile in YAML
format as it requires CFG representation of functions.
(cherry picked from FBD22941794)
Summary:
Add first bits to support emitting more than 255 sections.
Update llvm.patch to include the changes in MachOObjectFile.cpp.
(cherry picked from FBD22655053)
Summary:
Some LBR traces contain less than 16/32 entries. When the first trace is
less than the standard length, we fail to enable SKL bug workaround with
BAT mode enabled. The result is a bad profile from perf2bolt. The fix
is to check the length of every trace, not just the first one.
Issue facebookincubator/BOLT#94
(cherry picked from FBD22917971)
Summary:
Added execution count threshold option (execution-count-threshold) controlling the optimizations that are sensitive to the accuracy of the profiling data:
- BB reordering
- function splitting
- frame opts
- shrink wrapping
- indirect call promotion
(cherry picked from FBD22682171)
Summary:
Right now, the SAVE_ALL sequence executed upon entry of both
of our runtime libs (hugify and instrumentation) will cause the stack to
not be aligned at a 16B boundary because it saves 15 8-byte regs. Change
the code sequence to adjust for that. The compiler may generate code
that assumes the stack is aligned by using movaps instructions, which
will crash.
(cherry picked from FBD22744307)
Summary:
If no profile data is provided, but only a user-provided order
file for functions, fix the placement of the __hot_end symbol.
(cherry picked from FBD22713265)
Summary: Added handling of intra-function calls (internal control transfer by call instruction) to instrumentOneTarget
(cherry picked from FBD22606033)
Summary:
Do not fail/assert when trying to reorder blocks that terminate
with JRCXZ/JECXZ/LOOP instructions. We cannot invert the condition of
these instructions, so just treat them accordingly in fixBranches().
(cherry picked from FBD22487107)
Summary:
We've accidentally registered TBSS section address with a BinaryContext
resulting in addresses being attributed to it when
getSectionForAddress() was called.
(cherry picked from FBD22369323)
Summary:
Reading through the LLVM coding standard again, realized a few places where I didn't follow the standard when coding. Addressing them:
1. prefer static functions over functions in unnamed namespace.
2. #include as little as possible in headers
3. Have vtable anchors.
(cherry picked from FBD22353046)
Summary:
R_X86_64_PLT32 relocations recorded by the linker may point to the PLT
section instead of being resolved to the symbol reported by the
relocation. Sometimes they could point to the symbol too. Disable
internal verification for this type of relocation.
Include a fix for symbol address calculation when it is based on the
extracted value. The truncation to the relocation size is needed if
the results overflows.
(cherry picked from FBD22317952)
Summary:
Re-add tests removed because they used to depend on yaml2obj.
Rewrite them with an assembler (llvm-mc) and use the system linker to
produce a valid ELF as input to BOLT.
(cherry picked from FBD22323449)
Summary:
When adding symbols for patched functions, we may end up emitting
multiple symbols per function if the function has multiple names (e.g.
after identical code folding by the linker).
(cherry picked from FBD22294112)
Summary:
Accept binaries without dynamic section/segment as a valid input.
Modify the check for invalid debug info "executables" that are result of
running "objcopy --only-keep-debug". Instead of checking for an empty
dynamic segment, check that ".text" is mapped into a valid segment.
Move SegmentMapInfo inside BinaryContext.
Fixesfacebookincubator/BOLT#91
Temporarily removing issue*.test tests that use yaml2obj and operate on
fake binaries.
(cherry picked from FBD22271481)
Summary:
While aggregating perf.data events, even in strict mode, there is no
need to process all functions since we are not generating an output
binary. However, it's still important to convert data for as many
functions as possible, even for ones with unknown internal control flow.
(cherry picked from FBD22248390)
Summary:
lld linker may emit static relocations against addresses that also have
dynamic relocations associated with them. When this happens, BOLT fails
to validate the extracted value at the address.
Read dynamic relocations in the binary and ignore static relocations at
addresses that have a duplicate dynamic relocation.
(cherry picked from FBD22192345)
Summary:
getNewValueForSymbol() uses orc::RTDyldObjectLinkingLayer::findSymbol()
to resolve symbol values. The latter will always return JITSymbol,
even if there was no symbol defined. The address for the undefined
symbol will be zero, but some symbols could legally be resolved to zero
too.
We need to distinguish between real zero-valued symbols and symbols that
were not emitted and are not visible by orc::RTDyldObjectLinkingLayer.
If zero address is returned by ORC, check for a binary data with the
same name and use its address for the symbol resolution.
(cherry picked from FBD22175269)
Summary:
If we detect an internal function reference from code outside of the
function, then create an entry point at that location.
(cherry picked from FBD22169337)
Summary:
In my previous commit, I've accidentally reverted the condition while
evaluating a branch target.
Also, do not emit instruction for relocation purposes in
scanExternalRefs() if there was no TargetSymbol set and we have not
produced new relocations.
(cherry picked from FBD22169317)
Summary:
We may end up with a secondary entry symbol set to zero if there was no
symbol in the input file at the entry point address, and if we skipped
the function emission, e.g. if it was ignored. In that case, the symbol
should be properly initialized with a proper address.
(cherry picked from FBD22169167)
Summary:
Add '-lite' support for relocations for improved processing time,
memory consumption, and more resilient processing of binaries with
embedded assembly code.
In lite relocation mode, BOLT will skip full processing of functions
without a profile. It will run scanExternalRefs() on such functions
to discover external references and to create internal relocations
to update references to optimized functions.
Note that we could have relied on the compiler/linker to provide
relocations for function references. However, there's no assurance
that all such references are reported. E.g., the compiler can resolve
inter-procedural references internally, leaving no relocations
for the linker.
The scan process takes about <10 seconds per 100MB of code on modern
hardware. It's a reasonable overhead to live with considering the
flexibility it provides.
If BOLT fails to scan or disassemble a function, .e.g., due to a data
object embedded in code, or an unsupported instruction, it enables a
patching mode to guarantee that the failed function will call
optimized/moved versions of functions. The patching happens at original
function entry points.
'-skip=<func1,func2,...>' option now can be used to skip processing of
arbitrary functions in the relocation mode.
With '-use-old-text' or '-strict' we require all functions to be
processed. As such, it is incompatible with '-lite' option,
and '-skip' option will only disable optimizations of listed
functions, not their disassembly and emission.
(cherry picked from FBD22040717)
Summary:
In some cases we install bolt binary into one level deeper in bin/, such as bin/install/, we need to go back one more level to find lib directory.
(cherry picked from FBD22070974)
Summary:
This diff handles several challenges related to heatmap generation for Linux kernel (vmlinux elf file):
- If the input binary elf file contains the section `__ksymtab`, this diff assumes that this is the linux kernel `vmlinux` file and enables an extra flag `LinuxKernelMode`
- In `LinuxKernelMode`, we only support heat map generation right now, therefore it ensures that current BOLT mode is heat map generation. Otherwise, it exits with error.
- For some Linux symbol and section combinations, BOLT may not be able to find section for symbol (specially symbols that specifies the end of some section). For such cases, we show an warning message without exiting which was the previous behavior.
- Linux kernel elf file does not contain dynamic section, therefore, we don't exit when no dynamic section is found for linux kernel binary.
- Current `ParseMMap` logic does not work with linux kernel. MMap entries for linux kernel uses `PERF_RECORD_MMAP` format instead of typical `PERF_RECORD_MMAP2` format. Since linux kernel address mapping is absolute (same as specified in the ELF file), we avoid calling `ParseMMap` in linux kernel mode.
- Linux kernel entries are registered with PID -1, therefore `BinaryMMapInfo` lookup is not required for linux kernel entries. Similarly, `adjustLBR` is also not required.
- Default max address in linux kernel mode is highest unsigned 64-bit integer instead of current 4GBs.
- Added another new parameter for heatmap, `MinAddress`, in case of Linux kernel mode which is `KernelBaseAddress`, otherwise, it is 0. While registering Heatmap sample counts from LBR entries, any address lower than this `MinAddress` is ignored.
- `IgnoreInterruptLBR` is disabled in linux kernel mode to ensure that kernel entries are processed
Currently, linux kernel heat map also include heat map for Linux kernel modules that are not part of vmlinux elf file. This is intentional to identify other potential optimization opportunities. If reviewers think, those modules should be omitted, I will disable those modules based on highest end address of a vmlinux elf section.
(cherry picked from FBD21992765)
Summary:
Under some conditions, e.g. while running in lite mode or when a
function is non-simple, BOLT may decide not to emit function code and
hence there's no need to update the symbol. However, since we change
section table, the corresponding section index may need an update.
Also, update section index for ICF symbols.
(cherry picked from FBD21970017)
Summary:
This patch enables automated hugify for Bolt.
When running Bolt against a binary with -hugify specified, Bolt will inject a call to a runtime library function at the entry of the binary. The runtime library calls madvise to map the hot code region into a 2M huge page. We support both new kernel with THP support and old kernels. For kernels with THP support we simply make a madvise call, while for old kernels, we first copy the code out, remap the memory with huge page, and then copy the code back.
With this change, we no longer need to manually call into hugify_self and precompile it with --hot-text. Instead, we could simply combine --hugify option with existing optimizations, and at runtime it will automatically move hot code into 2M pages.
Some details around the changes made:
1. Add an command line option to support --hugify. --hugify will automatically turn on --hot-text to get the proper hot code symbols. However, running with both --hugify and --hot-text is not allowed, since --hot-text is used on binaries that has precompiled call to hugify_self, which contradicts with the purpose of --hugify.
2. Moved the common utility functions out of instr.cpp to common.h, which will also be used by hugify.cpp. Added a few new system calls definitions.
3. Added a new class that inherits RuntimeLibrary, and implemented the necessary emit and link logic for hugify.
4. Added a simple test for hugify.
(cherry picked from FBD21384529)
Summary:
As we are adding more types of runtime libraries, it would be better to move the runtime library out of RewriteInstance so that it could grow separately. This also requires splitting the current implementation of Instrumentation.cpp to two separate pieces, one as normal Pass, one as the runtime library. The Instrumentation Pass would pass over the generated data to the runtime library, which will use to emit binary and perform linking.
This patch does the following:
1. Turn Instrumentation class into an optimization pass. Register the pass in the pass manager instead of in RewriteInstance.
2. Split all the data that are generated by Instrumentation that's needed by runtime library into a separate data structure called InstrumentationSummary. At the creation of Instrumentation pass, we create an instance of such data structure, which will be moved over to the runtime at the end of the pass.
3. Added a runtime library member to BinaryContext. Set the member at the end of Instrumentation pass.
4. In BinaryEmitter, make BinaryContext to also emit runtime library binary.
5. Created a base class RuntimeLibrary, that defines the interface of a runtime library, along with a few common helper functions.
6. Created InstrumentationRuntimeLibrary which inherits from RuntimeLibrary, that does all the work (mostly copied over) for emit and linking.
7. Added a new directory called RuntimeLibs, and put all the runtime library related files into it.
(cherry picked from FBD21694762)
Summary:
take_front() is a const member of StringRef. Calling it does nothing.
This suggests that this line of code is useless, deleting it.
But it's good to double check, what was the original intention here?
(cherry picked from FBD21697637)
Summary:
This diff handles several issues related to profile reading and
handling:
* Unifies interface used by 3 profile readers in ProfileReaderBase.
* Adds automatic detection of the profile file contents.
* Removes reader-specific fields from BinaryFunction and BinaryData.
All the information is stored in instruction annotations.
* Removes implicit memory dependencies in annotations on profile
reader instance.
* Adds lite mode support to YAML reader.
* Moves profile reading code out of BinaryFunction.
(cherry picked from FBD21601411)
Summary:
IndirectCallProfile was holding to a StringRef from a profile reader
providing an implicit dependency on the reader.
(cherry picked from FBD21587101)
Summary:
Add a dummy option in BOLT to allow us to write any string in
the bolt info section. This is accomplished since we record the complete
argv vector. This string used to tag this binary with any ID that can
later be associated with a specific BOLT invocation.
(cherry picked from FBD21441902)
Summary:
We use a special routine to emit line info for functions that we do not
overwrite. The resulting DWARF was not quite efficient as we were
advancing addresses using a larger than needed opcodes. Since there were
only a few functions that we didn't emit/overwrite, it was not a big
issue.
However, in lite mode the majority of functions are not overwritten and
as a result, the inefficiency in debug line encoding got exposed and
binaries were getting larger than expected .debug_line sections.
Fix it by using more conventional line table opcodes for address
advancing.
(cherry picked from FBD21423074)
Summary: We only support linking ELF runtime library right now. If the library is an archiver, we check that each individual library inside the archiver is an ELF library.
(cherry picked from FBD21388672)
Summary:
When optimizing a binary without relocations, we can skip processing
functions without profile (cold functions). By skipping processing of
cold functions, we reduce the processing time and memory. We call
such mode a lite mode, and it is enabled by default.
Some processing is still done for functions without profile even in lite
mode. scanExternalRefs() function is used to detect secondary entry
points to functions that are not marked in the symbol table.
Note that the no-relocation requirement is a temporary limitation
of the lite mode.
(cherry picked from FBD21366567)
Summary:
Whenever a function is not meant for processing, e.g. when the user
requests to optimize only a subset of functions, mark the function as
ignored. Use this attribute instead of opts::shouldProcess().
(cherry picked from FBD21374806)
Summary:
The commit that fixed ICF determinism in non-relocation mode disabled
profile merging for functions. Dyno stats output needs to be updated to
reflect the lack of merge.
(cherry picked from FBD21366046)
Summary:
In non-relocation mode, the code for marking a function non-simple was
decoupled from the code that added new entry points. Fix that.
(cherry picked from FBD21264247)
Summary:
Some functions could be called at an address inside their function body.
Typically, these functions are written in assembly as C/C++ does not
have a multi-entry function concept. The addresses inside a function
body that could be referenced from outside are called secondary entry
points.
In BOLT we support processing functions with secondary/multiple entry
points. We used to mark basic blocks representing those entry points
with a special flag. There was only one problem - each basic block has
exactly one MCSymbol associated with it, and for the most efficient
processing we prefer that symbol to be local/temporary. However, in
certain scenarios, e.g. when running in non-relocation mode, we need
the entry symbol to be global/non-temporary.
We could create global symbols for secondary points ahead of time when
the entry point is marked in the symbol table. But not all such entries
are properly marked. This means that potentially we could discover an
entry point only after disassembling the code that references it, and
it could happen after a local label was already created at the same
location together with all its references. Replacing the local symbol
and updating the references turned out to be an error-prone process.
This diff takes a different approach. All basic blocks are created with
permanently local symbols. Whenever there's a need to add a secondary
entry point, we create an extra global symbol or use an existing one
at that location. Containing BinaryFunction maps a local symbol of a
basic block to the global symbol representing a secondary entry point.
This way we can tell if the basic block is a secondary entry point,
and we emit both symbols for all secondary entry points. Since secondary
entry points are quite rare, the overhead of this approach is minimal.
Note that the same location could be referenced via local symbol from
inside a function and via global entry point symbol from outside.
This is true for both primary and secondary entry points.
(cherry picked from FBD21150193)
Summary:
Add an option to fail processing of the input binary if the profile
is not accurate:
-stale-threshold=<uint>
- maximum percentage of stale functions to tolerate (default: 100)
Default (100) means never to fail.
A function profile is considered stale if any branch in its profile
has invalid source or destination.
Use `-stale-threshold=0` to fail if any staleness is detected in the
profile.
(cherry picked from FBD21189036)
Summary:
In relocation mode, there is no use for old .eh_frame section. Moreover,
it can interfere with new EH frames via .eh_frame_hdr when the original
.text is reused.
(cherry picked from FBD21120070)
Summary:
In non-relocation mode, make sure we emit extra symbols for a folded
function even if the function was not overwritten due to its large
size.
(cherry picked from FBD21080467)
Summary:
In a rare case, we may fold a function and fail to emit it in
non-relocation mode due to a function size increase. At the same time,
the function that the original function was folded into could have been
successfully emitted, e.g. because it was split in the presence of a
profile information.
Later, because the function was not emitted, we have to use its original
.eh_frame entry in the preserved .eh_frame section. However, that entry
is no longer referencing the original function, but the function that
the original was folded into. This happens since the original symbol gets
emitted at the other function location. As a result, .eh_frame entry for
the folded function is missing.
To prevent incorrect update of the original .eh_frame, create
relocations against absolute values. This guarantees preservation of the
section contents while updating pc-relative references.
(cherry picked from FBD21061130)
Summary:
RuntimeDyldImpl::resolveExternalSymbols() some time ago used to call
getSymbolAddress() while in the second loop. That call could have
modified the contents of ExternalSymbolRelocations that the loop was
iterating over. Thus the code was written in a way that erased the
processed entry on every loop iteration and reset the map
iterator. With large number of entries in ExternalSymbolRelocations the
loop code becomes a performance bottleneck.
Since getSymbolAddress() is no longer used, the
ExternalSymbolRelocations could be iterated in a straightforward way and
the map cleared before the function exit.
(cherry picked from FBD21057058)
Summary:
Indirect calls that use RSP to compute the target address would
break in instrumentation mode because we were adding instructions that
changed the stack pointer. Fix this.
(cherry picked from FBD20883791)
Summary:
Further speedup ICF by applying stricter rules for congruent functions.
While checking symbolic operands in congruent functions, consider
operands congruent only if they are equal or reference functions
with identical hashes, i.e. potentially foldable functions.
Note that jump table operands are handled as a special case.
(cherry picked from FBD20912054)
Summary:
Too many hash collisions may cause ICF to run slowly.
We used to hash BinaryFunction only looking at instruction opcodes,
ignoring instruction operands. With many almost identical functions,
such approach may lead to long ICF processing time. By including
operands into the hash, we reduce the number of collisions and
improve the runtime often by a factor of 2 or more.
(cherry picked from FBD20888957)
Summary:
ICF may fold functions in arbitrary order when running multi-threaded.
This is fine in relocation mode as we end up with just one function
holding all function symbols.
However, in non-relocation mode we keep all function bodies, and if we
keep merging profiles in non-deterministic order, we end up with
functions with non deterministic profiles. The fix for non-relocation
mode is to not merge profiles as the factual new profile could be
different from the merged one since both function instances are
potentially callable.
Additionally, emit extra symbols for ICF functions in non-relocation
mode to make it possible to track the folding.
(cherry picked from FBD20889866)
Summary:
Some functions may have exactly the same code and exception handlers.
However, their action tables could be different leading to mismatching
semantics. We should verify their equivalence while running ICF.
(cherry picked from FBD20889035)
Summary:
The version of LLVM that we are based on lacks the support for base
address in DWARF location lists. Add the missing pieces.
(cherry picked from FBD20640784)
Summary:
Make ELF symbol table rewriting code more structured. While at it,
remove symbols from non-allocatable sections.
(cherry picked from FBD20243386)
Summary:
Consolidate code and data emission code in ELF-independent
BinaryEmitter. The high-level interface includes only two
functions emitBinaryContext() and emitFunctionBody() used
by RewriteInstance and BinaryContext respectively.
(cherry picked from FBD20332901)
Summary:
This is a prerequisite for larger emitter refactoring.
Since .dynamic is read unconditionally, add an error message if the
section is missing, or the size of the section is zero.
(cherry picked from FBD20331735)
Summary:
Shrink wrapping has a mode where it will directly move push
pop pairs, instead of replacing them with stores/loads. This is an
ambitious mode that is triggered sometimes, but whenever matching with
a push, it would operate with the assumption that the restoring
instruction was a pop, not a load, otherwise it would assert. Fix this
assertion to bail nicely back to non-pushpop mode (use regular store and
load instructions).
(cherry picked from FBD20085905)
Summary:
Change our X86 target to use long nops by default. In general,
BOLT does not put nops into the instruction stream that is going to be
executed, since it doesn't align basic blocks, only functions. Since we
rebased BOLT, our relationship with MCAssembler changed because it
stopped using multibyte nops and we never needed to revisit that. But it
makes a difference if we want to mitigate perf issues with the Intel
JCC erratum, since the nops inserted are going to be decoded and
executed. To make MCAssembler emit long nops again, we need to explictly
set mattr (Features) of the X86 target.
(cherry picked from FBD19987277)
Summary: Start adding initial bits for MachO, this diff contains some small preparations for finding functions inside a MachO binary, this will be done in the next diff. The concept of a section in the MachO world is quite different from ELF, nevertheless, for functions for now it more or less fits into the current picture (in BOLT), but things will diverge more significantly a bit later.
(cherry picked from FBD19648161)
Summary:
There are two peephole subpasses, remove-double-jumps and
remove-useless-conditional-branches, that operates by reading branches
directly, which makes them tricky to run before fix-branches. In the
case of remove-double-jumps, it will even lead to suboptimal code if
the patched branch was going to be removed by fix-branches when the
target is the fall-through. If the final target is a tail call, it will
lead to a broken CFG in the worst case. Fix this by moving these passes
after SCTC, which already produces CFGs with conditional tail calls.
(cherry picked from FBD18795592)
Summary:
This diff ports reviews.llvm.org/D70157 to our LLVM tree, which
makes the integrated assembler able to align X86 control-flow changing
instructions in a way to reduce the performance impact of the ucode
update on Intel processors that implement the JCC erratum mitigation.
See white paper "Mitigations for Jump Conditional Code Erratum" by Intel
published November 2019.
To port this patch, I changed classifySecondInstInMacroFusion to analyze
instruction opcodes directly instead of analyzing the CondCond operand
(in more recent versions of LLVM, all conditional branches share the
same opcode, but with a different conditional operand). I also pulled to
our tree Alignment.h as a dependency, and the macroop analyzing helpers.
x86-align-branch-boundary and -x86-align-branch are the two flags that
control nop insertion to avoid disabling the decoder cache, following
the original patch. In BOLT, I added the flag
x86-align-branch-boundary-hot-only to request the alignment to only be
applied to hot code, which is turned on by default. The reason is
because such alignment is expensive to perform on large modules, but if
we limit it to hot code, the relaxation pass runtime becomes tolerable.
(cherry picked from FBD19828850)
Summary:
In this diff we refactor the code around getting the original binary encoding of function's body.
The main changes are: remove BinaryContext::getFunctionData, remove the parameter of the method BinaryFunction::disassemble, introduce BinaryFunction::getData.
(cherry picked from FBD19824368)
Summary:
In strict mode, a jump table with targets generated by
builtin_unreachable (located at the very end of the function) was
asserting when being recreated by postProcessIndirectBranches. Fix
this.
(cherry picked from FBD19614981)
Summary:
BinaryFunction used to have a list of Names associated with its main
entry point. However, the function is primarily identified by its
corresponding symbol or symbols, and these symbols are available as we
are creating them for a corresponding BinaryData object.
There's also no reason to emit symbols for alternative function names
(aliases), so change the code to only emit needed symbols.
When we emit a cold fragment for a function, only emit one cold symbol
for the fragment instead of one per every main entry symbol/name.
When we match a symbol to an entry point in the function, with this
change we can first go through the list of main entry symbols (now that
they are available).
(cherry picked from FBD19426709)
Summary:
1. Move createBinaryContext to BinaryContext.
1. Add support for nonlinux triples in createBinaryContext.
2. Remove unnecessary std::move in DWARFRewriter.cpp.
(cherry picked from FBD19421314)
Summary:
Call postProcessEntryPoints only after all functions have been
disassembled and all interprocedural references have been processed,
when all possible entry points have been accounted for. This makes our
detection of bad entries more robust as it does not depend on the order
of the functions any more.
(cherry picked from FBD19404767)
Summary:
When the validation of instruction encoding fails but we are able to
continue processing the binary, do no report an error. Report encoding
format only under `-v=1`.
(cherry picked from FBD19376531)
Summary:
For BinaryData, we used to maintain a vector of StringRef names and also
a vector of pointers to MCSymbol's associated with the data. There was
an unnecessary duplication of information and an associated overhead of
keeping it in sync. Fix it by removing Names and using Symbols wherever
Names were used.
Also merge two variants of registerNameAtAddress() and remove
unreachable/dead code in the process.
(cherry picked from FBD19359123)
Summary:
"Fix symbol table entries for secondary entries" diff broke the inliner.
Fix the breakage and make the discovery of secondary entry points more
accurate.
Add ability to BinaryContext::getFunctionForSymbol() to return an entry
point discriminator and use it instead of calling getEntryForSymbol()
and isSecondaryEntry(). This is the preferred way since
getFunctionForSymbol() is thread-safe.
(cherry picked from FBD19295983)
Summary:
Commit "Support full instrumentation" changed the map
SymbolToFunction in BinaryContext to map secondary entries of functions
too. This introduced unexpected behavior in our symbol table rewriting
logic, which caused it to mistakenly write them with the address of the
original function. Fix the behavior of getBinaryFunctionAtAddress to
correct this. Also fix other users of SymbolToFunction to ensure they
are not accidentally using secondary entries when they shouldn't.
(cherry picked from FBD19168319)
Summary:
Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done.
The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end.
(cherry picked from FBD18153069)
Summary:
When perf tool reports a mapping address of a binary, it is not always
the address of the first loadable segment we were checking against.
As a result, perf2botl was not working properly for binaries where the
first segment was not executable.
The fix is to check if the address reported by mmap event matches any
of the loadable segments. Note that the segment alignment has to be
applied to get real loadable address of the segment.
Fixesfacebookincubator/BOLT#65
(cherry picked from FBD19146419)
Summary:
Add full instrumentation support (branches, direct and
indirect calls). Add output statistics to show how many hot bytes
were split from cold ones in functions. Add -cold-threshold option
to allow splitting warm code (non-zero count). Add option in
bolt-diff to report missing functions in profile 2.
In instrumentation, fini hooks are fixed to run proper finalization
code after program finishes. Hooks for startup are added to setup
the runtime structures that needs initilization, such as indirect call
hash tables.
Add support for automatically dumping profile data every N seconds by
forking a watcher process during runtime.
(cherry picked from FBD17644396)
Summary:
-icp-top-callsites selects candidates for optimization until a
threshold is met. Currently, this parameter is set to 99% of calls by
default. The order of functions evaluated changes in parallel mode,
thus the functions that may be included to satisfy 99% of all calls may
change, leading to different optimization decisions when running in
parallel versus sequential.
Fix this by enabling optimizations for all branches with the same
frequency once we reach our budget instead of cutting off immediatelly
after our budget is satisfied. In that way, order of functions has no
impact on which functions are optimized.
(cherry picked from FBD18902239)
Summary:
Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done.
The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end.
(cherry picked from FBD18153069)
Summary:
Some processes can mmap the main binary for the purpose of
introspection. We should ignore such mmap events for fixed-address
binaries. For PIC binaries, we record the mapping and do the address
filtering later for all sample events.
(cherry picked from FBD18844314)
Summary:
The condition `DebugRangesOffset == -1U` can never happen since DebugRangesOffset has type `uint64_t` and the value always comes from `RangesSectionWriter->addRanges` which gets its value from `DebugRangesSectionWriter.SectionOffset` which has type `uint32_t`. The condition seems to be left over from a time where something was using `-1` as an error value.
I'm removing that check so I can use `-1` as a tag to refer to the empty range that will be at the beginning of the ranges section.
(cherry picked from FBD18153119)
Summary: The `.debug_aranges` section is already deterministic and is logically separate from the `.debug_ranges` section so separate them into separate classes so that it will be easier to make DebugRangesSectionsWriter deterministic
(cherry picked from FBD18153057)
Summary:
This fixes a bug which causes the debug_info and debug_loc sections to be unreadable by readelf/objdump.
Basically, we're using 12 bytes of a ULEB128 value to fill in space, but readelf can't read more than 9 bytes of ULEB128. Thus, we replace that value with a string of 'a' instead.
(cherry picked from FBD18097728)
Summary:
Avoid asserting on inputs that are shared libraries with
R_X86_64_64 static relocs and RELATIVE dynamic relocations matching
those. Our relocation checking mechanism would expect the result of
the static relocation to be encoded in the binary, but the linker
instead puts it as an addend in the RELATIVE dyn reloc.
Also fix aggregation for .so if the executable segment is not the
first one in the binary.
(cherry picked from FBD18651868)
Summary:
When combining icp=calls and shrink wrapping, the former may
generate empty BBs that are going to trigger a bug in shrink wraping
restore placement strategy. The restore is wrongly pushed to the BB
successor instead of being added to the current block. Add a pass to
go over the CFG to fix empty blocks by adding a temporary NOP
instruction that is going to be deleted later. Empty BBs are not
supported by one of the analysis done at this pass.
(cherry picked from FBD18717994)
Summary:
If -trap-avx512 option is not set, verify that we correctly encode
AVX-512 instructions and treat them as ordinary instructions.
(cherry picked from FBD18666427)
Summary:
There is no need to support existing functionality of adding entry
points after the CFG is built as the function is only called in empty or
disassembled state. Previously we used to run disassemble+buildCFG per
function, but now these phases are decoupled.
Also, remove a couple of redundant checks.
(cherry picked from FBD18622822)
Summary:
We only use locations of PC relocations and ignore the rest of the data.
There's no need to store type and value.
(cherry picked from FBD18623280)
Summary:
Speeding up cache+/ext-tsp block reordering algorithm.
On a high-level, the speedup is achieved by:
- precomputing and memorizing all jumps between a pair of chains
(instead of extracting them on every merge iteration);
- using a cache of size O(|E|) instead of O(|V|^2) as in previous version.
The final output is identical to previous one subject to a new deterministic
comparison of double values.
(cherry picked from FBD18380870)
Summary:
When we disassemble functions, we add discovered jump tables to a global
container in BinaryContext. Later, we analyze and verify all jump
tables. However, analysis for non-simple functions might fail for numerous
reasons, e.g. there would be no instruction at a destination. Since we
are not overwriting non-simple functions, it is not a critical error.
Thus, we can safely skip jump table analysis for non-simple functions.
(cherry picked from FBD18422997)
Summary:
Free more lists in BinaryFunction::releaseCFG().
Release BinaryFunction::Relocations after disassembly.
Do not populate BinaryFunction::MoveRelocations as we are not using them
currently.
Also remove PCRelativeRelocationOffsets that weren't used.
(cherry picked from FBD18413256)
Summary:
Once we emit function code, we no longer need CFG for next phases
that use basic blocks for address-translation and symbol update
purposes. We free memory used by CFG and instructions. The freed
memory gets reused by later phases resulting in overall memory usage
reduction.
We can probably improve memory consumption even further by replacing
BinaryBasicBlocks with more compact data structures.
(cherry picked from FBD18408954)
Summary:
We've used to emit special annotations to update SDT markers. However,
we can just use "Offset" annotations for the same purpose. Unlike BAT,
we have to generate "reverse" address translation tables.
This approach eliminates reliance on instructions after code emission.
(cherry picked from FBD18318660)
Summary:
Use BinaryBasicBlock::OffsetTranslationTable for BAT. This removes
dependency on instructions after the code emission.
(cherry picked from FBD18283965)
Summary:
Be default, we strip debug sections from the binary. Even though we did
not write the sections, we allocated space for them in the output binary
by mistake.
(cherry picked from FBD18218708)
Summary:
BOLT creates MCInst for every instruction from the input. For large
binaries, this means we are creating tens if not hundreds of millions of
instructions. If the number of operands for average instruction is much
less than 8, we benefit from changing the type of Operands from
SmallVector<MCOperand, 8> to SmallVector<MCOperand, 2>. That seems
to be the optimal type for X86-64 on average.
The size of MCInst goes down from 176 to 80 which often reduces BOLT
memory consumption by gigabytes.
(cherry picked from FBD18218924)
Summary:
We do not support optimizing functions with jump tables in
AArch64, but we do need to detect them. This idiom is slightly different
from the ones we've seen before. It encode jump table entries as
relative to the jump table itself instead of relative to the indirect
branch (BR) instruction.
(cherry picked from FBD18191100)
Summary:
For functions with unknown control flow, do not populate TakenBranches
with an entry pointing to the end of the function.
(cherry picked from FBD18034019)
Summary:
If collecting data in Intel Skylake machines, we may face a
bug where LBR0 or LBR1 may be duplicated w.r.t. the next entry. This
makes perf2bolt interpret it as an invalid trace, which ordinarily we
discard during aggregation. However, in BAT, since we do not disassemble
the binary where the collection happened but rely only on the
translation table, it is not possible to detect bad traces and discard
them. This gets to the fdata file, and this invalid trace ends up
invalidating the profile for the whole function (by being treated as
stale by BOLT).
In this patch, we detect Skylake by looking for LBRs with 32 entries,
and discard the first 2 entries to avoid running into this problem.
It also fixes an issue with collision in the translation map by
prioritizing the last basic block when more than one share the same
output address.
(cherry picked from FBD17996791)
Summary:
When reading debug info in parallel, CUs for functions were populated in
parallel and the order was non-deterministic. We used the first CU from
the non-deterministically-ordered list to set the line number resulting
in different outputs.
The fix is to sort the list after it's been created and before assigning
the line table unit.
(cherry picked from FBD17946889)
Summary:
merge-fdata for legacy format was simply appending all input
strings to output, but the real format supports some header strings
that can't be simply concatanated to output. Check for the header
string used by BAT before merging fdata to avoid creating an output
file with invalid lines (header in the middle of the fdata file).
For heatmap, avoid reading BAT tables, since they won't be used.
(cherry picked from FBD17943131)
Summary:
I noticed when setting up a new repository for bolt that bolt tests
would fail unexpectedly when running `ninja check-bolt` and
`ninja check-llvm`. This turns out to be because dependencies for bolt
binaries were not specified in the CMake configuration so they were not
built before running the tests. This diff adds the dependencies to the
CMake configuration for check-bolt and check-llvm so that bolt binaries
are built before running tests.
(cherry picked from FBD17919505)
Summary:
C++14 "sized deallocation" introduces a 2-argument `delete` where the new 2nd argument is the original allocated size. It's useful for allocators like jemalloc to be "reminded" of the original allocation size, else they incur the cost of an address to size lookup. Jemalloc has provided this for a while as `sdallocx`, and recently it got wired up to the new 2-arg `delete`.
Here I introduce typedefs for the SmallVectors so the "16" is consistent, which seems to fix the issue.
(cherry picked from FBD17618981)
Summary:
Change our CMake config for the standalone runtime instrumentation
library to check for the elf.h header before using it, so the build
doesn't break on systems lacking it. Also fix a SmallPtrSet usage where
its elements are not really pointers, but uint64_t, breaking the build
in Apple's Clang.
(cherry picked from FBD17505759)
Summary:
With the word "missed", the previous message about opportunities for
macro-op fusion optimization could be misleading.
(cherry picked from FBD17464603)
Summary:
The existing check for compiler de-virtualization bug was not working
when the relocation reference did not fall on a function boundary.
As a result, we were falsely detecting "unmarked object in code".
When running the check, the address could be arbitrary, except it
shouldn't match any existing function. Additionally, check that there's
a proper reference to the de-virtualized callee to avoid false
positives.
(cherry picked from FBD17433887)
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
Summary: `perf2bolt` accepts executable name, and the tool will find all the PIDs associated with that executable. When different versions of an executable are running at the same time, name alone may not be sufficient as we will get samples from different versions of the binary aggregated together. The resulting fdata may look stale to BOLT, which makes BOLT bailout optimization for functions. This change adds a `-pid` switch that lets user specify process ID in addition to executable name so BOLT can target a specific process.
(cherry picked from FBD17178898)
Summary: This change adds a switch (`ignore-interrupt-lbr`) to ignores LBR from perf input that is result of kernel interrupts. These asynchronous flow of user/kernel transition will make BOLT think that profile is stale, thus bailout optimization for functions. Ideally, user mode filter need to be set for `perf record` so we don't have asynchronous LBRs. However these are identifiable as kernel address space is known, so we can ignore any LBRs that come from or go into kernel addresses during aggregation. This is under a switch and off by default in case we need to BOLT kernel module.
(cherry picked from FBD17170107)
Summary:
Change our edge profiling technique when using instrumentation
to do not instrument every edge. Instead, build the spanning tree
for the CFG and omit instrumentation for edges in the spanning tree.
Infer the edge count for these edges when writing the profile during
run time. The inference works with a bottom-up traversal of the spanning
tree and establishes the value of the edge connecting to the parent based
on a simple flow equation involving output and input edges, where the
only unknown variable is the parent edge.
This requires some engineering in the runtime lib to support dynamic
allocation for building these graphs at runtime.
(cherry picked from FBD17062773)
Summary:
We start a thread to preprocess the profile while the main
thread continues to disassemble the input binary. We should not
disassemble it in BAT mode, however, the test to check whether we have
BAT in the input binary depends on the preprocessing thread, so there
is a race where we may start disassembling functions just because the
preprocessing thread didn't conclude we are in BAT mode. Fix this and
make the main thread check for BAT without depending on the
preprocessing thread.
(cherry picked from FBD17124370)
Summary:
We decode the regular .plt section and we are able to perform
optimizations on it with -plt=hot or -plt=all, however, .plt.got
sections are not decoded by BOLT. This patch teaches BOLT how to read
them. They are created by the bfd linker whenever there is no need for
the dynamic linker to lazy-bind the symbol (when they are eagerly
resolved at binary load time). These entries are 8-byte sized instead of
16-byte sized like the regular PLT, and contain a single indirect call
instruction with 7 bytes and a nop.
(cherry picked from FBD17060515)
Summary:
We should not rely on split function detection while aggregating
data, but only look up the original function names in the symbol table.
Split function detection should be done by BOLT and not perf2bolt while
writing the profile. Then, BOLT, when reading it, will take care of
combining functions if necessary.
This caused a bug in bolted data collection where samples in cold parts
of a function were being falsely attributed to the hot part of a function
instead of being attributed to the cold part, causing incorrect translation of
addresses.
(cherry picked from FBD16993065)
Summary:
We were too permissive by allowing more jump tables during the
preliminary scan of memory. This allowed for jump tables to be
falsely detected. And since we didn't have a way to backtrack
the jump table creation, we had to assert.
This diff refactors the code that analyzes jump table contents.
Preliminary and final passes share the same code. The only difference
should be the detection of instruction boundaries that are available
during the final pass.
This should affect strict relocation mode only.
(cherry picked from FBD16923335)
Summary:
BOLT prints "spawning thread to pre-process profile" message even when
it is not running in the aggregation mode. Fix that.
(cherry picked from FBD16908596)
Summary:
Avoid directly allocating string and description tables in
binary's static data region, since they are not needed during runtime
except when writing the profile at exit. Change the runtime library to
open the tables on disk and read only when necessary.
(cherry picked from FBD16626030)
Summary:
To allow the development of future instrumentation work, this
patch adds support in BOLT for linking arbitrary libraries into the
binary processed by BOLT. We use orc relocation handling mechanism for
that. With this support, this patch also moves code programatically
generated in X86 assembly language by X86MCPlusBuilder to C code written
in a new library called bolt_rt. Change CMake to support this library as
an external project in the same way as clang does with compiler_rt. This
library is installed in the lib/ folder relative to BOLT root
installation and by default instrumentation will look for the library
at that location to finish processing the binary with instrumentation.
(cherry picked from FBD16572013)
Summary:
Add a flag that disable writing botl-info section
and add a test that run bolt with two modes parallel
and sequential and assert that the resulting binaries
are the same.
(cherry picked from FBD16575440)
Summary:
Add option `-check-encoding` to verify if the input to LLVM disassembler
matches the output of the assembler. When set, the verification runs on
every instruction in processed functions.
I'm not enabling the option by default as it could be quite noisy on x86
where instruction encoding is ambiguous and can include redundant
prefixes.
(cherry picked from FBD16595415)
Summary:
In strict relocation mode, we get better function coverage. However, if
the profile used for optimization was converted using non-strict mode,
then it wouldn't match functions exclusive to strict mode. Hence,
we have to enforce strict relocation mode for profile conversion, so it
can be used for either mode.
I'm also adding parallel profile pre-processing unless `--no-threads` is
specified. This masks the runtime overhead of function disassembly on
multi-core machines.
(cherry picked from FBD16587855)
Summary:
switch to sequential execution when print-all is passed.
Since the function getDynoStats have an unsafe access
to the annotation allocators.
(cherry picked from FBD16503502)
Summary:
hfsort+ performs an expensive analysis to determine the
new order of the functions. 99% of the time during hfsort+
is spent in the function runPassTwo. This diff runs the body
of the hot loop in runPassTwo in parallel speeding up the
total runtime of reorder-functions pass by up to 4x
(cherry picked from FBD16450780)
Summary:
In non-relocation mode, we allow data objects to be embedded in the
code. Such objects could be unmarked, and could occupy an area between
functions, the area which is considered to be code padding.
When we disassemble code, we detect references into the padding area
and adjust it, so that it is not overwritten during the code emission.
We assume the reference to be pointing to the beginning of the object.
However, assembly-written functions may reference the middle of an
object and use negative offsets to reference data fields. Thus,
conservatively, we reduce the possibly-overwritten padding area to
a minimum if the object reference was detected.
Since we also allow functions with unknown code in non-relocation mode,
it is possible that we miss references to some objects in code.
To cover such cases, we need to verify the padding area before we
allow to overwrite it.
(cherry picked from FBD16477787)
Summary:
While reading debug info the function findSubprograms
runs on each compilation unit. This diff parallelize that loop
reducing its runtime duration by 70%.
(cherry picked from FBD16362867)
Summary:
BOLT spends a decent amount of time creating patches to update
debug information when -update-debug-sections is passed.
In updateDebugInfo patches are created to update .debug_info
and .debug_abbrev sections while .debug_loc and .debug_ranges
contents are populated. This this diff uses a lock-based approach to
parallelize updateDebugInfo functions and reduces the runtime of the
function by around 30%.
(cherry picked from FBD16352261)
Summary:
After stripping annotations, conditional tail calls no longer can be
identified by their corresponding tag. We can check the number of basic
block successors instead.
Fixesfacebookincubator/BOLT#58.
(cherry picked from FBD16444718)
Summary:
Shrink wrapping is an expensive part of frame optimizations if
performed on all functions. This diff makes it run in parallel,
reducing wall time.
(cherry picked from FBD16092651)
Summary:
This diff parallelize the construction of call graph during disassembly.
The diff includes a change to parallel-utilities where another interface
is added, that support running tasks on binaryFunctions that involves
adding instruction annotations. This pattern is common in different places,
e.g. frame optimizations. And such, pattern justify creating an interface,
that abstract out all the messy details.
(cherry picked from FBD16232809)
Summary:
This diff change reorderBasicBlocks pass to run in parallel,
it does so by adding locks to the fix branches function,
and creating temporary MCCodeEmitters when estimating basic block code size.
(cherry picked from FBD16161149)
Summary:
If two indirect branches use the same jump table, we need to
detect this and duplicate dump tables so we can modify this CFG
correctly. This is necessary for instrumentation and shrink wrapping.
For the latter, we only detect this and bail, fixing this old known
issue with shrink wrapping.
Other minor changes to support better instrumentation: add an option
to instrument only hot functions, add LOCK prefix to instrumentation
increment instruction, speed up splitting critical edges by avoiding
calling recomputeLandingPads() unnecessarily.
(cherry picked from FBD16101312)
Summary:
Heuristic that creates a jump table for every memory access,
including those we do not match against a pattern in an indirect jump,
is too permissive and has false positives. Guard this logic under
strict mode until we figure out a better strategy.
(cherry picked from FBD16192205)
Summary:
Each time we run some work in parallel over the list of functions in bolt, we manage a thread pool, task scheduling and perform some work to manage the granularity of the tasks based on the type of the work we do.
In this task, I am creating an interface where all those details are abstracted out, the user provides the function that will run on each function, and some policy parameters that setup the scheduling and granularity configurations.
This will make it easier to implement parallel tasks, and eliminate redundant coding efforts.
(cherry picked from FBD16116077)
Summary:
This diff parallelize the STPClean() function reducing its runtime from 5 seconds to 0.4 on HHVM,
Making the runtime for the frame optimizer goes down to 33 seconds on HHVM.
(cherry picked from FBD15914371)
Summary:
This diff includes two main changes:
1) When creating an annotation, a dedicated annotation allocator can be used, instead of the default allocator. This allows some annotation to be deallocated right after the end of their usage completely. Furthermore, having the ability to use dedicated allocators allows running SPT in parallel where each task uses a different allocator.
2) SPT is parallelized.
(cherry picked from FBD15913492)
Summary: We select the top hot targets for indirect call promotion. But since we only have frequency for targets, not for actual jump table indices, we have to merge indices that share the same actual target. In order to do that we sort targets by pointer of target symbol before merging, which introduces instability. Later we stable sort merged targets by frequency. Due to the instability of sorting pointers, and depending on how many indices each merged target has, we could end up with unstable ICP. This commit changes the 2nd pass sorting to prioritize targets with fewer indices, and higher mispredicts, in addition to higher frequency. It improves stability of ICP, and also exposes more ICP because targets with fewer indices has better chance of getting promoted.
(cherry picked from FBD16099701)
Summary:
Check that a symbol address is less than the next function
address before considering it for a secondary entry.
(cherry picked from FBD16056468)
Summary:
In strict relocation mode we rely on relocations to represent all
possible entry points into a function. Most of the code generated by
tested compilers (gcc and clang) will result in relocations against
any internal labels for jump tables and for computed goto tables.
In situations where we cannot properly reconstruct a jump table, or when
we cannot determine a table that guides an indirect jump, e.g. when
multiple computed goto tables are used, we conservatively assume that
the indirect jump can end up at any possible basic block referenced by
relocations.
In strict mode, simple functions may include the aforementioned
instructions with unknown control flow with a conservative list of
destinations added to the containing basic block. This allows us to
expand coverage of simple functions and to enable code reordering
optimizations for more functions.
The strict mode is recommended when BOLT is used with a well-formed
code generated by a compiler.
To use the strict mode, add "-strict" on the command line.
Another effect of this diff, is that with relocations, we will always
replace the immediate operand of an instruction with a symbol if the
relocation exists against this operand.
Also this diff fixes issues with Clang compiled with -fpic.
(cherry picked from FBD15872849)
Summary:
A relocation can have an addend that makes it look as the relocated
value is in a different section from the symbol being relocated.
E.g., a relocation against a variable in .rodata could have a negative
offset that will make it look like it is against a symbol in .text
(a section that typically precedes .rodata).
Unless the relocation is against a section symbol, we know
exactly the symbol that is being relocated and there is no issue.
However, when the linker leaves only a section relocation (i.e. a
relocation against a section symbol when a temporary original symbol
gets deleted), we have to guess the relocated symbol, and can falsely
detect a function reference in the case described above.
The fix is to keep a section relocation if the corresponding
relocated value falls into a different section, and to detect and
ignore false function reference.
(cherry picked from FBD16030791)
Summary: BOLT operates in relocation mode by default when .reloc is in the binary. This changes disables relocation mode for heatmap generation so we can use that for more cases. There's a small separate change that ignores zero-sized symbol in zero-sized code section during function discovery.
(cherry picked from FBD16009610)
Summary:
An instrumentation pass that modifies the input binary to
generate a profile after execution finishes. It modifies branches to
increment counters stored in the process memory and injects a new
function that dumps this data to an fdata file, readable by BOLT.
This instrumentation is experimental and currently uses a naive
approach where every branch is instrumented. This is not ideal for
runtime performance, but should be good enough for us to
evaluate/debug LBR profile quality against instrumentation.
Does not support instrumenting indirect calls yet, only direct
calls, direct branches and indirect local branches.
(cherry picked from FBD15998096)
Summary:
Make BOLT ignore empty functions (those containing no instructions,
despite having some space allocated to it filled with zeroes).
(cherry picked from FBD15981683)
Summary:
Profile bias may happen depending on the hardware counter used
to trigger LBR sampling, on the hardware implementation and as an
intrinsic characteristic of relying on LBRs. Since we infer fall-through
execution and these non-taken branches take zero hardware resources to
be represented, LBR-based profile likely overrepresents paths with fall
throughs and underrepresents paths with many taken branches. This patch
adds an option to print statistics about profile bias so we can better
understand these biases.
The goal is to analyze differences in the sum of the frequency of all
incoming edges in a basic block versus the sum of all outgoing. In an
ideally sampled profile, these differences should be close to zero. With
this option, the user gets the mean of these differences in flow as a
percentage of the input flow. For example, if this number is 15%, it
means, on average, a block observed 15% more or less flow going out of
it in comparison with the flow going in. We also print the standard
deviation so we can have an idea of how spread apart are different
measurements of flow differences. If variance is low, it means the
average bias is happening across all blocks, which is compatible with
using LBRs. If the variance is high, it means some blocks in the profile
have a much higher bias than others, which is compatible with using a
biased event such as cycles to sample LBRs because it overrepresents
paths that end in an expensive instruction.
(cherry picked from FBD15790517)
Summary:
ICF consumes 10-15% of bolt runtime, for HHVM that is around 45 seconds.
this diff perform some parallelization for the pass to make it faster.
A 60% reduction in the ICF runtime is measured on the parallel version for HHVM.
(cherry picked from FBD15589515)
Summary:
Now that we populate jump tables after all functions are disassembled,
we can check for instruction boundaries corresponding to jump table
entries. No need to delegate this task to postProcessJumpTables().
(cherry picked from FBD15814762)
Summary:
During the initial disassembly pass, only identify jump tables
without populating the contents. Later, after all functions have been
disassembled, we have a better idea of jump table boundaries and can do
a better job of populating their entries.
As a result, we no longer have embedded jump tables (i.e. a jump table
that is parter of another jump table). If we ever need to keep
sequential jump tables inseparable during the output, we can always
add such functionality later.
Fixesfacebookincubator/BOLT#56.
(cherry picked from FBD15800427)
Summary:
During frame analysis, the functions do not change, and stack pointer tracking
does not need to be performed more than one time.
The current implementation performs the SPT analysis multiple times per
function during the frame analysis, we ca eliminate such computation redundancy.
On HHVM with -frame-opts=hot, this save around a minute which is 40% of the
frame optimization runtime. (129s to 76s).
fdata should be passed for a reasonable evaluation (we need the call graph).
However, this comes at a memory cost, around 2G to the peak when only -frame-opt=hot only is used but,
When all the usual flags are passed, the effect is to the peak is only 200K (measured from one test).
Update:
When jemalloc is used the base became way better and the following runtime are observed:
[jemalloc]
hhvm 85 --> 72.
clang 27 --> 23.
[malloc]
hhvm 129 --> 76.
clang 34 --> 27.
(cherry picked from FBD15707003)
Summary:
Add an option to get extra profile trace using the recorded event PC.
The trace goes from the latest LBR record destination to the event PC.
(cherry picked from FBD15711804)
Summary:
We used to handle PC-relative address references differently from direct
address references. As a result, some cases, such as escaped function
label address, were not handled when dealing with absolute (non-PIC)
code. This diff moves processing of an address reference into
BinaryContext::handleAddressRef() which is called for both PIC and
non-PIC code.
(cherry picked from FBD15643535)
Summary:
Compile Bolt using std 14.
We want that to be able to use some threading the locking tools that do not exists in std 11.
(cherry picked from FBD15671736)
Summary:
Similarly to how the compiler relies on DWARF to map samples, so
it is possible to collect profile data in binaries optimized by PGO
techniques and retrofit data to be used in a representation of the program
that was not optimized by PGO, this diff implements an option in BOLT to
encode a table in the output binary that allows us to map data collected
in optimized binaries back to the address space used in the input binary
(where the profile is useful, since we do not support running BOLT on a
binary already optimized by BOLT). The goal is to offer an option to
support BOLT in scenarios where it is not easy to run a special deployment of
the binary with a version that was not optimized by BOLT just for data
collection.
This feature is enabled with the -enable-bat flag. BAT stands for BOLT
Address Translation, which refers to the process of mapping output to
input addresses.
(cherry picked from FBD15531860)
Summary:
Options such as `-print-only`, `-skip-funcs`, etc. now take regular
expressions. Internally, the option is converted to '^funcname$' form
prior to regex matching. This ensures that names without special
symbols will match exactly, i.e. "foo" will not match "foo123".
(cherry picked from FBD15551930)
Summary:
Run analyzeIndirectBranch() using basic block boundaries instead of
running ad-hoc validation of the jump table assumptions.
(cherry picked from FBD15465034)
Summary:
Move handling of interprocedural references to BinaryContext.
Post-process indirect branches immediately after the CFG is built.
This is almost NFC. Since indirect branches are now post-processed
before the profile data is processed it interferes with the way the
profile data in YAML format is handled.
(cherry picked from FBD15456003)
Summary:
Add an option:
-memcpy1-spec=func1,func2:cs1,func3:cs1:cs2,...
to specialize calls to memcpy() in listed functions (the name could be
supplied in regex) for size 1. The optimization will dynamically check
if the size argument equals to 1 and execute a one byte copy, otherwise
it will call memcpy() as usual. Specific call sites could be indicated
after ":" using their numeric count from the start of the function.
(cherry picked from FBD15428936)
Summary:
SDT markers that appears as nops in the assembly, are preserved and not eliminated.
Functions with SDT markers are also flagged. Inlining and folding are disabled for
functions that have SDT markers.
(cherry picked from FBD15379799)
Summary:
Parse statically defined tracepoints(SDT) markers from the ELF file, and store them.
Add an option to print SDTs (-print-sdt).
Add test case for parsing and printing SDTs.
(cherry picked from FBD15366712)
Summary:
Previously, ICP worked with a budget of N targets to convert to
direct calls. As long as the frequency of up to N of the hottest targets
surpassed a given fraction (threshold) of the total frequency, say, 90%,
then the optimization would convert a number of targets (up to N) to
direct calls. Otherwise, it would completely abort processing this call
site. The intent was to convert a given fraction of the indirect call
site frequency to use direct calls instead, but this ends up being a
"all or nothing" strategy.
In this patch we change this to operate with the same strategy seem in
LLVM's ICP, with two thresholds. The idea is that the hottest target of
an indirect call site will be compared against these two thresholds: one
checks its frequency relative to the total frequency of the original
indirect call site, and the other checks its frequency relative to the
remaining, unconverted targets (excluding the hottest targets that were
already converted to direct calls). The remaining threshold is typically
set higher than the total threshold. This allows us more control over
ICP.
I expose two pairs of knobs, one for jump tables and another for
indirect calls.
To improve the promotion of hot jump table indices when we have memory
profile, I also fix a bug that could cause us to promote extra indices
besides the hottest ones as seen in the memory profile. When we have the
memory profile, I reapply the dual threshold checks to the memory
profile which specifies exactly which indices are hot. I then update N,
the number of targets to be promoted, based on this new information, and
update frequency information.
To allow us to work with smaller profiles, I also created an option in
perf2bolt to filter out memory samples outside the statically allocated
area of the binary (heap/stack). This option is on by default.
(cherry picked from FBD15187832)
Summary:
Make BinaryContext responsible for creation and management of
JumpTables. This will be used for detection and resolution of jump table
conflicts across functions.
(cherry picked from FBD15196017)
Summary:
perf tool requires the input data to be owned by the current user or
root, otherwise it rejects the input. Use `-f` option to override this
behavior.
(cherry picked from FBD15160678)
Summary:
While checking for a size of a jump table, we've used containing
section as a boundary. This worked for most cases as typically jump
tables are not marked with symbol table entries. However, the compiler
may generate objects for indirect goto's.
(cherry picked from FBD15158905)