Summary:
While printing debug info for instructions, we should use line tables
from the corresponding DWARF CU which could be different from the
containing function CU in case of inlined instructions.
(cherry picked from FBD28908324)
Summary:
FBD55943 changed the way ProcessAllSections works in RuntimeDyld. After
the change, all sections, including symbol table, section table, etc.
are loaded into memory whenever ProcessAllSections is enabled.
In BOLT we rely on RuntimeDyld for processing sections with relocations.
These include most allocatable sections and additionally .debug_line.
The latter is skipped by RuntimeDyld without ProcessAllSections flag.
If we enable ProcessAllSections, we will have to deal with allocating
memory for more sections than we need (see above) and later to filter
them out.
The alternative is to mark all sections that we actually plan to use as
"required for execution" (using RuntimeDyld terminology). For
.debug_line section on ELF it means adding SHF_ALLOC flag. On MachO,
RuntimeDyld currently treats all sections as required.
(cherry picked from FBD28729398)
Summary:
This patch introduces LoopInversionPass. Its main purpose is to ensure
that the loop layout is optimal depending on the profile information. So
if profile information shows that the loop is used, the unconditional
jump instruction must be executed only once and vice-versa. Please take
a look to the pass header file and test for more details.
Also change link_fdata script a bit, to be able to change FDATA prefix,
like FileCheck does.
Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
PR facebookincubator/BOLT#153
(cherry picked from FBD28391811)
Summary:
Implemented support for Debug Fission.
For the most part it doesn't impact Monolithic execution path.
One area that was changed is the DW_AT_low_pc/DW_AT_high_pc conversion. Before it was to DW_AT_ranges/DW_AT_low_pc, now DW_AT_low_pc is kept in same place.
Another more visible impact is in Skeleton CU the DW_AT_low_pc is replaced with DW_AT_ranges_base if it's not originally present and bolt converted ranges conversion inside the dwo units.
Output of this are multiple .dwo files with updated debug information.
(cherry picked from FBD29569788)
Summary:
Remove relocations against internal function labels, e.g. jump table
relocations, only when overwriting them.
While reading an input file with relocations, we create internal
relocations against code references (we skip PIC relocations).
Later, when we discover jump tables, we remove corresponding relocations
with the assumption that original relocations will either be ignored or
replaced by new relocations. However, it is possible to miss some
references to the jump table, in which case the original entries will
not be ignored. While such situation is abnormal, it is still a
better/safer approach to preserve relocations if we are not replacing
them with new ones.
(cherry picked from FBD28406628)
Summary:
Explicit assignment operator can be replaced with an implicit one.
Remove it to allow an implicit copy constructor:
```
bolt/src/Passes/DataflowAnalysis.h:74:8: warning: definition of
implicit copy constructor for 'ProgramPoint' is deprecated because it
has a user-declared copy assignment operator [-Wdeprecated-copy]
void operator=(const ProgramPoint &PP) {
^
bolt/src/Passes/DataflowAnalysis.h:62:14: note: in implicit copy
constructor for 'llvm::bolt::ProgramPoint' first required here
return ProgramPoint(&*Last);
```
(cherry picked from FBD28335138)
Summary:
Since gcc/ld could produce and expect PIE files we need to pass -no-pie option to avoid linking errors for tests.
Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
(cherry picked from FBD28360045)
Summary:
Reorder-blocks optimization pass doesn't take into account that
available offset for legacy Jcc instructions (for example,
JRCXZ - operand 8 bits) has to be less than 255 bytes.
It's rare case and to exclude such functions with unsupported
instructions from optimization passes added extra checking
Alexey Moksyakov
Advanced Software Technology Lab, Huawei
(cherry picked from FBD28264117)
Summary:
Ran iwyu multiple times, manually picked header remove lines.
Reached fixed point wrt removal: iwyu doesn't automatically remove
any more headers or forward declarations.
(cherry picked from FBD29569221)
Summary:
Previously, we used p_align value of the code segment to predict the
mapping of the segment at runtime. However, at times the reported
value is not aligned and at other times the actual aligned value will
be different because of the different page size used.
All we know is that the page size used at runtime should not exceed
p_align value. Adjust our segment address matching accordingly.
(cherry picked from FBD28133066)
Summary:
Addressing comments from the review for "Expand auto types".
Use const reference in MCPlusBuilder for MCInstrDesc where the copy
is not necessary.
(cherry picked from FBD27844344)
Summary:
We may have a CU with empty ranges, so accept errors coming
from DWARFDie::getAddressRanges(). This happens when using tools that
selectively strip debuginfo from the binary.
(cherry picked from FBD27602731)
Summary:
Refactor SectionPatches to avoid the use of extra map and a cast
from StringRef to std::string.
cherry-picked from FBD26756560
(cherry picked from FBD27490641)
Summary:
The user may wish to run BOLT for printing statistics only
(i.e. to check that the profile is valid). Add an option to run BOLT
without writing any output file, similar to a dry run. This option
is triggered by supplying -o with "/dev/null".
(cherry picked from FBD29568632)
Summary:
During the initial indirect jump analysis, we used to assert that the
discovered jump table type matched the pattern of the corresponding
instruction sequence. E.g., for PIC jump table memory we expected the
PIC jump table instruction sequence. The assertions were too
conservative, as in the case of a mismatch we can mark the indirect jump
as having an unknown control flow. That should be sufficient to either
skip the function processing or rely on relocation information for
possible recovery of the control flow.
(cherry picked from FBD27255816)
Summary:
Fix a bug with instrumentation when trying to instrument
functions that share a jump table with multiple indirect
jumps. Usually, each indirect jump that uses a JT will have its own
copy of it. When this does not happen, we need to duplicate the jump
table safely, so we can split the edges correctly (each copy of the
jump table may have different split edges). For this to happen, we
need to correctly match the sequence of instructions that perform the
indirect jump to identify the base address of the jump table and patch
it to point to the new cloned JT. It was reported to us a case in
which the compiler generated suboptimal code to do an indirect jump
which our matcher failed to identify.
Fixesfacebookincubator/BOLT#126
(cherry picked from FBD27065579)
Summary:
Whenever BOLT encounters a data reference in code, it tries to convert
it into <Object+Offset> form. The primary reason behind this approach is
to support read-only data-reordering optimization. However, with the
current level of the linker and compiler support we don't have enough
information to always correctly restore the original <Object+Offset>.
E.g. with zero-sized symbols we have to speculate that the actual size
of the underlying object extends to the next symbol. Most of the time,
there will be an object pointed by a zero-sized symbol and even
if we are guessing incorrectly, there will be no harm in creating
references of such form.
The problem happens when there's no object corresponding to the original
symbol and the next object is an (unmarked) jump table:
A: # <- zero-sized object
.LJUMP_TABLE:
.long <entry1>
.long <entry2>
....
.LB:
.long 21
.LC:
.long 42
The jump table will be moved and all references past it (up to the next
named object) will be incorrectly updated.
We should not speculate about the size of A in a case like that and
treat all discovered data objects (and thus references) independently.
(cherry picked from FBD27005660)
Summary:
This PR introduces 2 new instrumentation options:
1. instrumentation-no-counters-clear: Discussed at https://github.com/facebookincubator/BOLT/issues/121
2. instrumentation-wait-forks: Since the instrumentation counters are mapped as MAP_SHARED it will be nice to add ability to wait until all forks of the parent process will die using tracking of process group.
The last patch is just emitBinary code refactor.
Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/125
GitHub Author: Vladislav Khmelevskyi <Vladislav.Khmelevskyi@huawei.com>
(cherry picked from FBD26919011)
Summary:
TBSS section is a "virtual" section that does not take memory or file
space. Ignore it completely while adjusting section sizes.
(cherry picked from FBD26824484)
Summary:
There is no real link between CU and TU, so relying on fact
that address are the same, and we are updating all of them.
(cherry picked from FBD28112114)
Summary:
This commit is the first step in rebasing all of BOLT
history in the LLVM monorepo. It also solves trivial build issues
by updating BOLT codebase to use current LLVM. There is still work
left in rebasing some BOLT features and in making sure everything
is working as intended.
History has been rewritten to put BOLT in the /bolt folder, as
opposed to /tools/llvm-bolt.
(cherry picked from FBD33289252)
Summary:
1. Add support for __literal16 section in the instrumentation runtime library for MacOS.
2. Fix emitting __counters section.
(cherry picked from FBD25746342)
Summary:
a few minor updates in block reordering:
- some refactoring to improve readability;
- optimized chain splitting strategy to improve quality of layout and performance of the algorithm.
(cherry picked from FBD25126220)
Summary:
When looking at perf.data's available binaries and their
respective mmap'ed segments, match them with the input binary by
looking at both aligned and non-aligned addresses. If we suppose
the alignment is the mmap'ed page size, we may miss some cases and
perf2bolt will refuse to proceed because it failed to match the
input binary with a process recorded in perf.data.
(cherry picked from FBD25732673)
Summary:
Add options for trading processing speed for binary performance.
-lite-threshold-pct=<uint>
Threshold (in percent) for selecting functions to process in lite
mode. Higher threshold means fewer functions to process.
E.g threshold of 90 means only top 10 percent of functions with
profile will be processed.
-lite-threshold-count=<uint>
Similar to '-lite-threshold-pct' but specify threshold using
absolute function call count. I.e. limit processing to functions
executed at least the specified number of times.
-no-scan
Do not scan cold functions for external references (may result in
slower binary).
(cherry picked from FBD24739092)