Summary:
Some LBR traces contain less than 16/32 entries. When the first trace is
less than the standard length, we fail to enable SKL bug workaround with
BAT mode enabled. The result is a bad profile from perf2bolt. The fix
is to check the length of every trace, not just the first one.
Issue facebookincubator/BOLT#94
(cherry picked from FBD22917971)
Summary:
Added execution count threshold option (execution-count-threshold) controlling the optimizations that are sensitive to the accuracy of the profiling data:
- BB reordering
- function splitting
- frame opts
- shrink wrapping
- indirect call promotion
(cherry picked from FBD22682171)
Summary:
Right now, the SAVE_ALL sequence executed upon entry of both
of our runtime libs (hugify and instrumentation) will cause the stack to
not be aligned at a 16B boundary because it saves 15 8-byte regs. Change
the code sequence to adjust for that. The compiler may generate code
that assumes the stack is aligned by using movaps instructions, which
will crash.
(cherry picked from FBD22744307)
Summary:
If no profile data is provided, but only a user-provided order
file for functions, fix the placement of the __hot_end symbol.
(cherry picked from FBD22713265)
Summary: Added handling of intra-function calls (internal control transfer by call instruction) to instrumentOneTarget
(cherry picked from FBD22606033)
Summary:
Do not fail/assert when trying to reorder blocks that terminate
with JRCXZ/JECXZ/LOOP instructions. We cannot invert the condition of
these instructions, so just treat them accordingly in fixBranches().
(cherry picked from FBD22487107)
Summary:
We've accidentally registered TBSS section address with a BinaryContext
resulting in addresses being attributed to it when
getSectionForAddress() was called.
(cherry picked from FBD22369323)
Summary:
Reading through the LLVM coding standard again, realized a few places where I didn't follow the standard when coding. Addressing them:
1. prefer static functions over functions in unnamed namespace.
2. #include as little as possible in headers
3. Have vtable anchors.
(cherry picked from FBD22353046)
Summary:
R_X86_64_PLT32 relocations recorded by the linker may point to the PLT
section instead of being resolved to the symbol reported by the
relocation. Sometimes they could point to the symbol too. Disable
internal verification for this type of relocation.
Include a fix for symbol address calculation when it is based on the
extracted value. The truncation to the relocation size is needed if
the results overflows.
(cherry picked from FBD22317952)
Summary:
Re-add tests removed because they used to depend on yaml2obj.
Rewrite them with an assembler (llvm-mc) and use the system linker to
produce a valid ELF as input to BOLT.
(cherry picked from FBD22323449)
Summary:
When adding symbols for patched functions, we may end up emitting
multiple symbols per function if the function has multiple names (e.g.
after identical code folding by the linker).
(cherry picked from FBD22294112)
Summary:
Accept binaries without dynamic section/segment as a valid input.
Modify the check for invalid debug info "executables" that are result of
running "objcopy --only-keep-debug". Instead of checking for an empty
dynamic segment, check that ".text" is mapped into a valid segment.
Move SegmentMapInfo inside BinaryContext.
Fixesfacebookincubator/BOLT#91
Temporarily removing issue*.test tests that use yaml2obj and operate on
fake binaries.
(cherry picked from FBD22271481)
Summary:
While aggregating perf.data events, even in strict mode, there is no
need to process all functions since we are not generating an output
binary. However, it's still important to convert data for as many
functions as possible, even for ones with unknown internal control flow.
(cherry picked from FBD22248390)
Summary:
lld linker may emit static relocations against addresses that also have
dynamic relocations associated with them. When this happens, BOLT fails
to validate the extracted value at the address.
Read dynamic relocations in the binary and ignore static relocations at
addresses that have a duplicate dynamic relocation.
(cherry picked from FBD22192345)
Summary:
getNewValueForSymbol() uses orc::RTDyldObjectLinkingLayer::findSymbol()
to resolve symbol values. The latter will always return JITSymbol,
even if there was no symbol defined. The address for the undefined
symbol will be zero, but some symbols could legally be resolved to zero
too.
We need to distinguish between real zero-valued symbols and symbols that
were not emitted and are not visible by orc::RTDyldObjectLinkingLayer.
If zero address is returned by ORC, check for a binary data with the
same name and use its address for the symbol resolution.
(cherry picked from FBD22175269)
Summary:
If we detect an internal function reference from code outside of the
function, then create an entry point at that location.
(cherry picked from FBD22169337)
Summary:
In my previous commit, I've accidentally reverted the condition while
evaluating a branch target.
Also, do not emit instruction for relocation purposes in
scanExternalRefs() if there was no TargetSymbol set and we have not
produced new relocations.
(cherry picked from FBD22169317)
Summary:
We may end up with a secondary entry symbol set to zero if there was no
symbol in the input file at the entry point address, and if we skipped
the function emission, e.g. if it was ignored. In that case, the symbol
should be properly initialized with a proper address.
(cherry picked from FBD22169167)
Summary:
Add '-lite' support for relocations for improved processing time,
memory consumption, and more resilient processing of binaries with
embedded assembly code.
In lite relocation mode, BOLT will skip full processing of functions
without a profile. It will run scanExternalRefs() on such functions
to discover external references and to create internal relocations
to update references to optimized functions.
Note that we could have relied on the compiler/linker to provide
relocations for function references. However, there's no assurance
that all such references are reported. E.g., the compiler can resolve
inter-procedural references internally, leaving no relocations
for the linker.
The scan process takes about <10 seconds per 100MB of code on modern
hardware. It's a reasonable overhead to live with considering the
flexibility it provides.
If BOLT fails to scan or disassemble a function, .e.g., due to a data
object embedded in code, or an unsupported instruction, it enables a
patching mode to guarantee that the failed function will call
optimized/moved versions of functions. The patching happens at original
function entry points.
'-skip=<func1,func2,...>' option now can be used to skip processing of
arbitrary functions in the relocation mode.
With '-use-old-text' or '-strict' we require all functions to be
processed. As such, it is incompatible with '-lite' option,
and '-skip' option will only disable optimizations of listed
functions, not their disassembly and emission.
(cherry picked from FBD22040717)
Summary:
In some cases we install bolt binary into one level deeper in bin/, such as bin/install/, we need to go back one more level to find lib directory.
(cherry picked from FBD22070974)
Summary:
This diff handles several challenges related to heatmap generation for Linux kernel (vmlinux elf file):
- If the input binary elf file contains the section `__ksymtab`, this diff assumes that this is the linux kernel `vmlinux` file and enables an extra flag `LinuxKernelMode`
- In `LinuxKernelMode`, we only support heat map generation right now, therefore it ensures that current BOLT mode is heat map generation. Otherwise, it exits with error.
- For some Linux symbol and section combinations, BOLT may not be able to find section for symbol (specially symbols that specifies the end of some section). For such cases, we show an warning message without exiting which was the previous behavior.
- Linux kernel elf file does not contain dynamic section, therefore, we don't exit when no dynamic section is found for linux kernel binary.
- Current `ParseMMap` logic does not work with linux kernel. MMap entries for linux kernel uses `PERF_RECORD_MMAP` format instead of typical `PERF_RECORD_MMAP2` format. Since linux kernel address mapping is absolute (same as specified in the ELF file), we avoid calling `ParseMMap` in linux kernel mode.
- Linux kernel entries are registered with PID -1, therefore `BinaryMMapInfo` lookup is not required for linux kernel entries. Similarly, `adjustLBR` is also not required.
- Default max address in linux kernel mode is highest unsigned 64-bit integer instead of current 4GBs.
- Added another new parameter for heatmap, `MinAddress`, in case of Linux kernel mode which is `KernelBaseAddress`, otherwise, it is 0. While registering Heatmap sample counts from LBR entries, any address lower than this `MinAddress` is ignored.
- `IgnoreInterruptLBR` is disabled in linux kernel mode to ensure that kernel entries are processed
Currently, linux kernel heat map also include heat map for Linux kernel modules that are not part of vmlinux elf file. This is intentional to identify other potential optimization opportunities. If reviewers think, those modules should be omitted, I will disable those modules based on highest end address of a vmlinux elf section.
(cherry picked from FBD21992765)
Summary:
Under some conditions, e.g. while running in lite mode or when a
function is non-simple, BOLT may decide not to emit function code and
hence there's no need to update the symbol. However, since we change
section table, the corresponding section index may need an update.
Also, update section index for ICF symbols.
(cherry picked from FBD21970017)
Summary:
This patch enables automated hugify for Bolt.
When running Bolt against a binary with -hugify specified, Bolt will inject a call to a runtime library function at the entry of the binary. The runtime library calls madvise to map the hot code region into a 2M huge page. We support both new kernel with THP support and old kernels. For kernels with THP support we simply make a madvise call, while for old kernels, we first copy the code out, remap the memory with huge page, and then copy the code back.
With this change, we no longer need to manually call into hugify_self and precompile it with --hot-text. Instead, we could simply combine --hugify option with existing optimizations, and at runtime it will automatically move hot code into 2M pages.
Some details around the changes made:
1. Add an command line option to support --hugify. --hugify will automatically turn on --hot-text to get the proper hot code symbols. However, running with both --hugify and --hot-text is not allowed, since --hot-text is used on binaries that has precompiled call to hugify_self, which contradicts with the purpose of --hugify.
2. Moved the common utility functions out of instr.cpp to common.h, which will also be used by hugify.cpp. Added a few new system calls definitions.
3. Added a new class that inherits RuntimeLibrary, and implemented the necessary emit and link logic for hugify.
4. Added a simple test for hugify.
(cherry picked from FBD21384529)
Summary:
As we are adding more types of runtime libraries, it would be better to move the runtime library out of RewriteInstance so that it could grow separately. This also requires splitting the current implementation of Instrumentation.cpp to two separate pieces, one as normal Pass, one as the runtime library. The Instrumentation Pass would pass over the generated data to the runtime library, which will use to emit binary and perform linking.
This patch does the following:
1. Turn Instrumentation class into an optimization pass. Register the pass in the pass manager instead of in RewriteInstance.
2. Split all the data that are generated by Instrumentation that's needed by runtime library into a separate data structure called InstrumentationSummary. At the creation of Instrumentation pass, we create an instance of such data structure, which will be moved over to the runtime at the end of the pass.
3. Added a runtime library member to BinaryContext. Set the member at the end of Instrumentation pass.
4. In BinaryEmitter, make BinaryContext to also emit runtime library binary.
5. Created a base class RuntimeLibrary, that defines the interface of a runtime library, along with a few common helper functions.
6. Created InstrumentationRuntimeLibrary which inherits from RuntimeLibrary, that does all the work (mostly copied over) for emit and linking.
7. Added a new directory called RuntimeLibs, and put all the runtime library related files into it.
(cherry picked from FBD21694762)
Summary:
take_front() is a const member of StringRef. Calling it does nothing.
This suggests that this line of code is useless, deleting it.
But it's good to double check, what was the original intention here?
(cherry picked from FBD21697637)
Summary:
This diff handles several issues related to profile reading and
handling:
* Unifies interface used by 3 profile readers in ProfileReaderBase.
* Adds automatic detection of the profile file contents.
* Removes reader-specific fields from BinaryFunction and BinaryData.
All the information is stored in instruction annotations.
* Removes implicit memory dependencies in annotations on profile
reader instance.
* Adds lite mode support to YAML reader.
* Moves profile reading code out of BinaryFunction.
(cherry picked from FBD21601411)
Summary:
IndirectCallProfile was holding to a StringRef from a profile reader
providing an implicit dependency on the reader.
(cherry picked from FBD21587101)
Summary:
Add a dummy option in BOLT to allow us to write any string in
the bolt info section. This is accomplished since we record the complete
argv vector. This string used to tag this binary with any ID that can
later be associated with a specific BOLT invocation.
(cherry picked from FBD21441902)
Summary:
We use a special routine to emit line info for functions that we do not
overwrite. The resulting DWARF was not quite efficient as we were
advancing addresses using a larger than needed opcodes. Since there were
only a few functions that we didn't emit/overwrite, it was not a big
issue.
However, in lite mode the majority of functions are not overwritten and
as a result, the inefficiency in debug line encoding got exposed and
binaries were getting larger than expected .debug_line sections.
Fix it by using more conventional line table opcodes for address
advancing.
(cherry picked from FBD21423074)
Summary: We only support linking ELF runtime library right now. If the library is an archiver, we check that each individual library inside the archiver is an ELF library.
(cherry picked from FBD21388672)
Summary:
When optimizing a binary without relocations, we can skip processing
functions without profile (cold functions). By skipping processing of
cold functions, we reduce the processing time and memory. We call
such mode a lite mode, and it is enabled by default.
Some processing is still done for functions without profile even in lite
mode. scanExternalRefs() function is used to detect secondary entry
points to functions that are not marked in the symbol table.
Note that the no-relocation requirement is a temporary limitation
of the lite mode.
(cherry picked from FBD21366567)