Commit Graph

616 Commits

Author SHA1 Message Date
Laith Saed Sakka 06e1554158 Retpoline Insertion Pass
Summary:
retpoline insertion implemented for reloc mode,

(cherry picked from FBD8832838)
2018-07-25 19:07:41 -07:00
Maksim Panchenko 39f6fcc947 [BOLT] Add support for IFUNC
Summary:
Relocation value verification was failing for IFUNC as the real value
used for relocation wasn't the symbol value, but a corresponding PLT
entry.

Relax the verification and skip any symbols of ST_Other type.

(cherry picked from FBD9123741)
2018-07-30 10:29:47 -07:00
Maksim Panchenko df94786119 [BOLT] Fix range checks
Summary:
containsRange() functions were incorrectly checking for an empty range
at the end of containing object. I.e. [a,b) was reporting true for
containing [b,b).

(cherry picked from FBD9074643)
2018-07-30 16:30:18 -07:00
Maksim Panchenko fe9f8219fa [BOLT] Fix TBSS-related issue
Summary:
TLS segment provide a template for initializing thread-local storage
for every new thread. It consists of initialized  and uninitialized
parts. The uninitialized part of TLS, .tbss, is completely meaningless
from a binary analysis perspective. It doesn't take any space in the
file, or in memory. Note that this is different from a regular .bss
section that takes space in memory.

We should not place .tbss into a list of allocatable sections, otherwise
it may cause conflicts with objects contained in the next section.

(cherry picked from FBD9074056)
2018-07-30 16:30:18 -07:00
Maksim Panchenko 771d976543 [BOLT][NFC] Minor code refactoring
(cherry picked from FBD8882632)
2018-07-12 10:13:03 -07:00
Maksim Panchenko 49920a8fad [BOLT] Add R_X86_64_PC64 relocation support
(cherry picked from FBD8980691)
2018-07-24 14:30:16 -07:00
spupyrev 631da736b0 [BOLT] further speeding up cache+
Summary:
For large binaries, cache+ algorithm adds a noticeable overhead in
comparison with cache. This modification restricts search space of the
optimization, which makes cache+ as fast as cache for all tested binaries.

There is a tiny (in the order of 0.01%) regression in cache-related metrics,
but this is not noticeable in practice.

(cherry picked from FBD8369968)
2018-05-17 18:27:13 -07:00
Rafael Auler ddfcf4f266 [BOLT] Add parser for pre-aggregated perf data
Summary:
The regular perf2bolt aggregation job is to read perf output directly.
However, if the data is coming from a database instead of perf, one
could write a query to produce a pre-aggregated file. This function
deals with this case.

The pre-aggregated file contains aggregated LBR data, but without binary
knowledge. BOLT will parse it and, using information from the
disassembled binary, augment it with fall-through edge frequency
information. After this step is finished, this data can be either
written to disk to be consumed by BOLT later, or can be used by BOLT
immediately if kept in memory.

File format syntax:
{B|F|f} [<start_id>:]<start_offset> [<end_id>:]<end_offset> <count>
[<mispred_count>]

B - indicates an aggregated branch
F - an aggregated fall-through (trace)
f - an aggregated fall-through with external origin - used to disambiguate
between a return hitting a basic block head and a regular internal
jump to the block

<start_id> - build id of the object containing the start address. We can
skip it for the main binary and use "X" for an unknown object. This will
save some space and facilitate human parsing.

<start_offset> - hex offset from the object base load address (0 for the
main executable unless it's PIE) to the start address.

<end_id>, <end_offset> - same for the end address.

<count> - total aggregated count of the branch or a fall-through.

<mispred_count> - the number of times the branch was mispredicted.
Omitted for fall-throughs.

Example
F 41be50 41be50 3
F 41be90 41be90 4
f 41be90 41be90 7
B 4b1942 39b57f0 3 0
B 4b196f 4b19e0 2 0

(cherry picked from FBD8887182)
2018-07-17 18:31:46 -07:00
Laith Saed Sakka 27f3032447 Add initial function injection support
Summary:
This diff have the API needed to inject functions using bolt.
In relocation mode injected functions are emitted between the cold and the hot functions,
In non-reloc mode injected functions are emitted a next text section.

(cherry picked from FBD8715965)
2018-07-08 12:14:08 -07:00
Maksim Panchenko 6e45f5aeec [perf2bolt] Enforce file matching in perf2bolt
Summary:
If the input binary does not have a build-id and the name does not match
any file names in perf.data, then reject the binary, and issue an error
message suggesting to rename it to one of the listed names from
perf.data.

(cherry picked from FBD8846181)
2018-07-13 15:26:41 -07:00
Maksim Panchenko f2f164f474 [perf2bolt] Fix perf build-id matching
Summary:
Recent compiler tool chains can produce build-ids that are less than 40
characters long. Linux perf, however, always outputs 40 characters,
expanding the string with 0's as needed. Fix the matching by only
checking the string prefix.

(cherry picked from FBD8839452)
2018-07-13 10:49:41 -07:00
Rafael Auler 7aee0adbf9 [BOLT-AArch64] Create cold symbols on demand
Summary:
Rework the logic we use for managing references to constant
islands. Defer the creation of the cold versions to when we split the
function and will need them.

(cherry picked from FBD8228803)
2018-05-31 10:33:53 -07:00
Maksim Panchenko 44a36937f8 [BOLT] Fix llvm-dwarfdump issues
Summary:
llvm-dwarfdump is relying on getRelocatedSection() to return
section_end() for ELF files of types other than relocatable objects.
We've changed the function to return relocatable section for other
types of ELF files. As a result, llvm-dwarfdump started re-processing
relocations for sections that already had relocations applied, e.g. in
executable files, and this resulted in wrong values reported.

As a workaround/solution, we make this function return relocated section
for executable (and any non-relocatable objects) files only if the
section is allocatable.

(cherry picked from FBD8760175)
2018-07-06 21:30:23 -07:00
Maksim Panchenko 66e0313d15 [perf2bolt] Accept `-` as a valid misprediction symbol
Summary:
As reported in GH-28 `perf` can produce `-` symbol for misprediction bit
if the bit is not supported by the kernel/HW. In this case we can ignore
the bit.

(cherry picked from FBD8786827)
2018-07-10 10:25:55 -07:00
Rafael Auler 12380b8b06 Fix assembly after adding entry points
Summary:
When a given function B, located after function A, references
one of A's basic blocks, it registers a new global symbol at the
reference address and update A's Labels vector via
BinaryFunction::addEntryPoint(). However, we don't update A's branch
targets at this point. So we end up with an inconsistent CFG, where the
basic block names are global symbols, but the internal branch operands
are still referencing the old local name of the corresponding blocks
that got promoted to an entry point. This patch fix this by detecting
this situation in addEntryPoint and iterating over all instructions,
looking for references to the old symbol and replacing them to use the
new global symbol (since this is now an entry point).

Fixes facebookincubator/BOLT#26

(cherry picked from FBD8728407)
2018-07-03 11:57:46 -07:00
Rafael Auler 544d1577c1 Avoid removing BBs referenced by JTs
Summary:
While removing unreachable blocks, we may decide to remove a
block that is listed as a target in a jump table entry. If we do that,
this label will be then undefined and LLVM assembler will crash.
Mitigate this for now by not removing such blocks, as we don't support
removing unnecessary jump tables yet.

Fixes facebookincubator/BOLT#20

(cherry picked from FBD8730269)
2018-07-03 17:02:33 -07:00
Laith Saed Sakka b6c4d8e924 -- Adding Veneer elimination pass and Veneer count to dyno stats.
Summary: Create a pass that performs veneers elimination .

(cherry picked from FBD8359299)
2018-06-07 11:10:37 -07:00
Maksim Panchenko 207ac19c63 Revert "[LongJumpPass] X86 enablement. First attempt."
This reverts commit 010b0f7603fc9fa209c6dc95ce4b9c08e7b70d75.

(cherry picked from FBD28111178)
2018-07-06 14:54:53 -07:00
Puyan Lotfi 64c429da89 [LongJumpPass] X86 enablement. First attempt.
(cherry picked from FBD8753328)
2018-07-06 12:31:36 -07:00
Maksim Panchenko b447979b8c [BOLT] Fix diagnostics printing in data aggregator
Summary: Print correct part of the string while reporting an error.

(cherry picked from FBD8745329)
2018-07-05 20:47:38 -07:00
Maksim Panchenko d7b2474f83 [DebugInfo] Change default value of FDEPointerEncoding
Summary:
If the encoding is not specified in CIE augmentation string, then it
should be DW_EH_PE_absptr instead of DW_EH_PE_omit.

(cherry picked from FBD8740274)
2018-07-05 14:21:49 -07:00
Maksim Panchenko 365613b404 [BOLT] Fix no-assertions build
Summary:
In release build without assertions MCInst::dump() is undefined and
causes link time failure.

Fixes facebookincubator/BOLT#27.

(cherry picked from FBD8732905)
2018-07-04 10:33:26 -07:00
Maksim Panchenko a6a37995d9 [BOLT] Reject processing of PIE binaries
Summary:
Check if the input binary ELF type. Reject any binary not of
ET_EXEC type, including position-independent executables (PIEs).

Also print the first function containing PIC jump table.

(cherry picked from FBD8707274)
2018-06-29 21:12:55 -07:00
Maksim Panchenko edc0cb1121 [LLVM] Accept `S` in augmentation strings in CIE
Summary:
Ignore 'S' in augmentation string on input. It just marks a signal
frame. All we have to do is propagate it.

Fixes facebookincubator/BOLT#21

This was already in LLVM trunk rL331738. Update llvm.patch.

(cherry picked from FBD8707222)
2018-06-29 20:30:36 -07:00
Maksim Panchenko 6802948028 [BOLT] Allow jump tables with 2 entries
Summary:
GCC 8 can generate jump tables with just 2 entries. Modify our heuristic
to accept it. We still assert that there's more than one entry.

(cherry picked from FBD8709416)
2018-06-30 13:30:47 -07:00
Rafael Auler 8835f90d1e [X86] Support a subset of internal calls
Summary:
Add support for functions with internal calls, necessary for
handling Intel MKL library and some code observed in google core dumper
library.

This is not optimizing these functions, but only identifying them,
running analyses to assure we will not break those functions if we move
them, and then "freezing" these functions (marking as not simple so Bolt
will not try to reorder it or touch it in any way).

(cherry picked from FBD8364381)
2018-06-11 13:18:44 -07:00
Facebook Github Bot 07353e9590 [BOLT][PR] In some cases DB could be nullptr
Summary:
When processing binary with -debug mode in some cases, BD could be nullptr. It will be better to fail later on assert than here with segfault.
Closes https://github.com/facebookincubator/BOLT/pull/18
GitHub Author: Alexander Gryanko <xpahos@gmail.com>

(cherry picked from FBD8650719)
2018-06-26 17:02:00 -07:00
Rafael Auler 72ecd12f2f Disable -split-eh in non-relocation mode
Summary:
This option only works in relocation mode. In non-relocation
mode, it generates invalid references that cause MCStreamer to fail.
Disable this flag if the user requested and print a warning.

(cherry picked from FBD8625990)
2018-06-25 14:55:48 -07:00
Rafael Auler 5b2eab6538 [BOLT] Fix call to evaluateX86MemOperands
Summary:
There was a call site not providing a displament immediate
value. This assertion is firing in opensource.

(cherry picked from FBD8576033)
2018-06-21 11:03:57 -07:00
Rafael Auler 8f717dd25e [BOLT] Add initial bolt-only test infra
Summary:
Create folders and setup to make LIT run BOLT-only tests. Add
a test example. This will add a new make/ninja rule "check-bolt" that
the user can invoke to run LIT on this folder.

(cherry picked from FBD8595786)
2018-06-22 13:50:07 -07:00
Maksim Panchenko 1baa2529ea [merge-fdata] Support legacy/non-YAML profile format
Summary: Concatenate profile contents if they are not in YAML format.

(cherry picked from FBD8579955)
2018-06-21 14:45:38 -07:00
Maksim Panchenko 3ab2929b36 [BOLT] Fix support for PIC jump tables
Summary:
BOLT heuristics failed to work if false PIC jump table entries were
accepted when they were pointing inside a function, but not at
an instruction boundary.

This fix checks if the destination falls at instruction boundary, and
if it does not, it truncates the jump table. This, of course, still does not
guarantee that the entry corresponds to a real destination, and we can
have "false positive" entry(ies). However, it shouldn't affect
correctness of the function, but the CFG may have edges that are never
taken. We may update an incorrect jump table entry, corresponding to an
unrelated data, and for that reason we force moving of jump tables if a
PIC jump table was detected.

(cherry picked from FBD8559588)
2018-06-20 21:43:22 -07:00
Rafael Auler 35c09dc4dd [BOLT] Add a user friendly error reporting message
Summary:
In case we fail to disassemble or to build the CFG for a
function, print instructions on bug reporting.

(cherry picked from FBD8549737)
2018-06-20 12:03:24 -07:00
Maksim Panchenko 221107c5fb [BOLT] Update llvm.patch
Summary:

(cherry picked from FBD8475998)
2018-06-17 22:29:27 -07:00
Maksim Panchenko a7d025139f Revert "[Bolt][NFC] Change capitalization s/BOLT/Bolt/g"
Summary:

(cherry picked from FBD8431879)
2018-06-14 14:27:20 -07:00
Maksim Panchenko 789162276d [Bolt][NFC] Change capitalization s/BOLT/Bolt/g
(cherry picked from FBD8373789)
2018-06-11 19:46:40 -07:00
Maksim Panchenko 232046f9b2 [Bolt] Reduce verbosity while reporting hash collisions
Summary:
Don't report all data objects with hash collisions by default. Only
report the summary, and use -v=1 for providing the full list.

(cherry picked from FBD8372241)
2018-06-11 17:17:25 -07:00
Bill Nell 706abb6c95 [BOLT] Hash anonymous symbol names
Summary:
This diff replaces the addresses in all the {SYMBOLat,HOLEat,DATAat} symbols with hash values based on the data contained in the symbol.  It should make the profiling data for anonymous symbols robust to address changes.

The only small problem with this approach is that the hashed name for padding symbols of the same size collide frequently.  This shouldn't be a big deal since it would be weird if those symbols were hot.

On a test run with hhvm there were 26 collisions (out of ~338k symbols).  Most of the collisions were from small (2,4,8 byte) objects.

(cherry picked from FBD7134261)
2018-06-06 03:17:32 -07:00
spupyrev 779541283a [BOLT] merging cold basic blocks to reduce #jumps
Summary:
This diff introduces a modification of cache+ block ordering algorithm,
which reordered and merges cold blocks in a function with the goal of reducing
the number of (non-fallthrough) jumps, and thus, the code size.

(cherry picked from FBD8044978)
2018-05-17 11:14:15 -07:00
Maksim Panchenko b4dbd35d6c [BOLT] Initial support for memcpy() inlininig
Summary:
Add "-inline-memcpy" option to inline calls to memcpy() using
"rep movsb" instruction. The pass is X86-specific.

Calls to _memcpy8 are optimized too using a special return value
(dest+size).

The implementation is very primitive in that it does not track liveness
of %rax after return, and no %rcx substitution. This is going to get
improved if we find the optimization to be useful.

(cherry picked from FBD8211890)
2018-05-26 12:40:51 -07:00
Rafael Auler 42e6512241 [BOLT-AArch64] Detect linker stubs and address them
Summary:
In AArch64, when the binary gets large, the linker inserts
stubs with 3 instructions: ADRP to load the PC-relative address of
a page; ADD to add the offset of the page; and a branch instruction
to do an indirect jump to the contents of X16 (the linker-reserved
reg). The problem is that the linker does not issue a relocation for
this (since this is not code coming from the assembler), so BOLT has
no idea what is the real target, unless it recognizes these instructions
and extract the target by combining the operands of the instructions
from the stub. This diff does exactly that.

(cherry picked from FBD7882653)
2018-04-30 14:47:32 -07:00
Maksim Panchenko 929b0908f7 [BOLT][NFC] Move ICF pass into a separate file
Summary:
Consolidate code used by identical code folding under
Passes/IdenticalCodeFolding.cpp.

(cherry picked from FBD8109916)
2018-05-22 15:52:21 -07:00
Maksim Panchenko 6302e18f94 [PERF2BOLT] Improve file matching
Summary:
If the input binary for perf2bolt has a build-id and perf data has
recorded build-ids, then try to match them. Adjust the file name if
build-ids match to cover cases where the binary was renamed after data
collection. If there's no matching build-id report an error and exit.

While scanning task events, truncate the name to 15 characters prior to
matching, since that's how names are reported by perf.

(cherry picked from FBD8034436)
2018-05-16 13:31:13 -07:00
Maksim Panchenko 13968f7fa9 [BOLT] Add option to print functions with bad layout
Summary:
Option `-report-bad-layout=N` prints top N functions with layouts
that have cold blocks placed in the middle of hot blocks. The sorting is
based on execution_count / number_of_basic_blocks formula.

(cherry picked from FBD8051950)
2018-05-17 16:58:29 -07:00
Maksim Panchenko 3af3537383 [BOLT] Properly handle non-standard function refs
Summary:
Application code can reference functions in a non-standard way, e.g.
using arithmetic and bitmask operations on them. One example is if a
program checks if a function is below a certain address or within
a certain address range to perform a low-level optimization or generate
a proper code (JIT).

Instead of relying on a relocation value (symbol+addend), we use only
the symbol value, and then check if the value is inside the function.
If it is, we treat it as a code reference against location within the
function, otherwise we handle it as a non-standard function reference
and issue a warning.

(cherry picked from FBD7996274)
2018-05-14 11:10:26 -07:00
Maksim Panchenko 1750fee2ac [BOLT] Add option to ignore function hash in profile
Summary:
When we make changes to MCInst opcodes (or get changes from upstream),
a hash value for BinaryFunction changes. As a result, we are unable
to match profile generated by a previous version of BOLT.

Add option `-profile-ignore-hash` to match profile while ignoring
function hash value. With this option we match functions with common
names using the number of basic blocks.

(cherry picked from FBD7983269)
2018-05-11 18:30:47 -07:00
Maksim Panchenko 56b38a14c5 [BOLT] Fix dyno-stats for PLT calls
Summary:
To accurately account for PLT optimization, each PLT call should be
counted as an extra indirect call instruction, which in turn is
a load, a call, an indirect call, and instruction entry in dyno stats.

(cherry picked from FBD7978980)
2018-05-11 15:30:56 -07:00
spupyrev e4f39bda51 adjusting cache stats for non-simple functions
Summary:
While working with a binary in non-relocations mode, I realized
some cache metrics are not computed correctly. Hence, this fix.
In addition, logging the number of functions with modified ordering of
basic blocks, which is helpful for analysis.

(cherry picked from FBD7975392)
2018-05-11 12:03:19 -07:00
Bill Nell 729da2da22 [BOLT] Static data reordering pass.
Summary:
Enable BOLT to reorder data sections in a binary based on memory
profiling data.

This diff adds a new pass to BOLT that can reorder data sections for
better locality based on memory profiling data.  For now, the algorithm
to order data is primitive and just relies on the frequency of loads to
order the contents of a section.  We could probably do a lot better by
looking at what functions use the hot data and grouping together hot
data that is used by a single function (or cluster of functions).
Block ordering might give some hints on how to order the data better as
well.

The new pass has two basic modes: inplace and split (when inplace is
false).  The default is split since inplace hasn't really been tested
much.  When splitting is on, the cold data is copied to a "cold" version
of the section while the hot data is kept in the original section, e.g.
for .rodata, .rodata will contain the hot data and .bolt.org.rodata will
contain the cold bits.  In inplace mode, the section contents are
reordered inplace.  In either mode, all relocations to data within that
section are updated to reflect new data locations.

Things to improve:
- The current algorithm is really dumb and doesn't seem to lead to any
  wins.  It certainly could use some improvement.
- Private symbols can have data that leaks over to an adjacent symbol,
  e.g. a string that has a common suffix can start in one symbol and
  leak over (with the common suffix) into the next.  For now, we punt on
  adjacent private symbols.
- Handle ambiguous relocations better.  Section relocations that point
  to the boundary of two symbols will prevent the adjacent symbols from
  being moved because we can't tell which symbol the relocation is for.
- Handle jump tables.  Right now jump table support must be basic if
  data reordering is enabled.
- Being able to handle TLS.  A good amount of data access in some
  binaries are happening in TLS. It would be worthwhile to be able to
  reorder any TLS sections too.
- Handle sections with writeable data.  This hasn't been tested so
  probably won't work.  We could try to prevent false sharing in
  writeable sections as well.
- A pie in the sky goal would be to use DWARF info to reorder types.

(cherry picked from FBD6792876)
2018-04-20 20:03:31 -07:00
Maksim Panchenko bdf21f7617 [BOLT] Align basic blocks based on execution count
Summary:
The default is not changing, i.e. we are not aligning code within a
function by default.

New meaning of options for aligning basic blocks:

  -align-blocks
      triggers basic block alignment based on profile

  -preserve-blocks-alignment
      tries to preserve basic block alignment seen on input

Tuning options for "-align-blocks":
  -align-blocks-min-size=<uint>
      blocks smaller than the specified size wouldn't be aligned

  -align-blocks-threshold=<uint>
      align only blocks with frequency larger than containing function
      execution frequency specified in percent. E.g. 1000 means aligning
      blocks that are 10 times more frequently executed than the containing
      function.

(cherry picked from FBD7921980)
2017-11-07 15:42:28 -08:00