Commit Graph

597 Commits

Author SHA1 Message Date
Rafael Auler 21c48f7d78 Fix profiling for functions with multiple entry points
Summary:
Fix issue in memcpy where one of its entry points was getting
no profiling data and was wrongly considered cold, being put in the cold
region.

(cherry picked from FBD5569156)
2017-08-02 18:14:01 -07:00
Rafael Auler b81ff8a8fc [BOLT] Fix SCTC issue with hot-cold split
Summary:
SCTC was deleting an unconditional branch to a block in the
cold area because it was the next block in the layout vector. Fix the
condition to only delete such branches when source and target are in
the same allocation area (either both hot or both cold).

(cherry picked from FBD5570300)
2017-08-04 20:14:24 -07:00
Maksim Panchenko e4290d083f [BOLT] Disable last basic block assertion.
Summary:
While converting code from __builtin_unreachable() we were asserting
that a basic block with a conditional jump and a single CFG successor
was the last one before converting the jump to an unconditional one.

However, if that code was executed after a conditional tail call
conversion in the same function, the original last basic block
will no longer be the last one in the post-conversion layout.

I'm disabling the assertion since it doesn't seem worth it to add
extra checks for the basic block that used to be the last one.

(cherry picked from FBD5570298)
2017-08-04 19:39:45 -07:00
Maksim Panchenko ae409f0b27 [BOLT] Better match LTO functions profile.
Summary:
* Improve profile matching for LTO binaries that don't match 100%.
* Fix profile matching for '.LTHUNK*' functions.
* Add external outgoing branches (calls) for profile validation.

There's an improvement for 100% match profile and for stale LTO
profile. However, we are still not fully closing the gap with
stale profile when LTO is enabled.

(NOTE: I haven't updated all test cases yet)

(cherry picked from FBD5529293)
2017-07-17 11:22:22 -07:00
Maksim Panchenko d27b31ee07 [BOLT] Fix reading LSDA address for PIC code
Summary:
Fix a bug while reading LSDA address in PIC format. The base address was
wrong for PC-relative value. There's more work involved in making PIC
code with C++ exceptions work.

(cherry picked from FBD5538755)
2017-08-01 11:19:01 -07:00
Yue Zhao eb64d03b73 Reformat the register strings in the output so Stoke can parse without preprocessing.
Summary:
Minor change. Reformat the def-in, live-out register strings so that Stoke can parse
without doing preprocessing.

(cherry picked from FBD5537421)
2017-07-27 12:52:56 -07:00
Bohan Ren 87481cb494 [BOLT] Improve Jump-Distance Metric -- Consider Function Execution Count
Summary:
Function execution count is very important. When calculating metric, we
should care more about functions which are known to be executed.

The correlations between this metric and both CPU time is slightly improved
to be close to  96% and the correlation between this metric and Cache Miss
remains the same 96%.

Thanks the suggestion from Sergey!

(cherry picked from FBD5494720)
2017-07-25 16:27:00 -07:00
Rafael Auler 787db1cf3e Recognize AArch64 as a valid input
Summary:
BOLT needs to be configured with the LLVM
AArch64 backend. If the backend is linked into the LLVM
library, start processing AArch64 binaries.

(cherry picked from FBD5489369)
2017-07-25 09:11:42 -07:00
Yue Zhao 70bad8d34d add: get function score to find hot functions refine the dumped csv format
Summary: minor modification of the bolt stoke pass

(cherry picked from FBD5471011)
2017-07-13 15:02:52 -07:00
Yue Zhao 6d845719ce get analysis information of functions
Summary:
complete the StokeInfo pass,
ignore previous arc diff

(cherry picked from FBD5306863)
2017-06-13 17:24:27 -07:00
Rafael Auler 4e29afeb18 [BOLT] Add cold symbols to the symbol table
Summary:
Create new .symtab and .strtab sections, so we can change their
sizes and not only patch them. Remove local symbols and add symbols to
identify the cold part of split functions.

(cherry picked from FBD5345460)
2017-06-27 16:25:59 -07:00
Bohan Ren 4d34471eeb [BOLT] Improved Jump-Distance Metric
Summary:
Current existing Jump-Distance Metric (Previously named Call-Distance) will ignore some traversals.
This modified version adds those missing traversals back.

The correlation remains the same: around 97% correlation with CPU and
Cache Miss (which implies that even though some traversals are ignored,
it doesn't affect correlation that much.)

(cherry picked from FBD5369653)
2017-07-04 15:59:29 -07:00
Rafael Auler 4ecd3856e9 [BOLT] Fix shrink-wrapping bugs
Summary:
Make shrink-wrapping more stable. Changes:

* Correctly detect landing pads at the dominance frontier, bailing
  on such cases because we are not prepared to split LPs that are target
  of a critical edge.
* Disable FOP's store removal by default - this is experimental and
  shouldn t go to prod because removing a store that we failed to detect
  it's actually necessary is disastrous. This pass currently doesn't
  have a great impact on the number of stores reduced, so it is not a
  problem. Most stores reduced are due shrink wrapping anyway.
* Fix stack access identification - correctly estimate memory length of
  weird instructions, bail if we don't know.
* Make rules for shrink-wrapping more strict: cancel shrink wrapping on
  a number of cases when we are not 100% sure that we are dealing with a
  regular callee-saved register.
* Add basic block folding to SW. Sometimes when splitting critical edges
  we create a lot of redundant BBs with the same instructions, same
  successor but different predecessor. Fold all identical BBs created by
  splitting critical edges.
* Change defaults: now the threshold used to determine when to perform
  SW is more conservative, to be sure we are moving a spill to a colder
  area. This effort, along with BB folding, helps us to avoid hurting
  icache performance by indiscriminately increasing code size.

(cherry picked from FBD5315086)
2017-06-22 16:34:01 -07:00
Bohan Ren ec304396c3 [BOLT] Call Distance Metric
Summary:
Designed a new metric, which shows 93.46% correltation with Cache Miss
and 86% correlation with CPU Time.

Definition:

One can get all the traversal path for each function. And for each traversal,
we will define a distance. The distance represents how far two connected
basic blocks are. Therefore, for each traversal, I will go through the
basic blocks one by one, until the end of the traversal and sum up the
distance for the neighboring basic blocks.
Distance between two connected basic blocks is the distance of the
centers of two blocks in the binary file.

(cherry picked from FBD5242526)
2017-06-13 16:29:39 -07:00
Rafael Auler 3469396269 [BOLT] Set local symbols in relocation mode to zero
Summary:
Strobelight is getting confused by local symbols that we do not
update in relocation mode. These symbols were preserved by the linker in
relocation mode in order support emitting relocations against local
labels, but they are unused.

Issue a quick fix to this by detecting such symbols and setting their
value to zero.

This patch also fixes an issue with the symbol table that was assigning
the wrong section index to symbols associated with the .text section.

(cherry picked from FBD5271277)
2017-06-16 20:04:43 -07:00
Bill Nell 59e90f0f43 [BOLT] Make function reordering more robust with stale data.
Summary:
Rewrote the guts of buildCallGraph.  There are two new options to control how the CG is created.  UsePerfData controls whether we use the perf data directly to construct the CG for functions with a stale profile.  IgnoreRecursiveCalls omits recursive calls from the CG since they might be skewing results unfairly for heavily recursive functions.

I've changed the way BinaryFunction::estimateHotSize() works.  If the function is marked as split, I count the size of all the non-cold blocks.  This gives a different but more accurate answer than the old method.

I've improved and updated the CG build stats with extra information.

(cherry picked from FBD5224183)
2017-06-09 13:17:36 -07:00
Rafael Auler 8233c7d204 [BOLT] Bail frame analysis on PUSHes escaping vars
Summary:
Some PUSH instructions may contain memory addresses pushed to
the stack. If this memory address is from an object in the stack, cancel
further frame analysis for this function since it may be escaping a
variable.

This fixes a bug with deleting used stores (in frameopt) in hhvm trunk.

(cherry picked from FBD5270590)
2017-06-16 15:02:26 -07:00
Yue Zhao 37d0f81df5 BinaryFunction.h: Clarify commet for getSize(), add getNumNonPseudos()
Summary: Minor fix and add new function

(cherry picked from FBD5270376)
2017-06-16 17:06:13 -07:00
Bill Nell dc4dd64800 [BOLT] More HFSort+ refactoring
Summary: Move most of hfsort+ into a class so the state can more easily be shared.

(cherry picked from FBD5216206)
2017-06-08 10:55:28 -07:00
Bohan Ren f819f53d27 Normalize Clusters Twice
Summary:
This one will normalize cluster twice, leaving edges connecting two
basic block untouched

(cherry picked from FBD5207416)
2017-06-07 20:25:30 -07:00
Rafael Auler eeea415dd2 [BOLT] Fix SCTC execution count assertion
Summary:
SCTC is currently asserting (my fault :-) when running in
combination with hot jump table entries optimization. This optimization
sets the frequency for edges connecting basic blocks it creates and jump
table targets based on the execution count of the original BB containing
the indirect jump.

This is OK as an estimation, but it breaks our assumption that the sum of
the frequency of preds edges equals to our BB frequency. This happens
because the frequency of the BB is rarely equal to its outgoing edges
frequency.

SCTC, in turn, was updating the execution count for BBs with tail calls
by subtracting the frequency count of predecessor edges. Because hot
jump table entries optimization broke the BB exec count = sum(preds freq)
invariant, SCTC was asserting.

To trigger this, the input program must have a jump table where each
entry contains a tail call. This happens in the HHVM binary for func
_ZN4HPHP11collections5issetEPNS_10ObjectDataEPKNS_10TypedValueE.

(cherry picked from FBD5222504)
2017-06-09 15:52:50 -07:00
Bohan Ren eb63a0b295 [BOLT] Expand BOLT report for basic block ordering
Summary:
Add a new positional option onto bolt: "-print-function-statistics=<uint64>"
which prints information about block ordering for requested number of functions.

(cherry picked from FBD5105323)
2017-05-22 11:04:01 -07:00
Bill Nell ea53066287 [BOLT] Fix hfsort+ caching mechanism
Summary:
There's good news and bad news.

The good news is that this fixes the caching mechanism used by hfsort+ so that we always get the correct end results, i.e. the order is the same whether the cache is enabled or not.
The bad news is that it takes about the same amount of time as the original to run. (~6min)
The good news is that I can make some improvements on this implementation which I'll put up in another diff.

The problem with the old caching mechanism is that it was caching values that were dependent on adjacent sets of clusters.  It only invalidated the clusters being merged and none of other clusters that might have been affected.  This version computes the adjacency information up front and updates it after every merge, rather than recomputing it for each iteration.  It uses the adjacency data to properly invalidate any cached values.

(cherry picked from FBD5203023)
2017-06-06 17:43:45 -07:00
Rafael Auler 583790ee22 Fix dynostats for conditional tail calls
Summary:
Don't treat conditional tail calls as branches for dynostats. Count
taken conditional tails calls as calls. Change SCTC to report dynamic
numbers after it is done.

(cherry picked from FBD5203708)
2017-06-07 14:20:39 -07:00
Rafael Auler 2baa4c7a2c [BOLT] Only print stats when requested
Summary:
Make LLVM timers only output numbers when the -time-opts option
is used.

(cherry picked from FBD5212221)
2017-06-08 13:46:17 -07:00
Bill Nell 8eaa2fdd9f [BOLT] Fix hfsort+ crash when no perf data is present.
Summary: hfsort+ was trying to access the back() of an empty vector when no perf data is present.  Just add a guard around that code.

(cherry picked from FBD5206962)
2017-06-07 18:31:06 -07:00
Maksim Panchenko f9436bc903 [BOLT] Fix ELF inter-section references
Summary:
Since we are stripping non-allocatable relocation sections from
the binary and adding new sections it changes section indices
in the binary. Some sections refer to other sections by their index
which is stored in sh_link or sh_info field. Hence we need to update
these field.

In the past update of indices was done ad-hoc and as we started
adding more complex updates to section header table the update
mechanism became broken in some cases. As a result, we were putting
wrong indices into sh_link/sh_info.

The broken case was discovered while investigating a problem with
a stripped BOLTed binary. In BOLTed binary .rela.plt was incorrectly
pointing to one of the debug sections and strip command removed
the debug section together with .rela section that was referencing it.

The new update mechanism computes complete old to new section index
mapping and updates sh_link/sh_info fields based on the mapping
before writing section header entries into an output file.

(cherry picked from FBD5207378)
2017-06-07 20:06:29 -07:00
Rafael Auler 2c23094299 Split FrameAnalysis and improve LivenessAnalysis
Summary:
Split FrameAnalysis into FrameAnalysis and RegAnalysis, since
some optimizations only require register information about functions,
not frame information. Refactor callgraph walking code into the
CallGraphWalker class, allowing any analysis that depend on the call
graph to easily traverse it via a visitor pattern. Also fix
LivenessAnalysis, which was broken because it was not considering
registers read into callees and incorporating this into caller.

(cherry picked from FBD5177901)
2017-06-02 16:57:22 -07:00
Rafael Auler d850ca3622 [BOLT] Add shrink wrapping pass
Summary:
Add an implementation for shrink wrapping, a frame optimization
that moves callee-saved register spills from hot prologues to cold
successors.

(cherry picked from FBD4983706)
2017-05-01 16:52:54 -07:00
Maksim Panchenko 4b485f4167 [BOLT] Fix misc issues in relocation mode.
Summary:
Fix issues discovered while testing LTO mode with bfd linker:

  * Correctly update absolute function references from code
    with addend.
  * Support .got.plt section generated by bfd linker.
  * Support quirks of .tbss section.
  * Don't ignore functions if the size in FDE doesn't match the
    size in the symbol table. Instead keep processing using the
    maximum indicated size.

(cherry picked from FBD5178831)
2017-06-02 18:41:31 -07:00
Bill Nell 382c660ee5 [BOLT] Make hfsort+ deterministic and add test case
Summary:
Make hfsort+ algorithm deterministic.
We only had a test for hfsort.  Since hfsort+ is going to be the default, I've added a test for that too.

(cherry picked from FBD5143143)
2017-05-26 17:42:39 -07:00
Bill Nell 5feee9f1d8 [BOLT] More CG refactoring
Summary:
Do some additional refactoring of the CallGraph class.  Add a BinaryFunctionCallGraph class that has the BOLT specific bits.  This is in preparation to moving the generic CallGraph class into a library that both BOLT and HHVM can use.

Make data members of CallGraph private and add the appropriate accessor methods.

(cherry picked from FBD5143468)
2017-05-26 15:46:46 -07:00
Maksim Panchenko 95ab659fe4 [BOLT] Do not assert on an empty location list.
Summary:
Clang generates an empty debug location list, which doesn't make sense,
but we probably shouldn't assert on it and instead issue a warning
in verbosity mode. There is only a single empty location list in the
whole llvm binary.

(cherry picked from FBD5166666)
2017-06-01 12:30:52 -07:00
Bill Nell 733e8c464f HFSort/call graph refactoring
Summary:
I've factored out the call graph code from dataflow and function reordering code and done a few small renames/cleanups.  I've also moved the function reordering pass into a separate file because it was starting to get big.

I've got more refactoring planned for hfsort/call graph but this is a start.

(cherry picked from FBD5140771)
2017-05-26 12:53:21 -07:00
Bill Nell 9b190cc74b [BOLT] Fix SCTC again again.
Summary: I put the const_cast<BinaryFunction *>(this) on the wrong version of getBasicBlockAfter().  It's on the right one now.

(cherry picked from FBD5159127)
2017-05-31 14:23:37 -07:00
Maksim Panchenko 6c32079d57 [BOLT] Update addresses for DW_TAG_GNU_call_site and DW_TAG_label.
Summary:
Some DWARF tags (such as GNU_call_site and label) reference instruction
addresses in the input binary. When we update debug info we need to
update these tags too with new addresses.

Also fix base address used for calculation of output addresses in
relocation mode.

(cherry picked from FBD5155814)
2017-05-31 09:36:49 -07:00
Bill Nell 35d2530a40 [BOLT] Fix SCTC again.
Summary: Respect hot/cold boundaries when using BinaryFunction::getBasicBlockAfter().

(cherry picked from FBD5153379)
2017-05-30 19:06:22 -07:00
Maksim Panchenko 2e744e6867 [BOLT] Emit sorted DWARF ranges and location lists.
Summary:
When producing address ranges and location lists for debug info
add a post-processing step that sorts them and merges adjacent
entries.

Fix a memory allocation/free issue for .debug_ranges section.

(cherry picked from FBD5130583)
2017-05-24 15:20:27 -07:00
Bill Nell 96943d2f4b Add option to generate function order file.
Summary: Add -generate-function-order=<filename> option to write the computed function order to a file.  We can read this order in later rather than recomputing each time we process a binary with BOLT.

(cherry picked from FBD5127915)
2017-05-24 18:40:29 -07:00
Maksim Panchenko 2428567f7d [BOLT] Fix no-assertions build.
(cherry picked from FBD5130285)
2017-05-25 10:29:38 -07:00
Maksim Panchenko 174e3a825b [BOLT] Fix C++ ABI function alignment.
Summary: C++ functions have to be aligned at 2-bytes minimum on x86-64.

(cherry picked from FBD5128185)
2017-05-24 21:59:01 -07:00
Bill Nell 5cd58961a9 Add .bolt_info notes section containing BOLT revision and command line args.
Summary:
Optinally add a .bolt_info notes section containing BOLT revision and command line args.
The new section is controlled by the -add-bolt-info flag which is on by default.

(cherry picked from FBD5125890)
2017-05-24 14:14:16 -07:00
Rafael Auler 2ee4bbd3c1 [BOLT] Optimize jump tables with hot entries
Summary:
This diff is similar to Bill's diff for optimizing jump tables
(and is built on top of it), but it differs in the strategy used to
optimize the jump table. The previous approach loads the target address
from the jump table and compare it to check if it is a hot target. This
accomplishes branch misprediction reduction by promote the indirect jmp
to a (more predictable) direct jmp.

  load  %r10, JMPTABLE
  cmp   %r10, HOTTARGET
  je    HOTTARGET
  ijmp  [JMPTABLE + %index * scale]

The idea in this diff is instead to make dcache better by avoiding the
load of the jump table, leaving branch mispredictions as a secondary
target. To do this we compare the index used in the indirect jmp and if
it matches a known hot entry, it performs a direct jump to the target.

  cmp  %index, HOTINDEX
  je   CORRESPONDING_TARGET
  ijmp [JMPTABLE + %index * scale]

The downside of this approach is that we may have multiple indices
associated with a single target, but we only have profiling to show
which targets are hot and we have no clue about which indices are hot.

  INDEX    TARGET
  0        4004f8
  8        4004f8
  10       4003d0
  18       4004f8

  Profiling data:
  TARGET   COUNT
  4004f8   10020
  4003d0   17

In this example, we know 4004f8 is hot, but to make a direct call to it
we need to check for indices 0, 8 and 18 -- 3 comparisons instead of 1.

Therefore, once we know a target is hot, we must generate code to
compare against all possible indices associated with this target because
we don't know which index is the hot one (IF there's a hotter index).

  cmp %index, 0
  je  4004f8
  cmp %index, 8
  je  4004f8
  cmp %index, 18
  je  4004f8
  (... up to N comparisons as in --indirect-call-promotion-topn=N )
  ijmp [JMPTABLE + %index * scale]

(cherry picked from FBD5005620)
2017-05-01 14:04:40 -07:00
Bill Nell 3a3bcd767e Don't add useless uncond branch to fallthroughs when running SCTC.
Summary:
SCTC was sometimes adding unconditional branches to fallthrough blocks.
This diff checks to see if the unconditional branch is really necessary, e.g.
it's not to a fallthrough block.

(cherry picked from FBD5098493)
2017-05-19 14:45:46 -07:00
Maksim Panchenko 96adec51eb [BOLT] Rework debug info processing.
Summary:
Multiple improvements to debug info handling:
  * Add support for relocation mode.
  * Speed-up processing.
  * Reduce memory consumption.
  * Bug fixes.

The high-level idea behind the new debug handling is that we don't save
intermediate state for ranges and location lists. Instead we depend
on function and basic block address transformations to update the info
as a final post-processing step.

For HHVM in non-relocation mode the peak memory went down from 55GB to 35GB. Processing time went from over 6 minutes to under 5 minutes.

(cherry picked from FBD5113431)
2017-05-16 09:27:34 -07:00
Rafael Auler 511a1c78b2 [BOLT] Add dataflow infrastructure
Summary:
This diff introduces a common infrastructure for performing
dataflow analyses in BinaryFunctions as well as a few analyses that are
useful in a variety of scenarios. The largest user of this
infrastructure so far is shrink wrapping, which will be added in a
separate diff.

(cherry picked from FBD4983671)
2017-05-01 16:51:27 -07:00
Maksim Panchenko 457b7f14b9 [BOLT] Fix debug info for input with continuous range.
Summary:
When we see a compilation unit with continuous range on input,
it has two attributes: DW_AT_low_pc and DW_AT_high_pc. We convert the
range to a non-continuous one and change the attributes to
DW_AT_ranges and DW_AT_producer. However, gdb seems to expect
every compilation unit to have a base address specified via
DW_AT_low_pc, even when its value is always 0. Otherwise gdb will
not show proper debug info for such modules.

With this diff we produce DW_AT_ranges followed by DW_AT_low_pc.
The problem is that the first attribute takes DW_FORM_sec_offset
which is exactly 4 bytes, and in many cases we are left with
12 bytes to fill in. We used to fill this space with DW_AT_producer,
which took an arbitrary-length field. For DW_AT_low_pc we can
use a trick of using DW_FORM_udata (unsigned ULEB128 encoded
integer) which can take up to 12 bytes, even when the value is 0.

(cherry picked from FBD5109798)
2017-05-22 17:17:04 -07:00
Bill Nell 4806b13835 [BOLT] Add jump table support to ICP
Summary:
Add jump table support to ICP.  The optimization is basically the same
as ICP for tail calls.  The big difference is that the profiling data
comes from the jump table and the targets are local symbols rather than
global.

I've removed an instruction from ICP for tail calls.  The code used to
have a conditional jump to a block with a direct jump to the target, i.e.

  B1: cmp foo,(%rax)
      jne B3
  B2: jmp foo
  B3: ...

this code is now:

  B1: cmp foo,(%rax)
      je  foo
  B2: ...

The other changes in this diff:
- Move ICP + new jump table support to separate file in Passes.
- Improve the CFG validation to handle jump tables.
- Fix the double jump peephole so that the successor of the modified
  block is updated properly.  Also make sure that any existing branches
  in the block are modified to properly reflect the new CFG.
- Add an invocation of the double jump peephole to SCTC.  This allows
  us to remove a call to peepholes/UCE occurring after fixBranches() in
  the pass manager.
- Miscellaneous cleanups to BOLT output.

(cherry picked from FBD4727757)
2017-03-08 19:58:33 -08:00
Maksim Panchenko c789d5137b [BOLT] Add option to keep/generate .debug_aranges.
Summary:
GOLD linker removes .debug_aranges while generating .gdb_index.
Some tools however rely on the presence of this section.
Add an option to generate .debug_aranges if it was removed,
or keep it in the file if it was present.

Generally speaking .debug_aranges duplicates information present
in .gdb_index addresses table.

(cherry picked from FBD5084808)
2017-05-17 18:35:00 -07:00
Maksim Panchenko 69b586326c [BOLT] Support adding new non-allocatable sections.
Summary:
We had the ability to add allocatable sections before. This diff
expands this capability to non-allocatable sections.

(cherry picked from FBD5082018)
2017-05-16 17:29:31 -07:00