Commit Graph

272 Commits

Author SHA1 Message Date
Wei Mi 7fb400112f [SampleFDO] Add a cutoff flag to control how many symbols will be included
into profile symbol list.

When test is unrepresentative to production behavior, sample profile
collected from production can cause unexpected performance behavior
in test. To triage such issue, it is useful to have a cutoff flag
to control how many symbols will be included into profile symbol list
in order to do binary search.

Differential Revision: https://reviews.llvm.org/D97623
2021-02-27 23:15:31 -08:00
Wei Mi 5fb65c02ca [SampleFDO] Stop repeated indirect call promotion for the same target.
Found a problem in indirect call promotion in sample loader pass. Currently
if an indirect call is promoted for a target, and if the parent function is
inlined into some other function, the indirect call can be promoted for the
same target again. That is redundent which can harm performance and can cause
excessive compile time in some extreme case.

The patch fixes the issue. If a target is promoted for an indirect call, the
patch will write ICP metadata with the target call count being set to 0.
In the later ICP in sample profile loader, if it sees a target has 0 count
for an indirect call, it knows the target has been promoted and won't do
indirect call promotion for the indirect call.

The fix brings 0.1~0.2% performance on our search benchmark.

Differential Revision: https://reviews.llvm.org/D96806
2021-02-18 17:01:32 -08:00
Hongtao Yu 0eed2b1a3c Remove test code that cause MSAN failure.
Summary:
The negative test (with the feature being added disabled) caused MSAN failure and that's the added feature is supposed to fix. Therefore the negative test code is being removed.
2021-02-11 14:51:55 -08:00
Hongtao Yu de40f6d623 [CSSPGO] Process functions in a top-down order on a dynamic call graph.
Functions are currently processed by the sample profiler loader in a top-down order defined by the static call graph. The order is being adjusted to be a top-down order based on the input context-sensitive profile. One benefit is that the processing order of caller and callee in one SCC would follow the context order in the profile to favor more inlining. Another benefit is that the processing order of caller and callee through an indirect call (which is not on the static call graph) can be honored which in turn allows for more inlining.

The profile top-down order for SCC is also extended to support non-CS profiles.

Two switches `-mllvm -use-profile-indirect-call-edges` and `-mllvm -use-profile-top-down-order` are being introduced.

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D95988
2021-02-11 12:36:59 -08:00
Hongtao Yu 3a5f8a3ea3 [CSSPGO] Restrict pseudo probe tests to x86_64 only. 2021-02-10 14:41:10 -08:00
Hongtao Yu 1cb47a063e [CSSPGO] Unblock optimizations with pseudo probe instrumentation.
The IR/MIR pseudo probe intrinsics don't get materialized into real machine instructions and therefore they don't incur runtime cost directly. However, they come with indirect cost by blocking certain optimizations. Some of the blocking are intentional (such as blocking code merge) for better counts quality while the others are accidental. This change unblocks perf-critical optimizations that do not affect counts quality. They include:

1. IR InstCombine, sinking load operation to shorten lifetimes.
2. MIR LiveRangeShrink, similar to #1
3. MIR TwoAddressInstructionPass, i.e, opeq transform
4. MIR function argument copy elision
5. IR stack protection. (though not perf-critical but nice to have).

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D95982
2021-02-10 12:43:17 -08:00
Wenlei He 801d9cc7b9 [CSSPGO] Use merged base profile for hot threshold calculation
Context-sensitive profile effectively split a function profile into many copies each representing the CFG profile of a particular calling context. That makes the count distribution looks more flat as we now have more function profiles each with lower counts, which in turn leads to lower hot thresholds. Now we tells threshold computation to merge context profile first before calculating percentile based cutoffs to compensate for seemingly flat context profile. This can be controlled by swtich `sample-profile-contextless-threshold`.

Earlier measurement showed ~0.4% perf boost with this tuning on spec2k6 for CSSPGO (with pseudo-probe and new inliner).

Differential Revision: https://reviews.llvm.org/D95980
2021-02-05 17:51:00 -08:00
Hongtao Yu 3d89b3cbec [CSSPGO] Introducing distribution factor for pseudo probe.
Sample re-annotation is required in LTO time to achieve a reasonable post-inline profile quality. However, we have seen that such LTO-time re-annotation degrades profile quality. This is mainly caused by preLTO code duplication that is done by passes such as loop unrolling, jump threading, indirect call promotion etc, where samples corresponding to a source location are aggregated multiple times due to the duplicates. In this change we are introducing a concept of distribution factor for pseudo probes so that samples can be distributed for duplicated probes scaled by a factor. We hope that optimizations duplicating code well-maintain the branch frequency information (BFI) based on which probe distribution factors are calculated. Distribution factors are updated at the end of preLTO pipeline to reflect an estimated portion of the real execution count.

This change also introduces a pseudo probe verifier that can be run after each IR passes to detect duplicated pseudo probes.

A saturated distribution factor stands for 1.0. A pesudo probe will carry a factor with the value ranged from 0.0 to 1.0. A 64-bit integral distribution factor field that represents [0.0, 1.0] is associated to each block probe. Unfortunately this cannot be done for callsite probes due to the size limitation of a 32-bit Dwarf discriminator. A 7-bit distribution factor is used instead.

Changes are also needed to the sample profile inliner to deal with prorated callsite counts. Call sites duplicated by PreLTO passes, when later on inlined in LTO time, should have the callees’s probe prorated based on the Prelink-computed distribution factors. The distribution factors should also be taken into account when computing hotness for inline candidates. Also, Indirect call promotion results in multiple callisites. The original samples should be distributed across them. This is fixed by adjusting the callisites' distribution factors.

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D93264
2021-02-02 11:55:01 -08:00
Wenlei He 1645f465be [CSSPGO] Factor out common part for CSSPGO inline and AFDO inline
Refactoring SampleProfileLoader::inlineHotFunctions to use helpers from CSSPGO inlining and reduce similar code in the inlining loop, plus minor cleanup for AFDO path.

This is resubmit of D95024, with build break and overtighten assertion fixed.

Test Plan:
2021-02-02 07:55:08 -08:00
Adrian Kuegel 48ca6da9d2 Revert "[CSSPGO] Factor out common part for CSSPGO inline and AFDO inline"
This reverts commit 9a03058d63.
2021-02-02 11:51:04 +01:00
Wenlei He 9a03058d63 [CSSPGO] Factor out common part for CSSPGO inline and AFDO inline
Refactoring SampleProfileLoader::inlineHotFunctions to use helpers from CSSPGO inlining and reduce similar code in the inlining loop, plus minor cleanup for AFDO path.

Test Plan:

Differential Revision: https://reviews.llvm.org/D95024
2021-02-02 00:34:06 -08:00
Wenlei He 6bae5973c4 [CSSPGO] Call site prioritized inlining for sample PGO
This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`).

Motivation

With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO.
 - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat.
 - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately.
 - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO.

New FDO Inliner

Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed.
 - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655.
 - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check.
 - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites.
 - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path.

Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO).

Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO.

Tunings and knobs

I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO.

Results

Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it).

Differential Revision: https://reviews.llvm.org/D94001
2021-02-01 23:46:34 -08:00
Hongtao Yu 224fee8219 [CSSPGO] Tweaking inlining with pseudo probes.
Fixing up a couple places where `getCallSiteIdentifier` is needed to support pseudo-probe-based callsites.

Also fixing an issue in the extbinary profile reader where the metadata section is not fully scanned based on the number of profiles loaded only for the current module.

Reviewed By: wmi, wenlei

Differential Revision: https://reviews.llvm.org/D95791
2021-02-01 13:56:40 -08:00
Hongtao Yu 7e99bddfea [CSSPGO] Support of CS profiles in extended binary format.
This change brings up support of context-sensitive profiles in the format of extended binary. Existing sample profile reader/writer/merger code is being tweaked to reflect the fact of bracketed input contexts, like (`[...]`). The paired brackets are also needed in extbinary profiles because we don't yet have an otherwise good way to tell calling contexts apart from regular function names since the context delimiter `@` can somehow serve as a part of the C++ mangled names.

Reviewed By: wmi, wenlei

Differential Revision: https://reviews.llvm.org/D95547
2021-01-27 21:29:46 -08:00
Wei Mi c9cd9a0066 [SampleFDO] Report error when reading a bad/incompatible profile instead of
turning off SampleFDO silently.

Currently sample loader pass turns off SampleFDO optimization silently when
it sees error in reading the profile. This behavior will defeat the tests
which could have caught those bad/incompatible profile problems. This patch
change the behavior to report error.

Differential Revision: https://reviews.llvm.org/D95269
2021-01-25 10:28:23 -08:00
Wei Mi 21b1ad0340 [SampleFDO] Add the support to split the function profiles with context into
separate sections.

For ThinLTO, all the function profiles without context has been annotated to
outline functions if possible in prelink phase. In postlink phase, profile
annotation in postlink phase is only meaningful for function profile with
context. If the profile is large, it is better to split the profile into two
parts, one with context and one without, so the profile reading in postlink
phase only has to read the part with context. To have the profile splitting,
we extend the ExtBinary format to support different section arrangement. It
will be flexible to add other section layout in the future without the need
to create new class inheriting from ExtBinary class.

Differential Revision: https://reviews.llvm.org/D94435
2021-01-19 15:16:19 -08:00
modimo 2a49b7c64a [Inliner] Change inline remark format and update ReplayInlineAdvisor to use it
This change modifies the source location formatting from:
LineNumber.Discriminator
to:
LineNumber:ColumnNumber.Discriminator

The motivation here is to enhance location information for inline replay that currently exists for the SampleProfile inliner. This will be leveraged further in inline replay for the CGSCC inliner in the related diff.

The ReplayInlineAdvisor is also modified to read the new format and now takes into account the callee for greater accuracy.

Testing:
ninja check-llvm

Reviewed By: mtrofin

Differential Revision: https://reviews.llvm.org/D94333
2021-01-12 13:43:48 -08:00
Hongtao Yu ac068e014b [CSSPGO] Consume pseudo-probe-based AutoFDO profile
This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`.

Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites).

**Adjustment to profile format **

A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like
```
!CFGChecksum: 12345
```
is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums.

Differential Revision: https://reviews.llvm.org/D92347
2020-12-16 15:57:18 -08:00
Hongtao Yu 705a4c149d [CSSPGO] Pseudo probe encoding and emission.
This change implements pseudo probe encoding and emission for CSSPGO. Please see RFC here for more context: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s

Pseudo probes are in the form of intrinsic calls on IR/MIR but they do not turn into any machine instructions. Instead they are emitted into the binary as a piece of data in standalone sections.  The probe-specific sections are not needed to be loaded into memory at execution time, thus they do not incur a runtime overhead. 

**ELF object emission**

The binary data to emit are organized as two ELF sections, i.e, the `.pseudo_probe_desc` section and the `.pseudo_probe` section. The `.pseudo_probe_desc` section stores a function descriptor for each function and the `.pseudo_probe` section stores the actual probes, each fo which corresponds to an IR basic block or an IR function callsite. A function descriptor is stored as a module-level metadata during the compilation and is serialized into the object file during object emission.

Both the probe descriptors and pseudo probes can be emitted into a separate ELF section per function to leverage the linker for deduplication.  A `.pseudo_probe` section shares the same COMDAT group with the function code so that when the function is dead, the probes are dead and disposed too. On the contrary, a `.pseudo_probe_desc` section has its own COMDAT group. This is because even if a function is dead, its probes may be inlined into other functions and its descriptor is still needed by the profile generation tool.

The format of `.pseudo_probe_desc` section looks like:

```
.section   .pseudo_probe_desc,"",@progbits
.quad   6309742469962978389  // Func GUID
.quad   4294967295           // Func Hash
.byte   9                    // Length of func name
.ascii  "_Z5funcAi"          // Func name
.quad   7102633082150537521
.quad   138828622701
.byte   12
.ascii  "_Z8funcLeafi"
.quad   446061515086924981
.quad   4294967295
.byte   9
.ascii  "_Z5funcBi"
.quad   -2016976694713209516
.quad   72617220756
.byte   7
.ascii  "_Z3fibi"
```

For each `.pseudoprobe` section, the encoded binary data consists of a single function record corresponding to an outlined function (i.e, a function with a code entry in the `.text` section). A function record has the following format :

```
FUNCTION BODY (one for each outlined function present in the text section)
    GUID (uint64)
        GUID of the function
    NPROBES (ULEB128)
        Number of probes originating from this function.
    NUM_INLINED_FUNCTIONS (ULEB128)
        Number of callees inlined into this function, aka number of
        first-level inlinees
    PROBE RECORDS
        A list of NPROBES entries. Each entry contains:
          INDEX (ULEB128)
          TYPE (uint4)
            0 - block probe, 1 - indirect call, 2 - direct call
          ATTRIBUTE (uint3)
            reserved
          ADDRESS_TYPE (uint1)
            0 - code address, 1 - address delta
          CODE_ADDRESS (uint64 or ULEB128)
            code address or address delta, depending on ADDRESS_TYPE
    INLINED FUNCTION RECORDS
        A list of NUM_INLINED_FUNCTIONS entries describing each of the inlined
        callees.  Each record contains:
          INLINE SITE
            GUID of the inlinee (uint64)
            ID of the callsite probe (ULEB128)
          FUNCTION BODY
            A FUNCTION BODY entry describing the inlined function.
```

To support building a context-sensitive profile, probes from inlinees are grouped by their inline contexts. An inline context is logically a call path through which a callee function lands in a caller function. The probe emitter builds an inline tree based on the debug metadata for each outlined function in the form of a trie tree. A tree root is the outlined function. Each tree edge stands for a callsite where inlining happens. Pseudo probes originating from an inlinee function are stored in a tree node and the tree path starting from the root all the way down to the tree node is the inline context of the probes. The emission happens on the whole tree top-down recursively. Probes of a tree node will be emitted altogether with their direct parent edge. Since a pseudo probe corresponds to a real code address, for size savings, the address is encoded as a delta from the previous probe except for the first probe. Variant-sized integer encoding, aka LEB128, is used for address delta and probe index.

**Assembling**

Pseudo probes can be printed as assembly directives alternatively. This allows for good assembly code readability and also provides a view of how optimizations and pseudo probes affect each other, especially helpful for diff time assembly analysis.

A pseudo probe directive has the following operands in order: function GUID, probe index, probe type, probe attributes and inline context. The directive is generated by the compiler and can be parsed by the assembler to form an encoded `.pseudoprobe` section in the object file.

A example assembly looks like:

```
foo2: # @foo2
# %bb.0: # %bb0
pushq %rax
testl %edi, %edi
.pseudoprobe 837061429793323041 1 0 0
je .LBB1_1
# %bb.2: # %bb2
.pseudoprobe 837061429793323041 6 2 0
callq foo
.pseudoprobe 837061429793323041 3 0 0
.pseudoprobe 837061429793323041 4 0 0
popq %rax
retq
.LBB1_1: # %bb1
.pseudoprobe 837061429793323041 5 1 0
callq *%rsi
.pseudoprobe 837061429793323041 2 0 0
.pseudoprobe 837061429793323041 4 0 0
popq %rax
retq
# -- End function
.section .pseudo_probe_desc,"",@progbits
.quad 6699318081062747564
.quad 72617220756
.byte 3
.ascii "foo"
.quad 837061429793323041
.quad 281547593931412
.byte 4
.ascii "foo2"
```

With inlining turned on, the assembly may look different around %bb2 with an inlined probe:

```
# %bb.2:                                # %bb2
.pseudoprobe    837061429793323041 3 0
.pseudoprobe    6699318081062747564 1 0 @ 837061429793323041:6
.pseudoprobe    837061429793323041 4 0
popq    %rax
retq
```

**Disassembling**

We have a disassembling tool (llvm-profgen) that can display disassembly alongside with pseudo probes. So far it only supports ELF executable file.

An example disassembly looks like:

```
00000000002011a0 <foo2>:
  2011a0: 50                    push   rax
  2011a1: 85 ff                 test   edi,edi
  [Probe]:  FUNC: foo2  Index: 1  Type: Block
  2011a3: 74 02                 je     2011a7 <foo2+0x7>
  [Probe]:  FUNC: foo2  Index: 3  Type: Block
  [Probe]:  FUNC: foo2  Index: 4  Type: Block
  [Probe]:  FUNC: foo   Index: 1  Type: Block  Inlined: @ foo2:6
  2011a5: 58                    pop    rax
  2011a6: c3                    ret
  [Probe]:  FUNC: foo2  Index: 2  Type: Block
  2011a7: bf 01 00 00 00        mov    edi,0x1
  [Probe]:  FUNC: foo2  Index: 5  Type: IndirectCall
  2011ac: ff d6                 call   rsi
  [Probe]:  FUNC: foo2  Index: 4  Type: Block
  2011ae: 58                    pop    rax
  2011af: c3                    ret
```

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D91878
2020-12-10 17:29:28 -08:00
Mitch Phillips 7ead5f5aa3 Revert "[CSSPGO] Pseudo probe encoding and emission."
This reverts commit b035513c06.

Reason: Broke the ASan buildbots:
  http://lab.llvm.org:8011/#/builders/5/builds/2269
2020-12-10 15:53:39 -08:00
Hongtao Yu b035513c06 [CSSPGO] Pseudo probe encoding and emission.
This change implements pseudo probe encoding and emission for CSSPGO. Please see RFC here for more context: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s

Pseudo probes are in the form of intrinsic calls on IR/MIR but they do not turn into any machine instructions. Instead they are emitted into the binary as a piece of data in standalone sections.  The probe-specific sections are not needed to be loaded into memory at execution time, thus they do not incur a runtime overhead. 

**ELF object emission**

The binary data to emit are organized as two ELF sections, i.e, the `.pseudo_probe_desc` section and the `.pseudo_probe` section. The `.pseudo_probe_desc` section stores a function descriptor for each function and the `.pseudo_probe` section stores the actual probes, each fo which corresponds to an IR basic block or an IR function callsite. A function descriptor is stored as a module-level metadata during the compilation and is serialized into the object file during object emission.

Both the probe descriptors and pseudo probes can be emitted into a separate ELF section per function to leverage the linker for deduplication.  A `.pseudo_probe` section shares the same COMDAT group with the function code so that when the function is dead, the probes are dead and disposed too. On the contrary, a `.pseudo_probe_desc` section has its own COMDAT group. This is because even if a function is dead, its probes may be inlined into other functions and its descriptor is still needed by the profile generation tool.

The format of `.pseudo_probe_desc` section looks like:

```
.section   .pseudo_probe_desc,"",@progbits
.quad   6309742469962978389  // Func GUID
.quad   4294967295           // Func Hash
.byte   9                    // Length of func name
.ascii  "_Z5funcAi"          // Func name
.quad   7102633082150537521
.quad   138828622701
.byte   12
.ascii  "_Z8funcLeafi"
.quad   446061515086924981
.quad   4294967295
.byte   9
.ascii  "_Z5funcBi"
.quad   -2016976694713209516
.quad   72617220756
.byte   7
.ascii  "_Z3fibi"
```

For each `.pseudoprobe` section, the encoded binary data consists of a single function record corresponding to an outlined function (i.e, a function with a code entry in the `.text` section). A function record has the following format :

```
FUNCTION BODY (one for each outlined function present in the text section)
    GUID (uint64)
        GUID of the function
    NPROBES (ULEB128)
        Number of probes originating from this function.
    NUM_INLINED_FUNCTIONS (ULEB128)
        Number of callees inlined into this function, aka number of
        first-level inlinees
    PROBE RECORDS
        A list of NPROBES entries. Each entry contains:
          INDEX (ULEB128)
          TYPE (uint4)
            0 - block probe, 1 - indirect call, 2 - direct call
          ATTRIBUTE (uint3)
            reserved
          ADDRESS_TYPE (uint1)
            0 - code address, 1 - address delta
          CODE_ADDRESS (uint64 or ULEB128)
            code address or address delta, depending on ADDRESS_TYPE
    INLINED FUNCTION RECORDS
        A list of NUM_INLINED_FUNCTIONS entries describing each of the inlined
        callees.  Each record contains:
          INLINE SITE
            GUID of the inlinee (uint64)
            ID of the callsite probe (ULEB128)
          FUNCTION BODY
            A FUNCTION BODY entry describing the inlined function.
```

To support building a context-sensitive profile, probes from inlinees are grouped by their inline contexts. An inline context is logically a call path through which a callee function lands in a caller function. The probe emitter builds an inline tree based on the debug metadata for each outlined function in the form of a trie tree. A tree root is the outlined function. Each tree edge stands for a callsite where inlining happens. Pseudo probes originating from an inlinee function are stored in a tree node and the tree path starting from the root all the way down to the tree node is the inline context of the probes. The emission happens on the whole tree top-down recursively. Probes of a tree node will be emitted altogether with their direct parent edge. Since a pseudo probe corresponds to a real code address, for size savings, the address is encoded as a delta from the previous probe except for the first probe. Variant-sized integer encoding, aka LEB128, is used for address delta and probe index.

**Assembling**

Pseudo probes can be printed as assembly directives alternatively. This allows for good assembly code readability and also provides a view of how optimizations and pseudo probes affect each other, especially helpful for diff time assembly analysis.

A pseudo probe directive has the following operands in order: function GUID, probe index, probe type, probe attributes and inline context. The directive is generated by the compiler and can be parsed by the assembler to form an encoded `.pseudoprobe` section in the object file.

A example assembly looks like:

```
foo2: # @foo2
# %bb.0: # %bb0
pushq %rax
testl %edi, %edi
.pseudoprobe 837061429793323041 1 0 0
je .LBB1_1
# %bb.2: # %bb2
.pseudoprobe 837061429793323041 6 2 0
callq foo
.pseudoprobe 837061429793323041 3 0 0
.pseudoprobe 837061429793323041 4 0 0
popq %rax
retq
.LBB1_1: # %bb1
.pseudoprobe 837061429793323041 5 1 0
callq *%rsi
.pseudoprobe 837061429793323041 2 0 0
.pseudoprobe 837061429793323041 4 0 0
popq %rax
retq
# -- End function
.section .pseudo_probe_desc,"",@progbits
.quad 6699318081062747564
.quad 72617220756
.byte 3
.ascii "foo"
.quad 837061429793323041
.quad 281547593931412
.byte 4
.ascii "foo2"
```

With inlining turned on, the assembly may look different around %bb2 with an inlined probe:

```
# %bb.2:                                # %bb2
.pseudoprobe    837061429793323041 3 0
.pseudoprobe    6699318081062747564 1 0 @ 837061429793323041:6
.pseudoprobe    837061429793323041 4 0
popq    %rax
retq
```

**Disassembling**

We have a disassembling tool (llvm-profgen) that can display disassembly alongside with pseudo probes. So far it only supports ELF executable file.

An example disassembly looks like:

```
00000000002011a0 <foo2>:
  2011a0: 50                    push   rax
  2011a1: 85 ff                 test   edi,edi
  [Probe]:  FUNC: foo2  Index: 1  Type: Block
  2011a3: 74 02                 je     2011a7 <foo2+0x7>
  [Probe]:  FUNC: foo2  Index: 3  Type: Block
  [Probe]:  FUNC: foo2  Index: 4  Type: Block
  [Probe]:  FUNC: foo   Index: 1  Type: Block  Inlined: @ foo2:6
  2011a5: 58                    pop    rax
  2011a6: c3                    ret
  [Probe]:  FUNC: foo2  Index: 2  Type: Block
  2011a7: bf 01 00 00 00        mov    edi,0x1
  [Probe]:  FUNC: foo2  Index: 5  Type: IndirectCall
  2011ac: ff d6                 call   rsi
  [Probe]:  FUNC: foo2  Index: 4  Type: Block
  2011ae: 58                    pop    rax
  2011af: c3                    ret
```

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D91878
2020-12-10 09:50:08 -08:00
Wei Mi 64e7685368 [SampleFDO] Store fixed length MD5 in NameTable instead of using ULEB128 if
MD5 is used.

Currently during sample profile loading, NameTable has to be loaded entirely
up front before any name string is retrieved. That is because NameTable is
stored using ULEB128 encoding and cannot be directly accessed like an array.
However, if MD5 is used to represent name in the NameTable, it has fixed
length. If MD5 names are stored in uint64_t type instead of ULEB128, NameTable
can be accessed like an array then in many cases only part of the NameTable
has to be read. This is helpful for reducing compile time especially when
small source file is compiled. We find that after this change, the elapsed
time to build a large application distributively is reduced by 5% and the
accumulative cpu time used for building is also reduced by 5%. The size of
the profile is slightly reduced with this change by ~0.2%, and that also
indicates encoding MD5 in ULEB128 doesn't save the storage space.

Differential Revision: https://reviews.llvm.org/D92621
2020-12-08 16:21:01 -08:00
Pan, Tao 7af802994e [CodeGen] Add text section prefix for COFF object file
Text section prefix is created in CodeGenPrepare, it's file format independent implementation,  text section name is written into object file in TargetLoweringObjectFile, it's file format dependent implementation, port code of adding text section prefix to text section name from ELF to COFF.
Different with ELF that use '.' as concatenation character, COFF use '$' as concatenation character. That is, concatenation character is variable, so split concatenation character from text section prefix.
Text section prefix is existing feature of ELF, it can help to reduce icache and itlb misses, it's also make possible aggregate other compilers e.g. v8 created same prefix sections. Furthermore, the recent feature Machine Function Splitter (basic block level text prefix section) is based on text section prefix.

Reviewed By: pengfei, rnk

Differential Revision: https://reviews.llvm.org/D92073
2020-12-08 18:56:21 +08:00
Wenlei He 6b989a1710 [CSSPGO] Infrastructure for context-sensitive Sample PGO and Inlining
This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon.

Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO.

**Interface**

There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile.

- Query base profile (`getBaseSamplesFor`)
Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function.

- Query context profile (`getContextSamplesFor`)
Context profile is a function's CFG profile for a given calling context. We can query context profile by context string.

- Track inlined context profile (`markContextSamplesInlined`)
When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function.

- Track not-inlined context profile (`promoteMergeContextSamplesTree`)
When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination.

**Implementation**

Implementation-wise, `SampleContext` is created as abstraction for context. Currently it's a string for call path, and we can later optimize it to something more efficient, e.g. context id. Each `SampleContext` also has a `ContextState` indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each `FunctionSamples` now has a `SampleContext` that tells whether it's base profile or context profile, and for context profile what is the context and state.

On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, `SampleContextTracker` is implemented that encapsulates a trie tree with `ContextTireNode` as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track `FunctionSamples` for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes.

**Integration**

`SampleContextTracker` is now also integrated with AutoFDO, `SampleProfileReader` and `SampleProfileLoader`. When we detected input profile contains context-sensitive profile, `SampleContextTracker` will be used to track profiles, and all profile query will go to `SampleContextTracker` instead of `SampleProfileReader` automatically. Tracking APIs are called automatically for each inline decision from `SampleProfileLoader`.

Differential Revision: https://reviews.llvm.org/D90125
2020-12-06 11:49:18 -08:00
Justin Bogner 203748e76d Revert "[test] Add a triple to an x86 test"
This fails on some of the bots. Reverting to investigate.

This reverts commit d5268ebe19.
2020-12-03 14:21:37 -08:00
Justin Bogner d5268ebe19 [test] Add a triple to an x86 test
This already "REQUIRE"d an x86 target, but it was also implicitly
depending on a default triple being set.
2020-12-03 14:05:51 -08:00
Hongtao Yu 24d4291ca7 [CSSPGO] Pseudo probes for function calls.
An indirect call site needs to be probed for its potential call targets. With CSSPGO a direct call also needs a probe so that a calling context can be represented by a stack of callsite probes. Unlike pseudo probes for basic blocks that are in form of standalone intrinsic call instructions, pseudo probes for callsites have to be attached to the call instruction, thus a separate instruction would not work.

One possible way of attaching a probe to a call instruction is to use a special metadata that carries information about the probe. The special metadata will have to make its way through the optimization pipeline down to object emission. This requires additional efforts to maintain the metadata in various places. Given that the `!dbg` metadata is a first-class metadata and has all essential support in place , leveraging the `!dbg` metadata as a channel to encode pseudo probe information is probably the easiest solution.

With the requirement of not inflating `!dbg` metadata that is allocated for almost every instruction, we found that the 32-bit DWARF discriminator field which mainly serves AutoFDO can be reused for pseudo probes. DWARF discriminators distinguish identical source locations between instructions and with pseudo probes such support is not required. In this change we are using the discriminator field to encode the ID and type of a callsite probe and the encoded value will be unpacked and consumed right before object emission. When a callsite is inlined, the callsite discriminator field will go with the inlined instructions. The `!dbg` metadata of an inlined instruction is in form of a scope stack. The top of the stack is the instruction's original `!dbg` metadata and the bottom of the stack is for the original callsite of the top-level inliner. Except for the top of the stack, all other elements of the stack actually refer to the nested inlined callsites whose discriminator field (which actually represents a calliste probe) can be used together to represent the inline context of an inlined PseudoProbeInst or CallInst.

To avoid collision with the baseline AutoFDO in various places that handles dwarf discriminators where a check against  the `-pseudo-probe-for-profiling` switch is not available, a special encoding scheme is used to tell apart a pseudo probe discriminator from a regular discriminator. For the regular discriminator, if all lowest 3 bits are non-zero, it means the discriminator is basically empty and all higher 29 bits can be reversed for pseudo probe use.

Callsite pseudo probes are inserted in `SampleProfileProbePass` and a target-independent MIR pass `PseudoProbeInserter` is added to unpack the probe ID/type from `!dbg`.

Note that with this work the switch -debug-info-for-profiling will not work with -pseudo-probe-for-profiling anymore. They cannot be used at the same time.

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D91756
2020-12-02 13:45:20 -08:00
Wei Wang 93dc1b5b8c [Remarks][2/2] Expand remarks hotness threshold option support in more tools
This is the #2 of 2 changes that make remarks hotness threshold option
available in more tools. The changes also allow the threshold to sync with
hotness threshold from profile summary with special value 'auto'.

This change expands remarks hotness threshold option
-fdiagnostics-hotness-threshold in clang and *-remarks-hotness-threshold in
other tools to utilize hotness threshold from profile summary.

Remarks hotness filtering relies on several driver options. Table below lists
how different options are correlated and affect final remarks outputs:

| profile | hotness | threshold | remarks printed |
|---------|---------|-----------|-----------------|
| No      | No      | No        | All             |
| No      | No      | Yes       | None            |
| No      | Yes     | No        | All             |
| No      | Yes     | Yes       | None            |
| Yes     | No      | No        | All             |
| Yes     | No      | Yes       | None            |
| Yes     | Yes     | No        | All             |
| Yes     | Yes     | Yes       | >=threshold     |

In the presence of profile summary, it is often more desirable to directly use
the hotness threshold from profile summary. The new argument value 'auto'
indicates threshold will be synced with hotness threshold from profile summary
during compilation. The "auto" threshold relies on the availability of profile
summary. In case of missing such information, no remarks will be generated.

Differential Revision: https://reviews.llvm.org/D85808
2020-11-30 21:55:50 -08:00
Hongtao Yu 750049d78b [CSSPGO] Disabling a pseudo probe test on non-x86 platforms.
Disabling a pseudo probe test on non-x86 platforms since it's not fully tested there.
2020-11-30 11:19:45 -08:00
Hongtao Yu 64fa8cce22 [CSSPGO] Pseudo probe instrumentation pass
This change introduces a pseudo probe instrumentation pass for block instrumentation. Please refer to https://reviews.llvm.org/D86193 for the whole story.

Given the following LLVM IR:

```
define internal void @foo2(i32 %x, void (i32)* %f) !dbg !4 {
bb0:
  %cmp = icmp eq i32 %x, 0
   br i1 %cmp, label %bb1, label %bb2
bb1:
   br label %bb3
bb2:
   br label %bb3
bb3:
   ret void
}
```

The instrumented IR will look like below. Note that each llvm.pseudoprobe intrinsic call represents a pseudo probe at a block, of which the first parameter is the GUID of the probe’s owner function and the second parameter is the probe’s ID.

```
define internal void @foo2(i32 %x, void (i32)* %f) !dbg !4 {
bb0:
   %cmp = icmp eq i32 %x, 0
   call void @llvm.pseudoprobe(i64 837061429793323041, i64 1)
   br i1 %cmp, label %bb1, label %bb2
bb1:
   call void @llvm.pseudoprobe(i64 837061429793323041, i64 2)
   br label %bb3
bb2:
   call void @llvm.pseudoprobe(i64 837061429793323041, i64 3)
   br label %bb3
bb3:
   call void @llvm.pseudoprobe(i64 837061429793323041, i64 4)
   ret void
}
```

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D86499
2020-11-30 10:16:54 -08:00
Arthur Eubanks 3d1149c6fe Make CallInst::updateProfWeight emit i32 weights instead of i64
Typically branch_weights are i32, not i64.
This fixes entry_counts_cold.ll under NPM.

Reviewed By: asbirlea

Differential Revision: https://reviews.llvm.org/D90539
2020-11-24 18:13:59 -08:00
Hongtao Yu d0e42037bf [CSSPGO] MIR target-independent pseudo instruction for pseudo-probe intrinsic
This change introduces a MIR target-independent pseudo instruction corresponding to the IR intrinsic llvm.pseudoprobe for pseudo-probe block instrumentation. Please refer to https://reviews.llvm.org/D86193 for the whole story.

An `llvm.pseudoprobe` intrinsic call will be lowered into a target-independent operation named `PSEUDO_PROBE`. Given the following instrumented IR,

```
define internal void @foo2(i32 %x, void (i32)* %f) !dbg !4 {
bb0:
   %cmp = icmp eq i32 %x, 0
   call void @llvm.pseudoprobe(i64 837061429793323041, i64 1)
   br i1 %cmp, label %bb1, label %bb2
bb1:
   call void @llvm.pseudoprobe(i64 837061429793323041, i64 2)
   br label %bb3
bb2:
   call void @llvm.pseudoprobe(i64 837061429793323041, i64 3)
   br label %bb3
bb3:
   call void @llvm.pseudoprobe(i64 837061429793323041, i64 4)
   ret void
}
```
the corresponding MIR is shown below. Note that block `bb3` is duplicated into `bb1` and `bb2` where its probe is duplicated too. This allows for an accurate execution count to be collected for `bb3`, which is basically the sum of the counts of `bb1` and `bb2`.

```
bb.0.bb0:
   frame-setup PUSH64r undef $rax, implicit-def $rsp, implicit $rsp
   TEST32rr killed renamable $edi, renamable $edi, implicit-def $eflags
   PSEUDO_PROBE 837061429793323041, 1, 0
   $edi = MOV32ri 1, debug-location !13; test.c:0
   JCC_1 %bb.1, 4, implicit $eflags

bb.2.bb2:
   PSEUDO_PROBE 837061429793323041, 3, 0
   PSEUDO_PROBE 837061429793323041, 4, 0
   $rax = frame-destroy POP64r implicit-def $rsp, implicit $rsp
   RETQ

bb.1.bb1:
   PSEUDO_PROBE 837061429793323041, 2, 0
   PSEUDO_PROBE 837061429793323041, 4, 0
   $rax = frame-destroy POP64r implicit-def $rsp, implicit $rsp
   RETQ
```

The target op PSEUDO_PROBE will be converted into a piece of binary data by the object emitter with no machine instructions generated. This is done in a different patch.

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D86495
2020-11-20 10:52:43 -08:00
Arthur Eubanks 5c31b8b94f Revert "Use uint64_t for branch weights instead of uint32_t"
This reverts commit 10f2a0d662.

More uint64_t overflows.
2020-10-31 00:25:32 -07:00
Arthur Eubanks 10f2a0d662 Use uint64_t for branch weights instead of uint32_t
CallInst::updateProfWeight() creates branch_weights with i64 instead of i32.
To be more consistent everywhere and remove lots of casts from uint64_t
to uint32_t, use i64 for branch_weights.

Reviewed By: davidxl

Differential Revision: https://reviews.llvm.org/D88609
2020-10-30 10:03:46 -07:00
Nico Weber 2a4e704c92 Revert "Use uint64_t for branch weights instead of uint32_t"
This reverts commit e5766f25c6.
Makes clang assert when building Chromium, see https://crbug.com/1142813
for a repro.
2020-10-27 09:26:21 -04:00
Arthur Eubanks e5766f25c6 Use uint64_t for branch weights instead of uint32_t
CallInst::updateProfWeight() creates branch_weights with i64 instead of i32.
To be more consistent everywhere and remove lots of casts from uint64_t
to uint32_t, use i64 for branch_weights.

Reviewed By: davidxl

Differential Revision: https://reviews.llvm.org/D88609
2020-10-26 20:24:04 -07:00
Arthur Eubanks 2d761a368c [test][NewPM][SampleProfile] Fix more tests under NPM
These all have separate legacy and new PM RUN lines.
2020-09-30 11:50:41 -07:00
Arthur Eubanks 2ab8770223 [test][SampleProfile][NewPM] Fix some tests under NPM 2020-09-30 09:44:29 -07:00
Wei Mi c67ccf5faf [SampleFDO] Enhance profile remapping support for searching inline instance
and indirect call promotion candidate.

Profile remapping is a feature to match a function in the module with its
profile in sample profile if the function name and the name in profile look
different but are equivalent using given remapping rules. This is a useful
feature to keep the performance stable by specifying some remapping rules
when sampleFDO targets are going through some large scale function signature
change.

However, currently profile remapping support is only valid for outline
function profile in SampleFDO. It cannot match a callee with an inline
instance profile if they have different but equivalent names. We found
that without the support for inline instance profile, remapping is less
effective for some large scale change.

To add that support, before any remapping lookup happens, all the names
in the profile will be inserted into remapper and the Key to the name
mapping will be recorded in a map called NameMap in the remapper. During
name lookup, a Key will be returned for the given name and it will be used
to extract an equivalent name in the profile from NameMap. So with the help
of the NameMap, we can translate any given name to an equivalent name in
the profile if it exists. Whenever we try to match a name in the module to
a name in the profile, we will try the match with the original name first,
and if it doesn't match, we will use the equivalent name got from remapper
to try the match for another time. In this way, the patch can enhance the
profile remapping support for searching inline instance and searching
indirect call promotion candidate.

In a planned large scale change of int64 type (long long) to int64_t (long),
we found the performance of a google internal benchmark degraded by 2% if
nothing was done. If existing profile remapping was enabled, the performance
degradation dropped to 1.2%. If the profile remapping with the current patch
was enabled, the performance degradation further dropped to 0.14% (Note the
experiment was done before searching indirect call promotion candidate was
added. We hope with the remapping support of searching indirect call promotion
candidate, the degradation can drop to 0% in the end. It will be evaluated
post commit).

Differential Revision: https://reviews.llvm.org/D86332
2020-08-26 11:07:35 -07:00
Wenlei He 577e58bcc7 [InlineAdvisor] New inliner advisor to replay inlining from optimization remarks
This change added a new inline advisor that takes optimization remarks from previous inlining as input, and provides the decision as advice so current inlining can replay inline decisions of a different compilation. Dwarf inline stack with line and discriminator is used as anchor for call sites including call context. The change can be useful for Inliner tuning as it provides a channel to allow external input for tweaking inline decisions. Existing alternatives like alwaysinline attribute is per-function, not per-callsite. Per-callsite inline intrinsic can be another solution (not yet existing), but it's intrusive to implement and also does not differentiate call context.

A switch -sample-profile-inline-replay=<inline_remarks_file> is added to hook up the new inline advisor with SampleProfileLoader's inline decision for replay. Since SampleProfileLoader does top-down inlining, inline decision can be specialized for each call context, hence we should be able to replay inlining accurately. However with a bottom-up inliner like CGSCC inlining, the replay can be limited due to lack of specialization for different call context. Apart from that limitation, the new inline advisor can still be used by regular CGSCC inliner later if needed for tuning purpose.

This is a resubmit of https://reviews.llvm.org/D83743
2020-08-15 20:17:21 -07:00
Wei Mi 4cd8e9b169 [SampleFDO] Stop letting findCalleeFunctionSamples return unrelated profiles
for invoke instructions.

We see a warning of "No debug information found in function foo: Function
profile not used" in a case. The function foo is called by an invoke
instruction. It has no debug information because it has attribute((nodebug))
in the definition. It shouldn't have profile instance in the sample profile
but compiler thinks it does, that turns out to be a compiler bug in
findCalleeFunctionSamples. The bug is exposed when sample-profile-merge-inlinee
is enabled recently.

Currently in findCalleeFunctionSamples, CalleeName is unset and is empty for
invoke instruction. For empty CalleeName, findFunctionSamplesAt will treat
the call as an indirect call and will return any inline instance profile at
the same location as the instruction. That leads to a wrong profile being
returned to function foo.

The patch set CalleeName when the instruction is an invoke.

Differential Revision: https://reviews.llvm.org/D85664
2020-08-10 12:41:09 -07:00
Hongtao Yu d23c1d6a8d [AutoFDO] Avoid merging inlinee samples multiple times
A function call can be replicated by optimizations like loop unroll and jump threading and the replicates end up sharing the sample nested callee profile. Therefore when it comes to merging samples for uninlined callees in the sample profile inliner, a callee profile can be merged multiple times which will cause an assert to fire.

This change avoids merging same callee profile for duplicate callsites by filtering out callee profiles with a non-zero head sample count.

Reviewed By: wenlei, wmi

Differential Revision: https://reviews.llvm.org/D84997
2020-07-31 09:30:05 -07:00
Wei Mi 836991d367 Fix a crash when the sample profile uses md5 and -sample-profile-merge-inlinee
is enabled.

When -sample-profile-merge-inlinee is enabled, new FunctionSamples may be
created during profile merge without GUIDToFuncNameMap being initialized.
That will occasionally cause compiler crash. The patch fixes it.

Differential Revision: https://reviews.llvm.org/D84994
2020-07-30 21:21:06 -07:00
Wenlei He d41d952be9 Revert "[InlineAdvisor] New inliner advisor to replay inlining from optimization remarks"
This reverts commit 2d6ecfa168.
2020-07-19 08:49:04 -07:00
Wenlei He 2d6ecfa168 [InlineAdvisor] New inliner advisor to replay inlining from optimization remarks
Summary:
This change added a new inline advisor that takes optimization remarks from previous inlining as input, and provides the decision as advice so current inlining can replay inline decisions of a different compilation. Dwarf inline stack with line and discriminator is used as anchor for call sites including call context. The change can be useful for Inliner tuning as it provides a channel to allow external input for tweaking inline decisions. Existing alternatives like alwaysinline attribute is per-function, not per-callsite. Per-callsite inline intrinsic can be another solution (not yet existing), but it's intrusive to implement and also does not differentiate call context.

A switch -sample-profile-inline-replay=<inline_remarks_file> is added to hook up the new inline advisor with SampleProfileLoader's inline decision for replay. Since SampleProfileLoader does top-down inlining, inline decision can be specialized for each call context, hence we should be able to replay inlining accurately. However with a bottom-up inliner like CGSCC inlining, the replay can be limited due to lack of specialization for different call context. Apart from that limitation, the new inline advisor can still be used by regular CGSCC inliner later if needed for tuning purpose.

Subscribers: mgorny, aprantl, hiraditya, llvm-commits

Tags: #llvm

Resubmit for https://reviews.llvm.org/D84086
2020-07-19 08:21:05 -07:00
Eric Christopher ae08dbc673 Temporarily Revert "[InlineAdvisor] New inliner advisor to replay inlining from optimization remarks"
as it is failing the inline-replay.ll test as well as sanitizers/Werror
from returning a stack local variable.

This reverts commit 029946b112.
2020-07-17 14:58:01 -07:00
Wenlei He 029946b112 [InlineAdvisor] New inliner advisor to replay inlining from optimization remarks
Summary:
This change added a new inline advisor that takes optimization remarks for previous inlining as input, and provide the decision as advice so current inlining can replay inline decision of a different compilation. Dwarf inline stack with line and discriminator is used as anchor for call sites. The change can be useful for Inliner tuning.
A switch -sample-profile-inline-replay=<inline_remarks_file> is added to hook up the new inliner advisor with SampleProfileLoader's inline decision for replay. The new inline advisor can also be used by regular CGSCC inliner later if needed.

Reviewers: davidxl, mtrofin, wmi, hoy

Subscribers: aprantl, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D83743
2020-07-17 13:30:47 -07:00
Wei Mi e32469a140 [SampleFDO] Enable sample-profile-top-down-load and sample-profile-merge-inlinee
by default.

sample-profile-top-down-load is an internal option which can enable top-down
order of inlining and profile annotation in sample profile load pass. It was
found to be beneficial for better profile annotation.

Recently we found it could also solve some build time issue. Suppose function
A has many callsites in function B. In the last release binary where sample
profile was collected, the outline copy of A is large because there are many
other functions inlined into A. However although all the callsites calling A
in B are inlined, but every inlined body is small (A was inlined into B
before other functions are inlined into A), there is no build time issue in
last release.

In an optimized build using the sample profile collected from last release,
without top-down inlining, we saw a case that A got very large because of
inlining, and then multiple callsites of A got inlined into B, and that led
to a huge B which caused significant build time issue besides profile
annotation issue.

To solve that problem, the patch enables the flag
sample-profile-top-down-load by default. sample-profile-top-down-load can
have better performance when it is enabled together with
sample-profile-merge-inlinee so in this patch we also enable
sample-profile-merge-inlinee by default.

Differential Revision: https://reviews.llvm.org/D82919
2020-07-08 09:23:18 -07:00
Wenlei He 7c8a6936bf [Remarks] Add callsite locations to inline remarks
Summary:
Add call site location info into inline remarks so we can differentiate inline sites.
This can be useful for inliner tuning. We can also reconstruct full hierarchical inline
tree from parsing such remarks. The messege of inline remark is also tweaked so we can
differentiate SampleProfileLoader inline from CGSCC inline.

Reviewers: wmi, davidxl, hoy

Subscribers: hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D82213
2020-06-20 23:32:10 -07:00
Fangrui Song c1ea672ddb [gcov] Delete XFAIL: host-byteorder-big-endian
Add two to function-numbering.ll and exit-block.ll
2020-06-03 19:58:28 -07:00
Wei Mi 7a6c89427c [SampleFDO] Add use-sample-profile function attribute.
When sampleFDO is enabled, people may expect they can use
-fno-profile-sample-use to opt-out using sample profile for a certain file.
That could be either for debugging purpose or for performance tuning purpose.
However, when thinlto is enabled, if a function in file A compiled with
-fno-profile-sample-use is imported to another file B compiled with
-fprofile-sample-use, the inlined copy of the function in file B may still
get its profile annotated.

The inconsistency may even introduce profile unused warning because if the
target is not compiled with explicit debug information flag, the function
in file A won't have its debug information enabled (debug information will
be enabled implicitly only when -fprofile-sample-use is used). After it is
imported into file B which is compiled with -fprofile-sample-use, profile
annotation for the outline copy of the function will fail because the
function has no debug information, and that will trigger  profile unused
warning.

We add a new attribute use-sample-profile to control whether a function
will use its sample profile no matter for its outline or inline copies.
That will make the behavior of -fno-profile-sample-use consistent.

Differential Revision: https://reviews.llvm.org/D79959
2020-06-02 17:23:17 -07:00
Hongtao Yu 47c1f2741f Properly add out-of-module functions to the import list
This patch addresses two issues related to adding inline functions to the import list while recursively going through the profiling data.
1. For callsite samples, only add an inlined function to the import list if it's from outside of the module (i.e. only has a declaration inside the module).
2. For body samples, add each target function to the import list if it's from outside of the module (i.e. only has a declaration inside the module). Previously we were using getSubProgram() to check whether it has dbg info, which is inaccurate. This fix properly add imports and could improve the quality of the pass.

Added a few changes to the test to catch these cases.

Differential Revision: https://reviews.llvm.org/D79379
2020-05-11 10:00:14 -07:00
Wei Mi aa2ddfc73d [SampleFDO] For functions without profiles, provide an option to put
them in a special text section.

For sampleFDO, because the optimized build uses profile generated from
previous release, previously we couldn't tell a function without profile
was truely cold or just newly created so we had to treat them conservatively
and put them in .text section instead of .text.unlikely. The result was when
we persuing the best performance by locking .text.hot and .text in memory,
we wasted a lot of memory to keep cold functions inside.

In https://reviews.llvm.org/D66374, we introduced profile symbol list to
discriminate functions being cold versus functions being newly added.
This mechanism works quite well for regular use cases in AutoFDO. However,
in some case, we can only have a partial profile when optimizing a target.
The partial profile may be an aggregated profile collected from many targets.
The profile symbol list method used for regular sampleFDO profile is not
applicable to partial profile use case because it may be too large and
introduce many false positives.

To solve the problem for partial profile use case, we provide an option called
--profile-unknown-in-special-section. For functions without profile, we will
still treat them conservatively in compiler optimizations -- for example,
treat them as warm instead of cold in inliner. When we use profile info to
add section prefix for functions, we will discriminate functions known to be
not cold versus functions without profile (being unknown), and we will put
functions being unknown in a special text section called .text.unknown.
Runtime system will have the flexibility to decide where to put the special
section in order to achieve a balance between performance and memory saving.

Differential Revision: https://reviews.llvm.org/D62540
2020-05-08 11:18:09 -07:00
Wei Mi ebad678857 [SampleFDO] Port MD5 name table support to extbinary format.
Compbinary format uses MD5 to represent strings in name table. That gives smaller profile without the need of compression/decompression when writing/reading the profile. The patch adds the support in extbinary format. It is off by default but user can choose to enable it.

Note the feature of using MD5 in name table can bring very small chance of name conflict leading to profile mismatch. Besides, profile using the feature won't have the profile remapping support.

Differential Revision: https://reviews.llvm.org/D76255
2020-03-30 22:07:08 -07:00
Wenlei He 05c3907b88 Fix test for profile remapper
Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74799
2020-02-18 17:58:32 -08:00
Fangrui Song bf70494b94 [test] More tests to target specific directories after CodeGenPrepare requires TargetPassConfig (D73754) 2020-02-02 10:43:02 -08:00
Wei Mi 154cd6de51 [SampleFDO] Fix invalid branch profile generated by indirect call promotion.
Suppose an inline instance has hot total sample count but 0 entry count, and
it is an indirect call target. If the indirect call has no other call target
and inline instance associated with it and it is promoted, currently the
conditional branch generated by indirect call promotion will have invalid
branch profile which is !{!"branch_weights", i32 0, i32 0} -- because the
entry count of the promoted target is 0 and the total entry count of all
targets is also 0. This caused a SEGV in Control Height Reduction and may
cause problem in other passes.

Function entry count of an inline instance is computed by a heuristic --
using either the sample of the starting line or starting inner inline
instance. The patch changes the heuristic a little bit so that when total
sample count is larger than 0, the computed entry count will be at least 1.
Then the new branch profile will be !{!"branch_weights", i32 1, i32 0}.

Differential Revision: https://reviews.llvm.org/D72790
2020-01-15 18:36:06 -08:00
Fangrui Song a36ddf0aa9 Migrate function attribute "no-frame-pointer-elim"="false" to "frame-pointer"="none" as cleanups after D56351 2019-12-24 16:27:51 -08:00
Fangrui Song 502a77f125 Migrate function attribute "no-frame-pointer-elim" to "frame-pointer"="all" as cleanups after D56351 2019-12-24 15:57:33 -08:00
Wenlei He d275a06487 [AutoFDO] Statistic for context sensitive profile guided inlining
Summary: AutoFDO compilation has two places that do inlining - the sample profile loader that does inlining with context sensitive profile, and the regular inliner as CGSCC pass. Ideally we want most inlining to come from sample profile loader as that is driven by context sensitive profile and also retains context sensitivity after inlining. However the reality is most of the inlining actually happens during regular inliner. To track the number of inline instances from sample profile loader and help move more inlining to sample profile loader, I'm adding statistics and optimization remarks for sample profile loader's inlining.

Reviewers: wmi, davidxl

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D70584
2019-12-11 21:37:21 -08:00
Wenlei He 7b61ae68ec [AutoFDO] Inline replay for cold/small callees from sample profile loader
Summary:
Sample profile loader of AutoFDO tries to replay previous inlining using context sensitive profile. The replay only repeats inlining if the call site block is hot. As a result it punts inlining of small functions, some of which can be beneficial for size, and will still be inlined by CSGCC inliner later. The oscillation between sample profile loader's inlining and regular CGSSC inlining cause unnecessary loss of context-sensitive profile. It doesn't have much impact for inline decision itself, but it negatively affects post-inline profile quality as CGSCC inliner have to scale counts which is not as accurate as the original context sensitive profile, and bad post-inline profile can misguide code layout.

This change added regular Inline Cost calculation for sample profile loader, so we can inline small functions upfront under switch -sample-profile-inline-size. In addition -sample-profile-cold-inline-threshold is added so we can tune the separate size threshold - currently the default is chosen to be the same as regular inliner's cold call-site threshold.

Reviewers: wmi, davidxl

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D70750
2019-12-06 11:44:45 -08:00
Wenlei He 532196d811 [AutoFDO] Top-down Inlining for specialization with context-sensitive profile
Summary:
AutoFDO's sample profile loader processes function in arbitrary source code order, so if I change the order of two functions in source code, the inline decision can change. This also prevented the use of context-sensitive profile to do specialization while inlining. This commit enforces SCC top-down order for sample profile loader. With this change, we can now do specialization, as illustrated by the added test case:

Say if we have A->B->C and D->B->C call path, we want to inline C into B when root inliner is B, but not when root inliner is A or D, this is not possible without enforcing top-down order. E.g. Once C is inlined into B, A and D can only choose to inline (B->C) as a whole or nothing, but what we want is only inline B into A and D, not its recursive callee C. If we process functions in top-down order, this is no longer a problem, which is what this commit is doing.

This change is guarded with a new switch "-sample-profile-top-down-load" for tuning, and it depends on D70653. Eventually, top-down can be the default order for sample profile loader.

Reviewers: wmi, davidxl

Subscribers: hiraditya, llvm-commits, tejohnson

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D70655
2019-12-05 16:07:01 -08:00
Wenlei He e503fd85d3 [AutoFDO] Properly merge context-sensitive profile of inlinee back to outlined function
Summary:
When sample profile loader decides not to inline a previously inlined call-site, we adjust the profile of outlined function simply by scaling up its profile counts by call-site count. This means the context-sensitive profile of that inlined instance will be thrown away. This commit try to keep context-sensitive profile for such cases:

 - Instead of scaling outlined function's profile, we now properly merge the FunctionSamples of inlined instance into outlined function, including all recursively inlined profile.
 - Instead of adjusting the profile for negative inline decision at the end of the sample profile loader pass, we do the profile merge right after processing each function. This change paired with top-down ordering of annotation/inline-replay (a separate diff) will make sure we recursively merge profile back before the profile is used for annotation and inline replay.

A new switch -sample-profile-merge-inlinee is added to enable the new profile merge for tuning. It should be the default behavior eventually.

Reviewers: wmi, davidxl

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D70653
2019-12-05 15:57:55 -08:00
Wenlei He ba1dfae054 Keep import function list for inlinee profile update
Summary:
When adjusting function entry counts after inlining, Funciton::setEntryCount is called without providing an import function list. The side effect of that is the previously set import function list will be dropped. The import function list is used by ThinLTO to help import hot cross module callee for LTO inlining, so dropping that during ThinLTO pre-link may adversely affect LTO inlining. The fix is to keep the list while updating entry counts for inlining.

Reviewers: wmi, davidxl, tejohnson

Subscribers: mehdi_amini, hiraditya, dexonsmith, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D69736
2019-11-06 18:36:00 -08:00
Wei Mi 8c8ec1f686 [SampleFDO] Add profile remapping support for profile on-demand loading used
by ExtBinary format profile

Profile on-demand loading was added for ExtBinary format profile in rL374233,
but currently profile on-demand loading doesn't work well with profile
remapping. The patch adds the support.

Suppose a function in the current module has outline instance in the profile.
The function name in the module is different from the name of the outline
instance, but remapper knows the two names are equal. When loading profile
on-demand, the outline instance has to be loaded with remapper's help.

At the same time SampleProfileReaderItaniumRemapper is changed from a proxy
of SampleProfileReader to a helper member in SampleProfileReader.

Differential Revision: https://reviews.llvm.org/D68901

llvm-svn: 375295
2019-10-18 22:35:20 +00:00
Wei Mi 09dcfe6805 [SampleFDO] Add indexing for function profiles so they can be loaded on demand
in ExtBinary format

Currently for Text, Binary and ExtBinary format profiles, when we compile a
module with samplefdo, even if there is no function showing up in the profile,
we have to load all the function profiles from the profile input. That is a
waste of compile time.

CompactBinary format profile has already had the support of loading function
profiles on demand. In this patch, we add the support to load profile on
demand for ExtBinary format. It will work no matter the sections in ExtBinary
format profile are compressed or not. Experiment shows it reduces the time to
compile a server benchmark by 30%.

When profile remapping and loading function profiles on demand are both used,
extra work needs to be done so that the loading on demand process will take
the name remapping into consideration. It will be addressed in a follow-up
patch.

Differential Revision: https://reviews.llvm.org/D68601

llvm-svn: 374233
2019-10-09 21:36:03 +00:00
Wei Mi b523790ae1 [SampleFDO] Add compression support for any section in ExtBinary profile format
Previously ExtBinary profile format only supports compression using zlib for
profile symbol list. In this patch, we extend the compression support to any
section. User can select some or all of the sections to compress. In an
experiment, for a 45M profile in ExtBinary format, compressing name table
reduced its size to 24M, and compressing all the sections reduced its size
to 11M.

Differential Revision: https://reviews.llvm.org/D68253

llvm-svn: 373914
2019-10-07 16:12:37 +00:00
Wei Mi f0c4e70e95 [SampleFDO] Create a separate flag profile-accurate-for-symsinlist to handle
profile symbol list.

Currently many existing users using profile-sample-accurate want to reduce
code size as much as possible. Their use cases are different from the scenario
profile symbol list tries to handle -- the major motivation of adding profile
symbol list is to get the major memory/code size saving without introduce
performance regression. So to keep the behavior of profile-sample-accurate
unchanged, we think decoupling these two things and using a new flag to
control the handling of profile symbol list may be better.

When profile-sample-accurate and the new flag profile-accurate-for-symsinlist
are both present, since profile-sample-accurate is a user assertion we let it
have a higher precedence.

Differential Revision: https://reviews.llvm.org/D68047

llvm-svn: 373133
2019-09-27 22:33:59 +00:00
Wei Mi 22fd88530b [SampleFDO] Treat names in profile as not cold only when profile symbol list
is available

In rL372232, we treated names showing up in profile as not cold when
profile-sample-accurate is enabled. This caused 70k size regression in
Chrome/Android. The patch put a guard and only enable the change when
profile symbol list is available, i.e., keep the old behavior when profile
symbol list is not available.

Differential Revision: https://reviews.llvm.org/D67931

llvm-svn: 372665
2019-09-23 22:11:35 +00:00
Wei Mi 5f7e822dc7 [SampleFDO] Minimize performance impact when profile-sample-accurate
is enabled.

We can save memory and reduce binary size significantly by enabling
ProfileSampleAccurate. However when ProfileSampleAccurate is true,
function without sample will be regarded as cold and this could
potentially cause performance regression.

To minimize the potential negative performance impact, we want to be
a little conservative here saying if a function shows up in the profile,
no matter as outline instance, inline instance or call targets, treat
the function as not being cold. This will handle the cases such as most
callsites of a function are inlined in sampled binary (thus outline copy
don't get any sample) but not inlined in current build (because of source
code drift, imprecise debug information, or the callsites are all cold
individually but not cold accumulatively...), so that the outline function
showing up as cold in sampled binary will actually not be cold after current
build. After the change, such function will be treated as not cold even
profile-sample-accurate is enabled.

At the same time we lower the hot criteria of callsiteIsHot check when
profile-sample-accurate is enabled. callsiteIsHot is used to determined
whether a callsite is hot and qualified for early inlining. When
profile-sample-accurate is enabled, functions without profile will be
regarded as cold and much less inlining will happen in CGSCC inlining pass,
so we can worry less about size increase and be aggressive to allow more
early inlining to happen for warm callsites and it is helpful for performance
overall.

Differential Revision: https://reviews.llvm.org/D67561

llvm-svn: 372232
2019-09-18 16:06:28 +00:00
Wei Mi 798e59b81f [SampleFDO] Add profile symbol list section to discriminate function being
cold versus function being newly added.

This is the second half of https://reviews.llvm.org/D66374.

Profile symbol list is the collection of function symbols showing up in
the binary which generates the current profile. It is used to discriminate
function being cold versus function being newly added. Profile symbol list
is only added for profile with ExtBinary format.

During profile use compilation, when profile-sample-accurate is enabled,
a function without profile will be regarded as cold only when it is
contained in that list.

Differential Revision: https://reviews.llvm.org/D66766

llvm-svn: 370563
2019-08-31 02:27:26 +00:00
Wei Mi be9073249e [SampleFDO] Add ExtBinary format to support extension of binary profile.
This is a patch split from https://reviews.llvm.org/D66374. It tries to add
a new format of profile called ExtBinary. The format adds a section header
table to the profile and organize the profile in sections, so the future
extension like adding a new section or extending an existing section will be
easier while keeping backward compatiblity feasible.

Differential Revision: https://reviews.llvm.org/D66513

llvm-svn: 369798
2019-08-23 19:05:30 +00:00
Yevgeny Rouban bf6df042a5 Fix prof branch_weights in entry_counts_missing_dbginfo.ll test
Removed extra parameter from !prof branch_weights metadata of
a call instruction according to the spec.

Differential Revision: https://reviews.llvm.org/D61932

llvm-svn: 360843
2019-05-16 03:39:09 +00:00
Mircea Trofin 0a753938db [llvm] Avoid div by 0 when updating profile weights.
Reviewers: davidxl

Reviewed By: davidxl

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61661

llvm-svn: 360223
2019-05-08 03:57:25 +00:00
Hubert Tong 02d055a269 [tests] Add host-byteorder-*-endian; update XFAILs of big-endian triples
Summary:
Triple components in `XFAIL` lines are tested against the target triple.
Various tests that are expected to fail on big-endian hosts are marked
as being `XFAIL` for big-endian targets. This patch corrects these tests
by having them test against a new `host-byteorder-big-endian` feature.

Reviewers: xingxue, sfertile, jasonliu

Reviewed By: xingxue

Subscribers: jvesely, nhaehnle, fedor.sergeev, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60551

llvm-svn: 359689
2019-05-01 15:36:18 +00:00
Eric Christopher cee313d288 Revert "Temporarily Revert "Add basic loop fusion pass.""
The reversion apparently deleted the test/Transforms directory.

Will be re-reverting again.

llvm-svn: 358552
2019-04-17 04:52:47 +00:00
Eric Christopher a863435128 Temporarily Revert "Add basic loop fusion pass."
As it's causing some bot failures (and per request from kbarton).

This reverts commit r358543/ab70da07286e618016e78247e4a24fcb84077fda.

llvm-svn: 358546
2019-04-17 02:12:23 +00:00
Taewook Oh a960f89962 [ProfileSummary] Count callsite samples when computing total samples.
Summary: Currently ProfileSummaryBuilder doesn't count into callsite samples when computing total samples. Considering that ProfileSummaryInfo is used to checked the hotness of not only body samples but also callsite samples (from SampleProfileLoader), I think the callsite sample counts should be considered when computing total samples.

Reviewers: eraman, danielcdh, wmi

Subscribers: hiraditya, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59835

llvm-svn: 357627
2019-04-03 19:54:43 +00:00
Taewook Oh 6a27c48be2 [SampleProfile] Repeat indirect call promotion only when the target is actually hot.
Summary: It is possible that multiple indirect call targets have been promoted for a single callsite from the profiled binary. Current implementation repeats promotion for all these targets as far as the callsite itself is hot (the callsite is assumed to be hot if any one of these targets was "hot" during the profiling). However, even when one of the ICPed target is hot other targets may not, and we should not repeat promotion for "cold" targets.

Reviewers: danielcdh, wmi

Subscribers: hiraditya, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59940

llvm-svn: 357484
2019-04-02 15:48:21 +00:00
David Callahan d2eeb2516d Update entry count for cold calls
Summary:
Profile sample files include the number of times each entry or inlined
call site is sampled. This is translated into the entry count metadta
on functions.

When sample data is being read, if a call site that was inlined
in the sample program is considered cold and not inlined, then
the entry count of the out-of-line functions does not reflect
the current compilation.

In this patch, we note call sites where the function was not inlined
and as a last action of the sample profile loading, we update the
called function's entry count to reflect the calls from these
call sites which are not included in the profile file.

Reviewers: danielcdh, wmi, Kader, modocache

Reviewed By: wmi

Subscribers: davidxl, eraman, llvm-commits

Differential Revision: https://reviews.llvm.org/D52845

llvm-svn: 352001
2019-01-24 00:55:23 +00:00
Wei Mi 3bcccdfe38 [SampleFDO] Skip profile reading when flattened profile used in ThinLTO postlink
If the sample profile has no inlining hierachy information included, we call
the sample profile is flattened. For flattened profile, in ThinLTO postlink
phase, SampleProfileLoader's hot function inlining and profile annotation will
do nothing, so it is better to save the effort to read in the profile and run
the sample profile loader pass. It is helpful for reducing compile time when
the flattened profile is huge.

Differential Revision: https://reviews.llvm.org/D54819

llvm-svn: 351476
2019-01-17 20:48:34 +00:00
Wei Mi c876e3d42b [PGO] Make pgo related options in opt more consistent.
Currently we have pgo options defined in PassManagerBuilder.cpp only for
instrument pgo, but not for sample pgo. We also have pgo options defined
in NewPMDriver.cpp in opt only for new pass manager and for all kinds of
pgo. They have some inconsistency.

To make the options more consistent and make tests writing easier, the
patch let old pass manager to share the same pgo options with new pass
manager in opt, and removes the options in PassManagerBuilder.cpp.

Differential Revision: https://reviews.llvm.org/D56749

llvm-svn: 351392
2019-01-16 23:19:02 +00:00
Wei Mi 66c6c5abea [SampleFDO] handle ProfileSampleAccurate when initializing function entry count
ProfileSampleAccurate is used to indicate the profile has exact match to the
code to be optimized.

Previously ProfileSampleAccurate is handled in ProfileSummaryInfo::isColdCallSite
and ProfileSummaryInfo::isColdBlock. A better solution is to initialize function
entry count to 0 when ProfileSampleAccurate is true, so we don't have to handle
ProfileSampleAccurate in multiple places.

Differential Revision: https://reviews.llvm.org/D55660

llvm-svn: 349088
2018-12-13 21:51:42 +00:00
Richard Smith 6c67662816 Add a flag to remap manglings when reading profile data information.
This can be used to preserve profiling information across codebase
changes that have widespread impact on mangled names, but across which
most profiling data should still be usable. For example, when switching
from libstdc++ to libc++, or from the old libstdc++ ABI to the new ABI,
or even from a 32-bit to a 64-bit build.

The user can provide a remapping file specifying parts of mangled names
that should be treated as equivalent (eg, std::__1 should be treated as
equivalent to std::__cxx11), and profile data will be treated as
applying to a particular function if its name is equivalent to the name
of a function in the profile data under the provided equivalences. See
the documentation change for a description of how this is configured.

Remapping is supported for both sample-based profiling and instruction
profiling. We do not support remapping indirect branch target
information, but all other profile data should be remapped
appropriately.

Support is only added for the new pass manager. If someone wants to also
add support for this for the old pass manager, doing so should be
straightforward.

This is the LLVM side of Clang r344199.

Reviewers: davidxl, tejohnson, dlj, erik.pilkington

Subscribers: mehdi_amini, steven_wu, dexonsmith, llvm-commits

Differential Revision: https://reviews.llvm.org/D51249

llvm-svn: 344200
2018-10-10 23:13:47 +00:00
Wei Mi 6a14325dff [SampleFDO] Add FunctionOffsetTable in compact binary format profile.
The patch saves a function offset table which maps function name index to the
offset of its function profile to the start of the binary profile. By using
the function offset table, for those function profiles which will not be used
when compiling a module, the profile reader does't have to read them. For
profile size around 10~20M, it saves ~10% compile time.

Differential Revision: https://reviews.llvm.org/D51863

llvm-svn: 342283
2018-09-14 20:52:59 +00:00
Wei Mi 94d44c97bc [SampleFDO] Make sample profile loader unaware of compact format change.
The patch tries to make sample profile loader independent of profile format
change. It moves compact format related code into FunctionSamples and
SampleProfileReader classes, and sample profile loader only has to interact
with those two classes and will be unaware of profile format changes.

The cleanup also contain some fixes to further remove the difference between
compactbinary format and binary format. After the cleanup using different
formats originated from the same profile will generate the same binaries,
which we verified by compiling two large server benchmarks w/wo thinlto.

Differential Revision: https://reviews.llvm.org/D51643

llvm-svn: 341591
2018-09-06 22:03:37 +00:00
Wei Mi b1ef2cc53d [SampleFDO] Fix a bug in getOrCompHotCountThreshold/getOrCompColdCountThreshold
getOrCompHotCountThreshold/getOrCompColdCountThreshold introduced in
https://reviews.llvm.org/D45377 contain a bad mistake and will only return 1 or 0
instead of the true hot/cold cutoff value. The patch fixes the mistake. But the
mistake seems not causing big performance difference according to internal server
benchmarks testing.

Differential Revision: https://reviews.llvm.org/D50370

llvm-svn: 339162
2018-08-07 18:13:10 +00:00
Wei Mi a0c0857e7a [SampleFDO] Add a new compact binary format for sample profile.
Name table occupies a big chunk of size in current binary format sample profile.
In order to reduce its size, the patch changes the sample writer/reader to
save/restore MD5Hash of names in the name table. Sample annotation phase will
also use MD5Hash of name to query samples accordingly.

Experiment shows compact binary format can reduce the size of sample profile by
2/3 compared with binary format generally.

Differential Revision: https://reviews.llvm.org/D47955

llvm-svn: 334447
2018-06-11 22:40:43 +00:00
Wei Mi 0c2f6be662 [SampleFDO] Don't treat warm callsite with inline instance in the profile as cold
We found current sampleFDO had a performance issue when triaging a regression.
For a callsite with inline instance in the profile, even if hot callsite inliner
cannot inline it, it may still execute enough times and should not be treated as
cold in regular inliner later. However, currently if such callsite is not inlined
by hot callsite inliner, and the BB where the callsite locates doesn't get
samples from other instructions inside of it, the callsite will have no profile
metadata annotated. In regular inliner cost analysis, if the callsite has no
profile annotated and its caller has profile information, it will be treated as
cold.

The fix changes the isCallsiteHot check and chooses to compare
CallsiteTotalSamples with hot cutoff value computed by ProfileSummaryInfo.

Differential Revision: https://reviews.llvm.org/D45377

llvm-svn: 332058
2018-05-10 23:02:27 +00:00
Shiva Chen 2c864551df [DebugInfo] Add DILabel metadata and intrinsic llvm.dbg.label.
In order to set breakpoints on labels and list source code around
labels, we need collect debug information for labels, i.e., label
name, the function label belong, line number in the file, and the
address label located. In order to keep these information in LLVM
IR and to allow backend to generate debug information correctly.
We create a new kind of metadata for labels, DILabel. The format
of DILabel is

!DILabel(scope: !1, name: "foo", file: !2, line: 3)

We hope to keep debug information as much as possible even the
code is optimized. So, we create a new kind of intrinsic for label
metadata to avoid the metadata is eliminated with basic block.
The intrinsic will keep existing if we keep it from optimized out.
The format of the intrinsic is

llvm.dbg.label(metadata !1)

It has only one argument, that is the DILabel metadata. The
intrinsic will follow the label immediately. Backend could get the
label metadata through the intrinsic's parameter.

We also create DIBuilder API for labels to be used by Frontend.
Frontend could use createLabel() to allocate DILabel objects, and use
insertLabel() to insert llvm.dbg.label intrinsic in LLVM IR.

Differential Revision: https://reviews.llvm.org/D45024

Patch by Hsiangkai Wang.

llvm-svn: 331841
2018-05-09 02:40:45 +00:00
Hubert Tong 55662a8e9f Profiling tests: Endianess XFAIL for powerpc- (32-bit)
Add powerpc- (32-bit) as XFAIL for tests that are documented either in-
line or via commit messages as expected to fail on big-endian systems.

Tests not documented in-line are documented in commit messages as
follows:
r211172 - test/tools/llvm-cov/llvm-cov.test
r247920 - test/Transforms/SampleProfile/gcc-simple.ll

llvm-svn: 322114
2018-01-09 20:09:23 +00:00
Teresa Johnson 915897e21b [PGO] Fix handling of cold entry count for instrumented PGO
Summary:
In r277849, getEntryCount was changed to return None when the entry
count was 0, specifically for SamplePGO where it means no samples were
recorded. However, for instrumentation PGO a 0 entry count should be
returned directly, since it does mean that the function was completely
cold. Otherwise we end up treating these functions conservatively
in isFunctionEntryCold() and isColdBB().

Instead, for SamplePGO use -1 when there are no samples, and change
getEntryCount to return None when the value is -1.

Reviewers: danielcdh, davidxl

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41307

llvm-svn: 321018
2017-12-18 20:02:43 +00:00
Dehao Chen 5d2a1a5045 Include already promoted counts when computing SUM for VP.
Summary: When computing the SUM for indirect call promotion, if the callsite is already promoted in the profile, it will be promoted before ICP. In the current implementation, ICP only sees remaining counts in SUM. This may cause extra indirect call targets being promoted. This patch updates the SUM to include the counts already promoted earlier. This way we do not end up promoting too many indirect call targets.

Reviewers: tejohnson

Reviewed By: tejohnson

Subscribers: llvm-commits, sanjoy

Differential Revision: https://reviews.llvm.org/D38763

llvm-svn: 317502
2017-11-06 19:52:49 +00:00
Dehao Chen c6c051f2ea Include GUIDs from the same module when computing GUIDs that needs to be imported.
Summary: In the compile phase of SamplePGO+ThinLTO, ICP is not invoked. Instead, indirect call targets will be included as function metadata for ThinIndex to buidl the call graph. This should not only include functions defined in other modules, but also functions defined in the same module, otherwise ThinIndex may find the callee dead and eliminate it, while ICP in backend will revive the symbol, which leads to undefined symbol.

Reviewers: tejohnson

Reviewed By: tejohnson

Subscribers: sanjoy, llvm-commits, mehdi_amini

Differential Revision: https://reviews.llvm.org/D39480

llvm-svn: 317118
2017-11-01 20:26:47 +00:00
Dehao Chen 3f56a05ae5 Use the first instruction's count to estimate the funciton's entry frequency.
Summary: In the current implementation, we only have accurate profile count for standalone symbols. For inlined functions, we do not have entry count data because it's not available in LBR. In this patch, we use the first instruction's frequency to estimiate the function's entry count, especially for inlined functions. This may be inaccurate due to debug info in optimized code. However, this is a better estimate than the static 80/20 estimation we have in the current implementation.

Reviewers: tejohnson, davidxl

Reviewed By: tejohnson

Subscribers: sanjoy, llvm-commits, aprantl

Differential Revision: https://reviews.llvm.org/D38478

llvm-svn: 315369
2017-10-10 21:13:50 +00:00
Dehao Chen 9bd60429e2 Directly return promoted direct call instead of rely on stripPointerCast.
Summary: stripPointerCast is not reliably returning the value that's being type-casted. Instead it may look further at function attributes to further propagate the value. Instead of relying on stripPOintercast, the more reliable solution is to directly use the pointer to the promoted direct call.

Reviewers: tejohnson, davidxl

Reviewed By: tejohnson

Subscribers: llvm-commits, sanjoy

Differential Revision: https://reviews.llvm.org/D38603

llvm-svn: 315077
2017-10-06 17:04:55 +00:00
Dehao Chen 16f01fb1db Annotate VP prof on indirect call if it is ICPed in the profiled binary.
Summary: In SamplePGO, when an indirect call is promoted in the profiled binary, before profile annotation, it will be promoted and inlined. For the original indirect call, the current implementation will not mark VP profile on it. This is an issue when profile becomes stale. This patch annotates VP prof on indirect calls during annotation.

Reviewers: tejohnson

Reviewed By: tejohnson

Subscribers: sanjoy, llvm-commits

Differential Revision: https://reviews.llvm.org/D38477

llvm-svn: 315016
2017-10-05 20:15:29 +00:00
Dehao Chen d26dae0d34 Separate the logic when handling indirect calls in SamplePGO ThinLTO compile phase and other phases.
Summary: In SamplePGO ThinLTO compile phase, we will not invoke ICP as it may introduce confusion to the 2nd annotation. This patch extracted that logic and makes it clearer before profile annotation. In the mean time, we need to make function importing process both inlined callsites as well as not promoted indirect callsites.

Reviewers: tejohnson

Reviewed By: tejohnson

Subscribers: sanjoy, mehdi_amini, llvm-commits, inglorion

Differential Revision: https://reviews.llvm.org/D38094

llvm-svn: 314619
2017-10-01 05:24:51 +00:00
Dehao Chen 62b9c33e1e Import all inlined indirect call targets for SamplePGO.
Summary: In the ThinLTO compilation, if a function is inlined in the profiling binary, we need to inline it before annotation. If the callee is not available in the primary module, a first step is needed to import that callee function. For the current implementation, if the call is an indirect call, which has been promoted to >1 targets and inlined, SamplePGO will only import one target with the largest sample count. This patch fixed the bug to import all targets instead.

Reviewers: tejohnson, davidxl

Reviewed By: tejohnson

Subscribers: sanjoy, llvm-commits, mehdi_amini

Differential Revision: https://reviews.llvm.org/D36637

llvm-svn: 313678
2017-09-19 21:18:14 +00:00
Dehao Chen b6e60c8b80 Handle profile mismatch correctly for SamplePGO.
Summary: Fix the bug when promoted call return type mismatches with the promoted function, we should not try to inline it. Otherwise it may lead to compiler crash.

Reviewers: davidxl, tejohnson, eraman

Reviewed By: tejohnson

Subscribers: llvm-commits, sanjoy

Differential Revision: https://reviews.llvm.org/D38018

llvm-svn: 313658
2017-09-19 18:26:54 +00:00