Commit Graph

67 Commits

Author SHA1 Message Date
Hongtao Yu 9f732af583 [llvm-profgen] Filter out oversized LBR ranges.
As a follow up to {D123271}, LBR ranges that are too big should also be considered as invalid.

For example, the last two pairs in the following trace form a range [0x0d7b02b0, 0x368ba706] that covers a ton of functions in the binary. Such oversized range should also be ignored.

   0x0c74505f/0x368b99a0 **0x368ba706**/0x0c745040  0x0d7b1c3f/**0x0d7b02b0**

Add a defensive check to filter out those ranges based that the valid range should not cross the unconditional branch(Call, return, unconditional jmp).

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D125448
2022-05-12 10:58:50 -07:00
Hongtao Yu e36786d15f [CSSPGO] Rename ProfileIsCSNested and ProfileIsCSFlat
To be more clear and definitive, I'm renaming `ProfileIsCSFlat` back to `ProfileIsCS` which stands for full context-sensitive flat profiles.  `ProfileIsCSNested` is now renamed to `ProfileIsPreInlined` and is extended to be applicable for CS flat profiles too. More specifically, `ProfileIsPreInlined` is for any kind of profiles (flat or nested) that contain 'ShouldBeInlined' contexts. The flag is encoded in the profile summary section for extbinary profiles and is computed on-the-fly for text profiles.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D122602
2022-04-29 17:03:52 -07:00
wlei bfcb2c1119 [llvm-profgen] Decouple artificial branch from LBR parser and fix external address related issues
This patch is fixing two issues for both CS and non-CS.
1) For external-call-internal, the head samples of the the internal function should be recorded.
2) avoid ignoring LBR after meeting the interrupt branch for CS profile

LBR parser is shared between CS and non-CS, we found it's error-prone while dealing with artificial branch inside LBR parser. Since artificial branch is mainly used for CS profile unwinding, this patch tries to simplify LBR parser by decoupling artificial branch code from it, the concept of artificial branch is removed and split into two transitional branches(internal-to-external, external-to-internal). Then we leave all the processing of external branch to unwinder.

Specifically for unwinder, remembering that we introduce external frame in https://reviews.llvm.org/D115550. We can just take external address as a regular address and reuse current unwind function(unwindCall, unwindReturn). For a normal case, the external frame will match an external LBR, and it will be filtered out by `unwindLinear` without losing any context.

The data also shows that the interrupt or standalone LBR pattern(unpaired case) does exist, we choose to handle it by clearing the call stack and keeping unwinding. Here we leverage checking in `unwindLinear`, because a standalone LBR, no matter its type, since it doesn’t have other part to pair, it will eventually cause a wrong linear range, like [external, internal], [internal, external]. Then set the state to invalid there.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D118177
2022-04-28 16:07:28 -07:00
Wenlei He 17f6cba30d [llvm-profgen] Add process filter for perf reader
For profile generation, we need to filter raw perf samples for binary of interest. Sometimes binary name along isn't enough as we can have binary of the same name running in the system. This change adds a process id filter to allow users to further disambiguiate the input raw samples.

Differential Revision: https://reviews.llvm.org/D123869
2022-04-18 09:50:16 -07:00
Hongtao Yu 8a0406dcc8 [llvm-profgen] Filter out invalid LBR ranges.
The profiler can sometimes give us a LBR trace that implicates bogus code ranges. For example,

    0xc5acb56/0xc66c6c0 0xc628195/0xf31fbb0 0xc611261/0xc628130 0xc5c1a21/0xc6111c0 0x1f7edfd3/0xc5c3a50 0xc5c154f/0x1f7edec0 0xe8eed07/0xc5c11e0

, note that the first two pairs are supposed to form a linear execution range, in this case, it is [0xf31fbb0, 0xc5acb56] , which doesn't make sense.

Such bogus ranges should be ruled out to avoid generating a bad profile. I'm fixing this for both CS and non-CS cases.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D123271
2022-04-07 21:42:01 -07:00
Hongtao Yu 3f97016857 [llvm-profgen] Decoding pseudo probe for profiled function only.
Complete pseudo probes decoding can result in large memory usage. In practice only a small porting of the decoded probes are used in profile generation. I'm changing the full decoding mode to be decoding for profiled functions only, though we still do a full scan of the .pseudoprobe section due to a missing table-of-content but we don't have to build the in-memory data structure for functions not sampled.

To build the in-memory data structure for profiled functions only, I'm rewriting the previous non-recursive probe decoding logic to be recursive. This is easy to read and maintain.

I also have to change the previous representation of unsymbolized context from probe-based stack to address-based stack since the profiled functions are unknown yet by the time of virtual unwinding. The address-based stack will be converted to probe-based stack after virtual unwinding and on-demand probe decoding.

I'm seeing 20GB memory is saved for one of our internal large service.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D121643
2022-03-23 14:15:11 -07:00
serge-sans-paille db29f4374d Cleanup include: DebugInfo/Symbolize
Estimation of the impact on preprocessor output
after: 1067349756
before:1067487786

Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D120433
2022-02-24 13:25:11 +01:00
Hongtao Yu 67db31115d [llvm-profgen] Clean up unnecessary memory reservations between phases.
Cleaning up data structures that are not used after a certain point. This further brings down peak memory usage by 15% for a large benchmark.

Before:
   note: Before parsePerfTraces
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: Before parseAndAggregateTrace
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: After parseAndAggregateTrace
   note: VM: 88.93 GB   RSS: 87.97 GB
   note: Before generateUnsymbolizedProfile
   note: VM: 88.95 GB   RSS: 87.99 GB
   note: After generateUnsymbolizedProfile
   note: VM: 93.50 GB   RSS: 92.53 GB
   note: After computeSizeForProfiledFunctions
   note: VM: 101.13 GB   RSS: 99.36 GB
   note: After generateProbeBasedProfile
   note: VM: 215.61 GB   RSS: 210.88 GB
   note: After postProcessProfiles
   note: VM: 237.48 GB   RSS: 212.50 GB

After:
   note: Before parsePerfTraces
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: Before parseAndAggregateTrace
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: After parseAndAggregateTrace
   note: VM: 88.93 GB   RSS: 87.96 GB
   note: Before generateUnsymbolizedProfile
   note: VM: 88.95 GB   RSS: 87.97 GB
   note: After generateUnsymbolizedProfile
   note: VM: 93.50 GB   RSS: 92.51 GB
   note: After computeSizeForProfiledFunctions
   note: VM: 93.50 GB   RSS: 92.53 GB
   note: After generateProbeBasedProfile
   note: VM: 164.87 GB   RSS: 163.55 GB
   note: After postProcessProfiles
   note: VM: 182.28 GB   RSS: 179.43 GB

Reviewed By: wenlei, wlei

Differential Revision: https://reviews.llvm.org/D118677
2022-02-01 16:27:54 -08:00
Hongtao Yu fec57e5b17 Revert "[llvm-profgen] Clean up unnecessary memory reservations between phases."
This reverts commit 057e784b09.
2022-02-01 14:44:48 -08:00
Hongtao Yu 057e784b09 [llvm-profgen] Clean up unnecessary memory reservations between phases.
Cleaning up data structures that are not used after a certain point. This further brings down peak memory usage by 15% for a large benchmark.

Before:
   note: Before parsePerfTraces
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: Before parseAndAggregateTrace
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: After parseAndAggregateTrace
   note: VM: 88.93 GB   RSS: 87.97 GB
   note: Before generateUnsymbolizedProfile
   note: VM: 88.95 GB   RSS: 87.99 GB
   note: After generateUnsymbolizedProfile
   note: VM: 93.50 GB   RSS: 92.53 GB
   note: After computeSizeForProfiledFunctions
   note: VM: 101.13 GB   RSS: 99.36 GB
   note: After generateProbeBasedProfile
   note: VM: 215.61 GB   RSS: 210.88 GB
   note: After postProcessProfiles
   note: VM: 237.48 GB   RSS: 212.50 GB

After:
   note: Before parsePerfTraces
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: Before parseAndAggregateTrace
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: After parseAndAggregateTrace
   note: VM: 88.93 GB   RSS: 87.96 GB
   note: Before generateUnsymbolizedProfile
   note: VM: 88.95 GB   RSS: 87.97 GB
   note: After generateUnsymbolizedProfile
   note: VM: 93.50 GB   RSS: 92.51 GB
   note: After computeSizeForProfiledFunctions
   note: VM: 93.50 GB   RSS: 92.53 GB
   note: After generateProbeBasedProfile
   note: VM: 164.87 GB   RSS: 163.55 GB
   note: After postProcessProfiles
   note: VM: 182.28 GB   RSS: 179.43 GB

Reviewed By: wenlei, wlei

Differential Revision: https://reviews.llvm.org/D118677
2022-02-01 12:48:08 -08:00
wlei b239b2b0db [llvm-profgen] Fix warning of enumerated and non-enumerated type in conditional expression
Differential Revision: https://reviews.llvm.org/D115842
2021-12-16 19:28:55 -08:00
wlei 0f53df864e [CSSPGO][llvm-profgen] Fix external address issues of perf reader (return to external addr part)
Before we have an issue with artificial LBR whose source is a return, recalling that "an internal code(A) can return to external address, then from the external address call a new internal code(B), making an artificial branch that looks like a return from A to B can confuse the unwinder". We just ignore the LBRs after this artificial LBR which can miss some samples. This change aims at fixing this by correctly unwinding them instead of ignoring them.

List some typical scenarios covered by this change.

1)  multiple sequential call back happen in external address, e.g.

```
[ext, call, foo] [foo, return, ext] [ext, call, bar]
```
Unwinder should avoid having foo return from bar. Wrong call stack is like [foo, bar]

2) the call stack before and after external call should be correctly unwinded.
```
 {call stack1}                                            {call stack2}
 [foo, call, ext]  [ext, call, bar]  [bar, return, ext]  [ext, return, foo ]
```
call stack 1 should be the same to call stack2. Both shouldn't be truncated

3) call stack should be truncated after call into external code since we can't do inlining with external code.

```
 [foo, call, ext]  [ext, call, bar]  [bar, call, baz] [baz, return, bar ] [bar, return, ext]
```
the call stack of code in baz should not include foo.

### Implementation:

We leverage artificial frame to fix #2 and #3: when we got a return artificial LBR, push an extra artificial frame to the stack. when we pop frame, check if the parent is an artificial frame to pop(fix #2). Therefore, call/ return artificial LBR is just the same as regular LBR which can keep the call stack.

While recording context on the trie, artificial frame is used as a tag indicating that we should truncate the call stack(fix #3).

To differentiate #1 and #2, we leverage `getCallAddrFromFrameAddr`.  Normally the target of the return should be the next inst of a call inst and `getCallAddrFromFrameAddr` will return the address of call inst. Otherwise, getCallAddrFromFrameAddr will return to 0 which is the case of #1.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D115550
2021-12-14 16:40:54 -08:00
wlei 30c3aba998 [llvm-profgen] Fix to use getUntrackedCallsites outside the loop
Unwinder is hoisted out in https://reviews.llvm.org/D115550, so fix the useage of getUntrackedCallsites.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D115760
2021-12-14 16:40:53 -08:00
wlei 3dcb60db9a [CSSPGO][llvm-profgen] Fix external address issues of perf reader (leading external LBR part)
We can have the sampling just hit into the external addresses, in that case, both the top stack frame and the latest LBR target are external addresses. For example:
```
	        ffffffff
 0x4006c8/0xffffffff/P/-/-/0  0x40069b/0x400670/M/-/-/0

 	          ffffffff
	          40067e
0xffffffff/0xffffffff/P/-/-/0  0x4006c8/0xffffffff/P/-/-/0  0x40069b/0x400670/M/-/-/0
```
Before we will ignore the entire samples. However, we found there exists some internal LBRs in the remaining part of sample, the range between them is still a valid range, we will lose some valid LBRs. Those LBRs will be unwinded based on a empty(context-less) call stack.

This change tries to fix it, instead of ignoring the entire sample, we only ignore the leading external addresses.

Note that the first outgoing LBR is useful since there is a valid range between it's source and next LBR's target.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D115538
2021-12-14 16:40:53 -08:00
Hongtao Yu 5740bb801a [CSSPGO] Use nested context-sensitive profile.
CSSPGO currently employs a flat profile format for context-sensitive profiles. Such a flat profile allows for precisely manipulating contexts that is either inlined or not inlined. This is a benefit over the nested profile format used by non-CS AutoFDO. A downside of this is the longer build time due to parsing the indexing the full CS contexts.

For a CS flat profile, though only the context profiles relevant to a module are loaded when that module is compiled, the cost to figure out what profiles are relevant is noticeably high when there're many contexts,  since the sample reader will need to scan all context strings anyway. On the contrary, a nested function profile has its related inline subcontexts isolated from other unrelated contexts. Therefore when compiling a set of functions, unrelated contexts will never need to be scanned.

In this change we are exploring using nested profile format for CSSPGO. This is expected to work based on an assumption that with a preinliner-computed profile all contexts are precomputed and expected to be inlined by the compiler. Contexts not expected to be inlined will be cut off and returned to corresponding base profiles (for top-level outlined functions). This naturally forms a nested profile where all nested contexts are expected to be inlined. The compiler will less likely optimize on derived contexts that are not precomputed.

A CS-nested profile will look exactly the same with regular nested profile except that each nested profile can come with an attributes. With pseudo probes,  a nested profile shown as below can also have a CFG checksum.

```

main:1968679:12
 2: 24
 3: 28 _Z5funcAi:18
 3.1: 28 _Z5funcBi:30
 3: _Z5funcAi:1467398
  0: 10
  1: 10 _Z8funcLeafi:11
  3: 24
  1: _Z8funcLeafi:1467299
   0: 6
   1: 6
   3: 287884
   4: 287864 _Z3fibi:315608
   15: 23
   !CFGChecksum: 138828622701
   !Attributes: 2
  !CFGChecksum: 281479271677951
  !Attributes: 2
```

Specific work included in this change:
- A recursive profile converter to convert CS flat profile to nested profile.
- Extend function checksum and attribute metadata to be stored in nested way for text profile and extbinary profile.
- Unifiy sample loader inliner path for CS and preinlined nested profile.
 - Changes in the sample loader to support probe-based nested profile.

I've seen promising results regarding build time. A nested profile can result in a 20% shorter build time than a CS flat profile while keep an on-par performance. This is with -duplicate-contexts-into-base=1.

Test Plan:

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D115205
2021-12-14 14:40:25 -08:00
wlei 41a681ce09 [FS-AFDO][llvm-profgen] Generate profile with FS-AFDO discriminator
In order to support generating profile  with FS discriminator, three kind of changes are done in llvm-profgen:

1) Dissassemble .rodata section to check if FS discriminator var ('"__llvm_fs_discriminator__"') exists and set the corresponding flag in the binary.

2) Change the discriminator decoding in `getBaseDiscriminator` and `getDuplicationFactor`.

3) set true for `FunctionSamples::ProfileIsFS` to enable FS functionality in ProfileData.

Reviewed By: xur, hoy, wenlei

Differential Revision: https://reviews.llvm.org/D113296
2021-11-30 15:57:59 -08:00
Wenlei He f7976edc1e [llvm-profgen] Add switch to allow use of first loadable segment for calculating offset
Adding `-use-loadable-segment-as-base` to allow use of first loadable segment for calculating offset. By default first executable segment is used for calculating offset. The switch helps compatibility with unsymbolized profile generated from older tools.

Differential Revision: https://reviews.llvm.org/D113727
2021-11-15 19:00:27 -08:00
wlei aab1810006 [llvm-profgen] Fix bug of setting function entry
Previously we set `isFuncEntry` flag  to true when the funcName from DWARF is equal to the name in symbol table and we use this flag to ignore reporting callsite sample that's from an intra func branch. However, in HHVM, it appears that the symbol table name is inconsistent with the dwarf info func name, it's likely due to `OptimizeGlobalAliases`.

This change is a workaround in llvm-profgen side to mark the only one range as the function entry and add warnings for the remaining inconsistence.

This also fixed a missing `getCanonicalFnName` for symbol name which caused the mismatching as well.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D113492
2021-11-12 12:18:43 -08:00
wlei dc9f037955 [llvm-profgen] Refactor the code of getHashCode
Refactor to generate hash code lazily. Tested on clang self build, no observable generating time regression.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D113059
2021-11-02 19:56:20 -07:00
wlei 138202a8c3 [llvm-profgen] Warn on invalid range and show warning summary
Two things in this diff:

1) Warn on the invalid range, currently three types of checking, see the detailed message in the code.

2) In some situation, llvm-profgen gives lots of warnings on the truncated stacks which is noisy. This change provides a switch to `--show-detailed-warning` to skip the warnings. Alternatively, we use a summary for those warning and show the percentage of cases with those issues.

Example of warning summary.
```
warning: 0.05%(1120/2428958) cases with issue: Profile context truncated due to missing probe for call instruction.
warning: 0.00%(2/178637) cases with issue: Range does not belong to any functions, likely from external function.
```

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D111902
2021-11-02 19:55:55 -07:00
wlei a5f411b7f8 [llvm-profgen] Allow unsymbolized profile as perf input
This change allows the unsymbolized profile as input. The unsymbolized profile is created by `llvm-profgen` with `--skip-symbolization` and it's after the sample aggregation but before symbolization , so it has much small file size. It can be used for sample merging and trimming,  also is useful for debugging or adding test cases. A switch `--unsymbolized-profile=file-patch` is added for this.

Format of unsymbolized profile:
```

   [context stack1]    # If it's a CS profile
      number of entries in RangeCounter
      from_1-to_1:count_1
      from_2-to_2:count_2
      ......
      from_n-to_n:count_n
      number of entries in BranchCounter
      src_1->dst_1:count_1
      src_2->dst_2:count_2
      ......
      src_n->dst_n:count_n
    [context stack2]
      ......
```

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D111750
2021-10-25 23:58:08 -07:00
Kazu Hirata 4e3eebc6bd [tools, utils] Use StringRef::contains (NFC) 2021-10-22 17:22:13 -07:00
Wenlei He a316343e19 [llvm-profgen] Allow generating AutoFDO profile from CSSPGO binary
Add `-use-dwarf-correlation` switch to allow llvm-profgen to generate AutoFDO profile for binaries built with CSSPGO (pseudo-probe).

Differential Revision: https://reviews.llvm.org/D111776
2021-10-14 09:11:56 -07:00
wlei 30ca33eab0 [llvm-profgen] Ignore the whole trace with the leading external branch
The first LBR entry can be an external branch, we should ignore the whole trace.

```
     7f7448e889e4 0x7f7448e889e4/0x7f7448e88826/P/-/-/1  0x7f7448e8899f/0x7f7448e889d8/P/-/-/4  ...
```

Reviewed By: wenlei, hoy

Differential Revision: https://reviews.llvm.org/D111749
2021-10-13 16:52:29 -07:00
wlei ab5d65e685 [llvm-profgen] Ignore stack samples before aggregation
With `ignore-stack-samples`, We can ignore the call stack before the samples aggregation which could reduce some redundant computations.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D111577
2021-10-13 16:52:29 -07:00
Wenlei He da4e5fc861 [llvm-profgen] Deduplicate PID when processing perf input
When parsing mmap to retrieve PID, deduplicate them before passing PID list to perf script. Perf script would error out when there's duplicated PID in the input, however raw perf data may main duplicated PID for large binary where more than one mmap is needed to load executable segment.

Differential Revision: https://reviews.llvm.org/D111384
2021-10-10 13:30:17 -07:00
Wenlei He 1f0bc617bd [llvm-porfgen] Allow perf data as input
This change enables llvm-profgen to take raw perf data as alternative input format. Sometimes we need to retrieve evenets for processes with matching binary. Using perf data as input allows us to retrieve process Ids from mmap events for matching binary, then filter by process id during perf script generation.

Differential Revision: https://reviews.llvm.org/D110793
2021-09-29 22:57:35 -07:00
Wenlei He 941191aae4 [llvm-profgen] Refactor and better diagnostics
This change contains diagnostics improvments, refactoring and preparation for consuming perf data directly.

Diagnostics:
 - We now have more detailed diagnostics when no mmap is found.
 - We also print warning for abnormal transition to external code.

Refactoring:
 - Simplify input perf trace processing to only allow a single input file. This is because 1) using multiple input perf trace (perf script) is error prone because we may miss key mmap events. 2) the functionality is not really being used anyways.
 - Make more functions private for Readers, move non-trivial definitions out of header. Cleanup some inconsistency.
 - Prepare for consuming perf data as input directly.

Differential Revision: https://reviews.llvm.org/D110729
2021-09-29 22:55:50 -07:00
wlei a03cf331e1 [llvm-profgen] Strip context to support non-CS profile generation for hybrid sample
Differential Revision: https://reviews.llvm.org/D109769
2021-09-28 12:20:23 -07:00
wlei 1422fa5fab [llvm-profgen] Unify output format of different unsymbolized profiles
Differential Revision: https://reviews.llvm.org/D110080
2021-09-24 14:18:00 -07:00
wlei a7cdcf25c1 [llvm-profgen] Ignore invalid perf line in LBR record
Similar to https://reviews.llvm.org/D109637, there is a whole invalid line of message in perfscript.

```
warning: Invalid address in LBR record at line 14118674: Processed 14138923 events and lost 1 chunks!
warning: Invalid address in LBR record at line 14118676: Check IO/CPU overload!
```

This only happened for LBR only perfscript, hybridperfscript have a check of " 0x" to make sure it's the LBR perf line.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D110424
2021-09-24 13:44:57 -07:00
wlei 686cc00067 [llvm-profgen] Fix an out-of-range error during unwinding
It happened that the LBR entry target can be the first address of text section which causes an out-of-range crash. So here add a boundary check.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D110271
2021-09-22 18:33:27 -07:00
Hongtao Yu 0057c7185d [CSSPGO][llvm-profgen] Truncate stack samples with invalid return address.
Invalid frame addresses exist in call stack samples due to bad unwinding. This could happen to frame-pointer-based unwinding and the callee functions that do not have the frame pointer chain set up. It isn't common when the program is built with the frame pointer omission disabled, but can still happen with third-party static libs built with frame pointer omitted.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D109638
2021-09-14 21:56:22 -07:00
Hongtao Yu 8cbbd7e0b2 [llvm-profgen] Ignore broken LBR samples
Perf script can sometimes give disordered LBR samples like below.

```
          b022500
          32de0044
          3386e1d1
      7f118e05720c
      7f118df2d81f
 0x2a0b9622/0x2a0b9610/P/-/-/1  0x2a0b79ff/0x2a0b9618/P/-/-/2  0x2a0b7a4a/0x2a0b79e8/P/-/-/1  0x2a0b7a33/0x2a0b7a46/P/-/-/1  0x2a0b7a42/0x2a0b7a23/P/-/-/1  0x2a0b7a21/0x2a0b7a37/P/-/-/2  0x2a0b79e6/0x2a0b7a07/P/-/-/1  0x2a0b79d4/0x2a0b79dc/P/-/-/2  0x2a0b7a03/0x2a0b79aa/P/-/-/1  0x2a0b79a8/0x2a0b7a00/P/-/-/234  0x2a0b9613/0x2a0b7930/P/-/-/1  0x2a0b9622/0x2a0b9610/P/-/-/1  0x2a0b79ff/0x2a0b9618/P/-/-/2  0x2a0b7a4a/0x2aWarning:
Processed 10263226 events and lost 1 chunks!

```
 Note that the last LBR record `0x2a0b7a4a/0x2aWarning:` . Currently llvm-profgen does not detect that and as a result an uninitialized branch target value will be used. The uninitialized value can cause creepy instruction ranges created which which in turn will result in a completely wrong profile. An example is like

```

 .... @ _ZN5folly13loadUnalignedIsEET_PKv]:18446744073709551615:18446744073709551615
 1: 18446744073709551615
 !CFGChecksum: 4294967295
 !Attributes: 0
```

Reviewed By: wenlei, wlei

Differential Revision: https://reviews.llvm.org/D109637
2021-09-14 12:11:17 -07:00
Wenlei He 6eca242e09 [llvm-profgen] Deduplicate and improve warning for truncated context
This change improves the warning for truncated context by: 1) deduplicate them as one call without probe can appear in many different context leading to duplicated warnings , 2) rephrase the message to make it easier to understand. The term "untracked frame" can be confusing.

Differential Revision: https://reviews.llvm.org/D109115
2021-09-02 09:15:38 -07:00
wlei 964053d56f [llvm-profgen] Support LBR only perf script
This change aims at supporting LBR only sample perf script which is used for regular(Non-CS) profile generation.  A LBR perf script includes a batch of LBR sample which starts with a frame pointer and a group of 32 LBR entries is followed. The FROM/TO LBR pair and the range between two consecutive entries (the former entry's TO and the latter entry's FROM) will be used to infer function profile info.

An example of LBR perf script(created by `perf script -F ip,brstack -i perf.data`)
```
           40062f 0x40062f/0x4005b0/P/-/-/9  0x400645/0x4005ff/P/-/-/1  0x400637/0x400645/P/-/-/1 ...
           4005d7 0x4005d7/0x4005e5/P/-/-/8  0x40062f/0x4005b0/P/-/-/6  0x400645/0x4005ff/P/-/-/1 ...
           ...
```

For implementation:
 - Extended a new child class `LBRPerfReader` for the sample parsing, reused all the functionalities in `extractLBRStack` except for an extension to parsing leading instruction pointer.
 - `HybridSample` is reused(just leave the call stack empty) and the parsed samples is still aggregated in `AggregatedSamples`. After that, range samples, branch sample, address samples are computed and recorded.
 - Reused `ContextSampleCounterMap` to store the raw profile, since it's no need to aggregation by context, here it just registered one sample counter with a fake context key.
 - Unified to use `show-raw-profile` instead of `show-unwinder-output` to dump the intermediate raw profile, see the comments of the format of the raw profile. For CS profile, it remains to output the unwinder output.

Profile generation part will come soon.

Differential Revision: https://reviews.llvm.org/D108153
2021-08-31 13:28:17 -07:00
Hongtao Yu b9db70369b [CSSPGO] Split context string to deduplicate function name used in the context.
Currently context strings contain a lot of duplicated function names and that significantly increase the profile size. This change split the context into a series of {name, offset, discriminator} tuples so function names used in the context can be replaced by the index into the name table and that significantly reduce the size consumed by context.

A follow-up improvement made in the compiler and profiling tools is to avoid reconstructing full context strings which is  time- and memory- consuming. Instead a context vector of `StringRef` is adopted to represent the full context in all scenarios. As a result, the previous prevalent profile map which was implemented as a `StringRef` is now engineered as an unordered map keyed by `SampleContext`. `SampleContext` is reshaped to using an `ArrayRef` to represent a full context for CS profile. For non-CS profile, it falls back to use `StringRef` to represent a contextless function name. Both the `ArrayRef` and `StringRef` objects are underpinned by real array and string objects that are stored in producer buffers. For compiler, they are maintained by the sample reader. For llvm-profgen, they are maintained in `ProfiledBinary` and `ProfileGenerator`. Full context strings can be generated only in those cases of debugging and printing.

When it comes to profile format, nothing has changed to the text format, though internally CS context is implemented as a vector. Extbinary format is only changed for CS profile, with an additional `SecCSNameTable` section which stores all full contexts logically in the form of `vector<int>`, which each element as an offset points to `SecNameTable`. All occurrences of contexts elsewhere are redirected to using the offset of `SecCSNameTable`.

Testing
This is no-diff change in terms of code quality and profile content (for text profile).

For our internal large service (aka ads), the profile generation is cut to half, with a 20x smaller string-based extbinary format generated.

The compile time of ads is dropped by 25%.

Differential Revision: https://reviews.llvm.org/D107299
2021-08-30 20:09:29 -07:00
wlei 9af46710fe [llvm-profgen] Move profiled binary loading out of PerfReader
Change to use unique pointer of profiled binary to unblock asan.

At same time, I realized we can decouple to move the profiled binary loading out of PerfReader, so I made some other related refactors.

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D108254
2021-08-17 17:28:01 -07:00
wlei f812c19253 [llvm-profgen] Clean up code dealing with multiple binaries
As we decided to support only one binary each time, this patch cleans up the related code dealing with multiple binaries. We can use `llvm-profdata` to merge profile from multiple binaries.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D108002
2021-08-17 12:16:07 -07:00
wlei 856a6a5041 [CSSPGO][llvm-profgen] Trim and merge context beforehand to reduce memory usage
Currently we use a centralized string map(StringMap<FunctionSamples> ProfileMap) to store the profile while populating the sample, which might cause the memory usage bottleneck. I saw in an extreme case, there are thousands of samples whose context stack depth is >= 100. The memory consumption can be greater than 100GB.

As here the context is used for inlining, we can assume we won't have so many of inlinees keeping inlined at the same root function, so this change tried to cap the context stack and merge the samples for peak memory reduction and this is done after recursion compression.

The default value is -1 meaning no depth limit, in the future we can tune to a smaller one.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D107800
2021-08-11 16:02:35 -07:00
jamesluox ee7d20e846 [CSSPGO] Migrate and refactor the decoder of Pseudo Probe
Migrate pseudo probe decoding logic in llvm-profgen to MC, so other LLVM-base program could reuse existing codes. Redesign object layout of encoded and decoded pseudo probes.

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D106861
2021-08-04 09:21:34 -07:00
wlei f1affe8dc8 [llvm-profgen][CSSPGO] Support count based aggregated type of hybrid perf script
This change tried to integrate a new count based aggregated type of perf script. The only difference of the format is that an aggregated count is added at the head of the original sample which means the same samples are repeated to the given count times. This is used to reduce the perf script size.
e.g.
```
2
	          4005dc
	          400634
	          400684
	    7f68c5788793
 0x4005c8/0x4005dc/P/-/-/0  ....
```
Implemented by a dedicated PerfReader `AggregatedHybridPerfReader`.

Differential Revision: https://reviews.llvm.org/D107192
2021-08-03 17:56:35 -07:00
wlei fe3ba90830 [llvm-profgen] Support perf script without parsing MMap events
This change supports to run without parsing MMap binary loading events instead it always assumes binary is loaded at the preferred address. This is used when we have assured no binary load address changes or we have pre-processed the addresses resolution. Warn if there's interior mmap event but without leading mmap events.

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D107097
2021-08-03 10:01:07 -07:00
wlei 6da9241aab [llvm-profgen] Refactor PerfReader to allow different types of perf scripts
In order to support different types of perf scripts, this change tried to refactor `PerfReader` by adding the base class `PerfReaderBase` and current HybridPerfReader is derived from it for CS profile generation. Common functions like, passMM2PEvents, extract_lbrs, extract_callstack, etc. can be reused.

Next step is to add LBR only reader(for non-CS profile) and aggregated perf scripts reader(do a pre-aggregation of scripts).

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D107014
2021-08-02 17:18:47 -07:00
Hongtao Yu cda2394d97 [NFC][CSSPGO] Rename the name of an enum value. 2021-07-13 18:30:16 -07:00
Hongtao Yu 0712038458 [CSSPGO][llvm-profgen] Allow multiple executable load segments.
The linker or post-link optimizer can create an ELF image with multiple executable segments each of which will be loaded separately at run time. This breaks the assumption of llvm-profgen that currently only supports one base load address. What it ends up with is that the subsequent mmap events will be treated as an overwrite of the first mmap event which will in turn screw up address mapping. While it is non-trivial to support multiple separate load addresses and given that on x64 those segments will always be loaded at consecutive addresses (though via separate mmap
sys calls), I'm adding an error checking logic to bail out if that's violated and keep using a single load address which is the address of the first executable segment.

Also changing the disassembly output from printing section offset to printing the virtual address instead, which matches the behavior of objdump.

Differential Revision: https://reviews.llvm.org/D103178
2021-07-13 18:22:24 -07:00
Hongtao Yu 5c8659801a [CSSPGO][llvm-profgen] Handle return to external transition.
In a callback case, a return from internal code, say A, to external runtime can happen. The external runtime can then call back to another internal routine, say B. Making an artificial branch that looks like a return from A to B can confuse the unwinder to treat the instruction before B as the call instruction.

Reviewed By: wenlei, wmi

Differential Revision: https://reviews.llvm.org/D104546
2021-06-22 16:24:59 -07:00
Hongtao Yu 8c2c97287e [CSSPGO][llvm-profgen] Ignore LBR records after interrupt transition
If we have seen an inwards transition from external code to internal code, but not a following outwards transition, the inwards transition is likely due to interrupt which is usually unpaired. Ignore current  and subsequent entries since they are likely from an unrelated pre-interrupt context.

LBR records from different interrupt context are unrelated and they should not be mixed together. Currenlty the OS does this for task-scheduling interrupt but not for all interrupts.

Reviewed By: wenlei, wlei

Differential Revision: https://reviews.llvm.org/D104276
2021-06-18 12:13:53 -07:00
Hongtao Yu 3b51b51877 [CSSPGO][llvm-profgen] Report samples for untrackable frames.
Fixing an issue where samples collected for an untrackable frame is not reported. An untrackable frame refers to a frame whose caller is untrackable due to missing debug info or pseudo probe. Though the frame is connected to its parent frame through the frame pointer chain at runtime, the compiler cannot build the connection without debug info or pseudo probe. In such case we just need to report the untrackable frame as the base frame and all of its child frames.

With more samples reported I'm seeing this improves the performance of an internal benchmark by 2.5%.

Reviewed By: wenlei, wlei

Differential Revision: https://reviews.llvm.org/D102961
2021-05-24 12:39:12 -07:00
Nico Weber ba7a92c01e [Support] Don't include VirtualFileSystem.h in CommandLine.h
CommandLine.h is indirectly included in ~50% of TUs when building
clang, and VirtualFileSystem.h is large.

(Already remarked by jhenderson on D70769.)

No behavior change.

Differential Revision: https://reviews.llvm.org/D100957
2021-04-21 10:19:01 -04:00