llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

2546 lines
98 KiB
C++
Raw Normal View History

SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
//===- SampleProfile.cpp - Incorporate sample profiles into the IR --------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
//
//===----------------------------------------------------------------------===//
//
// This file implements the SampleProfileLoader transformation. This pass
// reads a profile file generated by a sampling profiler (e.g. Linux Perf -
// http://perf.wiki.kernel.org/) and generates IR metadata to reflect the
// profile information in the given profile.
//
// This pass generates branch weight annotations on the IR:
//
// - prof: Represents branch weights. This annotation is added to branches
// to indicate the weights of each edge coming out of the branch.
// The weight of each edge is the weight of the target block for
// that edge. The weight of a block B is computed as the maximum
// number of samples found in B.
//
//===----------------------------------------------------------------------===//
#include "llvm/Transforms/IPO/SampleProfile.h"
#include "llvm/ADT/ArrayRef.h"
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/DenseSet.h"
#include "llvm/ADT/None.h"
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
#include "llvm/ADT/PriorityQueue.h"
#include "llvm/ADT/SCCIterator.h"
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallSet.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/Statistic.h"
#include "llvm/ADT/StringMap.h"
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/Twine.h"
#include "llvm/Analysis/AssumptionCache.h"
#include "llvm/Analysis/CallGraph.h"
#include "llvm/Analysis/CallGraphSCCPass.h"
#include "llvm/Analysis/InlineAdvisor.h"
#include "llvm/Analysis/InlineCost.h"
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/OptimizationRemarkEmitter.h"
#include "llvm/Analysis/PostDominators.h"
#include "llvm/Analysis/ProfileSummaryInfo.h"
#include "llvm/Analysis/ReplayInlineAdvisor.h"
#include "llvm/Analysis/TargetLibraryInfo.h"
#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/IR/BasicBlock.h"
#include "llvm/IR/CFG.h"
#include "llvm/IR/DebugInfoMetadata.h"
#include "llvm/IR/DebugLoc.h"
#include "llvm/IR/DiagnosticInfo.h"
#include "llvm/IR/Dominators.h"
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
#include "llvm/IR/Function.h"
#include "llvm/IR/GlobalValue.h"
#include "llvm/IR/InstrTypes.h"
#include "llvm/IR/Instruction.h"
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/MDBuilder.h"
#include "llvm/IR/Module.h"
#include "llvm/IR/PassManager.h"
#include "llvm/IR/ValueSymbolTable.h"
#include "llvm/InitializePasses.h"
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
#include "llvm/Pass.h"
#include "llvm/ProfileData/InstrProf.h"
#include "llvm/ProfileData/SampleProf.h"
#include "llvm/ProfileData/SampleProfReader.h"
#include "llvm/Support/Casting.h"
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"
#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/ErrorOr.h"
#include "llvm/Support/GenericDomTree.h"
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
#include "llvm/Support/raw_ostream.h"
#include "llvm/Transforms/IPO.h"
[CSSPGO] Infrastructure for context-sensitive Sample PGO and Inlining This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon. Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO. **Interface** There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile. - Query base profile (`getBaseSamplesFor`) Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function. - Query context profile (`getContextSamplesFor`) Context profile is a function's CFG profile for a given calling context. We can query context profile by context string. - Track inlined context profile (`markContextSamplesInlined`) When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function. - Track not-inlined context profile (`promoteMergeContextSamplesTree`) When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination. **Implementation** Implementation-wise, `SampleContext` is created as abstraction for context. Currently it's a string for call path, and we can later optimize it to something more efficient, e.g. context id. Each `SampleContext` also has a `ContextState` indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each `FunctionSamples` now has a `SampleContext` that tells whether it's base profile or context profile, and for context profile what is the context and state. On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, `SampleContextTracker` is implemented that encapsulates a trie tree with `ContextTireNode` as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track `FunctionSamples` for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes. **Integration** `SampleContextTracker` is now also integrated with AutoFDO, `SampleProfileReader` and `SampleProfileLoader`. When we detected input profile contains context-sensitive profile, `SampleContextTracker` will be used to track profiles, and all profile query will go to `SampleContextTracker` instead of `SampleProfileReader` automatically. Tracking APIs are called automatically for each inline decision from `SampleProfileLoader`. Differential Revision: https://reviews.llvm.org/D90125
2020-03-24 14:50:41 +08:00
#include "llvm/Transforms/IPO/SampleContextTracker.h"
[CSSPGO] Consume pseudo-probe-based AutoFDO profile This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`. Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites). **Adjustment to profile format ** A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like ``` !CFGChecksum: 12345 ``` is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums. Differential Revision: https://reviews.llvm.org/D92347
2020-12-17 04:54:50 +08:00
#include "llvm/Transforms/IPO/SampleProfileProbe.h"
#include "llvm/Transforms/Instrumentation.h"
#include "llvm/Transforms/Utils/CallPromotionUtils.h"
#include "llvm/Transforms/Utils/Cloning.h"
#include <algorithm>
#include <cassert>
#include <cstdint>
#include <functional>
#include <limits>
#include <map>
#include <memory>
#include <queue>
#include <string>
#include <system_error>
#include <utility>
#include <vector>
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
using namespace llvm;
using namespace sampleprof;
using ProfileCount = Function::ProfileCount;
#define DEBUG_TYPE "sample-profile"
#define CSINLINE_DEBUG DEBUG_TYPE "-inline"
STATISTIC(NumCSInlined,
"Number of functions inlined with context sensitive profile");
STATISTIC(NumCSNotInlined,
"Number of functions not inlined with context sensitive profile");
[CSSPGO] Consume pseudo-probe-based AutoFDO profile This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`. Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites). **Adjustment to profile format ** A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like ``` !CFGChecksum: 12345 ``` is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums. Differential Revision: https://reviews.llvm.org/D92347
2020-12-17 04:54:50 +08:00
STATISTIC(NumMismatchedProfile,
"Number of functions with CFG mismatched profile");
STATISTIC(NumMatchedProfile, "Number of functions with CFG matched profile");
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
STATISTIC(NumCSInlinedHitMinLimit,
"Number of functions with FDO inline stopped due to min size limit");
STATISTIC(NumCSInlinedHitMaxLimit,
"Number of functions with FDO inline stopped due to max size limit");
STATISTIC(
NumCSInlinedHitGrowthLimit,
"Number of functions with FDO inline stopped due to growth size limit");
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
// Command line option to specify the file to read samples from. This is
// mainly used for debugging.
static cl::opt<std::string> SampleProfileFile(
"sample-profile-file", cl::init(""), cl::value_desc("filename"),
cl::desc("Profile file loaded by -sample-profile"), cl::Hidden);
// The named file contains a set of transformations that may have been applied
// to the symbol names between the program from which the sample data was
// collected and the current program's symbols.
static cl::opt<std::string> SampleProfileRemappingFile(
"sample-profile-remapping-file", cl::init(""), cl::value_desc("filename"),
cl::desc("Profile remapping file loaded by -sample-profile"), cl::Hidden);
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
static cl::opt<unsigned> SampleProfileMaxPropagateIterations(
"sample-profile-max-propagate-iterations", cl::init(100),
cl::desc("Maximum number of iterations to go through when propagating "
"sample block/edge weights through the CFG."));
static cl::opt<unsigned> SampleProfileRecordCoverage(
"sample-profile-check-record-coverage", cl::init(0), cl::value_desc("N"),
cl::desc("Emit a warning if less than N% of records in the input profile "
"are matched to the IR."));
static cl::opt<unsigned> SampleProfileSampleCoverage(
"sample-profile-check-sample-coverage", cl::init(0), cl::value_desc("N"),
cl::desc("Emit a warning if less than N% of samples in the input profile "
"are matched to the IR."));
static cl::opt<bool> NoWarnSampleUnused(
"no-warn-sample-unused", cl::init(false), cl::Hidden,
cl::desc("Use this option to turn off/on warnings about function with "
"samples but without debug information to use those samples. "));
static cl::opt<bool> ProfileSampleAccurate(
"profile-sample-accurate", cl::Hidden, cl::init(false),
cl::desc("If the sample profile is accurate, we will mark all un-sampled "
"callsite and function as having 0 samples. Otherwise, treat "
"un-sampled callsites and functions conservatively as unknown. "));
static cl::opt<bool> ProfileAccurateForSymsInList(
"profile-accurate-for-symsinlist", cl::Hidden, cl::ZeroOrMore,
cl::init(true),
cl::desc("For symbols in profile symbol list, regard their profiles to "
"be accurate. It may be overriden by profile-sample-accurate. "));
static cl::opt<bool> ProfileMergeInlinee(
"sample-profile-merge-inlinee", cl::Hidden, cl::init(true),
cl::desc("Merge past inlinee's profile to outline version if sample "
"profile loader decided not to inline a call site. It will "
"only be enabled when top-down order of profile loading is "
"enabled. "));
static cl::opt<bool> ProfileTopDownLoad(
"sample-profile-top-down-load", cl::Hidden, cl::init(true),
cl::desc("Do profile annotation and inlining for functions in top-down "
"order of call graph during sample profile loading. It only "
"works for new pass manager. "));
static cl::opt<bool> ProfileSizeInline(
"sample-profile-inline-size", cl::Hidden, cl::init(false),
cl::desc("Inline cold call sites in profile loader if it's beneficial "
"for code size."));
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
static cl::opt<int> ProfileInlineGrowthLimit(
"sample-profile-inline-growth-limit", cl::Hidden, cl::init(12),
cl::desc("The size growth ratio limit for proirity-based sample profile "
"loader inlining."));
static cl::opt<int> ProfileInlineLimitMin(
"sample-profile-inline-limit-min", cl::Hidden, cl::init(100),
cl::desc("The lower bound of size growth limit for "
"proirity-based sample profile loader inlining."));
static cl::opt<int> ProfileInlineLimitMax(
"sample-profile-inline-limit-max", cl::Hidden, cl::init(10000),
cl::desc("The upper bound of size growth limit for "
"proirity-based sample profile loader inlining."));
static cl::opt<int> ProfileICPThreshold(
"sample-profile-icp-threshold", cl::Hidden, cl::init(5),
cl::desc(
"Relative hotness threshold for indirect "
"call promotion in proirity-based sample profile loader inlining."));
static cl::opt<int> SampleHotCallSiteThreshold(
"sample-profile-hot-inline-threshold", cl::Hidden, cl::init(3000),
cl::desc("Hot callsite threshold for proirity-based sample profile loader "
"inlining."));
static cl::opt<bool> CallsitePrioritizedInline(
"sample-profile-prioritized-inline", cl::Hidden, cl::ZeroOrMore,
cl::init(false),
cl::desc("Use call site prioritized inlining for sample profile loader."
"Currently only CSSPGO is supported."));
static cl::opt<int> SampleColdCallSiteThreshold(
"sample-profile-cold-inline-threshold", cl::Hidden, cl::init(45),
cl::desc("Threshold for inlining cold callsites"));
static cl::opt<std::string> ProfileInlineReplayFile(
"sample-profile-inline-replay", cl::init(""), cl::value_desc("filename"),
cl::desc(
"Optimization remarks file containing inline remarks to be replayed "
"by inlining from sample profile loader."),
cl::Hidden);
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
namespace {
using BlockWeightMap = DenseMap<const BasicBlock *, uint64_t>;
using EquivalenceClassMap = DenseMap<const BasicBlock *, const BasicBlock *>;
using Edge = std::pair<const BasicBlock *, const BasicBlock *>;
using EdgeWeightMap = DenseMap<Edge, uint64_t>;
using BlockEdgeMap =
DenseMap<const BasicBlock *, SmallVector<const BasicBlock *, 8>>;
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
class SampleProfileLoader;
class SampleCoverageTracker {
public:
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
SampleCoverageTracker(SampleProfileLoader &SPL) : SPLoader(SPL){};
bool markSamplesUsed(const FunctionSamples *FS, uint32_t LineOffset,
uint32_t Discriminator, uint64_t Samples);
unsigned computeCoverage(unsigned Used, unsigned Total) const;
unsigned countUsedRecords(const FunctionSamples *FS,
ProfileSummaryInfo *PSI) const;
unsigned countBodyRecords(const FunctionSamples *FS,
ProfileSummaryInfo *PSI) const;
uint64_t getTotalUsedSamples() const { return TotalUsedSamples; }
uint64_t countBodySamples(const FunctionSamples *FS,
ProfileSummaryInfo *PSI) const;
void clear() {
SampleCoverage.clear();
TotalUsedSamples = 0;
}
private:
using BodySampleCoverageMap = std::map<LineLocation, unsigned>;
using FunctionSamplesCoverageMap =
DenseMap<const FunctionSamples *, BodySampleCoverageMap>;
/// Coverage map for sampling records.
///
/// This map keeps a record of sampling records that have been matched to
/// an IR instruction. This is used to detect some form of staleness in
/// profiles (see flag -sample-profile-check-coverage).
///
/// Each entry in the map corresponds to a FunctionSamples instance. This is
/// another map that counts how many times the sample record at the
/// given location has been used.
FunctionSamplesCoverageMap SampleCoverage;
/// Number of samples used from the profile.
///
/// When a sampling record is used for the first time, the samples from
/// that record are added to this accumulator. Coverage is later computed
/// based on the total number of samples available in this function and
/// its callsites.
///
/// Note that this accumulator tracks samples used from a single function
/// and all the inlined callsites. Strictly, we should have a map of counters
/// keyed by FunctionSamples pointers, but these stats are cleared after
/// every function, so we just need to keep a single counter.
uint64_t TotalUsedSamples = 0;
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
SampleProfileLoader &SPLoader;
};
class GUIDToFuncNameMapper {
public:
GUIDToFuncNameMapper(Module &M, SampleProfileReader &Reader,
DenseMap<uint64_t, StringRef> &GUIDToFuncNameMap)
: CurrentReader(Reader), CurrentModule(M),
CurrentGUIDToFuncNameMap(GUIDToFuncNameMap) {
if (!CurrentReader.useMD5())
return;
for (const auto &F : CurrentModule) {
StringRef OrigName = F.getName();
CurrentGUIDToFuncNameMap.insert(
{Function::getGUID(OrigName), OrigName});
// Local to global var promotion used by optimization like thinlto
// will rename the var and add suffix like ".llvm.xxx" to the
// original local name. In sample profile, the suffixes of function
// names are all stripped. Since it is possible that the mapper is
// built in post-thin-link phase and var promotion has been done,
// we need to add the substring of function name without the suffix
// into the GUIDToFuncNameMap.
StringRef CanonName = FunctionSamples::getCanonicalFnName(F);
if (CanonName != OrigName)
CurrentGUIDToFuncNameMap.insert(
{Function::getGUID(CanonName), CanonName});
}
// Update GUIDToFuncNameMap for each function including inlinees.
SetGUIDToFuncNameMapForAll(&CurrentGUIDToFuncNameMap);
}
~GUIDToFuncNameMapper() {
if (!CurrentReader.useMD5())
return;
CurrentGUIDToFuncNameMap.clear();
// Reset GUIDToFuncNameMap for of each function as they're no
// longer valid at this point.
SetGUIDToFuncNameMapForAll(nullptr);
}
private:
void SetGUIDToFuncNameMapForAll(DenseMap<uint64_t, StringRef> *Map) {
std::queue<FunctionSamples *> FSToUpdate;
for (auto &IFS : CurrentReader.getProfiles()) {
FSToUpdate.push(&IFS.second);
}
while (!FSToUpdate.empty()) {
FunctionSamples *FS = FSToUpdate.front();
FSToUpdate.pop();
FS->GUIDToFuncNameMap = Map;
for (const auto &ICS : FS->getCallsiteSamples()) {
const FunctionSamplesMap &FSMap = ICS.second;
for (auto &IFS : FSMap) {
FunctionSamples &FS = const_cast<FunctionSamples &>(IFS.second);
FSToUpdate.push(&FS);
}
}
}
}
SampleProfileReader &CurrentReader;
Module &CurrentModule;
DenseMap<uint64_t, StringRef> &CurrentGUIDToFuncNameMap;
};
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
// Inline candidate used by iterative callsite prioritized inliner
struct InlineCandidate {
CallBase *CallInstr;
const FunctionSamples *CalleeSamples;
uint64_t CallsiteCount;
};
// Inline candidate comparer using call site weight
struct CandidateComparer {
bool operator()(const InlineCandidate &LHS, const InlineCandidate &RHS) {
if (LHS.CallsiteCount != RHS.CallsiteCount)
return LHS.CallsiteCount < RHS.CallsiteCount;
// Tie breaker using GUID so we have stable/deterministic inlining order
assert(LHS.CalleeSamples && RHS.CalleeSamples &&
"Expect non-null FunctionSamples");
return LHS.CalleeSamples->getGUID(LHS.CalleeSamples->getName()) <
RHS.CalleeSamples->getGUID(RHS.CalleeSamples->getName());
}
};
using CandidateQueue =
PriorityQueue<InlineCandidate, std::vector<InlineCandidate>,
CandidateComparer>;
/// Sample profile pass.
///
/// This pass reads profile data from the file specified by
/// -sample-profile-file and annotates every affected function with the
/// profile information found in that file.
class SampleProfileLoader {
public:
SampleProfileLoader(
StringRef Name, StringRef RemapName, ThinOrFullLTOPhase LTOPhase,
std::function<AssumptionCache &(Function &)> GetAssumptionCache,
std::function<TargetTransformInfo &(Function &)> GetTargetTransformInfo,
std::function<const TargetLibraryInfo &(Function &)> GetTLI)
: GetAC(std::move(GetAssumptionCache)),
GetTTI(std::move(GetTargetTransformInfo)), GetTLI(std::move(GetTLI)),
CoverageTracker(*this), Filename(std::string(Name)),
RemappingFilename(std::string(RemapName)), LTOPhase(LTOPhase) {}
bool doInitialization(Module &M, FunctionAnalysisManager *FAM = nullptr);
bool runOnModule(Module &M, ModuleAnalysisManager *AM,
ProfileSummaryInfo *_PSI, CallGraph *CG);
void dump() { Reader->dump(); }
protected:
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
friend class SampleCoverageTracker;
bool runOnFunction(Function &F, ModuleAnalysisManager *AM);
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
unsigned getFunctionLoc(Function &F);
bool emitAnnotations(Function &F);
ErrorOr<uint64_t> getInstWeight(const Instruction &I);
[CSSPGO] Consume pseudo-probe-based AutoFDO profile This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`. Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites). **Adjustment to profile format ** A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like ``` !CFGChecksum: 12345 ``` is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums. Differential Revision: https://reviews.llvm.org/D92347
2020-12-17 04:54:50 +08:00
ErrorOr<uint64_t> getProbeWeight(const Instruction &I);
ErrorOr<uint64_t> getBlockWeight(const BasicBlock *BB);
const FunctionSamples *findCalleeFunctionSamples(const CallBase &I) const;
std::vector<const FunctionSamples *>
findIndirectCallFunctionSamples(const Instruction &I, uint64_t &Sum) const;
mutable DenseMap<const DILocation *, const FunctionSamples *> DILocation2SampleMap;
const FunctionSamples *findFunctionSamples(const Instruction &I) const;
CallBase *tryPromoteIndirectCall(Function &F, StringRef CalleeName,
uint64_t &Sum, uint64_t Count, CallBase *I,
const char *&Reason);
bool inlineCallInstruction(CallBase &CB,
const FunctionSamples *CalleeSamples);
bool inlineHotFunctions(Function &F,
DenseSet<GlobalValue::GUID> &InlinedGUIDs);
// Helper functions call-site prioritized BFS inliner
// Will change the main FDO inliner to be work list based directly in
// upstream, then merge this change with that and remove the duplication.
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
InlineCost shouldInlineCandidate(InlineCandidate &Candidate);
bool getInlineCandidate(InlineCandidate *NewCandidate, CallBase *CB);
bool tryInlineCandidate(InlineCandidate &Candidate,
SmallVector<CallBase *, 8> &InlinedCallSites);
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
bool
inlineHotFunctionsWithPriority(Function &F,
DenseSet<GlobalValue::GUID> &InlinedGUIDs);
// Inline cold/small functions in addition to hot ones
bool shouldInlineColdCallee(CallBase &CallInst);
void emitOptimizationRemarksForInlineCandidates(
const SmallVectorImpl<CallBase *> &Candidates, const Function &F,
bool Hot);
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
void printEdgeWeight(raw_ostream &OS, Edge E);
void printBlockWeight(raw_ostream &OS, const BasicBlock *BB) const;
void printBlockEquivalence(raw_ostream &OS, const BasicBlock *BB);
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
bool computeBlockWeights(Function &F);
void findEquivalenceClasses(Function &F);
template <bool IsPostDom>
void findEquivalencesFor(BasicBlock *BB1, ArrayRef<BasicBlock *> Descendants,
DominatorTreeBase<BasicBlock, IsPostDom> *DomTree);
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
void propagateWeights(Function &F);
uint64_t visitEdge(Edge E, unsigned *NumUnknownEdges, Edge *UnknownEdge);
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
void buildEdges(Function &F);
std::vector<Function *> buildFunctionOrder(Module &M, CallGraph *CG);
bool propagateThroughEdges(Function &F, bool UpdateBlockCount);
void computeDominanceAndLoopInfo(Function &F);
void clearFunctionData();
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
bool callsiteIsHot(const FunctionSamples *CallsiteFS,
ProfileSummaryInfo *PSI);
/// Map basic blocks to their computed weights.
///
/// The weight of a basic block is defined to be the maximum
/// of all the instruction weights in that block.
BlockWeightMap BlockWeights;
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
/// Map edges to their computed weights.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// Edge weights are computed by propagating basic block weights in
/// SampleProfile::propagateWeights.
EdgeWeightMap EdgeWeights;
/// Set of visited blocks during propagation.
SmallPtrSet<const BasicBlock *, 32> VisitedBlocks;
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
/// Set of visited edges during propagation.
SmallSet<Edge, 32> VisitedEdges;
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
/// Equivalence classes for block weights.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// Two blocks BB1 and BB2 are in the same equivalence class if they
/// dominate and post-dominate each other, and they are in the same loop
/// nest. When this happens, the two blocks are guaranteed to execute
/// the same number of times.
EquivalenceClassMap EquivalenceClass;
/// Map from function name to Function *. Used to find the function from
/// the function name. If the function name contains suffix, additional
/// entry is added to map from the stripped name to the function if there
/// is one-to-one mapping.
StringMap<Function *> SymbolMap;
/// Dominance, post-dominance and loop information.
std::unique_ptr<DominatorTree> DT;
std::unique_ptr<PostDominatorTree> PDT;
std::unique_ptr<LoopInfo> LI;
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
std::function<AssumptionCache &(Function &)> GetAC;
std::function<TargetTransformInfo &(Function &)> GetTTI;
std::function<const TargetLibraryInfo &(Function &)> GetTLI;
/// Predecessors for each basic block in the CFG.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
BlockEdgeMap Predecessors;
/// Successors for each basic block in the CFG.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
BlockEdgeMap Successors;
SampleCoverageTracker CoverageTracker;
/// Profile reader object.
std::unique_ptr<SampleProfileReader> Reader;
[CSSPGO] Infrastructure for context-sensitive Sample PGO and Inlining This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon. Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO. **Interface** There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile. - Query base profile (`getBaseSamplesFor`) Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function. - Query context profile (`getContextSamplesFor`) Context profile is a function's CFG profile for a given calling context. We can query context profile by context string. - Track inlined context profile (`markContextSamplesInlined`) When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function. - Track not-inlined context profile (`promoteMergeContextSamplesTree`) When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination. **Implementation** Implementation-wise, `SampleContext` is created as abstraction for context. Currently it's a string for call path, and we can later optimize it to something more efficient, e.g. context id. Each `SampleContext` also has a `ContextState` indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each `FunctionSamples` now has a `SampleContext` that tells whether it's base profile or context profile, and for context profile what is the context and state. On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, `SampleContextTracker` is implemented that encapsulates a trie tree with `ContextTireNode` as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track `FunctionSamples` for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes. **Integration** `SampleContextTracker` is now also integrated with AutoFDO, `SampleProfileReader` and `SampleProfileLoader`. When we detected input profile contains context-sensitive profile, `SampleContextTracker` will be used to track profiles, and all profile query will go to `SampleContextTracker` instead of `SampleProfileReader` automatically. Tracking APIs are called automatically for each inline decision from `SampleProfileLoader`. Differential Revision: https://reviews.llvm.org/D90125
2020-03-24 14:50:41 +08:00
/// Profile tracker for different context.
std::unique_ptr<SampleContextTracker> ContextTracker;
/// Samples collected for the body of this function.
FunctionSamples *Samples = nullptr;
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
/// Name of the profile file to load.
std::string Filename;
/// Name of the profile remapping file to load.
std::string RemappingFilename;
/// Flag indicating whether the profile input loaded successfully.
bool ProfileIsValid = false;
[CSSPGO] Infrastructure for context-sensitive Sample PGO and Inlining This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon. Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO. **Interface** There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile. - Query base profile (`getBaseSamplesFor`) Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function. - Query context profile (`getContextSamplesFor`) Context profile is a function's CFG profile for a given calling context. We can query context profile by context string. - Track inlined context profile (`markContextSamplesInlined`) When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function. - Track not-inlined context profile (`promoteMergeContextSamplesTree`) When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination. **Implementation** Implementation-wise, `SampleContext` is created as abstraction for context. Currently it's a string for call path, and we can later optimize it to something more efficient, e.g. context id. Each `SampleContext` also has a `ContextState` indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each `FunctionSamples` now has a `SampleContext` that tells whether it's base profile or context profile, and for context profile what is the context and state. On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, `SampleContextTracker` is implemented that encapsulates a trie tree with `ContextTireNode` as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track `FunctionSamples` for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes. **Integration** `SampleContextTracker` is now also integrated with AutoFDO, `SampleProfileReader` and `SampleProfileLoader`. When we detected input profile contains context-sensitive profile, `SampleContextTracker` will be used to track profiles, and all profile query will go to `SampleContextTracker` instead of `SampleProfileReader` automatically. Tracking APIs are called automatically for each inline decision from `SampleProfileLoader`. Differential Revision: https://reviews.llvm.org/D90125
2020-03-24 14:50:41 +08:00
/// Flag indicating whether input profile is context-sensitive
bool ProfileIsCS = false;
/// Flag indicating which LTO/ThinLTO phase the pass is invoked in.
///
/// We need to know the LTO phase because for example in ThinLTOPrelink
/// phase, in annotation, we should not promote indirect calls. Instead,
/// we will mark GUIDs that needs to be annotated to the function.
ThinOrFullLTOPhase LTOPhase;
/// Profile Summary Info computed from sample profile.
ProfileSummaryInfo *PSI = nullptr;
/// Profle Symbol list tells whether a function name appears in the binary
/// used to generate the current profile.
std::unique_ptr<ProfileSymbolList> PSL;
/// Total number of samples collected in this profile.
///
/// This is the sum of all the samples collected in all the functions executed
/// at runtime.
uint64_t TotalCollectedSamples = 0;
/// Optimization Remark Emitter used to emit diagnostic remarks.
OptimizationRemarkEmitter *ORE = nullptr;
// Information recorded when we declined to inline a call site
// because we have determined it is too cold is accumulated for
// each callee function. Initially this is just the entry count.
struct NotInlinedProfileInfo {
uint64_t entryCount;
};
DenseMap<Function *, NotInlinedProfileInfo> notInlinedCallInfo;
// GUIDToFuncNameMap saves the mapping from GUID to the symbol name, for
// all the function symbols defined or declared in current module.
DenseMap<uint64_t, StringRef> GUIDToFuncNameMap;
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
// All the Names used in FunctionSamples including outline function
// names, inline instance names and call target names.
StringSet<> NamesInProfile;
// For symbol in profile symbol list, whether to regard their profiles
// to be accurate. It is mainly decided by existance of profile symbol
// list and -profile-accurate-for-symsinlist flag, but it can be
// overriden by -profile-sample-accurate or profile-sample-accurate
// attribute.
bool ProfAccForSymsInList;
// External inline advisor used to replay inline decision from remarks.
std::unique_ptr<ReplayInlineAdvisor> ExternalInlineAdvisor;
[CSSPGO] Consume pseudo-probe-based AutoFDO profile This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`. Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites). **Adjustment to profile format ** A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like ``` !CFGChecksum: 12345 ``` is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums. Differential Revision: https://reviews.llvm.org/D92347
2020-12-17 04:54:50 +08:00
// A pseudo probe helper to correlate the imported sample counts.
std::unique_ptr<PseudoProbeManager> ProbeManager;
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
};
class SampleProfileLoaderLegacyPass : public ModulePass {
public:
// Class identification, replacement for typeinfo
static char ID;
SampleProfileLoaderLegacyPass(
StringRef Name = SampleProfileFile,
ThinOrFullLTOPhase LTOPhase = ThinOrFullLTOPhase::None)
: ModulePass(ID), SampleLoader(
Name, SampleProfileRemappingFile, LTOPhase,
[&](Function &F) -> AssumptionCache & {
return ACT->getAssumptionCache(F);
},
[&](Function &F) -> TargetTransformInfo & {
return TTIWP->getTTI(F);
},
[&](Function &F) -> TargetLibraryInfo & {
return TLIWP->getTLI(F);
}) {
initializeSampleProfileLoaderLegacyPassPass(
*PassRegistry::getPassRegistry());
}
void dump() { SampleLoader.dump(); }
bool doInitialization(Module &M) override {
return SampleLoader.doInitialization(M);
}
StringRef getPassName() const override { return "Sample profile pass"; }
bool runOnModule(Module &M) override;
void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<AssumptionCacheTracker>();
AU.addRequired<TargetTransformInfoWrapperPass>();
AU.addRequired<TargetLibraryInfoWrapperPass>();
AU.addRequired<ProfileSummaryInfoWrapperPass>();
}
private:
SampleProfileLoader SampleLoader;
AssumptionCacheTracker *ACT = nullptr;
TargetTransformInfoWrapperPass *TTIWP = nullptr;
TargetLibraryInfoWrapperPass *TLIWP = nullptr;
};
} // end anonymous namespace
/// Return true if the given callsite is hot wrt to hot cutoff threshold.
///
/// Functions that were inlined in the original binary will be represented
/// in the inline stack in the sample profile. If the profile shows that
/// the original inline decision was "good" (i.e., the callsite is executed
/// frequently), then we will recreate the inline decision and apply the
/// profile from the inlined callsite.
///
/// To decide whether an inlined callsite is hot, we compare the callsite
/// sample count with the hot cutoff computed by ProfileSummaryInfo, it is
/// regarded as hot if the count is above the cutoff value.
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
///
/// When ProfileAccurateForSymsInList is enabled and profile symbol list
/// is present, functions in the profile symbol list but without profile will
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
/// be regarded as cold and much less inlining will happen in CGSCC inlining
/// pass, so we tend to lower the hot criteria here to allow more early
/// inlining to happen for warm callsites and it is helpful for performance.
bool SampleProfileLoader::callsiteIsHot(const FunctionSamples *CallsiteFS,
ProfileSummaryInfo *PSI) {
if (!CallsiteFS)
return false; // The callsite was not inlined in the original binary.
assert(PSI && "PSI is expected to be non null");
uint64_t CallsiteTotalSamples = CallsiteFS->getTotalSamples();
if (ProfAccForSymsInList)
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
return !PSI->isColdCount(CallsiteTotalSamples);
else
return PSI->isHotCount(CallsiteTotalSamples);
}
/// Mark as used the sample record for the given function samples at
/// (LineOffset, Discriminator).
///
/// \returns true if this is the first time we mark the given record.
bool SampleCoverageTracker::markSamplesUsed(const FunctionSamples *FS,
uint32_t LineOffset,
uint32_t Discriminator,
uint64_t Samples) {
LineLocation Loc(LineOffset, Discriminator);
unsigned &Count = SampleCoverage[FS][Loc];
bool FirstTime = (++Count == 1);
if (FirstTime)
TotalUsedSamples += Samples;
return FirstTime;
}
/// Return the number of sample records that were applied from this profile.
///
/// This count does not include records from cold inlined callsites.
unsigned
SampleCoverageTracker::countUsedRecords(const FunctionSamples *FS,
ProfileSummaryInfo *PSI) const {
auto I = SampleCoverage.find(FS);
// The size of the coverage map for FS represents the number of records
// that were marked used at least once.
unsigned Count = (I != SampleCoverage.end()) ? I->second.size() : 0;
// If there are inlined callsites in this function, count the samples found
// in the respective bodies. However, do not bother counting callees with 0
// total samples, these are callees that were never invoked at runtime.
for (const auto &I : FS->getCallsiteSamples())
for (const auto &J : I.second) {
const FunctionSamples *CalleeSamples = &J.second;
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
if (SPLoader.callsiteIsHot(CalleeSamples, PSI))
Count += countUsedRecords(CalleeSamples, PSI);
}
return Count;
}
/// Return the number of sample records in the body of this profile.
///
/// This count does not include records from cold inlined callsites.
unsigned
SampleCoverageTracker::countBodyRecords(const FunctionSamples *FS,
ProfileSummaryInfo *PSI) const {
unsigned Count = FS->getBodySamples().size();
// Only count records in hot callsites.
for (const auto &I : FS->getCallsiteSamples())
for (const auto &J : I.second) {
const FunctionSamples *CalleeSamples = &J.second;
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
if (SPLoader.callsiteIsHot(CalleeSamples, PSI))
Count += countBodyRecords(CalleeSamples, PSI);
}
return Count;
}
/// Return the number of samples collected in the body of this profile.
///
/// This count does not include samples from cold inlined callsites.
uint64_t
SampleCoverageTracker::countBodySamples(const FunctionSamples *FS,
ProfileSummaryInfo *PSI) const {
uint64_t Total = 0;
for (const auto &I : FS->getBodySamples())
Total += I.second.getSamples();
// Only count samples in hot callsites.
for (const auto &I : FS->getCallsiteSamples())
for (const auto &J : I.second) {
const FunctionSamples *CalleeSamples = &J.second;
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
if (SPLoader.callsiteIsHot(CalleeSamples, PSI))
Total += countBodySamples(CalleeSamples, PSI);
}
return Total;
}
/// Return the fraction of sample records used in this profile.
///
/// The returned value is an unsigned integer in the range 0-100 indicating
/// the percentage of sample records that were used while applying this
/// profile to the associated function.
unsigned SampleCoverageTracker::computeCoverage(unsigned Used,
unsigned Total) const {
assert(Used <= Total &&
"number of used records cannot exceed the total number of records");
return Total > 0 ? Used * 100 / Total : 100;
}
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
/// Clear all the per-function data used to load samples and propagate weights.
void SampleProfileLoader::clearFunctionData() {
BlockWeights.clear();
EdgeWeights.clear();
VisitedBlocks.clear();
VisitedEdges.clear();
EquivalenceClass.clear();
DT = nullptr;
PDT = nullptr;
LI = nullptr;
Predecessors.clear();
Successors.clear();
CoverageTracker.clear();
}
#ifndef NDEBUG
/// Print the weight of edge \p E on stream \p OS.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// \param OS Stream to emit the output to.
/// \param E Edge to print.
void SampleProfileLoader::printEdgeWeight(raw_ostream &OS, Edge E) {
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
OS << "weight[" << E.first->getName() << "->" << E.second->getName()
<< "]: " << EdgeWeights[E] << "\n";
}
/// Print the equivalence class of block \p BB on stream \p OS.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// \param OS Stream to emit the output to.
/// \param BB Block to print.
void SampleProfileLoader::printBlockEquivalence(raw_ostream &OS,
const BasicBlock *BB) {
const BasicBlock *Equiv = EquivalenceClass[BB];
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
OS << "equivalence[" << BB->getName()
<< "]: " << ((Equiv) ? EquivalenceClass[BB]->getName() : "NONE") << "\n";
}
/// Print the weight of block \p BB on stream \p OS.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// \param OS Stream to emit the output to.
/// \param BB Block to print.
void SampleProfileLoader::printBlockWeight(raw_ostream &OS,
const BasicBlock *BB) const {
const auto &I = BlockWeights.find(BB);
uint64_t W = (I == BlockWeights.end() ? 0 : I->second);
OS << "weight[" << BB->getName() << "]: " << W << "\n";
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
#endif
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
/// Get the weight for an instruction.
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
///
/// The "weight" of an instruction \p Inst is the number of samples
/// collected on that instruction at runtime. To retrieve it, we
/// need to compute the line number of \p Inst relative to the start of its
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
/// function. We use HeaderLineno to compute the offset. We then
/// look up the samples collected for \p Inst using BodySamples.
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
///
/// \param Inst Instruction to query.
///
/// \returns the weight of \p Inst.
ErrorOr<uint64_t> SampleProfileLoader::getInstWeight(const Instruction &Inst) {
[CSSPGO] Consume pseudo-probe-based AutoFDO profile This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`. Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites). **Adjustment to profile format ** A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like ``` !CFGChecksum: 12345 ``` is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums. Differential Revision: https://reviews.llvm.org/D92347
2020-12-17 04:54:50 +08:00
if (FunctionSamples::ProfileIsProbeBased)
return getProbeWeight(Inst);
const DebugLoc &DLoc = Inst.getDebugLoc();
if (!DLoc)
return std::error_code();
const FunctionSamples *FS = findFunctionSamples(Inst);
if (!FS)
return std::error_code();
// Ignore all intrinsics, phinodes and branch instructions.
// Branch and phinodes instruction usually contains debug info from sources outside of
// the residing basic block, thus we ignore them during annotation.
if (isa<BranchInst>(Inst) || isa<IntrinsicInst>(Inst) || isa<PHINode>(Inst))
return std::error_code();
// If a direct call/invoke instruction is inlined in profile
// (findCalleeFunctionSamples returns non-empty result), but not inlined here,
// it means that the inlined callsite has no sample, thus the call
// instruction should have 0 count.
[CSSPGO] Infrastructure for context-sensitive Sample PGO and Inlining This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon. Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO. **Interface** There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile. - Query base profile (`getBaseSamplesFor`) Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function. - Query context profile (`getContextSamplesFor`) Context profile is a function's CFG profile for a given calling context. We can query context profile by context string. - Track inlined context profile (`markContextSamplesInlined`) When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function. - Track not-inlined context profile (`promoteMergeContextSamplesTree`) When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination. **Implementation** Implementation-wise, `SampleContext` is created as abstraction for context. Currently it's a string for call path, and we can later optimize it to something more efficient, e.g. context id. Each `SampleContext` also has a `ContextState` indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each `FunctionSamples` now has a `SampleContext` that tells whether it's base profile or context profile, and for context profile what is the context and state. On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, `SampleContextTracker` is implemented that encapsulates a trie tree with `ContextTireNode` as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track `FunctionSamples` for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes. **Integration** `SampleContextTracker` is now also integrated with AutoFDO, `SampleProfileReader` and `SampleProfileLoader`. When we detected input profile contains context-sensitive profile, `SampleContextTracker` will be used to track profiles, and all profile query will go to `SampleContextTracker` instead of `SampleProfileReader` automatically. Tracking APIs are called automatically for each inline decision from `SampleProfileLoader`. Differential Revision: https://reviews.llvm.org/D90125
2020-03-24 14:50:41 +08:00
if (!ProfileIsCS)
if (const auto *CB = dyn_cast<CallBase>(&Inst))
if (!CB->isIndirectCall() && findCalleeFunctionSamples(*CB))
return 0;
const DILocation *DIL = DLoc;
uint32_t LineOffset = FunctionSamples::getOffset(DIL);
uint32_t Discriminator = DIL->getBaseDiscriminator();
ErrorOr<uint64_t> R = FS->findSamplesAt(LineOffset, Discriminator);
if (R) {
bool FirstMark =
CoverageTracker.markSamplesUsed(FS, LineOffset, Discriminator, R.get());
if (FirstMark) {
ORE->emit([&]() {
OptimizationRemarkAnalysis Remark(DEBUG_TYPE, "AppliedSamples", &Inst);
Remark << "Applied " << ore::NV("NumSamples", *R);
Remark << " samples from profile (offset: ";
Remark << ore::NV("LineOffset", LineOffset);
if (Discriminator) {
Remark << ".";
Remark << ore::NV("Discriminator", Discriminator);
}
Remark << ")";
return Remark;
});
}
LLVM_DEBUG(dbgs() << " " << DLoc.getLine() << "."
<< DIL->getBaseDiscriminator() << ":" << Inst
<< " (line offset: " << LineOffset << "."
<< DIL->getBaseDiscriminator() << " - weight: " << R.get()
<< ")\n");
}
return R;
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
}
[CSSPGO] Consume pseudo-probe-based AutoFDO profile This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`. Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites). **Adjustment to profile format ** A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like ``` !CFGChecksum: 12345 ``` is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums. Differential Revision: https://reviews.llvm.org/D92347
2020-12-17 04:54:50 +08:00
ErrorOr<uint64_t> SampleProfileLoader::getProbeWeight(const Instruction &Inst) {
assert(FunctionSamples::ProfileIsProbeBased &&
"Profile is not pseudo probe based");
Optional<PseudoProbe> Probe = extractProbe(Inst);
if (!Probe)
return std::error_code();
const FunctionSamples *FS = findFunctionSamples(Inst);
if (!FS)
return std::error_code();
// If a direct call/invoke instruction is inlined in profile
// (findCalleeFunctionSamples returns non-empty result), but not inlined here,
// it means that the inlined callsite has no sample, thus the call
// instruction should have 0 count.
if (const auto *CB = dyn_cast<CallBase>(&Inst))
if (!CB->isIndirectCall() && findCalleeFunctionSamples(*CB))
return 0;
const ErrorOr<uint64_t> &R = FS->findSamplesAt(Probe->Id, 0);
if (R) {
uint64_t Samples = R.get();
bool FirstMark = CoverageTracker.markSamplesUsed(FS, Probe->Id, 0, Samples);
if (FirstMark) {
ORE->emit([&]() {
OptimizationRemarkAnalysis Remark(DEBUG_TYPE, "AppliedSamples", &Inst);
Remark << "Applied " << ore::NV("NumSamples", Samples);
Remark << " samples from profile (ProbeId=";
Remark << ore::NV("ProbeId", Probe->Id);
Remark << ")";
return Remark;
});
}
LLVM_DEBUG(dbgs() << " " << Probe->Id << ":" << Inst
<< " - weight: " << R.get() << ")\n");
return Samples;
}
return R;
}
/// Compute the weight of a basic block.
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
///
/// The weight of basic block \p BB is the maximum weight of all the
/// instructions in BB.
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
///
/// \param BB The basic block to query.
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
///
/// \returns the weight for \p BB.
ErrorOr<uint64_t> SampleProfileLoader::getBlockWeight(const BasicBlock *BB) {
uint64_t Max = 0;
bool HasWeight = false;
for (auto &I : BB->getInstList()) {
const ErrorOr<uint64_t> &R = getInstWeight(I);
if (R) {
Max = std::max(Max, R.get());
HasWeight = true;
}
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
}
return HasWeight ? ErrorOr<uint64_t>(Max) : std::error_code();
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
}
/// Compute and store the weights of every basic block.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// This populates the BlockWeights map by computing
/// the weights of every basic block in the CFG.
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
///
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
/// \param F The function to query.
bool SampleProfileLoader::computeBlockWeights(Function &F) {
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
bool Changed = false;
LLVM_DEBUG(dbgs() << "Block weights\n");
for (const auto &BB : F) {
ErrorOr<uint64_t> Weight = getBlockWeight(&BB);
if (Weight) {
BlockWeights[&BB] = Weight.get();
VisitedBlocks.insert(&BB);
Changed = true;
}
LLVM_DEBUG(printBlockWeight(dbgs(), &BB));
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
return Changed;
}
/// Get the FunctionSamples for a call instruction.
///
/// The FunctionSamples of a call/invoke instruction \p Inst is the inlined
/// instance in which that call instruction is calling to. It contains
/// all samples that resides in the inlined instance. We first find the
/// inlined instance in which the call instruction is from, then we
/// traverse its children to find the callsite with the matching
/// location.
///
/// \param Inst Call/Invoke instruction to query.
///
/// \returns The FunctionSamples pointer to the inlined instance.
const FunctionSamples *
SampleProfileLoader::findCalleeFunctionSamples(const CallBase &Inst) const {
const DILocation *DIL = Inst.getDebugLoc();
if (!DIL) {
return nullptr;
}
StringRef CalleeName;
if (Function *Callee = Inst.getCalledFunction())
[CSSPGO] Infrastructure for context-sensitive Sample PGO and Inlining This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon. Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO. **Interface** There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile. - Query base profile (`getBaseSamplesFor`) Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function. - Query context profile (`getContextSamplesFor`) Context profile is a function's CFG profile for a given calling context. We can query context profile by context string. - Track inlined context profile (`markContextSamplesInlined`) When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function. - Track not-inlined context profile (`promoteMergeContextSamplesTree`) When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination. **Implementation** Implementation-wise, `SampleContext` is created as abstraction for context. Currently it's a string for call path, and we can later optimize it to something more efficient, e.g. context id. Each `SampleContext` also has a `ContextState` indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each `FunctionSamples` now has a `SampleContext` that tells whether it's base profile or context profile, and for context profile what is the context and state. On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, `SampleContextTracker` is implemented that encapsulates a trie tree with `ContextTireNode` as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track `FunctionSamples` for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes. **Integration** `SampleContextTracker` is now also integrated with AutoFDO, `SampleProfileReader` and `SampleProfileLoader`. When we detected input profile contains context-sensitive profile, `SampleContextTracker` will be used to track profiles, and all profile query will go to `SampleContextTracker` instead of `SampleProfileReader` automatically. Tracking APIs are called automatically for each inline decision from `SampleProfileLoader`. Differential Revision: https://reviews.llvm.org/D90125
2020-03-24 14:50:41 +08:00
CalleeName = FunctionSamples::getCanonicalFnName(*Callee);
if (ProfileIsCS)
return ContextTracker->getCalleeContextSamplesFor(Inst, CalleeName);
const FunctionSamples *FS = findFunctionSamples(Inst);
if (FS == nullptr)
return nullptr;
[CSSPGO] Consume pseudo-probe-based AutoFDO profile This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`. Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites). **Adjustment to profile format ** A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like ``` !CFGChecksum: 12345 ``` is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums. Differential Revision: https://reviews.llvm.org/D92347
2020-12-17 04:54:50 +08:00
return FS->findFunctionSamplesAt(FunctionSamples::getCallSiteIdentifier(DIL),
[SampleFDO] Enhance profile remapping support for searching inline instance and indirect call promotion candidate. Profile remapping is a feature to match a function in the module with its profile in sample profile if the function name and the name in profile look different but are equivalent using given remapping rules. This is a useful feature to keep the performance stable by specifying some remapping rules when sampleFDO targets are going through some large scale function signature change. However, currently profile remapping support is only valid for outline function profile in SampleFDO. It cannot match a callee with an inline instance profile if they have different but equivalent names. We found that without the support for inline instance profile, remapping is less effective for some large scale change. To add that support, before any remapping lookup happens, all the names in the profile will be inserted into remapper and the Key to the name mapping will be recorded in a map called NameMap in the remapper. During name lookup, a Key will be returned for the given name and it will be used to extract an equivalent name in the profile from NameMap. So with the help of the NameMap, we can translate any given name to an equivalent name in the profile if it exists. Whenever we try to match a name in the module to a name in the profile, we will try the match with the original name first, and if it doesn't match, we will use the equivalent name got from remapper to try the match for another time. In this way, the patch can enhance the profile remapping support for searching inline instance and searching indirect call promotion candidate. In a planned large scale change of int64 type (long long) to int64_t (long), we found the performance of a google internal benchmark degraded by 2% if nothing was done. If existing profile remapping was enabled, the performance degradation dropped to 1.2%. If the profile remapping with the current patch was enabled, the performance degradation further dropped to 0.14% (Note the experiment was done before searching indirect call promotion candidate was added. We hope with the remapping support of searching indirect call promotion candidate, the degradation can drop to 0% in the end. It will be evaluated post commit). Differential Revision: https://reviews.llvm.org/D86332
2020-08-25 13:59:20 +08:00
CalleeName, Reader->getRemapper());
}
/// Returns a vector of FunctionSamples that are the indirect call targets
/// of \p Inst. The vector is sorted by the total number of samples. Stores
/// the total call count of the indirect call in \p Sum.
std::vector<const FunctionSamples *>
SampleProfileLoader::findIndirectCallFunctionSamples(
const Instruction &Inst, uint64_t &Sum) const {
const DILocation *DIL = Inst.getDebugLoc();
std::vector<const FunctionSamples *> R;
if (!DIL) {
return R;
}
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
auto FSCompare = [](const FunctionSamples *L, const FunctionSamples *R) {
assert(L && R && "Expect non-null FunctionSamples");
if (L->getEntrySamples() != R->getEntrySamples())
return L->getEntrySamples() > R->getEntrySamples();
return FunctionSamples::getGUID(L->getName()) <
FunctionSamples::getGUID(R->getName());
};
if (ProfileIsCS) {
auto CalleeSamples =
ContextTracker->getIndirectCalleeContextSamplesFor(DIL);
if (CalleeSamples.empty())
return R;
// For CSSPGO, we only use target context profile's entry count
// as that already includes both inlined callee and non-inlined ones..
Sum = 0;
for (const auto *const FS : CalleeSamples) {
Sum += FS->getEntrySamples();
R.push_back(FS);
}
llvm::sort(R, FSCompare);
return R;
}
const FunctionSamples *FS = findFunctionSamples(Inst);
if (FS == nullptr)
return R;
[CSSPGO] Consume pseudo-probe-based AutoFDO profile This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`. Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites). **Adjustment to profile format ** A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like ``` !CFGChecksum: 12345 ``` is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums. Differential Revision: https://reviews.llvm.org/D92347
2020-12-17 04:54:50 +08:00
auto CallSite = FunctionSamples::getCallSiteIdentifier(DIL);
auto T = FS->findCallTargetMapAt(CallSite);
Sum = 0;
if (T)
for (const auto &T_C : T.get())
Sum += T_C.second;
[CSSPGO] Consume pseudo-probe-based AutoFDO profile This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`. Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites). **Adjustment to profile format ** A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like ``` !CFGChecksum: 12345 ``` is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums. Differential Revision: https://reviews.llvm.org/D92347
2020-12-17 04:54:50 +08:00
if (const FunctionSamplesMap *M = FS->findFunctionSamplesMapAt(CallSite)) {
if (M->empty())
return R;
for (const auto &NameFS : *M) {
Sum += NameFS.second.getEntrySamples();
R.push_back(&NameFS.second);
}
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
llvm::sort(R, FSCompare);
}
return R;
}
/// Get the FunctionSamples for an instruction.
///
/// The FunctionSamples of an instruction \p Inst is the inlined instance
/// in which that instruction is coming from. We traverse the inline stack
/// of that instruction, and match it with the tree nodes in the profile.
///
/// \param Inst Instruction to query.
///
/// \returns the FunctionSamples pointer to the inlined instance.
const FunctionSamples *
SampleProfileLoader::findFunctionSamples(const Instruction &Inst) const {
[CSSPGO] Consume pseudo-probe-based AutoFDO profile This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`. Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites). **Adjustment to profile format ** A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like ``` !CFGChecksum: 12345 ``` is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums. Differential Revision: https://reviews.llvm.org/D92347
2020-12-17 04:54:50 +08:00
if (FunctionSamples::ProfileIsProbeBased) {
Optional<PseudoProbe> Probe = extractProbe(Inst);
if (!Probe)
return nullptr;
}
const DILocation *DIL = Inst.getDebugLoc();
if (!DIL)
return Samples;
auto it = DILocation2SampleMap.try_emplace(DIL,nullptr);
[CSSPGO] Infrastructure for context-sensitive Sample PGO and Inlining This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon. Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO. **Interface** There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile. - Query base profile (`getBaseSamplesFor`) Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function. - Query context profile (`getContextSamplesFor`) Context profile is a function's CFG profile for a given calling context. We can query context profile by context string. - Track inlined context profile (`markContextSamplesInlined`) When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function. - Track not-inlined context profile (`promoteMergeContextSamplesTree`) When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination. **Implementation** Implementation-wise, `SampleContext` is created as abstraction for context. Currently it's a string for call path, and we can later optimize it to something more efficient, e.g. context id. Each `SampleContext` also has a `ContextState` indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each `FunctionSamples` now has a `SampleContext` that tells whether it's base profile or context profile, and for context profile what is the context and state. On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, `SampleContextTracker` is implemented that encapsulates a trie tree with `ContextTireNode` as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track `FunctionSamples` for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes. **Integration** `SampleContextTracker` is now also integrated with AutoFDO, `SampleProfileReader` and `SampleProfileLoader`. When we detected input profile contains context-sensitive profile, `SampleContextTracker` will be used to track profiles, and all profile query will go to `SampleContextTracker` instead of `SampleProfileReader` automatically. Tracking APIs are called automatically for each inline decision from `SampleProfileLoader`. Differential Revision: https://reviews.llvm.org/D90125
2020-03-24 14:50:41 +08:00
if (it.second) {
if (ProfileIsCS)
it.first->second = ContextTracker->getContextSamplesFor(DIL);
else
it.first->second =
Samples->findFunctionSamples(DIL, Reader->getRemapper());
}
return it.first->second;
}
CallBase *
SampleProfileLoader::tryPromoteIndirectCall(Function &F, StringRef CalleeName,
uint64_t &Sum, uint64_t Count,
CallBase *I, const char *&Reason) {
Reason = "Callee function not available";
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
// R->getValue() != &F is to prevent promoting a recursive call.
// If it is a recursive call, we do not inline it as it could bloat
// the code exponentially. There is way to better handle this, e.g.
// clone the caller first, and inline the cloned caller if it is
// recursive. As llvm does not inline recursive calls, we will
// simply ignore it instead of handling it explicitly.
auto R = SymbolMap.find(CalleeName);
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
if (R != SymbolMap.end() && R->getValue() &&
!R->getValue()->isDeclaration() && R->getValue()->getSubprogram() &&
R->getValue()->hasFnAttribute("use-sample-profile") &&
R->getValue() != &F && isLegalToPromote(*I, R->getValue(), &Reason)) {
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
auto *DI =
&pgo::promoteIndirectCall(*I, R->getValue(), Count, Sum, false, ORE);
Sum -= Count;
return DI;
}
return nullptr;
}
bool SampleProfileLoader::inlineCallInstruction(
CallBase &CB, const FunctionSamples *CalleeSamples) {
if (ExternalInlineAdvisor) {
auto Advice = ExternalInlineAdvisor->getAdvice(CB);
if (!Advice->isInliningRecommended()) {
Advice->recordUnattemptedInlining();
return false;
}
// Dummy record, we don't use it for replay.
Advice->recordInlining();
}
Function *CalledFunction = CB.getCalledFunction();
assert(CalledFunction);
DebugLoc DLoc = CB.getDebugLoc();
BasicBlock *BB = CB.getParent();
InlineParams Params = getInlineParams();
Params.ComputeFullInlineCost = true;
// Checks if there is anything in the reachable portion of the callee at
// this callsite that makes this inlining potentially illegal. Need to
// set ComputeFullInlineCost, otherwise getInlineCost may return early
// when cost exceeds threshold without checking all IRs in the callee.
// The acutal cost does not matter because we only checks isNever() to
// see if it is legal to inline the callsite.
InlineCost Cost =
getInlineCost(CB, Params, GetTTI(*CalledFunction), GetAC, GetTLI);
if (Cost.isNever()) {
ORE->emit(OptimizationRemarkAnalysis(CSINLINE_DEBUG, "InlineFail", DLoc, BB)
<< "incompatible inlining");
return false;
}
InlineFunctionInfo IFI(nullptr, GetAC);
if (InlineFunction(CB, IFI).isSuccess()) {
// The call to InlineFunction erases I, so we can't pass it here.
emitInlinedInto(*ORE, DLoc, BB, *CalledFunction, *BB->getParent(), Cost,
true, CSINLINE_DEBUG);
if (ProfileIsCS)
ContextTracker->markContextSamplesInlined(CalleeSamples);
++NumCSInlined;
return true;
}
return false;
}
bool SampleProfileLoader::shouldInlineColdCallee(CallBase &CallInst) {
if (!ProfileSizeInline)
return false;
Function *Callee = CallInst.getCalledFunction();
if (Callee == nullptr)
return false;
InlineCost Cost = getInlineCost(CallInst, getInlineParams(), GetTTI(*Callee),
GetAC, GetTLI);
[CSSPGO] Infrastructure for context-sensitive Sample PGO and Inlining This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon. Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO. **Interface** There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile. - Query base profile (`getBaseSamplesFor`) Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function. - Query context profile (`getContextSamplesFor`) Context profile is a function's CFG profile for a given calling context. We can query context profile by context string. - Track inlined context profile (`markContextSamplesInlined`) When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function. - Track not-inlined context profile (`promoteMergeContextSamplesTree`) When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination. **Implementation** Implementation-wise, `SampleContext` is created as abstraction for context. Currently it's a string for call path, and we can later optimize it to something more efficient, e.g. context id. Each `SampleContext` also has a `ContextState` indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each `FunctionSamples` now has a `SampleContext` that tells whether it's base profile or context profile, and for context profile what is the context and state. On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, `SampleContextTracker` is implemented that encapsulates a trie tree with `ContextTireNode` as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track `FunctionSamples` for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes. **Integration** `SampleContextTracker` is now also integrated with AutoFDO, `SampleProfileReader` and `SampleProfileLoader`. When we detected input profile contains context-sensitive profile, `SampleContextTracker` will be used to track profiles, and all profile query will go to `SampleContextTracker` instead of `SampleProfileReader` automatically. Tracking APIs are called automatically for each inline decision from `SampleProfileLoader`. Differential Revision: https://reviews.llvm.org/D90125
2020-03-24 14:50:41 +08:00
if (Cost.isNever())
return false;
if (Cost.isAlways())
return true;
return Cost.getCost() <= SampleColdCallSiteThreshold;
}
void SampleProfileLoader::emitOptimizationRemarksForInlineCandidates(
const SmallVectorImpl<CallBase *> &Candidates, const Function &F,
bool Hot) {
for (auto I : Candidates) {
Function *CalledFunction = I->getCalledFunction();
if (CalledFunction) {
ORE->emit(OptimizationRemarkAnalysis(CSINLINE_DEBUG, "InlineAttempt",
I->getDebugLoc(), I->getParent())
<< "previous inlining reattempted for "
<< (Hot ? "hotness: '" : "size: '")
<< ore::NV("Callee", CalledFunction) << "' into '"
<< ore::NV("Caller", &F) << "'");
}
}
}
/// Iteratively inline hot callsites of a function.
///
/// Iteratively traverse all callsites of the function \p F, and find if
/// the corresponding inlined instance exists and is hot in profile. If
/// it is hot enough, inline the callsites and adds new callsites of the
/// callee into the caller. If the call is an indirect call, first promote
/// it to direct call. Each indirect call is limited with a single target.
///
/// \param F function to perform iterative inlining.
/// \param InlinedGUIDs a set to be updated to include all GUIDs that are
/// inlined in the profiled binary.
///
/// \returns True if there is any inline happened.
bool SampleProfileLoader::inlineHotFunctions(
Function &F, DenseSet<GlobalValue::GUID> &InlinedGUIDs) {
DenseSet<Instruction *> PromotedInsns;
// ProfAccForSymsInList is used in callsiteIsHot. The assertion makes sure
// Profile symbol list is ignored when profile-sample-accurate is on.
assert((!ProfAccForSymsInList ||
(!ProfileSampleAccurate &&
!F.hasFnAttribute("profile-sample-accurate"))) &&
"ProfAccForSymsInList should be false when profile-sample-accurate "
"is enabled");
DenseMap<CallBase *, const FunctionSamples *> localNotInlinedCallSites;
bool Changed = false;
while (true) {
bool LocalChanged = false;
SmallVector<CallBase *, 10> CIS;
for (auto &BB : F) {
bool Hot = false;
SmallVector<CallBase *, 10> AllCandidates;
SmallVector<CallBase *, 10> ColdCandidates;
for (auto &I : BB.getInstList()) {
const FunctionSamples *FS = nullptr;
if (auto *CB = dyn_cast<CallBase>(&I)) {
if (!isa<IntrinsicInst>(I) && (FS = findCalleeFunctionSamples(*CB))) {
assert((!FunctionSamples::UseMD5 || FS->GUIDToFuncNameMap) &&
"GUIDToFuncNameMap has to be populated");
AllCandidates.push_back(CB);
[CSSPGO] Infrastructure for context-sensitive Sample PGO and Inlining This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon. Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO. **Interface** There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile. - Query base profile (`getBaseSamplesFor`) Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function. - Query context profile (`getContextSamplesFor`) Context profile is a function's CFG profile for a given calling context. We can query context profile by context string. - Track inlined context profile (`markContextSamplesInlined`) When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function. - Track not-inlined context profile (`promoteMergeContextSamplesTree`) When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination. **Implementation** Implementation-wise, `SampleContext` is created as abstraction for context. Currently it's a string for call path, and we can later optimize it to something more efficient, e.g. context id. Each `SampleContext` also has a `ContextState` indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each `FunctionSamples` now has a `SampleContext` that tells whether it's base profile or context profile, and for context profile what is the context and state. On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, `SampleContextTracker` is implemented that encapsulates a trie tree with `ContextTireNode` as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track `FunctionSamples` for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes. **Integration** `SampleContextTracker` is now also integrated with AutoFDO, `SampleProfileReader` and `SampleProfileLoader`. When we detected input profile contains context-sensitive profile, `SampleContextTracker` will be used to track profiles, and all profile query will go to `SampleContextTracker` instead of `SampleProfileReader` automatically. Tracking APIs are called automatically for each inline decision from `SampleProfileLoader`. Differential Revision: https://reviews.llvm.org/D90125
2020-03-24 14:50:41 +08:00
if (FS->getEntrySamples() > 0 || ProfileIsCS)
localNotInlinedCallSites.try_emplace(CB, FS);
if (callsiteIsHot(FS, PSI))
Hot = true;
else if (shouldInlineColdCallee(*CB))
ColdCandidates.push_back(CB);
}
}
}
if (Hot || ExternalInlineAdvisor) {
CIS.insert(CIS.begin(), AllCandidates.begin(), AllCandidates.end());
emitOptimizationRemarksForInlineCandidates(AllCandidates, F, true);
} else {
CIS.insert(CIS.begin(), ColdCandidates.begin(), ColdCandidates.end());
emitOptimizationRemarksForInlineCandidates(ColdCandidates, F, false);
}
}
for (CallBase *I : CIS) {
Function *CalledFunction = I->getCalledFunction();
// Do not inline recursive calls.
if (CalledFunction == &F)
continue;
if (I->isIndirectCall()) {
if (PromotedInsns.count(I))
continue;
uint64_t Sum;
for (const auto *FS : findIndirectCallFunctionSamples(*I, Sum)) {
if (LTOPhase == ThinOrFullLTOPhase::ThinLTOPreLink) {
FS->findInlinedFunctions(InlinedGUIDs, F.getParent(),
PSI->getOrCompHotCountThreshold());
continue;
}
[SampleFDO] Enhance profile remapping support for searching inline instance and indirect call promotion candidate. Profile remapping is a feature to match a function in the module with its profile in sample profile if the function name and the name in profile look different but are equivalent using given remapping rules. This is a useful feature to keep the performance stable by specifying some remapping rules when sampleFDO targets are going through some large scale function signature change. However, currently profile remapping support is only valid for outline function profile in SampleFDO. It cannot match a callee with an inline instance profile if they have different but equivalent names. We found that without the support for inline instance profile, remapping is less effective for some large scale change. To add that support, before any remapping lookup happens, all the names in the profile will be inserted into remapper and the Key to the name mapping will be recorded in a map called NameMap in the remapper. During name lookup, a Key will be returned for the given name and it will be used to extract an equivalent name in the profile from NameMap. So with the help of the NameMap, we can translate any given name to an equivalent name in the profile if it exists. Whenever we try to match a name in the module to a name in the profile, we will try the match with the original name first, and if it doesn't match, we will use the equivalent name got from remapper to try the match for another time. In this way, the patch can enhance the profile remapping support for searching inline instance and searching indirect call promotion candidate. In a planned large scale change of int64 type (long long) to int64_t (long), we found the performance of a google internal benchmark degraded by 2% if nothing was done. If existing profile remapping was enabled, the performance degradation dropped to 1.2%. If the profile remapping with the current patch was enabled, the performance degradation further dropped to 0.14% (Note the experiment was done before searching indirect call promotion candidate was added. We hope with the remapping support of searching indirect call promotion candidate, the degradation can drop to 0% in the end. It will be evaluated post commit). Differential Revision: https://reviews.llvm.org/D86332
2020-08-25 13:59:20 +08:00
if (!callsiteIsHot(FS, PSI))
continue;
const char *Reason = nullptr;
auto CalleeFunctionName = FS->getFuncName();
if (CallBase *DI =
tryPromoteIndirectCall(F, CalleeFunctionName, Sum,
FS->getEntrySamples(), I, Reason)) {
PromotedInsns.insert(I);
// If profile mismatches, we should not attempt to inline DI.
if ((isa<CallInst>(DI) || isa<InvokeInst>(DI)) &&
inlineCallInstruction(cast<CallBase>(*DI), FS)) {
localNotInlinedCallSites.erase(I);
LocalChanged = true;
}
} else {
LLVM_DEBUG(dbgs()
<< "\nFailed to promote indirect call to "
<< CalleeFunctionName << " because " << Reason << "\n");
}
}
} else if (CalledFunction && CalledFunction->getSubprogram() &&
!CalledFunction->isDeclaration()) {
if (inlineCallInstruction(*I, localNotInlinedCallSites.count(I)
? localNotInlinedCallSites[I]
: nullptr)) {
localNotInlinedCallSites.erase(I);
LocalChanged = true;
}
} else if (LTOPhase == ThinOrFullLTOPhase::ThinLTOPreLink) {
findCalleeFunctionSamples(*I)->findInlinedFunctions(
InlinedGUIDs, F.getParent(), PSI->getOrCompHotCountThreshold());
}
}
if (LocalChanged) {
Changed = true;
} else {
break;
}
}
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
// For CS profile, profile for not inlined context will be merged when
// base profile is being trieved
if (ProfileIsCS)
return Changed;
// Accumulate not inlined callsite information into notInlinedSamples
for (const auto &Pair : localNotInlinedCallSites) {
CallBase *I = Pair.getFirst();
Function *Callee = I->getCalledFunction();
if (!Callee || Callee->isDeclaration())
continue;
ORE->emit(OptimizationRemarkAnalysis(CSINLINE_DEBUG, "NotInline",
I->getDebugLoc(), I->getParent())
<< "previous inlining not repeated: '"
<< ore::NV("Callee", Callee) << "' into '"
<< ore::NV("Caller", &F) << "'");
++NumCSNotInlined;
const FunctionSamples *FS = Pair.getSecond();
if (FS->getTotalSamples() == 0 && FS->getEntrySamples() == 0) {
continue;
}
if (ProfileMergeInlinee) {
// A function call can be replicated by optimizations like callsite
// splitting or jump threading and the replicates end up sharing the
// sample nested callee profile instead of slicing the original inlinee's
// profile. We want to do merge exactly once by filtering out callee
// profiles with a non-zero head sample count.
if (FS->getHeadSamples() == 0) {
// Use entry samples as head samples during the merge, as inlinees
// don't have head samples.
const_cast<FunctionSamples *>(FS)->addHeadSamples(
FS->getEntrySamples());
// Note that we have to do the merge right after processing function.
// This allows OutlineFS's profile to be used for annotation during
// top-down processing of functions' annotation.
FunctionSamples *OutlineFS = Reader->getOrCreateSamplesFor(*Callee);
OutlineFS->merge(*FS);
[AutoFDO] Remove a broken assert in merging inlinee samples Duplicated callsites share the same callee profile if the original callsite was inlined. The sharing also causes the profile of callee's callee to be shared. This breaks the assert introduced ealier by D84997 in a tricky way. To illustrate, I'm using an abstract example. Say we have three functions `A`, `B` and `C`. A calls B twice and B calls C once. Some optimize performed prior to the sample profile loader duplicates first callsite to `B` and the program may look like ``` A() { B(); // with nested profile B1 and C1 B(); // duplicated, with nested profile B1 and C1 B(); // with nested profile B2 and C2 } ``` For some reason, the sample profile loader inliner then decides to only inline the first callsite in `A` and transforms `A` into ``` A() { C(); // with nested profile C1 B(); // duplicated, with nested profile B1 and C1 B(); // with nested profile B2 and C2. } ``` Here is what happens next: 1. Failing to inline the callsite `C()` results in `C1`'s samples returned to `C`'s base (outlined) profile. In the meantime, `C1`'s head samples are updated to `C1`'s entry sample. This also affects the profile of the middle callsite which shares `C1` with the first callsite. 2. Failing to inline the middle callsite results in `B1` returned to `B`'s base profile, which in turn will cause `C1` merged into `B`'s base profile. Note that the nest `C` profile in `B`'s base has a non-zero head sample count now. The value actually equals to `C1`'s entry count. 3. Failing to inline last callsite results in `B2` returned to `B`'s base profile. Note that the nested `C` profile in `B`'s base now has an entry count equal to the sum of that of `C1` and `C2`, with the head count equal to that of `C1`. This will trigger the assert later on. 4. Compiling `B` using `B`'s base profile. Failing to inline `C` there triggers the returning of the nested `C` profile. Since the nested `C` profile has a non-zero head count, the returning doesn't go through. Instead, the assert goes off. It's good that `C1` is only returned once, based on using a non-zero head count to ensure an inline profile is only returned once. However C2 is never returned. While it seems hard to solve this perfectly within the current framework, I'm just removing the broken assert. This should be reasonably fixed by the upcoming CSSPGO work where counts returning is based on context-sensitivity and a distribution factor for callsite probes. The simple example is extracted from one of our internal services. In reality, why the original callsite `B()` and duplicate one having different inline behavior is a magic. It has to do with imperfect counts in profile and extra complicated inlining that makes the hotness for them different. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D90056
2020-10-23 14:17:45 +08:00
}
} else {
auto pair =
notInlinedCallInfo.try_emplace(Callee, NotInlinedProfileInfo{0});
pair.first->second.entryCount += FS->getEntrySamples();
}
}
return Changed;
}
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
bool SampleProfileLoader::tryInlineCandidate(
InlineCandidate &Candidate, SmallVector<CallBase *, 8> &InlinedCallSites) {
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
CallBase &CB = *Candidate.CallInstr;
Function *CalledFunction = CB.getCalledFunction();
assert(CalledFunction && "Expect a callee with definition");
DebugLoc DLoc = CB.getDebugLoc();
BasicBlock *BB = CB.getParent();
InlineCost Cost = shouldInlineCandidate(Candidate);
if (Cost.isNever()) {
ORE->emit(OptimizationRemarkAnalysis(CSINLINE_DEBUG, "InlineFail", DLoc, BB)
<< "incompatible inlining");
return false;
}
if (!Cost)
return false;
InlineFunctionInfo IFI(nullptr, GetAC);
if (InlineFunction(CB, IFI).isSuccess()) {
// The call to InlineFunction erases I, so we can't pass it here.
emitInlinedInto(*ORE, DLoc, BB, *CalledFunction, *BB->getParent(), Cost,
true, CSINLINE_DEBUG);
// Now populate the list of newly exposed call sites.
InlinedCallSites.clear();
for (auto &I : IFI.InlinedCallSites)
InlinedCallSites.push_back(I);
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
if (ProfileIsCS)
ContextTracker->markContextSamplesInlined(Candidate.CalleeSamples);
++NumCSInlined;
return true;
}
return false;
}
bool SampleProfileLoader::getInlineCandidate(InlineCandidate *NewCandidate,
CallBase *CB) {
assert(CB && "Expect non-null call instruction");
if (isa<IntrinsicInst>(CB))
return false;
// Find the callee's profile. For indirect call, find hottest target profile.
const FunctionSamples *CalleeSamples = findCalleeFunctionSamples(*CB);
if (!CalleeSamples)
return false;
uint64_t CallsiteCount = 0;
ErrorOr<uint64_t> Weight = getBlockWeight(CB->getParent());
if (Weight)
CallsiteCount = Weight.get();
if (CalleeSamples)
CallsiteCount = std::max(CallsiteCount, CalleeSamples->getEntrySamples());
*NewCandidate = {CB, CalleeSamples, CallsiteCount};
return true;
}
InlineCost
SampleProfileLoader::shouldInlineCandidate(InlineCandidate &Candidate) {
assert(ProfileIsCS && "Prioritiy based inliner only works with CSSPGO now");
std::unique_ptr<InlineAdvice> Advice = nullptr;
if (ExternalInlineAdvisor) {
Advice = ExternalInlineAdvisor->getAdvice(*Candidate.CallInstr);
if (!Advice->isInliningRecommended()) {
Advice->recordUnattemptedInlining();
return InlineCost::getNever("not previously inlined");
}
Advice->recordInlining();
return InlineCost::getAlways("previously inlined");
}
// Adjust threshold based on call site hotness, only do this for callsite
// prioritized inliner because otherwise cost-benefit check is done earlier.
int SampleThreshold = SampleColdCallSiteThreshold;
if (CallsitePrioritizedInline) {
if (Candidate.CallsiteCount > PSI->getHotCountThreshold())
SampleThreshold = SampleHotCallSiteThreshold;
else if (!ProfileSizeInline)
return InlineCost::getNever("cold callsite");
}
Function *Callee = Candidate.CallInstr->getCalledFunction();
assert(Callee && "Expect a definition for inline candidate of direct call");
InlineParams Params = getInlineParams();
Params.ComputeFullInlineCost = true;
// Checks if there is anything in the reachable portion of the callee at
// this callsite that makes this inlining potentially illegal. Need to
// set ComputeFullInlineCost, otherwise getInlineCost may return early
// when cost exceeds threshold without checking all IRs in the callee.
// The acutal cost does not matter because we only checks isNever() to
// see if it is legal to inline the callsite.
InlineCost Cost = getInlineCost(*Candidate.CallInstr, Callee, Params,
GetTTI(*Callee), GetAC, GetTLI);
// For old FDO inliner, we inline the call site as long as cost is not
// "Never". The cost-benefit check is done earlier.
if (!CallsitePrioritizedInline) {
if (Cost.isNever())
return Cost;
return InlineCost::getAlways("hot callsite previously inlined");
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
}
// Honor always inline and never inline from call analyzer
if (Cost.isNever() || Cost.isAlways())
return Cost;
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
// Otherwise only use the cost from call analyzer, but overwite threshold with
// Sample PGO threshold.
return InlineCost::get(Cost.getCost(), SampleThreshold);
}
bool SampleProfileLoader::inlineHotFunctionsWithPriority(
Function &F, DenseSet<GlobalValue::GUID> &InlinedGUIDs) {
DenseSet<Instruction *> PromotedInsns;
assert(ProfileIsCS && "Prioritiy based inliner only works with CSSPGO now");
// ProfAccForSymsInList is used in callsiteIsHot. The assertion makes sure
// Profile symbol list is ignored when profile-sample-accurate is on.
assert((!ProfAccForSymsInList ||
(!ProfileSampleAccurate &&
!F.hasFnAttribute("profile-sample-accurate"))) &&
"ProfAccForSymsInList should be false when profile-sample-accurate "
"is enabled");
// Populating worklist with initial call sites from root inliner, along
// with call site weights.
CandidateQueue CQueue;
InlineCandidate NewCandidate;
for (auto &BB : F) {
for (auto &I : BB.getInstList()) {
auto *CB = dyn_cast<CallBase>(&I);
if (!CB)
continue;
if (getInlineCandidate(&NewCandidate, CB))
CQueue.push(NewCandidate);
}
}
// Cap the size growth from profile guided inlining. This is needed even
// though cost of each inline candidate already accounts for callee size,
// because with top-down inlining, we can grow inliner size significantly
// with large number of smaller inlinees each pass the cost check.
assert(ProfileInlineLimitMax >= ProfileInlineLimitMin &&
"Max inline size limit should not be smaller than min inline size "
"limit.");
unsigned SizeLimit = F.getInstructionCount() * ProfileInlineGrowthLimit;
SizeLimit = std::min(SizeLimit, (unsigned)ProfileInlineLimitMax);
SizeLimit = std::max(SizeLimit, (unsigned)ProfileInlineLimitMin);
if (ExternalInlineAdvisor)
SizeLimit = std::numeric_limits<unsigned>::max();
// Perform iterative BFS call site prioritized inlining
bool Changed = false;
while (!CQueue.empty() && F.getInstructionCount() < SizeLimit) {
InlineCandidate Candidate = CQueue.top();
CQueue.pop();
CallBase *I = Candidate.CallInstr;
Function *CalledFunction = I->getCalledFunction();
if (CalledFunction == &F)
continue;
if (I->isIndirectCall()) {
if (PromotedInsns.count(I))
continue;
uint64_t Sum;
auto CalleeSamples = findIndirectCallFunctionSamples(*I, Sum);
uint64_t SumOrigin = Sum;
for (const auto *FS : CalleeSamples) {
// TODO: Consider disable pre-lTO ICP for MonoLTO as well
if (LTOPhase == ThinOrFullLTOPhase::ThinLTOPreLink) {
FS->findInlinedFunctions(InlinedGUIDs, F.getParent(),
PSI->getOrCompHotCountThreshold());
continue;
}
uint64_t EntryCountDistributed = FS->getEntrySamples();
// In addition to regular inline cost check, we also need to make sure
// ICP isn't introducing excessive speculative checks even if individual
// target looks beneficial to promote and inline. That means we should
// only do ICP when there's a small number dominant targets.
if (EntryCountDistributed < SumOrigin / ProfileICPThreshold)
break;
// TODO: Fix CallAnalyzer to handle all indirect calls.
// For indirect call, we don't run CallAnalyzer to get InlineCost
// before actual inlining. This is because we could see two different
// types from the same definition, which makes CallAnalyzer choke as
// it's expecting matching parameter type on both caller and callee
// side. See example from PR18962 for the triggering cases (the bug was
// fixed, but we generate different types).
if (!PSI->isHotCount(EntryCountDistributed))
break;
const char *Reason = nullptr;
auto CalleeFunctionName = FS->getFuncName();
if (CallBase *DI = tryPromoteIndirectCall(
F, CalleeFunctionName, Sum, EntryCountDistributed, I, Reason)) {
// Attach function profile for promoted indirect callee, and update
// call site count for the promoted inline candidate too.
Candidate = {DI, FS, EntryCountDistributed};
PromotedInsns.insert(I);
SmallVector<CallBase *, 8> InlinedCallSites;
// If profile mismatches, we should not attempt to inline DI.
if ((isa<CallInst>(DI) || isa<InvokeInst>(DI)) &&
tryInlineCandidate(Candidate, InlinedCallSites)) {
for (auto *CB : InlinedCallSites) {
if (getInlineCandidate(&NewCandidate, CB))
CQueue.emplace(NewCandidate);
}
Changed = true;
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
}
} else {
LLVM_DEBUG(dbgs()
<< "\nFailed to promote indirect call to "
<< CalleeFunctionName << " because " << Reason << "\n");
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
}
}
} else if (CalledFunction && CalledFunction->getSubprogram() &&
!CalledFunction->isDeclaration()) {
SmallVector<CallBase *, 8> InlinedCallSites;
if (tryInlineCandidate(Candidate, InlinedCallSites)) {
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
for (auto *CB : InlinedCallSites) {
if (getInlineCandidate(&NewCandidate, CB))
CQueue.emplace(NewCandidate);
}
Changed = true;
}
} else if (LTOPhase == ThinOrFullLTOPhase::ThinLTOPreLink) {
findCalleeFunctionSamples(*I)->findInlinedFunctions(
InlinedGUIDs, F.getParent(), PSI->getOrCompHotCountThreshold());
}
}
if (!CQueue.empty()) {
if (SizeLimit == (unsigned)ProfileInlineLimitMax)
++NumCSInlinedHitMaxLimit;
else if (SizeLimit == (unsigned)ProfileInlineLimitMin)
++NumCSInlinedHitMinLimit;
else
++NumCSInlinedHitGrowthLimit;
}
return Changed;
}
/// Find equivalence classes for the given block.
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
///
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
/// This finds all the blocks that are guaranteed to execute the same
/// number of times as \p BB1. To do this, it traverses all the
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
/// descendants of \p BB1 in the dominator or post-dominator tree.
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
///
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
/// A block BB2 will be in the same equivalence class as \p BB1 if
/// the following holds:
///
/// 1- \p BB1 is a descendant of BB2 in the opposite tree. So, if BB2
/// is a descendant of \p BB1 in the dominator tree, then BB2 should
/// dominate BB1 in the post-dominator tree.
///
/// 2- Both BB2 and \p BB1 must be in the same loop.
///
/// For every block BB2 that meets those two requirements, we set BB2's
/// equivalence class to \p BB1.
///
/// \param BB1 Block to check.
/// \param Descendants Descendants of \p BB1 in either the dom or pdom tree.
/// \param DomTree Opposite dominator tree. If \p Descendants is filled
/// with blocks from \p BB1's dominator tree, then
/// this is the post-dominator tree, and vice versa.
template <bool IsPostDom>
void SampleProfileLoader::findEquivalencesFor(
BasicBlock *BB1, ArrayRef<BasicBlock *> Descendants,
DominatorTreeBase<BasicBlock, IsPostDom> *DomTree) {
const BasicBlock *EC = EquivalenceClass[BB1];
uint64_t Weight = BlockWeights[EC];
for (const auto *BB2 : Descendants) {
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
bool IsDomParent = DomTree->dominates(BB2, BB1);
bool IsInSameLoop = LI->getLoopFor(BB1) == LI->getLoopFor(BB2);
if (BB1 != BB2 && IsDomParent && IsInSameLoop) {
EquivalenceClass[BB2] = EC;
// If BB2 is visited, then the entire EC should be marked as visited.
if (VisitedBlocks.count(BB2)) {
VisitedBlocks.insert(EC);
}
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
// If BB2 is heavier than BB1, make BB2 have the same weight
// as BB1.
//
// Note that we don't worry about the opposite situation here
// (when BB2 is lighter than BB1). We will deal with this
// during the propagation phase. Right now, we just want to
// make sure that BB1 has the largest weight of all the
// members of its equivalence set.
Weight = std::max(Weight, BlockWeights[BB2]);
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
}
if (EC == &EC->getParent()->getEntryBlock()) {
BlockWeights[EC] = Samples->getHeadSamples() + 1;
} else {
BlockWeights[EC] = Weight;
}
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
/// Find equivalence classes.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// Since samples may be missing from blocks, we can fill in the gaps by setting
/// the weights of all the blocks in the same equivalence class to the same
/// weight. To compute the concept of equivalence, we use dominance and loop
/// information. Two blocks B1 and B2 are in the same equivalence class if B1
/// dominates B2, B2 post-dominates B1 and both are in the same loop.
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
///
/// \param F The function to query.
void SampleProfileLoader::findEquivalenceClasses(Function &F) {
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
SmallVector<BasicBlock *, 8> DominatedBBs;
LLVM_DEBUG(dbgs() << "\nBlock equivalence classes\n");
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
// Find equivalence sets based on dominance and post-dominance information.
for (auto &BB : F) {
BasicBlock *BB1 = &BB;
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
// Compute BB1's equivalence class once.
if (EquivalenceClass.count(BB1)) {
LLVM_DEBUG(printBlockEquivalence(dbgs(), BB1));
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
continue;
}
// By default, blocks are in their own equivalence class.
EquivalenceClass[BB1] = BB1;
// Traverse all the blocks dominated by BB1. We are looking for
// every basic block BB2 such that:
//
// 1- BB1 dominates BB2.
// 2- BB2 post-dominates BB1.
// 3- BB1 and BB2 are in the same loop nest.
//
// If all those conditions hold, it means that BB2 is executed
// as many times as BB1, so they are placed in the same equivalence
// class by making BB2's equivalence class be BB1.
DominatedBBs.clear();
DT->getDescendants(BB1, DominatedBBs);
findEquivalencesFor(BB1, DominatedBBs, PDT.get());
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
LLVM_DEBUG(printBlockEquivalence(dbgs(), BB1));
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
// Assign weights to equivalence classes.
//
// All the basic blocks in the same equivalence class will execute
// the same number of times. Since we know that the head block in
// each equivalence class has the largest weight, assign that weight
// to all the blocks in that equivalence class.
LLVM_DEBUG(
dbgs() << "\nAssign the same weight to all blocks in the same class\n");
for (auto &BI : F) {
const BasicBlock *BB = &BI;
const BasicBlock *EquivBB = EquivalenceClass[BB];
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
if (BB != EquivBB)
BlockWeights[BB] = BlockWeights[EquivBB];
LLVM_DEBUG(printBlockWeight(dbgs(), BB));
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
}
/// Visit the given edge to decide if it has a valid weight.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// If \p E has not been visited before, we copy to \p UnknownEdge
/// and increment the count of unknown edges.
///
/// \param E Edge to visit.
/// \param NumUnknownEdges Current number of unknown edges.
/// \param UnknownEdge Set if E has not been visited before.
///
/// \returns E's weight, if known. Otherwise, return 0.
uint64_t SampleProfileLoader::visitEdge(Edge E, unsigned *NumUnknownEdges,
Edge *UnknownEdge) {
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
if (!VisitedEdges.count(E)) {
(*NumUnknownEdges)++;
*UnknownEdge = E;
return 0;
}
return EdgeWeights[E];
}
/// Propagate weights through incoming/outgoing edges.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// If the weight of a basic block is known, and there is only one edge
/// with an unknown weight, we can calculate the weight of that edge.
///
/// Similarly, if all the edges have a known count, we can calculate the
/// count of the basic block, if needed.
///
/// \param F Function to process.
/// \param UpdateBlockCount Whether we should update basic block counts that
/// has already been annotated.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// \returns True if new weights were assigned to edges or blocks.
bool SampleProfileLoader::propagateThroughEdges(Function &F,
bool UpdateBlockCount) {
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
bool Changed = false;
LLVM_DEBUG(dbgs() << "\nPropagation through edges\n");
for (const auto &BI : F) {
const BasicBlock *BB = &BI;
const BasicBlock *EC = EquivalenceClass[BB];
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
// Visit all the predecessor and successor edges to determine
// which ones have a weight assigned already. Note that it doesn't
// matter that we only keep track of a single unknown edge. The
// only case we are interested in handling is when only a single
// edge is unknown (see setEdgeOrBlockWeight).
for (unsigned i = 0; i < 2; i++) {
uint64_t TotalWeight = 0;
unsigned NumUnknownEdges = 0, NumTotalEdges = 0;
Edge UnknownEdge, SelfReferentialEdge, SingleEdge;
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
if (i == 0) {
// First, visit all predecessor edges.
NumTotalEdges = Predecessors[BB].size();
for (auto *Pred : Predecessors[BB]) {
Edge E = std::make_pair(Pred, BB);
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
TotalWeight += visitEdge(E, &NumUnknownEdges, &UnknownEdge);
if (E.first == E.second)
SelfReferentialEdge = E;
}
if (NumTotalEdges == 1) {
SingleEdge = std::make_pair(Predecessors[BB][0], BB);
}
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
} else {
// On the second round, visit all successor edges.
NumTotalEdges = Successors[BB].size();
for (auto *Succ : Successors[BB]) {
Edge E = std::make_pair(BB, Succ);
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
TotalWeight += visitEdge(E, &NumUnknownEdges, &UnknownEdge);
}
if (NumTotalEdges == 1) {
SingleEdge = std::make_pair(BB, Successors[BB][0]);
}
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
// After visiting all the edges, there are three cases that we
// can handle immediately:
//
// - All the edge weights are known (i.e., NumUnknownEdges == 0).
// In this case, we simply check that the sum of all the edges
// is the same as BB's weight. If not, we change BB's weight
// to match. Additionally, if BB had not been visited before,
// we mark it visited.
//
// - Only one edge is unknown and BB has already been visited.
// In this case, we can compute the weight of the edge by
// subtracting the total block weight from all the known
// edge weights. If the edges weight more than BB, then the
// edge of the last remaining edge is set to zero.
//
// - There exists a self-referential edge and the weight of BB is
// known. In this case, this edge can be based on BB's weight.
// We add up all the other known edges and set the weight on
// the self-referential edge as we did in the previous case.
//
// In any other case, we must continue iterating. Eventually,
// all edges will get a weight, or iteration will stop when
// it reaches SampleProfileMaxPropagateIterations.
if (NumUnknownEdges <= 1) {
uint64_t &BBWeight = BlockWeights[EC];
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
if (NumUnknownEdges == 0) {
if (!VisitedBlocks.count(EC)) {
// If we already know the weight of all edges, the weight of the
// basic block can be computed. It should be no larger than the sum
// of all edge weights.
if (TotalWeight > BBWeight) {
BBWeight = TotalWeight;
Changed = true;
LLVM_DEBUG(dbgs() << "All edge weights for " << BB->getName()
<< " known. Set weight for block: ";
printBlockWeight(dbgs(), BB););
}
} else if (NumTotalEdges == 1 &&
EdgeWeights[SingleEdge] < BlockWeights[EC]) {
// If there is only one edge for the visited basic block, use the
// block weight to adjust edge weight if edge weight is smaller.
EdgeWeights[SingleEdge] = BlockWeights[EC];
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
Changed = true;
}
} else if (NumUnknownEdges == 1 && VisitedBlocks.count(EC)) {
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
// If there is a single unknown edge and the block has been
// visited, then we can compute E's weight.
if (BBWeight >= TotalWeight)
EdgeWeights[UnknownEdge] = BBWeight - TotalWeight;
else
EdgeWeights[UnknownEdge] = 0;
const BasicBlock *OtherEC;
if (i == 0)
OtherEC = EquivalenceClass[UnknownEdge.first];
else
OtherEC = EquivalenceClass[UnknownEdge.second];
// Edge weights should never exceed the BB weights it connects.
if (VisitedBlocks.count(OtherEC) &&
EdgeWeights[UnknownEdge] > BlockWeights[OtherEC])
EdgeWeights[UnknownEdge] = BlockWeights[OtherEC];
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
VisitedEdges.insert(UnknownEdge);
Changed = true;
LLVM_DEBUG(dbgs() << "Set weight for edge: ";
printEdgeWeight(dbgs(), UnknownEdge));
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
} else if (VisitedBlocks.count(EC) && BlockWeights[EC] == 0) {
// If a block Weights 0, all its in/out edges should weight 0.
if (i == 0) {
for (auto *Pred : Predecessors[BB]) {
Edge E = std::make_pair(Pred, BB);
EdgeWeights[E] = 0;
VisitedEdges.insert(E);
}
} else {
for (auto *Succ : Successors[BB]) {
Edge E = std::make_pair(BB, Succ);
EdgeWeights[E] = 0;
VisitedEdges.insert(E);
}
}
} else if (SelfReferentialEdge.first && VisitedBlocks.count(EC)) {
uint64_t &BBWeight = BlockWeights[BB];
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
// We have a self-referential edge and the weight of BB is known.
if (BBWeight >= TotalWeight)
EdgeWeights[SelfReferentialEdge] = BBWeight - TotalWeight;
else
EdgeWeights[SelfReferentialEdge] = 0;
VisitedEdges.insert(SelfReferentialEdge);
Changed = true;
LLVM_DEBUG(dbgs() << "Set self-referential edge weight to: ";
printEdgeWeight(dbgs(), SelfReferentialEdge));
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
if (UpdateBlockCount && !VisitedBlocks.count(EC) && TotalWeight > 0) {
BlockWeights[EC] = TotalWeight;
VisitedBlocks.insert(EC);
Changed = true;
}
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
}
return Changed;
}
/// Build in/out edge lists for each basic block in the CFG.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// We are interested in unique edges. If a block B1 has multiple
/// edges to another block B2, we only add a single B1->B2 edge.
void SampleProfileLoader::buildEdges(Function &F) {
for (auto &BI : F) {
BasicBlock *B1 = &BI;
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
// Add predecessors for B1.
SmallPtrSet<BasicBlock *, 16> Visited;
if (!Predecessors[B1].empty())
llvm_unreachable("Found a stale predecessors list in a basic block.");
for (pred_iterator PI = pred_begin(B1), PE = pred_end(B1); PI != PE; ++PI) {
BasicBlock *B2 = *PI;
if (Visited.insert(B2).second)
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
Predecessors[B1].push_back(B2);
}
// Add successors for B1.
Visited.clear();
if (!Successors[B1].empty())
llvm_unreachable("Found a stale successors list in a basic block.");
for (succ_iterator SI = succ_begin(B1), SE = succ_end(B1); SI != SE; ++SI) {
BasicBlock *B2 = *SI;
if (Visited.insert(B2).second)
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
Successors[B1].push_back(B2);
}
}
}
/// Returns the sorted CallTargetMap \p M by count in descending order.
static SmallVector<InstrProfValueData, 2> GetSortedValueDataFromCallTargets(
const SampleRecord::CallTargetMap & M) {
SmallVector<InstrProfValueData, 2> R;
for (const auto &I : SampleRecord::SortCallTargets(M)) {
R.emplace_back(InstrProfValueData{FunctionSamples::getGUID(I.first), I.second});
}
return R;
}
/// Propagate weights into edges
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// The following rules are applied to every block BB in the CFG:
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// - If BB has a single predecessor/successor, then the weight
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
/// of that edge is the weight of the block.
///
/// - If all incoming or outgoing edges are known except one, and the
/// weight of the block is already known, the weight of the unknown
/// edge will be the weight of the block minus the sum of all the known
/// edges. If the sum of all the known edges is larger than BB's weight,
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
/// we set the unknown edge weight to zero.
///
/// - If there is a self-referential edge, and the weight of the block is
/// known, the weight for that edge is set to the weight of the block
/// minus the weight of the other incoming edges to that block (if
/// known).
void SampleProfileLoader::propagateWeights(Function &F) {
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
bool Changed = true;
unsigned I = 0;
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
// If BB weight is larger than its corresponding loop's header BB weight,
// use the BB weight to replace the loop header BB weight.
for (auto &BI : F) {
BasicBlock *BB = &BI;
Loop *L = LI->getLoopFor(BB);
if (!L) {
continue;
}
BasicBlock *Header = L->getHeader();
if (Header && BlockWeights[BB] > BlockWeights[Header]) {
BlockWeights[Header] = BlockWeights[BB];
}
}
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
// Before propagation starts, build, for each block, a list of
// unique predecessors and successors. This is necessary to handle
// identical edges in multiway branches. Since we visit all blocks and all
// edges of the CFG, it is cleaner to build these lists once at the start
// of the pass.
buildEdges(F);
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
// Propagate until we converge or we go past the iteration limit.
while (Changed && I++ < SampleProfileMaxPropagateIterations) {
Changed = propagateThroughEdges(F, false);
}
// The first propagation propagates BB counts from annotated BBs to unknown
// BBs. The 2nd propagation pass resets edges weights, and use all BB weights
// to propagate edge weights.
VisitedEdges.clear();
Changed = true;
while (Changed && I++ < SampleProfileMaxPropagateIterations) {
Changed = propagateThroughEdges(F, false);
}
// The 3rd propagation pass allows adjust annotated BB weights that are
// obviously wrong.
Changed = true;
while (Changed && I++ < SampleProfileMaxPropagateIterations) {
Changed = propagateThroughEdges(F, true);
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
// Generate MD_prof metadata for every branch instruction using the
// edge weights computed during propagation.
LLVM_DEBUG(dbgs() << "\nPropagation complete. Setting branch weights\n");
LLVMContext &Ctx = F.getContext();
MDBuilder MDB(Ctx);
for (auto &BI : F) {
BasicBlock *BB = &BI;
if (BlockWeights[BB]) {
for (auto &I : BB->getInstList()) {
if (!isa<CallInst>(I) && !isa<InvokeInst>(I))
continue;
if (!cast<CallBase>(I).getCalledFunction()) {
const DebugLoc &DLoc = I.getDebugLoc();
if (!DLoc)
continue;
const DILocation *DIL = DLoc;
const FunctionSamples *FS = findFunctionSamples(I);
if (!FS)
continue;
[CSSPGO] Consume pseudo-probe-based AutoFDO profile This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`. Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites). **Adjustment to profile format ** A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like ``` !CFGChecksum: 12345 ``` is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums. Differential Revision: https://reviews.llvm.org/D92347
2020-12-17 04:54:50 +08:00
auto CallSite = FunctionSamples::getCallSiteIdentifier(DIL);
auto T = FS->findCallTargetMapAt(CallSite);
if (!T || T.get().empty())
continue;
SmallVector<InstrProfValueData, 2> SortedCallTargets =
GetSortedValueDataFromCallTargets(T.get());
uint64_t Sum;
findIndirectCallFunctionSamples(I, Sum);
annotateValueSite(*I.getParent()->getParent()->getParent(), I,
SortedCallTargets, Sum, IPVK_IndirectCallTarget,
SortedCallTargets.size());
} else if (!isa<IntrinsicInst>(&I)) {
I.setMetadata(LLVMContext::MD_prof,
MDB.createBranchWeights(
{static_cast<uint32_t>(BlockWeights[BB])}));
}
}
}
Instruction *TI = BB->getTerminator();
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
if (TI->getNumSuccessors() == 1)
continue;
if (!isa<BranchInst>(TI) && !isa<SwitchInst>(TI))
continue;
DebugLoc BranchLoc = TI->getDebugLoc();
LLVM_DEBUG(dbgs() << "\nGetting weights for branch at line "
<< ((BranchLoc) ? Twine(BranchLoc.getLine())
: Twine("<UNKNOWN LOCATION>"))
<< ".\n");
SmallVector<uint32_t, 4> Weights;
uint32_t MaxWeight = 0;
Instruction *MaxDestInst;
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
for (unsigned I = 0; I < TI->getNumSuccessors(); ++I) {
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
BasicBlock *Succ = TI->getSuccessor(I);
Edge E = std::make_pair(BB, Succ);
uint64_t Weight = EdgeWeights[E];
LLVM_DEBUG(dbgs() << "\t"; printEdgeWeight(dbgs(), E));
// Use uint32_t saturated arithmetic to adjust the incoming weights,
// if needed. Sample counts in profiles are 64-bit unsigned values,
// but internally branch weights are expressed as 32-bit values.
if (Weight > std::numeric_limits<uint32_t>::max()) {
LLVM_DEBUG(dbgs() << " (saturated due to uint32_t overflow)");
Weight = std::numeric_limits<uint32_t>::max();
}
// Weight is added by one to avoid propagation errors introduced by
// 0 weights.
Weights.push_back(static_cast<uint32_t>(Weight + 1));
if (Weight != 0) {
if (Weight > MaxWeight) {
MaxWeight = Weight;
MaxDestInst = Succ->getFirstNonPHIOrDbgOrLifetime();
}
}
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
uint64_t TempWeight;
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
// Only set weights if there is at least one non-zero weight.
// In any other case, let the analyzer set weights.
// Do not set weights if the weights are present. In ThinLTO, the profile
// annotation is done twice. If the first annotation already set the
// weights, the second pass does not need to set it.
if (MaxWeight > 0 && !TI->extractProfTotalWeight(TempWeight)) {
LLVM_DEBUG(dbgs() << "SUCCESS. Found non-zero weights.\n");
TI->setMetadata(LLVMContext::MD_prof,
MDB.createBranchWeights(Weights));
ORE->emit([&]() {
return OptimizationRemark(DEBUG_TYPE, "PopularDest", MaxDestInst)
<< "most popular destination for conditional branches at "
<< ore::NV("CondBranchesLoc", BranchLoc);
});
} else {
LLVM_DEBUG(dbgs() << "SKIPPED. All branch weights are zero.\n");
}
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
}
/// Get the line number for the function header.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// This looks up function \p F in the current compilation unit and
/// retrieves the line number where the function is defined. This is
/// line 0 for all the samples read from the profile file. Every line
/// number is relative to this line.
///
/// \param F Function object to query.
///
/// \returns the line number where \p F is defined. If it returns 0,
/// it means that there is no debug information available for \p F.
unsigned SampleProfileLoader::getFunctionLoc(Function &F) {
if (DISubprogram *S = F.getSubprogram())
return S->getLine();
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
if (NoWarnSampleUnused)
return 0;
2015-10-28 02:41:46 +08:00
// If the start of \p F is missing, emit a diagnostic to inform the user
// about the missed opportunity.
F.getContext().diagnose(DiagnosticInfoSampleProfile(
"No debug information found in function " + F.getName() +
": Function profile not used",
DS_Warning));
return 0;
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
void SampleProfileLoader::computeDominanceAndLoopInfo(Function &F) {
DT.reset(new DominatorTree);
DT->recalculate(F);
PDT.reset(new PostDominatorTree(F));
LI.reset(new LoopInfo);
LI->analyze(*DT);
}
/// Generate branch weight metadata for all branches in \p F.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// Branch weights are computed out of instruction samples using a
/// propagation heuristic. Propagation proceeds in 3 phases:
///
/// 1- Assignment of block weights. All the basic blocks in the function
/// are initial assigned the same weight as their most frequently
/// executed instruction.
///
/// 2- Creation of equivalence classes. Since samples may be missing from
/// blocks, we can fill in the gaps by setting the weights of all the
/// blocks in the same equivalence class to the same weight. To compute
/// the concept of equivalence, we use dominance and loop information.
/// Two blocks B1 and B2 are in the same equivalence class if B1
/// dominates B2, B2 post-dominates B1 and both are in the same loop.
///
/// 3- Propagation of block weights into edges. This uses a simple
/// propagation heuristic. The following rules are applied to every
/// block BB in the CFG:
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// - If BB has a single predecessor/successor, then the weight
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
/// of that edge is the weight of the block.
///
/// - If all the edges are known except one, and the weight of the
/// block is already known, the weight of the unknown edge will
/// be the weight of the block minus the sum of all the known
/// edges. If the sum of all the known edges is larger than BB's weight,
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
/// we set the unknown edge weight to zero.
///
/// - If there is a self-referential edge, and the weight of the block is
/// known, the weight for that edge is set to the weight of the block
/// minus the weight of the other incoming edges to that block (if
/// known).
///
/// Since this propagation is not guaranteed to finalize for every CFG, we
/// only allow it to proceed for a limited number of iterations (controlled
/// by -sample-profile-max-propagate-iterations).
///
/// FIXME: Try to replace this propagation heuristic with a scheme
/// that is guaranteed to finalize. A work-list approach similar to
/// the standard value propagation algorithm used by SSA-CCP might
/// work here.
///
/// Once all the branch weights are computed, we emit the MD_prof
/// metadata on BB using the computed values for each of its branches.
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
///
/// \param F The function to query.
///
/// \returns true if \p F was modified. Returns false, otherwise.
bool SampleProfileLoader::emitAnnotations(Function &F) {
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
bool Changed = false;
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
[CSSPGO] Consume pseudo-probe-based AutoFDO profile This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`. Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites). **Adjustment to profile format ** A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like ``` !CFGChecksum: 12345 ``` is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums. Differential Revision: https://reviews.llvm.org/D92347
2020-12-17 04:54:50 +08:00
if (FunctionSamples::ProfileIsProbeBased) {
if (!ProbeManager->profileIsValid(F, *Samples)) {
LLVM_DEBUG(
dbgs() << "Profile is invalid due to CFG mismatch for Function "
<< F.getName());
++NumMismatchedProfile;
return false;
}
++NumMatchedProfile;
} else {
if (getFunctionLoc(F) == 0)
return false;
[CSSPGO] Consume pseudo-probe-based AutoFDO profile This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`. Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites). **Adjustment to profile format ** A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like ``` !CFGChecksum: 12345 ``` is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums. Differential Revision: https://reviews.llvm.org/D92347
2020-12-17 04:54:50 +08:00
LLVM_DEBUG(dbgs() << "Line number for the first instruction in "
<< F.getName() << ": " << getFunctionLoc(F) << "\n");
}
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
DenseSet<GlobalValue::GUID> InlinedGUIDs;
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
if (ProfileIsCS && CallsitePrioritizedInline)
Changed |= inlineHotFunctionsWithPriority(F, InlinedGUIDs);
else
Changed |= inlineHotFunctions(F, InlinedGUIDs);
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
// Compute basic block weights.
Changed |= computeBlockWeights(F);
if (Changed) {
// Add an entry count to the function using the samples gathered at the
// function entry.
// Sets the GUIDs that are inlined in the profiled binary. This is used
// for ThinLink to make correct liveness analysis, and also make the IR
// match the profiled binary before annotation.
F.setEntryCount(
ProfileCount(Samples->getHeadSamples() + 1, Function::PCT_Real),
&InlinedGUIDs);
// Compute dominance and loop info needed for propagation.
computeDominanceAndLoopInfo(F);
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
// Find equivalence classes.
findEquivalenceClasses(F);
// Propagate weights to all edges.
propagateWeights(F);
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
}
// If coverage checking was requested, compute it now.
if (SampleProfileRecordCoverage) {
unsigned Used = CoverageTracker.countUsedRecords(Samples, PSI);
unsigned Total = CoverageTracker.countBodyRecords(Samples, PSI);
unsigned Coverage = CoverageTracker.computeCoverage(Used, Total);
if (Coverage < SampleProfileRecordCoverage) {
F.getContext().diagnose(DiagnosticInfoSampleProfile(
F.getSubprogram()->getFilename(), getFunctionLoc(F),
Twine(Used) + " of " + Twine(Total) + " available profile records (" +
Twine(Coverage) + "%) were applied",
DS_Warning));
}
}
if (SampleProfileSampleCoverage) {
uint64_t Used = CoverageTracker.getTotalUsedSamples();
uint64_t Total = CoverageTracker.countBodySamples(Samples, PSI);
unsigned Coverage = CoverageTracker.computeCoverage(Used, Total);
if (Coverage < SampleProfileSampleCoverage) {
F.getContext().diagnose(DiagnosticInfoSampleProfile(
F.getSubprogram()->getFilename(), getFunctionLoc(F),
Twine(Used) + " of " + Twine(Total) + " available profile samples (" +
Twine(Coverage) + "%) were applied",
DS_Warning));
}
}
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
return Changed;
}
char SampleProfileLoaderLegacyPass::ID = 0;
INITIALIZE_PASS_BEGIN(SampleProfileLoaderLegacyPass, "sample-profile",
"Sample Profile loader", false, false)
INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(TargetLibraryInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(ProfileSummaryInfoWrapperPass)
INITIALIZE_PASS_END(SampleProfileLoaderLegacyPass, "sample-profile",
"Sample Profile loader", false, false)
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
std::vector<Function *>
SampleProfileLoader::buildFunctionOrder(Module &M, CallGraph *CG) {
std::vector<Function *> FunctionOrderList;
FunctionOrderList.reserve(M.size());
if (!ProfileTopDownLoad || CG == nullptr) {
if (ProfileMergeInlinee) {
// Disable ProfileMergeInlinee if profile is not loaded in top down order,
// because the profile for a function may be used for the profile
// annotation of its outline copy before the profile merging of its
// non-inlined inline instances, and that is not the way how
// ProfileMergeInlinee is supposed to work.
ProfileMergeInlinee = false;
}
for (Function &F : M)
if (!F.isDeclaration() && F.hasFnAttribute("use-sample-profile"))
FunctionOrderList.push_back(&F);
return FunctionOrderList;
}
assert(&CG->getModule() == &M);
scc_iterator<CallGraph *> CGI = scc_begin(CG);
while (!CGI.isAtEnd()) {
for (CallGraphNode *node : *CGI) {
auto F = node->getFunction();
if (F && !F->isDeclaration() && F->hasFnAttribute("use-sample-profile"))
FunctionOrderList.push_back(F);
}
++CGI;
}
std::reverse(FunctionOrderList.begin(), FunctionOrderList.end());
return FunctionOrderList;
}
bool SampleProfileLoader::doInitialization(Module &M,
FunctionAnalysisManager *FAM) {
auto &Ctx = M.getContext();
auto ReaderOrErr =
SampleProfileReader::create(Filename, Ctx, RemappingFilename);
if (std::error_code EC = ReaderOrErr.getError()) {
std::string Msg = "Could not open profile: " + EC.message();
Ctx.diagnose(DiagnosticInfoSampleProfile(Filename, Msg));
return false;
}
Reader = std::move(ReaderOrErr.get());
Reader->setSkipFlatProf(LTOPhase == ThinOrFullLTOPhase::ThinLTOPostLink);
Reader->collectFuncsFrom(M);
if (std::error_code EC = Reader->read()) {
std::string Msg = "profile reading failed: " + EC.message();
Ctx.diagnose(DiagnosticInfoSampleProfile(Filename, Msg));
return false;
}
PSL = Reader->getProfileSymbolList();
// While profile-sample-accurate is on, ignore symbol list.
ProfAccForSymsInList =
ProfileAccurateForSymsInList && PSL && !ProfileSampleAccurate;
if (ProfAccForSymsInList) {
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
NamesInProfile.clear();
if (auto NameTable = Reader->getNameTable())
NamesInProfile.insert(NameTable->begin(), NameTable->end());
}
if (FAM && !ProfileInlineReplayFile.empty()) {
ExternalInlineAdvisor = std::make_unique<ReplayInlineAdvisor>(
[InlineAdvisor] Allow replay of inline decisions for the CGSCC inliner from optimization remarks This change leverages the work done in D83743 to replay in the SampleProfile inliner to also be used in the CGSCC inliner. NOTE: currently restricted to non-ML advisors only. The added switch `-cgscc-inline-replay=<remarks file>` will replay the inlining decisions in that file where the remarks file is generated via `-Rpass=inline`. The aim here is to make it easier to analyze changes that would modify inlining heuristics to be separated from this behavior. Doing so allows easier examination of assembly and runtime behavior compared to the baseline rather than trying to dig through the large churn caused by inlining. In LTO compilation, since inlining is done twice you can separately specify replay by passing the flag to the FE (`-cgscc-inline-replay=`) and to the linker (`-Wl,cgscc-inline-replay=`) with the remarks generated from their respective places. Testing on mysqld by comparing the inline decisions between base (generates remarks.txt) and diff (replay using identical input/tools with remarks.txt) and examining the inlining sites with `diff` shows 14,000 mismatches out of 247,341 for a ~94% replay accuracy. I believe this gap can be narrowed further though for the general case we may never achieve full accuracy. For my personal use, this is close enough to be representative: I set the baseline as the one generated by the replay on identical input/toolset and compare that to my modified input/toolset using the same replay. Testing: ninja check-llvm newly added test correctly replays CGSCC inlining decisions Reviewed By: mtrofin, wenlei Differential Revision: https://reviews.llvm.org/D94334
2021-01-26 07:25:39 +08:00
M, *FAM, Ctx, /*OriginalAdvisor=*/nullptr, ProfileInlineReplayFile,
/*EmitRemarks=*/false);
if (!ExternalInlineAdvisor->areReplayRemarksLoaded())
ExternalInlineAdvisor.reset();
}
[CSSPGO] Infrastructure for context-sensitive Sample PGO and Inlining This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon. Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO. **Interface** There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile. - Query base profile (`getBaseSamplesFor`) Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function. - Query context profile (`getContextSamplesFor`) Context profile is a function's CFG profile for a given calling context. We can query context profile by context string. - Track inlined context profile (`markContextSamplesInlined`) When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function. - Track not-inlined context profile (`promoteMergeContextSamplesTree`) When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination. **Implementation** Implementation-wise, `SampleContext` is created as abstraction for context. Currently it's a string for call path, and we can later optimize it to something more efficient, e.g. context id. Each `SampleContext` also has a `ContextState` indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each `FunctionSamples` now has a `SampleContext` that tells whether it's base profile or context profile, and for context profile what is the context and state. On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, `SampleContextTracker` is implemented that encapsulates a trie tree with `ContextTireNode` as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track `FunctionSamples` for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes. **Integration** `SampleContextTracker` is now also integrated with AutoFDO, `SampleProfileReader` and `SampleProfileLoader`. When we detected input profile contains context-sensitive profile, `SampleContextTracker` will be used to track profiles, and all profile query will go to `SampleContextTracker` instead of `SampleProfileReader` automatically. Tracking APIs are called automatically for each inline decision from `SampleProfileLoader`. Differential Revision: https://reviews.llvm.org/D90125
2020-03-24 14:50:41 +08:00
// Apply tweaks if context-sensitive profile is available.
if (Reader->profileIsCS()) {
ProfileIsCS = true;
FunctionSamples::ProfileIsCS = true;
[CSSPGO] Call site prioritized inlining for sample PGO This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
2021-01-04 08:43:06 +08:00
// Enable priority-base inliner and size inline by default for CSSPGO.
if (!ProfileSizeInline.getNumOccurrences())
ProfileSizeInline = true;
if (!CallsitePrioritizedInline.getNumOccurrences())
CallsitePrioritizedInline = true;
[CSSPGO] Infrastructure for context-sensitive Sample PGO and Inlining This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon. Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO. **Interface** There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile. - Query base profile (`getBaseSamplesFor`) Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function. - Query context profile (`getContextSamplesFor`) Context profile is a function's CFG profile for a given calling context. We can query context profile by context string. - Track inlined context profile (`markContextSamplesInlined`) When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function. - Track not-inlined context profile (`promoteMergeContextSamplesTree`) When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination. **Implementation** Implementation-wise, `SampleContext` is created as abstraction for context. Currently it's a string for call path, and we can later optimize it to something more efficient, e.g. context id. Each `SampleContext` also has a `ContextState` indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each `FunctionSamples` now has a `SampleContext` that tells whether it's base profile or context profile, and for context profile what is the context and state. On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, `SampleContextTracker` is implemented that encapsulates a trie tree with `ContextTireNode` as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track `FunctionSamples` for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes. **Integration** `SampleContextTracker` is now also integrated with AutoFDO, `SampleProfileReader` and `SampleProfileLoader`. When we detected input profile contains context-sensitive profile, `SampleContextTracker` will be used to track profiles, and all profile query will go to `SampleContextTracker` instead of `SampleProfileReader` automatically. Tracking APIs are called automatically for each inline decision from `SampleProfileLoader`. Differential Revision: https://reviews.llvm.org/D90125
2020-03-24 14:50:41 +08:00
// Tracker for profiles under different context
ContextTracker =
std::make_unique<SampleContextTracker>(Reader->getProfiles());
}
[CSSPGO] Consume pseudo-probe-based AutoFDO profile This change enables pseudo-probe-based sample counts to be consumed by the sample profile loader under the regular `-fprofile-sample-use` switch with minimal adjustments to the existing sample file formats. After the counts are imported, a probe helper, aka, a `PseudoProbeManager` object, is automatically launched to verify the CFG checksum of every function in the current compilation against the corresponding checksum from the profile. Mismatched checksums will cause a function profile to be slipped. A `SampleProfileProber` pass is scheduled before any of the `SampleProfileLoader` instances so that the CFG checksums as well as probe mappings are available during the profile loading time. The `PseudoProbeManager` object is set up right after the profile reading is done. In the future a CFG-based fuzzy matching could be done in `PseudoProbeManager`. Samples will be applied only to pseudo probe instructions as well as probed callsites once the checksum verification goes through. Those instructions are processed in the same way that regular instructions would be processed in the line-number-based scenario. In other words, a function is processed in a regular way as if it was reduced to just containing pseudo probes (block probes and callsites). **Adjustment to profile format ** A CFG checksum field is being added to the existing AutoFDO profile formats. So far only the text format and the extended binary format are supported. For the text format, a new line like ``` !CFGChecksum: 12345 ``` is added to the end of the body sample lines. For the extended binary profile format, we introduce a metadata section to store the checksum map from function names to their CFG checksums. Differential Revision: https://reviews.llvm.org/D92347
2020-12-17 04:54:50 +08:00
// Load pseudo probe descriptors for probe-based function samples.
if (Reader->profileIsProbeBased()) {
ProbeManager = std::make_unique<PseudoProbeManager>(M);
if (!ProbeManager->moduleIsProbed(M)) {
const char *Msg =
"Pseudo-probe-based profile requires SampleProfileProbePass";
Ctx.diagnose(DiagnosticInfoSampleProfile(Filename, Msg));
return false;
}
}
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
return true;
}
ModulePass *llvm::createSampleProfileLoaderPass() {
return new SampleProfileLoaderLegacyPass();
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
ModulePass *llvm::createSampleProfileLoaderPass(StringRef Name) {
return new SampleProfileLoaderLegacyPass(Name);
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
}
bool SampleProfileLoader::runOnModule(Module &M, ModuleAnalysisManager *AM,
ProfileSummaryInfo *_PSI, CallGraph *CG) {
GUIDToFuncNameMapper Mapper(M, *Reader, GUIDToFuncNameMap);
PSI = _PSI;
if (M.getProfileSummary(/* IsCS */ false) == nullptr) {
M.setProfileSummary(Reader->getSummary().getMD(M.getContext()),
ProfileSummary::PSK_Sample);
PSI->refresh();
}
// Compute the total number of samples collected in this profile.
for (const auto &I : Reader->getProfiles())
TotalCollectedSamples += I.second.getTotalSamples();
[SampleFDO] Enhance profile remapping support for searching inline instance and indirect call promotion candidate. Profile remapping is a feature to match a function in the module with its profile in sample profile if the function name and the name in profile look different but are equivalent using given remapping rules. This is a useful feature to keep the performance stable by specifying some remapping rules when sampleFDO targets are going through some large scale function signature change. However, currently profile remapping support is only valid for outline function profile in SampleFDO. It cannot match a callee with an inline instance profile if they have different but equivalent names. We found that without the support for inline instance profile, remapping is less effective for some large scale change. To add that support, before any remapping lookup happens, all the names in the profile will be inserted into remapper and the Key to the name mapping will be recorded in a map called NameMap in the remapper. During name lookup, a Key will be returned for the given name and it will be used to extract an equivalent name in the profile from NameMap. So with the help of the NameMap, we can translate any given name to an equivalent name in the profile if it exists. Whenever we try to match a name in the module to a name in the profile, we will try the match with the original name first, and if it doesn't match, we will use the equivalent name got from remapper to try the match for another time. In this way, the patch can enhance the profile remapping support for searching inline instance and searching indirect call promotion candidate. In a planned large scale change of int64 type (long long) to int64_t (long), we found the performance of a google internal benchmark degraded by 2% if nothing was done. If existing profile remapping was enabled, the performance degradation dropped to 1.2%. If the profile remapping with the current patch was enabled, the performance degradation further dropped to 0.14% (Note the experiment was done before searching indirect call promotion candidate was added. We hope with the remapping support of searching indirect call promotion candidate, the degradation can drop to 0% in the end. It will be evaluated post commit). Differential Revision: https://reviews.llvm.org/D86332
2020-08-25 13:59:20 +08:00
auto Remapper = Reader->getRemapper();
// Populate the symbol map.
for (const auto &N_F : M.getValueSymbolTable()) {
StringRef OrigName = N_F.getKey();
Function *F = dyn_cast<Function>(N_F.getValue());
if (F == nullptr)
continue;
SymbolMap[OrigName] = F;
auto pos = OrigName.find('.');
if (pos != StringRef::npos) {
StringRef NewName = OrigName.substr(0, pos);
auto r = SymbolMap.insert(std::make_pair(NewName, F));
// Failiing to insert means there is already an entry in SymbolMap,
// thus there are multiple functions that are mapped to the same
// stripped name. In this case of name conflicting, set the value
// to nullptr to avoid confusion.
if (!r.second)
r.first->second = nullptr;
[SampleFDO] Enhance profile remapping support for searching inline instance and indirect call promotion candidate. Profile remapping is a feature to match a function in the module with its profile in sample profile if the function name and the name in profile look different but are equivalent using given remapping rules. This is a useful feature to keep the performance stable by specifying some remapping rules when sampleFDO targets are going through some large scale function signature change. However, currently profile remapping support is only valid for outline function profile in SampleFDO. It cannot match a callee with an inline instance profile if they have different but equivalent names. We found that without the support for inline instance profile, remapping is less effective for some large scale change. To add that support, before any remapping lookup happens, all the names in the profile will be inserted into remapper and the Key to the name mapping will be recorded in a map called NameMap in the remapper. During name lookup, a Key will be returned for the given name and it will be used to extract an equivalent name in the profile from NameMap. So with the help of the NameMap, we can translate any given name to an equivalent name in the profile if it exists. Whenever we try to match a name in the module to a name in the profile, we will try the match with the original name first, and if it doesn't match, we will use the equivalent name got from remapper to try the match for another time. In this way, the patch can enhance the profile remapping support for searching inline instance and searching indirect call promotion candidate. In a planned large scale change of int64 type (long long) to int64_t (long), we found the performance of a google internal benchmark degraded by 2% if nothing was done. If existing profile remapping was enabled, the performance degradation dropped to 1.2%. If the profile remapping with the current patch was enabled, the performance degradation further dropped to 0.14% (Note the experiment was done before searching indirect call promotion candidate was added. We hope with the remapping support of searching indirect call promotion candidate, the degradation can drop to 0% in the end. It will be evaluated post commit). Differential Revision: https://reviews.llvm.org/D86332
2020-08-25 13:59:20 +08:00
OrigName = NewName;
}
// Insert the remapped names into SymbolMap.
if (Remapper) {
if (auto MapName = Remapper->lookUpNameInProfile(OrigName)) {
if (*MapName == OrigName)
continue;
SymbolMap.insert(std::make_pair(*MapName, F));
}
}
}
bool retval = false;
for (auto F : buildFunctionOrder(M, CG)) {
assert(!F->isDeclaration());
clearFunctionData();
retval |= runOnFunction(*F, AM);
}
// Account for cold calls not inlined....
[CSSPGO] Infrastructure for context-sensitive Sample PGO and Inlining This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon. Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO. **Interface** There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile. - Query base profile (`getBaseSamplesFor`) Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function. - Query context profile (`getContextSamplesFor`) Context profile is a function's CFG profile for a given calling context. We can query context profile by context string. - Track inlined context profile (`markContextSamplesInlined`) When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function. - Track not-inlined context profile (`promoteMergeContextSamplesTree`) When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination. **Implementation** Implementation-wise, `SampleContext` is created as abstraction for context. Currently it's a string for call path, and we can later optimize it to something more efficient, e.g. context id. Each `SampleContext` also has a `ContextState` indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each `FunctionSamples` now has a `SampleContext` that tells whether it's base profile or context profile, and for context profile what is the context and state. On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, `SampleContextTracker` is implemented that encapsulates a trie tree with `ContextTireNode` as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track `FunctionSamples` for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes. **Integration** `SampleContextTracker` is now also integrated with AutoFDO, `SampleProfileReader` and `SampleProfileLoader`. When we detected input profile contains context-sensitive profile, `SampleContextTracker` will be used to track profiles, and all profile query will go to `SampleContextTracker` instead of `SampleProfileReader` automatically. Tracking APIs are called automatically for each inline decision from `SampleProfileLoader`. Differential Revision: https://reviews.llvm.org/D90125
2020-03-24 14:50:41 +08:00
if (!ProfileIsCS)
for (const std::pair<Function *, NotInlinedProfileInfo> &pair :
notInlinedCallInfo)
updateProfileCallee(pair.first, pair.second.entryCount);
return retval;
}
bool SampleProfileLoaderLegacyPass::runOnModule(Module &M) {
ACT = &getAnalysis<AssumptionCacheTracker>();
TTIWP = &getAnalysis<TargetTransformInfoWrapperPass>();
TLIWP = &getAnalysis<TargetLibraryInfoWrapperPass>();
ProfileSummaryInfo *PSI =
&getAnalysis<ProfileSummaryInfoWrapperPass>().getPSI();
return SampleLoader.runOnModule(M, nullptr, PSI, nullptr);
}
bool SampleProfileLoader::runOnFunction(Function &F, ModuleAnalysisManager *AM) {
DILocation2SampleMap.clear();
// By default the entry count is initialized to -1, which will be treated
// conservatively by getEntryCount as the same as unknown (None). This is
// to avoid newly added code to be treated as cold. If we have samples
// this will be overwritten in emitAnnotations.
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
uint64_t initialEntryCount = -1;
ProfAccForSymsInList = ProfileAccurateForSymsInList && PSL;
if (ProfileSampleAccurate || F.hasFnAttribute("profile-sample-accurate")) {
// initialize all the function entry counts to 0. It means all the
// functions without profile will be regarded as cold.
initialEntryCount = 0;
// profile-sample-accurate is a user assertion which has a higher precedence
// than symbol list. When profile-sample-accurate is on, ignore symbol list.
ProfAccForSymsInList = false;
}
// PSL -- profile symbol list include all the symbols in sampled binary.
// If ProfileAccurateForSymsInList is enabled, PSL is used to treat
// old functions without samples being cold, without having to worry
// about new and hot functions being mistakenly treated as cold.
if (ProfAccForSymsInList) {
// Initialize the entry count to 0 for functions in the list.
if (PSL->contains(F.getName()))
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
initialEntryCount = 0;
// Function in the symbol list but without sample will be regarded as
// cold. To minimize the potential negative performance impact it could
// have, we want to be a little conservative here saying if a function
// shows up in the profile, no matter as outline function, inline instance
// or call targets, treat the function as not being cold. This will handle
// the cases such as most callsites of a function are inlined in sampled
// binary but not inlined in current build (because of source code drift,
// imprecise debug information, or the callsites are all cold individually
// but not cold accumulatively...), so the outline function showing up as
// cold in sampled binary will actually not be cold after current build.
StringRef CanonName = FunctionSamples::getCanonicalFnName(F);
if (NamesInProfile.count(CanonName))
initialEntryCount = -1;
[SampleFDO] Minimize performance impact when profile-sample-accurate is enabled. We can save memory and reduce binary size significantly by enabling ProfileSampleAccurate. However when ProfileSampleAccurate is true, function without sample will be regarded as cold and this could potentially cause performance regression. To minimize the potential negative performance impact, we want to be a little conservative here saying if a function shows up in the profile, no matter as outline instance, inline instance or call targets, treat the function as not being cold. This will handle the cases such as most callsites of a function are inlined in sampled binary (thus outline copy don't get any sample) but not inlined in current build (because of source code drift, imprecise debug information, or the callsites are all cold individually but not cold accumulatively...), so that the outline function showing up as cold in sampled binary will actually not be cold after current build. After the change, such function will be treated as not cold even profile-sample-accurate is enabled. At the same time we lower the hot criteria of callsiteIsHot check when profile-sample-accurate is enabled. callsiteIsHot is used to determined whether a callsite is hot and qualified for early inlining. When profile-sample-accurate is enabled, functions without profile will be regarded as cold and much less inlining will happen in CGSCC inlining pass, so we can worry less about size increase and be aggressive to allow more early inlining to happen for warm callsites and it is helpful for performance overall. Differential Revision: https://reviews.llvm.org/D67561 llvm-svn: 372232
2019-09-19 00:06:28 +08:00
}
// Initialize entry count when the function has no existing entry
// count value.
if (!F.getEntryCount().hasValue())
F.setEntryCount(ProfileCount(initialEntryCount, Function::PCT_Real));
std::unique_ptr<OptimizationRemarkEmitter> OwnedORE;
if (AM) {
auto &FAM =
AM->getResult<FunctionAnalysisManagerModuleProxy>(*F.getParent())
.getManager();
ORE = &FAM.getResult<OptimizationRemarkEmitterAnalysis>(F);
} else {
OwnedORE = std::make_unique<OptimizationRemarkEmitter>(&F);
ORE = OwnedORE.get();
}
[CSSPGO] Infrastructure for context-sensitive Sample PGO and Inlining This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon. Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO. **Interface** There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile. - Query base profile (`getBaseSamplesFor`) Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function. - Query context profile (`getContextSamplesFor`) Context profile is a function's CFG profile for a given calling context. We can query context profile by context string. - Track inlined context profile (`markContextSamplesInlined`) When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function. - Track not-inlined context profile (`promoteMergeContextSamplesTree`) When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination. **Implementation** Implementation-wise, `SampleContext` is created as abstraction for context. Currently it's a string for call path, and we can later optimize it to something more efficient, e.g. context id. Each `SampleContext` also has a `ContextState` indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each `FunctionSamples` now has a `SampleContext` that tells whether it's base profile or context profile, and for context profile what is the context and state. On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, `SampleContextTracker` is implemented that encapsulates a trie tree with `ContextTireNode` as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track `FunctionSamples` for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes. **Integration** `SampleContextTracker` is now also integrated with AutoFDO, `SampleProfileReader` and `SampleProfileLoader`. When we detected input profile contains context-sensitive profile, `SampleContextTracker` will be used to track profiles, and all profile query will go to `SampleContextTracker` instead of `SampleProfileReader` automatically. Tracking APIs are called automatically for each inline decision from `SampleProfileLoader`. Differential Revision: https://reviews.llvm.org/D90125
2020-03-24 14:50:41 +08:00
if (ProfileIsCS)
Samples = ContextTracker->getBaseSamplesFor(F);
else
Samples = Reader->getSamplesFor(F);
SamplePGO ThinLTO ICP fix for local functions. Summary: In SamplePGO, if the profile is collected from non-LTO binary, and used to drive ThinLTO, the indirect call promotion may fail because ThinLTO adjusts local function names to avoid conflicts. There are two places of where the mismatch can happen: 1. thin-link prepends SourceFileName to front of FuncName to build the GUID (GlobalValue::getGlobalIdentifier). Unlike instrumentation FDO, SamplePGO does not use the PGOFuncName scheme and therefore the indirect call target profile data contains a hash of the OriginalName. 2. backend compiler promotes some local functions to global and appends .llvm.{$ModuleHash} to the end of the FuncName to derive PromotedFunctionName This patch tries at the best effort to find the GUID from the original local function name (in profile), and use that in ICP promotion, and in SamplePGO matching that happens in the backend after importing/inlining: 1. in thin-link, it builds the map from OriginalName to GUID so that when thin-link reads in indirect call target profile (represented by OriginalName), it knows which GUID to import. 2. in backend compiler, if sample profile reader cannot find a profile match for PromotedFunctionName, it will try to find if there is a match for OriginalFunctionName. 3. in backend compiler, we build symbol table entry for OriginalFunctionName and pointer to the same symbol of PromotedFunctionName, so that ICP can find the correct target to promote. Reviewers: mehdi_amini, tejohnson Reviewed By: tejohnson Subscribers: llvm-commits, Prazek Differential Revision: https://reviews.llvm.org/D30754 llvm-svn: 297757
2017-03-15 01:33:01 +08:00
if (Samples && !Samples->empty())
return emitAnnotations(F);
Propagation of profile samples through the CFG. This adds a propagation heuristic to convert instruction samples into branch weights. It implements a similar heuristic to the one implemented by Dehao Chen on GCC. The propagation proceeds in 3 phases: 1- Assignment of block weights. All the basic blocks in the function are initial assigned the same weight as their most frequently executed instruction. 2- Creation of equivalence classes. Since samples may be missing from blocks, we can fill in the gaps by setting the weights of all the blocks in the same equivalence class to the same weight. To compute the concept of equivalence, we use dominance and loop information. Two blocks B1 and B2 are in the same equivalence class if B1 dominates B2, B2 post-dominates B1 and both are in the same loop. 3- Propagation of block weights into edges. This uses a simple propagation heuristic. The following rules are applied to every block B in the CFG: - If B has a single predecessor/successor, then the weight of that edge is the weight of the block. - If all the edges are known except one, and the weight of the block is already known, the weight of the unknown edge will be the weight of the block minus the sum of all the known edges. If the sum of all the known edges is larger than B's weight, we set the unknown edge weight to zero. - If there is a self-referential edge, and the weight of the block is known, the weight for that edge is set to the weight of the block minus the weight of the other incoming edges to that block (if known). Since this propagation is not guaranteed to finalize for every CFG, we only allow it to proceed for a limited number of iterations (controlled by -sample-profile-max-propagate-iterations). It currently uses the same GCC default of 100. Before propagation starts, the pass builds (for each block) a list of unique predecessors and successors. This is necessary to handle identical edges in multiway branches. Since we visit all blocks and all edges of the CFG, it is cleaner to build these lists once at the start of the pass. Finally, the patch fixes the computation of relative line locations. The profiler emits lines relative to the function header. To discover it, we traverse the compilation unit looking for the subprogram corresponding to the function. The line number of that subprogram is the line where the function begins. That becomes line zero for all the relative locations. llvm-svn: 198972
2014-01-11 07:23:46 +08:00
return false;
SampleProfileLoader pass. Initial setup. This adds a new scalar pass that reads a file with samples generated by 'perf' during runtime. The samples read from the profile are incorporated and emmited as IR metadata reflecting that profile. The profile file is assumed to have been generated by an external profile source. The profile information is converted into IR metadata, which is later used by the analysis routines to estimate block frequencies, edge weights and other related data. External profile information files have no fixed format, each profiler is free to define its own. This includes both the on-disk representation of the profile and the kind of profile information stored in the file. A common kind of profile is based on sampling (e.g., perf), which essentially counts how many times each line of the program has been executed during the run. The SampleProfileLoader pass is organized as a scalar transformation. On startup, it reads the file given in -sample-profile-file to determine what kind of profile it contains. This file is assumed to contain profile information for the whole application. The profile data in the file is read and incorporated into the internal state of the corresponding profiler. To facilitate testing, I've organized the profilers to support two file formats: text and native. The native format is whatever on-disk representation the profiler wants to support, I think this will mostly be bitcode files, but it could be anything the profiler wants to support. To do this, every profiler must implement the SampleProfile::loadNative() function. The text format is mostly meant for debugging. Records are separated by newlines, but each profiler is free to interpret records as it sees fit. Profilers must implement the SampleProfile::loadText() function. Finally, the pass will call SampleProfile::emitAnnotations() for each function in the current translation unit. This function needs to translate the loaded profile into IR metadata, which the analyzer will later be able to use. This patch implements the first steps towards the above design. I've implemented a sample-based flat profiler. The format of the profile is fairly simplistic. Each sampled function contains a list of relative line locations (from the start of the function) together with a count representing how many samples were collected at that line during execution. I generate this profile using perf and a separate converter tool. Currently, I have only implemented a text format for these profiles. I am interested in initial feedback to the whole approach before I send the other parts of the implementation for review. This patch implements: - The SampleProfileLoader pass. - The base ExternalProfile class with the core interface. - A SampleProfile sub-class using the above interface. The profiler generates branch weight metadata on every branch instructions that matches the profiles. - A text loader class to assist the implementation of SampleProfile::loadText(). - Basic unit tests for the pass. Additionally, the patch uses profile information to compute branch weights based on instruction samples. This patch converts instruction samples into branch weights. It does a fairly simplistic conversion: Given a multi-way branch instruction, it calculates the weight of each branch based on the maximum sample count gathered from each target basic block. Note that this assignment of branch weights is somewhat lossy and can be misleading. If a basic block has more than one incoming branch, all the incoming branches will get the same weight. In reality, it may be that only one of them is the most heavily taken branch. I will adjust this assignment in subsequent patches. llvm-svn: 194566
2013-11-13 20:22:21 +08:00
}
PreservedAnalyses SampleProfileLoaderPass::run(Module &M,
ModuleAnalysisManager &AM) {
FunctionAnalysisManager &FAM =
AM.getResult<FunctionAnalysisManagerModuleProxy>(M).getManager();
auto GetAssumptionCache = [&](Function &F) -> AssumptionCache & {
return FAM.getResult<AssumptionAnalysis>(F);
};
auto GetTTI = [&](Function &F) -> TargetTransformInfo & {
return FAM.getResult<TargetIRAnalysis>(F);
};
auto GetTLI = [&](Function &F) -> const TargetLibraryInfo & {
return FAM.getResult<TargetLibraryAnalysis>(F);
};
SampleProfileLoader SampleLoader(
ProfileFileName.empty() ? SampleProfileFile : ProfileFileName,
ProfileRemappingFileName.empty() ? SampleProfileRemappingFile
: ProfileRemappingFileName,
LTOPhase, GetAssumptionCache, GetTTI, GetTLI);
if (!SampleLoader.doInitialization(M, &FAM))
return PreservedAnalyses::all();
ProfileSummaryInfo *PSI = &AM.getResult<ProfileSummaryAnalysis>(M);
CallGraph &CG = AM.getResult<CallGraphAnalysis>(M);
if (!SampleLoader.runOnModule(M, &AM, PSI, &CG))
return PreservedAnalyses::all();
return PreservedAnalyses::none();
}