[CSSPGO][llvm-profgen] Context-sensitive global pre-inliner
This change sets up a framework in llvm-profgen to estimate inline decision and adjust context-sensitive profile based on that. We call it a global pre-inliner in llvm-profgen.
It will serve two purposes:
1) Since context profile for not inlined context will be merged into base profile, if we estimate a context will not be inlined, we can merge the context profile in the output to save profile size.
2) For thinLTO, when a context involving functions from different modules is not inined, we can't merge functions profiles across modules, leading to suboptimal post-inline count quality. By estimating some inline decisions, we would be able to adjust/merge context profiles beforehand as a mitigation.
Compiler inline heuristic uses inline cost which is not available in llvm-profgen. But since inline cost is closely related to size, we could get an estimate through function size from debug info. Because the size we have in llvm-profgen is the final size, it could also be more accurate than the inline cost estimation in the compiler.
This change only has the framework, with a few TODOs left for follow up patches for a complete implementation:
1) We need to retrieve size for funciton//inlinee from debug info for inlining estimation. Currently we use number of samples in a profile as place holder for size estimation.
2) Currently the thresholds are using the values used by sample loader inliner. But they need to be tuned since the size here is fully optimized machine code size, instead of inline cost based on not yet fully optimized IR.
Differential Revision: https://reviews.llvm.org/D99146
2021-03-05 23:50:36 +08:00
|
|
|
//===-- CSPreInliner.cpp - Profile guided preinliner -------------- C++ -*-===//
|
|
|
|
//
|
|
|
|
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
|
|
// See https://llvm.org/LICENSE.txt for license information.
|
|
|
|
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
|
|
//
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
|
|
|
|
#include "CSPreInliner.h"
|
|
|
|
#include "llvm/ADT/SCCIterator.h"
|
|
|
|
#include <cstdint>
|
|
|
|
#include <queue>
|
|
|
|
|
|
|
|
#define DEBUG_TYPE "cs-preinliner"
|
|
|
|
|
|
|
|
using namespace llvm;
|
|
|
|
using namespace sampleprof;
|
|
|
|
|
|
|
|
// The switches specify inline thresholds used in SampleProfileLoader inlining.
|
|
|
|
// TODO: the actual threshold to be tuned here because the size here is based
|
|
|
|
// on machine code not LLVM IR.
|
|
|
|
extern cl::opt<int> SampleHotCallSiteThreshold;
|
|
|
|
extern cl::opt<int> SampleColdCallSiteThreshold;
|
|
|
|
extern cl::opt<int> ProfileInlineGrowthLimit;
|
|
|
|
extern cl::opt<int> ProfileInlineLimitMin;
|
|
|
|
extern cl::opt<int> ProfileInlineLimitMax;
|
|
|
|
|
|
|
|
static cl::opt<bool> SamplePreInlineReplay(
|
|
|
|
"csspgo-replay-preinline", cl::Hidden, cl::init(false),
|
|
|
|
cl::desc(
|
|
|
|
"Replay previous inlining and adjust context profile accordingly"));
|
|
|
|
|
|
|
|
CSPreInliner::CSPreInliner(StringMap<FunctionSamples> &Profiles,
|
|
|
|
uint64_t HotThreshold, uint64_t ColdThreshold)
|
|
|
|
: ContextTracker(Profiles), ProfileMap(Profiles),
|
|
|
|
HotCountThreshold(HotThreshold), ColdCountThreshold(ColdThreshold) {}
|
|
|
|
|
|
|
|
std::vector<StringRef> CSPreInliner::buildTopDownOrder() {
|
|
|
|
std::vector<StringRef> Order;
|
[CSSPGO] Top-down processing order based on full profile.
Use profiled call edges to augment the top-down order. There are cases that the top-down order computed based on the static call graph doesn't reflect real execution order. For example:
1. Incomplete static call graph due to unknown indirect call targets. Adjusting the order by considering indirect call edges from the profile can enable the inlining of indirect call targets by allowing the caller processed before them.
2. Mutual call edges in an SCC. The static processing order computed for an SCC may not reflect the call contexts in the context-sensitive profile, thus may cause potential inlining to be overlooked. The function order in one SCC is being adjusted to a top-down order based on the profile to favor more inlining.
3. Transitive indirect call edges due to inlining. When a callee function is inlined into into a caller function in LTO prelink, every call edge originated from the callee will be transferred to the caller. If any of the transferred edges is indirect, the original profiled indirect edge, even if considered, would not enforce a top-down order from the caller to the potential indirect call target in LTO postlink since the inlined callee is gone from the static call graph.
4. #3 can happen even for direct call targets, due to functions defined in header files. Header functions, when included into source files, are defined multiple times but only one definition survives due to ODR. Therefore, the LTO prelink inlining done on those dropped definitions can be useless based on a local file scope. More importantly, the inlinee, once fully inlined to a to-be-dropped inliner, will have no profile to consume when its outlined version is compiled. This can lead to a profile-less prelink compilation for the outlined version of the inlinee function which may be called from external modules. while this isn't easy to fix, we rely on the postlink AutoFDO pipeline to optimize the inlinee. Since the survived copy of the inliner (defined in headers) can be inlined in its local scope in prelink, it may not exist in the merged IR in postlink, and we'll need the profiled call edges to enforce a top-down order for the rest of the functions.
Considering those cases, a profiled call graph completely independent of the static call graph is constructed based on profile data, where function objects are not even needed to handle case #3 and case 4.
I'm seeing an average 0.4% perf win out of SPEC2017. For certain benchmark such as Xalanbmk and GCC, the win is bigger, above 2%.
The change is an enhancement to https://reviews.llvm.org/D95988.
Reviewed By: wmi, wenlei
Differential Revision: https://reviews.llvm.org/D99351
2021-03-30 01:21:31 +08:00
|
|
|
ProfiledCallGraph ProfiledCG(ContextTracker);
|
[CSSPGO][llvm-profgen] Context-sensitive global pre-inliner
This change sets up a framework in llvm-profgen to estimate inline decision and adjust context-sensitive profile based on that. We call it a global pre-inliner in llvm-profgen.
It will serve two purposes:
1) Since context profile for not inlined context will be merged into base profile, if we estimate a context will not be inlined, we can merge the context profile in the output to save profile size.
2) For thinLTO, when a context involving functions from different modules is not inined, we can't merge functions profiles across modules, leading to suboptimal post-inline count quality. By estimating some inline decisions, we would be able to adjust/merge context profiles beforehand as a mitigation.
Compiler inline heuristic uses inline cost which is not available in llvm-profgen. But since inline cost is closely related to size, we could get an estimate through function size from debug info. Because the size we have in llvm-profgen is the final size, it could also be more accurate than the inline cost estimation in the compiler.
This change only has the framework, with a few TODOs left for follow up patches for a complete implementation:
1) We need to retrieve size for funciton//inlinee from debug info for inlining estimation. Currently we use number of samples in a profile as place holder for size estimation.
2) Currently the thresholds are using the values used by sample loader inliner. But they need to be tuned since the size here is fully optimized machine code size, instead of inline cost based on not yet fully optimized IR.
Differential Revision: https://reviews.llvm.org/D99146
2021-03-05 23:50:36 +08:00
|
|
|
|
|
|
|
// Now that we have a profiled call graph, construct top-down order
|
|
|
|
// by building up SCC and reversing SCC order.
|
|
|
|
scc_iterator<ProfiledCallGraph *> I = scc_begin(&ProfiledCG);
|
|
|
|
while (!I.isAtEnd()) {
|
|
|
|
for (ProfiledCallGraphNode *Node : *I) {
|
|
|
|
if (Node != ProfiledCG.getEntryNode())
|
|
|
|
Order.push_back(Node->Name);
|
|
|
|
}
|
|
|
|
++I;
|
|
|
|
}
|
|
|
|
std::reverse(Order.begin(), Order.end());
|
|
|
|
|
|
|
|
return Order;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool CSPreInliner::getInlineCandidates(ProfiledCandidateQueue &CQueue,
|
|
|
|
const FunctionSamples *CallerSamples) {
|
|
|
|
assert(CallerSamples && "Expect non-null caller samples");
|
|
|
|
|
|
|
|
// Ideally we want to consider everything a function calls, but as far as
|
|
|
|
// context profile is concerned, only those frames that are children of
|
|
|
|
// current one in the trie is relavent. So we walk the trie instead of call
|
|
|
|
// targets from function profile.
|
|
|
|
ContextTrieNode *CallerNode =
|
|
|
|
ContextTracker.getContextFor(CallerSamples->getContext());
|
|
|
|
|
|
|
|
bool HasNewCandidate = false;
|
|
|
|
for (auto &Child : CallerNode->getAllChildContext()) {
|
|
|
|
ContextTrieNode *CalleeNode = &Child.second;
|
|
|
|
FunctionSamples *CalleeSamples = CalleeNode->getFunctionSamples();
|
|
|
|
if (!CalleeSamples)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
// Call site count is more reliable, so we look up the corresponding call
|
|
|
|
// target profile in caller's context profile to retrieve call site count.
|
|
|
|
uint64_t CalleeEntryCount = CalleeSamples->getEntrySamples();
|
|
|
|
uint64_t CallsiteCount = 0;
|
|
|
|
LineLocation Callsite = CalleeNode->getCallSiteLoc();
|
|
|
|
if (auto CallTargets = CallerSamples->findCallTargetMapAt(Callsite)) {
|
|
|
|
SampleRecord::CallTargetMap &TargetCounts = CallTargets.get();
|
|
|
|
auto It = TargetCounts.find(CalleeSamples->getName());
|
|
|
|
if (It != TargetCounts.end())
|
|
|
|
CallsiteCount = It->second;
|
|
|
|
}
|
|
|
|
|
|
|
|
// TODO: call site and callee entry count should be mostly consistent, add
|
|
|
|
// check for that.
|
|
|
|
HasNewCandidate = true;
|
|
|
|
CQueue.emplace(CalleeSamples, std::max(CallsiteCount, CalleeEntryCount));
|
|
|
|
}
|
|
|
|
|
|
|
|
return HasNewCandidate;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool CSPreInliner::shouldInline(ProfiledInlineCandidate &Candidate) {
|
|
|
|
// If replay inline is requested, simply follow the inline decision of the
|
|
|
|
// profiled binary.
|
|
|
|
if (SamplePreInlineReplay)
|
|
|
|
return Candidate.CalleeSamples->getContext().hasAttribute(
|
|
|
|
ContextWasInlined);
|
|
|
|
|
|
|
|
// Adjust threshold based on call site hotness, only do this for callsite
|
|
|
|
// prioritized inliner because otherwise cost-benefit check is done earlier.
|
|
|
|
unsigned int SampleThreshold = SampleColdCallSiteThreshold;
|
|
|
|
if (Candidate.CallsiteCount > HotCountThreshold)
|
|
|
|
SampleThreshold = SampleHotCallSiteThreshold;
|
|
|
|
|
|
|
|
// TODO: for small cold functions, we may inlined them and we need to keep
|
|
|
|
// context profile accordingly.
|
|
|
|
if (Candidate.CallsiteCount < ColdCountThreshold)
|
|
|
|
SampleThreshold = SampleColdCallSiteThreshold;
|
|
|
|
|
|
|
|
return (Candidate.SizeCost < SampleThreshold);
|
|
|
|
}
|
|
|
|
|
|
|
|
void CSPreInliner::processFunction(const StringRef Name) {
|
|
|
|
LLVM_DEBUG(dbgs() << "Process " << Name
|
|
|
|
<< " for context-sensitive pre-inlining\n");
|
|
|
|
|
|
|
|
FunctionSamples *FSamples = ContextTracker.getBaseSamplesFor(Name);
|
|
|
|
if (!FSamples)
|
|
|
|
return;
|
|
|
|
|
|
|
|
// Use the number of lines/probes as proxy for function size for now.
|
|
|
|
// TODO: retrieve accurate size from dwarf or binary instead.
|
|
|
|
unsigned FuncSize = FSamples->getBodySamples().size();
|
|
|
|
unsigned FuncFinalSize = FuncSize;
|
|
|
|
unsigned SizeLimit = FuncSize * ProfileInlineGrowthLimit;
|
|
|
|
SizeLimit = std::min(SizeLimit, (unsigned)ProfileInlineLimitMax);
|
|
|
|
SizeLimit = std::max(SizeLimit, (unsigned)ProfileInlineLimitMin);
|
|
|
|
|
|
|
|
ProfiledCandidateQueue CQueue;
|
|
|
|
getInlineCandidates(CQueue, FSamples);
|
|
|
|
|
|
|
|
while (!CQueue.empty() && FuncFinalSize < SizeLimit) {
|
|
|
|
ProfiledInlineCandidate Candidate = CQueue.top();
|
|
|
|
CQueue.pop();
|
|
|
|
bool ShouldInline = false;
|
|
|
|
if ((ShouldInline = shouldInline(Candidate))) {
|
|
|
|
// We mark context as inlined as the corresponding context profile
|
|
|
|
// won't be merged into that function's base profile.
|
|
|
|
ContextTracker.markContextSamplesInlined(Candidate.CalleeSamples);
|
|
|
|
Candidate.CalleeSamples->getContext().setAttribute(
|
|
|
|
ContextShouldBeInlined);
|
|
|
|
FuncFinalSize += Candidate.SizeCost;
|
|
|
|
getInlineCandidates(CQueue, Candidate.CalleeSamples);
|
|
|
|
}
|
|
|
|
LLVM_DEBUG(dbgs() << (ShouldInline ? " Inlined" : " Outlined")
|
|
|
|
<< " context profile for: "
|
|
|
|
<< Candidate.CalleeSamples->getNameWithContext()
|
|
|
|
<< " (callee size: " << Candidate.SizeCost
|
|
|
|
<< ", call count:" << Candidate.CallsiteCount << ")\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
LLVM_DEBUG({
|
|
|
|
if (!CQueue.empty())
|
|
|
|
dbgs() << " Inline candidates ignored due to size limit (inliner "
|
|
|
|
"original size: "
|
|
|
|
<< FuncSize << ", inliner final size: " << FuncFinalSize
|
|
|
|
<< ", size limit: " << SizeLimit << ")\n";
|
|
|
|
|
|
|
|
while (!CQueue.empty()) {
|
|
|
|
ProfiledInlineCandidate Candidate = CQueue.top();
|
|
|
|
CQueue.pop();
|
|
|
|
bool WasInlined =
|
|
|
|
Candidate.CalleeSamples->getContext().hasAttribute(ContextWasInlined);
|
|
|
|
dbgs() << " " << Candidate.CalleeSamples->getNameWithContext()
|
|
|
|
<< " (candidate size:" << Candidate.SizeCost
|
|
|
|
<< ", call count: " << Candidate.CallsiteCount << ", previously "
|
|
|
|
<< (WasInlined ? "inlined)\n" : "not inlined)\n");
|
|
|
|
}
|
|
|
|
});
|
|
|
|
}
|
|
|
|
|
|
|
|
void CSPreInliner::run() {
|
|
|
|
#ifndef NDEBUG
|
|
|
|
auto printProfileNames = [](StringMap<FunctionSamples> &Profiles,
|
|
|
|
bool IsInput) {
|
|
|
|
dbgs() << (IsInput ? "Input" : "Output") << " context-sensitive profiles ("
|
|
|
|
<< Profiles.size() << " total):\n";
|
|
|
|
for (auto &It : Profiles) {
|
|
|
|
const FunctionSamples &Samples = It.second;
|
|
|
|
dbgs() << " [" << Samples.getNameWithContext() << "] "
|
|
|
|
<< Samples.getTotalSamples() << ":" << Samples.getHeadSamples()
|
|
|
|
<< "\n";
|
|
|
|
}
|
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
|
|
|
LLVM_DEBUG(printProfileNames(ProfileMap, true));
|
|
|
|
|
|
|
|
// Execute global pre-inliner to estimate a global top-down inline
|
|
|
|
// decision and merge profiles accordingly. This helps with profile
|
|
|
|
// merge for ThinLTO otherwise we won't be able to merge profiles back
|
|
|
|
// to base profile across module/thin-backend boundaries.
|
|
|
|
// It also helps better compress context profile to control profile
|
|
|
|
// size, as we now only need context profile for functions going to
|
|
|
|
// be inlined.
|
|
|
|
for (StringRef FuncName : buildTopDownOrder()) {
|
|
|
|
processFunction(FuncName);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Not inlined context profiles are merged into its base, so we can
|
|
|
|
// trim out such profiles from the output.
|
|
|
|
std::vector<StringRef> ProfilesToBeRemoved;
|
|
|
|
for (auto &It : ProfileMap) {
|
|
|
|
SampleContext Context = It.second.getContext();
|
|
|
|
if (!Context.isBaseContext() && !Context.hasState(InlinedContext)) {
|
|
|
|
assert(Context.hasState(MergedContext) &&
|
|
|
|
"Not inlined context profile should be merged already");
|
|
|
|
ProfilesToBeRemoved.push_back(It.first());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
for (StringRef ContextName : ProfilesToBeRemoved) {
|
|
|
|
ProfileMap.erase(ContextName);
|
|
|
|
}
|
|
|
|
|
2021-04-08 14:06:39 +08:00
|
|
|
// Make sure ProfileMap's key is consistent with FunctionSamples' name.
|
|
|
|
SampleContextTrimmer(ProfileMap).canonicalizeContextProfiles();
|
|
|
|
|
[CSSPGO][llvm-profgen] Context-sensitive global pre-inliner
This change sets up a framework in llvm-profgen to estimate inline decision and adjust context-sensitive profile based on that. We call it a global pre-inliner in llvm-profgen.
It will serve two purposes:
1) Since context profile for not inlined context will be merged into base profile, if we estimate a context will not be inlined, we can merge the context profile in the output to save profile size.
2) For thinLTO, when a context involving functions from different modules is not inined, we can't merge functions profiles across modules, leading to suboptimal post-inline count quality. By estimating some inline decisions, we would be able to adjust/merge context profiles beforehand as a mitigation.
Compiler inline heuristic uses inline cost which is not available in llvm-profgen. But since inline cost is closely related to size, we could get an estimate through function size from debug info. Because the size we have in llvm-profgen is the final size, it could also be more accurate than the inline cost estimation in the compiler.
This change only has the framework, with a few TODOs left for follow up patches for a complete implementation:
1) We need to retrieve size for funciton//inlinee from debug info for inlining estimation. Currently we use number of samples in a profile as place holder for size estimation.
2) Currently the thresholds are using the values used by sample loader inliner. But they need to be tuned since the size here is fully optimized machine code size, instead of inline cost based on not yet fully optimized IR.
Differential Revision: https://reviews.llvm.org/D99146
2021-03-05 23:50:36 +08:00
|
|
|
LLVM_DEBUG(printProfileNames(ProfileMap, false));
|
|
|
|
}
|