[InlineCost] Enable the new switch cost heuristic

Summary:
This is to enable the new switch inline cost heuristic (r301649) by removing the
old heuristic as well as the flag itself.
In my experiment for LLVM test suite and spec2000/2006, +17.82% performance and
8% code size reduce was observed in spec2000/vertex with O3 LTO in AArch64.
No significant code size / performance regression was found in O3/O2/Os. No
significant complain was reported from the llvm-dev thread.

Reviewers: hans, chandlerc, eraman, haicheng, mcrosier, bmakam, eastig, ddibyend, echristo

Reviewed By: echristo

Subscribers: javed.absar, kristof.beyls, echristo, aemerson, rengolin, mehdi_amini

Differential Revision: https://reviews.llvm.org/D32653

llvm-svn: 304594
This commit is contained in:
Jun Bum Lim 2017-06-02 20:42:54 +00:00
parent 2c08fde9e5
commit 2960d41e68
2 changed files with 60 additions and 80 deletions

View File

@ -54,11 +54,6 @@ static cl::opt<int>
cl::init(45),
cl::desc("Threshold for inlining cold callsites"));
static cl::opt<bool>
EnableGenericSwitchCost("inline-generic-switch-cost", cl::Hidden,
cl::init(false),
cl::desc("Enable generic switch cost model"));
// We introduce this threshold to help performance of instrumentation based
// PGO before we actually hook up inliner with analysis passes such as BPI and
// BFI.
@ -1015,13 +1010,17 @@ bool CallAnalyzer::visitSwitchInst(SwitchInst &SI) {
if (isa<ConstantInt>(V))
return true;
if (EnableGenericSwitchCost) {
// Assume the most general case where the swith is lowered into
// either a jump table, bit test, or a balanced binary tree consisting of
// case clusters without merging adjacent clusters with the same
// destination. We do not consider the switches that are lowered with a mix
// of jump table/bit test/binary search tree. The cost of the switch is
// proportional to the size of the tree or the size of jump table range.
//
// NB: We convert large switches which are just used to initialize large phi
// nodes to lookup tables instead in simplify-cfg, so this shouldn't prevent
// inlining those. It will prevent inlining in cases where the optimization
// does not (yet) fire.
// Exit early for a large switch, assuming one case needs at least one
// instruction.
@ -1076,25 +1075,6 @@ bool CallAnalyzer::visitSwitchInst(SwitchInst &SI) {
return false;
}
// Use a simple switch cost model where we accumulate a cost proportional to
// the number of distinct successor blocks. This fan-out in the CFG cannot
// be represented for free even if we can represent the core switch as a
// jumptable that takes a single instruction.
///
// NB: We convert large switches which are just used to initialize large phi
// nodes to lookup tables instead in simplify-cfg, so this shouldn't prevent
// inlining those. It will prevent inlining in cases where the optimization
// does not (yet) fire.
SmallPtrSet<BasicBlock *, 8> SuccessorBlocks;
SuccessorBlocks.insert(SI.getDefaultDest());
for (auto Case : SI.cases())
SuccessorBlocks.insert(Case.getCaseSuccessor());
// Add cost corresponding to the number of distinct destinations. The first
// we model as free because of fallthrough.
Cost += (SuccessorBlocks.size() - 1) * InlineConstants::InstrCost;
return false;
}
bool CallAnalyzer::visitIndirectBrInst(IndirectBrInst &IBI) {
// We never want to inline functions that contain an indirectbr. This is
// incorrect because all the blockaddress's (in static global initializers

View File

@ -1,5 +1,5 @@
; RUN: opt < %s -inline -inline-threshold=20 -S -mtriple=aarch64-none-linux -inline-generic-switch-cost=true | FileCheck %s
; RUN: opt < %s -passes='cgscc(inline)' -inline-threshold=20 -S -mtriple=aarch64-none-linux -inline-generic-switch-cost=true | FileCheck %s
; RUN: opt < %s -inline -inline-threshold=20 -S -mtriple=aarch64-none-linux | FileCheck %s
; RUN: opt < %s -passes='cgscc(inline)' -inline-threshold=20 -S -mtriple=aarch64-none-linux | FileCheck %s
define i32 @callee_range(i32 %a, i32* %P) {
switch i32 %a, label %sw.default [