llvm-project/llvm/lib/Analysis
David Green 39db5753f9 [LV][ARM] Inloop reduction cost modelling
This adds cost modelling for the inloop vectorization added in
745bf6cf44. Up until now they have been modelled as the original
underlying instruction, usually an add. This happens to works OK for MVE
with instructions that are reducing into the same type as they are
working on. But MVE's instructions can perform the equivalent of an
extended MLA as a single instruction:

  %sa = sext <16 x i8> A to <16 x i32>
  %sb = sext <16 x i8> B to <16 x i32>
  %m = mul <16 x i32> %sa, %sb
  %r = vecreduce.add(%m)
  ->
  R = VMLADAV A, B

There are other instructions for performing add reductions of
v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64
(VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV).
The i64 are particularly interesting as there are no native i64 add/mul
instructions, leading to the i64 add and mul naturally getting very
high costs.

Also worth mentioning, under NEON there is the concept of a sdot/udot
instruction which performs a partial reduction from a v16i8 to a v4i32.
They extend and mul/sum the first four elements from the inputs into the
first element of the output, repeating for each of the four output
lanes. They could possibly be represented in the same way as above in
llvm, so long as a vecreduce.add could perform a partial reduction. The
vectorizer would then produce a combination of in and outer loop
reductions to efficiently use the sdot and udot instructions. Although
this patch does not do that yet, it does suggest that separating the
input reduction type from the produced result type is a useful concept
to model. It also shows that a MLA reduction as a single instruction is
fairly common.

This patch attempt to improve the costmodelling of in-loop reductions
by:
 - Adding some pattern matching in the loop vectorizer cost model to
   match extended reduction patterns that are optionally extended and/or
   MLA patterns. This marks the cost of the reduction instruction correctly
   and the sext/zext/mul leading up to it as free, which is otherwise
   difficult to tell and may get a very high cost. (In the long run this
   can hopefully be replaced by vplan producing a single node and costing
   it correctly, but that is not yet something that vplan can do).
 - getExtendedAddReductionCost is added to query the cost of these
   extended reduction patterns.
 - Expanded the ARM costs to account for these expanded sizes, which is a
   fairly simple change in itself.
 - Some minor alterations to allow inloop reduction larger than the highest
   vector width and i64 MVE reductions.
 - An extra InLoopReductionImmediateChains map was added to the vectorizer
   for it to efficiently detect which instructions are reductions in the
   cost model.
 - The tests have some updates to show what I believe is optimal
   vectorization and where we are now.

Put together this can greatly improve performance for reduction loop
under MVE.

Differential Revision: https://reviews.llvm.org/D93476
2021-01-21 21:03:41 +00:00
..
models/inliner [MLInliner] Simplify TFUTILS_SUPPORTED_TYPES 2020-08-25 14:19:39 -07:00
AliasAnalysis.cpp Require chained analyses in BasicAA and AAResults to be transitive 2021-01-11 11:50:07 +01:00
AliasAnalysisEvaluator.cpp [AA] Split up LocationSize::unknown() 2020-11-26 18:39:55 +01:00
AliasAnalysisSummary.cpp
AliasAnalysisSummary.h
AliasSetTracker.cpp [noalias.decl] Look through llvm.experimental.noalias.scope.decl 2021-01-19 20:09:42 +01:00
Analysis.cpp [NPM] Port module-debuginfo pass to the new pass manager 2020-10-19 14:31:17 -07:00
AssumeBundleQueries.cpp Reland [AssumeBundles] Use operand bundles to encode alignment assumptions 2020-09-12 15:36:06 +02:00
AssumptionCache.cpp [llvm] Call *(Set|Map)::erase directly (NFC) 2021-01-03 09:57:47 -08:00
BasicAliasAnalysis.cpp Reapply [BasicAA] Handle recursive queries more efficiently 2021-01-17 10:34:35 +01:00
BlockFrequencyInfo.cpp
BlockFrequencyInfoImpl.cpp
BranchProbabilityInfo.cpp [BPI] Improve static heuristics for "cold" paths. 2020-12-23 22:47:36 +07:00
CFG.cpp [Analysis] Use is_contained (NFC) 2020-12-11 21:19:31 -08:00
CFGPrinter.cpp [llvm] Use llvm::all_of (NFC) 2021-01-06 18:27:36 -08:00
CFLAndersAliasAnalysis.cpp [MemLoc] Use hasValue() method (NFC) 2020-11-19 21:53:50 +01:00
CFLGraph.h
CFLSteensAliasAnalysis.cpp
CGSCCPassManager.cpp [llvm] Use *::empty (NFC) 2021-01-16 09:40:55 -08:00
CMakeLists.txt [NFC] Move ImportedFunctionsInliningStatistics to Analysis 2021-01-20 13:18:03 -08:00
CallGraph.cpp [Analysis] Remove spliceFunction (NFC) 2020-12-23 21:57:25 -08:00
CallGraphSCCPass.cpp [NewPM] Support --print-before/after in NPM 2020-12-03 16:52:14 -08:00
CallPrinter.cpp [llvm] Remove redundant string initialization (NFC) 2021-01-12 21:43:46 -08:00
CaptureTracking.cpp [CaptureTracking] Add statistics (NFC) 2020-11-07 12:57:00 +01:00
CmpInstAnalysis.cpp
CodeMetrics.cpp [LoopRotate] Calls not lowered to calls should not block rotation. 2021-01-19 14:37:36 +00:00
ConstantFolding.cpp [ConstantFold] Fold fptoi.sat intrinsics 2021-01-10 17:37:27 +01:00
ConstraintSystem.cpp [llvm] Remove redundant string initialization (NFC) 2021-01-12 21:43:46 -08:00
CostModel.cpp [Support] Introduce a new InstructionCost class 2020-12-11 08:12:54 +00:00
DDG.cpp [Analysis] Use llvm::append_range (NFC) 2020-12-29 19:23:21 -08:00
DDGPrinter.cpp [DDG] Data Dependence Graph - DOT printer - recommit 2020-12-16 12:37:36 -05:00
Delinearization.cpp [Delinearization][NewPM] Port delinearization to NPM 2020-09-21 17:59:08 -07:00
DemandedBits.cpp [DemandedBits][BDCE] Add support for min/max intrinsics 2020-09-10 22:13:31 +02:00
DependenceAnalysis.cpp [AA] Split up LocationSize::unknown() 2020-11-26 18:39:55 +01:00
DependenceGraphBuilder.cpp [DDG] Fix duplicate edge removal during pi-block formation 2021-01-07 10:31:11 -05:00
DevelopmentModeInlineAdvisor.cpp [NewPM][Inliner] Move the 'always inliner' case in the same CGSCC pass as 'regular' inliner 2021-01-15 17:59:38 -08:00
DivergenceAnalysis.cpp [Analysis, CodeGen, IR] Use contains (NFC) 2020-12-18 19:08:17 -08:00
DomPrinter.cpp
DomTreeUpdater.cpp [Target, Transforms] Use *Set::contains (NFC) 2021-01-08 18:39:54 -08:00
DominanceFrontier.cpp
EHPersonalities.cpp [XCOFF][AIX] Generate LSDA data and compact unwind section on AIX 2020-12-02 18:42:44 +00:00
FunctionPropertiesAnalysis.cpp [llvm] Ensure newlines at the end of files (NFC) 2021-01-10 09:24:57 -08:00
GlobalsModRef.cpp Reapply [BasicAA] Handle recursive queries more efficiently 2021-01-17 10:34:35 +01:00
GuardUtils.cpp
HeatUtils.cpp [CallPrinter] Remove static constructor. 2020-06-17 13:02:58 +02:00
IRSimilarityIdentifier.cpp [llvm] Use llvm::all_of (NFC) 2021-01-19 20:19:17 -08:00
IVDescriptors.cpp [LoopUtils] remove redundant opcode parameter; NFC 2021-01-04 17:05:28 -05:00
IVUsers.cpp
ImportedFunctionsInliningStatistics.cpp Reland "[NPM][Inliner] Factor ImportedFunctionStats in the InlineAdvisor" 2021-01-20 13:33:43 -08:00
IndirectCallPromotionAnalysis.cpp [Analysis] Remove unused system header includes 2020-11-22 10:32:37 +00:00
InlineAdvisor.cpp Reland "[NPM][Inliner] Factor ImportedFunctionStats in the InlineAdvisor" 2021-01-20 13:33:43 -08:00
InlineCost.cpp [Inliner] Compute the full cost for the cost benefit analsysis 2021-01-05 12:48:49 -08:00
InlineSizeEstimatorAnalysis.cpp [MLGO] Fix build break as result of new InstructionCost (D91174) 2020-12-11 20:28:39 -08:00
InstCount.cpp [NFC] Port InstCount pass to new pass manager 2020-08-21 12:39:42 +03:00
InstructionPrecedenceTracking.cpp
InstructionSimplify.cpp [InstCombine,InstSimplify] Optimize select followed by and/or/xor 2021-01-19 09:14:17 +09:00
Interval.cpp [Analysis/Interval] Remove isLoop (NFC) 2020-12-12 10:09:35 -08:00
IntervalPartition.cpp
LazyBlockFrequencyInfo.cpp
LazyBranchProbabilityInfo.cpp
LazyCallGraph.cpp [llvm] Use llvm::drop_begin (NFC) 2021-01-14 20:30:33 -08:00
LazyValueInfo.cpp [LVI] Handle unions of conditions 2021-01-01 16:46:21 +01:00
LegacyDivergenceAnalysis.cpp [DivergenceAnalysis] Use addRequiredTransitive 2020-11-13 14:40:00 +01:00
Lint.cpp [AA] Split up LocationSize::unknown() 2020-11-26 18:39:55 +01:00
Loads.cpp Use deref facts derived from minimum object size of allocations 2020-12-03 15:01:14 -08:00
LoopAccessAnalysis.cpp [llvm] Drop unnecessary make_range (NFC) 2021-01-09 09:25:00 -08:00
LoopAnalysisManager.cpp [NFC] Reduce include files dependency. 2020-12-03 18:25:05 +03:00
LoopCacheAnalysis.cpp [LoopInfo] empty() -> isInnermost(), add isOutermost() 2020-09-22 23:28:51 +03:00
LoopInfo.cpp [llvm] Use the default value of drop_begin (NFC) 2021-01-18 10:16:36 -08:00
LoopNestAnalysis.cpp [LoopNest] Remove unused include. 2021-01-05 20:05:31 +00:00
LoopPass.cpp [Analysis] Use llvm::erase_value (NFC) 2020-12-14 22:40:13 -08:00
LoopUnrollAnalyzer.cpp ScalarEvolution.h - reduce LoopInfo.h include to forward declarations. NFC. 2020-06-17 15:48:23 +01:00
MLInlineAdvisor.cpp Reland "[NPM][Inliner] Factor ImportedFunctionStats in the InlineAdvisor" 2021-01-20 13:33:43 -08:00
MemDepPrinter.cpp [Analysis] Remove dead function getInstTypePair (NFC) 2020-12-19 10:57:35 -08:00
MemDerefPrinter.cpp Port -print-memderefs to NPM 2020-11-23 11:56:22 -08:00
MemoryBuiltins.cpp [STLExtras] Use return type from operator* of the wrapped iter. 2021-01-10 14:41:13 +00:00
MemoryDependenceAnalysis.cpp [llvm] Use llvm::lower_bound and llvm::upper_bound (NFC) 2021-01-05 21:15:59 -08:00
MemoryLocation.cpp [MemLoc] Fix debug print for LocationSize 2020-12-20 17:52:48 +01:00
MemorySSA.cpp [noalias.decl] Look through llvm.experimental.noalias.scope.decl 2021-01-19 20:09:42 +01:00
MemorySSAUpdater.cpp [DominatorTree] Add support for mixed pre/post CFG views. 2021-01-06 14:53:09 -08:00
ModuleDebugInfoPrinter.cpp [NPM] Port module-debuginfo pass to the new pass manager 2020-10-19 14:31:17 -07:00
ModuleSummaryAnalysis.cpp [llvm] Use the default value of drop_begin (NFC) 2021-01-18 10:16:36 -08:00
MustExecute.cpp Port print-must-be-executed-contexts and print-mustexecute to NPM 2020-11-03 21:06:46 -08:00
ObjCARCAliasAnalysis.cpp [AA] Split up LocationSize::unknown() 2020-11-26 18:39:55 +01:00
ObjCARCAnalysisUtils.cpp [NFC] Reduce include files dependency. 2020-12-03 18:25:05 +03:00
ObjCARCInstKind.cpp [Analysis/Transforms/Sanitizers] As part of using inclusive language 2020-06-20 00:42:26 -07:00
OptimizationRemarkEmitter.cpp [BPI] Improve static heuristics for "cold" paths. 2020-12-23 22:47:36 +07:00
PHITransAddr.cpp
PhiValues.cpp [PhiValues] Use SetVector to avoid non-determinism 2020-10-23 20:14:02 +02:00
PostDominators.cpp
ProfileSummaryInfo.cpp [NFC] Change getEntryForPercentile to be a static function in ProfileSummaryBuilder. 2020-07-09 16:38:19 -07:00
PtrUseVisitor.cpp
README.txt
RegionInfo.cpp RegionInfo.cpp - remove duplicate includes that already exist in RegionInfo.h. NFC. 2020-07-23 17:50:22 +01:00
RegionPass.cpp [NFC] Clean up always false variables 2020-10-21 10:54:55 -07:00
RegionPrinter.cpp
ReleaseModeModelRunner.cpp static const char *const foo => const char foo[] 2020-12-01 10:33:18 -08:00
ReplayInlineAdvisor.cpp Reland "[NPM][Inliner] Factor ImportedFunctionStats in the InlineAdvisor" 2021-01-20 13:33:43 -08:00
ScalarEvolution.cpp [llvm] Construct SmallVector with iterator ranges (NFC) 2021-01-20 21:35:52 -08:00
ScalarEvolutionAliasAnalysis.cpp [AA] Split up LocationSize::unknown() 2020-11-26 18:39:55 +01:00
ScalarEvolutionDivision.cpp [SCEV] Generalize SCEVParameterRewriter to accept SCEV expression as target. 2020-09-18 10:05:02 +01:00
ScalarEvolutionNormalization.cpp
ScopedNoAliasAA.cpp [NFC] Reduce include files dependency and AA header cleanup (part 2). 2020-12-17 14:04:48 +03:00
StackLifetime.cpp When dumping results of StackLifetime, it will print the following 2020-09-07 11:43:16 +08:00
StackSafetyAnalysis.cpp [llvm] Drop unnecessary make_range (NFC) 2021-01-09 09:25:00 -08:00
StratifiedSets.h
SyncDependenceAnalysis.cpp [SDA] Fix -Wunused-function in -DLLVM_ENABLE_ASSERTIONS=off builds 2020-10-04 12:17:16 -07:00
SyntheticCountsUtils.cpp
TFUtils.cpp [NFC][TFUtils] also include output specs lookup logic in loadOutputSpecs 2020-11-18 21:20:21 -08:00
TargetLibraryInfo.cpp [Analysis] Use llvm::append_range (NFC) 2020-12-29 19:23:21 -08:00
TargetTransformInfo.cpp [LV][ARM] Inloop reduction cost modelling 2021-01-21 21:03:41 +00:00
Trace.cpp
TypeBasedAliasAnalysis.cpp [NFC] Reduce include files dependency and AA header cleanup (part 2). 2020-12-17 14:04:48 +03:00
TypeMetadataUtils.cpp TypeMetadataUtils.h - reduce Instructions.h include to forward declaration. NFC. 2020-06-05 17:40:33 +01:00
VFABIDemangling.cpp [llvm] Use the default value of drop_begin (NFC) 2021-01-18 10:16:36 -08:00
ValueLattice.cpp
ValueLatticeUtils.cpp [ValueLattice] Simplify canTrackGlobalVariableInterprocedurally (NFC). 2020-07-09 18:33:09 +01:00
ValueTracking.cpp Allow nonnull/align attribute to accept poison 2021-01-20 11:31:23 +09:00
VectorUtils.cpp [noalias.decl] Look through llvm.experimental.noalias.scope.decl 2021-01-19 20:09:42 +01:00

README.txt

Analysis Opportunities:

//===---------------------------------------------------------------------===//

In test/Transforms/LoopStrengthReduce/quadradic-exit-value.ll, the
ScalarEvolution expression for %r is this:

  {1,+,3,+,2}<loop>

Outside the loop, this could be evaluated simply as (%n * %n), however
ScalarEvolution currently evaluates it as

  (-2 + (2 * (trunc i65 (((zext i64 (-2 + %n) to i65) * (zext i64 (-1 + %n) to i65)) /u 2) to i64)) + (3 * %n))

In addition to being much more complicated, it involves i65 arithmetic,
which is very inefficient when expanded into code.

//===---------------------------------------------------------------------===//

In formatValue in test/CodeGen/X86/lsr-delayed-fold.ll,

ScalarEvolution is forming this expression:

((trunc i64 (-1 * %arg5) to i32) + (trunc i64 %arg5 to i32) + (-1 * (trunc i64 undef to i32)))

This could be folded to

(-1 * (trunc i64 undef to i32))

//===---------------------------------------------------------------------===//