2007-04-07 09:25:15 +08:00
|
|
|
//===- LoopRotation.cpp - Loop Rotation Pass ------------------------------===//
|
|
|
|
//
|
|
|
|
// The LLVM Compiler Infrastructure
|
|
|
|
//
|
2007-12-30 04:36:04 +08:00
|
|
|
// This file is distributed under the University of Illinois Open Source
|
|
|
|
// License. See LICENSE.TXT for details.
|
2007-04-07 09:25:15 +08:00
|
|
|
//
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
//
|
|
|
|
// This file implements Loop Rotation Pass.
|
|
|
|
//
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
|
2016-05-04 06:02:31 +08:00
|
|
|
#include "llvm/Transforms/Scalar/LoopRotation.h"
|
2012-12-04 00:50:05 +08:00
|
|
|
#include "llvm/ADT/Statistic.h"
|
2015-07-23 17:34:01 +08:00
|
|
|
#include "llvm/Analysis/AliasAnalysis.h"
|
2016-12-19 16:22:17 +08:00
|
|
|
#include "llvm/Analysis/AssumptionCache.h"
|
2017-01-11 17:43:56 +08:00
|
|
|
#include "llvm/Analysis/BasicAliasAnalysis.h"
|
2011-01-02 15:35:53 +08:00
|
|
|
#include "llvm/Analysis/CodeMetrics.h"
|
[PM/AA] Rebuild LLVM's alias analysis infrastructure in a way compatible
with the new pass manager, and no longer relying on analysis groups.
This builds essentially a ground-up new AA infrastructure stack for
LLVM. The core ideas are the same that are used throughout the new pass
manager: type erased polymorphism and direct composition. The design is
as follows:
- FunctionAAResults is a type-erasing alias analysis results aggregation
interface to walk a single query across a range of results from
different alias analyses. Currently this is function-specific as we
always assume that aliasing queries are *within* a function.
- AAResultBase is a CRTP utility providing stub implementations of
various parts of the alias analysis result concept, notably in several
cases in terms of other more general parts of the interface. This can
be used to implement only a narrow part of the interface rather than
the entire interface. This isn't really ideal, this logic should be
hoisted into FunctionAAResults as currently it will cause
a significant amount of redundant work, but it faithfully models the
behavior of the prior infrastructure.
- All the alias analysis passes are ported to be wrapper passes for the
legacy PM and new-style analysis passes for the new PM with a shared
result object. In some cases (most notably CFL), this is an extremely
naive approach that we should revisit when we can specialize for the
new pass manager.
- BasicAA has been restructured to reflect that it is much more
fundamentally a function analysis because it uses dominator trees and
loop info that need to be constructed for each function.
All of the references to getting alias analysis results have been
updated to use the new aggregation interface. All the preservation and
other pass management code has been updated accordingly.
The way the FunctionAAResultsWrapperPass works is to detect the
available alias analyses when run, and add them to the results object.
This means that we should be able to continue to respect when various
passes are added to the pipeline, for example adding CFL or adding TBAA
passes should just cause their results to be available and to get folded
into this. The exception to this rule is BasicAA which really needs to
be a function pass due to using dominator trees and loop info. As
a consequence, the FunctionAAResultsWrapperPass directly depends on
BasicAA and always includes it in the aggregation.
This has significant implications for preserving analyses. Generally,
most passes shouldn't bother preserving FunctionAAResultsWrapperPass
because rebuilding the results just updates the set of known AA passes.
The exception to this rule are LoopPass instances which need to preserve
all the function analyses that the loop pass manager will end up
needing. This means preserving both BasicAAWrapperPass and the
aggregating FunctionAAResultsWrapperPass.
Now, when preserving an alias analysis, you do so by directly preserving
that analysis. This is only necessary for non-immutable-pass-provided
alias analyses though, and there are only three of interest: BasicAA,
GlobalsAA (formerly GlobalsModRef), and SCEVAA. Usually BasicAA is
preserved when needed because it (like DominatorTree and LoopInfo) is
marked as a CFG-only pass. I've expanded GlobalsAA into the preserved
set everywhere we previously were preserving all of AliasAnalysis, and
I've added SCEVAA in the intersection of that with where we preserve
SCEV itself.
One significant challenge to all of this is that the CGSCC passes were
actually using the alias analysis implementations by taking advantage of
a pretty amazing set of loop holes in the old pass manager's analysis
management code which allowed analysis groups to slide through in many
cases. Moving away from analysis groups makes this problem much more
obvious. To fix it, I've leveraged the flexibility the design of the new
PM components provides to just directly construct the relevant alias
analyses for the relevant functions in the IPO passes that need them.
This is a bit hacky, but should go away with the new pass manager, and
is already in many ways cleaner than the prior state.
Another significant challenge is that various facilities of the old
alias analysis infrastructure just don't fit any more. The most
significant of these is the alias analysis 'counter' pass. That pass
relied on the ability to snoop on AA queries at different points in the
analysis group chain. Instead, I'm planning to build printing
functionality directly into the aggregation layer. I've not included
that in this patch merely to keep it smaller.
Note that all of this needs a nearly complete rewrite of the AA
documentation. I'm planning to do that, but I'd like to make sure the
new design settles, and to flesh out a bit more of what it looks like in
the new pass manager first.
Differential Revision: http://reviews.llvm.org/D12080
llvm-svn: 247167
2015-09-10 01:55:00 +08:00
|
|
|
#include "llvm/Analysis/GlobalsModRef.h"
|
2017-01-11 17:43:56 +08:00
|
|
|
#include "llvm/Analysis/InstructionSimplify.h"
|
2012-12-04 00:50:05 +08:00
|
|
|
#include "llvm/Analysis/LoopPass.h"
|
2007-07-12 07:47:28 +08:00
|
|
|
#include "llvm/Analysis/ScalarEvolution.h"
|
[PM/AA] Rebuild LLVM's alias analysis infrastructure in a way compatible
with the new pass manager, and no longer relying on analysis groups.
This builds essentially a ground-up new AA infrastructure stack for
LLVM. The core ideas are the same that are used throughout the new pass
manager: type erased polymorphism and direct composition. The design is
as follows:
- FunctionAAResults is a type-erasing alias analysis results aggregation
interface to walk a single query across a range of results from
different alias analyses. Currently this is function-specific as we
always assume that aliasing queries are *within* a function.
- AAResultBase is a CRTP utility providing stub implementations of
various parts of the alias analysis result concept, notably in several
cases in terms of other more general parts of the interface. This can
be used to implement only a narrow part of the interface rather than
the entire interface. This isn't really ideal, this logic should be
hoisted into FunctionAAResults as currently it will cause
a significant amount of redundant work, but it faithfully models the
behavior of the prior infrastructure.
- All the alias analysis passes are ported to be wrapper passes for the
legacy PM and new-style analysis passes for the new PM with a shared
result object. In some cases (most notably CFL), this is an extremely
naive approach that we should revisit when we can specialize for the
new pass manager.
- BasicAA has been restructured to reflect that it is much more
fundamentally a function analysis because it uses dominator trees and
loop info that need to be constructed for each function.
All of the references to getting alias analysis results have been
updated to use the new aggregation interface. All the preservation and
other pass management code has been updated accordingly.
The way the FunctionAAResultsWrapperPass works is to detect the
available alias analyses when run, and add them to the results object.
This means that we should be able to continue to respect when various
passes are added to the pipeline, for example adding CFL or adding TBAA
passes should just cause their results to be available and to get folded
into this. The exception to this rule is BasicAA which really needs to
be a function pass due to using dominator trees and loop info. As
a consequence, the FunctionAAResultsWrapperPass directly depends on
BasicAA and always includes it in the aggregation.
This has significant implications for preserving analyses. Generally,
most passes shouldn't bother preserving FunctionAAResultsWrapperPass
because rebuilding the results just updates the set of known AA passes.
The exception to this rule are LoopPass instances which need to preserve
all the function analyses that the loop pass manager will end up
needing. This means preserving both BasicAAWrapperPass and the
aggregating FunctionAAResultsWrapperPass.
Now, when preserving an alias analysis, you do so by directly preserving
that analysis. This is only necessary for non-immutable-pass-provided
alias analyses though, and there are only three of interest: BasicAA,
GlobalsAA (formerly GlobalsModRef), and SCEVAA. Usually BasicAA is
preserved when needed because it (like DominatorTree and LoopInfo) is
marked as a CFG-only pass. I've expanded GlobalsAA into the preserved
set everywhere we previously were preserving all of AliasAnalysis, and
I've added SCEVAA in the intersection of that with where we preserve
SCEV itself.
One significant challenge to all of this is that the CGSCC passes were
actually using the alias analysis implementations by taking advantage of
a pretty amazing set of loop holes in the old pass manager's analysis
management code which allowed analysis groups to slide through in many
cases. Moving away from analysis groups makes this problem much more
obvious. To fix it, I've leveraged the flexibility the design of the new
PM components provides to just directly construct the relevant alias
analyses for the relevant functions in the IPO passes that need them.
This is a bit hacky, but should go away with the new pass manager, and
is already in many ways cleaner than the prior state.
Another significant challenge is that various facilities of the old
alias analysis infrastructure just don't fit any more. The most
significant of these is the alias analysis 'counter' pass. That pass
relied on the ability to snoop on AA queries at different points in the
analysis group chain. Instead, I'm planning to build printing
functionality directly into the aggregation layer. I've not included
that in this patch merely to keep it smaller.
Note that all of this needs a nearly complete rewrite of the AA
documentation. I'm planning to do that, but I'd like to make sure the
new design settles, and to flesh out a bit more of what it looks like in
the new pass manager first.
Differential Revision: http://reviews.llvm.org/D12080
llvm-svn: 247167
2015-09-10 01:55:00 +08:00
|
|
|
#include "llvm/Analysis/ScalarEvolutionAliasAnalysis.h"
|
2013-01-21 21:04:33 +08:00
|
|
|
#include "llvm/Analysis/TargetTransformInfo.h"
|
2012-02-14 08:00:23 +08:00
|
|
|
#include "llvm/Analysis/ValueTracking.h"
|
2014-03-04 19:45:46 +08:00
|
|
|
#include "llvm/IR/CFG.h"
|
2014-01-13 17:26:24 +08:00
|
|
|
#include "llvm/IR/Dominators.h"
|
2013-01-02 19:36:10 +08:00
|
|
|
#include "llvm/IR/Function.h"
|
|
|
|
#include "llvm/IR/IntrinsicInst.h"
|
2015-03-10 10:37:25 +08:00
|
|
|
#include "llvm/IR/Module.h"
|
2014-05-26 16:58:51 +08:00
|
|
|
#include "llvm/Support/CommandLine.h"
|
2012-12-04 00:50:05 +08:00
|
|
|
#include "llvm/Support/Debug.h"
|
2015-03-24 03:32:43 +08:00
|
|
|
#include "llvm/Support/raw_ostream.h"
|
2016-05-04 06:02:31 +08:00
|
|
|
#include "llvm/Transforms/Scalar.h"
|
2017-01-11 17:43:56 +08:00
|
|
|
#include "llvm/Transforms/Scalar/LoopPassManager.h"
|
2007-07-12 07:47:28 +08:00
|
|
|
#include "llvm/Transforms/Utils/BasicBlockUtils.h"
|
2012-12-04 00:50:05 +08:00
|
|
|
#include "llvm/Transforms/Utils/Local.h"
|
[LPM] Factor all of the loop analysis usage updates into a common helper
routine.
We were getting this wrong in small ways and generally being very
inconsistent about it across loop passes. Instead, let's have a common
place where we do this. One minor downside is that this will require
some analyses like SCEV in more places than they are strictly needed.
However, this seems benign as these analyses are complete no-ops, and
without this consistency we can in many cases end up with the legacy
pass manager scheduling deciding to split up a loop pass pipeline in
order to run the function analysis half-way through. It is very, very
annoying to fix these without just being very pedantic across the board.
The only loop passes I've not updated here are ones that use
AU.setPreservesAll() such as IVUsers (an analysis) and the pass printer.
They seemed less relevant.
With this patch, almost all of the problems in PR24804 around loop pass
pipelines are fixed. The one remaining issue is that we run simplify-cfg
and instcombine in the middle of the loop pass pipeline. We've recently
added some loop variants of these passes that would seem substantially
cleaner to use, but this at least gets us much closer to the previous
state. Notably, the seven loop pass managers is down to three.
I've not updated the loop passes using LoopAccessAnalysis because that
analysis hasn't been fully wired into LoopSimplify/LCSSA, and it isn't
clear that those transforms want to support those forms anyways. They
all run late anyways, so this is harmless. Similarly, LSR is left alone
because it already carefully manages its forms and doesn't need to get
fused into a single loop pass manager with a bunch of other loop passes.
LoopReroll didn't use loop simplified form previously, and I've updated
the test case to match the trivially different output.
Finally, I've also factored all the pass initialization for the passes
that use this technique as well, so that should be done regularly and
reliably.
Thanks to James for the help reviewing and thinking about this stuff,
and Ben for help thinking about it as well!
Differential Revision: http://reviews.llvm.org/D17435
llvm-svn: 261316
2016-02-19 18:45:18 +08:00
|
|
|
#include "llvm/Transforms/Utils/LoopUtils.h"
|
2009-10-25 07:19:52 +08:00
|
|
|
#include "llvm/Transforms/Utils/SSAUpdater.h"
|
2011-01-08 15:21:31 +08:00
|
|
|
#include "llvm/Transforms/Utils/ValueMapper.h"
|
2007-04-07 09:25:15 +08:00
|
|
|
using namespace llvm;
|
|
|
|
|
2014-04-22 10:55:47 +08:00
|
|
|
#define DEBUG_TYPE "loop-rotate"
|
|
|
|
|
2016-06-14 22:44:05 +08:00
|
|
|
static cl::opt<unsigned> DefaultRotationThreshold(
|
|
|
|
"rotation-max-header-size", cl::init(16), cl::Hidden,
|
|
|
|
cl::desc("The default maximum header size for automatic loop rotation"));
|
2007-04-07 09:25:15 +08:00
|
|
|
|
|
|
|
STATISTIC(NumRotated, "Number of loops rotated");
|
|
|
|
|
2016-07-10 19:28:51 +08:00
|
|
|
namespace {
|
2016-06-14 22:44:05 +08:00
|
|
|
/// A simple loop rotation transformation.
|
|
|
|
class LoopRotate {
|
|
|
|
const unsigned MaxHeaderSize;
|
|
|
|
LoopInfo *LI;
|
|
|
|
const TargetTransformInfo *TTI;
|
2016-12-19 16:22:17 +08:00
|
|
|
AssumptionCache *AC;
|
2016-06-14 22:44:05 +08:00
|
|
|
DominatorTree *DT;
|
|
|
|
ScalarEvolution *SE;
|
2017-04-26 21:52:18 +08:00
|
|
|
const SimplifyQuery &SQ;
|
2016-06-14 22:44:05 +08:00
|
|
|
|
|
|
|
public:
|
|
|
|
LoopRotate(unsigned MaxHeaderSize, LoopInfo *LI,
|
2016-12-19 16:22:17 +08:00
|
|
|
const TargetTransformInfo *TTI, AssumptionCache *AC,
|
2017-04-26 21:52:18 +08:00
|
|
|
DominatorTree *DT, ScalarEvolution *SE, const SimplifyQuery &SQ)
|
|
|
|
: MaxHeaderSize(MaxHeaderSize), LI(LI), TTI(TTI), AC(AC), DT(DT), SE(SE),
|
|
|
|
SQ(SQ) {}
|
2016-06-14 22:44:05 +08:00
|
|
|
bool processLoop(Loop *L);
|
|
|
|
|
|
|
|
private:
|
|
|
|
bool rotateLoop(Loop *L, bool SimplifiedLatch);
|
|
|
|
bool simplifyLoopLatch(Loop *L);
|
|
|
|
};
|
2016-07-10 19:28:51 +08:00
|
|
|
} // end anonymous namespace
|
2016-06-14 22:44:05 +08:00
|
|
|
|
2011-01-09 03:26:33 +08:00
|
|
|
/// RewriteUsesOfClonedInstructions - We just cloned the instructions from the
|
|
|
|
/// old header into the preheader. If there were uses of the values produced by
|
|
|
|
/// these instruction that were outside of the loop, we have to insert PHI nodes
|
|
|
|
/// to merge the two values. Do this now.
|
|
|
|
static void RewriteUsesOfClonedInstructions(BasicBlock *OrigHeader,
|
|
|
|
BasicBlock *OrigPreheader,
|
2017-03-08 17:56:22 +08:00
|
|
|
ValueToValueMapTy &ValueMap,
|
|
|
|
SmallVectorImpl<PHINode*> *InsertedPHIs) {
|
2011-01-09 03:26:33 +08:00
|
|
|
// Remove PHI node entries that are no longer live.
|
|
|
|
BasicBlock::iterator I, E = OrigHeader->end();
|
|
|
|
for (I = OrigHeader->begin(); PHINode *PN = dyn_cast<PHINode>(I); ++I)
|
|
|
|
PN->removeIncomingValue(PN->getBasicBlockIndex(OrigPreheader));
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2011-01-09 03:26:33 +08:00
|
|
|
// Now fix up users of the instructions in OrigHeader, inserting PHI nodes
|
|
|
|
// as necessary.
|
2017-03-08 17:56:22 +08:00
|
|
|
SSAUpdater SSA(InsertedPHIs);
|
2011-01-09 03:26:33 +08:00
|
|
|
for (I = OrigHeader->begin(); I != E; ++I) {
|
2015-10-14 03:26:58 +08:00
|
|
|
Value *OrigHeaderVal = &*I;
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2011-01-09 03:26:33 +08:00
|
|
|
// If there are no uses of the value (e.g. because it returns void), there
|
|
|
|
// is nothing to rewrite.
|
|
|
|
if (OrigHeaderVal->use_empty())
|
|
|
|
continue;
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2016-04-18 03:26:49 +08:00
|
|
|
Value *OrigPreHeaderVal = ValueMap.lookup(OrigHeaderVal);
|
2011-01-09 03:26:33 +08:00
|
|
|
|
|
|
|
// The value now exits in two versions: the initial value in the preheader
|
|
|
|
// and the loop "next" value in the original header.
|
|
|
|
SSA.Initialize(OrigHeaderVal->getType(), OrigHeaderVal->getName());
|
|
|
|
SSA.AddAvailableValue(OrigHeader, OrigHeaderVal);
|
|
|
|
SSA.AddAvailableValue(OrigPreheader, OrigPreHeaderVal);
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2011-01-09 03:26:33 +08:00
|
|
|
// Visit each use of the OrigHeader instruction.
|
|
|
|
for (Value::use_iterator UI = OrigHeaderVal->use_begin(),
|
2016-06-14 22:44:05 +08:00
|
|
|
UE = OrigHeaderVal->use_end();
|
|
|
|
UI != UE;) {
|
2011-01-09 03:26:33 +08:00
|
|
|
// Grab the use before incrementing the iterator.
|
2014-03-09 11:16:01 +08:00
|
|
|
Use &U = *UI;
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2011-01-09 03:26:33 +08:00
|
|
|
// Increment the iterator before removing the use from the list.
|
|
|
|
++UI;
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2011-01-09 03:26:33 +08:00
|
|
|
// SSAUpdater can't handle a non-PHI use in the same block as an
|
|
|
|
// earlier def. We can easily handle those cases manually.
|
|
|
|
Instruction *UserInst = cast<Instruction>(U.getUser());
|
|
|
|
if (!isa<PHINode>(UserInst)) {
|
|
|
|
BasicBlock *UserBB = UserInst->getParent();
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2011-01-09 03:26:33 +08:00
|
|
|
// The original users in the OrigHeader are already using the
|
|
|
|
// original definitions.
|
|
|
|
if (UserBB == OrigHeader)
|
|
|
|
continue;
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2011-01-09 03:26:33 +08:00
|
|
|
// Users in the OrigPreHeader need to use the value to which the
|
|
|
|
// original definitions are mapped.
|
|
|
|
if (UserBB == OrigPreheader) {
|
|
|
|
U = OrigPreHeaderVal;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2011-01-09 03:26:33 +08:00
|
|
|
// Anything else can be handled by SSAUpdater.
|
|
|
|
SSA.RewriteUse(U);
|
|
|
|
}
|
2016-05-10 17:45:44 +08:00
|
|
|
|
|
|
|
// Replace MetadataAsValue(ValueAsMetadata(OrigHeaderVal)) uses in debug
|
|
|
|
// intrinsics.
|
|
|
|
LLVMContext &C = OrigHeader->getContext();
|
|
|
|
if (auto *VAM = ValueAsMetadata::getIfExists(OrigHeaderVal)) {
|
|
|
|
if (auto *MAV = MetadataAsValue::getIfExists(C, VAM)) {
|
2016-06-14 22:44:05 +08:00
|
|
|
for (auto UI = MAV->use_begin(), E = MAV->use_end(); UI != E;) {
|
2016-05-10 17:45:44 +08:00
|
|
|
// Grab the use before incrementing the iterator. Otherwise, altering
|
|
|
|
// the Use will invalidate the iterator.
|
|
|
|
Use &U = *UI++;
|
|
|
|
DbgInfoIntrinsic *UserInst = dyn_cast<DbgInfoIntrinsic>(U.getUser());
|
2016-06-14 22:44:05 +08:00
|
|
|
if (!UserInst)
|
|
|
|
continue;
|
2016-05-10 17:45:44 +08:00
|
|
|
|
|
|
|
// The original users in the OrigHeader are already using the original
|
|
|
|
// definitions.
|
|
|
|
BasicBlock *UserBB = UserInst->getParent();
|
|
|
|
if (UserBB == OrigHeader)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
// Users in the OrigPreHeader need to use the value to which the
|
|
|
|
// original definitions are mapped and anything else can be handled by
|
|
|
|
// the SSAUpdater. To avoid adding PHINodes, check if the value is
|
|
|
|
// available in UserBB, if not substitute undef.
|
|
|
|
Value *NewVal;
|
|
|
|
if (UserBB == OrigPreheader)
|
|
|
|
NewVal = OrigPreHeaderVal;
|
|
|
|
else if (SSA.HasValueForBlock(UserBB))
|
|
|
|
NewVal = SSA.GetValueInMiddleOfBlock(UserBB);
|
|
|
|
else
|
|
|
|
NewVal = UndefValue::get(OrigHeaderVal->getType());
|
|
|
|
U = MetadataAsValue::get(C, ValueAsMetadata::get(NewVal));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2011-01-09 03:26:33 +08:00
|
|
|
}
|
2012-02-14 08:00:19 +08:00
|
|
|
}
|
2011-01-09 03:26:33 +08:00
|
|
|
|
2017-03-08 17:56:22 +08:00
|
|
|
/// Propagate dbg.value intrinsics through the newly inserted Phis.
|
|
|
|
static void insertDebugValues(BasicBlock *OrigHeader,
|
|
|
|
SmallVectorImpl<PHINode*> &InsertedPHIs) {
|
|
|
|
ValueToValueMapTy DbgValueMap;
|
|
|
|
|
|
|
|
// Map existing PHI nodes to their dbg.values.
|
|
|
|
for (auto &I : *OrigHeader) {
|
|
|
|
if (auto DbgII = dyn_cast<DbgInfoIntrinsic>(&I)) {
|
|
|
|
if (auto *Loc = dyn_cast_or_null<PHINode>(DbgII->getVariableLocation()))
|
|
|
|
DbgValueMap.insert({Loc, DbgII});
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Then iterate through the new PHIs and look to see if they use one of the
|
|
|
|
// previously mapped PHIs. If so, insert a new dbg.value intrinsic that will
|
|
|
|
// propagate the info through the new PHI.
|
|
|
|
LLVMContext &C = OrigHeader->getContext();
|
|
|
|
for (auto PHI : InsertedPHIs) {
|
|
|
|
for (auto VI : PHI->operand_values()) {
|
|
|
|
auto V = DbgValueMap.find(VI);
|
|
|
|
if (V != DbgValueMap.end()) {
|
|
|
|
auto *DbgII = cast<DbgInfoIntrinsic>(V->second);
|
|
|
|
Instruction *NewDbgII = DbgII->clone();
|
|
|
|
auto PhiMAV = MetadataAsValue::get(C, ValueAsMetadata::get(PHI));
|
|
|
|
NewDbgII->setOperand(0, PhiMAV);
|
|
|
|
BasicBlock *Parent = PHI->getParent();
|
|
|
|
NewDbgII->insertBefore(Parent->getFirstNonPHIOrDbgOrLifetime());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-05-12 05:10:54 +08:00
|
|
|
/// Rotate loop LP. Return true if the loop is rotated.
|
2013-05-07 01:58:18 +08:00
|
|
|
///
|
|
|
|
/// \param SimplifiedLatch is true if the latch was just folded into the final
|
|
|
|
/// loop exit. In this case we may want to rotate even though the new latch is
|
|
|
|
/// now an exiting branch. This rotation would have happened had the latch not
|
|
|
|
/// been simplified. However, if SimplifiedLatch is false, then we avoid
|
|
|
|
/// rotating loops in which the latch exits to avoid excessive or endless
|
|
|
|
/// rotation. LoopRotate should be repeatable and converge to a canonical
|
|
|
|
/// form. This property is satisfied because simplifying the loop latch can only
|
|
|
|
/// happen once across multiple invocations of the LoopRotate pass.
|
2016-06-14 22:44:05 +08:00
|
|
|
bool LoopRotate::rotateLoop(Loop *L, bool SimplifiedLatch) {
|
2009-06-25 08:22:44 +08:00
|
|
|
// If the loop has only one block then there is not much to rotate.
|
2007-04-10 00:11:48 +08:00
|
|
|
if (L->getBlocks().size() == 1)
|
2007-04-07 09:25:15 +08:00
|
|
|
return false;
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2011-01-09 02:06:22 +08:00
|
|
|
BasicBlock *OrigHeader = L->getHeader();
|
2012-08-30 23:39:42 +08:00
|
|
|
BasicBlock *OrigLatch = L->getLoopLatch();
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2011-01-09 02:06:22 +08:00
|
|
|
BranchInst *BI = dyn_cast<BranchInst>(OrigHeader->getTerminator());
|
2014-04-25 13:29:35 +08:00
|
|
|
if (!BI || BI->isUnconditional())
|
2011-01-09 02:06:22 +08:00
|
|
|
return false;
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2009-06-25 08:22:44 +08:00
|
|
|
// If the loop header is not one of the loop exiting blocks then
|
|
|
|
// either this loop is already rotated or it is not
|
2007-04-07 09:25:15 +08:00
|
|
|
// suitable for loop rotation transformations.
|
2009-10-25 07:34:26 +08:00
|
|
|
if (!L->isLoopExiting(OrigHeader))
|
2007-04-07 09:25:15 +08:00
|
|
|
return false;
|
|
|
|
|
2012-08-30 23:39:42 +08:00
|
|
|
// If the loop latch already contains a branch that leaves the loop then the
|
|
|
|
// loop is already rotated.
|
2014-04-25 13:29:35 +08:00
|
|
|
if (!OrigLatch)
|
2013-05-07 01:58:18 +08:00
|
|
|
return false;
|
|
|
|
|
|
|
|
// Rotate if either the loop latch does *not* exit the loop, or if the loop
|
|
|
|
// latch was just simplified.
|
|
|
|
if (L->isLoopExiting(OrigLatch) && !SimplifiedLatch)
|
2007-04-07 09:25:15 +08:00
|
|
|
return false;
|
|
|
|
|
2012-12-21 00:04:27 +08:00
|
|
|
// Check size of original header and reject loop if it is very big or we can't
|
|
|
|
// duplicate blocks inside it.
|
2011-01-02 15:35:53 +08:00
|
|
|
{
|
2014-09-07 21:49:57 +08:00
|
|
|
SmallPtrSet<const Value *, 32> EphValues;
|
2016-12-19 16:22:17 +08:00
|
|
|
CodeMetrics::collectEphemeralValues(L, AC, EphValues);
|
2014-09-07 21:49:57 +08:00
|
|
|
|
2011-01-02 15:35:53 +08:00
|
|
|
CodeMetrics Metrics;
|
2014-09-07 21:49:57 +08:00
|
|
|
Metrics.analyzeBasicBlock(OrigHeader, *TTI, EphValues);
|
2012-12-21 00:04:27 +08:00
|
|
|
if (Metrics.notDuplicatable) {
|
2013-12-05 13:44:44 +08:00
|
|
|
DEBUG(dbgs() << "LoopRotation: NOT rotating - contains non-duplicatable"
|
2016-06-14 22:44:05 +08:00
|
|
|
<< " instructions: ";
|
|
|
|
L->dump());
|
2012-12-21 00:04:27 +08:00
|
|
|
return false;
|
|
|
|
}
|
2016-02-13 05:01:33 +08:00
|
|
|
if (Metrics.convergent) {
|
|
|
|
DEBUG(dbgs() << "LoopRotation: NOT rotating - contains convergent "
|
2016-06-14 22:44:05 +08:00
|
|
|
"instructions: ";
|
|
|
|
L->dump());
|
2016-02-13 05:01:33 +08:00
|
|
|
return false;
|
|
|
|
}
|
2014-05-26 16:58:51 +08:00
|
|
|
if (Metrics.NumInsts > MaxHeaderSize)
|
2011-01-02 15:35:53 +08:00
|
|
|
return false;
|
2009-03-06 11:51:30 +08:00
|
|
|
}
|
|
|
|
|
2007-07-12 07:47:28 +08:00
|
|
|
// Now, this loop is suitable for rotation.
|
2011-01-09 03:26:33 +08:00
|
|
|
BasicBlock *OrigPreheader = L->getLoopPreheader();
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2011-04-09 15:25:58 +08:00
|
|
|
// If the loop could not be converted to canonical form, it must have an
|
|
|
|
// indirectbr in it, just give up.
|
2014-04-25 13:29:35 +08:00
|
|
|
if (!OrigPreheader)
|
2011-04-09 15:25:58 +08:00
|
|
|
return false;
|
2007-07-12 07:47:28 +08:00
|
|
|
|
2009-09-27 23:37:03 +08:00
|
|
|
// Anything ScalarEvolution may know about this loop or the PHI nodes
|
|
|
|
// in its header will soon be invalidated.
|
2015-12-15 07:22:48 +08:00
|
|
|
if (SE)
|
|
|
|
SE->forgetLoop(L);
|
2009-09-27 23:37:03 +08:00
|
|
|
|
2012-08-30 23:39:42 +08:00
|
|
|
DEBUG(dbgs() << "LoopRotation: rotating "; L->dump());
|
|
|
|
|
2007-04-07 09:25:15 +08:00
|
|
|
// Find new Loop header. NewHeader is a Header's one and only successor
|
2009-01-26 09:57:01 +08:00
|
|
|
// that is inside loop. Header's other successor is outside the
|
|
|
|
// loop. Otherwise loop is not suitable for rotation.
|
2011-01-09 01:48:33 +08:00
|
|
|
BasicBlock *Exit = BI->getSuccessor(0);
|
|
|
|
BasicBlock *NewHeader = BI->getSuccessor(1);
|
2007-04-10 00:11:48 +08:00
|
|
|
if (L->contains(Exit))
|
|
|
|
std::swap(Exit, NewHeader);
|
2009-01-26 09:38:24 +08:00
|
|
|
assert(NewHeader && "Unable to determine new loop header");
|
2012-02-14 08:00:19 +08:00
|
|
|
assert(L->contains(NewHeader) && !L->contains(Exit) &&
|
2007-04-10 00:11:48 +08:00
|
|
|
"Unable to determine loop header and exit blocks");
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2009-06-25 08:22:44 +08:00
|
|
|
// This code assumes that the new header has exactly one predecessor.
|
|
|
|
// Remove any single-entry PHI nodes in it.
|
2009-01-26 10:11:30 +08:00
|
|
|
assert(NewHeader->getSinglePredecessor() &&
|
|
|
|
"New header doesn't have one pred!");
|
|
|
|
FoldSingleEntryPHINodes(NewHeader);
|
2007-04-07 09:25:15 +08:00
|
|
|
|
2009-10-25 07:19:52 +08:00
|
|
|
// Begin by walking OrigHeader and populating ValueMap with an entry for
|
|
|
|
// each Instruction.
|
2007-04-10 00:11:48 +08:00
|
|
|
BasicBlock::iterator I = OrigHeader->begin(), E = OrigHeader->end();
|
2011-01-08 15:21:31 +08:00
|
|
|
ValueToValueMapTy ValueMap;
|
2007-04-10 00:11:48 +08:00
|
|
|
|
2009-10-25 07:19:52 +08:00
|
|
|
// For PHI nodes, the value available in OldPreHeader is just the
|
|
|
|
// incoming value from OldPreHeader.
|
|
|
|
for (; PHINode *PN = dyn_cast<PHINode>(I); ++I)
|
2011-06-20 22:18:48 +08:00
|
|
|
ValueMap[PN] = PN->getIncomingValueForBlock(OrigPreheader);
|
2007-04-10 03:04:21 +08:00
|
|
|
|
2010-09-06 09:10:22 +08:00
|
|
|
// For the rest of the instructions, either hoist to the OrigPreheader if
|
|
|
|
// possible or create a clone in the OldPreHeader if not.
|
2011-01-09 03:26:33 +08:00
|
|
|
TerminatorInst *LoopEntryBranch = OrigPreheader->getTerminator();
|
2010-09-06 09:10:22 +08:00
|
|
|
while (I != E) {
|
2015-10-14 03:26:58 +08:00
|
|
|
Instruction *Inst = &*I++;
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2010-09-06 09:10:22 +08:00
|
|
|
// If the instruction's operands are invariant and it doesn't read or write
|
|
|
|
// memory, then it is safe to hoist. Doing this doesn't change the order of
|
|
|
|
// execution in the preheader, but does prevent the instruction from
|
|
|
|
// executing in each iteration of the loop. This means it is safe to hoist
|
|
|
|
// something that might trap, but isn't safe to hoist something that reads
|
|
|
|
// memory (without proving that the loop doesn't write).
|
2016-06-14 22:44:05 +08:00
|
|
|
if (L->hasLoopInvariantOperands(Inst) && !Inst->mayReadFromMemory() &&
|
|
|
|
!Inst->mayWriteToMemory() && !isa<TerminatorInst>(Inst) &&
|
|
|
|
!isa<DbgInfoIntrinsic>(Inst) && !isa<AllocaInst>(Inst)) {
|
2010-09-06 09:10:22 +08:00
|
|
|
Inst->moveBefore(LoopEntryBranch);
|
|
|
|
continue;
|
|
|
|
}
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2010-09-06 09:10:22 +08:00
|
|
|
// Otherwise, create a duplicate of the instruction.
|
|
|
|
Instruction *C = Inst->clone();
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2011-01-08 16:24:46 +08:00
|
|
|
// Eagerly remap the operands of the instruction.
|
|
|
|
RemapInstruction(C, ValueMap,
|
2016-04-07 08:26:43 +08:00
|
|
|
RF_NoModuleLevelChanges | RF_IgnoreMissingLocals);
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2011-01-08 16:24:46 +08:00
|
|
|
// With the operands remapped, see if the instruction constant folds or is
|
|
|
|
// otherwise simplifyable. This commonly occurs because the entry from PHI
|
|
|
|
// nodes allows icmps and other instructions to fold.
|
2017-04-29 03:55:38 +08:00
|
|
|
Value *V = SimplifyInstruction(C, SQ);
|
2011-01-09 01:38:45 +08:00
|
|
|
if (V && LI->replacementPreservesLCSSAForm(C, V)) {
|
2011-01-08 16:24:46 +08:00
|
|
|
// If so, then delete the temporary instruction and stick the folded value
|
|
|
|
// in the map.
|
|
|
|
ValueMap[Inst] = V;
|
2016-06-25 08:04:10 +08:00
|
|
|
if (!C->mayHaveSideEffects()) {
|
[IR] De-virtualize ~Value to save a vptr
Summary:
Implements PR889
Removing the virtual table pointer from Value saves 1% of RSS when doing
LTO of llc on Linux. The impact on time was positive, but too noisy to
conclusively say that performance improved. Here is a link to the
spreadsheet with the original data:
https://docs.google.com/spreadsheets/d/1F4FHir0qYnV0MEp2sYYp_BuvnJgWlWPhWOwZ6LbW7W4/edit?usp=sharing
This change makes it invalid to directly delete a Value, User, or
Instruction pointer. Instead, such code can be rewritten to a null check
and a call Value::deleteValue(). Value objects tend to have their
lifetimes managed through iplist, so for the most part, this isn't a big
deal. However, there are some places where LLVM deletes values, and
those places had to be migrated to deleteValue. I have also created
llvm::unique_value, which has a custom deleter, so it can be used in
place of std::unique_ptr<Value>.
I had to add the "DerivedUser" Deleter escape hatch for MemorySSA, which
derives from User outside of lib/IR. Code in IR cannot include MemorySSA
headers or call the MemoryAccess object destructors without introducing
a circular dependency, so we need some level of indirection.
Unfortunately, no class derived from User may have any virtual methods,
because adding a virtual method would break User::getHungOffOperands(),
which assumes that it can find the use list immediately prior to the
User object. I've added a static_assert to the appropriate OperandTraits
templates to help people avoid this trap.
Reviewers: chandlerc, mehdi_amini, pete, dberlin, george.burgess.iv
Reviewed By: chandlerc
Subscribers: krytarowski, eraman, george.burgess.iv, mzolotukhin, Prazek, nlewycky, hans, inglorion, pcc, tejohnson, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D31261
llvm-svn: 303362
2017-05-19 01:24:10 +08:00
|
|
|
C->deleteValue();
|
2016-06-25 08:04:10 +08:00
|
|
|
C = nullptr;
|
|
|
|
}
|
2011-01-08 16:24:46 +08:00
|
|
|
} else {
|
2016-06-25 08:04:10 +08:00
|
|
|
ValueMap[Inst] = C;
|
|
|
|
}
|
|
|
|
if (C) {
|
2011-01-08 16:24:46 +08:00
|
|
|
// Otherwise, stick the new instruction into the new block!
|
|
|
|
C->setName(Inst->getName());
|
|
|
|
C->insertBefore(LoopEntryBranch);
|
2016-12-19 16:22:17 +08:00
|
|
|
|
|
|
|
if (auto *II = dyn_cast<IntrinsicInst>(C))
|
|
|
|
if (II->getIntrinsicID() == Intrinsic::assume)
|
|
|
|
AC->registerAssumption(II);
|
2011-01-08 16:24:46 +08:00
|
|
|
}
|
2007-04-07 09:25:15 +08:00
|
|
|
}
|
|
|
|
|
2009-10-25 07:19:52 +08:00
|
|
|
// Along with all the other instructions, we just cloned OrigHeader's
|
|
|
|
// terminator into OrigPreHeader. Fix up the PHI nodes in each of OrigHeader's
|
|
|
|
// successors by duplicating their incoming values for OrigHeader.
|
|
|
|
TerminatorInst *TI = OrigHeader->getTerminator();
|
2015-08-07 04:22:46 +08:00
|
|
|
for (BasicBlock *SuccBB : TI->successors())
|
|
|
|
for (BasicBlock::iterator BI = SuccBB->begin();
|
2009-10-25 07:19:52 +08:00
|
|
|
PHINode *PN = dyn_cast<PHINode>(BI); ++BI)
|
2011-01-09 03:26:33 +08:00
|
|
|
PN->addIncoming(PN->getIncomingValueForBlock(OrigHeader), OrigPreheader);
|
2009-10-25 07:19:52 +08:00
|
|
|
|
|
|
|
// Now that OrigPreHeader has a clone of OrigHeader's terminator, remove
|
|
|
|
// OrigPreHeader's old terminator (the original branch into the loop), and
|
|
|
|
// remove the corresponding incoming values from the PHI nodes in OrigHeader.
|
|
|
|
LoopEntryBranch->eraseFromParent();
|
|
|
|
|
2017-03-08 17:56:22 +08:00
|
|
|
|
|
|
|
SmallVector<PHINode*, 2> InsertedPHIs;
|
2011-01-09 03:26:33 +08:00
|
|
|
// If there were any uses of instructions in the duplicated block outside the
|
|
|
|
// loop, update them, inserting PHI nodes as required
|
2017-03-08 17:56:22 +08:00
|
|
|
RewriteUsesOfClonedInstructions(OrigHeader, OrigPreheader, ValueMap,
|
|
|
|
&InsertedPHIs);
|
|
|
|
|
|
|
|
// Attach dbg.value intrinsics to the new phis if that phi uses a value that
|
|
|
|
// previously had debug metadata attached. This keeps the debug info
|
|
|
|
// up-to-date in the loop body.
|
|
|
|
if (!InsertedPHIs.empty())
|
|
|
|
insertDebugValues(OrigHeader, InsertedPHIs);
|
2007-04-07 09:25:15 +08:00
|
|
|
|
2009-10-25 07:19:52 +08:00
|
|
|
// NewHeader is now the header of the loop.
|
2007-04-07 09:25:15 +08:00
|
|
|
L->moveToHeader(NewHeader);
|
2011-01-09 03:10:28 +08:00
|
|
|
assert(L->getHeader() == NewHeader && "Latch block is our new header");
|
2007-04-07 09:25:15 +08:00
|
|
|
|
2017-08-18 05:48:19 +08:00
|
|
|
// Inform DT about changes to the CFG.
|
|
|
|
if (DT) {
|
|
|
|
// The OrigPreheader branches to the NewHeader and Exit now. Then, inform
|
|
|
|
// the DT about the removed edge to the OrigHeader (that got removed).
|
|
|
|
SmallVector<DominatorTree::UpdateType, 3> Updates;
|
|
|
|
Updates.push_back({DominatorTree::Insert, OrigPreheader, Exit});
|
|
|
|
Updates.push_back({DominatorTree::Insert, OrigPreheader, NewHeader});
|
|
|
|
Updates.push_back({DominatorTree::Delete, OrigPreheader, OrigHeader});
|
|
|
|
DT->applyUpdates(Updates);
|
|
|
|
}
|
|
|
|
|
When loop rotation happens, it is *very* common for the duplicated condbr
to be foldable into an uncond branch. When this happens, we can make a
much simpler CFG for the loop, which is important for nested loop cases
where we want the outer loop to be aggressively optimized.
Handle this case more aggressively. For example, previously on
phi-duplicate.ll we would get this:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
%cmp1 = icmp slt i64 1, 1000
br i1 %cmp1, label %bb.nph, label %for.end
bb.nph: ; preds = %entry
br label %for.body
for.body: ; preds = %bb.nph, %for.cond
%j.02 = phi i64 [ 1, %bb.nph ], [ %inc, %for.cond ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.02
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.02, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.02
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.02, 1
br label %for.cond
for.cond: ; preds = %for.body
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.cond.for.end_crit_edge
for.cond.for.end_crit_edge: ; preds = %for.cond
br label %for.end
for.end: ; preds = %for.cond.for.end_crit_edge, %entry
ret void
}
Now we get the much nicer:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
br label %for.body
for.body: ; preds = %entry, %for.body
%j.01 = phi i64 [ 1, %entry ], [ %inc, %for.body ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.01
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.01, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.01
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.01, 1
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.end
for.end: ; preds = %for.body
ret void
}
With all of these recent changes, we are now able to compile:
void foo(char *X) {
for (int i = 0; i != 100; ++i)
for (int j = 0; j != 100; ++j)
X[j+i*100] = 0;
}
into a single memset of 10000 bytes. This series of changes
should also be helpful for other nested loop scenarios as well.
llvm-svn: 123079
2011-01-09 03:59:06 +08:00
|
|
|
// At this point, we've finished our major CFG changes. As part of cloning
|
|
|
|
// the loop into the preheader we've simplified instructions and the
|
|
|
|
// duplicated conditional branch may now be branching on a constant. If it is
|
|
|
|
// branching on a constant and if that constant means that we enter the loop,
|
|
|
|
// then we fold away the cond branch to an uncond branch. This simplifies the
|
|
|
|
// loop in cases important for nested loops, and it also means we don't have
|
|
|
|
// to split as many edges.
|
|
|
|
BranchInst *PHBI = cast<BranchInst>(OrigPreheader->getTerminator());
|
|
|
|
assert(PHBI->isConditional() && "Should be clone of BI condbr!");
|
|
|
|
if (!isa<ConstantInt>(PHBI->getCondition()) ||
|
2016-06-14 22:44:05 +08:00
|
|
|
PHBI->getSuccessor(cast<ConstantInt>(PHBI->getCondition())->isZero()) !=
|
|
|
|
NewHeader) {
|
When loop rotation happens, it is *very* common for the duplicated condbr
to be foldable into an uncond branch. When this happens, we can make a
much simpler CFG for the loop, which is important for nested loop cases
where we want the outer loop to be aggressively optimized.
Handle this case more aggressively. For example, previously on
phi-duplicate.ll we would get this:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
%cmp1 = icmp slt i64 1, 1000
br i1 %cmp1, label %bb.nph, label %for.end
bb.nph: ; preds = %entry
br label %for.body
for.body: ; preds = %bb.nph, %for.cond
%j.02 = phi i64 [ 1, %bb.nph ], [ %inc, %for.cond ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.02
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.02, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.02
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.02, 1
br label %for.cond
for.cond: ; preds = %for.body
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.cond.for.end_crit_edge
for.cond.for.end_crit_edge: ; preds = %for.cond
br label %for.end
for.end: ; preds = %for.cond.for.end_crit_edge, %entry
ret void
}
Now we get the much nicer:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
br label %for.body
for.body: ; preds = %entry, %for.body
%j.01 = phi i64 [ 1, %entry ], [ %inc, %for.body ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.01
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.01, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.01
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.01, 1
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.end
for.end: ; preds = %for.body
ret void
}
With all of these recent changes, we are now able to compile:
void foo(char *X) {
for (int i = 0; i != 100; ++i)
for (int j = 0; j != 100; ++j)
X[j+i*100] = 0;
}
into a single memset of 10000 bytes. This series of changes
should also be helpful for other nested loop scenarios as well.
llvm-svn: 123079
2011-01-09 03:59:06 +08:00
|
|
|
// The conditional branch can't be folded, handle the general case.
|
2017-08-18 05:48:19 +08:00
|
|
|
// Split edges as necessary to preserve LoopSimplify form.
|
2012-02-14 08:00:19 +08:00
|
|
|
|
When loop rotation happens, it is *very* common for the duplicated condbr
to be foldable into an uncond branch. When this happens, we can make a
much simpler CFG for the loop, which is important for nested loop cases
where we want the outer loop to be aggressively optimized.
Handle this case more aggressively. For example, previously on
phi-duplicate.ll we would get this:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
%cmp1 = icmp slt i64 1, 1000
br i1 %cmp1, label %bb.nph, label %for.end
bb.nph: ; preds = %entry
br label %for.body
for.body: ; preds = %bb.nph, %for.cond
%j.02 = phi i64 [ 1, %bb.nph ], [ %inc, %for.cond ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.02
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.02, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.02
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.02, 1
br label %for.cond
for.cond: ; preds = %for.body
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.cond.for.end_crit_edge
for.cond.for.end_crit_edge: ; preds = %for.cond
br label %for.end
for.end: ; preds = %for.cond.for.end_crit_edge, %entry
ret void
}
Now we get the much nicer:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
br label %for.body
for.body: ; preds = %entry, %for.body
%j.01 = phi i64 [ 1, %entry ], [ %inc, %for.body ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.01
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.01, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.01
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.01, 1
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.end
for.end: ; preds = %for.body
ret void
}
With all of these recent changes, we are now able to compile:
void foo(char *X) {
for (int i = 0; i != 100; ++i)
for (int j = 0; j != 100; ++j)
X[j+i*100] = 0;
}
into a single memset of 10000 bytes. This series of changes
should also be helpful for other nested loop scenarios as well.
llvm-svn: 123079
2011-01-09 03:59:06 +08:00
|
|
|
// Right now OrigPreHeader has two successors, NewHeader and ExitBlock, and
|
2012-07-24 18:51:42 +08:00
|
|
|
// thus is not a preheader anymore.
|
|
|
|
// Split the edge to form a real preheader.
|
2015-01-19 20:09:11 +08:00
|
|
|
BasicBlock *NewPH = SplitCriticalEdge(
|
|
|
|
OrigPreheader, NewHeader,
|
|
|
|
CriticalEdgeSplittingOptions(DT, LI).setPreserveLCSSA());
|
When loop rotation happens, it is *very* common for the duplicated condbr
to be foldable into an uncond branch. When this happens, we can make a
much simpler CFG for the loop, which is important for nested loop cases
where we want the outer loop to be aggressively optimized.
Handle this case more aggressively. For example, previously on
phi-duplicate.ll we would get this:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
%cmp1 = icmp slt i64 1, 1000
br i1 %cmp1, label %bb.nph, label %for.end
bb.nph: ; preds = %entry
br label %for.body
for.body: ; preds = %bb.nph, %for.cond
%j.02 = phi i64 [ 1, %bb.nph ], [ %inc, %for.cond ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.02
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.02, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.02
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.02, 1
br label %for.cond
for.cond: ; preds = %for.body
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.cond.for.end_crit_edge
for.cond.for.end_crit_edge: ; preds = %for.cond
br label %for.end
for.end: ; preds = %for.cond.for.end_crit_edge, %entry
ret void
}
Now we get the much nicer:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
br label %for.body
for.body: ; preds = %entry, %for.body
%j.01 = phi i64 [ 1, %entry ], [ %inc, %for.body ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.01
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.01, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.01
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.01, 1
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.end
for.end: ; preds = %for.body
ret void
}
With all of these recent changes, we are now able to compile:
void foo(char *X) {
for (int i = 0; i != 100; ++i)
for (int j = 0; j != 100; ++j)
X[j+i*100] = 0;
}
into a single memset of 10000 bytes. This series of changes
should also be helpful for other nested loop scenarios as well.
llvm-svn: 123079
2011-01-09 03:59:06 +08:00
|
|
|
NewPH->setName(NewHeader->getName() + ".lr.ph");
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2012-07-24 18:51:42 +08:00
|
|
|
// Preserve canonical loop form, which means that 'Exit' should have only
|
[LPM] Fix PR18643, another scary place where loop transforms failed to
preserve loop simplify of enclosing loops.
The problem here starts with LoopRotation which ends up cloning code out
of the latch into the new preheader it is buidling. This can create
a new edge from the preheader into the exit block of the loop which
breaks LoopSimplify form. The code tries to fix this by splitting the
critical edge between the latch and the exit block to get a new exit
block that only the latch dominates. This sadly isn't sufficient.
The exit block may be an exit block for multiple nested loops. When we
clone an edge from the latch of the inner loop to the new preheader
being built in the outer loop, we create an exiting edge from the outer
loop to this exit block. Despite breaking the LoopSimplify form for the
inner loop, this is fine for the outer loop. However, when we split the
edge from the inner loop to the exit block, we create a new block which
is in neither the inner nor outer loop as the new exit block. This is
a predecessor to the old exit block, and so the split itself takes the
outer loop out of LoopSimplify form. We need to split every edge
entering the exit block from inside a loop nested more deeply than the
exit block in order to preserve all of the loop simplify constraints.
Once we try to do that, a problem with splitting critical edges
surfaces. Previously, we tried a very brute force to update LoopSimplify
form by re-computing it for all exit blocks. We don't need to do this,
and doing this much will sometimes but not always overlap with the
LoopRotate bug fix. Instead, the code needs to specifically handle the
cases which can start to violate LoopSimplify -- they aren't that
common. We need to see if the destination of the split edge was a loop
exit block in simplified form for the loop of the source of the edge.
For this to be true, all the predecessors need to be in the exact same
loop as the source of the edge being split. If the dest block was
originally in this form, we have to split all of the deges back into
this loop to recover it. The old mechanism of doing this was
conservatively correct because at least *one* of the exiting blocks it
rewrote was the DestBB and so the DestBB's predecessors were fixed. But
this is a much more targeted way of doing it. Making it targeted is
important, because ballooning the set of edges touched prevents
LoopRotate from being able to split edges *it* needs to split to
preserve loop simplify in a coherent way -- the critical edge splitting
would sometimes find the other edges in need of splitting but not
others.
Many, *many* thanks for help from Nick reducing these test cases
mightily. And helping lots with the analysis here as this one was quite
tricky to track down.
llvm-svn: 200393
2014-01-29 21:16:53 +08:00
|
|
|
// one predecessor. Note that Exit could be an exit block for multiple
|
|
|
|
// nested loops, causing both of the edges to now be critical and need to
|
|
|
|
// be split.
|
|
|
|
SmallVector<BasicBlock *, 4> ExitPreds(pred_begin(Exit), pred_end(Exit));
|
|
|
|
bool SplitLatchEdge = false;
|
2016-06-26 20:28:59 +08:00
|
|
|
for (BasicBlock *ExitPred : ExitPreds) {
|
[LPM] Fix PR18643, another scary place where loop transforms failed to
preserve loop simplify of enclosing loops.
The problem here starts with LoopRotation which ends up cloning code out
of the latch into the new preheader it is buidling. This can create
a new edge from the preheader into the exit block of the loop which
breaks LoopSimplify form. The code tries to fix this by splitting the
critical edge between the latch and the exit block to get a new exit
block that only the latch dominates. This sadly isn't sufficient.
The exit block may be an exit block for multiple nested loops. When we
clone an edge from the latch of the inner loop to the new preheader
being built in the outer loop, we create an exiting edge from the outer
loop to this exit block. Despite breaking the LoopSimplify form for the
inner loop, this is fine for the outer loop. However, when we split the
edge from the inner loop to the exit block, we create a new block which
is in neither the inner nor outer loop as the new exit block. This is
a predecessor to the old exit block, and so the split itself takes the
outer loop out of LoopSimplify form. We need to split every edge
entering the exit block from inside a loop nested more deeply than the
exit block in order to preserve all of the loop simplify constraints.
Once we try to do that, a problem with splitting critical edges
surfaces. Previously, we tried a very brute force to update LoopSimplify
form by re-computing it for all exit blocks. We don't need to do this,
and doing this much will sometimes but not always overlap with the
LoopRotate bug fix. Instead, the code needs to specifically handle the
cases which can start to violate LoopSimplify -- they aren't that
common. We need to see if the destination of the split edge was a loop
exit block in simplified form for the loop of the source of the edge.
For this to be true, all the predecessors need to be in the exact same
loop as the source of the edge being split. If the dest block was
originally in this form, we have to split all of the deges back into
this loop to recover it. The old mechanism of doing this was
conservatively correct because at least *one* of the exiting blocks it
rewrote was the DestBB and so the DestBB's predecessors were fixed. But
this is a much more targeted way of doing it. Making it targeted is
important, because ballooning the set of edges touched prevents
LoopRotate from being able to split edges *it* needs to split to
preserve loop simplify in a coherent way -- the critical edge splitting
would sometimes find the other edges in need of splitting but not
others.
Many, *many* thanks for help from Nick reducing these test cases
mightily. And helping lots with the analysis here as this one was quite
tricky to track down.
llvm-svn: 200393
2014-01-29 21:16:53 +08:00
|
|
|
// We only need to split loop exit edges.
|
2016-06-26 20:28:59 +08:00
|
|
|
Loop *PredLoop = LI->getLoopFor(ExitPred);
|
[LPM] Fix PR18643, another scary place where loop transforms failed to
preserve loop simplify of enclosing loops.
The problem here starts with LoopRotation which ends up cloning code out
of the latch into the new preheader it is buidling. This can create
a new edge from the preheader into the exit block of the loop which
breaks LoopSimplify form. The code tries to fix this by splitting the
critical edge between the latch and the exit block to get a new exit
block that only the latch dominates. This sadly isn't sufficient.
The exit block may be an exit block for multiple nested loops. When we
clone an edge from the latch of the inner loop to the new preheader
being built in the outer loop, we create an exiting edge from the outer
loop to this exit block. Despite breaking the LoopSimplify form for the
inner loop, this is fine for the outer loop. However, when we split the
edge from the inner loop to the exit block, we create a new block which
is in neither the inner nor outer loop as the new exit block. This is
a predecessor to the old exit block, and so the split itself takes the
outer loop out of LoopSimplify form. We need to split every edge
entering the exit block from inside a loop nested more deeply than the
exit block in order to preserve all of the loop simplify constraints.
Once we try to do that, a problem with splitting critical edges
surfaces. Previously, we tried a very brute force to update LoopSimplify
form by re-computing it for all exit blocks. We don't need to do this,
and doing this much will sometimes but not always overlap with the
LoopRotate bug fix. Instead, the code needs to specifically handle the
cases which can start to violate LoopSimplify -- they aren't that
common. We need to see if the destination of the split edge was a loop
exit block in simplified form for the loop of the source of the edge.
For this to be true, all the predecessors need to be in the exact same
loop as the source of the edge being split. If the dest block was
originally in this form, we have to split all of the deges back into
this loop to recover it. The old mechanism of doing this was
conservatively correct because at least *one* of the exiting blocks it
rewrote was the DestBB and so the DestBB's predecessors were fixed. But
this is a much more targeted way of doing it. Making it targeted is
important, because ballooning the set of edges touched prevents
LoopRotate from being able to split edges *it* needs to split to
preserve loop simplify in a coherent way -- the critical edge splitting
would sometimes find the other edges in need of splitting but not
others.
Many, *many* thanks for help from Nick reducing these test cases
mightily. And helping lots with the analysis here as this one was quite
tricky to track down.
llvm-svn: 200393
2014-01-29 21:16:53 +08:00
|
|
|
if (!PredLoop || PredLoop->contains(Exit))
|
|
|
|
continue;
|
2016-06-26 20:28:59 +08:00
|
|
|
if (isa<IndirectBrInst>(ExitPred->getTerminator()))
|
2015-02-21 04:49:25 +08:00
|
|
|
continue;
|
2016-06-26 20:28:59 +08:00
|
|
|
SplitLatchEdge |= L->getLoopLatch() == ExitPred;
|
2015-01-19 20:09:11 +08:00
|
|
|
BasicBlock *ExitSplit = SplitCriticalEdge(
|
2016-06-26 20:28:59 +08:00
|
|
|
ExitPred, Exit,
|
|
|
|
CriticalEdgeSplittingOptions(DT, LI).setPreserveLCSSA());
|
[LPM] Fix PR18643, another scary place where loop transforms failed to
preserve loop simplify of enclosing loops.
The problem here starts with LoopRotation which ends up cloning code out
of the latch into the new preheader it is buidling. This can create
a new edge from the preheader into the exit block of the loop which
breaks LoopSimplify form. The code tries to fix this by splitting the
critical edge between the latch and the exit block to get a new exit
block that only the latch dominates. This sadly isn't sufficient.
The exit block may be an exit block for multiple nested loops. When we
clone an edge from the latch of the inner loop to the new preheader
being built in the outer loop, we create an exiting edge from the outer
loop to this exit block. Despite breaking the LoopSimplify form for the
inner loop, this is fine for the outer loop. However, when we split the
edge from the inner loop to the exit block, we create a new block which
is in neither the inner nor outer loop as the new exit block. This is
a predecessor to the old exit block, and so the split itself takes the
outer loop out of LoopSimplify form. We need to split every edge
entering the exit block from inside a loop nested more deeply than the
exit block in order to preserve all of the loop simplify constraints.
Once we try to do that, a problem with splitting critical edges
surfaces. Previously, we tried a very brute force to update LoopSimplify
form by re-computing it for all exit blocks. We don't need to do this,
and doing this much will sometimes but not always overlap with the
LoopRotate bug fix. Instead, the code needs to specifically handle the
cases which can start to violate LoopSimplify -- they aren't that
common. We need to see if the destination of the split edge was a loop
exit block in simplified form for the loop of the source of the edge.
For this to be true, all the predecessors need to be in the exact same
loop as the source of the edge being split. If the dest block was
originally in this form, we have to split all of the deges back into
this loop to recover it. The old mechanism of doing this was
conservatively correct because at least *one* of the exiting blocks it
rewrote was the DestBB and so the DestBB's predecessors were fixed. But
this is a much more targeted way of doing it. Making it targeted is
important, because ballooning the set of edges touched prevents
LoopRotate from being able to split edges *it* needs to split to
preserve loop simplify in a coherent way -- the critical edge splitting
would sometimes find the other edges in need of splitting but not
others.
Many, *many* thanks for help from Nick reducing these test cases
mightily. And helping lots with the analysis here as this one was quite
tricky to track down.
llvm-svn: 200393
2014-01-29 21:16:53 +08:00
|
|
|
ExitSplit->moveBefore(Exit);
|
|
|
|
}
|
|
|
|
assert(SplitLatchEdge &&
|
|
|
|
"Despite splitting all preds, failed to split latch exit?");
|
When loop rotation happens, it is *very* common for the duplicated condbr
to be foldable into an uncond branch. When this happens, we can make a
much simpler CFG for the loop, which is important for nested loop cases
where we want the outer loop to be aggressively optimized.
Handle this case more aggressively. For example, previously on
phi-duplicate.ll we would get this:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
%cmp1 = icmp slt i64 1, 1000
br i1 %cmp1, label %bb.nph, label %for.end
bb.nph: ; preds = %entry
br label %for.body
for.body: ; preds = %bb.nph, %for.cond
%j.02 = phi i64 [ 1, %bb.nph ], [ %inc, %for.cond ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.02
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.02, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.02
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.02, 1
br label %for.cond
for.cond: ; preds = %for.body
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.cond.for.end_crit_edge
for.cond.for.end_crit_edge: ; preds = %for.cond
br label %for.end
for.end: ; preds = %for.cond.for.end_crit_edge, %entry
ret void
}
Now we get the much nicer:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
br label %for.body
for.body: ; preds = %entry, %for.body
%j.01 = phi i64 [ 1, %entry ], [ %inc, %for.body ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.01
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.01, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.01
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.01, 1
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.end
for.end: ; preds = %for.body
ret void
}
With all of these recent changes, we are now able to compile:
void foo(char *X) {
for (int i = 0; i != 100; ++i)
for (int j = 0; j != 100; ++j)
X[j+i*100] = 0;
}
into a single memset of 10000 bytes. This series of changes
should also be helpful for other nested loop scenarios as well.
llvm-svn: 123079
2011-01-09 03:59:06 +08:00
|
|
|
} else {
|
|
|
|
// We can fold the conditional branch in the preheader, this makes things
|
|
|
|
// simpler. The first step is to remove the extra edge to the Exit block.
|
|
|
|
Exit->removePredecessor(OrigPreheader, true /*preserve LCSSA*/);
|
2011-04-30 04:38:55 +08:00
|
|
|
BranchInst *NewBI = BranchInst::Create(NewHeader, PHBI);
|
|
|
|
NewBI->setDebugLoc(PHBI->getDebugLoc());
|
When loop rotation happens, it is *very* common for the duplicated condbr
to be foldable into an uncond branch. When this happens, we can make a
much simpler CFG for the loop, which is important for nested loop cases
where we want the outer loop to be aggressively optimized.
Handle this case more aggressively. For example, previously on
phi-duplicate.ll we would get this:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
%cmp1 = icmp slt i64 1, 1000
br i1 %cmp1, label %bb.nph, label %for.end
bb.nph: ; preds = %entry
br label %for.body
for.body: ; preds = %bb.nph, %for.cond
%j.02 = phi i64 [ 1, %bb.nph ], [ %inc, %for.cond ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.02
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.02, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.02
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.02, 1
br label %for.cond
for.cond: ; preds = %for.body
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.cond.for.end_crit_edge
for.cond.for.end_crit_edge: ; preds = %for.cond
br label %for.end
for.end: ; preds = %for.cond.for.end_crit_edge, %entry
ret void
}
Now we get the much nicer:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
br label %for.body
for.body: ; preds = %entry, %for.body
%j.01 = phi i64 [ 1, %entry ], [ %inc, %for.body ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.01
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.01, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.01
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.01, 1
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.end
for.end: ; preds = %for.body
ret void
}
With all of these recent changes, we are now able to compile:
void foo(char *X) {
for (int i = 0; i != 100; ++i)
for (int j = 0; j != 100; ++j)
X[j+i*100] = 0;
}
into a single memset of 10000 bytes. This series of changes
should also be helpful for other nested loop scenarios as well.
llvm-svn: 123079
2011-01-09 03:59:06 +08:00
|
|
|
PHBI->eraseFromParent();
|
2012-02-14 08:00:19 +08:00
|
|
|
|
When loop rotation happens, it is *very* common for the duplicated condbr
to be foldable into an uncond branch. When this happens, we can make a
much simpler CFG for the loop, which is important for nested loop cases
where we want the outer loop to be aggressively optimized.
Handle this case more aggressively. For example, previously on
phi-duplicate.ll we would get this:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
%cmp1 = icmp slt i64 1, 1000
br i1 %cmp1, label %bb.nph, label %for.end
bb.nph: ; preds = %entry
br label %for.body
for.body: ; preds = %bb.nph, %for.cond
%j.02 = phi i64 [ 1, %bb.nph ], [ %inc, %for.cond ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.02
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.02, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.02
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.02, 1
br label %for.cond
for.cond: ; preds = %for.body
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.cond.for.end_crit_edge
for.cond.for.end_crit_edge: ; preds = %for.cond
br label %for.end
for.end: ; preds = %for.cond.for.end_crit_edge, %entry
ret void
}
Now we get the much nicer:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
br label %for.body
for.body: ; preds = %entry, %for.body
%j.01 = phi i64 [ 1, %entry ], [ %inc, %for.body ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.01
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.01, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.01
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.01, 1
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.end
for.end: ; preds = %for.body
ret void
}
With all of these recent changes, we are now able to compile:
void foo(char *X) {
for (int i = 0; i != 100; ++i)
for (int j = 0; j != 100; ++j)
X[j+i*100] = 0;
}
into a single memset of 10000 bytes. This series of changes
should also be helpful for other nested loop scenarios as well.
llvm-svn: 123079
2011-01-09 03:59:06 +08:00
|
|
|
// With our CFG finalized, update DomTree if it is available.
|
2017-08-18 05:48:19 +08:00
|
|
|
if (DT) DT->deleteEdge(OrigPreheader, Exit);
|
2007-07-12 07:47:28 +08:00
|
|
|
}
|
2012-02-14 08:00:19 +08:00
|
|
|
|
When loop rotation happens, it is *very* common for the duplicated condbr
to be foldable into an uncond branch. When this happens, we can make a
much simpler CFG for the loop, which is important for nested loop cases
where we want the outer loop to be aggressively optimized.
Handle this case more aggressively. For example, previously on
phi-duplicate.ll we would get this:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
%cmp1 = icmp slt i64 1, 1000
br i1 %cmp1, label %bb.nph, label %for.end
bb.nph: ; preds = %entry
br label %for.body
for.body: ; preds = %bb.nph, %for.cond
%j.02 = phi i64 [ 1, %bb.nph ], [ %inc, %for.cond ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.02
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.02, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.02
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.02, 1
br label %for.cond
for.cond: ; preds = %for.body
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.cond.for.end_crit_edge
for.cond.for.end_crit_edge: ; preds = %for.cond
br label %for.end
for.end: ; preds = %for.cond.for.end_crit_edge, %entry
ret void
}
Now we get the much nicer:
define void @test(i32 %N, double* %G) nounwind ssp {
entry:
br label %for.body
for.body: ; preds = %entry, %for.body
%j.01 = phi i64 [ 1, %entry ], [ %inc, %for.body ]
%arrayidx = getelementptr inbounds double* %G, i64 %j.01
%tmp3 = load double* %arrayidx
%sub = sub i64 %j.01, 1
%arrayidx6 = getelementptr inbounds double* %G, i64 %sub
%tmp7 = load double* %arrayidx6
%add = fadd double %tmp3, %tmp7
%arrayidx10 = getelementptr inbounds double* %G, i64 %j.01
store double %add, double* %arrayidx10
%inc = add nsw i64 %j.01, 1
%cmp = icmp slt i64 %inc, 1000
br i1 %cmp, label %for.body, label %for.end
for.end: ; preds = %for.body
ret void
}
With all of these recent changes, we are now able to compile:
void foo(char *X) {
for (int i = 0; i != 100; ++i)
for (int j = 0; j != 100; ++j)
X[j+i*100] = 0;
}
into a single memset of 10000 bytes. This series of changes
should also be helpful for other nested loop scenarios as well.
llvm-svn: 123079
2011-01-09 03:59:06 +08:00
|
|
|
assert(L->getLoopPreheader() && "Invalid loop preheader after loop rotation");
|
2011-01-09 02:52:51 +08:00
|
|
|
assert(L->getLoopLatch() && "Invalid loop latch after loop rotation");
|
2011-01-09 02:55:50 +08:00
|
|
|
|
2011-01-11 15:47:59 +08:00
|
|
|
// Now that the CFG and DomTree are in a consistent state again, try to merge
|
|
|
|
// the OrigHeader block into OrigLatch. This will succeed if they are
|
|
|
|
// connected by an unconditional branch. This is just a cleanup so the
|
|
|
|
// emitted code isn't too gross in this common case.
|
2015-01-18 10:11:23 +08:00
|
|
|
MergeBlockIntoPredecessor(OrigHeader, DT, LI);
|
2012-02-14 08:00:19 +08:00
|
|
|
|
2012-08-30 23:39:42 +08:00
|
|
|
DEBUG(dbgs() << "LoopRotation: into "; L->dump());
|
|
|
|
|
2011-01-09 02:55:50 +08:00
|
|
|
++NumRotated;
|
|
|
|
return true;
|
2007-04-10 04:19:46 +08:00
|
|
|
}
|
2015-12-15 07:22:44 +08:00
|
|
|
|
|
|
|
/// Determine whether the instructions in this range may be safely and cheaply
|
|
|
|
/// speculated. This is not an important enough situation to develop complex
|
|
|
|
/// heuristics. We handle a single arithmetic instruction along with any type
|
|
|
|
/// conversions.
|
|
|
|
static bool shouldSpeculateInstrs(BasicBlock::iterator Begin,
|
|
|
|
BasicBlock::iterator End, Loop *L) {
|
|
|
|
bool seenIncrement = false;
|
|
|
|
bool MultiExitLoop = false;
|
|
|
|
|
|
|
|
if (!L->getExitingBlock())
|
|
|
|
MultiExitLoop = true;
|
|
|
|
|
|
|
|
for (BasicBlock::iterator I = Begin; I != End; ++I) {
|
|
|
|
|
|
|
|
if (!isSafeToSpeculativelyExecute(&*I))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (isa<DbgInfoIntrinsic>(I))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
switch (I->getOpcode()) {
|
|
|
|
default:
|
|
|
|
return false;
|
|
|
|
case Instruction::GetElementPtr:
|
|
|
|
// GEPs are cheap if all indices are constant.
|
|
|
|
if (!cast<GEPOperator>(I)->hasAllConstantIndices())
|
|
|
|
return false;
|
2016-08-17 13:10:15 +08:00
|
|
|
// fall-thru to increment case
|
|
|
|
LLVM_FALLTHROUGH;
|
2015-12-15 07:22:44 +08:00
|
|
|
case Instruction::Add:
|
|
|
|
case Instruction::Sub:
|
|
|
|
case Instruction::And:
|
|
|
|
case Instruction::Or:
|
|
|
|
case Instruction::Xor:
|
|
|
|
case Instruction::Shl:
|
|
|
|
case Instruction::LShr:
|
|
|
|
case Instruction::AShr: {
|
2016-06-14 22:44:05 +08:00
|
|
|
Value *IVOpnd =
|
|
|
|
!isa<Constant>(I->getOperand(0))
|
|
|
|
? I->getOperand(0)
|
|
|
|
: !isa<Constant>(I->getOperand(1)) ? I->getOperand(1) : nullptr;
|
2015-12-15 07:22:44 +08:00
|
|
|
if (!IVOpnd)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
// If increment operand is used outside of the loop, this speculation
|
|
|
|
// could cause extra live range interference.
|
|
|
|
if (MultiExitLoop) {
|
|
|
|
for (User *UseI : IVOpnd->users()) {
|
|
|
|
auto *UserInst = cast<Instruction>(UseI);
|
|
|
|
if (!L->contains(UserInst))
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (seenIncrement)
|
|
|
|
return false;
|
|
|
|
seenIncrement = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case Instruction::Trunc:
|
|
|
|
case Instruction::ZExt:
|
|
|
|
case Instruction::SExt:
|
|
|
|
// ignore type conversions
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Fold the loop tail into the loop exit by speculating the loop tail
|
|
|
|
/// instructions. Typically, this is a single post-increment. In the case of a
|
|
|
|
/// simple 2-block loop, hoisting the increment can be much better than
|
|
|
|
/// duplicating the entire loop header. In the case of loops with early exits,
|
|
|
|
/// rotation will not work anyway, but simplifyLoopLatch will put the loop in
|
|
|
|
/// canonical form so downstream passes can handle it.
|
|
|
|
///
|
|
|
|
/// I don't believe this invalidates SCEV.
|
2016-06-14 22:44:05 +08:00
|
|
|
bool LoopRotate::simplifyLoopLatch(Loop *L) {
|
2015-12-15 07:22:44 +08:00
|
|
|
BasicBlock *Latch = L->getLoopLatch();
|
|
|
|
if (!Latch || Latch->hasAddressTaken())
|
|
|
|
return false;
|
|
|
|
|
|
|
|
BranchInst *Jmp = dyn_cast<BranchInst>(Latch->getTerminator());
|
|
|
|
if (!Jmp || !Jmp->isUnconditional())
|
|
|
|
return false;
|
|
|
|
|
|
|
|
BasicBlock *LastExit = Latch->getSinglePredecessor();
|
|
|
|
if (!LastExit || !L->isLoopExiting(LastExit))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
BranchInst *BI = dyn_cast<BranchInst>(LastExit->getTerminator());
|
|
|
|
if (!BI)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (!shouldSpeculateInstrs(Latch->begin(), Jmp->getIterator(), L))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
DEBUG(dbgs() << "Folding loop latch " << Latch->getName() << " into "
|
2016-06-14 22:44:05 +08:00
|
|
|
<< LastExit->getName() << "\n");
|
2015-12-15 07:22:44 +08:00
|
|
|
|
|
|
|
// Hoist the instructions from Latch into LastExit.
|
|
|
|
LastExit->getInstList().splice(BI->getIterator(), Latch->getInstList(),
|
|
|
|
Latch->begin(), Jmp->getIterator());
|
|
|
|
|
|
|
|
unsigned FallThruPath = BI->getSuccessor(0) == Latch ? 0 : 1;
|
|
|
|
BasicBlock *Header = Jmp->getSuccessor(0);
|
|
|
|
assert(Header == L->getHeader() && "expected a backward branch");
|
|
|
|
|
|
|
|
// Remove Latch from the CFG so that LastExit becomes the new Latch.
|
|
|
|
BI->setSuccessor(FallThruPath, Header);
|
|
|
|
Latch->replaceSuccessorsPhiUsesWith(LastExit);
|
|
|
|
Jmp->eraseFromParent();
|
|
|
|
|
|
|
|
// Nuke the Latch block.
|
|
|
|
assert(Latch->empty() && "unable to evacuate Latch");
|
|
|
|
LI->removeBlock(Latch);
|
|
|
|
if (DT)
|
|
|
|
DT->eraseNode(Latch);
|
|
|
|
Latch->eraseFromParent();
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2016-06-11 06:03:56 +08:00
|
|
|
/// Rotate \c L, and return true if any modification was made.
|
2016-06-14 22:44:05 +08:00
|
|
|
bool LoopRotate::processLoop(Loop *L) {
|
2015-12-15 07:22:44 +08:00
|
|
|
// Save the loop metadata.
|
|
|
|
MDNode *LoopMD = L->getLoopID();
|
|
|
|
|
|
|
|
// Simplify the loop latch before attempting to rotate the header
|
|
|
|
// upward. Rotation may not be needed if the loop tail can be folded into the
|
|
|
|
// loop exit.
|
2016-06-14 22:44:05 +08:00
|
|
|
bool SimplifiedLatch = simplifyLoopLatch(L);
|
2015-12-15 07:22:44 +08:00
|
|
|
|
2016-06-14 22:44:05 +08:00
|
|
|
bool MadeChange = rotateLoop(L, SimplifiedLatch);
|
2016-06-11 06:03:56 +08:00
|
|
|
assert((!MadeChange || L->isLoopExiting(L->getLoopLatch())) &&
|
|
|
|
"Loop latch should be exiting after loop-rotate.");
|
2015-12-15 07:22:44 +08:00
|
|
|
|
|
|
|
// Restore the loop metadata.
|
|
|
|
// NB! We presume LoopRotation DOESN'T ADD its own metadata.
|
|
|
|
if ((MadeChange || SimplifiedLatch) && LoopMD)
|
|
|
|
L->setLoopID(LoopMD);
|
|
|
|
|
|
|
|
return MadeChange;
|
|
|
|
}
|
2015-12-15 07:22:48 +08:00
|
|
|
|
2016-12-22 14:59:15 +08:00
|
|
|
LoopRotatePass::LoopRotatePass(bool EnableHeaderDuplication)
|
|
|
|
: EnableHeaderDuplication(EnableHeaderDuplication) {}
|
2016-05-04 06:02:31 +08:00
|
|
|
|
2017-01-11 14:23:21 +08:00
|
|
|
PreservedAnalyses LoopRotatePass::run(Loop &L, LoopAnalysisManager &AM,
|
|
|
|
LoopStandardAnalysisResults &AR,
|
|
|
|
LPMUpdater &) {
|
2016-12-22 14:59:15 +08:00
|
|
|
int Threshold = EnableHeaderDuplication ? DefaultRotationThreshold : 0;
|
2017-04-26 21:52:18 +08:00
|
|
|
const DataLayout &DL = L.getHeader()->getModule()->getDataLayout();
|
2017-04-29 06:05:55 +08:00
|
|
|
const SimplifyQuery SQ = getBestSimplifyQuery(AR, DL);
|
2017-04-29 03:55:38 +08:00
|
|
|
LoopRotate LR(Threshold, &AR.LI, &AR.TTI, &AR.AC, &AR.DT, &AR.SE,
|
2017-04-29 06:05:55 +08:00
|
|
|
SQ);
|
2016-05-04 06:02:31 +08:00
|
|
|
|
2016-06-14 22:44:05 +08:00
|
|
|
bool Changed = LR.processLoop(&L);
|
2016-05-04 06:02:31 +08:00
|
|
|
if (!Changed)
|
|
|
|
return PreservedAnalyses::all();
|
2017-01-15 14:32:49 +08:00
|
|
|
|
2016-05-04 06:02:31 +08:00
|
|
|
return getLoopPassPreservedAnalyses();
|
|
|
|
}
|
|
|
|
|
2015-12-15 07:22:48 +08:00
|
|
|
namespace {
|
|
|
|
|
2016-05-04 06:02:31 +08:00
|
|
|
class LoopRotateLegacyPass : public LoopPass {
|
2015-12-15 07:22:48 +08:00
|
|
|
unsigned MaxHeaderSize;
|
|
|
|
|
|
|
|
public:
|
|
|
|
static char ID; // Pass ID, replacement for typeid
|
2016-05-04 06:02:31 +08:00
|
|
|
LoopRotateLegacyPass(int SpecifiedMaxHeaderSize = -1) : LoopPass(ID) {
|
|
|
|
initializeLoopRotateLegacyPassPass(*PassRegistry::getPassRegistry());
|
2015-12-15 07:22:48 +08:00
|
|
|
if (SpecifiedMaxHeaderSize == -1)
|
|
|
|
MaxHeaderSize = DefaultRotationThreshold;
|
|
|
|
else
|
|
|
|
MaxHeaderSize = unsigned(SpecifiedMaxHeaderSize);
|
|
|
|
}
|
|
|
|
|
|
|
|
// LCSSA form makes instruction renaming easier.
|
|
|
|
void getAnalysisUsage(AnalysisUsage &AU) const override {
|
2016-12-19 16:22:17 +08:00
|
|
|
AU.addRequired<AssumptionCacheTracker>();
|
2015-12-15 07:22:48 +08:00
|
|
|
AU.addRequired<TargetTransformInfoWrapperPass>();
|
[LPM] Factor all of the loop analysis usage updates into a common helper
routine.
We were getting this wrong in small ways and generally being very
inconsistent about it across loop passes. Instead, let's have a common
place where we do this. One minor downside is that this will require
some analyses like SCEV in more places than they are strictly needed.
However, this seems benign as these analyses are complete no-ops, and
without this consistency we can in many cases end up with the legacy
pass manager scheduling deciding to split up a loop pass pipeline in
order to run the function analysis half-way through. It is very, very
annoying to fix these without just being very pedantic across the board.
The only loop passes I've not updated here are ones that use
AU.setPreservesAll() such as IVUsers (an analysis) and the pass printer.
They seemed less relevant.
With this patch, almost all of the problems in PR24804 around loop pass
pipelines are fixed. The one remaining issue is that we run simplify-cfg
and instcombine in the middle of the loop pass pipeline. We've recently
added some loop variants of these passes that would seem substantially
cleaner to use, but this at least gets us much closer to the previous
state. Notably, the seven loop pass managers is down to three.
I've not updated the loop passes using LoopAccessAnalysis because that
analysis hasn't been fully wired into LoopSimplify/LCSSA, and it isn't
clear that those transforms want to support those forms anyways. They
all run late anyways, so this is harmless. Similarly, LSR is left alone
because it already carefully manages its forms and doesn't need to get
fused into a single loop pass manager with a bunch of other loop passes.
LoopReroll didn't use loop simplified form previously, and I've updated
the test case to match the trivially different output.
Finally, I've also factored all the pass initialization for the passes
that use this technique as well, so that should be done regularly and
reliably.
Thanks to James for the help reviewing and thinking about this stuff,
and Ben for help thinking about it as well!
Differential Revision: http://reviews.llvm.org/D17435
llvm-svn: 261316
2016-02-19 18:45:18 +08:00
|
|
|
getLoopAnalysisUsage(AU);
|
2015-12-15 07:22:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
bool runOnLoop(Loop *L, LPPassManager &LPM) override {
|
2016-04-23 06:06:11 +08:00
|
|
|
if (skipLoop(L))
|
2015-12-15 07:22:48 +08:00
|
|
|
return false;
|
|
|
|
Function &F = *L->getHeader()->getParent();
|
|
|
|
|
|
|
|
auto *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
|
|
|
|
const auto *TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
|
2016-12-19 16:22:17 +08:00
|
|
|
auto *AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
|
2015-12-15 07:22:48 +08:00
|
|
|
auto *DTWP = getAnalysisIfAvailable<DominatorTreeWrapperPass>();
|
|
|
|
auto *DT = DTWP ? &DTWP->getDomTree() : nullptr;
|
|
|
|
auto *SEWP = getAnalysisIfAvailable<ScalarEvolutionWrapperPass>();
|
|
|
|
auto *SE = SEWP ? &SEWP->getSE() : nullptr;
|
2017-04-29 06:05:55 +08:00
|
|
|
const SimplifyQuery SQ = getBestSimplifyQuery(*this, F);
|
|
|
|
LoopRotate LR(MaxHeaderSize, LI, TTI, AC, DT, SE, SQ);
|
2016-06-14 22:44:05 +08:00
|
|
|
return LR.processLoop(L);
|
2015-12-15 07:22:48 +08:00
|
|
|
}
|
|
|
|
};
|
|
|
|
}
|
|
|
|
|
2016-05-04 06:02:31 +08:00
|
|
|
char LoopRotateLegacyPass::ID = 0;
|
|
|
|
INITIALIZE_PASS_BEGIN(LoopRotateLegacyPass, "loop-rotate", "Rotate Loops",
|
|
|
|
false, false)
|
2016-12-19 16:22:17 +08:00
|
|
|
INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
|
[LPM] Factor all of the loop analysis usage updates into a common helper
routine.
We were getting this wrong in small ways and generally being very
inconsistent about it across loop passes. Instead, let's have a common
place where we do this. One minor downside is that this will require
some analyses like SCEV in more places than they are strictly needed.
However, this seems benign as these analyses are complete no-ops, and
without this consistency we can in many cases end up with the legacy
pass manager scheduling deciding to split up a loop pass pipeline in
order to run the function analysis half-way through. It is very, very
annoying to fix these without just being very pedantic across the board.
The only loop passes I've not updated here are ones that use
AU.setPreservesAll() such as IVUsers (an analysis) and the pass printer.
They seemed less relevant.
With this patch, almost all of the problems in PR24804 around loop pass
pipelines are fixed. The one remaining issue is that we run simplify-cfg
and instcombine in the middle of the loop pass pipeline. We've recently
added some loop variants of these passes that would seem substantially
cleaner to use, but this at least gets us much closer to the previous
state. Notably, the seven loop pass managers is down to three.
I've not updated the loop passes using LoopAccessAnalysis because that
analysis hasn't been fully wired into LoopSimplify/LCSSA, and it isn't
clear that those transforms want to support those forms anyways. They
all run late anyways, so this is harmless. Similarly, LSR is left alone
because it already carefully manages its forms and doesn't need to get
fused into a single loop pass manager with a bunch of other loop passes.
LoopReroll didn't use loop simplified form previously, and I've updated
the test case to match the trivially different output.
Finally, I've also factored all the pass initialization for the passes
that use this technique as well, so that should be done regularly and
reliably.
Thanks to James for the help reviewing and thinking about this stuff,
and Ben for help thinking about it as well!
Differential Revision: http://reviews.llvm.org/D17435
llvm-svn: 261316
2016-02-19 18:45:18 +08:00
|
|
|
INITIALIZE_PASS_DEPENDENCY(LoopPass)
|
|
|
|
INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
|
2016-06-14 22:44:05 +08:00
|
|
|
INITIALIZE_PASS_END(LoopRotateLegacyPass, "loop-rotate", "Rotate Loops", false,
|
|
|
|
false)
|
2015-12-15 07:22:48 +08:00
|
|
|
|
|
|
|
Pass *llvm::createLoopRotatePass(int MaxHeaderSize) {
|
2016-05-04 06:02:31 +08:00
|
|
|
return new LoopRotateLegacyPass(MaxHeaderSize);
|
2015-12-15 07:22:48 +08:00
|
|
|
}
|