2017-09-21 05:35:51 +08:00
|
|
|
//===- ARMTargetTransformInfo.cpp - ARM specific TTI ----------------------===//
|
Switch TargetTransformInfo from an immutable analysis pass that requires
a TargetMachine to construct (and thus isn't always available), to an
analysis group that supports layered implementations much like
AliasAnalysis does. This is a pretty massive change, with a few parts
that I was unable to easily separate (sorry), so I'll walk through it.
The first step of this conversion was to make TargetTransformInfo an
analysis group, and to sink the nonce implementations in
ScalarTargetTransformInfo and VectorTargetTranformInfo into
a NoTargetTransformInfo pass. This allows other passes to add a hard
requirement on TTI, and assume they will always get at least on
implementation.
The TargetTransformInfo analysis group leverages the delegation chaining
trick that AliasAnalysis uses, where the base class for the analysis
group delegates to the previous analysis *pass*, allowing all but tho
NoFoo analysis passes to only implement the parts of the interfaces they
support. It also introduces a new trick where each pass in the group
retains a pointer to the top-most pass that has been initialized. This
allows passes to implement one API in terms of another API and benefit
when some other pass above them in the stack has more precise results
for the second API.
The second step of this conversion is to create a pass that implements
the TargetTransformInfo analysis using the target-independent
abstractions in the code generator. This replaces the
ScalarTargetTransformImpl and VectorTargetTransformImpl classes in
lib/Target with a single pass in lib/CodeGen called
BasicTargetTransformInfo. This class actually provides most of the TTI
functionality, basing it upon the TargetLowering abstraction and other
information in the target independent code generator.
The third step of the conversion adds support to all TargetMachines to
register custom analysis passes. This allows building those passes with
access to TargetLowering or other target-specific classes, and it also
allows each target to customize the set of analysis passes desired in
the pass manager. The baseline LLVMTargetMachine implements this
interface to add the BasicTTI pass to the pass manager, and all of the
tools that want to support target-aware TTI passes call this routine on
whatever target machine they end up with to add the appropriate passes.
The fourth step of the conversion created target-specific TTI analysis
passes for the X86 and ARM backends. These passes contain the custom
logic that was previously in their extensions of the
ScalarTargetTransformInfo and VectorTargetTransformInfo interfaces.
I separated them into their own file, as now all of the interface bits
are private and they just expose a function to create the pass itself.
Then I extended these target machines to set up a custom set of analysis
passes, first adding BasicTTI as a fallback, and then adding their
customized TTI implementations.
The fourth step required logic that was shared between the target
independent layer and the specific targets to move to a different
interface, as they no longer derive from each other. As a consequence,
a helper functions were added to TargetLowering representing the common
logic needed both in the target implementation and the codegen
implementation of the TTI pass. While technically this is the only
change that could have been committed separately, it would have been
a nightmare to extract.
The final step of the conversion was just to delete all the old
boilerplate. This got rid of the ScalarTargetTransformInfo and
VectorTargetTransformInfo classes, all of the support in all of the
targets for producing instances of them, and all of the support in the
tools for manually constructing a pass based around them.
Now that TTI is a relatively normal analysis group, two things become
straightforward. First, we can sink it into lib/Analysis which is a more
natural layer for it to live. Second, clients of this interface can
depend on it *always* being available which will simplify their code and
behavior. These (and other) simplifications will follow in subsequent
commits, this one is clearly big enough.
Finally, I'm very aware that much of the comments and documentation
needs to be updated. As soon as I had this working, and plausibly well
commented, I wanted to get it committed and in front of the build bots.
I'll be doing a few passes over documentation later if it sticks.
Commits to update DragonEgg and Clang will be made presently.
llvm-svn: 171681
2013-01-07 09:37:14 +08:00
|
|
|
//
|
2019-01-19 16:50:56 +08:00
|
|
|
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
|
|
// See https://llvm.org/LICENSE.txt for license information.
|
|
|
|
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
Switch TargetTransformInfo from an immutable analysis pass that requires
a TargetMachine to construct (and thus isn't always available), to an
analysis group that supports layered implementations much like
AliasAnalysis does. This is a pretty massive change, with a few parts
that I was unable to easily separate (sorry), so I'll walk through it.
The first step of this conversion was to make TargetTransformInfo an
analysis group, and to sink the nonce implementations in
ScalarTargetTransformInfo and VectorTargetTranformInfo into
a NoTargetTransformInfo pass. This allows other passes to add a hard
requirement on TTI, and assume they will always get at least on
implementation.
The TargetTransformInfo analysis group leverages the delegation chaining
trick that AliasAnalysis uses, where the base class for the analysis
group delegates to the previous analysis *pass*, allowing all but tho
NoFoo analysis passes to only implement the parts of the interfaces they
support. It also introduces a new trick where each pass in the group
retains a pointer to the top-most pass that has been initialized. This
allows passes to implement one API in terms of another API and benefit
when some other pass above them in the stack has more precise results
for the second API.
The second step of this conversion is to create a pass that implements
the TargetTransformInfo analysis using the target-independent
abstractions in the code generator. This replaces the
ScalarTargetTransformImpl and VectorTargetTransformImpl classes in
lib/Target with a single pass in lib/CodeGen called
BasicTargetTransformInfo. This class actually provides most of the TTI
functionality, basing it upon the TargetLowering abstraction and other
information in the target independent code generator.
The third step of the conversion adds support to all TargetMachines to
register custom analysis passes. This allows building those passes with
access to TargetLowering or other target-specific classes, and it also
allows each target to customize the set of analysis passes desired in
the pass manager. The baseline LLVMTargetMachine implements this
interface to add the BasicTTI pass to the pass manager, and all of the
tools that want to support target-aware TTI passes call this routine on
whatever target machine they end up with to add the appropriate passes.
The fourth step of the conversion created target-specific TTI analysis
passes for the X86 and ARM backends. These passes contain the custom
logic that was previously in their extensions of the
ScalarTargetTransformInfo and VectorTargetTransformInfo interfaces.
I separated them into their own file, as now all of the interface bits
are private and they just expose a function to create the pass itself.
Then I extended these target machines to set up a custom set of analysis
passes, first adding BasicTTI as a fallback, and then adding their
customized TTI implementations.
The fourth step required logic that was shared between the target
independent layer and the specific targets to move to a different
interface, as they no longer derive from each other. As a consequence,
a helper functions were added to TargetLowering representing the common
logic needed both in the target implementation and the codegen
implementation of the TTI pass. While technically this is the only
change that could have been committed separately, it would have been
a nightmare to extract.
The final step of the conversion was just to delete all the old
boilerplate. This got rid of the ScalarTargetTransformInfo and
VectorTargetTransformInfo classes, all of the support in all of the
targets for producing instances of them, and all of the support in the
tools for manually constructing a pass based around them.
Now that TTI is a relatively normal analysis group, two things become
straightforward. First, we can sink it into lib/Analysis which is a more
natural layer for it to live. Second, clients of this interface can
depend on it *always* being available which will simplify their code and
behavior. These (and other) simplifications will follow in subsequent
commits, this one is clearly big enough.
Finally, I'm very aware that much of the comments and documentation
needs to be updated. As soon as I had this working, and plausibly well
commented, I wanted to get it committed and in front of the build bots.
I'll be doing a few passes over documentation later if it sticks.
Commits to update DragonEgg and Clang will be made presently.
llvm-svn: 171681
2013-01-07 09:37:14 +08:00
|
|
|
//
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
|
2015-01-31 19:17:59 +08:00
|
|
|
#include "ARMTargetTransformInfo.h"
|
2017-09-21 05:35:51 +08:00
|
|
|
#include "ARMSubtarget.h"
|
|
|
|
#include "MCTargetDesc/ARMAddressingModes.h"
|
|
|
|
#include "llvm/ADT/APInt.h"
|
|
|
|
#include "llvm/ADT/SmallVector.h"
|
|
|
|
#include "llvm/Analysis/LoopInfo.h"
|
2017-11-17 09:07:10 +08:00
|
|
|
#include "llvm/CodeGen/CostTable.h"
|
2017-09-21 05:35:51 +08:00
|
|
|
#include "llvm/CodeGen/ISDOpcodes.h"
|
2018-03-30 01:21:10 +08:00
|
|
|
#include "llvm/CodeGen/ValueTypes.h"
|
2017-09-21 05:35:51 +08:00
|
|
|
#include "llvm/IR/BasicBlock.h"
|
|
|
|
#include "llvm/IR/CallSite.h"
|
|
|
|
#include "llvm/IR/DataLayout.h"
|
|
|
|
#include "llvm/IR/DerivedTypes.h"
|
|
|
|
#include "llvm/IR/Instruction.h"
|
|
|
|
#include "llvm/IR/Instructions.h"
|
2019-04-30 18:28:50 +08:00
|
|
|
#include "llvm/IR/IntrinsicInst.h"
|
2017-09-21 05:35:51 +08:00
|
|
|
#include "llvm/IR/Type.h"
|
|
|
|
#include "llvm/MC/SubtargetFeature.h"
|
|
|
|
#include "llvm/Support/Casting.h"
|
2018-03-24 07:58:25 +08:00
|
|
|
#include "llvm/Support/MachineValueType.h"
|
2017-09-21 05:35:51 +08:00
|
|
|
#include "llvm/Target/TargetMachine.h"
|
|
|
|
#include <algorithm>
|
|
|
|
#include <cassert>
|
|
|
|
#include <cstdint>
|
|
|
|
#include <utility>
|
|
|
|
|
Switch TargetTransformInfo from an immutable analysis pass that requires
a TargetMachine to construct (and thus isn't always available), to an
analysis group that supports layered implementations much like
AliasAnalysis does. This is a pretty massive change, with a few parts
that I was unable to easily separate (sorry), so I'll walk through it.
The first step of this conversion was to make TargetTransformInfo an
analysis group, and to sink the nonce implementations in
ScalarTargetTransformInfo and VectorTargetTranformInfo into
a NoTargetTransformInfo pass. This allows other passes to add a hard
requirement on TTI, and assume they will always get at least on
implementation.
The TargetTransformInfo analysis group leverages the delegation chaining
trick that AliasAnalysis uses, where the base class for the analysis
group delegates to the previous analysis *pass*, allowing all but tho
NoFoo analysis passes to only implement the parts of the interfaces they
support. It also introduces a new trick where each pass in the group
retains a pointer to the top-most pass that has been initialized. This
allows passes to implement one API in terms of another API and benefit
when some other pass above them in the stack has more precise results
for the second API.
The second step of this conversion is to create a pass that implements
the TargetTransformInfo analysis using the target-independent
abstractions in the code generator. This replaces the
ScalarTargetTransformImpl and VectorTargetTransformImpl classes in
lib/Target with a single pass in lib/CodeGen called
BasicTargetTransformInfo. This class actually provides most of the TTI
functionality, basing it upon the TargetLowering abstraction and other
information in the target independent code generator.
The third step of the conversion adds support to all TargetMachines to
register custom analysis passes. This allows building those passes with
access to TargetLowering or other target-specific classes, and it also
allows each target to customize the set of analysis passes desired in
the pass manager. The baseline LLVMTargetMachine implements this
interface to add the BasicTTI pass to the pass manager, and all of the
tools that want to support target-aware TTI passes call this routine on
whatever target machine they end up with to add the appropriate passes.
The fourth step of the conversion created target-specific TTI analysis
passes for the X86 and ARM backends. These passes contain the custom
logic that was previously in their extensions of the
ScalarTargetTransformInfo and VectorTargetTransformInfo interfaces.
I separated them into their own file, as now all of the interface bits
are private and they just expose a function to create the pass itself.
Then I extended these target machines to set up a custom set of analysis
passes, first adding BasicTTI as a fallback, and then adding their
customized TTI implementations.
The fourth step required logic that was shared between the target
independent layer and the specific targets to move to a different
interface, as they no longer derive from each other. As a consequence,
a helper functions were added to TargetLowering representing the common
logic needed both in the target implementation and the codegen
implementation of the TTI pass. While technically this is the only
change that could have been committed separately, it would have been
a nightmare to extract.
The final step of the conversion was just to delete all the old
boilerplate. This got rid of the ScalarTargetTransformInfo and
VectorTargetTransformInfo classes, all of the support in all of the
targets for producing instances of them, and all of the support in the
tools for manually constructing a pass based around them.
Now that TTI is a relatively normal analysis group, two things become
straightforward. First, we can sink it into lib/Analysis which is a more
natural layer for it to live. Second, clients of this interface can
depend on it *always* being available which will simplify their code and
behavior. These (and other) simplifications will follow in subsequent
commits, this one is clearly big enough.
Finally, I'm very aware that much of the comments and documentation
needs to be updated. As soon as I had this working, and plausibly well
commented, I wanted to get it committed and in front of the build bots.
I'll be doing a few passes over documentation later if it sticks.
Commits to update DragonEgg and Clang will be made presently.
llvm-svn: 171681
2013-01-07 09:37:14 +08:00
|
|
|
using namespace llvm;
|
|
|
|
|
2014-04-22 10:41:26 +08:00
|
|
|
#define DEBUG_TYPE "armtti"
|
|
|
|
|
2019-06-12 20:00:42 +08:00
|
|
|
static cl::opt<bool> DisableLowOverheadLoops(
|
|
|
|
"disable-arm-loloops", cl::Hidden, cl::init(true),
|
|
|
|
cl::desc("Disable the generation of low-overhead loops"));
|
|
|
|
|
[ARM] Inline callee if its target-features are a subset of the caller
Summary:
Similar to X86, it should be safe to inline callees if their
target-features are a subset of the caller. As some subtarget features
provide different instructions depending on whether they are set or
unset (e.g. ThumbMode and ModeSoftFloat), we use a whitelist of
target-features describing hardware capabilities only.
Reviewers: kristof.beyls, rengolin, t.p.northover, SjoerdMeijer, peter.smith, silviu.baranga, efriedma
Reviewed By: SjoerdMeijer, efriedma
Subscribers: dschuff, efriedma, aemerson, sdardis, javed.absar, arichardson, eraman, llvm-commits
Differential Revision: https://reviews.llvm.org/D34697
llvm-svn: 307889
2017-07-13 16:26:17 +08:00
|
|
|
bool ARMTTIImpl::areInlineCompatible(const Function *Caller,
|
|
|
|
const Function *Callee) const {
|
|
|
|
const TargetMachine &TM = getTLI()->getTargetMachine();
|
|
|
|
const FeatureBitset &CallerBits =
|
|
|
|
TM.getSubtargetImpl(*Caller)->getFeatureBits();
|
|
|
|
const FeatureBitset &CalleeBits =
|
|
|
|
TM.getSubtargetImpl(*Callee)->getFeatureBits();
|
|
|
|
|
|
|
|
// To inline a callee, all features not in the whitelist must match exactly.
|
|
|
|
bool MatchExact = (CallerBits & ~InlineFeatureWhitelist) ==
|
|
|
|
(CalleeBits & ~InlineFeatureWhitelist);
|
|
|
|
// For features in the whitelist, the callee's features must be a subset of
|
|
|
|
// the callers'.
|
|
|
|
bool MatchSubset = ((CallerBits & CalleeBits) & InlineFeatureWhitelist) ==
|
|
|
|
(CalleeBits & InlineFeatureWhitelist);
|
|
|
|
return MatchExact && MatchSubset;
|
|
|
|
}
|
|
|
|
|
2015-08-06 02:08:10 +08:00
|
|
|
int ARMTTIImpl::getIntImmCost(const APInt &Imm, Type *Ty) {
|
Switch TargetTransformInfo from an immutable analysis pass that requires
a TargetMachine to construct (and thus isn't always available), to an
analysis group that supports layered implementations much like
AliasAnalysis does. This is a pretty massive change, with a few parts
that I was unable to easily separate (sorry), so I'll walk through it.
The first step of this conversion was to make TargetTransformInfo an
analysis group, and to sink the nonce implementations in
ScalarTargetTransformInfo and VectorTargetTranformInfo into
a NoTargetTransformInfo pass. This allows other passes to add a hard
requirement on TTI, and assume they will always get at least on
implementation.
The TargetTransformInfo analysis group leverages the delegation chaining
trick that AliasAnalysis uses, where the base class for the analysis
group delegates to the previous analysis *pass*, allowing all but tho
NoFoo analysis passes to only implement the parts of the interfaces they
support. It also introduces a new trick where each pass in the group
retains a pointer to the top-most pass that has been initialized. This
allows passes to implement one API in terms of another API and benefit
when some other pass above them in the stack has more precise results
for the second API.
The second step of this conversion is to create a pass that implements
the TargetTransformInfo analysis using the target-independent
abstractions in the code generator. This replaces the
ScalarTargetTransformImpl and VectorTargetTransformImpl classes in
lib/Target with a single pass in lib/CodeGen called
BasicTargetTransformInfo. This class actually provides most of the TTI
functionality, basing it upon the TargetLowering abstraction and other
information in the target independent code generator.
The third step of the conversion adds support to all TargetMachines to
register custom analysis passes. This allows building those passes with
access to TargetLowering or other target-specific classes, and it also
allows each target to customize the set of analysis passes desired in
the pass manager. The baseline LLVMTargetMachine implements this
interface to add the BasicTTI pass to the pass manager, and all of the
tools that want to support target-aware TTI passes call this routine on
whatever target machine they end up with to add the appropriate passes.
The fourth step of the conversion created target-specific TTI analysis
passes for the X86 and ARM backends. These passes contain the custom
logic that was previously in their extensions of the
ScalarTargetTransformInfo and VectorTargetTransformInfo interfaces.
I separated them into their own file, as now all of the interface bits
are private and they just expose a function to create the pass itself.
Then I extended these target machines to set up a custom set of analysis
passes, first adding BasicTTI as a fallback, and then adding their
customized TTI implementations.
The fourth step required logic that was shared between the target
independent layer and the specific targets to move to a different
interface, as they no longer derive from each other. As a consequence,
a helper functions were added to TargetLowering representing the common
logic needed both in the target implementation and the codegen
implementation of the TTI pass. While technically this is the only
change that could have been committed separately, it would have been
a nightmare to extract.
The final step of the conversion was just to delete all the old
boilerplate. This got rid of the ScalarTargetTransformInfo and
VectorTargetTransformInfo classes, all of the support in all of the
targets for producing instances of them, and all of the support in the
tools for manually constructing a pass based around them.
Now that TTI is a relatively normal analysis group, two things become
straightforward. First, we can sink it into lib/Analysis which is a more
natural layer for it to live. Second, clients of this interface can
depend on it *always* being available which will simplify their code and
behavior. These (and other) simplifications will follow in subsequent
commits, this one is clearly big enough.
Finally, I'm very aware that much of the comments and documentation
needs to be updated. As soon as I had this working, and plausibly well
commented, I wanted to get it committed and in front of the build bots.
I'll be doing a few passes over documentation later if it sticks.
Commits to update DragonEgg and Clang will be made presently.
llvm-svn: 171681
2013-01-07 09:37:14 +08:00
|
|
|
assert(Ty->isIntegerTy());
|
|
|
|
|
2016-04-14 07:08:27 +08:00
|
|
|
unsigned Bits = Ty->getPrimitiveSizeInBits();
|
2016-06-29 06:30:45 +08:00
|
|
|
if (Bits == 0 || Imm.getActiveBits() >= 64)
|
2016-04-14 07:08:27 +08:00
|
|
|
return 4;
|
Switch TargetTransformInfo from an immutable analysis pass that requires
a TargetMachine to construct (and thus isn't always available), to an
analysis group that supports layered implementations much like
AliasAnalysis does. This is a pretty massive change, with a few parts
that I was unable to easily separate (sorry), so I'll walk through it.
The first step of this conversion was to make TargetTransformInfo an
analysis group, and to sink the nonce implementations in
ScalarTargetTransformInfo and VectorTargetTranformInfo into
a NoTargetTransformInfo pass. This allows other passes to add a hard
requirement on TTI, and assume they will always get at least on
implementation.
The TargetTransformInfo analysis group leverages the delegation chaining
trick that AliasAnalysis uses, where the base class for the analysis
group delegates to the previous analysis *pass*, allowing all but tho
NoFoo analysis passes to only implement the parts of the interfaces they
support. It also introduces a new trick where each pass in the group
retains a pointer to the top-most pass that has been initialized. This
allows passes to implement one API in terms of another API and benefit
when some other pass above them in the stack has more precise results
for the second API.
The second step of this conversion is to create a pass that implements
the TargetTransformInfo analysis using the target-independent
abstractions in the code generator. This replaces the
ScalarTargetTransformImpl and VectorTargetTransformImpl classes in
lib/Target with a single pass in lib/CodeGen called
BasicTargetTransformInfo. This class actually provides most of the TTI
functionality, basing it upon the TargetLowering abstraction and other
information in the target independent code generator.
The third step of the conversion adds support to all TargetMachines to
register custom analysis passes. This allows building those passes with
access to TargetLowering or other target-specific classes, and it also
allows each target to customize the set of analysis passes desired in
the pass manager. The baseline LLVMTargetMachine implements this
interface to add the BasicTTI pass to the pass manager, and all of the
tools that want to support target-aware TTI passes call this routine on
whatever target machine they end up with to add the appropriate passes.
The fourth step of the conversion created target-specific TTI analysis
passes for the X86 and ARM backends. These passes contain the custom
logic that was previously in their extensions of the
ScalarTargetTransformInfo and VectorTargetTransformInfo interfaces.
I separated them into their own file, as now all of the interface bits
are private and they just expose a function to create the pass itself.
Then I extended these target machines to set up a custom set of analysis
passes, first adding BasicTTI as a fallback, and then adding their
customized TTI implementations.
The fourth step required logic that was shared between the target
independent layer and the specific targets to move to a different
interface, as they no longer derive from each other. As a consequence,
a helper functions were added to TargetLowering representing the common
logic needed both in the target implementation and the codegen
implementation of the TTI pass. While technically this is the only
change that could have been committed separately, it would have been
a nightmare to extract.
The final step of the conversion was just to delete all the old
boilerplate. This got rid of the ScalarTargetTransformInfo and
VectorTargetTransformInfo classes, all of the support in all of the
targets for producing instances of them, and all of the support in the
tools for manually constructing a pass based around them.
Now that TTI is a relatively normal analysis group, two things become
straightforward. First, we can sink it into lib/Analysis which is a more
natural layer for it to live. Second, clients of this interface can
depend on it *always* being available which will simplify their code and
behavior. These (and other) simplifications will follow in subsequent
commits, this one is clearly big enough.
Finally, I'm very aware that much of the comments and documentation
needs to be updated. As soon as I had this working, and plausibly well
commented, I wanted to get it committed and in front of the build bots.
I'll be doing a few passes over documentation later if it sticks.
Commits to update DragonEgg and Clang will be made presently.
llvm-svn: 171681
2013-01-07 09:37:14 +08:00
|
|
|
|
2016-04-14 07:08:27 +08:00
|
|
|
int64_t SImmVal = Imm.getSExtValue();
|
|
|
|
uint64_t ZImmVal = Imm.getZExtValue();
|
Switch TargetTransformInfo from an immutable analysis pass that requires
a TargetMachine to construct (and thus isn't always available), to an
analysis group that supports layered implementations much like
AliasAnalysis does. This is a pretty massive change, with a few parts
that I was unable to easily separate (sorry), so I'll walk through it.
The first step of this conversion was to make TargetTransformInfo an
analysis group, and to sink the nonce implementations in
ScalarTargetTransformInfo and VectorTargetTranformInfo into
a NoTargetTransformInfo pass. This allows other passes to add a hard
requirement on TTI, and assume they will always get at least on
implementation.
The TargetTransformInfo analysis group leverages the delegation chaining
trick that AliasAnalysis uses, where the base class for the analysis
group delegates to the previous analysis *pass*, allowing all but tho
NoFoo analysis passes to only implement the parts of the interfaces they
support. It also introduces a new trick where each pass in the group
retains a pointer to the top-most pass that has been initialized. This
allows passes to implement one API in terms of another API and benefit
when some other pass above them in the stack has more precise results
for the second API.
The second step of this conversion is to create a pass that implements
the TargetTransformInfo analysis using the target-independent
abstractions in the code generator. This replaces the
ScalarTargetTransformImpl and VectorTargetTransformImpl classes in
lib/Target with a single pass in lib/CodeGen called
BasicTargetTransformInfo. This class actually provides most of the TTI
functionality, basing it upon the TargetLowering abstraction and other
information in the target independent code generator.
The third step of the conversion adds support to all TargetMachines to
register custom analysis passes. This allows building those passes with
access to TargetLowering or other target-specific classes, and it also
allows each target to customize the set of analysis passes desired in
the pass manager. The baseline LLVMTargetMachine implements this
interface to add the BasicTTI pass to the pass manager, and all of the
tools that want to support target-aware TTI passes call this routine on
whatever target machine they end up with to add the appropriate passes.
The fourth step of the conversion created target-specific TTI analysis
passes for the X86 and ARM backends. These passes contain the custom
logic that was previously in their extensions of the
ScalarTargetTransformInfo and VectorTargetTransformInfo interfaces.
I separated them into their own file, as now all of the interface bits
are private and they just expose a function to create the pass itself.
Then I extended these target machines to set up a custom set of analysis
passes, first adding BasicTTI as a fallback, and then adding their
customized TTI implementations.
The fourth step required logic that was shared between the target
independent layer and the specific targets to move to a different
interface, as they no longer derive from each other. As a consequence,
a helper functions were added to TargetLowering representing the common
logic needed both in the target implementation and the codegen
implementation of the TTI pass. While technically this is the only
change that could have been committed separately, it would have been
a nightmare to extract.
The final step of the conversion was just to delete all the old
boilerplate. This got rid of the ScalarTargetTransformInfo and
VectorTargetTransformInfo classes, all of the support in all of the
targets for producing instances of them, and all of the support in the
tools for manually constructing a pass based around them.
Now that TTI is a relatively normal analysis group, two things become
straightforward. First, we can sink it into lib/Analysis which is a more
natural layer for it to live. Second, clients of this interface can
depend on it *always* being available which will simplify their code and
behavior. These (and other) simplifications will follow in subsequent
commits, this one is clearly big enough.
Finally, I'm very aware that much of the comments and documentation
needs to be updated. As soon as I had this working, and plausibly well
commented, I wanted to get it committed and in front of the build bots.
I'll be doing a few passes over documentation later if it sticks.
Commits to update DragonEgg and Clang will be made presently.
llvm-svn: 171681
2013-01-07 09:37:14 +08:00
|
|
|
if (!ST->isThumb()) {
|
|
|
|
if ((SImmVal >= 0 && SImmVal < 65536) ||
|
|
|
|
(ARM_AM::getSOImmVal(ZImmVal) != -1) ||
|
|
|
|
(ARM_AM::getSOImmVal(~ZImmVal) != -1))
|
|
|
|
return 1;
|
|
|
|
return ST->hasV6T2Ops() ? 2 : 3;
|
2014-03-08 23:15:42 +08:00
|
|
|
}
|
|
|
|
if (ST->isThumb2()) {
|
Switch TargetTransformInfo from an immutable analysis pass that requires
a TargetMachine to construct (and thus isn't always available), to an
analysis group that supports layered implementations much like
AliasAnalysis does. This is a pretty massive change, with a few parts
that I was unable to easily separate (sorry), so I'll walk through it.
The first step of this conversion was to make TargetTransformInfo an
analysis group, and to sink the nonce implementations in
ScalarTargetTransformInfo and VectorTargetTranformInfo into
a NoTargetTransformInfo pass. This allows other passes to add a hard
requirement on TTI, and assume they will always get at least on
implementation.
The TargetTransformInfo analysis group leverages the delegation chaining
trick that AliasAnalysis uses, where the base class for the analysis
group delegates to the previous analysis *pass*, allowing all but tho
NoFoo analysis passes to only implement the parts of the interfaces they
support. It also introduces a new trick where each pass in the group
retains a pointer to the top-most pass that has been initialized. This
allows passes to implement one API in terms of another API and benefit
when some other pass above them in the stack has more precise results
for the second API.
The second step of this conversion is to create a pass that implements
the TargetTransformInfo analysis using the target-independent
abstractions in the code generator. This replaces the
ScalarTargetTransformImpl and VectorTargetTransformImpl classes in
lib/Target with a single pass in lib/CodeGen called
BasicTargetTransformInfo. This class actually provides most of the TTI
functionality, basing it upon the TargetLowering abstraction and other
information in the target independent code generator.
The third step of the conversion adds support to all TargetMachines to
register custom analysis passes. This allows building those passes with
access to TargetLowering or other target-specific classes, and it also
allows each target to customize the set of analysis passes desired in
the pass manager. The baseline LLVMTargetMachine implements this
interface to add the BasicTTI pass to the pass manager, and all of the
tools that want to support target-aware TTI passes call this routine on
whatever target machine they end up with to add the appropriate passes.
The fourth step of the conversion created target-specific TTI analysis
passes for the X86 and ARM backends. These passes contain the custom
logic that was previously in their extensions of the
ScalarTargetTransformInfo and VectorTargetTransformInfo interfaces.
I separated them into their own file, as now all of the interface bits
are private and they just expose a function to create the pass itself.
Then I extended these target machines to set up a custom set of analysis
passes, first adding BasicTTI as a fallback, and then adding their
customized TTI implementations.
The fourth step required logic that was shared between the target
independent layer and the specific targets to move to a different
interface, as they no longer derive from each other. As a consequence,
a helper functions were added to TargetLowering representing the common
logic needed both in the target implementation and the codegen
implementation of the TTI pass. While technically this is the only
change that could have been committed separately, it would have been
a nightmare to extract.
The final step of the conversion was just to delete all the old
boilerplate. This got rid of the ScalarTargetTransformInfo and
VectorTargetTransformInfo classes, all of the support in all of the
targets for producing instances of them, and all of the support in the
tools for manually constructing a pass based around them.
Now that TTI is a relatively normal analysis group, two things become
straightforward. First, we can sink it into lib/Analysis which is a more
natural layer for it to live. Second, clients of this interface can
depend on it *always* being available which will simplify their code and
behavior. These (and other) simplifications will follow in subsequent
commits, this one is clearly big enough.
Finally, I'm very aware that much of the comments and documentation
needs to be updated. As soon as I had this working, and plausibly well
commented, I wanted to get it committed and in front of the build bots.
I'll be doing a few passes over documentation later if it sticks.
Commits to update DragonEgg and Clang will be made presently.
llvm-svn: 171681
2013-01-07 09:37:14 +08:00
|
|
|
if ((SImmVal >= 0 && SImmVal < 65536) ||
|
|
|
|
(ARM_AM::getT2SOImmVal(ZImmVal) != -1) ||
|
|
|
|
(ARM_AM::getT2SOImmVal(~ZImmVal) != -1))
|
|
|
|
return 1;
|
|
|
|
return ST->hasV6T2Ops() ? 2 : 3;
|
|
|
|
}
|
2018-09-25 00:15:23 +08:00
|
|
|
// Thumb1, any i8 imm cost 1.
|
|
|
|
if (Bits == 8 || (SImmVal >= 0 && SImmVal < 256))
|
2014-03-08 23:15:42 +08:00
|
|
|
return 1;
|
2016-09-08 20:58:04 +08:00
|
|
|
if ((~SImmVal < 256) || ARM_AM::isThumbImmShiftedVal(ZImmVal))
|
2014-03-08 23:15:42 +08:00
|
|
|
return 2;
|
|
|
|
// Load from constantpool.
|
|
|
|
return 3;
|
Switch TargetTransformInfo from an immutable analysis pass that requires
a TargetMachine to construct (and thus isn't always available), to an
analysis group that supports layered implementations much like
AliasAnalysis does. This is a pretty massive change, with a few parts
that I was unable to easily separate (sorry), so I'll walk through it.
The first step of this conversion was to make TargetTransformInfo an
analysis group, and to sink the nonce implementations in
ScalarTargetTransformInfo and VectorTargetTranformInfo into
a NoTargetTransformInfo pass. This allows other passes to add a hard
requirement on TTI, and assume they will always get at least on
implementation.
The TargetTransformInfo analysis group leverages the delegation chaining
trick that AliasAnalysis uses, where the base class for the analysis
group delegates to the previous analysis *pass*, allowing all but tho
NoFoo analysis passes to only implement the parts of the interfaces they
support. It also introduces a new trick where each pass in the group
retains a pointer to the top-most pass that has been initialized. This
allows passes to implement one API in terms of another API and benefit
when some other pass above them in the stack has more precise results
for the second API.
The second step of this conversion is to create a pass that implements
the TargetTransformInfo analysis using the target-independent
abstractions in the code generator. This replaces the
ScalarTargetTransformImpl and VectorTargetTransformImpl classes in
lib/Target with a single pass in lib/CodeGen called
BasicTargetTransformInfo. This class actually provides most of the TTI
functionality, basing it upon the TargetLowering abstraction and other
information in the target independent code generator.
The third step of the conversion adds support to all TargetMachines to
register custom analysis passes. This allows building those passes with
access to TargetLowering or other target-specific classes, and it also
allows each target to customize the set of analysis passes desired in
the pass manager. The baseline LLVMTargetMachine implements this
interface to add the BasicTTI pass to the pass manager, and all of the
tools that want to support target-aware TTI passes call this routine on
whatever target machine they end up with to add the appropriate passes.
The fourth step of the conversion created target-specific TTI analysis
passes for the X86 and ARM backends. These passes contain the custom
logic that was previously in their extensions of the
ScalarTargetTransformInfo and VectorTargetTransformInfo interfaces.
I separated them into their own file, as now all of the interface bits
are private and they just expose a function to create the pass itself.
Then I extended these target machines to set up a custom set of analysis
passes, first adding BasicTTI as a fallback, and then adding their
customized TTI implementations.
The fourth step required logic that was shared between the target
independent layer and the specific targets to move to a different
interface, as they no longer derive from each other. As a consequence,
a helper functions were added to TargetLowering representing the common
logic needed both in the target implementation and the codegen
implementation of the TTI pass. While technically this is the only
change that could have been committed separately, it would have been
a nightmare to extract.
The final step of the conversion was just to delete all the old
boilerplate. This got rid of the ScalarTargetTransformInfo and
VectorTargetTransformInfo classes, all of the support in all of the
targets for producing instances of them, and all of the support in the
tools for manually constructing a pass based around them.
Now that TTI is a relatively normal analysis group, two things become
straightforward. First, we can sink it into lib/Analysis which is a more
natural layer for it to live. Second, clients of this interface can
depend on it *always* being available which will simplify their code and
behavior. These (and other) simplifications will follow in subsequent
commits, this one is clearly big enough.
Finally, I'm very aware that much of the comments and documentation
needs to be updated. As soon as I had this working, and plausibly well
commented, I wanted to get it committed and in front of the build bots.
I'll be doing a few passes over documentation later if it sticks.
Commits to update DragonEgg and Clang will be made presently.
llvm-svn: 171681
2013-01-07 09:37:14 +08:00
|
|
|
}
|
2013-01-30 07:31:38 +08:00
|
|
|
|
2016-07-14 15:44:20 +08:00
|
|
|
// Constants smaller than 256 fit in the immediate field of
|
|
|
|
// Thumb1 instructions so we return a zero cost and 1 otherwise.
|
|
|
|
int ARMTTIImpl::getIntImmCodeSizeCost(unsigned Opcode, unsigned Idx,
|
|
|
|
const APInt &Imm, Type *Ty) {
|
|
|
|
if (Imm.isNonNegative() && Imm.getLimitedValue() < 256)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2016-04-16 02:17:18 +08:00
|
|
|
int ARMTTIImpl::getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm,
|
|
|
|
Type *Ty) {
|
|
|
|
// Division by a constant can be turned into multiplication, but only if we
|
|
|
|
// know it's constant. So it's not so much that the immediate is cheap (it's
|
|
|
|
// not), but that the alternative is worse.
|
|
|
|
// FIXME: this is probably unneeded with GlobalISel.
|
|
|
|
if ((Opcode == Instruction::SDiv || Opcode == Instruction::UDiv ||
|
|
|
|
Opcode == Instruction::SRem || Opcode == Instruction::URem) &&
|
|
|
|
Idx == 1)
|
|
|
|
return 0;
|
|
|
|
|
2019-02-04 19:58:48 +08:00
|
|
|
if (Opcode == Instruction::And) {
|
|
|
|
// UXTB/UXTH
|
|
|
|
if (Imm == 255 || Imm == 65535)
|
|
|
|
return 0;
|
|
|
|
// Conversion to BIC is free, and means we can use ~Imm instead.
|
|
|
|
return std::min(getIntImmCost(Imm, Ty), getIntImmCost(~Imm, Ty));
|
|
|
|
}
|
2016-09-08 20:58:12 +08:00
|
|
|
|
2016-09-09 21:35:36 +08:00
|
|
|
if (Opcode == Instruction::Add)
|
|
|
|
// Conversion to SUB is free, and means we can use -Imm instead.
|
|
|
|
return std::min(getIntImmCost(Imm, Ty), getIntImmCost(-Imm, Ty));
|
|
|
|
|
2016-09-09 21:35:28 +08:00
|
|
|
if (Opcode == Instruction::ICmp && Imm.isNegative() &&
|
|
|
|
Ty->getIntegerBitWidth() == 32) {
|
|
|
|
int64_t NegImm = -Imm.getSExtValue();
|
|
|
|
if (ST->isThumb2() && NegImm < 1<<12)
|
|
|
|
// icmp X, #-C -> cmn X, #C
|
|
|
|
return 0;
|
|
|
|
if (ST->isThumb() && NegImm < 1<<8)
|
|
|
|
// icmp X, #-C -> adds X, #C
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-02-20 19:07:35 +08:00
|
|
|
// xor a, -1 can always be folded to MVN
|
2018-02-22 17:38:57 +08:00
|
|
|
if (Opcode == Instruction::Xor && Imm.isAllOnesValue())
|
|
|
|
return 0;
|
2018-02-20 19:07:35 +08:00
|
|
|
|
2016-04-16 02:17:18 +08:00
|
|
|
return getIntImmCost(Imm, Ty);
|
|
|
|
}
|
|
|
|
|
2017-04-12 19:49:08 +08:00
|
|
|
int ARMTTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst, Type *Src,
|
|
|
|
const Instruction *I) {
|
2013-01-30 07:31:38 +08:00
|
|
|
int ISD = TLI->InstructionOpcodeToISD(Opcode);
|
|
|
|
assert(ISD && "Invalid opcode");
|
|
|
|
|
2013-03-15 23:10:47 +08:00
|
|
|
// Single to/from double precision conversions.
|
2015-10-28 12:02:12 +08:00
|
|
|
static const CostTblEntry NEONFltDblTbl[] = {
|
2013-03-15 23:10:47 +08:00
|
|
|
// Vector fptrunc/fpext conversions.
|
|
|
|
{ ISD::FP_ROUND, MVT::v2f64, 2 },
|
|
|
|
{ ISD::FP_EXTEND, MVT::v2f32, 2 },
|
|
|
|
{ ISD::FP_EXTEND, MVT::v4f32, 4 }
|
|
|
|
};
|
|
|
|
|
|
|
|
if (Src->isVectorTy() && ST->hasNEON() && (ISD == ISD::FP_ROUND ||
|
|
|
|
ISD == ISD::FP_EXTEND)) {
|
2015-08-06 02:08:10 +08:00
|
|
|
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Src);
|
2015-10-27 12:14:24 +08:00
|
|
|
if (const auto *Entry = CostTableLookup(NEONFltDblTbl, ISD, LT.second))
|
|
|
|
return LT.first * Entry->Cost;
|
2013-03-15 23:10:47 +08:00
|
|
|
}
|
|
|
|
|
2015-07-09 10:09:04 +08:00
|
|
|
EVT SrcTy = TLI->getValueType(DL, Src);
|
|
|
|
EVT DstTy = TLI->getValueType(DL, Dst);
|
2013-01-30 07:31:38 +08:00
|
|
|
|
|
|
|
if (!SrcTy.isSimple() || !DstTy.isSimple())
|
[PM] Change the core design of the TTI analysis to use a polymorphic
type erased interface and a single analysis pass rather than an
extremely complex analysis group.
The end result is that the TTI analysis can contain a type erased
implementation that supports the polymorphic TTI interface. We can build
one from a target-specific implementation or from a dummy one in the IR.
I've also factored all of the code into "mix-in"-able base classes,
including CRTP base classes to facilitate calling back up to the most
specialized form when delegating horizontally across the surface. These
aren't as clean as I would like and I'm planning to work on cleaning
some of this up, but I wanted to start by putting into the right form.
There are a number of reasons for this change, and this particular
design. The first and foremost reason is that an analysis group is
complete overkill, and the chaining delegation strategy was so opaque,
confusing, and high overhead that TTI was suffering greatly for it.
Several of the TTI functions had failed to be implemented in all places
because of the chaining-based delegation making there be no checking of
this. A few other functions were implemented with incorrect delegation.
The message to me was very clear working on this -- the delegation and
analysis group structure was too confusing to be useful here.
The other reason of course is that this is *much* more natural fit for
the new pass manager. This will lay the ground work for a type-erased
per-function info object that can look up the correct subtarget and even
cache it.
Yet another benefit is that this will significantly simplify the
interaction of the pass managers and the TargetMachine. See the future
work below.
The downside of this change is that it is very, very verbose. I'm going
to work to improve that, but it is somewhat an implementation necessity
in C++ to do type erasure. =/ I discussed this design really extensively
with Eric and Hal prior to going down this path, and afterward showed
them the result. No one was really thrilled with it, but there doesn't
seem to be a substantially better alternative. Using a base class and
virtual method dispatch would make the code much shorter, but as
discussed in the update to the programmer's manual and elsewhere,
a polymorphic interface feels like the more principled approach even if
this is perhaps the least compelling example of it. ;]
Ultimately, there is still a lot more to be done here, but this was the
huge chunk that I couldn't really split things out of because this was
the interface change to TTI. I've tried to minimize all the other parts
of this. The follow up work should include at least:
1) Improving the TargetMachine interface by having it directly return
a TTI object. Because we have a non-pass object with value semantics
and an internal type erasure mechanism, we can narrow the interface
of the TargetMachine to *just* do what we need: build and return
a TTI object that we can then insert into the pass pipeline.
2) Make the TTI object be fully specialized for a particular function.
This will include splitting off a minimal form of it which is
sufficient for the inliner and the old pass manager.
3) Add a new pass manager analysis which produces TTI objects from the
target machine for each function. This may actually be done as part
of #2 in order to use the new analysis to implement #2.
4) Work on narrowing the API between TTI and the targets so that it is
easier to understand and less verbose to type erase.
5) Work on narrowing the API between TTI and its clients so that it is
easier to understand and less verbose to forward.
6) Try to improve the CRTP-based delegation. I feel like this code is
just a bit messy and exacerbating the complexity of implementing
the TTI in each target.
Many thanks to Eric and Hal for their help here. I ended up blocked on
this somewhat more abruptly than I expected, and so I appreciate getting
it sorted out very quickly.
Differential Revision: http://reviews.llvm.org/D7293
llvm-svn: 227669
2015-01-31 11:43:40 +08:00
|
|
|
return BaseT::getCastInstrCost(Opcode, Dst, Src);
|
2013-01-30 07:31:38 +08:00
|
|
|
|
|
|
|
// Some arithmetic, load and store operations have specific instructions
|
2013-02-05 22:05:55 +08:00
|
|
|
// to cast up/down their types automatically at no extra cost.
|
|
|
|
// TODO: Get these tables to know at least what the related operations are.
|
2015-10-28 12:02:12 +08:00
|
|
|
static const TypeConversionCostTblEntry NEONVectorConversionTbl[] = {
|
2013-01-30 07:31:38 +08:00
|
|
|
{ ISD::SIGN_EXTEND, MVT::v4i32, MVT::v4i16, 0 },
|
|
|
|
{ ISD::ZERO_EXTEND, MVT::v4i32, MVT::v4i16, 0 },
|
|
|
|
{ ISD::SIGN_EXTEND, MVT::v2i64, MVT::v2i32, 1 },
|
|
|
|
{ ISD::ZERO_EXTEND, MVT::v2i64, MVT::v2i32, 1 },
|
|
|
|
{ ISD::TRUNCATE, MVT::v4i32, MVT::v4i64, 0 },
|
|
|
|
{ ISD::TRUNCATE, MVT::v4i16, MVT::v4i32, 1 },
|
2013-02-05 22:05:55 +08:00
|
|
|
|
2013-03-19 16:15:38 +08:00
|
|
|
// The number of vmovl instructions for the extension.
|
|
|
|
{ ISD::SIGN_EXTEND, MVT::v4i64, MVT::v4i16, 3 },
|
|
|
|
{ ISD::ZERO_EXTEND, MVT::v4i64, MVT::v4i16, 3 },
|
|
|
|
{ ISD::SIGN_EXTEND, MVT::v8i32, MVT::v8i8, 3 },
|
|
|
|
{ ISD::ZERO_EXTEND, MVT::v8i32, MVT::v8i8, 3 },
|
|
|
|
{ ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i8, 7 },
|
|
|
|
{ ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i8, 7 },
|
|
|
|
{ ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i16, 6 },
|
|
|
|
{ ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i16, 6 },
|
|
|
|
{ ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i8, 6 },
|
|
|
|
{ ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i8, 6 },
|
|
|
|
|
Legalize vector truncates by parts rather than just splitting.
Rather than just splitting the input type and hoping for the best, apply
a bit more cleverness. Just splitting the types until the source is
legal often leads to an illegal result time, which is then widened and a
scalarization step is introduced which leads to truly horrible code
generation. With the loop vectorizer, these sorts of operations are much
more common, and so it's worth extra effort to do them well.
Add a legalization hook for the operands of a TRUNCATE node, which will
be encountered after the result type has been legalized, but if the
operand type is still illegal. If simple splitting of both types
ends up with the result type of each half still being legal, just
do that (v16i16 -> v16i8 on ARM, for example). If, however, that would
result in an illegal result type (v8i32 -> v8i8 on ARM, for example),
we can get more clever with power-two vectors. Specifically,
split the input type, but also widen the result element size, then
concatenate the halves and truncate again. For example on ARM,
To perform a "%res = v8i8 trunc v8i32 %in" we transform to:
%inlo = v4i32 extract_subvector %in, 0
%inhi = v4i32 extract_subvector %in, 4
%lo16 = v4i16 trunc v4i32 %inlo
%hi16 = v4i16 trunc v4i32 %inhi
%in16 = v8i16 concat_vectors v4i16 %lo16, v4i16 %hi16
%res = v8i8 trunc v8i16 %in16
This allows instruction selection to generate three VMOVN instructions
instead of a sequences of moves, stores and loads.
Update the ARMTargetTransformInfo to take this improved legalization
into account.
Consider the simplified IR:
define <16 x i8> @test1(<16 x i32>* %ap) {
%a = load <16 x i32>* %ap
%tmp = trunc <16 x i32> %a to <16 x i8>
ret <16 x i8> %tmp
}
define <8 x i8> @test2(<8 x i32>* %ap) {
%a = load <8 x i32>* %ap
%tmp = trunc <8 x i32> %a to <8 x i8>
ret <8 x i8> %tmp
}
Previously, we would generate the truly hideous:
.syntax unified
.section __TEXT,__text,regular,pure_instructions
.globl _test1
.align 2
_test1: @ @test1
@ BB#0:
push {r7}
mov r7, sp
sub sp, sp, #20
bic sp, sp, #7
add r1, r0, #48
add r2, r0, #32
vld1.64 {d24, d25}, [r0:128]
vld1.64 {d16, d17}, [r1:128]
vld1.64 {d18, d19}, [r2:128]
add r1, r0, #16
vmovn.i32 d22, q8
vld1.64 {d16, d17}, [r1:128]
vmovn.i32 d20, q9
vmovn.i32 d18, q12
vmov.u16 r0, d22[3]
strb r0, [sp, #15]
vmov.u16 r0, d22[2]
strb r0, [sp, #14]
vmov.u16 r0, d22[1]
strb r0, [sp, #13]
vmov.u16 r0, d22[0]
vmovn.i32 d16, q8
strb r0, [sp, #12]
vmov.u16 r0, d20[3]
strb r0, [sp, #11]
vmov.u16 r0, d20[2]
strb r0, [sp, #10]
vmov.u16 r0, d20[1]
strb r0, [sp, #9]
vmov.u16 r0, d20[0]
strb r0, [sp, #8]
vmov.u16 r0, d18[3]
strb r0, [sp, #3]
vmov.u16 r0, d18[2]
strb r0, [sp, #2]
vmov.u16 r0, d18[1]
strb r0, [sp, #1]
vmov.u16 r0, d18[0]
strb r0, [sp]
vmov.u16 r0, d16[3]
strb r0, [sp, #7]
vmov.u16 r0, d16[2]
strb r0, [sp, #6]
vmov.u16 r0, d16[1]
strb r0, [sp, #5]
vmov.u16 r0, d16[0]
strb r0, [sp, #4]
vldmia sp, {d16, d17}
vmov r0, r1, d16
vmov r2, r3, d17
mov sp, r7
pop {r7}
bx lr
.globl _test2
.align 2
_test2: @ @test2
@ BB#0:
push {r7}
mov r7, sp
sub sp, sp, #12
bic sp, sp, #7
vld1.64 {d16, d17}, [r0:128]
add r0, r0, #16
vld1.64 {d20, d21}, [r0:128]
vmovn.i32 d18, q8
vmov.u16 r0, d18[3]
vmovn.i32 d16, q10
strb r0, [sp, #3]
vmov.u16 r0, d18[2]
strb r0, [sp, #2]
vmov.u16 r0, d18[1]
strb r0, [sp, #1]
vmov.u16 r0, d18[0]
strb r0, [sp]
vmov.u16 r0, d16[3]
strb r0, [sp, #7]
vmov.u16 r0, d16[2]
strb r0, [sp, #6]
vmov.u16 r0, d16[1]
strb r0, [sp, #5]
vmov.u16 r0, d16[0]
strb r0, [sp, #4]
ldm sp, {r0, r1}
mov sp, r7
pop {r7}
bx lr
Now, however, we generate the much more straightforward:
.syntax unified
.section __TEXT,__text,regular,pure_instructions
.globl _test1
.align 2
_test1: @ @test1
@ BB#0:
add r1, r0, #48
add r2, r0, #32
vld1.64 {d20, d21}, [r0:128]
vld1.64 {d16, d17}, [r1:128]
add r1, r0, #16
vld1.64 {d18, d19}, [r2:128]
vld1.64 {d22, d23}, [r1:128]
vmovn.i32 d17, q8
vmovn.i32 d16, q9
vmovn.i32 d18, q10
vmovn.i32 d19, q11
vmovn.i16 d17, q8
vmovn.i16 d16, q9
vmov r0, r1, d16
vmov r2, r3, d17
bx lr
.globl _test2
.align 2
_test2: @ @test2
@ BB#0:
vld1.64 {d16, d17}, [r0:128]
add r0, r0, #16
vld1.64 {d18, d19}, [r0:128]
vmovn.i32 d16, q8
vmovn.i32 d17, q9
vmovn.i16 d16, q8
vmov r0, r1, d16
bx lr
llvm-svn: 179989
2013-04-22 07:47:41 +08:00
|
|
|
// Operations that we legalize using splitting.
|
|
|
|
{ ISD::TRUNCATE, MVT::v16i8, MVT::v16i32, 6 },
|
|
|
|
{ ISD::TRUNCATE, MVT::v8i8, MVT::v8i32, 3 },
|
2013-03-13 05:19:22 +08:00
|
|
|
|
2013-02-05 22:05:55 +08:00
|
|
|
// Vector float <-> i32 conversions.
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v4f32, MVT::v4i32, 1 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v4f32, MVT::v4i32, 1 },
|
2013-03-19 06:47:09 +08:00
|
|
|
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v2f32, MVT::v2i8, 3 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v2f32, MVT::v2i8, 3 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v2f32, MVT::v2i16, 2 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v2f32, MVT::v2i16, 2 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v2f32, MVT::v2i32, 1 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v2f32, MVT::v2i32, 1 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v4f32, MVT::v4i1, 3 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v4f32, MVT::v4i1, 3 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v4f32, MVT::v4i8, 3 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v4f32, MVT::v4i8, 3 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v4f32, MVT::v4i16, 2 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v4f32, MVT::v4i16, 2 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v8f32, MVT::v8i16, 4 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v8f32, MVT::v8i16, 4 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v8f32, MVT::v8i32, 2 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v8f32, MVT::v8i32, 2 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i16, 8 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i16, 8 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i32, 4 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i32, 4 },
|
|
|
|
|
2013-02-05 22:05:55 +08:00
|
|
|
{ ISD::FP_TO_SINT, MVT::v4i32, MVT::v4f32, 1 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::v4i32, MVT::v4f32, 1 },
|
2013-03-19 06:47:06 +08:00
|
|
|
{ ISD::FP_TO_SINT, MVT::v4i8, MVT::v4f32, 3 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::v4i8, MVT::v4f32, 3 },
|
|
|
|
{ ISD::FP_TO_SINT, MVT::v4i16, MVT::v4f32, 2 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::v4i16, MVT::v4f32, 2 },
|
2013-02-05 22:05:55 +08:00
|
|
|
|
|
|
|
// Vector double <-> i32 conversions.
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v2f64, MVT::v2i32, 2 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v2f64, MVT::v2i32, 2 },
|
2013-03-19 06:47:09 +08:00
|
|
|
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v2f64, MVT::v2i8, 4 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v2f64, MVT::v2i8, 4 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v2f64, MVT::v2i16, 3 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v2f64, MVT::v2i16, 3 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::v2f64, MVT::v2i32, 2 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::v2f64, MVT::v2i32, 2 },
|
|
|
|
|
2013-02-05 22:05:55 +08:00
|
|
|
{ ISD::FP_TO_SINT, MVT::v2i32, MVT::v2f64, 2 },
|
2013-03-19 06:47:06 +08:00
|
|
|
{ ISD::FP_TO_UINT, MVT::v2i32, MVT::v2f64, 2 },
|
|
|
|
{ ISD::FP_TO_SINT, MVT::v8i16, MVT::v8f32, 4 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::v8i16, MVT::v8f32, 4 },
|
|
|
|
{ ISD::FP_TO_SINT, MVT::v16i16, MVT::v16f32, 8 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::v16i16, MVT::v16f32, 8 }
|
2013-01-30 07:31:38 +08:00
|
|
|
};
|
|
|
|
|
2013-02-05 22:05:55 +08:00
|
|
|
if (SrcTy.isVector() && ST->hasNEON()) {
|
2015-10-27 12:14:24 +08:00
|
|
|
if (const auto *Entry = ConvertCostTableLookup(NEONVectorConversionTbl, ISD,
|
|
|
|
DstTy.getSimpleVT(),
|
|
|
|
SrcTy.getSimpleVT()))
|
|
|
|
return Entry->Cost;
|
2013-02-05 22:05:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
// Scalar float to integer conversions.
|
2015-10-28 12:02:12 +08:00
|
|
|
static const TypeConversionCostTblEntry NEONFloatConversionTbl[] = {
|
2013-02-05 22:05:55 +08:00
|
|
|
{ ISD::FP_TO_SINT, MVT::i1, MVT::f32, 2 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::i1, MVT::f32, 2 },
|
|
|
|
{ ISD::FP_TO_SINT, MVT::i1, MVT::f64, 2 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::i1, MVT::f64, 2 },
|
|
|
|
{ ISD::FP_TO_SINT, MVT::i8, MVT::f32, 2 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::i8, MVT::f32, 2 },
|
|
|
|
{ ISD::FP_TO_SINT, MVT::i8, MVT::f64, 2 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::i8, MVT::f64, 2 },
|
|
|
|
{ ISD::FP_TO_SINT, MVT::i16, MVT::f32, 2 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::i16, MVT::f32, 2 },
|
|
|
|
{ ISD::FP_TO_SINT, MVT::i16, MVT::f64, 2 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::i16, MVT::f64, 2 },
|
|
|
|
{ ISD::FP_TO_SINT, MVT::i32, MVT::f32, 2 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::i32, MVT::f32, 2 },
|
|
|
|
{ ISD::FP_TO_SINT, MVT::i32, MVT::f64, 2 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::i32, MVT::f64, 2 },
|
|
|
|
{ ISD::FP_TO_SINT, MVT::i64, MVT::f32, 10 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::i64, MVT::f32, 10 },
|
|
|
|
{ ISD::FP_TO_SINT, MVT::i64, MVT::f64, 10 },
|
|
|
|
{ ISD::FP_TO_UINT, MVT::i64, MVT::f64, 10 }
|
|
|
|
};
|
|
|
|
if (SrcTy.isFloatingPoint() && ST->hasNEON()) {
|
2015-10-27 12:14:24 +08:00
|
|
|
if (const auto *Entry = ConvertCostTableLookup(NEONFloatConversionTbl, ISD,
|
|
|
|
DstTy.getSimpleVT(),
|
|
|
|
SrcTy.getSimpleVT()))
|
|
|
|
return Entry->Cost;
|
2013-02-05 22:05:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
// Scalar integer to float conversions.
|
2015-10-28 12:02:12 +08:00
|
|
|
static const TypeConversionCostTblEntry NEONIntegerConversionTbl[] = {
|
2013-02-05 22:05:55 +08:00
|
|
|
{ ISD::SINT_TO_FP, MVT::f32, MVT::i1, 2 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::f32, MVT::i1, 2 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::f64, MVT::i1, 2 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::f64, MVT::i1, 2 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::f32, MVT::i8, 2 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::f32, MVT::i8, 2 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::f64, MVT::i8, 2 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::f64, MVT::i8, 2 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::f32, MVT::i16, 2 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::f32, MVT::i16, 2 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::f64, MVT::i16, 2 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::f64, MVT::i16, 2 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::f32, MVT::i32, 2 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::f32, MVT::i32, 2 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::f64, MVT::i32, 2 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::f64, MVT::i32, 2 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::f32, MVT::i64, 10 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::f32, MVT::i64, 10 },
|
|
|
|
{ ISD::SINT_TO_FP, MVT::f64, MVT::i64, 10 },
|
|
|
|
{ ISD::UINT_TO_FP, MVT::f64, MVT::i64, 10 }
|
|
|
|
};
|
|
|
|
|
|
|
|
if (SrcTy.isInteger() && ST->hasNEON()) {
|
2015-10-27 12:14:24 +08:00
|
|
|
if (const auto *Entry = ConvertCostTableLookup(NEONIntegerConversionTbl,
|
|
|
|
ISD, DstTy.getSimpleVT(),
|
|
|
|
SrcTy.getSimpleVT()))
|
|
|
|
return Entry->Cost;
|
2013-01-30 07:31:38 +08:00
|
|
|
}
|
|
|
|
|
2013-02-05 22:05:55 +08:00
|
|
|
// Scalar integer conversion costs.
|
2015-10-28 12:02:12 +08:00
|
|
|
static const TypeConversionCostTblEntry ARMIntegerConversionTbl[] = {
|
2013-02-05 22:05:55 +08:00
|
|
|
// i16 -> i64 requires two dependent operations.
|
|
|
|
{ ISD::SIGN_EXTEND, MVT::i64, MVT::i16, 2 },
|
|
|
|
|
|
|
|
// Truncates on i64 are assumed to be free.
|
|
|
|
{ ISD::TRUNCATE, MVT::i32, MVT::i64, 0 },
|
|
|
|
{ ISD::TRUNCATE, MVT::i16, MVT::i64, 0 },
|
|
|
|
{ ISD::TRUNCATE, MVT::i8, MVT::i64, 0 },
|
|
|
|
{ ISD::TRUNCATE, MVT::i1, MVT::i64, 0 }
|
|
|
|
};
|
|
|
|
|
|
|
|
if (SrcTy.isInteger()) {
|
2015-10-27 12:14:24 +08:00
|
|
|
if (const auto *Entry = ConvertCostTableLookup(ARMIntegerConversionTbl, ISD,
|
|
|
|
DstTy.getSimpleVT(),
|
|
|
|
SrcTy.getSimpleVT()))
|
|
|
|
return Entry->Cost;
|
2013-02-05 22:05:55 +08:00
|
|
|
}
|
|
|
|
|
[PM] Change the core design of the TTI analysis to use a polymorphic
type erased interface and a single analysis pass rather than an
extremely complex analysis group.
The end result is that the TTI analysis can contain a type erased
implementation that supports the polymorphic TTI interface. We can build
one from a target-specific implementation or from a dummy one in the IR.
I've also factored all of the code into "mix-in"-able base classes,
including CRTP base classes to facilitate calling back up to the most
specialized form when delegating horizontally across the surface. These
aren't as clean as I would like and I'm planning to work on cleaning
some of this up, but I wanted to start by putting into the right form.
There are a number of reasons for this change, and this particular
design. The first and foremost reason is that an analysis group is
complete overkill, and the chaining delegation strategy was so opaque,
confusing, and high overhead that TTI was suffering greatly for it.
Several of the TTI functions had failed to be implemented in all places
because of the chaining-based delegation making there be no checking of
this. A few other functions were implemented with incorrect delegation.
The message to me was very clear working on this -- the delegation and
analysis group structure was too confusing to be useful here.
The other reason of course is that this is *much* more natural fit for
the new pass manager. This will lay the ground work for a type-erased
per-function info object that can look up the correct subtarget and even
cache it.
Yet another benefit is that this will significantly simplify the
interaction of the pass managers and the TargetMachine. See the future
work below.
The downside of this change is that it is very, very verbose. I'm going
to work to improve that, but it is somewhat an implementation necessity
in C++ to do type erasure. =/ I discussed this design really extensively
with Eric and Hal prior to going down this path, and afterward showed
them the result. No one was really thrilled with it, but there doesn't
seem to be a substantially better alternative. Using a base class and
virtual method dispatch would make the code much shorter, but as
discussed in the update to the programmer's manual and elsewhere,
a polymorphic interface feels like the more principled approach even if
this is perhaps the least compelling example of it. ;]
Ultimately, there is still a lot more to be done here, but this was the
huge chunk that I couldn't really split things out of because this was
the interface change to TTI. I've tried to minimize all the other parts
of this. The follow up work should include at least:
1) Improving the TargetMachine interface by having it directly return
a TTI object. Because we have a non-pass object with value semantics
and an internal type erasure mechanism, we can narrow the interface
of the TargetMachine to *just* do what we need: build and return
a TTI object that we can then insert into the pass pipeline.
2) Make the TTI object be fully specialized for a particular function.
This will include splitting off a minimal form of it which is
sufficient for the inliner and the old pass manager.
3) Add a new pass manager analysis which produces TTI objects from the
target machine for each function. This may actually be done as part
of #2 in order to use the new analysis to implement #2.
4) Work on narrowing the API between TTI and the targets so that it is
easier to understand and less verbose to type erase.
5) Work on narrowing the API between TTI and its clients so that it is
easier to understand and less verbose to forward.
6) Try to improve the CRTP-based delegation. I feel like this code is
just a bit messy and exacerbating the complexity of implementing
the TTI in each target.
Many thanks to Eric and Hal for their help here. I ended up blocked on
this somewhat more abruptly than I expected, and so I appreciate getting
it sorted out very quickly.
Differential Revision: http://reviews.llvm.org/D7293
llvm-svn: 227669
2015-01-31 11:43:40 +08:00
|
|
|
return BaseT::getCastInstrCost(Opcode, Dst, Src);
|
2013-01-30 07:31:38 +08:00
|
|
|
}
|
2013-02-04 10:52:05 +08:00
|
|
|
|
2015-08-06 02:08:10 +08:00
|
|
|
int ARMTTIImpl::getVectorInstrCost(unsigned Opcode, Type *ValTy,
|
|
|
|
unsigned Index) {
|
2013-02-08 22:50:48 +08:00
|
|
|
// Penalize inserting into an D-subregister. We end up with a three times
|
|
|
|
// lower estimated throughput on swift.
|
2016-07-06 17:22:23 +08:00
|
|
|
if (ST->hasSlowLoadDSubregister() && Opcode == Instruction::InsertElement &&
|
|
|
|
ValTy->isVectorTy() && ValTy->getScalarSizeInBits() <= 32)
|
2013-02-08 22:50:48 +08:00
|
|
|
return 3;
|
2013-02-04 10:52:05 +08:00
|
|
|
|
2014-09-12 21:29:40 +08:00
|
|
|
if ((Opcode == Instruction::InsertElement ||
|
2015-08-17 23:57:05 +08:00
|
|
|
Opcode == Instruction::ExtractElement)) {
|
|
|
|
// Cross-class copies are expensive on many microarchitectures,
|
|
|
|
// so assume they are expensive by default.
|
|
|
|
if (ValTy->getVectorElementType()->isIntegerTy())
|
|
|
|
return 3;
|
|
|
|
|
|
|
|
// Even if it's not a cross class copy, this likely leads to mixing
|
|
|
|
// of NEON and VFP code and should be therefore penalized.
|
|
|
|
if (ValTy->isVectorTy() &&
|
|
|
|
ValTy->getScalarSizeInBits() <= 32)
|
|
|
|
return std::max(BaseT::getVectorInstrCost(Opcode, ValTy, Index), 2U);
|
|
|
|
}
|
2014-09-12 21:29:40 +08:00
|
|
|
|
[PM] Change the core design of the TTI analysis to use a polymorphic
type erased interface and a single analysis pass rather than an
extremely complex analysis group.
The end result is that the TTI analysis can contain a type erased
implementation that supports the polymorphic TTI interface. We can build
one from a target-specific implementation or from a dummy one in the IR.
I've also factored all of the code into "mix-in"-able base classes,
including CRTP base classes to facilitate calling back up to the most
specialized form when delegating horizontally across the surface. These
aren't as clean as I would like and I'm planning to work on cleaning
some of this up, but I wanted to start by putting into the right form.
There are a number of reasons for this change, and this particular
design. The first and foremost reason is that an analysis group is
complete overkill, and the chaining delegation strategy was so opaque,
confusing, and high overhead that TTI was suffering greatly for it.
Several of the TTI functions had failed to be implemented in all places
because of the chaining-based delegation making there be no checking of
this. A few other functions were implemented with incorrect delegation.
The message to me was very clear working on this -- the delegation and
analysis group structure was too confusing to be useful here.
The other reason of course is that this is *much* more natural fit for
the new pass manager. This will lay the ground work for a type-erased
per-function info object that can look up the correct subtarget and even
cache it.
Yet another benefit is that this will significantly simplify the
interaction of the pass managers and the TargetMachine. See the future
work below.
The downside of this change is that it is very, very verbose. I'm going
to work to improve that, but it is somewhat an implementation necessity
in C++ to do type erasure. =/ I discussed this design really extensively
with Eric and Hal prior to going down this path, and afterward showed
them the result. No one was really thrilled with it, but there doesn't
seem to be a substantially better alternative. Using a base class and
virtual method dispatch would make the code much shorter, but as
discussed in the update to the programmer's manual and elsewhere,
a polymorphic interface feels like the more principled approach even if
this is perhaps the least compelling example of it. ;]
Ultimately, there is still a lot more to be done here, but this was the
huge chunk that I couldn't really split things out of because this was
the interface change to TTI. I've tried to minimize all the other parts
of this. The follow up work should include at least:
1) Improving the TargetMachine interface by having it directly return
a TTI object. Because we have a non-pass object with value semantics
and an internal type erasure mechanism, we can narrow the interface
of the TargetMachine to *just* do what we need: build and return
a TTI object that we can then insert into the pass pipeline.
2) Make the TTI object be fully specialized for a particular function.
This will include splitting off a minimal form of it which is
sufficient for the inliner and the old pass manager.
3) Add a new pass manager analysis which produces TTI objects from the
target machine for each function. This may actually be done as part
of #2 in order to use the new analysis to implement #2.
4) Work on narrowing the API between TTI and the targets so that it is
easier to understand and less verbose to type erase.
5) Work on narrowing the API between TTI and its clients so that it is
easier to understand and less verbose to forward.
6) Try to improve the CRTP-based delegation. I feel like this code is
just a bit messy and exacerbating the complexity of implementing
the TTI in each target.
Many thanks to Eric and Hal for their help here. I ended up blocked on
this somewhat more abruptly than I expected, and so I appreciate getting
it sorted out very quickly.
Differential Revision: http://reviews.llvm.org/D7293
llvm-svn: 227669
2015-01-31 11:43:40 +08:00
|
|
|
return BaseT::getVectorInstrCost(Opcode, ValTy, Index);
|
2013-02-04 10:52:05 +08:00
|
|
|
}
|
2013-02-08 00:10:15 +08:00
|
|
|
|
2017-04-12 19:49:08 +08:00
|
|
|
int ARMTTIImpl::getCmpSelInstrCost(unsigned Opcode, Type *ValTy, Type *CondTy,
|
|
|
|
const Instruction *I) {
|
2013-02-08 00:10:15 +08:00
|
|
|
int ISD = TLI->InstructionOpcodeToISD(Opcode);
|
2018-02-22 15:48:29 +08:00
|
|
|
// On NEON a vector select gets lowered to vbsl.
|
2013-02-08 00:10:15 +08:00
|
|
|
if (ST->hasNEON() && ValTy->isVectorTy() && ISD == ISD::SELECT) {
|
2013-03-15 03:17:02 +08:00
|
|
|
// Lowering of some vector selects is currently far from perfect.
|
2015-10-28 12:02:12 +08:00
|
|
|
static const TypeConversionCostTblEntry NEONVectorSelectTbl[] = {
|
2013-03-15 03:17:02 +08:00
|
|
|
{ ISD::SELECT, MVT::v4i1, MVT::v4i64, 4*4 + 1*2 + 1 },
|
|
|
|
{ ISD::SELECT, MVT::v8i1, MVT::v8i64, 50 },
|
|
|
|
{ ISD::SELECT, MVT::v16i1, MVT::v16i64, 100 }
|
|
|
|
};
|
|
|
|
|
2015-07-09 10:09:04 +08:00
|
|
|
EVT SelCondTy = TLI->getValueType(DL, CondTy);
|
|
|
|
EVT SelValTy = TLI->getValueType(DL, ValTy);
|
2013-08-03 01:10:04 +08:00
|
|
|
if (SelCondTy.isSimple() && SelValTy.isSimple()) {
|
2015-10-27 12:14:24 +08:00
|
|
|
if (const auto *Entry = ConvertCostTableLookup(NEONVectorSelectTbl, ISD,
|
|
|
|
SelCondTy.getSimpleVT(),
|
|
|
|
SelValTy.getSimpleVT()))
|
|
|
|
return Entry->Cost;
|
2013-08-03 01:10:04 +08:00
|
|
|
}
|
2013-03-15 03:17:02 +08:00
|
|
|
|
2015-08-06 02:08:10 +08:00
|
|
|
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);
|
2013-02-08 00:10:15 +08:00
|
|
|
return LT.first;
|
|
|
|
}
|
|
|
|
|
2017-04-12 19:49:08 +08:00
|
|
|
return BaseT::getCmpSelInstrCost(Opcode, ValTy, CondTy, I);
|
2013-02-08 00:10:15 +08:00
|
|
|
}
|
2013-02-08 22:50:48 +08:00
|
|
|
|
2017-01-05 22:03:41 +08:00
|
|
|
int ARMTTIImpl::getAddressComputationCost(Type *Ty, ScalarEvolution *SE,
|
|
|
|
const SCEV *Ptr) {
|
2013-07-13 03:16:04 +08:00
|
|
|
// Address computations in vectorized code with non-consecutive addresses will
|
|
|
|
// likely result in more instructions compared to scalar code where the
|
|
|
|
// computation can more often be merged into the index mode. The resulting
|
|
|
|
// extra micro-ops can significantly decrease throughput.
|
|
|
|
unsigned NumVectorInstToHideOverhead = 10;
|
2017-01-05 22:03:41 +08:00
|
|
|
int MaxMergeDistance = 64;
|
2013-07-13 03:16:04 +08:00
|
|
|
|
2018-07-31 03:41:25 +08:00
|
|
|
if (Ty->isVectorTy() && SE &&
|
2017-01-05 22:03:41 +08:00
|
|
|
!BaseT::isConstantStridedAccessLessThan(SE, Ptr, MaxMergeDistance + 1))
|
2013-07-13 03:16:04 +08:00
|
|
|
return NumVectorInstToHideOverhead;
|
|
|
|
|
2013-02-08 22:50:48 +08:00
|
|
|
// In many cases the address computation is not merged into the instruction
|
|
|
|
// addressing mode.
|
|
|
|
return 1;
|
|
|
|
}
|
2013-02-12 10:40:39 +08:00
|
|
|
|
2019-04-30 18:28:50 +08:00
|
|
|
int ARMTTIImpl::getMemcpyCost(const Instruction *I) {
|
|
|
|
const MemCpyInst *MI = dyn_cast<MemCpyInst>(I);
|
|
|
|
assert(MI && "MemcpyInst expected");
|
|
|
|
ConstantInt *C = dyn_cast<ConstantInt>(MI->getLength());
|
|
|
|
|
|
|
|
// To model the cost of a library call, we assume 1 for the call, and
|
|
|
|
// 3 for the argument setup.
|
|
|
|
const unsigned LibCallCost = 4;
|
|
|
|
|
|
|
|
// If 'size' is not a constant, a library call will be generated.
|
|
|
|
if (!C)
|
|
|
|
return LibCallCost;
|
|
|
|
|
|
|
|
const unsigned Size = C->getValue().getZExtValue();
|
|
|
|
const unsigned DstAlign = MI->getDestAlignment();
|
|
|
|
const unsigned SrcAlign = MI->getSourceAlignment();
|
|
|
|
const Function *F = I->getParent()->getParent();
|
|
|
|
const unsigned Limit = TLI->getMaxStoresPerMemmove(F->hasMinSize());
|
|
|
|
std::vector<EVT> MemOps;
|
|
|
|
|
|
|
|
// MemOps will be poplulated with a list of data types that needs to be
|
|
|
|
// loaded and stored. That's why we multiply the number of elements by 2 to
|
|
|
|
// get the cost for this memcpy.
|
|
|
|
if (getTLI()->findOptimalMemOpLowering(
|
|
|
|
MemOps, Limit, Size, DstAlign, SrcAlign, false /*IsMemset*/,
|
|
|
|
false /*ZeroMemset*/, false /*MemcpyStrSrc*/, false /*AllowOverlap*/,
|
|
|
|
MI->getDestAddressSpace(), MI->getSourceAddressSpace(),
|
|
|
|
F->getAttributes()))
|
|
|
|
return MemOps.size() * 2;
|
|
|
|
|
|
|
|
// If we can't find an optimal memop lowering, return the default cost
|
|
|
|
return LibCallCost;
|
|
|
|
}
|
|
|
|
|
2015-08-06 02:08:10 +08:00
|
|
|
int ARMTTIImpl::getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,
|
|
|
|
Type *SubTp) {
|
2018-10-25 18:52:36 +08:00
|
|
|
if (Kind == TTI::SK_Broadcast) {
|
|
|
|
static const CostTblEntry NEONDupTbl[] = {
|
|
|
|
// VDUP handles these cases.
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v2i32, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v2f32, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v2i64, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v2f64, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v4i16, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v8i8, 1},
|
|
|
|
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v4i32, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v4f32, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v8i16, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v16i8, 1}};
|
|
|
|
|
|
|
|
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);
|
2013-02-12 10:40:39 +08:00
|
|
|
|
2018-10-25 18:52:36 +08:00
|
|
|
if (const auto *Entry = CostTableLookup(NEONDupTbl, ISD::VECTOR_SHUFFLE,
|
|
|
|
LT.second))
|
|
|
|
return LT.first * Entry->Cost;
|
|
|
|
|
|
|
|
return BaseT::getShuffleCost(Kind, Tp, Index, SubTp);
|
|
|
|
}
|
[PM] Change the core design of the TTI analysis to use a polymorphic
type erased interface and a single analysis pass rather than an
extremely complex analysis group.
The end result is that the TTI analysis can contain a type erased
implementation that supports the polymorphic TTI interface. We can build
one from a target-specific implementation or from a dummy one in the IR.
I've also factored all of the code into "mix-in"-able base classes,
including CRTP base classes to facilitate calling back up to the most
specialized form when delegating horizontally across the surface. These
aren't as clean as I would like and I'm planning to work on cleaning
some of this up, but I wanted to start by putting into the right form.
There are a number of reasons for this change, and this particular
design. The first and foremost reason is that an analysis group is
complete overkill, and the chaining delegation strategy was so opaque,
confusing, and high overhead that TTI was suffering greatly for it.
Several of the TTI functions had failed to be implemented in all places
because of the chaining-based delegation making there be no checking of
this. A few other functions were implemented with incorrect delegation.
The message to me was very clear working on this -- the delegation and
analysis group structure was too confusing to be useful here.
The other reason of course is that this is *much* more natural fit for
the new pass manager. This will lay the ground work for a type-erased
per-function info object that can look up the correct subtarget and even
cache it.
Yet another benefit is that this will significantly simplify the
interaction of the pass managers and the TargetMachine. See the future
work below.
The downside of this change is that it is very, very verbose. I'm going
to work to improve that, but it is somewhat an implementation necessity
in C++ to do type erasure. =/ I discussed this design really extensively
with Eric and Hal prior to going down this path, and afterward showed
them the result. No one was really thrilled with it, but there doesn't
seem to be a substantially better alternative. Using a base class and
virtual method dispatch would make the code much shorter, but as
discussed in the update to the programmer's manual and elsewhere,
a polymorphic interface feels like the more principled approach even if
this is perhaps the least compelling example of it. ;]
Ultimately, there is still a lot more to be done here, but this was the
huge chunk that I couldn't really split things out of because this was
the interface change to TTI. I've tried to minimize all the other parts
of this. The follow up work should include at least:
1) Improving the TargetMachine interface by having it directly return
a TTI object. Because we have a non-pass object with value semantics
and an internal type erasure mechanism, we can narrow the interface
of the TargetMachine to *just* do what we need: build and return
a TTI object that we can then insert into the pass pipeline.
2) Make the TTI object be fully specialized for a particular function.
This will include splitting off a minimal form of it which is
sufficient for the inliner and the old pass manager.
3) Add a new pass manager analysis which produces TTI objects from the
target machine for each function. This may actually be done as part
of #2 in order to use the new analysis to implement #2.
4) Work on narrowing the API between TTI and the targets so that it is
easier to understand and less verbose to type erase.
5) Work on narrowing the API between TTI and its clients so that it is
easier to understand and less verbose to forward.
6) Try to improve the CRTP-based delegation. I feel like this code is
just a bit messy and exacerbating the complexity of implementing
the TTI in each target.
Many thanks to Eric and Hal for their help here. I ended up blocked on
this somewhat more abruptly than I expected, and so I appreciate getting
it sorted out very quickly.
Differential Revision: http://reviews.llvm.org/D7293
llvm-svn: 227669
2015-01-31 11:43:40 +08:00
|
|
|
if (Kind == TTI::SK_Reverse) {
|
2015-10-28 12:02:12 +08:00
|
|
|
static const CostTblEntry NEONShuffleTbl[] = {
|
2014-06-20 12:32:48 +08:00
|
|
|
// Reverse shuffle cost one instruction if we are shuffling within a
|
|
|
|
// double word (vrev) or two if we shuffle a quad word (vrev, vext).
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v2i32, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v2f32, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v2i64, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v2f64, 1},
|
2018-10-23 17:42:10 +08:00
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v4i16, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v8i8, 1},
|
2013-02-12 10:40:39 +08:00
|
|
|
|
2014-06-20 12:32:48 +08:00
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v4i32, 2},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v4f32, 2},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v8i16, 2},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v16i8, 2}};
|
2013-02-12 10:40:39 +08:00
|
|
|
|
2015-08-06 02:08:10 +08:00
|
|
|
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);
|
2014-06-20 12:32:48 +08:00
|
|
|
|
2015-10-27 12:14:24 +08:00
|
|
|
if (const auto *Entry = CostTableLookup(NEONShuffleTbl, ISD::VECTOR_SHUFFLE,
|
|
|
|
LT.second))
|
|
|
|
return LT.first * Entry->Cost;
|
2013-02-12 10:40:39 +08:00
|
|
|
|
2015-10-27 12:14:24 +08:00
|
|
|
return BaseT::getShuffleCost(Kind, Tp, Index, SubTp);
|
2014-06-20 12:32:48 +08:00
|
|
|
}
|
[CostModel] Replace ShuffleKind::SK_Alternate with ShuffleKind::SK_Select (PR33744)
As discussed on PR33744, this patch relaxes ShuffleKind::SK_Alternate which requires shuffle masks to only match an alternating pattern from its 2 sources:
e.g. v4f32: <0,5,2,7> or <4,1,6,3>
This seems far too restrictive as most SIMD hardware which will implement it using a general blend/bit-select instruction, so replaces it with SK_Select, permitting elements from either source as long as they are inline:
e.g. v4f32: <0,5,2,7>, <4,1,6,3>, <0,1,6,7>, <4,1,2,3> etc.
This initial patch just updates the name and cost model shuffle mask analysis, later patch reviews will update SLP to better utilise this - it still limits itself to SK_Alternate style patterns.
Differential Revision: https://reviews.llvm.org/D47985
llvm-svn: 334513
2018-06-13 00:12:29 +08:00
|
|
|
if (Kind == TTI::SK_Select) {
|
|
|
|
static const CostTblEntry NEONSelShuffleTbl[] = {
|
|
|
|
// Select shuffle cost table for ARM. Cost is the number of instructions
|
2014-06-20 12:32:48 +08:00
|
|
|
// required to create the shuffled vector.
|
|
|
|
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v2f32, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v2i64, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v2f64, 1},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v2i32, 1},
|
|
|
|
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v4i32, 2},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v4f32, 2},
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v4i16, 2},
|
|
|
|
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v8i16, 16},
|
|
|
|
|
|
|
|
{ISD::VECTOR_SHUFFLE, MVT::v16i8, 32}};
|
|
|
|
|
2015-08-06 02:08:10 +08:00
|
|
|
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);
|
[CostModel] Replace ShuffleKind::SK_Alternate with ShuffleKind::SK_Select (PR33744)
As discussed on PR33744, this patch relaxes ShuffleKind::SK_Alternate which requires shuffle masks to only match an alternating pattern from its 2 sources:
e.g. v4f32: <0,5,2,7> or <4,1,6,3>
This seems far too restrictive as most SIMD hardware which will implement it using a general blend/bit-select instruction, so replaces it with SK_Select, permitting elements from either source as long as they are inline:
e.g. v4f32: <0,5,2,7>, <4,1,6,3>, <0,1,6,7>, <4,1,2,3> etc.
This initial patch just updates the name and cost model shuffle mask analysis, later patch reviews will update SLP to better utilise this - it still limits itself to SK_Alternate style patterns.
Differential Revision: https://reviews.llvm.org/D47985
llvm-svn: 334513
2018-06-13 00:12:29 +08:00
|
|
|
if (const auto *Entry = CostTableLookup(NEONSelShuffleTbl,
|
2015-10-27 12:14:24 +08:00
|
|
|
ISD::VECTOR_SHUFFLE, LT.second))
|
|
|
|
return LT.first * Entry->Cost;
|
|
|
|
return BaseT::getShuffleCost(Kind, Tp, Index, SubTp);
|
2014-06-20 12:32:48 +08:00
|
|
|
}
|
[PM] Change the core design of the TTI analysis to use a polymorphic
type erased interface and a single analysis pass rather than an
extremely complex analysis group.
The end result is that the TTI analysis can contain a type erased
implementation that supports the polymorphic TTI interface. We can build
one from a target-specific implementation or from a dummy one in the IR.
I've also factored all of the code into "mix-in"-able base classes,
including CRTP base classes to facilitate calling back up to the most
specialized form when delegating horizontally across the surface. These
aren't as clean as I would like and I'm planning to work on cleaning
some of this up, but I wanted to start by putting into the right form.
There are a number of reasons for this change, and this particular
design. The first and foremost reason is that an analysis group is
complete overkill, and the chaining delegation strategy was so opaque,
confusing, and high overhead that TTI was suffering greatly for it.
Several of the TTI functions had failed to be implemented in all places
because of the chaining-based delegation making there be no checking of
this. A few other functions were implemented with incorrect delegation.
The message to me was very clear working on this -- the delegation and
analysis group structure was too confusing to be useful here.
The other reason of course is that this is *much* more natural fit for
the new pass manager. This will lay the ground work for a type-erased
per-function info object that can look up the correct subtarget and even
cache it.
Yet another benefit is that this will significantly simplify the
interaction of the pass managers and the TargetMachine. See the future
work below.
The downside of this change is that it is very, very verbose. I'm going
to work to improve that, but it is somewhat an implementation necessity
in C++ to do type erasure. =/ I discussed this design really extensively
with Eric and Hal prior to going down this path, and afterward showed
them the result. No one was really thrilled with it, but there doesn't
seem to be a substantially better alternative. Using a base class and
virtual method dispatch would make the code much shorter, but as
discussed in the update to the programmer's manual and elsewhere,
a polymorphic interface feels like the more principled approach even if
this is perhaps the least compelling example of it. ;]
Ultimately, there is still a lot more to be done here, but this was the
huge chunk that I couldn't really split things out of because this was
the interface change to TTI. I've tried to minimize all the other parts
of this. The follow up work should include at least:
1) Improving the TargetMachine interface by having it directly return
a TTI object. Because we have a non-pass object with value semantics
and an internal type erasure mechanism, we can narrow the interface
of the TargetMachine to *just* do what we need: build and return
a TTI object that we can then insert into the pass pipeline.
2) Make the TTI object be fully specialized for a particular function.
This will include splitting off a minimal form of it which is
sufficient for the inliner and the old pass manager.
3) Add a new pass manager analysis which produces TTI objects from the
target machine for each function. This may actually be done as part
of #2 in order to use the new analysis to implement #2.
4) Work on narrowing the API between TTI and the targets so that it is
easier to understand and less verbose to type erase.
5) Work on narrowing the API between TTI and its clients so that it is
easier to understand and less verbose to forward.
6) Try to improve the CRTP-based delegation. I feel like this code is
just a bit messy and exacerbating the complexity of implementing
the TTI in each target.
Many thanks to Eric and Hal for their help here. I ended up blocked on
this somewhat more abruptly than I expected, and so I appreciate getting
it sorted out very quickly.
Differential Revision: http://reviews.llvm.org/D7293
llvm-svn: 227669
2015-01-31 11:43:40 +08:00
|
|
|
return BaseT::getShuffleCost(Kind, Tp, Index, SubTp);
|
2013-02-12 10:40:39 +08:00
|
|
|
}
|
2013-04-26 05:16:18 +08:00
|
|
|
|
2015-08-06 02:08:10 +08:00
|
|
|
int ARMTTIImpl::getArithmeticInstrCost(
|
[PM] Change the core design of the TTI analysis to use a polymorphic
type erased interface and a single analysis pass rather than an
extremely complex analysis group.
The end result is that the TTI analysis can contain a type erased
implementation that supports the polymorphic TTI interface. We can build
one from a target-specific implementation or from a dummy one in the IR.
I've also factored all of the code into "mix-in"-able base classes,
including CRTP base classes to facilitate calling back up to the most
specialized form when delegating horizontally across the surface. These
aren't as clean as I would like and I'm planning to work on cleaning
some of this up, but I wanted to start by putting into the right form.
There are a number of reasons for this change, and this particular
design. The first and foremost reason is that an analysis group is
complete overkill, and the chaining delegation strategy was so opaque,
confusing, and high overhead that TTI was suffering greatly for it.
Several of the TTI functions had failed to be implemented in all places
because of the chaining-based delegation making there be no checking of
this. A few other functions were implemented with incorrect delegation.
The message to me was very clear working on this -- the delegation and
analysis group structure was too confusing to be useful here.
The other reason of course is that this is *much* more natural fit for
the new pass manager. This will lay the ground work for a type-erased
per-function info object that can look up the correct subtarget and even
cache it.
Yet another benefit is that this will significantly simplify the
interaction of the pass managers and the TargetMachine. See the future
work below.
The downside of this change is that it is very, very verbose. I'm going
to work to improve that, but it is somewhat an implementation necessity
in C++ to do type erasure. =/ I discussed this design really extensively
with Eric and Hal prior to going down this path, and afterward showed
them the result. No one was really thrilled with it, but there doesn't
seem to be a substantially better alternative. Using a base class and
virtual method dispatch would make the code much shorter, but as
discussed in the update to the programmer's manual and elsewhere,
a polymorphic interface feels like the more principled approach even if
this is perhaps the least compelling example of it. ;]
Ultimately, there is still a lot more to be done here, but this was the
huge chunk that I couldn't really split things out of because this was
the interface change to TTI. I've tried to minimize all the other parts
of this. The follow up work should include at least:
1) Improving the TargetMachine interface by having it directly return
a TTI object. Because we have a non-pass object with value semantics
and an internal type erasure mechanism, we can narrow the interface
of the TargetMachine to *just* do what we need: build and return
a TTI object that we can then insert into the pass pipeline.
2) Make the TTI object be fully specialized for a particular function.
This will include splitting off a minimal form of it which is
sufficient for the inliner and the old pass manager.
3) Add a new pass manager analysis which produces TTI objects from the
target machine for each function. This may actually be done as part
of #2 in order to use the new analysis to implement #2.
4) Work on narrowing the API between TTI and the targets so that it is
easier to understand and less verbose to type erase.
5) Work on narrowing the API between TTI and its clients so that it is
easier to understand and less verbose to forward.
6) Try to improve the CRTP-based delegation. I feel like this code is
just a bit messy and exacerbating the complexity of implementing
the TTI in each target.
Many thanks to Eric and Hal for their help here. I ended up blocked on
this somewhat more abruptly than I expected, and so I appreciate getting
it sorted out very quickly.
Differential Revision: http://reviews.llvm.org/D7293
llvm-svn: 227669
2015-01-31 11:43:40 +08:00
|
|
|
unsigned Opcode, Type *Ty, TTI::OperandValueKind Op1Info,
|
|
|
|
TTI::OperandValueKind Op2Info, TTI::OperandValueProperties Opd1PropInfo,
|
[X86] updating TTI costs for arithmetic instructions on X86\SLM arch.
updated instructions:
pmulld, pmullw, pmulhw, mulsd, mulps, mulpd, divss, divps, divsd, divpd, addpd and subpd.
special optimization case which replaces pmulld with pmullw\pmulhw\pshuf seq.
In case if the real operands bitwidth <= 16.
Differential Revision: https://reviews.llvm.org/D28104
llvm-svn: 291657
2017-01-11 16:23:37 +08:00
|
|
|
TTI::OperandValueProperties Opd2PropInfo,
|
|
|
|
ArrayRef<const Value *> Args) {
|
2013-04-26 05:16:18 +08:00
|
|
|
int ISDOpcode = TLI->InstructionOpcodeToISD(Opcode);
|
2015-08-06 02:08:10 +08:00
|
|
|
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);
|
2013-04-26 05:16:18 +08:00
|
|
|
|
|
|
|
const unsigned FunctionCallDivCost = 20;
|
|
|
|
const unsigned ReciprocalDivCost = 10;
|
2015-10-28 12:02:12 +08:00
|
|
|
static const CostTblEntry CostTbl[] = {
|
2013-04-26 05:16:18 +08:00
|
|
|
// Division.
|
|
|
|
// These costs are somewhat random. Choose a cost of 20 to indicate that
|
|
|
|
// vectorizing devision (added function call) is going to be very expensive.
|
|
|
|
// Double registers types.
|
|
|
|
{ ISD::SDIV, MVT::v1i64, 1 * FunctionCallDivCost},
|
|
|
|
{ ISD::UDIV, MVT::v1i64, 1 * FunctionCallDivCost},
|
|
|
|
{ ISD::SREM, MVT::v1i64, 1 * FunctionCallDivCost},
|
|
|
|
{ ISD::UREM, MVT::v1i64, 1 * FunctionCallDivCost},
|
|
|
|
{ ISD::SDIV, MVT::v2i32, 2 * FunctionCallDivCost},
|
|
|
|
{ ISD::UDIV, MVT::v2i32, 2 * FunctionCallDivCost},
|
|
|
|
{ ISD::SREM, MVT::v2i32, 2 * FunctionCallDivCost},
|
|
|
|
{ ISD::UREM, MVT::v2i32, 2 * FunctionCallDivCost},
|
|
|
|
{ ISD::SDIV, MVT::v4i16, ReciprocalDivCost},
|
|
|
|
{ ISD::UDIV, MVT::v4i16, ReciprocalDivCost},
|
|
|
|
{ ISD::SREM, MVT::v4i16, 4 * FunctionCallDivCost},
|
|
|
|
{ ISD::UREM, MVT::v4i16, 4 * FunctionCallDivCost},
|
|
|
|
{ ISD::SDIV, MVT::v8i8, ReciprocalDivCost},
|
|
|
|
{ ISD::UDIV, MVT::v8i8, ReciprocalDivCost},
|
|
|
|
{ ISD::SREM, MVT::v8i8, 8 * FunctionCallDivCost},
|
|
|
|
{ ISD::UREM, MVT::v8i8, 8 * FunctionCallDivCost},
|
|
|
|
// Quad register types.
|
|
|
|
{ ISD::SDIV, MVT::v2i64, 2 * FunctionCallDivCost},
|
|
|
|
{ ISD::UDIV, MVT::v2i64, 2 * FunctionCallDivCost},
|
|
|
|
{ ISD::SREM, MVT::v2i64, 2 * FunctionCallDivCost},
|
|
|
|
{ ISD::UREM, MVT::v2i64, 2 * FunctionCallDivCost},
|
|
|
|
{ ISD::SDIV, MVT::v4i32, 4 * FunctionCallDivCost},
|
|
|
|
{ ISD::UDIV, MVT::v4i32, 4 * FunctionCallDivCost},
|
|
|
|
{ ISD::SREM, MVT::v4i32, 4 * FunctionCallDivCost},
|
|
|
|
{ ISD::UREM, MVT::v4i32, 4 * FunctionCallDivCost},
|
|
|
|
{ ISD::SDIV, MVT::v8i16, 8 * FunctionCallDivCost},
|
|
|
|
{ ISD::UDIV, MVT::v8i16, 8 * FunctionCallDivCost},
|
|
|
|
{ ISD::SREM, MVT::v8i16, 8 * FunctionCallDivCost},
|
|
|
|
{ ISD::UREM, MVT::v8i16, 8 * FunctionCallDivCost},
|
|
|
|
{ ISD::SDIV, MVT::v16i8, 16 * FunctionCallDivCost},
|
|
|
|
{ ISD::UDIV, MVT::v16i8, 16 * FunctionCallDivCost},
|
|
|
|
{ ISD::SREM, MVT::v16i8, 16 * FunctionCallDivCost},
|
|
|
|
{ ISD::UREM, MVT::v16i8, 16 * FunctionCallDivCost},
|
|
|
|
// Multiplication.
|
|
|
|
};
|
|
|
|
|
|
|
|
if (ST->hasNEON())
|
2015-10-27 12:14:24 +08:00
|
|
|
if (const auto *Entry = CostTableLookup(CostTbl, ISDOpcode, LT.second))
|
|
|
|
return LT.first * Entry->Cost;
|
2013-04-26 05:16:18 +08:00
|
|
|
|
2015-08-06 02:08:10 +08:00
|
|
|
int Cost = BaseT::getArithmeticInstrCost(Opcode, Ty, Op1Info, Op2Info,
|
|
|
|
Opd1PropInfo, Opd2PropInfo);
|
2013-10-29 09:33:53 +08:00
|
|
|
|
|
|
|
// This is somewhat of a hack. The problem that we are facing is that SROA
|
|
|
|
// creates a sequence of shift, and, or instructions to construct values.
|
|
|
|
// These sequences are recognized by the ISel and have zero-cost. Not so for
|
|
|
|
// the vectorized code. Because we have support for v2i64 but not i64 those
|
2014-01-25 01:20:08 +08:00
|
|
|
// sequences look particularly beneficial to vectorize.
|
2013-10-29 09:33:53 +08:00
|
|
|
// To work around this we increase the cost of v2i64 operations to make them
|
|
|
|
// seem less beneficial.
|
|
|
|
if (LT.second == MVT::v2i64 &&
|
|
|
|
Op2Info == TargetTransformInfo::OK_UniformConstantValue)
|
|
|
|
Cost += 4;
|
|
|
|
|
|
|
|
return Cost;
|
2013-04-26 05:16:18 +08:00
|
|
|
}
|
|
|
|
|
2015-08-06 02:08:10 +08:00
|
|
|
int ARMTTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
|
2017-04-12 19:49:08 +08:00
|
|
|
unsigned AddressSpace, const Instruction *I) {
|
2015-08-06 02:08:10 +08:00
|
|
|
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Src);
|
2013-10-29 09:33:57 +08:00
|
|
|
|
|
|
|
if (Src->isVectorTy() && Alignment != 16 &&
|
|
|
|
Src->getVectorElementType()->isDoubleTy()) {
|
|
|
|
// Unaligned loads/stores are extremely inefficient.
|
|
|
|
// We need 4 uops for vst.1/vld.1 vs 1uop for vldr/vstr.
|
|
|
|
return LT.first * 4;
|
|
|
|
}
|
|
|
|
return LT.first;
|
|
|
|
}
|
[ARM] Lower interleaved memory accesses to vldN/vstN intrinsics.
This patch also adds a function to calculate the cost of interleaved memory accesses.
E.g. Lower an interleaved load:
%wide.vec = load <8 x i32>, <8 x i32>* %ptr, align 4
%v0 = shuffle %wide.vec, undef, <0, 2, 4, 6>
%v1 = shuffle %wide.vec, undef, <1, 3, 5, 7>
into:
%vld2 = { <4 x i32>, <4 x i32> } call llvm.arm.neon.vld2(%ptr, 4)
%vec0 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 0
%vec1 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 1
E.g. Lower an interleaved store:
%i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1, <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>
store <12 x i32> %i.vec, <12 x i32>* %ptr, align 4
into:
%sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2, 3>
%sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6, 7>
%sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10, 11>
call void llvm.arm.neon.vst3(%ptr, %sub.v0, %sub.v1, %sub.v2, 4)
Differential Revision: http://reviews.llvm.org/D10533
llvm-svn: 240755
2015-06-26 10:45:36 +08:00
|
|
|
|
2015-08-06 02:08:10 +08:00
|
|
|
int ARMTTIImpl::getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,
|
|
|
|
unsigned Factor,
|
|
|
|
ArrayRef<unsigned> Indices,
|
|
|
|
unsigned Alignment,
|
2018-10-14 16:50:06 +08:00
|
|
|
unsigned AddressSpace,
|
2018-10-31 17:57:56 +08:00
|
|
|
bool UseMaskForCond,
|
|
|
|
bool UseMaskForGaps) {
|
[ARM] Lower interleaved memory accesses to vldN/vstN intrinsics.
This patch also adds a function to calculate the cost of interleaved memory accesses.
E.g. Lower an interleaved load:
%wide.vec = load <8 x i32>, <8 x i32>* %ptr, align 4
%v0 = shuffle %wide.vec, undef, <0, 2, 4, 6>
%v1 = shuffle %wide.vec, undef, <1, 3, 5, 7>
into:
%vld2 = { <4 x i32>, <4 x i32> } call llvm.arm.neon.vld2(%ptr, 4)
%vec0 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 0
%vec1 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 1
E.g. Lower an interleaved store:
%i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1, <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>
store <12 x i32> %i.vec, <12 x i32>* %ptr, align 4
into:
%sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2, 3>
%sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6, 7>
%sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10, 11>
call void llvm.arm.neon.vst3(%ptr, %sub.v0, %sub.v1, %sub.v2, 4)
Differential Revision: http://reviews.llvm.org/D10533
llvm-svn: 240755
2015-06-26 10:45:36 +08:00
|
|
|
assert(Factor >= 2 && "Invalid interleave factor");
|
|
|
|
assert(isa<VectorType>(VecTy) && "Expect a vector type");
|
|
|
|
|
|
|
|
// vldN/vstN doesn't support vector types of i64/f64 element.
|
[AArch64][ARM] Don't base interleaved op legality on type alloc size.
Otherwise, we think that most types that look like they'd fit in a
legal vector type are legal (so, basically, *any* vector type with a
size between 33 and 128 bits, I think, since we use pow2 alignment;
e.g., v2i25, v3f32, ...).
DataLayout::getTypeAllocSize rounds up based on alignment.
When checking for target intrinsic legality, that's not what we want:
if rounding makes a difference, the type isn't legal, and the
target intrinsics shouldn't be used, as they are always assumed legal.
One could make the argument that alloc size is ultimately the most
relevant here, since we're dealing with LD/ST intrinsics. That's only
true if we did legalize them though; that's a problem for another day.
Use DataLayout::getTypeSizeInBits instead of getTypeAllocSizeInBits.
Type::getSizeInBits can't be used because that'd gratuitously break
pointer vector support.
Some of these uses are currently fine, because we only hit them when
the type is already known legal (e.g., r114454). Update them for
consistency. It's faster to avoid the rounding anyway!
llvm-svn: 255089
2015-12-09 09:19:50 +08:00
|
|
|
bool EltIs64Bits = DL.getTypeSizeInBits(VecTy->getScalarType()) == 64;
|
[ARM] Lower interleaved memory accesses to vldN/vstN intrinsics.
This patch also adds a function to calculate the cost of interleaved memory accesses.
E.g. Lower an interleaved load:
%wide.vec = load <8 x i32>, <8 x i32>* %ptr, align 4
%v0 = shuffle %wide.vec, undef, <0, 2, 4, 6>
%v1 = shuffle %wide.vec, undef, <1, 3, 5, 7>
into:
%vld2 = { <4 x i32>, <4 x i32> } call llvm.arm.neon.vld2(%ptr, 4)
%vec0 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 0
%vec1 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 1
E.g. Lower an interleaved store:
%i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1, <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>
store <12 x i32> %i.vec, <12 x i32>* %ptr, align 4
into:
%sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2, 3>
%sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6, 7>
%sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10, 11>
call void llvm.arm.neon.vst3(%ptr, %sub.v0, %sub.v1, %sub.v2, 4)
Differential Revision: http://reviews.llvm.org/D10533
llvm-svn: 240755
2015-06-26 10:45:36 +08:00
|
|
|
|
2018-10-14 16:50:06 +08:00
|
|
|
if (Factor <= TLI->getMaxSupportedInterleaveFactor() && !EltIs64Bits &&
|
2018-10-31 17:57:56 +08:00
|
|
|
!UseMaskForCond && !UseMaskForGaps) {
|
[ARM] Lower interleaved memory accesses to vldN/vstN intrinsics.
This patch also adds a function to calculate the cost of interleaved memory accesses.
E.g. Lower an interleaved load:
%wide.vec = load <8 x i32>, <8 x i32>* %ptr, align 4
%v0 = shuffle %wide.vec, undef, <0, 2, 4, 6>
%v1 = shuffle %wide.vec, undef, <1, 3, 5, 7>
into:
%vld2 = { <4 x i32>, <4 x i32> } call llvm.arm.neon.vld2(%ptr, 4)
%vec0 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 0
%vec1 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 1
E.g. Lower an interleaved store:
%i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1, <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>
store <12 x i32> %i.vec, <12 x i32>* %ptr, align 4
into:
%sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2, 3>
%sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6, 7>
%sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10, 11>
call void llvm.arm.neon.vst3(%ptr, %sub.v0, %sub.v1, %sub.v2, 4)
Differential Revision: http://reviews.llvm.org/D10533
llvm-svn: 240755
2015-06-26 10:45:36 +08:00
|
|
|
unsigned NumElts = VecTy->getVectorNumElements();
|
2017-04-11 02:34:37 +08:00
|
|
|
auto *SubVecTy = VectorType::get(VecTy->getScalarType(), NumElts / Factor);
|
[ARM] Lower interleaved memory accesses to vldN/vstN intrinsics.
This patch also adds a function to calculate the cost of interleaved memory accesses.
E.g. Lower an interleaved load:
%wide.vec = load <8 x i32>, <8 x i32>* %ptr, align 4
%v0 = shuffle %wide.vec, undef, <0, 2, 4, 6>
%v1 = shuffle %wide.vec, undef, <1, 3, 5, 7>
into:
%vld2 = { <4 x i32>, <4 x i32> } call llvm.arm.neon.vld2(%ptr, 4)
%vec0 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 0
%vec1 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 1
E.g. Lower an interleaved store:
%i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1, <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>
store <12 x i32> %i.vec, <12 x i32>* %ptr, align 4
into:
%sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2, 3>
%sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6, 7>
%sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10, 11>
call void llvm.arm.neon.vst3(%ptr, %sub.v0, %sub.v1, %sub.v2, 4)
Differential Revision: http://reviews.llvm.org/D10533
llvm-svn: 240755
2015-06-26 10:45:36 +08:00
|
|
|
|
|
|
|
// vldN/vstN only support legal vector types of size 64 or 128 in bits.
|
2017-03-02 23:15:35 +08:00
|
|
|
// Accesses having vector types that are a multiple of 128 bits can be
|
|
|
|
// matched to more than one vldN/vstN instruction.
|
2017-04-11 02:34:37 +08:00
|
|
|
if (NumElts % Factor == 0 &&
|
|
|
|
TLI->isLegalInterleavedAccessType(SubVecTy, DL))
|
|
|
|
return Factor * TLI->getNumInterleavedAccesses(SubVecTy, DL);
|
[ARM] Lower interleaved memory accesses to vldN/vstN intrinsics.
This patch also adds a function to calculate the cost of interleaved memory accesses.
E.g. Lower an interleaved load:
%wide.vec = load <8 x i32>, <8 x i32>* %ptr, align 4
%v0 = shuffle %wide.vec, undef, <0, 2, 4, 6>
%v1 = shuffle %wide.vec, undef, <1, 3, 5, 7>
into:
%vld2 = { <4 x i32>, <4 x i32> } call llvm.arm.neon.vld2(%ptr, 4)
%vec0 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 0
%vec1 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 1
E.g. Lower an interleaved store:
%i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1, <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>
store <12 x i32> %i.vec, <12 x i32>* %ptr, align 4
into:
%sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2, 3>
%sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6, 7>
%sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10, 11>
call void llvm.arm.neon.vst3(%ptr, %sub.v0, %sub.v1, %sub.v2, 4)
Differential Revision: http://reviews.llvm.org/D10533
llvm-svn: 240755
2015-06-26 10:45:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
|
2018-10-31 17:57:56 +08:00
|
|
|
Alignment, AddressSpace,
|
|
|
|
UseMaskForCond, UseMaskForGaps);
|
[ARM] Lower interleaved memory accesses to vldN/vstN intrinsics.
This patch also adds a function to calculate the cost of interleaved memory accesses.
E.g. Lower an interleaved load:
%wide.vec = load <8 x i32>, <8 x i32>* %ptr, align 4
%v0 = shuffle %wide.vec, undef, <0, 2, 4, 6>
%v1 = shuffle %wide.vec, undef, <1, 3, 5, 7>
into:
%vld2 = { <4 x i32>, <4 x i32> } call llvm.arm.neon.vld2(%ptr, 4)
%vec0 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 0
%vec1 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 1
E.g. Lower an interleaved store:
%i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1, <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>
store <12 x i32> %i.vec, <12 x i32>* %ptr, align 4
into:
%sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2, 3>
%sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6, 7>
%sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10, 11>
call void llvm.arm.neon.vst3(%ptr, %sub.v0, %sub.v1, %sub.v2, 4)
Differential Revision: http://reviews.llvm.org/D10533
llvm-svn: 240755
2015-06-26 10:45:36 +08:00
|
|
|
}
|
2017-07-25 16:51:30 +08:00
|
|
|
|
2019-06-12 20:00:42 +08:00
|
|
|
bool ARMTTIImpl::isLoweredToCall(const Function *F) {
|
|
|
|
if (!F->isIntrinsic())
|
|
|
|
BaseT::isLoweredToCall(F);
|
|
|
|
|
|
|
|
// Assume all Arm-specific intrinsics map to an instruction.
|
|
|
|
if (F->getName().startswith("llvm.arm"))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
switch (F->getIntrinsicID()) {
|
|
|
|
default: break;
|
|
|
|
case Intrinsic::powi:
|
|
|
|
case Intrinsic::sin:
|
|
|
|
case Intrinsic::cos:
|
|
|
|
case Intrinsic::pow:
|
|
|
|
case Intrinsic::log:
|
|
|
|
case Intrinsic::log10:
|
|
|
|
case Intrinsic::log2:
|
|
|
|
case Intrinsic::exp:
|
|
|
|
case Intrinsic::exp2:
|
|
|
|
return true;
|
|
|
|
case Intrinsic::sqrt:
|
|
|
|
case Intrinsic::fabs:
|
|
|
|
case Intrinsic::copysign:
|
|
|
|
case Intrinsic::floor:
|
|
|
|
case Intrinsic::ceil:
|
|
|
|
case Intrinsic::trunc:
|
|
|
|
case Intrinsic::rint:
|
|
|
|
case Intrinsic::nearbyint:
|
|
|
|
case Intrinsic::round:
|
|
|
|
case Intrinsic::canonicalize:
|
|
|
|
case Intrinsic::lround:
|
|
|
|
case Intrinsic::llround:
|
|
|
|
case Intrinsic::lrint:
|
|
|
|
case Intrinsic::llrint:
|
|
|
|
if (F->getReturnType()->isDoubleTy() && !ST->hasFP64())
|
|
|
|
return true;
|
|
|
|
if (F->getReturnType()->isHalfTy() && !ST->hasFullFP16())
|
|
|
|
return true;
|
|
|
|
// Some operations can be handled by vector instructions and assume
|
|
|
|
// unsupported vectors will be expanded into supported scalar ones.
|
|
|
|
// TODO Handle scalar operations properly.
|
|
|
|
return !ST->hasFPARMv8Base() && !ST->hasVFP2Base();
|
|
|
|
case Intrinsic::masked_store:
|
|
|
|
case Intrinsic::masked_load:
|
|
|
|
case Intrinsic::masked_gather:
|
|
|
|
case Intrinsic::masked_scatter:
|
|
|
|
return !ST->hasMVEIntegerOps();
|
|
|
|
case Intrinsic::sadd_with_overflow:
|
|
|
|
case Intrinsic::uadd_with_overflow:
|
|
|
|
case Intrinsic::ssub_with_overflow:
|
|
|
|
case Intrinsic::usub_with_overflow:
|
|
|
|
case Intrinsic::sadd_sat:
|
|
|
|
case Intrinsic::uadd_sat:
|
|
|
|
case Intrinsic::ssub_sat:
|
|
|
|
case Intrinsic::usub_sat:
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return BaseT::isLoweredToCall(F);
|
|
|
|
}
|
|
|
|
|
|
|
|
bool ARMTTIImpl::isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
|
|
|
|
AssumptionCache &AC,
|
|
|
|
TargetLibraryInfo *LibInfo,
|
2019-06-19 09:26:31 +08:00
|
|
|
HardwareLoopInfo &HWLoopInfo) {
|
2019-06-12 20:00:42 +08:00
|
|
|
// Low-overhead branches are only supported in the 'low-overhead branch'
|
|
|
|
// extension of v8.1-m.
|
|
|
|
if (!ST->hasLOB() || DisableLowOverheadLoops)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (!SE.hasLoopInvariantBackedgeTakenCount(L))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
const SCEV *BackedgeTakenCount = SE.getBackedgeTakenCount(L);
|
|
|
|
if (isa<SCEVCouldNotCompute>(BackedgeTakenCount))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
const SCEV *TripCountSCEV =
|
|
|
|
SE.getAddExpr(BackedgeTakenCount,
|
|
|
|
SE.getOne(BackedgeTakenCount->getType()));
|
|
|
|
|
|
|
|
// We need to store the trip count in LR, a 32-bit register.
|
|
|
|
if (SE.getUnsignedRangeMax(TripCountSCEV).getBitWidth() > 32)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
// Making a call will trash LR and clear LO_BRANCH_INFO, so there's little
|
|
|
|
// point in generating a hardware loop if that's going to happen.
|
|
|
|
auto MaybeCall = [this](Instruction &I) {
|
|
|
|
const ARMTargetLowering *TLI = getTLI();
|
|
|
|
unsigned ISD = TLI->InstructionOpcodeToISD(I.getOpcode());
|
|
|
|
EVT VT = TLI->getValueType(DL, I.getType(), true);
|
|
|
|
if (TLI->getOperationAction(ISD, VT) == TargetLowering::LibCall)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
// Check if an intrinsic will be lowered to a call and assume that any
|
|
|
|
// other CallInst will generate a bl.
|
|
|
|
if (auto *Call = dyn_cast<CallInst>(&I)) {
|
|
|
|
if (isa<IntrinsicInst>(Call)) {
|
|
|
|
if (const Function *F = Call->getCalledFunction())
|
|
|
|
return isLoweredToCall(F);
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
// FPv5 provides conversions between integer, double-precision,
|
|
|
|
// single-precision, and half-precision formats.
|
|
|
|
switch (I.getOpcode()) {
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
case Instruction::FPToSI:
|
|
|
|
case Instruction::FPToUI:
|
|
|
|
case Instruction::SIToFP:
|
|
|
|
case Instruction::UIToFP:
|
|
|
|
case Instruction::FPTrunc:
|
|
|
|
case Instruction::FPExt:
|
|
|
|
return !ST->hasFPARMv8Base();
|
|
|
|
}
|
|
|
|
|
|
|
|
// FIXME: Unfortunately the approach of checking the Operation Action does
|
|
|
|
// not catch all cases of Legalization that use library calls. Our
|
|
|
|
// Legalization step categorizes some transformations into library calls as
|
|
|
|
// Custom, Expand or even Legal when doing type legalization. So for now
|
|
|
|
// we have to special case for instance the SDIV of 64bit integers and the
|
|
|
|
// use of floating point emulation.
|
|
|
|
if (VT.isInteger() && VT.getSizeInBits() >= 64) {
|
|
|
|
switch (ISD) {
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
case ISD::SDIV:
|
|
|
|
case ISD::UDIV:
|
|
|
|
case ISD::SREM:
|
|
|
|
case ISD::UREM:
|
|
|
|
case ISD::SDIVREM:
|
|
|
|
case ISD::UDIVREM:
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Assume all other non-float operations are supported.
|
|
|
|
if (!VT.isFloatingPoint())
|
|
|
|
return false;
|
|
|
|
|
|
|
|
// We'll need a library call to handle most floats when using soft.
|
|
|
|
if (TLI->useSoftFloat()) {
|
|
|
|
switch (I.getOpcode()) {
|
|
|
|
default:
|
|
|
|
return true;
|
|
|
|
case Instruction::Alloca:
|
|
|
|
case Instruction::Load:
|
|
|
|
case Instruction::Store:
|
|
|
|
case Instruction::Select:
|
|
|
|
case Instruction::PHI:
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// We'll need a libcall to perform double precision operations on a single
|
|
|
|
// precision only FPU.
|
|
|
|
if (I.getType()->isDoubleTy() && !ST->hasFP64())
|
|
|
|
return true;
|
|
|
|
|
|
|
|
// Likewise for half precision arithmetic.
|
|
|
|
if (I.getType()->isHalfTy() && !ST->hasFullFP16())
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
};
|
|
|
|
|
2019-06-13 16:28:46 +08:00
|
|
|
auto IsHardwareLoopIntrinsic = [](Instruction &I) {
|
|
|
|
if (auto *Call = dyn_cast<IntrinsicInst>(&I)) {
|
2019-06-13 16:32:56 +08:00
|
|
|
switch (Call->getIntrinsicID()) {
|
2019-06-13 16:28:46 +08:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
case Intrinsic::set_loop_iterations:
|
|
|
|
case Intrinsic::loop_decrement:
|
|
|
|
case Intrinsic::loop_decrement_reg:
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
};
|
|
|
|
|
2019-06-12 20:00:42 +08:00
|
|
|
// Scan the instructions to see if there's any that we know will turn into a
|
2019-06-13 16:28:46 +08:00
|
|
|
// call or if this loop is already a low-overhead loop.
|
|
|
|
auto ScanLoop = [&](Loop *L) {
|
|
|
|
for (auto *BB : L->getBlocks()) {
|
|
|
|
for (auto &I : *BB) {
|
|
|
|
if (MaybeCall(I) || IsHardwareLoopIntrinsic(I))
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
};
|
|
|
|
|
|
|
|
// Visit inner loops.
|
|
|
|
for (auto Inner : *L)
|
|
|
|
if (!ScanLoop(Inner))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (!ScanLoop(L))
|
|
|
|
return false;
|
2019-06-12 20:00:42 +08:00
|
|
|
|
|
|
|
// TODO: Check whether the trip count calculation is expensive. If L is the
|
|
|
|
// inner loop but we know it has a low trip count, calculating that trip
|
|
|
|
// count (in the parent loop) may be detrimental.
|
|
|
|
|
|
|
|
LLVMContext &C = L->getHeader()->getContext();
|
|
|
|
HWLoopInfo.CounterInReg = true;
|
2019-06-13 16:28:46 +08:00
|
|
|
HWLoopInfo.IsNestingLegal = false;
|
2019-06-12 20:00:42 +08:00
|
|
|
HWLoopInfo.CountType = Type::getInt32Ty(C);
|
|
|
|
HWLoopInfo.LoopDecrement = ConstantInt::get(HWLoopInfo.CountType, 1);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2017-07-25 16:51:30 +08:00
|
|
|
void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
|
|
|
|
TTI::UnrollingPreferences &UP) {
|
|
|
|
// Only currently enable these preferences for M-Class cores.
|
2017-08-16 15:42:44 +08:00
|
|
|
if (!ST->isMClass())
|
2017-07-25 16:51:30 +08:00
|
|
|
return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP);
|
|
|
|
|
|
|
|
// Disable loop unrolling for Oz and Os.
|
|
|
|
UP.OptSizeThreshold = 0;
|
|
|
|
UP.PartialOptSizeThreshold = 0;
|
2019-04-05 06:40:06 +08:00
|
|
|
if (L->getHeader()->getParent()->hasOptSize())
|
2017-10-23 16:05:14 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
// Only enable on Thumb-2 targets.
|
|
|
|
if (!ST->isThumb2())
|
|
|
|
return;
|
|
|
|
|
|
|
|
SmallVector<BasicBlock*, 4> ExitingBlocks;
|
|
|
|
L->getExitingBlocks(ExitingBlocks);
|
2018-05-14 20:53:11 +08:00
|
|
|
LLVM_DEBUG(dbgs() << "Loop has:\n"
|
|
|
|
<< "Blocks: " << L->getNumBlocks() << "\n"
|
|
|
|
<< "Exit blocks: " << ExitingBlocks.size() << "\n");
|
2017-10-23 16:05:14 +08:00
|
|
|
|
|
|
|
// Only allow another exit other than the latch. This acts as an early exit
|
|
|
|
// as it mirrors the profitability calculation of the runtime unroller.
|
|
|
|
if (ExitingBlocks.size() > 2)
|
|
|
|
return;
|
|
|
|
|
|
|
|
// Limit the CFG of the loop body for targets with a branch predictor.
|
|
|
|
// Allowing 4 blocks permits if-then-else diamonds in the body.
|
|
|
|
if (ST->hasBranchPredictor() && L->getNumBlocks() > 4)
|
2017-08-16 15:42:44 +08:00
|
|
|
return;
|
2017-07-25 16:51:30 +08:00
|
|
|
|
|
|
|
// Scan the loop: don't unroll loops with calls as this could prevent
|
|
|
|
// inlining.
|
2017-08-16 15:42:44 +08:00
|
|
|
unsigned Cost = 0;
|
2017-10-23 16:05:14 +08:00
|
|
|
for (auto *BB : L->getBlocks()) {
|
|
|
|
for (auto &I : *BB) {
|
|
|
|
if (isa<CallInst>(I) || isa<InvokeInst>(I)) {
|
|
|
|
ImmutableCallSite CS(&I);
|
|
|
|
if (const Function *F = CS.getCalledFunction()) {
|
|
|
|
if (!isLoweredToCall(F))
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
return;
|
2017-07-25 16:51:30 +08:00
|
|
|
}
|
2017-10-23 16:05:14 +08:00
|
|
|
SmallVector<const Value*, 4> Operands(I.value_op_begin(),
|
|
|
|
I.value_op_end());
|
|
|
|
Cost += getUserCost(&I, Operands);
|
2017-07-25 16:51:30 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-05-14 20:53:11 +08:00
|
|
|
LLVM_DEBUG(dbgs() << "Cost of loop: " << Cost << "\n");
|
2017-10-23 16:05:14 +08:00
|
|
|
|
2017-07-25 16:51:30 +08:00
|
|
|
UP.Partial = true;
|
|
|
|
UP.Runtime = true;
|
2019-06-10 18:22:14 +08:00
|
|
|
UP.UpperBound = true;
|
2017-08-16 15:42:44 +08:00
|
|
|
UP.UnrollRemainder = true;
|
|
|
|
UP.DefaultUnrollRuntimeCount = 4;
|
2018-07-01 20:47:30 +08:00
|
|
|
UP.UnrollAndJam = true;
|
|
|
|
UP.UnrollAndJamInnerLoopThreshold = 60;
|
2017-08-16 15:42:44 +08:00
|
|
|
|
|
|
|
// Force unrolling small loops can be very useful because of the branch
|
|
|
|
// taken cost of the backedge.
|
|
|
|
if (Cost < 12)
|
|
|
|
UP.Force = true;
|
2017-07-25 16:51:30 +08:00
|
|
|
}
|