This will be needed in the loop-vectorizer where the minimum VF
requested may be a scalable VF. getMinimumVF now takes an additional
operand 'IsScalableVF' that indicates whether a scalable VF is required.
Reviewed By: kparzysz, rampitec
Differential Revision: https://reviews.llvm.org/D96020
This reverts commit 502a67dd7f.
This expose a failure in test-suite build on PowerPC,
revert to unblock buildbot first,
Dave will re-commit in https://reviews.llvm.org/D96287.
Thanks Dave.
getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various
parameters of the intrinsic being costed. It can either be called with a
scalar intrinsic (RetTy==Scalar, VF==1), with a vector instruction
(RetTy==Vector, VF==1) or from the vectorizer with a scalar type and
vector width (RetTy==Scalar, VF>1). A RetTy==Vector, VF>1 is considered
an error. Both of the vector modes are expected to be treated the same,
but because this is confusing many backends end up getting it wrong.
Instead of trying work with those two values separately this removes the
VF parameter, widening the RetTy/ArgTys by VF used called from the
vectorizer. This keeps things simpler, but does require some other
modifications to keep things consistent.
Most backends look like this will be an improvement (or were not using
getIntrinsicInstrCost). AMDGPU needed the most changes to keep the code
from c230965ccf working. ARM removed the fix in
dfac521da1, webassembly happens to get a fixup for an SLP cost
issue and both X86 and AArch64 seem to now be using better costs from
the vectorizer.
Differential Revision: https://reviews.llvm.org/D95291
Having a custom inliner doesn't really fit in with the new PM's
pipeline. It's also extra technical debt.
amdgpu-inline only does a couple of custom things compared to the normal
inliner:
1) It disables inlining if the number of BBs in a function would exceed
some limit
2) It increases the threshold if there are pointers to private arrays(?)
These can all be handled as TTI inliner hooks.
There already exists a hook for backends to multiply the inlining
threshold.
This way we can remove the custom amdgpu-inline pass.
This caused inline-hint.ll to fail, and after some investigation, it
looks like getInliningThresholdMultiplier() was previously getting
applied twice in amdgpu-inline (https://reviews.llvm.org/D62707 fixed it
not applying at all, so some later inliner change must have fixed
something), so I had to change the threshold in the test.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D94153
This adds cost modelling for the inloop vectorization added in
745bf6cf44. Up until now they have been modelled as the original
underlying instruction, usually an add. This happens to works OK for MVE
with instructions that are reducing into the same type as they are
working on. But MVE's instructions can perform the equivalent of an
extended MLA as a single instruction:
%sa = sext <16 x i8> A to <16 x i32>
%sb = sext <16 x i8> B to <16 x i32>
%m = mul <16 x i32> %sa, %sb
%r = vecreduce.add(%m)
->
R = VMLADAV A, B
There are other instructions for performing add reductions of
v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64
(VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV).
The i64 are particularly interesting as there are no native i64 add/mul
instructions, leading to the i64 add and mul naturally getting very
high costs.
Also worth mentioning, under NEON there is the concept of a sdot/udot
instruction which performs a partial reduction from a v16i8 to a v4i32.
They extend and mul/sum the first four elements from the inputs into the
first element of the output, repeating for each of the four output
lanes. They could possibly be represented in the same way as above in
llvm, so long as a vecreduce.add could perform a partial reduction. The
vectorizer would then produce a combination of in and outer loop
reductions to efficiently use the sdot and udot instructions. Although
this patch does not do that yet, it does suggest that separating the
input reduction type from the produced result type is a useful concept
to model. It also shows that a MLA reduction as a single instruction is
fairly common.
This patch attempt to improve the costmodelling of in-loop reductions
by:
- Adding some pattern matching in the loop vectorizer cost model to
match extended reduction patterns that are optionally extended and/or
MLA patterns. This marks the cost of the reduction instruction correctly
and the sext/zext/mul leading up to it as free, which is otherwise
difficult to tell and may get a very high cost. (In the long run this
can hopefully be replaced by vplan producing a single node and costing
it correctly, but that is not yet something that vplan can do).
- getExtendedAddReductionCost is added to query the cost of these
extended reduction patterns.
- Expanded the ARM costs to account for these expanded sizes, which is a
fairly simple change in itself.
- Some minor alterations to allow inloop reduction larger than the highest
vector width and i64 MVE reductions.
- An extra InLoopReductionImmediateChains map was added to the vectorizer
for it to efficiently detect which instructions are reductions in the
cost model.
- The tests have some updates to show what I believe is optimal
vectorization and where we are now.
Put together this can greatly improve performance for reduction loop
under MVE.
Differential Revision: https://reviews.llvm.org/D93476
A new TTI interface has been added 'Optional <unsigned>getMaxVScale' that
returns the maximum vscale for a given target.
When known getMaxVScale is used to compute the cost of masked gather scatter
for scalable vector.
Depends on D92094
Differential Revision: https://reviews.llvm.org/D93030
This is split off from D91718 and adds a new target hook
supportsScalableVectors that can be queried to check if scalable vectors
are supported by the backend. For AArch64 this returns true if SVE is
enabled.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D93060
D82227 has added a proper check to limit PHI vectorization to the
maximum vector register size. That unfortunately resulted in at
least a couple of regressions on SystemZ and x86.
This change reverts PHI handling from D82227 and replaces it with
a more general check in SLPVectorizerPass::tryToVectorizeList().
Moved to tryToVectorizeList() it allows to restart vectorization
if initial chunk fails.
However, this function is more general and handles not only PHI
but everything which SLP handles. If vectorization factor would
be limited to maximum vector register size it would limit much
more vectorization than before leading to further regressions.
Therefore a new TTI callback getMaximumVF() is added with the
default 0 to preserve current behavior and limit nothing. Then
targets can decide what is better for them.
The callback gets ElementSize just like a similar getMinimumVF()
function and the main opcode of the chain. The latter is to avoid
regressions at least on the AMDGPU. We can have loads and stores
up to 128 bit wide, and <2 x 16> bit vector math on some
subtargets, where the rest shall not be vectorized. I.e. we need
to differentiate based on the element size and operation itself.
Differential Revision: https://reviews.llvm.org/D92059
This patch replaces the attribute `unsigned VF` in the class
IntrinsicCostAttributes by `ElementCount VF`.
This is a non-functional change to help upcoming patches to compute the cost
model for scalable vector inside this class.
Differential Revision: https://reviews.llvm.org/D91532
Putting the +1 before the zero-extend will allow scalar evolution to fold the expression in some cases such as the one shown in PowerPC's `shrink-wrap.ll` test.
Reviewed By: samparker
Differential Revision: https://reviews.llvm.org/D91724
- In certain cases, a generic pointer could be assumed as a pointer to
the global memory space or other spaces. With a dedicated target hook
to query that address space from a given value, infer-address-space
pass could infer and propagate that to all its users.
Differential Revision: https://reviews.llvm.org/D91121
This reverts commits:
* [LoopVectorizer] NFCI: Calculate register usage based on TLI.getTypeLegalizationCost.
b873aba394.
* [LoopVectorizer] Silence warning in GetRegUsage.
9ff701100a.
This is more accurate than dividing the bitwidth based on the element count by the
maximum register size, as it can just reuse whatever has been calculated for
legalization of these types.
This change is also necessary when calculating register usage for scalable vectors, where
the legalization of these types cannot be done based on the widest register size, because
that does not take the 'vscale' component into account.
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D91059
This reverts the revert commit 408c4408fa.
This version of the patch includes a fix for a crash caused by
treating ICmp/FCmp constant expressions as instructions.
Original message:
On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.
This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.
This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.
I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.
On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.
This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.
This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.
I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.
Reviewed By: dmgreen, RKSimon
Differential Revision: https://reviews.llvm.org/D90070
This appears to be an error of code duplication - instead of
one constructor variant calling another, we have N similar
but not identical versions.
I think this is 'NFC' based on the current callers, but it's
hard to tell or guess the intent in all cases.
Changes TTI function getIntImmCostInst to take an additional Instruction parameter,
which enables us to be able to check it is part of a min(max())/max(min()) pattern that will match SSAT.
We can then mark the constant used as free to prevent it being hoisted so SSAT can still be generated.
Required minor changes in some non-ARM backends to allow for the optional parameter to be included.
Differential Revision: https://reviews.llvm.org/D87457
This allows the backend to tell the vectorizer to produce inloop
reductions through a TTI hook.
For the moment on ARM under MVE this means allowing integer add
reductions of the correct size. In the future this can include integer
min/max too, under -Os.
Differential Revision: https://reviews.llvm.org/D75512
As part of D84741, this adds a target hook for the
preferPredicatedReductionSelect option and makes use
of it under MVE, allowing us to tail predicate most
reduction loops.
Differential Revision: https://reviews.llvm.org/D85980
Currently, getCastInstrCost has limited information about the cast it's
rating, often just the opcode and types. Sometimes there is a context
instruction as well, but it isn't trustworthy: for instance, when the
vectorizer is rating a plan, it calls getCastInstrCost with the old
instructions when, in fact, it's trying to evaluate the cost of the
instruction post-vectorization. Thus, the current system can get the
cost of certain casts incorrect as the correct cost can vary greatly
based on the context in which it's used.
For example, if the vectorizer queries getCastInstrCost to evaluate the
cost of a sext(load) with tail predication enabled, getCastInstrCost
will think it's free most of the time, but it's not always free. On ARM
MVE, a VLD2 group cannot be extended like a normal VLDR can. Similar
situations can come up with how masked loads can be extended when being
split.
To fix that, this path adds a new parameter to getCastInstrCost to give
it a hint about the context of the cast. It adds a CastContextHint enum
which contains the type of the load/store being created by the
vectorizer - one for each of the types it can produce.
Original patch by Pierre van Houtryve
Differential Revision: https://reviews.llvm.org/D79162
For a long time, the InstCombine pass handled target specific
intrinsics. Having target specific code in general passes was noted as
an area for improvement for a long time.
D81728 moves most target specific code out of the InstCombine pass.
Applying the target specific combinations in an extra pass would
probably result in inferior optimizations compared to the current
fixed-point iteration, therefore the InstCombine pass resorts to newly
introduced functions in the TargetTransformInfo when it encounters
unknown intrinsics.
The patch should not have any effect on generated code (under the
assumption that code never uses intrinsics from a foreign target).
This introduces three new functions:
TargetTransformInfo::instCombineIntrinsic
TargetTransformInfo::simplifyDemandedUseBitsIntrinsic
TargetTransformInfo::simplifyDemandedVectorEltsIntrinsic
A few target specific parts are left in the InstCombine folder, where
it makes sense to share code. The largest left-over part in
InstCombineCalls.cpp is the code shared between arm and aarch64.
This allows to move about 3000 lines out from InstCombine to the targets.
Differential Revision: https://reviews.llvm.org/D81728
Summary:
This patch separates the peeling specific parameters from the UnrollingPreferences,
and creates a new struct called PeelingPreferences. Functions which used the
UnrollingPreferences struct for peeling have been updated to use the PeelingPreferences struct.
Author: sidbav (Sidharth Baveja)
Reviewers: Whitney (Whitney Tsang), Meinersbur (Michael Kruse), skatkov (Serguei Katkov), ashlykov (Arkady Shlykov), bogner (Justin Bogner), hfinkel (Hal Finkel), anhtuyen (Anh Tuyen Tran), nikic (Nikita Popov)
Reviewed By: Meinersbur (Michael Kruse)
Subscribers: fhahn (Florian Hahn), hiraditya (Aditya Kumar), llvm-commits, LLVM
Tag: LLVM
Differential Revision: https://reviews.llvm.org/D80580
Summary:
Get back `const` partially lost in one of recent changes.
Additionally specify explicit qualifiers in few places.
Reviewers: samparker
Reviewed By: samparker
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D82383
Have TTI::getInstructionThroughput call getUserCost for Br, Ret and
PHI. This now means that eveything in getInstructionThroughput is
handled by getUserCost.
Differential Revision: https://reviews.llvm.org/D79849
Enable TTIImpl::getUserCost to handle FNeg so that
getInstructionThroughput can call that instead. This means we can
remove the code in the AMDGPU backend too.
Differential Revision: https://reviews.llvm.org/D81635
Move the cost modelling, with the reduction pattern matching, from
getInstructionThroughput into generic TTIImpl::getUserCost. The
modelling in the AMDGPU backend can now be removed.
Differential Revision: https://reviews.llvm.org/D81643
Extract the existing code from getInstructionThroughput into
TTImpl::getUserCost. The duplicated code in the AMDGPU backend has
also been removed.
Differential Revision: https://reviews.llvm.org/D81448
Add the remaining arithmetic opcodes into the generic implementation
of getUserCost and then call this from getInstructionThroughput. Most
of the backends have been modified to return the base implementation
for cost kinds other RecipThroughput. The outlier here is AMDGPU
which already uses getArithmeticInstrCost for all the cost kinds.
This change means that most of the opcodes can be removed from that
backends implementation of getUserCost.
Differential Revision: https://reviews.llvm.org/D80992
Add cases for icmp, fcmp and select into the switch statement of the
generic getUserCost implementation with getInstructionThroughput then
calling into it. The BasicTTI and backend implementations have be set
to return a default value (1) when a cost other than throughput is
being queried.
Differential Revision: https://reviews.llvm.org/D80550
Use getMemoryOpCost from the generic implementation of getUserCost
and have getInstructionThroughput return the result of that for loads
and stores.
This also means that the X86 implementation of getUserCost can be
removed with the functionality folded into its getMemoryOpCost.
Differential Revision: https://reviews.llvm.org/D80984
This is split off from D79100 and adds a new target hook emitGetActiveLaneMask
that can be queried to check if the intrinsic @llvm.get.active.lane.mask() is
supported by the backend and if it should be emitted for a given loop.
See also commit rG7fb8a40e5220 and its commit message for more details/context
on this new intrinsic.
Differential Revision: https://reviews.llvm.org/D80597
This one is slightly odd since it counts as an address expression,
which previously could never fail. Allow the existing TTI hook to
return the value to use, and re-use it for handling how to handle
ptrmask.
Handles the no-op addrspacecasts for AMDGPU. We could probably do
something better based on analysis of the mask value based on the
address space, but leave that for now.
Last part of recommitting 'Unify Intrinsic Costs'
259eb619ff. This patch now uses
getUserCost from getInstructionThroughput.
Differential Revision: https://reviews.llvm.org/D80012
Add the remaining cast instruction opcodes to the base implementation
of getUserCost and directly return the result. This allows
getInstructionThroughput to return getUserCost for the casts. This
has required changes to PPC and SystemZ because they implement
getUserCost and/or getCastInstrCost with adjustments for vector
operations. Adjusts have also been made in the remaining backends
that implement the method so that they still produce a cost of zero
or one for cost kinds other than throughput.
Differential Revision: https://reviews.llvm.org/D79848
Recommitting most of the remaining changes from
259eb619ff, but excluding the call to
getUserCost from getInstructionThroughput. Though there's still no
test changes, I doubt that this is an NFC...
With the two getIntrinsicInstrCosts folded into one, now fold in the
scalar/code-size orientated getIntrinsicCost. The remaining scalar
intrinsics were memcpy, cttz and ctlz which now have special handling
in the BasicTTI implementation.
This had required a change in the AMDGPU backend for fabs as it
should always be 'free'. I've also changed the X86 backend to return
the BaseT implementation when the CostKind isn't RecipThroughput.
Differential Revision: https://reviews.llvm.org/D80012
With the two getIntrinsicInstrCosts folded into one, now fold in the
scalar/code-size orientated getIntrinsicCost. This involved sinking
cost of the TTIImpl into the base implementation, as it performs no
target checks. The opcodes remaining were memcpy, cttz and ctlz which
now have special handling in the BasicTTI implementation.
getInstructionThroughput can now directly return the result of
getUserCost.
This had required a change in the AMDGPU backend for fabs and its
always 'free'. I've also changed the X86 backend to return '1' for
any intrinsic when the CostKind isn't RecipThroughput.
Though this intended to be a non-functional change, there are many
paths being combined here so I would be very surprised if this didn't
have an effect.
Differential Revision: https://reviews.llvm.org/D80012
This has not been implemented by any backends which appear to cover
the functionality through getCastInstrCost. Sink what there is in the
default implementation into BasicTTI.
Differential Revision: https://reviews.llvm.org/D78922
Combine the two API calls into one by introducing a structure to hold
the relevant data. This has the added benefit of moving the boiler
plate code for arguments and flags, into the constructors. This is
intended to be a non-functional change, but the complicated web of
logic involved here makes it very hard to guarantee.
Differential Revision: https://reviews.llvm.org/D79941
This patch adds a new TTI hook to allow targets to tell LSR that
a chain including some instruction is already profitable and
should not be optimized. This patch also adds an implementation
of this TTI hook for ARM so LSR doesn't optimize chains that include
the VCTP intrinsic.
Differential Revision: https://reviews.llvm.org/D79418
getScalarizationOverhead is only ever called with vectors (and we already had a load of cast<VectorType> calls immediately inside the functions).
Followup to D78357
Reviewed By: @samparker
Differential Revision: https://reviews.llvm.org/D79341
Make the kind of cost explicit throughout the cost model which,
apart from making the cost clear, will allow the generic parts to
calculate better costs. It will also allow some backends to
approximate and correlate the different costs if they wish. Another
benefit is that it will also help simplify the cost model around
immediate and intrinsic costs, where we currently have multiple APIs.
RFC thread:
http://lists.llvm.org/pipermail/llvm-dev/2020-April/141263.html
Differential Revision: https://reviews.llvm.org/D79002
The improvements to the x86 vector insert/extract element costs in D74976 resulted in the estimated costs for vector initialization and scalarization increasing higher than should be expected. This is particularly noticeable on pre-SSE4 targets where the available of legal INSERT_VECTOR_ELT ops is more limited.
This patch does 2 things:
1 - it implements X86TTIImpl::getScalarizationOverhead to more accurately represent the typical costs of a ISD::BUILD_VECTOR pattern.
2 - it adds a DemandedElts mask to getScalarizationOverhead to permit the SLP's BoUpSLP::getGatherCost to be rewritten to use it directly instead of accumulating raw vector insertion costs.
This fixes PR45418 where a v4i8 (zext'd to v4i32) was no longer vectorizing.
A future patch should extend X86TTIImpl::getScalarizationOverhead to tweak the EXTRACT_VECTOR_ELT scalarization costs as well.
Reviewed By: @craig.topper
Differential Revision: https://reviews.llvm.org/D78216
There are several different types of cost that TTI tries to provide
explicit information for: throughput, latency, code size along with
a vague 'intersection of code-size cost and execution cost'.
The vectorizer is a keen user of RecipThroughput and there's at least
'getInstructionThroughput' and 'getArithmeticInstrCost' designed to
help with this cost. The latency cost has a single use and a single
implementation. The intersection cost appears to cover most of the
rest of the API.
getUserCost is explicitly called from within TTI when the user has
been explicit in wanting the code size (also only one use) as well
as a few passes which are concerned with a mixture of size and/or
a relative cost. In many cases these costs are closely related, such
as when multiple instructions are required, but one evident diverging
cost in this function is for div/rem.
This patch adds an argument so that the cost required is explicit,
so that we can make the important distinction when necessary.
Differential Revision: https://reviews.llvm.org/D78635
This API call has been used recently with, a very valid, expectation
that it would do something useful but it doesn't actually query any
backend information. So, remove this method and merge its
functionality into getUserCost. As well as that, also use
getCastInstrCost to get a proper cost from the backend for the
concerned instructions though we only currently return the answer if
it's considered free. The default implementation now also checks
int/ptr conversions too, as well as truncs and bitcasts.
Differential Revision: https://reviews.llvm.org/D76124
The API for shuffles and reductions uses generic Type parameters,
instead of VectorType, and so assertions and casts are used a lot.
This patch makes those types explicit, which means that the clients
can't be lazy, but results in less ambiguity, and that can only be a
good thing.
Bugzilla: https://bugs.llvm.org/show_bug.cgi?id=45562
Differential Revision: https://reviews.llvm.org/D78357
Summary:
Remove usages of asserting vector getters in Type in preparation for the
VectorType refactor. The existence of these functions complicates the
refactor while adding little value.
Reviewers: sunfish, sdesmalen, efriedma
Reviewed By: efriedma
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77273
This patch adds
- New arguments to getMinPrefetchStride() to let the target decide on a
per-loop basis if software prefetching should be done even with a stride
within the limit of the hw prefetcher.
- New TTI hook enableWritePrefetching() to let a target do write prefetching
by default (defaults to false).
- In LoopDataPrefetch:
- A search through the whole loop to gather information before emitting any
prefetches. This way the target can get information via new arguments to
getMinPrefetchStride() and emit prefetches more selectively. Collected
information includes: Does the loop have a call, how many memory
accesses, how many of them are strided, how many prefetches will cover
them. This is NFC to before as long as the target does not change its
definition of getMinPrefetchStride().
- If a previous access to the same exact address was 'read', and the
current one is 'write', make it a 'write' prefetch.
- If two accesses that are covered by the same prefetch do not dominate
each other, put the prefetch in a block that dominates both of them.
- If a ConstantMaxTripCount is less than ItersAhead, then skip the loop.
- A SystemZ implementation of getMinPrefetchStride().
Review: Ulrich Weigand, Michael Kruse
Differential Revision: https://reviews.llvm.org/D70228
getCallCost is only used within the different layers of TTI, with no
backend implementing it so fold the base implementation into
getUserCost. I think this is an NFC.
Differential Revision: https://reviews.llvm.org/D77050
Instead, represent the mask as out-of-line data in the instruction. This
should be more efficient in the places that currently use
getShuffleVector(), and paves the way for further changes to add new
shuffles for scalable vectors.
This doesn't change the syntax in textual IR. And I don't currently plan
to change the bitcode encoding in this patch, although we'll probably
need to do something once we extend shufflevector for scalable types.
I expect that once this is finished, we can then replace the raw "mask"
with something more appropriate for scalable vectors. Not sure exactly
what this looks like at the moment, but there are a few different ways
we could handle it. Maybe we could try to describe specific shuffles.
Or maybe we could define it in terms of a function to convert a fixed-length
array into an appropriate scalable vector, using a "step", or something
like that.
Differential Revision: https://reviews.llvm.org/D72467
Refines the gather/scatter cost model, but also changes the TTI
function getIntrinsicInstrCost to accept an additional parameter
which is needed for the gather/scatter cost evaluation.
This did require trivial changes in some non-ARM backends to
adopt the new parameter.
Extending gathers and truncating scatters are now priced cheaper.
Differential Revision: https://reviews.llvm.org/D75525
Summary:
Enable the new diveregence analysis by default for AMDGPU.
Resubmit with test updates since GPUDA was causing failures on Windows.
Reviewers: rampitec, nhaehnle, arsenm, thakis
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73315
Add an extra parameter so alignment can be taken under
consideration in gather/scatter legalization.
Differential Revision: https://reviews.llvm.org/D71610
Soon Intrinsic::ID will be a plain integer, so this overload will not be
possible.
Rename both overloads to ensure that downstream targets observe this as
a build failure instead of a runtime failure.
Split off from D71320
Reviewers: efriedma
Differential Revision: https://reviews.llvm.org/D71381
This attempts to teach the cost model in Arm that code such as:
%s = shl i32 %a, 3
%a = and i32 %s, %b
Can under Arm or Thumb2 become:
and r0, r1, r2, lsl #3
So the cost of the shift can essentially be free. To do this without
trying to artificially adjust the cost of the "and" instruction, it
needs to get the users of the shl and check if they are a type of
instruction that the shift can be folded into. And so it needs to have
access to the actual instruction in getArithmeticInstrCost, which if
available is added as an extra parameter much like getCastInstrCost.
We otherwise limit it to shifts with a single user, which should
hopefully handle most of the cases. The list of instruction that the
shift can be folded into include ADC, ADD, AND, BIC, CMP, EOR, MVN, ORR,
ORN, RSB, SBC and SUB. This translates to Add, Sub, And, Or, Xor and
ICmp.
Differential Revision: https://reviews.llvm.org/D70966
This file lists every pass in LLVM, and is included by Pass.h, which is
very popular. Every time we add, remove, or rename a pass in LLVM, it
caused lots of recompilation.
I found this fact by looking at this table, which is sorted by the
number of times a file was changed over the last 100,000 git commits
multiplied by the number of object files that depend on it in the
current checkout:
recompiles touches affected_files header
342380 95 3604 llvm/include/llvm/ADT/STLExtras.h
314730 234 1345 llvm/include/llvm/InitializePasses.h
307036 118 2602 llvm/include/llvm/ADT/APInt.h
213049 59 3611 llvm/include/llvm/Support/MathExtras.h
170422 47 3626 llvm/include/llvm/Support/Compiler.h
162225 45 3605 llvm/include/llvm/ADT/Optional.h
158319 63 2513 llvm/include/llvm/ADT/Triple.h
140322 39 3598 llvm/include/llvm/ADT/StringRef.h
137647 59 2333 llvm/include/llvm/Support/Error.h
131619 73 1803 llvm/include/llvm/Support/FileSystem.h
Before this change, touching InitializePasses.h would cause 1345 files
to recompile. After this change, touching it only causes 550 compiles in
an incremental rebuild.
Reviewers: bkramer, asbirlea, bollu, jdoerfert
Differential Revision: https://reviews.llvm.org/D70211
We have two ways to steer creating a predicated vector body over creating a
scalar epilogue. To force this, we have 1) a command line option and 2) a
pragma available. This adds a third: a target hook to TargetTransformInfo that
can be queried whether predication is preferred or not, which allows the
vectoriser to make the decision without forcing it.
While this change behaves as a non-functional change for now, it shows the
required TTI plumbing, usage of this new hook in the vectoriser, and the
beginning of an ARM MVE implementation. I will follow up on this with:
- a complete MVE implementation, see D69845.
- a patch to disable this, i.e. we should respect "vector_predicate(disable)"
and its corresponding loophint.
Differential Revision: https://reviews.llvm.org/D69040
Add an extra parameter so the backend can take the alignment into
consideration.
Differential Revision: https://reviews.llvm.org/D68400
llvm-svn: 374763
In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not
estimate different register pressure for different register class separately(especially for scalar type,
float type should not be on the same position with int type), so it's not accurate. Specifically,
it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance.
So we need classify the register classes in IR level, and importantly these are abstract register classes,
and are not the target register class of backend provided in td file. It's used to establish the mapping between
the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types.
For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR),
float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled,
and 3 kinds of register class when VSX is NOT enabled.
It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions.
Differential revision: https://reviews.llvm.org/D67148
llvm-svn: 374634
Move the default implementations of cache and prefetch queries to
TargetTransformInfoImplBase and delete them from NoTIIImpl. This brings these
interfaces in line with how other TTI interfaces work.
Differential Revision: https://reviews.llvm.org/D68804
llvm-svn: 374446
Re-apply 9fdfb045ae8b/r365676 with fixes for PPC and Hexagon. This involved
moving defaults from TargetTransformInfoImplBase to MCSubtargetInfo.
Rework the TTI cache and software prefetching APIs to prepare for the
introduction of a general system model. Changes include:
- Marking existing interfaces const and/or override as appropriate
- Adding comments
- Adding BasicTTIImpl interfaces that delegate to a subtarget
implementation
- Moving the default TargetTransformInfoImplBase implementation to a default
MCSubtarget implementation
Only a handful of targets use these interfaces currently: AArch64, Hexagon, PPC
and SystemZ. AArch64 already has a custom subtarget implementation, so its
custom TTI implementation is migrated to use the new facilities in BasicTTIImpl
to invoke its custom subtarget implementation. The custom TTI implementations
continue to exist for the other targets with this change. They are not moved
over to subtarget-based implementations.
The end goal is to have the default subtarget implementation defer to the system
model defined by the target. With this change, the default MCSubtargetInfo
implementation essentially returns the defaults TargetTransformInfoImplBase used
to return. Existing users of TTI defaults will hit the defaults now in
MCSubtargetInfo. Targets that define their own custom TTI implementations won't
use the BasicTTIImpl implementations that route to the subtarget.
Once system models are in place for the targets that use these interfaces, their
custom TTI implementations can be removed.
Differential Revision: https://reviews.llvm.org/D63614
llvm-svn: 374205
Also Revert "[LoopVectorize] Fix non-debug builds after rL374017"
This reverts commit 9f41deccc0.
This reverts commit 18b6fe07bc.
The patch is breaking PowerPC internal build, checked with author, reverting
on behalf of him for now due to timezone.
llvm-svn: 374091
In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not
estimate different register pressure for different register class separately(especially for scalar type,
float type should not be on the same position with int type), so it's not accurate. Specifically,
it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance.
So we need classify the register classes in IR level, and importantly these are abstract register classes,
and are not the target register class of backend provided in td file. It's used to establish the mapping between
the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types.
For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR),
float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled,
and 3 kinds of register class when VSX is NOT enabled.
It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions.
Differential revision: https://reviews.llvm.org/D67148
llvm-svn: 374017
This reverts SVN r373833, as it caused a failed assert "Non-zero loop
cost expected" on building numerous projects, see PR43582 for details
and reproduction samples.
llvm-svn: 373882
I don't see an ideal solution to these 2 related, potentially large, perf regressions:
https://bugs.llvm.org/show_bug.cgi?id=42708https://bugs.llvm.org/show_bug.cgi?id=43146
We decided that load combining was unsuitable for IR because it could obscure other
optimizations in IR. So we removed the LoadCombiner pass and deferred to the backend.
Therefore, preventing SLP from destroying load combine opportunities requires that it
recognizes patterns that could be combined later, but not do the optimization itself (
it's not a vector combine anyway, so it's probably out-of-scope for SLP).
Here, we add a scalar cost model adjustment with a conservative pattern match and cost
summation for a multi-instruction sequence that can probably be reduced later.
This should prevent SLP from creating a vector reduction unless that sequence is
extremely cheap.
In the x86 tests shown (and discussed in more detail in the bug reports), SDAG combining
will produce a single instruction on these tests like:
movbe rax, qword ptr [rdi]
or:
mov rax, qword ptr [rdi]
Not some (half) vector monstrosity as we currently do using SLP:
vpmovzxbq ymm0, dword ptr [rdi + 1] # ymm0 = mem[0],zero,zero,..
vpsllvq ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
movzx eax, byte ptr [rdi]
movzx ecx, byte ptr [rdi + 5]
shl rcx, 40
movzx edx, byte ptr [rdi + 6]
shl rdx, 48
or rdx, rcx
movzx ecx, byte ptr [rdi + 7]
shl rcx, 56
or rcx, rdx
or rcx, rax
vextracti128 xmm1, ymm0, 1
vpor xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1]
vpor xmm0, xmm0, xmm1
vmovq rax, xmm0
or rax, rcx
vzeroupper
ret
Differential Revision: https://reviews.llvm.org/D67841
llvm-svn: 373833
I'm planning on handling intrinsics that will benefit from checking
the address space enums. Don't bother moving the address collection
for now, since those won't need th enums.
llvm-svn: 368895
Summary:
Since the target has no significant advantage of vectorization,
vector instructions bous threshold bonus should be optional.
amdgpu-inline-arg-alloca-cost parameter default value and the target
InliningThresholdMultiplier value tuned then respectively.
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, eraman, hiraditya, haicheng, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D64642
llvm-svn: 366348