This reverts the revert commit 408c4408fa.
This version of the patch includes a fix for a crash caused by
treating ICmp/FCmp constant expressions as instructions.
Original message:
On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.
This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.
This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.
I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.
On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.
This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.
This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.
I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.
Reviewed By: dmgreen, RKSimon
Differential Revision: https://reviews.llvm.org/D90070
Changes TTI function getIntImmCostInst to take an additional Instruction parameter,
which enables us to be able to check it is part of a min(max())/max(min()) pattern that will match SSAT.
We can then mark the constant used as free to prevent it being hoisted so SSAT can still be generated.
Required minor changes in some non-ARM backends to allow for the optional parameter to be included.
Differential Revision: https://reviews.llvm.org/D87457
Currently, getCastInstrCost has limited information about the cast it's
rating, often just the opcode and types. Sometimes there is a context
instruction as well, but it isn't trustworthy: for instance, when the
vectorizer is rating a plan, it calls getCastInstrCost with the old
instructions when, in fact, it's trying to evaluate the cost of the
instruction post-vectorization. Thus, the current system can get the
cost of certain casts incorrect as the correct cost can vary greatly
based on the context in which it's used.
For example, if the vectorizer queries getCastInstrCost to evaluate the
cost of a sext(load) with tail predication enabled, getCastInstrCost
will think it's free most of the time, but it's not always free. On ARM
MVE, a VLD2 group cannot be extended like a normal VLDR can. Similar
situations can come up with how masked loads can be extended when being
split.
To fix that, this path adds a new parameter to getCastInstrCost to give
it a hint about the context of the cast. It adds a CastContextHint enum
which contains the type of the load/store being created by the
vectorizer - one for each of the types it can produce.
Original patch by Pierre van Houtryve
Differential Revision: https://reviews.llvm.org/D79162
For a long time, the InstCombine pass handled target specific
intrinsics. Having target specific code in general passes was noted as
an area for improvement for a long time.
D81728 moves most target specific code out of the InstCombine pass.
Applying the target specific combinations in an extra pass would
probably result in inferior optimizations compared to the current
fixed-point iteration, therefore the InstCombine pass resorts to newly
introduced functions in the TargetTransformInfo when it encounters
unknown intrinsics.
The patch should not have any effect on generated code (under the
assumption that code never uses intrinsics from a foreign target).
This introduces three new functions:
TargetTransformInfo::instCombineIntrinsic
TargetTransformInfo::simplifyDemandedUseBitsIntrinsic
TargetTransformInfo::simplifyDemandedVectorEltsIntrinsic
A few target specific parts are left in the InstCombine folder, where
it makes sense to share code. The largest left-over part in
InstCombineCalls.cpp is the code shared between arm and aarch64.
This allows to move about 3000 lines out from InstCombine to the targets.
Differential Revision: https://reviews.llvm.org/D81728
Summary:
This patch separates the peeling specific parameters from the UnrollingPreferences,
and creates a new struct called PeelingPreferences. Functions which used the
UnrollingPreferences struct for peeling have been updated to use the PeelingPreferences struct.
Author: sidbav (Sidharth Baveja)
Reviewers: Whitney (Whitney Tsang), Meinersbur (Michael Kruse), skatkov (Serguei Katkov), ashlykov (Arkady Shlykov), bogner (Justin Bogner), hfinkel (Hal Finkel), anhtuyen (Anh Tuyen Tran), nikic (Nikita Popov)
Reviewed By: Meinersbur (Michael Kruse)
Subscribers: fhahn (Florian Hahn), hiraditya (Aditya Kumar), llvm-commits, LLVM
Tag: LLVM
Differential Revision: https://reviews.llvm.org/D80580
Have BasicTTI call the base implementation so that both agree on the
default behaviour, which the default being a cost of '1'. This has
required an X86 specific implementation as it seems to be very
reliant on those instructions being free. Changes are also made to
AMDGPU so that their implementations distinguish between cost kinds,
so that the unrolling isn't affected. PowerPC also has its own
implementation to prevent changes to the reg-usage vectorizer test.
The cost model test changes now reflect that ret instructions are not
generally free.
Differential Revision: https://reviews.llvm.org/D79164
Combine the two API calls into one by introducing a structure to hold
the relevant data. This has the added benefit of moving the boiler
plate code for arguments and flags, into the constructors. This is
intended to be a non-functional change, but the complicated web of
logic involved here makes it very hard to guarantee.
Differential Revision: https://reviews.llvm.org/D79941
Make the kind of cost explicit throughout the cost model which,
apart from making the cost clear, will allow the generic parts to
calculate better costs. It will also allow some backends to
approximate and correlate the different costs if they wish. Another
benefit is that it will also help simplify the cost model around
immediate and intrinsic costs, where we currently have multiple APIs.
RFC thread:
http://lists.llvm.org/pipermail/llvm-dev/2020-April/141263.html
Differential Revision: https://reviews.llvm.org/D79002
There are several different types of cost that TTI tries to provide
explicit information for: throughput, latency, code size along with
a vague 'intersection of code-size cost and execution cost'.
The vectorizer is a keen user of RecipThroughput and there's at least
'getInstructionThroughput' and 'getArithmeticInstrCost' designed to
help with this cost. The latency cost has a single use and a single
implementation. The intersection cost appears to cover most of the
rest of the API.
getUserCost is explicitly called from within TTI when the user has
been explicit in wanting the code size (also only one use) as well
as a few passes which are concerned with a mixture of size and/or
a relative cost. In many cases these costs are closely related, such
as when multiple instructions are required, but one evident diverging
cost in this function is for div/rem.
This patch adds an argument so that the cost required is explicit,
so that we can make the important distinction when necessary.
Differential Revision: https://reviews.llvm.org/D78635
Refines the gather/scatter cost model, but also changes the TTI
function getIntrinsicInstrCost to accept an additional parameter
which is needed for the gather/scatter cost evaluation.
This did require trivial changes in some non-ARM backends to
adopt the new parameter.
Extending gathers and truncating scatters are now priced cheaper.
Differential Revision: https://reviews.llvm.org/D75525
On Powerpc, set instruction count as lsr first priority of lsr by default.
Add an option ppc-lsr-no-insns-cost to return back to default lsr cost model.
Reviewed By: steven.zhang, jsji
Differential Revision: https://reviews.llvm.org/D72683
We somehow missed doing this when we were working on Power9 exploitation.
This just adds the missing legalization and cost for producing the vector
intrinsics.
Differential revision: https://reviews.llvm.org/D70436
Soon Intrinsic::ID will be a plain integer, so this overload will not be
possible.
Rename both overloads to ensure that downstream targets observe this as
a build failure instead of a runtime failure.
Split off from D71320
Reviewers: efriedma
Differential Revision: https://reviews.llvm.org/D71381
This attempts to teach the cost model in Arm that code such as:
%s = shl i32 %a, 3
%a = and i32 %s, %b
Can under Arm or Thumb2 become:
and r0, r1, r2, lsl #3
So the cost of the shift can essentially be free. To do this without
trying to artificially adjust the cost of the "and" instruction, it
needs to get the users of the shl and check if they are a type of
instruction that the shift can be folded into. And so it needs to have
access to the actual instruction in getArithmeticInstrCost, which if
available is added as an extra parameter much like getCastInstrCost.
We otherwise limit it to shifts with a single user, which should
hopefully handle most of the cases. The list of instruction that the
shift can be folded into include ADC, ADD, AND, BIC, CMP, EOR, MVN, ORR,
ORN, RSB, SBC and SUB. This translates to Add, Sub, And, Or, Xor and
ICmp.
Differential Revision: https://reviews.llvm.org/D70966
In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not
estimate different register pressure for different register class separately(especially for scalar type,
float type should not be on the same position with int type), so it's not accurate. Specifically,
it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance.
So we need classify the register classes in IR level, and importantly these are abstract register classes,
and are not the target register class of backend provided in td file. It's used to establish the mapping between
the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types.
For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR),
float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled,
and 3 kinds of register class when VSX is NOT enabled.
It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions.
Differential revision: https://reviews.llvm.org/D67148
llvm-svn: 374634
Re-apply 9fdfb045ae8b/r365676 with fixes for PPC and Hexagon. This involved
moving defaults from TargetTransformInfoImplBase to MCSubtargetInfo.
Rework the TTI cache and software prefetching APIs to prepare for the
introduction of a general system model. Changes include:
- Marking existing interfaces const and/or override as appropriate
- Adding comments
- Adding BasicTTIImpl interfaces that delegate to a subtarget
implementation
- Moving the default TargetTransformInfoImplBase implementation to a default
MCSubtarget implementation
Only a handful of targets use these interfaces currently: AArch64, Hexagon, PPC
and SystemZ. AArch64 already has a custom subtarget implementation, so its
custom TTI implementation is migrated to use the new facilities in BasicTTIImpl
to invoke its custom subtarget implementation. The custom TTI implementations
continue to exist for the other targets with this change. They are not moved
over to subtarget-based implementations.
The end goal is to have the default subtarget implementation defer to the system
model defined by the target. With this change, the default MCSubtargetInfo
implementation essentially returns the defaults TargetTransformInfoImplBase used
to return. Existing users of TTI defaults will hit the defaults now in
MCSubtargetInfo. Targets that define their own custom TTI implementations won't
use the BasicTTIImpl implementations that route to the subtarget.
Once system models are in place for the targets that use these interfaces, their
custom TTI implementations can be removed.
Differential Revision: https://reviews.llvm.org/D63614
llvm-svn: 374205
Also Revert "[LoopVectorize] Fix non-debug builds after rL374017"
This reverts commit 9f41deccc0.
This reverts commit 18b6fe07bc.
The patch is breaking PowerPC internal build, checked with author, reverting
on behalf of him for now due to timezone.
llvm-svn: 374091
In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not
estimate different register pressure for different register class separately(especially for scalar type,
float type should not be on the same position with int type), so it's not accurate. Specifically,
it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance.
So we need classify the register classes in IR level, and importantly these are abstract register classes,
and are not the target register class of backend provided in td file. It's used to establish the mapping between
the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types.
For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR),
float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled,
and 3 kinds of register class when VSX is NOT enabled.
It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions.
Differential revision: https://reviews.llvm.org/D67148
llvm-svn: 374017
Rework the TTI cache and software prefetching APIs to prepare for the
introduction of a general system model. Changes include:
- Marking existing interfaces const and/or override as appropriate
- Adding comments
- Adding BasicTTIImpl interfaces that delegate to a subtarget
implementation
- Adding a default "no information" subtarget implementation
Only a handful of targets use these interfaces currently: AArch64,
Hexagon, PPC and SystemZ. AArch64 already has a custom subtarget
implementation, so its custom TTI implementation is migrated to use
the new facilities in BasicTTIImpl to invoke its custom subtarget
implementation. The custom TTI implementations continue to exist for
the other targets with this change. They are not moved over to
subtarget-based implementations.
The end goal is to have the default subtarget implementation defer to
the system model defined by the target. With this change, the default
subtarget implementation essentially returns "no information" for
these interfaces. None of the existing users of TTI will hit that
implementation because they define their own custom TTI
implementations and won't use the BasicTTIImpl implementations.
Once system models are in place for the targets that use these
interfaces, their custom TTI implementations can be removed.
Differential Revision: https://reviews.llvm.org/D63614
llvm-svn: 365676
Patch which introduces a target-independent framework for generating
hardware loops at the IR level. Most of the code has been taken from
PowerPC CTRLoops and PowerPC has been ported over to use this generic
pass. The target dependent parts have been moved into
TargetTransformInfo, via isHardwareLoopProfitable, with
HardwareLoopInfo introduced to transfer information from the backend.
Three generic intrinsics have been introduced:
- void @llvm.set_loop_iterations
Takes as a single operand, the number of iterations to be executed.
- i1 @llvm.loop_decrement(anyint)
Takes the maximum number of elements processed in an iteration of
the loop body and subtracts this from the total count. Returns
false when the loop should exit.
- anyint @llvm.loop_decrement_reg(anyint, anyint)
Takes the number of elements remaining to be processed as well as
the maximum numbe of elements processed in an iteration of the loop
body. Returns the updated number of elements remaining.
llvm-svn: 362774
For the power9 CPU, vector operations consume a pair of execution units rather
than one execution unit like a scalar operation. Update the target transform
cost functions to reflect the higher cost of vector operations when targeting
Power9.
Patch by RolandF.
Differential revision: https://reviews.llvm.org/D55461
llvm-svn: 352261
to reflect the new license.
We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.
Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.
llvm-svn: 351636
optsize using masked wide loads
Under Opt for Size, the vectorizer does not vectorize interleave-groups that
have gaps at the end of the group (such as a loop that reads only the even
elements: a[2*i]) because that implies that we'll require a scalar epilogue
(which is not allowed under Opt for Size). This patch extends the support for
masked-interleave-groups (introduced by D53011 for conditional accesses) to
also cover the case of gaps in a group of loads; Targets that enable the
masked-interleave-group feature don't have to invalidate interleave-groups of
loads with gaps; they could now use masked wide-loads and shuffles (if that's
what the cost model selects).
Reviewers: Ayal, hsaito, dcaballe, fhahn
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D53668
llvm-svn: 345705
interleave-group
The vectorizer currently does not attempt to create interleave-groups that
contain predicated loads/stores; predicated strided accesses can currently be
vectorized only using masked gather/scatter or scalarization. This patch makes
predicated loads/stores candidates for forming interleave-groups during the
Loop-Vectorizer's analysis, and adds the proper support for masked-interleave-
groups to the Loop-Vectorizer's planning and transformation stages. The patch
also extends the TTI API to allow querying the cost of masked interleave groups
(which each target can control); Targets that support masked vector loads/
stores may choose to enable this feature and allow vectorizing predicated
strided loads/stores using masked wide loads/stores and shuffles.
Reviewers: Ayal, hsaito, dcaballe, fhahn, javed.absar
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D53011
llvm-svn: 344472
The purpose of this patch is to have LSR generate better code on Power.
This is done by overriding isLSRCostLess.
Differential Revision: https://reviews.llvm.org/D40855
llvm-svn: 326906
candidates with coldcc attribute.
This recommits r322721 reverted due to sanitizer memory leak build bot failures.
Original commit message:
This patch adds support for the coldcc calling convention for Power.
This changes the set of non-volatile registers. It includes a pass to stress
test the implementation by marking all static directly called functions with
the coldcc attribute through the option -enable-coldcc-stress-test. It also
includes an option, -ppc-enable-coldcc, to add the coldcc attribute to
functions which are cold at all call sites based on BlockFrequencyInfo when
the containing function does not call any non cold functions.
Differential Revision: https://reviews.llvm.org/D38413
llvm-svn: 323778
candidates with coldcc attribute.
This patch adds support for the coldcc calling convention for Power.
This changes the set of non-volatile registers. It includes a pass to stress
test the implementation by marking all static directly called functions with
the coldcc attribute through the option -enable-coldcc-stress-test. It also
includes an option, -ppc-enable-coldcc, to add the coldcc attribute to
functions which are cold at all call sites based on BlockFrequencyInfo when
the containing function does not call any non cold functions.
Differential Revision: https://reviews.llvm.org/D38413
llvm-svn: 322721
All these headers already depend on CodeGen headers so moving them into
CodeGen fixes the layering (since CodeGen depends on Target, not the
other way around).
llvm-svn: 318490
- Targets that want to support memcmp expansions now return the list of
supported load sizes.
- Expansion codegen does not assume that all power-of-two load sizes
smaller than the max load size are valid. For examples, this is not the
case for x86(32bit)+sse2.
Fixes PR34887.
llvm-svn: 316905