Add specific scalar costs for CTLZ instructions, we can't discriminate between CTLZ and CTLZ_ZERO_UNDEF so we have to assume the worst. Given how BSR is often a microcoded nightmare on some older targets we might still be underestimating it.
For targets supporting LZCNT (Intel Haswell+ or AMD Fam10+), we provide overrides that assume 1cy costs.
llvm-svn: 374786
Add specific scalar costs for ctpop instructions, these are based on the llvm-mca's SLM throughput numbers (the oldest model we have).
For targets supporting POPCNT, we provide overrides that assume 1cy costs.
llvm-svn: 374775
I can't see any notable differences in costs between SSE2 and SSE42 arches for FADD/ADD reduction, so I've lowered the target to just SSE2.
I've also added vXi8 sum reduction costs in line with the PSADBW codegen and discussions on PR42674.
llvm-svn: 374655
SLM is 2 x slower for <2 x i64> comparison ops than other vector types, we should account for this like we do for SLM <2 x i64> add/sub/mul costs.
This should remove some of the SLM codegen diffs in D43582
llvm-svn: 372954
We are missing costs for a lot of truncation cases, I'm hoping to address all the 'zero cost' cases in trunc.ll
I thought this was a vector widening side effect, but even before this we had some interesting LV decisions (notably over indvars) being made due to these zero costs.
llvm-svn: 372498
The recently announced IBM z15 processor implements the architecture
already supported as "arch13" in LLVM. This patch adds support for
"z15" as an alternate architecture name for arch13.
The patch also uses z15 in a number of places where we used arch13
as long as the official name was not yet announced.
llvm-svn: 372435
I don't really understand the costs we're using for fp_to_sint,
but prior to widening legalization we used 20 as the cost for this
via the v2i64->v2f64 entry. That number seems better than the 40
we got with widening legalization. So now we need either a
v2i32->v2f64 entry or a v4i32->v2f64 entry depending on whether
AVX is enabled or not since we skip the first SSE2 table look up
under AVX.
llvm-svn: 369628
This adds some sext costs for MVE, taken from the length of assembly sequences
that we currently generate.
Differential Revision: https://reviews.llvm.org/D66010
llvm-svn: 369244
MVE also has some sext of loads, which will be free just as scalar
instructions are.
Differential Revision: https://reviews.llvm.org/D66008
llvm-svn: 369118
Now that we're using widening legalization. We need to improve our extract_subvector cost model for these types. This patch begins by modeling these as a subvector extract followed by a permute. I've left FIXMEs in the code for future improvements.
Differential Revision: https://reviews.llvm.org/D65892
llvm-svn: 369022
Now that we legalize by widening, the element types here won't change. Previously these were modeled as the elements being widened and then the instruction might become an AND or SHL/ASHR pair. But now they'll become something like a ZERO_EXTEND_VECTOR_INREG/SIGN_EXTEND_VECTOR_INREG.
For AVX2, when the destination type is legal its clear the cost should be 1 since we have extend instructions that can produce 256 bit vectors from less than 128 bit vectors. I'm a little less sure about AVX1 costs, but I think the ones I changed were definitely too high, but they might still be too high.
Differential Revision: https://reviews.llvm.org/D66169
llvm-svn: 368858
The MVE architecture has the idea of "beats", where a vector instruction can be
executed over several ticks of the architecture. This adds a similar system
into the Arm backend cost model, multiplying the cost of all vector
instructions by a factor.
This factor essentially becomes the expected difference between scalar code
and vector code, on average. MVE Vector instructions can also overlap so the a
true cost of them is often lower. But equally scalar instructions can in some
situations be dual issued, or have other optimisations such as unrolling or
make use of dsp instructions. The default is chosen as 2. This should not
prevent vectorisation is a most cases (as the vector instructions will still be
doing at least 4 times the work), but it will help prevent over vectorising in
cases where the benefits are less likely.
This adds things so far to the obvious places in ARMTargetTransformInfo, and
updates a few related costs like not treating float instructions as cost 2 just
because they are floats.
Differential Revision: https://reviews.llvm.org/D66005
llvm-svn: 368733
This teaches the cost model that the sext or zext of a load is going to be
free.
Differential Revision: https://reviews.llvm.org/D66006
llvm-svn: 368593
A VDUP will perform a vector broadcast in a single instruction. Update the cost
model for MVE accordingly.
Code originally by David Sherwood.
Differential Revision: https://reviews.llvm.org/D63448
llvm-svn: 368589
This puts some of the calls in ARMTargetTransformInfo.cpp behind hasNeon()
checks, now that we have MVE, and updates all the tests accordingly.
Differential Revision: https://reviews.llvm.org/D63447
llvm-svn: 368587
This adds a number of cost model tests for ARM, useful for MVE. It also re-jigs
some of the existing tests to make them easier to update and read.
llvm-svn: 368586
The assert that caused this to be reverted should be fixed now.
Original commit message:
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.
This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.
Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.
llvm-svn: 368183
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.
This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.
Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.
llvm-svn: 367901
This patch series adds support for the next-generation arch13
CPU architecture to the SystemZ backend.
This includes:
- Basic support for the new processor and its features.
- Assembler/disassembler support for new instructions.
- CodeGen for new instructions, including new LLVM intrinsics.
- Scheduler description for the new processor.
- Detection of arch13 as host processor.
Note: No currently available Z system supports the arch13
architecture. Once new systems become available, the
official system name will be added as supported -march name.
llvm-svn: 365932
This patch uses the mechanism from D62995 to strengthen the
definitions of the reduction intrinsics by letting the scalar
result/accumulator type be overloaded from the vector element type.
For example:
; The LLVM LangRef specifies that the scalar result must equal the
; vector element type, but this is not checked/enforced by LLVM.
declare i32 @llvm.experimental.vector.reduce.or.i32.v4i32(<4 x i32> %a)
This patch changes that into:
declare i32 @llvm.experimental.vector.reduce.or.v4i32(<4 x i32> %a)
Which has the type-constraint more explicit and causes LLVM to check
the result type with the vector element type.
Reviewers: RKSimon, arsenm, rnk, greened, aemerson
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D62996
llvm-svn: 363240
Types such as float and i64's do not have legal loads in Thumb1, but will still
be loaded with a LDR (or potentially multiple LDR's). As such we can treat the
cost of addressing mode calculations the same as an i32 and get some optimisation
benefits.
Differential Revision: https://reviews.llvm.org/D62968
llvm-svn: 362874
Now with MVE being added, we can add the vector addressing mode costs for it.
These are generally imm7 multiplied by the size of the type being loaded /
stored.
Differential Revision: https://reviews.llvm.org/D62967
llvm-svn: 362873
The fp16 version of VLDR takes a imm8 multiplied by 2. This updates the costs
to account for those, and adds extra testing. It is dependant upon hasFPRegs16
as this is what the load/store instructions require.
Differential Revision: https://reviews.llvm.org/D62966
llvm-svn: 362872
For some reason multiple places need to do this, and the variant the
loop unroller and inliner use was not handling it.
Also, introduce a new wrapper to be slightly more precise, since on
AMDGPU some addrspacecasts are free, but not no-ops.
llvm-svn: 362436
Summary:
This reuses the getArithmeticInstrCost, but passes dummy values of the second
operand flags.
The X86 costs are wrong and can be improved in a follow up. I just wanted to
stop it from reporting an unknown cost first.
Reviewers: RKSimon, spatel, andrew.w.kaylor, cameron.mcinally
Reviewed By: spatel
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D62444
llvm-svn: 361788