This patch adds the ability to use a PALIGNR to rotate a pair of inputs to select a range containing all the referenced elements, followed by a single input permute to put them in the right location.
Differential Revision: https://reviews.llvm.org/D54267
llvm-svn: 346706
Truncate and shuffle lowering are already capable of matching to PACKUS using known bits analysis.
This features one test change where we now prefer to extend v16i16->v16i32 then trunc v16i32->v16i8 over extract_subvector+packus when avx512f is available, but avx512bw is not.
llvm-svn: 346697
When we repeat the 2 shifting operands then this is a bit rotation - annoyingly this has to be done in the other getIntrinsicInstrCost than most intrinsics as we need to check the operands are the same.
llvm-svn: 346688
getConstant will create a BUILD_VECTOR for us and use a legal type if necessary. So just create the simple node and let BUILD_VECTOR legalization do the canonicalization.
llvm-svn: 346603
There are two AGU units, and per 1cy, there can be either two loads,
or a load and a store; but not two stores, or two loads and a store.
Additionally, loads shouldn't affect the store scheduler and vice versa.
(but *should* affect the PdEX scheduler.)
Required rL346545.
Fixes https://bugs.llvm.org/show_bug.cgi?id=39465
llvm-svn: 346587
The sdivrem will emit its own MOVSX to move %ah to the low byte of a register. By using a MOVSX for an any_extend this allows a post-isel peephole to merge them.
llvm-svn: 346581
This gives shuffle lowering the freedom to use zero_extend_vector_inreg for the unpckl shuffle. Shuffle combining usually makes this swap later, but not when AVX512 is enabled it seems.
While there also use DAG.getConstant to create a 0 vector instead of using the helper the forces a specific BUILD_VECTOR. I don't think that helper is usually needed. We're basically free to create a constant build_vector anytime and it will be legalized on its own.
llvm-svn: 346574
With avx512f but not avx512bw we need to extend to v16i32 then truncate that to to v16i8. Previously we emitted both nodes during lowering, but I'm trying to switch to using target independent nodes and with that switched the extend+truncate wou
This patch changes the implementation to what will be necessary with that patch which helps minimize test diffs.
llvm-svn: 346552
This makes X86ISD::VSEXT more similar to ISD::SIGN_EXTEND and ISD::ZERO_EXTEND.
I'm hoping to replace X86ISD::VSEXT/VZEXT with target independent nodes. Making the target specific nodes similar to the target independent nodes helps minimize test diffs in that patch.
llvm-svn: 346539
I noticed that we weren't generating broadcasts as much I thought we would with
D54271, and this is part of the problem.
Widening the shuffle elements means adding bitcasts and hiding the relationship
between a splatted scalar and the vector. If we can form a broadcast, do that
before going through the rest of the shuffle lowering because broadcasts should
be cheap and can often be load-folded.
Differential Revision: https://reviews.llvm.org/D54280
llvm-svn: 346498
Summary:
This simplifies the code and moves everything to tablegen for consistency. This
also prepares the ground for adding issue counters.
Reviewers: gchatelet, john.brawn, jsji
Subscribers: nemanjai, mgorny, javed.absar, kbarton, tschuett, llvm-commits
Differential Revision: https://reviews.llvm.org/D54297
llvm-svn: 346489
As discussed in D54073, we have a potential regression from more aggressive vector narrowing here, so let's try to avoid that by changing build-vector lowering slightly.
Insert-vector-element lowering always does this since there's no "pinsr" for ymm/zmm:
// If the vector is wider than 128 bits, extract the 128-bit subvector, insert
// into that, and then insert the subvector back into the result.
...but we can sometimes do better for insert-into-constant-vector by using shuffle lowering.
Differential Revision: https://reviews.llvm.org/D54271
llvm-svn: 346433
Summary:
The conditional branch created to support -fsplit-stack for X86 is
left unbiased/unhinted, resulting in less than ideal block placement:
the __morestack call block is kept on the main hot path. Bias the
branch to insure that the stack allocation block is treated as a
"cold" block during machine basic block placement.
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D54123
llvm-svn: 346336
Change the type in a couple of lists and sets that only store physical
registers from unsigned to MCPhysRegs. The later is only 16bits and
saves us a bit of memory.
llvm-svn: 346254
The main caller of this already has an MVT and several targets called getSimpleVT inside without checking isSimple. This makes the simpleness explicit.
llvm-svn: 346180
SimplifyDemandedBits can turn a sign_extend back into an any_extend and trigger an infinite loop. So instead legalize it the same way as a sign_extend, but preserve the opcode. Then just pattern match it the same as sign_extend during isel.
I don't have a reduced test case for such an infinite loop yet.
llvm-svn: 346170
v2i8/v2i16/v2i32 are promoted to v2i64. pmuludq takes a v2i64 input and produces a v2i64 output. Since we don't about the upper bits of the type legalized multiply we can use the pmuludq to produce the multiply result for the bits we do care about.
llvm-svn: 346115
Summary: This also enables some constant folding from KnownBits propagation. This helps on some cases vXi64 case in 32-bit mode where constant vectors appear as vXi32 and a bitcast. This can prevent getNode from constant folding sra/shl/srl.
Reviewers: RKSimon, spatel
Reviewed By: spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D54069
llvm-svn: 346102
These methods were just wrappers around getNode with additional asserts (identical and repeated 3 times). But getNode already has a switch that can be used to hold these asserts that allows them to be shared for all 3 opcodes. This also enables checking on the places that create these nodes without using the wrappers.
The rest of the patch is just changing all callers to use getNode directly.
llvm-svn: 346087
Use MachineFrameInfo's OffsetAdjustment field to pass this information
from the target to CodeViewDebug.cpp. The X86 backend doesn't use it for
any other purpose.
This fixes PR38857 in the case where there is a non-aligned quantity of
CSRs and a non-aligned quantity of locals.
llvm-svn: 346062
The majority of the changes are because the rest of shuffle lowering/combining prefers to replace the undef input with the other operand. Using UNPCKL directly seemed to avoid this and just grabbed a randomish register for the undef which can create false dependencies.
llvm-svn: 346050
We already have custom lowering for the AVX case in LegalizeVectorOps. So its better to keep the regular extend op around as long as possible.
I had to qualify one place in DAG combine that created illegal vector extending load operations. This change by itself had no effect on any tests which is why its included here.
I've made a few cleanups to the custom lowering. The sign extend code no longer creates an identity shuffle with undef elements. The zero extend code now emits a zero_extend_vector_inreg instead of an unpckl with a zero vector.
For the high half of the custom lowering of zero_extend/any_extend, we're now using an unpckh with a zero vector or undef. Previously we used used a pshufd to move the upper 64-bits to the lower 64-bits and then used a zero_extend_vector_inreg. I think the zero vector should require less execution resources and be smaller code size.
Differential Revision: https://reviews.llvm.org/D54024
llvm-svn: 346043
This patch should not introduce any behavior changes. It consists of
mostly one of two changes:
1. Replacing fall through comments with the LLVM_FALLTHROUGH macro
2. Inserting 'break' before falling through into a case block consisting
of only 'break'.
We were already using this warning with GCC, but its warning behaves
slightly differently. In this patch, the following differences are
relevant:
1. GCC recognizes comments that say "fall through" as annotations, clang
doesn't
2. GCC doesn't warn on "case N: foo(); default: break;", clang does
3. GCC doesn't warn when the case contains a switch, but falls through
the outer case.
I will enable the warning separately in a follow-up patch so that it can
be cleanly reverted if necessary.
Reviewers: alexfh, rsmith, lattner, rtrieu, EricWF, bollu
Differential Revision: https://reviews.llvm.org/D53950
llvm-svn: 345882
This patch adds support for expanding vector CTPOP instructions and removes the x86 'bitmath' lowering which replicates the same expansion.
Differential Revision: https://reviews.llvm.org/D53258
llvm-svn: 345869
Reapplying an updated version of rL345395 (reverted in rL345451), now the issues noticed in PR39483 have been fixed.
This patch allows resolveTargetShuffleInputs to remove UNDEF inputs from cases where we have more than 2 inputs.
llvm-svn: 345824
Before this patch, class PredicateExpander only knew how to expand simple
predicates that performed checks on instruction operands.
In particular, the new scheduling predicate syntax was not rich enough to
express checks like this one:
Foo(MI->getOperand(0).getImm()) == ExpectedVal;
Here, the immediate operand value at index zero is passed in input to function
Foo, and ExpectedVal is compared against the value returned by function Foo.
While this predicate pattern doesn't show up in any X86 model, it shows up in
other upstream targets. So, being able to support those predicates is
fundamental if we want to be able to modernize all the scheduling models
upstream.
With this patch, we allow users to specify if a register/immediate operand value
needs to be passed in input to a function as part of the predicate check. Now,
register/immediate operand checks all derive from base class CheckOperandBase.
This patch also changes where TIIPredicate definitions are expanded by the
instructon info emitter. Before, definitions were expanded in class
XXXGenInstrInfo (where XXX is a target name).
With the introduction of this new syntax, we may want to have TIIPredicates
expanded directly in XXXInstrInfo. That is because functions used by the new
operand predicates may only exist in the derived class (i.e. XXXInstrInfo).
This patch is a non functional change for the existing scheduling models.
In future, we will be able to use this richer syntax to better describe complex
scheduling predicates, and expose them to llvm-mca.
Differential Revision: https://reviews.llvm.org/D53880
llvm-svn: 345714
optsize using masked wide loads
Under Opt for Size, the vectorizer does not vectorize interleave-groups that
have gaps at the end of the group (such as a loop that reads only the even
elements: a[2*i]) because that implies that we'll require a scalar epilogue
(which is not allowed under Opt for Size). This patch extends the support for
masked-interleave-groups (introduced by D53011 for conditional accesses) to
also cover the case of gaps in a group of loads; Targets that enable the
masked-interleave-group feature don't have to invalidate interleave-groups of
loads with gaps; they could now use masked wide-loads and shuffles (if that's
what the cost model selects).
Reviewers: Ayal, hsaito, dcaballe, fhahn
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D53668
llvm-svn: 345705
The CONCAT_VECTORS case was using the original mask element count to determine how to adjust the broadcast index. But if we looked through a bitcast the original mask size doesn't tell us anything about the concat_vectors.
This patch switchs to using the concat_vectors input element count directly instead.
Differential Revision: https://reviews.llvm.org/D53823
llvm-svn: 345626
Summary:
The final pattern.
There is no test changes:
* We are looking for the pattern with one-use of it's mask,
* If the mask is one-use, D48768 will unfold it into pattern d.
* Thus, the tests have extra-use on the mask.
* Thus, only the BMI2 BZHI can be tested, and it already worked.
* So there is no BMI1 test coverage, we just assume it works since it uses the same codepath.
Reviewers: craig.topper, RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D53575
llvm-svn: 345584
Summary: Previously if we had a bitcast vector output type that needs promotion and a vector input type that needs widening we would just do a stack store and load to handle the conversion. We can do a little better if we can widen the bitcast to a legal vector type the same size as the widened input type. Then we can do the bitcast between this widened type and the widened input type. Afterwards we can extract_subvector back to the original output and any_extend that. Type legalization will then circle back and handle promotion of the extract_subvector and the any_extend will just be removed. This will avoid going through the stack and allows us to remove a custom version of this legalization from X86.
Reviewers: efriedma, RKSimon
Reviewed By: efriedma
Subscribers: javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D53229
llvm-svn: 345567
Use SelectionDAG::EVTToAPFloatSemantics. Make the LogicVT calculation in LowerFABSorFNEG similar to LowerFCOPYSIGN. Use APInt::getSignedMaxValue instead of ~APInt::getSignMask.
llvm-svn: 345565