SLP vectorizer supports horizontal reductions for Add/FAdd binary
operations. Patch adds support for horizontal min/max reductions.
Function getReductionCost() is split to getArithmeticReductionCost() for
binary operation reductions and getMinMaxReductionCost() for min/max
reductions.
Patch fixes PR26956.
Differential revision: https://reviews.llvm.org/D27846
llvm-svn: 312791
cover the bitwise operators.
Nothing really exciting here, this just stamps out the rest of the core
operations that can RMW memory and set flags.
Still not implemented here: ADC, SBB. Those will require more
interesting logic to channel the flags *in*, and I'm not currently
planning to try to tackle that. It might be interesting for someone who
wants to improve our code generation for bignum implementations.
Differential Revision: https://reviews.llvm.org/D37141
llvm-svn: 312768
operands and used flags to support matching immediate operands.
This is a bit trickier than register operands, and we still want to fall
back on a register operands even for things that appear to be
"immediates" when they won't actually select into the operation's
immediate operand. This also requires us to handle things like selecting
`sub` vs. `add` to minimize the number of bits needed to represent the
immediate, and picking the shortest immediate encoding. In order to
that, we in turn need to scan to make sure that CF isn't used as it will
get inverted.
The end result seems very nice though, and we're now generating
optimal instruction sequences for these patterns IMO.
A follow-up patch will further expand this to other operations with RMW
memory operands. But handing `add` and `sub` are useful starting points
to flesh out the machinery and make sure interesting and complex cases
can be handled.
Thanks to Craig Topper who provided a few fixes and improvements to this
patch in addition to the review!
Differential Revision: https://reviews.llvm.org/D37139
llvm-svn: 312764
This patch expands the support of lowerInterleavedload to {8|16|32}x8i stride 3.
LLVM creates suboptimal shuffle code-gen for AVX2. In overall, this patch is a specific fix for the pattern (Strid=3 VF={8|16|32}) and we plan to include the store (deinterleved side).
The patch goal is to optimize the following sequence:
a0 b0 c0 a1 b1 c1 a2 b2
c2 a3 b3 c3 a4 b4 c4 a5
b5 c5 a6 b6 c6 a7 b7 c7
into
a0 a1 a2 a3 a4 a5 a6 a7
b0 b1 b2 b3 b4 b5 b6 b7
c0 c1 c2 c3 c4 c5 c6 c7
Reviewers
1. zvi
2. igor
3. guyblank
4. dorit
5. Ayal
llvm-svn: 312722
Summary:
Add patterns for
fptoui <16 x float> to <16 x i8>
fptoui <16 x float> to <16 x i16>
Reviewers: igorb, delena, craig.topper
Reviewed By: craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D37505
llvm-svn: 312704
Intrinsic handling is still creating these nodes with 32-bit elements as well. But at least this gets rid of 8 and 16.
Ideally, someday we'll convert the intrinsics to generic vector shuffles and remove the intrinsics.
llvm-svn: 312702
I don't think we ever generate these. If we did, I would expect we would also be able to generate v16f32 and v8f64, but we don't have those patterns.
llvm-svn: 312694
This patch moves some of similar non-instruction patterns from X86InstrSSE.td and X86InstrAVX512.td to a common file.
This is intended as a starting point. There are many other optimization patterns that exist in both files that we could move here.
Differential Revision: https://reviews.llvm.org/D37455
llvm-svn: 312649
As noted in PR34080, a lot of x87 instructions alter the FPSW status register (or leave it in an undefined state) but aren't tagged as such in the tablegen.
This patch tags the control word, stack, wait and math instructions as altering FPSW, which matches what the AMD APMs suggests happens.
Differential Revision: https://reviews.llvm.org/D36414
llvm-svn: 312629
performing a zext of a register.
On the PR there is discussion of how to more effectively handle this,
but this patch prevents us from miscompiling code.
Differential Revision: https://reviews.llvm.org/D37504
llvm-svn: 312620
This matches what we already do for AVX512. The peephole pass makes up for this in most if not all cases. But this makes isel behavior for these consistent with every other instruction.
llvm-svn: 312613
Summary:
Most instructions in AVX work “in-lane”, that is, each source element is applied only to other
elements of the same lane, thus a cross lane permutation is costly and needs more than one instrution.
AVX2 includes instructions to perform any-to-any permutation of words over a 256-bit register
and vectorized table lookup.
This should also Fix PR34369
Differential Revision: https://reviews.llvm.org/D37388
llvm-svn: 312608
We had already disabled the pattern for SSE4.1 and SSE4.2. But it got re-enabled for AVX and AVX512.
With SSE41 we rely on a separate (v4f32 (X86vzmovl VR128)) pattern to select blendps with a xorps to create zeroess. And a separate (v4f32 (scalar_to_vector FR32X)) to select a COPY_TO_REG_CLASS to move FR32 to VR128
The same thing can happen for AVX with vblendps and those separate patterns already exist.
For AVX512, (v4f32 (X86vzmov VR128)) will select a VMOVSS instruction instead of VBLENDPS due to their not being a EVEX VBLENDPS. This is what we were getting out of the larger pattern anyway. So the larger pattern is unneeded for AVX512 too.
For SSE1-SSSE3 we can rely on (v4f32 (X86vzmov VR128)) selecting a MOVSS similar to AVX512. Again this is what the larger pattern did too.
So the only real change here is that AVX1/2 now properly outputs a VBLENDPS during isel instead of a VMOVSS to match SSE41. Most tests didn't notice because the two address instruction pass knows how to turn VMOVSS into VBLENDPS to get an independent destination register.
llvm-svn: 312564
This reorders some patterns to get tablegen to detect them as duplicates. Tablegen only detects duplicates when creating variants for commutable operations. It does not detect duplicates between the patterns as written in the td file. So we need to ensure all the FMA patterns in the td file are unique.
This also uses null_frag to remove some other unneeded patterns.
llvm-svn: 312470
This uses the capability introduced in r312464 to make SDNode patterns commutable on the first two operands.
This allows us to remove some of the extra FMA patterns that have to put loads and mask operands in different places to cover all cases. This even includes patterns that were missing to support match a load in the first operand with FMA4. Non-broadcast loads with masking for AVX512.
I believe this is causing us to generate some duplicate patterns because tablegen's isomorphism checks don't catch isomorphism between the patterns as written in the td. It only detects isomorphism in the commuted variants it tries to create. The the unmasked 231 and 132 memory forms are isomorphic as written in the td file so we end up keeping both. I think we precommute the 132 pattern to fix this.
We also need a follow up patch to go back to the legacy FMA3 instructions and add patterns to the 231 and 132 forms which we currently don't have.
llvm-svn: 312469
Summary:
This is a re-roll of D36615 which uses PLT relocations in the back-end
to the call to __xray_CustomEvent() when building in -fPIC and
-fxray-instrument mode.
Reviewers: pcc, djasper, bkramer
Subscribers: sdardis, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D37373
llvm-svn: 312466
Ideally we'd be able to emit the SUBREG_TO_REG without the explicit register->register move, but we'd need to be sure the producing operation would select something that guaranteed the upper bits were already zeroed.
llvm-svn: 312450
In a future patch, I plan to teach isel to use a small vector move with implicit zeroing of the upper elements when it sees the (insert_subvector zero, X, 0) pattern.
llvm-svn: 312448
Summary:
ZExt and SExt from i8 to i16 aren't implemented in the autogenerated fast isel table because normal isel does a zext/sext to 32-bits and a subreg extract to avoid a partial register write or false dependency on the upper bits of the destination. This means without handling in fast isel we end up triggering a fast isel abort.
We had no custom sign extend handling at all so while I was there I went ahead and implemented sext i1->i8/i16/i32/i64 which was also missing. This generates an i1->i8 sign extend using a mask with 1, then an 8-bit negate, then continues with a sext from i8. A better sequence would be a wider and/negate, but would require more custom code.
Fast isel tests are a mess and I couldn't find a good home for the tests so I created a new one.
The test pr34381.ll had to have fast-isel removed because it was relying on a fast isel abort to hit the bug. The test case still seems valid with fast-isel disabled though some of the instructions changed.
Reviewers: spatel, zvi, igorb, guyblank, RKSimon
Reviewed By: guyblank
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D37320
llvm-svn: 312422
This is limited to a set of patterns based on the example in PR34111:
https://bugs.llvm.org/show_bug.cgi?id=34111
...but as I was investigating this, I see that horizontal patterns can go wrong in many,
many other ways that would not be handled by this patch. Each data type may even go
different in the DAG after starting with the same basic IR pattern, so even proper IR
canonicalization won't fix it all.
Differential Revision: https://reviews.llvm.org/D37357
llvm-svn: 312379
Previously we generated a register only pattern for each of the 3 instruction forms, but they are all identical as far as isel is concerned. So drop the others and just keep the 213 version.
This removes 2968 bytes from the isel table.
llvm-svn: 312313
Prior to this patch we had a DAG combine that tried to bypass an X86ISD::ADD with -1 being added to the carry flag of some previous operation. We would then pass the carry flag directly to user.
But this is only safe if the user is looking for the carry flag and not the zero flag.
So we need to only do this combine in a context where we know what flag the consumer is using.
Fixes PR34381.
Differential Revision: https://reviews.llvm.org/D37317
llvm-svn: 312285
Summary:
This patch enables the following:
1) Regex based Instruction itineraries for integer instructions.
2) The instructions are grouped as per the nature of the instructions
(move, arithmetic, logic, Misc, Control Transfer).
3) FP instructions and their itineraries are added which includes values
for SSE4A, BMI, BMI2 and SHA instructions.
Patch by Ganesh Gopalasubramanian
Reviewers: RKSimon, craig.topper
Subscribers: vprasad, shivaram, ddibyend, andreadb, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D36617
llvm-svn: 312237
Selecting 32-bit element logical ops without a select or broadcast requires matching a bitconvert on the inputs to the and. But that's a weird thing to rely on. It's entirely possible that one of the inputs doesn't have a bitcast and one does.
Since there's no functional difference, just remove the extra patterns and save some isel table size.
Differential Revision: https://reviews.llvm.org/D36854
llvm-svn: 312138