Commit Graph

56 Commits

Author SHA1 Message Date
Sanjay Patel 8cefc37be5 [DAGCombine] visitEXTRACT_SUBVECTOR - 'little to big' extract_subvector(bitcast()) support
This moves the X86 specific transform from rL364407
into DAGCombiner to generically handle 'little to big' cases
(for example: extract_subvector(v2i64 bitcast(v16i8))). This
allows us to remove both the x86 implementation and the aarch64
bitcast(extract_subvector(bitcast())) combine.

Earlier patches that dealt with regressions initially exposed
by this patch:
rG5e5e99c041e4
rG0b38af89e2c0

Patch by: @RKSimon (Simon Pilgrim)

Differential Revision: https://reviews.llvm.org/D63815
2019-12-23 10:11:45 -05:00
Craig Topper f65493a83e [X86] Teach X86MCInstLower to swap operands of commutable instructions to enable 2-byte VEX encoding.
Summary:
The 2 source operands commutable instructions are encoded in the
VEX.VVVV field and the r/m field of the MODRM byte plus the VEX.B
field.

The VEX.B field is missing from the 2-byte VEX encoding. If the
VEX.VVVV source is 0-7 and the other register is 8-15 we can
swap them to avoid needing the VEX.B field. This works as long as
the VEX.W, VEX.mmmmm, and VEX.X fields are also not needed.

Fixes PR36706.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D68550
2019-11-04 22:07:46 -08:00
Craig Topper 18e8d02e8c [X86] Pass v32i16/v64i8 in zmm registers on KNL target.
gcc and icc pass these types in zmm registers in zmm registers.

This patch implements a quick hack to override the register
type before calling convention handling to one that is legal.
Longer term we might want to do something similar to 256-bit
integer registers on AVX1 where we just split all the operations.

Fixes PR42957

Differential Revision: https://reviews.llvm.org/D66708

llvm-svn: 370495
2019-08-30 17:35:08 +00:00
Amaury Sechet 8365e42010 [DAGCombiner] (insert_vector_elt (vector_shuffle X, Y), (extract_vector_elt X, N), IdxC) -> (vector_shuffle X, Y)
Summary: This is beneficial when the shuffle is only used once and end up being generated in a few places when some node is combined into a shuffle.

Reviewers: craig.topper, efriedma, RKSimon, lebedev.ri

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D66718

llvm-svn: 370326
2019-08-29 10:35:51 +00:00
Craig Topper 846429de74 [DAGCombiner][X86] Teach SimplifyVBinOp to fold VBinOp (concat X, undef/constant), (concat Y, undef/constant) -> concat (VBinOp X, Y), VecC
This improves the combine I included in D66504 to handle constants in the upper operands of the concat. If we can constant fold them away we can pull the concat after the bin op. This helps with chains of madd reductions on X86 from loop unrolling. The loop madd reduction pattern creates pmaddwd with half the width of the add that follows it using zeroes to fill the upper bits. If we have two of these added together we can pull the zeroes through the accumulating add and then shrink it.

Differential Revision: https://reviews.llvm.org/D66680

llvm-svn: 369937
2019-08-26 17:59:11 +00:00
Craig Topper 85a968e9d5 [X86] Add a further unrolled madd reduction test case that shows several deficiencies.
The AVX2 check lines show two issues. An ADD that became an OR
because we knew the input was disjoint, but really it was zero
so we should have just removed the ADD/OR all together.

Relatedly we use 128-bit VPMADDWD instructions followed by
256-bit VPADDD operations. We should be able to narrow these
VPADDDs.

llvm-svn: 369736
2019-08-23 07:38:25 +00:00
Craig Topper 8b5f2ab2a4 Recommit r367901 "[X86] Enable -x86-experimental-vector-widening-legalization by default."
The assert that caused this to be reverted should be fixed now.

Original commit message:

This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.

This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.

Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.

llvm-svn: 368183
2019-08-07 16:24:26 +00:00
Mitch Phillips bd0d97e1c4 Revert "[X86] Enable -x86-experimental-vector-widening-legalization by default."
This reverts commit 3de33245d2.

This commit broke the MSan buildbots. See
https://reviews.llvm.org/rL367901 for more information.

llvm-svn: 368107
2019-08-06 23:00:43 +00:00
Craig Topper 3de33245d2 [X86] Enable -x86-experimental-vector-widening-legalization by default.
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.

This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.

Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.

llvm-svn: 367901
2019-08-05 18:25:36 +00:00
Craig Topper eb1beabad9 [X86] Don't use PMADDWD for vector add reductions of multiplies if the mul inputs have an additional user.
The pmaddwd inserts a truncate, if that truncate would end up
creating additional instructions instead of making a zext
narrower, then we shouldn't do it.

I've restricted this to only sse4.1 targets since on prior
targets the zext will be done in stages. So the truncate will
probably not create additional instructions. Might need some
more investigation of mul shrinking and the other pmaddwd
transform to be sure this is the right decision.

There might be a slight regression on AVX1 targets due to add
splitting. Hard to say for sure. Maybe we need to look into
using the vector reduction flag to use 2 narrow loads and a
blend instead of extracting and inserting.

llvm-svn: 367198
2019-07-29 01:36:58 +00:00
Craig Topper ac9d0f4150 [X86] Add test cases to show missing one use check in combineLoopMAddPattern.
llvm-svn: 367197
2019-07-29 01:36:54 +00:00
Craig Topper 894916cac9 [X86] In combineLoopMAddPattern and combineLoopSADPattern, preserve the vector reduction flag on the final add. Handle unrolled loops by letting DAG combine revisit.
This reverts r340478 and r340631 and replaces them with a simpler
method of just letting DAG combine revisit the nodes to handle
the other operand.

llvm-svn: 367195
2019-07-28 18:45:42 +00:00
Sanjay Patel a09e686821 [DAGCombiner] try to move bitcast after extract_subvector
I noticed that we were failing to narrow an x86 ymm math op in a case similar
to the 'madd' test diff. That is because a bitcast is sitting between the math
and the extract subvector and thwarting our pattern matching for narrowing:

       t56: v8i32 = add t59, t58
      t68: v4i64 = bitcast t56
    t73: v2i64 = extract_subvector t68, Constant:i64<2>
  t96: v4i32 = bitcast t73

There are a few wins and neutral diffs in the other tests.

Differential Revision: https://reviews.llvm.org/D61806

llvm-svn: 360541
2019-05-12 14:43:20 +00:00
Sanjay Patel a61d586f74 [DAGCombiner] fold extract_subvector of extract_subvector
This is the sibling fold for insert-of-insert that was added with D56604.

Now that we have x86 shuffle narrowing (D57156), this change shows improvements for 
lots of AVX512 reduction code (not sure that we would ever expect extract-of-extract otherwise).

There's a small regression in some of the partial-permute tests (extracting followed by splat).
That is tracked by PR40500:
https://bugs.llvm.org/show_bug.cgi?id=40500

Differential Revision: https://reviews.llvm.org/D57336

llvm-svn: 352528
2019-01-29 19:13:39 +00:00
Sanjay Patel 21aa6ddc14 [x86] narrow a shuffle that doesn't use or set any high elements
This isn't the final fix for our reduction/horizontal codegen, but it takes care 
of a lot of the problems. After we narrow the shuffle, existing combines for 
insert/extract and binops kick in, and we end up with cheaper 128-bit ops.

The avg and mul reduction tests show an existing shuffle lowering hole for 
AVX2/AVX512. I think in its most minimal form this is:
https://bugs.llvm.org/show_bug.cgi?id=40434
...but we might need multiple fixes to get it right.

Differential Revision: https://reviews.llvm.org/D57156

llvm-svn: 352209
2019-01-25 15:37:42 +00:00
Craig Topper a32e353afa [X86] Don't mark SEXTLOAD from v4i8/v4i16/v8i8 as Custom on pre-sse4.1.
This seems to be getting in the way more than its helping. This does mean we stop scalarizing some cases, but I'm not convinced the scalarization was really better.

Some of the changes to vsel-cmp-load.ll are a regression but D56156 should fix it.

llvm-svn: 350159
2018-12-30 03:05:07 +00:00
Dinar Temirbulatov 8c8724dd0d [CodeGen] Enhance machine PHIs optimization
Summary:
Make machine PHIs optimization to work for single value register taken from
several different copies. This is the first step to fix PR38917. This change
allows to get rid of redundant PHIs (see opt_phis2.mir test) to make
the subsequent optimizations (like CSE) possible and simpler.

For instance, before this patch the code like this:

%b = COPY %z
...
%a = PHI %bb1, %a; %bb2, %b
could be optimized to:

%a = %b
but the code like this:

%c = COPY %z
...
%b = COPY %z
...
%a = PHI %bb1, %a; %bb2, %b; %bb3, %c
would remain unchanged.
With this patch the latter case will be optimized:

%a = %z```.

Committed on behalf of: Anton Afanasyev anton.a.afanasyev@gmail.com

Reviewers: RKSimon, MatzeB

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D54839

llvm-svn: 349271
2018-12-15 14:37:01 +00:00
Craig Topper 17fa42a69b [X86] Preserve undef information when creating a punpckl/hbw from a v16i8 where all the even or odd elements are undef.
Previously if V2 was unused we ended up using V1 for both inputs as part of the code that follows the new code. By using lowerVectorShuffleWithUNPCK we keep the undef nature of V2 in the output.

As near as I can tell this makes v16i8 behavior consistent with every other VT now.

This does mean that we give the register allocator freedom to fill in random registers now and create false dependencies. But like I said we're already doing that for other types.

llvm-svn: 347296
2018-11-20 09:04:01 +00:00
Sanjay Patel 0a515595a7 [x86] allow vector load narrowing with multi-use values
This is a long-awaited follow-up suggested in D33578. Since then, we've picked up even more
opportunities for vector narrowing from changes like D53784, so there are a lot of test diffs.
Apart from 2-3 strange cases, these are all wins.

I've structured this to be no-functional-change-intended for any target except for x86
because I couldn't tell if AArch64, ARM, and AMDGPU would improve or not. All of those
targets have existing regression tests (4, 4, 10 files respectively) that would be
affected. Also, Hexagon overrides the shouldReduceLoadWidth() hook, but doesn't show
any regression test diffs. The trade-off is deciding if an extra vector load is better
than a single wide load + extract_subvector.

For x86, this is almost always better (on paper at least) because we often can fold
loads into subsequent ops and not increase the official instruction count. There's also
some unknown -- but potentially large -- benefit from using narrower vector ops if wide
ops are implemented with multiple uops and/or frequency throttling is avoided.

Differential Revision: https://reviews.llvm.org/D54073

llvm-svn: 346595
2018-11-10 20:05:31 +00:00
Craig Topper f7108aef14 [X86] In LowerEXTEND_VECTOR_INREG, emit a vector shuffle instead of directly using X86ISD::UNPCKL
The majority of the changes are because the rest of shuffle lowering/combining prefers to replace the undef input with the other operand. Using UNPCKL directly seemed to avoid this and just grabbed a randomish register for the undef which can create false dependencies.

llvm-svn: 346050
2018-11-02 22:48:02 +00:00
Craig Topper 60c202a494 [X86] Don't emit *_extend_vector_inreg nodes when both the input and output types are legal with AVX1
We already have custom lowering for the AVX case in LegalizeVectorOps. So its better to keep the regular extend op around as long as possible.

I had to qualify one place in DAG combine that created illegal vector extending load operations. This change by itself had no effect on any tests which is why its included here.

I've made a few cleanups to the custom lowering. The sign extend code no longer creates an identity shuffle with undef elements. The zero extend code now emits a zero_extend_vector_inreg instead of an unpckl with a zero vector.

For the high half of the custom lowering of zero_extend/any_extend, we're now using an unpckh with a zero vector or undef. Previously we used used a pshufd to move the upper 64-bits to the lower 64-bits and then used a zero_extend_vector_inreg. I think the zero vector should require less execution resources and be smaller code size.

Differential Revision: https://reviews.llvm.org/D54024

llvm-svn: 346043
2018-11-02 21:09:49 +00:00
Craig Topper 6c3f1692c8 Revert r345165 "[X86] Bring back the MOV64r0 pseudo instruction"
Google is reporting regressions on some benchmarks.

llvm-svn: 345785
2018-10-31 21:53:24 +00:00
Sanjay Patel 8b207defea [DAGCombiner] narrow vector binops when extraction is cheap
Narrowing vector binops came up in the demanded bits discussion in D52912.

I don't think we're going to be able to do this transform in IR as a canonicalization 
because of the risk of creating unsupported widths for vector ops, but we already have 
a DAG TLI hook to allow what I was hoping for: isExtractSubvectorCheap(). This is 
currently enabled for x86, ARM, and AArch64 (although only x86 has existing regression 
test diffs).

This is artificially limited to not look through bitcasts because there are so many 
test diffs already, but that's marked with a TODO and is a small follow-up.

Differential Revision: https://reviews.llvm.org/D53784

llvm-svn: 345602
2018-10-30 14:14:34 +00:00
Craig Topper 2417273255 [X86] Bring back the MOV64r0 pseudo instruction
This patch brings back the MOV64r0 pseudo instruction for zeroing a 64-bit register. This replaces the SUBREG_TO_REG MOV32r0 sequence we use today. Post register allocation we will rewrite the MOV64r0 to a 32-bit xor with an implicit def of the 64-bit register similar to what we do for the various XMM/YMM/ZMM zeroing pseudos.

My main motivation is to enable the spill optimization in foldMemoryOperandImpl. As we were seeing some code that repeatedly did "xor eax, eax; store eax;" to spill several registers with a new xor for each store. With this optimization enabled we get a store of a 0 immediate instead of an xor. Though I admit the ideal solution would be one xor where there are multiple spills. I don't believe we have a test case that shows this optimization in here. I'll see if I can try to reduce one from the code were looking at.

There's definitely some other machine CSE(and maybe other passes) behavior changes exposed by this patch. So it seems like there might be some other deficiencies in SUBREG_TO_REG handling.

Differential Revision: https://reviews.llvm.org/D52757

llvm-svn: 345165
2018-10-24 17:32:09 +00:00
Sanjay Patel e28c8ecd72 [x86] add and use fast horizontal vector math subtarget feature
This is the planned follow-up to D52997. Here we are reducing horizontal vector math codegen 
by default. AMD Jaguar (btver2) should have no difference with this patch because it has 
fast-hops. (If we want to set that bit for other CPUs, let me know.)

The code changes are small, but there are many test diffs. For files that are specifically 
testing for hops, I added RUNs to distinguish fast/slow, so we can see the consequences 
side-by-side. For files that are primarily concerned with codegen other than hops, I just 
updated the CHECK lines to reflect the new default codegen.

To recap the recent horizontal op story:

1. Before rL343727, we were producing hops for all subtargets for a variety of patterns. 
   Hops were likely not optimal for all targets though.
2. The IR improvement in r343727 exposed a hole in the backend hop pattern matching, so 
   we reduced hop codegen for all subtargets. That was bad for Jaguar (PR39195).
3. We restored the hop codegen for all targets with rL344141. Good for Jaguar, but 
   probably bad for other CPUs.
4. This patch allows us to distinguish when we want to produce hops, so everyone can be 
   happy. I'm not sure if we have the best predicate here, but the intent is to undo the 
   extra hop-iness that was enabled by r344141.

Differential Revision: https://reviews.llvm.org/D53095

llvm-svn: 344361
2018-10-12 16:41:02 +00:00
Simon Pilgrim 2d0f20cc04 [X86] Handle COPYs of physregs better (regalloc hints)
Enable enableMultipleCopyHints() on X86.

Original Patch by @jonpa:

While enabling the mischeduler for SystemZ, it was discovered that for some reason a test needed one extra seemingly needless COPY (test/CodeGen/SystemZ/call-03.ll). The handling for that is resulted in this patch, which improves the register coalescing by providing not just one copy hint, but a sorted list of copy hints. On SystemZ, this gives ~12500 less register moves on SPEC, as well as marginally less spilling.

Instead of improving just the SystemZ backend, the improvement has been implemented in common-code (calculateSpillWeightAndHint(). This gives a lot of test failures, but since this should be a general improvement I hope that the involved targets will help and review the test updates.

Differential Revision: https://reviews.llvm.org/D38128

llvm-svn: 342578
2018-09-19 18:59:08 +00:00
Craig Topper 4058e29e7d [X86] Teach combineLoopMAddPattern to handle cases where there is no loop and the add has two multiply inputs
Differential Revision: https://reviews.llvm.org/D50868

llvm-svn: 340631
2018-08-24 18:05:04 +00:00
Craig Topper 3c78622d64 [X86] Add test case for D50868. NFC
llvm-svn: 340630
2018-08-24 18:05:02 +00:00
Simon Pilgrim 01ae462fef [X86][SSE] Combine (some) target shuffles with multiple uses
As discussed on D41794, we have many cases where we fail to combine shuffles as the input operands have other uses.

This patch permits these shuffles to be combined as long as they don't introduce additional variable shuffle masks, which should reduce instruction dependencies and allow the total number of shuffles to still drop without increasing the constant pool.

However, this may mean that some memory folds may no longer occur, and on pre-AVX require the occasional extra register move.

This also exposes some poor PMULDQ/PMULUDQ codegen which was doing unnecessary upper/lower calculations which will in fact fold to zero/undef - the fix will be added in a followup commit.

Differential Revision: https://reviews.llvm.org/D50328

llvm-svn: 339335
2018-08-09 12:30:02 +00:00
Craig Topper e364baa88b [X86] Add matching for another pattern of PMADDWD.
Summary:
This is the pattern you get from the loop vectorizer for something like this

int16_t A[1024];
int16_t B[1024];
int32_t C[512];

void pmaddwd() {
  for (int i = 0; i != 512; ++i)
    C[i] = (A[2*i]*B[2*i]) + (A[2*i+1]*B[2*i+1]);
}

In this case we will have (add (mul (build_vector), (build_vector)), (mul (build_vector), (build_vector))). This is different than the pattern we currently match which has the build_vectors between an add and a single multiply. I'm not sure what C code would get you that pattern.

Reviewers: RKSimon, spatel, zvi

Reviewed By: zvi

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D49636

llvm-svn: 338097
2018-07-27 04:29:10 +00:00
Craig Topper b2a626b52e [X86] Remove the max vector width restriction from combineLoopMAddPattern and rely splitOpsAndApply to handle splitting.
This seems to be a net improvement. There's still an issue under avx512f where we have a 512-bit vpaddd, but not vpmaddwd so we end up doing two 256-bit vpmaddwds and inserting the results before a 512-bit vpaddd. It might be better to do two 512-bits paddds with zeros in the upper half. Same number of instructions, but breaks a dependency.

llvm-svn: 337656
2018-07-22 19:44:35 +00:00
Craig Topper d8f80e90ce [X86] Add more MADD recurrence test cases with larger and narrower vector widths.
llvm-svn: 337650
2018-07-22 05:16:47 +00:00
Simon Pilgrim c7132031a2 [X86][SSE] Use SplitOpsAndApply to improve HADD/HSUB lowering
Improve AVX1 256-bit vector HADD/HSUB matching by using SplitOpsAndApply to split into 128-bit instructions.

llvm-svn: 337568
2018-07-20 16:20:45 +00:00
Vedant Kumar cc7b2a55c2 [DAGCombiner] Change the SDLoc on split extloads (2/N)
In DAGCombiner, we try to simplify this pattern:

  ([s|z]ext (load ...))

Conceptually, a new extload which is created while splitting the load
should have the same debug location as the load.

Making this change affects the IROrder of the new load, causing some
test case churn.

In practice, the new location is never different from the location of
the [s|z]ext, at least not during check-llvm or a stage2 build.

Part of: llvm.org/PR37262

Differential Revision: https://reviews.llvm.org/D46156

llvm-svn: 331301
2018-05-01 19:29:15 +00:00
Simon Pilgrim ba43ec8702 [X86][AVX] combineLoopMAddPattern - support 256-bit cases on AVX1 via SplitBinaryOpsAndApply
llvm-svn: 326189
2018-02-27 12:20:37 +00:00
Craig Topper f340a0e8c0 [X86] Add avx1 command line to madd.ll to show splitting and concatenating 256-bit operations.
llvm-svn: 326068
2018-02-26 07:48:17 +00:00
Craig Topper f98baa7065 [X86] Add more madd reduction tests with wider vectors.
We had no test case exercising 512-bit vpmaddwd usage.

llvm-svn: 323840
2018-01-31 00:30:32 +00:00
Zvi Rackover 652f9a1896 X86: Add pattern matching for PMADDWD
In addition to the existing match as part of a loop-reduction, add a
straightforward pattern match for DAG-contained patterns.

Reviewers: RKSimon, craig.topper

Subscribers: llvm-commits

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D41811

llvm-svn: 322446
2018-01-13 17:42:19 +00:00
Zvi Rackover 63f1f322c9 X86 Tests: add more pamddwd cases. NFC
Improve coverage of D41811

llvm-svn: 322434
2018-01-13 08:21:29 +00:00
Zvi Rackover 93b8bd4955 X86 Tests: Add Tests for PMADDWD selection. NFC.
Support for ISel to be added.

llvm-svn: 321970
2018-01-07 20:21:10 +00:00
Francis Visoiu Mistrih 25528d6de7 [CodeGen] Unify MBB reference format in both MIR and debug output
As part of the unification of the debug format and the MIR format, print
MBB references as '%bb.5'.

The MIR printer prints the IR name of a MBB only for block definitions.

* find . \( -name "*.mir" -o -name "*.cpp" -o -name "*.h" -o -name "*.ll" \) -type f -print0 | xargs -0 sed -i '' -E 's/BB#" << ([a-zA-Z0-9_]+)->getNumber\(\)/" << printMBBReference(*\1)/g'
* find . \( -name "*.mir" -o -name "*.cpp" -o -name "*.h" -o -name "*.ll" \) -type f -print0 | xargs -0 sed -i '' -E 's/BB#" << ([a-zA-Z0-9_]+)\.getNumber\(\)/" << printMBBReference(\1)/g'
* find . \( -name "*.txt" -o -name "*.s" -o -name "*.mir" -o -name "*.cpp" -o -name "*.h" -o -name "*.ll" \) -type f -print0 | xargs -0 sed -i '' -E 's/BB#([0-9]+)/%bb.\1/g'
* grep -nr 'BB#' and fix

Differential Revision: https://reviews.llvm.org/D40422

llvm-svn: 319665
2017-12-04 17:18:51 +00:00
Craig Topper 143797eb89 [X86] Add isel pattern infrastructure to begin recognizing when we're inserting 0s into the upper portions of a vector register and the producing instruction as already produced the zeros.
Currently if we're inserting 0s into the upper elements of a vector register we insert an explicit move of the smaller register to implicitly zero the upper bits. But if we can prove that they are already zero we can skip that. This is based on a similar idea of what we do to avoid emitting explicit zero extends for GR32->GR64.

Unfortunately, this is harder for vector registers because there are several opcodes that don't have VEX equivalent instructions, but can write to XMM registers. Among these are SHA instructions and a MMX->XMM move. Bitcasts can also get in the way.

So for now I'm starting with explicitly allowing only VPMADDWD because we emit zeros in combineLoopMAddPattern. So that is placing extra instruction into the reduction loop.

I'd like to allow PSADBW as well after D37453, but that's currently blocked by a bitcast. We either need to peek through bitcasts or canonicalize insert_subvectors with zeros to remove bitcasts on the value being inserted.

Longer term we should probably have a cleanup pass that removes superfluous zeroing moves even when the producer is in another basic block which is something these isel tricks can't do. See PR32544.

Differential Revision: https://reviews.llvm.org/D37653

llvm-svn: 313365
2017-09-15 17:09:00 +00:00
Craig Topper 8ee36ffb54 [X86] Add patterns to turn an insert into lower subvector of a zero vector into a move instruction which will implicitly zero the upper elements.
Ideally we'd be able to emit the SUBREG_TO_REG without the explicit register->register move, but we'd need to be sure the producing operation would select something that guaranteed the upper bits were already zeroed.

llvm-svn: 312450
2017-09-03 17:52:25 +00:00
Craig Topper bb6506d251 [X86] Canonicalize (concat_vectors X, zero) -> (insert_subvector zero, X, 0).
In a future patch, I plan to teach isel to use a small vector move with implicit zeroing of the upper elements when it sees the (insert_subvector zero, X, 0) pattern.

llvm-svn: 312448
2017-09-03 17:52:19 +00:00
Craig Topper 0f30fe9634 [x86] Enable some support for lowerVectorShuffleWithUndefHalf with AVX-512
Summary:
This teaches 512-bit shuffles to detect unused halfs in order to reduce shuffle size.

We may need to refine the 512-bit exit point. I couldn't remember if we had good cross lane shuffles for 8/16 bit with AVX-512 or not.

I believe this is step towards being able to handle D36454 without a special case.

From here we need to improve our ability to combine extract_subvector with insert_subvector and other extract_subvectors. And we need to support narrowing binary operations where we don't demand all elements. This may be improvements to DAGCombiner::narrowExtractedVectorBinOp(by recognizing an insert_subvector in addition to concat) or we may need a target specific combiner.

Reviewers: RKSimon, zvi, delena, jbhateja

Reviewed By: RKSimon, jbhateja

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D36601

llvm-svn: 310724
2017-08-11 16:20:05 +00:00
Evgeny Stupachenko c675290680 Reapply fix PR23384 (part 3 of 3) r304824 (was reverted in r305720).
The root cause of reverting was fixed - PR33514.

Summary:
The patch makes instruction count the highest priority for
 LSR solution for X86 (previously registers had highest priority).

Reviewers: qcolombet

Differential Revision: http://reviews.llvm.org/D30562

From: Evgeny Stupachenko <evstupac@gmail.com>
                         <evgeny.v.stupachenko@intel.com>
llvm-svn: 310289
2017-08-07 19:56:34 +00:00
Dinar Temirbulatov a0beedef1c [X86] SET0 to use XMM registers where possible PR26018 PR32862
Differential Revision: https://reviews.llvm.org/D35965

llvm-svn: 309926
2017-08-03 08:50:18 +00:00
Dinar Temirbulatov aead31a36f [X86] SET0 to use XMM registers where possible PR26018 PR32862
Differential Revision: https://reviews.llvm.org/D35839

llvm-svn: 309298
2017-07-27 17:47:01 +00:00
Hans Wennborg ca69fc1cb7 Revert r304824 "Fix PR23384 (part 3 of 3)"
This seems to be interacting badly with ASan somehow, causing false reports of
heap-buffer overflows: PR33514.

> Summary:
> The patch makes instruction count the highest priority for
> LSR solution for X86 (previously registers had highest priority).
>
> Reviewers: qcolombet
>
> Differential Revision: http://reviews.llvm.org/D30562
>
> From: Evgeny Stupachenko <evstupac@gmail.com>

llvm-svn: 305720
2017-06-19 17:57:15 +00:00
Evgeny Stupachenko 3b88291581 Fix PR23384 (part 3 of 3)
Summary:
The patch makes instruction count the highest priority for
 LSR solution for X86 (previously registers had highest priority).

Reviewers: qcolombet

Differential Revision: http://reviews.llvm.org/D30562

From: Evgeny Stupachenko <evstupac@gmail.com>
llvm-svn: 304824
2017-06-06 20:04:16 +00:00