Commit Graph

61 Commits

Author SHA1 Message Date
Gadi Haber 19c4fc5e62 This is a large patch for X86 AVX-512 of an optimization for reducing code size by encoding EVEX AVX-512 instructions using the shorter VEX encoding when possible.
There are cases of AVX-512 instructions that have two possible encodings. This is the case with instructions that use vector registers with low indexes of 0 - 15 and do not use the zmm registers or the mask k registers.
The EVEX encoding prefix requires 4 bytes whereas the VEX prefix can take only up to 3 bytes. Consequently, using the VEX encoding for these instructions results in a code size reduction of ~2 bytes even though it is compiled with the AVX-512 features enabled.

Reviewers: Craig Topper, Zvi Rackoover, Elena Demikhovsky 
Differential Revision: https://reviews.llvm.org/D27901

llvm-svn: 290663
2016-12-28 10:12:48 +00:00
Craig Topper 8aca90507f [AVX-512] Add VLX command lines to 128 and 256-bit shufffle tests.
llvm-svn: 283014
2016-10-01 06:01:18 +00:00
Simon Pilgrim 941bd6bbae [X86][SSE] Add support for combining VZEXT_MOVL target shuffles
Includes adding more general support for the pattern: VZEXT_MOVL(VZEXT_LOAD(ptr)) -> VZEXT_LOAD(ptr)

This has unearthed a couple of latent poor codegen issues (MINSS/MAXSS scalar load folding and MOVDDUP/BROADCAST load folding patterns), which will be fixed shortly.

Its also reduced a couple of tests so that they no longer reach the instruction threshold necessary to be combined to PSHUFB (see PR26183).

llvm-svn: 279646
2016-08-24 18:07:53 +00:00
Simon Pilgrim 82e54871d0 [DAGCombiner] Fold xor/and/or (bitcast(A), bitcast(B)) -> bitcast(op (A,B)) anytime before LegalizeVectorOprs
xor/and/or (bitcast(A), bitcast(B)) -> bitcast(op (A,B)) was only being combined at the AfterLegalizeTypes stage, this patch permits the combine to occur anytime before then as well.

The main aim with this to improve the ability to recognise bitmasks that can be converted to shuffles.

I had to modify a number of AVX512 mask tests as the basic bitcast to/from scalar pattern was being stripped out, preventing testing of the mmask bitops. By replacing the bitcasts with loads we can get almost the same result.

Differential Revision: http://reviews.llvm.org/D18944

llvm-svn: 265998
2016-04-11 21:10:33 +00:00
Simon Pilgrim fba9352f31 [X86][SSE] Added bitmask pattern shuffle tests
Based on OR(AND(MASK,V0),AND(~MASK,V1)) style patterns

llvm-svn: 265697
2016-04-07 17:23:55 +00:00
Simon Pilgrim 5ba1c127fc [X86][SSE] Improve i16 splatting shuffles
Better handling of the annoying pshuflw/pshufhw ops which only shuffle lower/upper halves of a vector.

Added vXi16 unary shuffle support for cases where i16 elements (from the same half of the source) are being splatted to the whole of one of the halves. This avoids the general lowering case which must shuffle the 32-bit elements first - meaning that we used to end up with unnecessary duplicate pshuflw/pshufhw shuffles.

Note this has the side effect of a lot of SSSE3 test cases no longer needing to use PSHUFB, as it falls below the 3 op combine threshold for when PSHUFB is typically worth it. I've raised PR26183 to discuss if the threshold should be changed and whether we need to make it more specific to the target CPU.

Differential Revision: http://reviews.llvm.org/D14901

llvm-svn: 258440
2016-01-21 22:07:41 +00:00
James Y Knight 7c905063c5 Make utils/update_llc_test_checks.py note that the assertions are
autogenerated.

Also update existing test cases which appear to be generated by it and
weren't modified (other than addition of the header) by rerunning it.

llvm-svn: 253917
2015-11-23 21:33:58 +00:00
Ahmed Bougacha b49eb3ab4b [X86] Fold (trunc (i32 (zextload i16))) into vbroadcast.
When matching non-LSB-extracting truncating broadcasts, we now insert
the necessary SRL. If the scalar resulted from a load, the SRL will be
folded into it, creating a narrower, offset, load.

However, i16 loads aren't Desirable, so we get i16->i32 zextloads.
We already catch i16 aextloads; catch these as well.

llvm-svn: 252363
2015-11-06 23:16:48 +00:00
Ahmed Bougacha 05a0514b12 [X86] SRL non-LSB extracts when folding to truncating broadcasts.
Now that we recognize this, we can support it instead of bailing out.
That is, we can fold:
  (v8i16 (shufflevector
    (v8i16 (bitcast (v4i32 (build_vector X, Y, ...)))),
    <1,1,...,1>))
into:
  (v8i16 (vbroadcast (i16 (trunc (srl Y, 16)))))

llvm-svn: 252362
2015-11-06 23:16:43 +00:00
Ahmed Bougacha 68614a36d1 [X86] Don't fold non-LSB extracts into truncating broadcasts.
We used to incorrectly assume that the offset we're extracting from
was a multiple of the element size. So, we'd fold:
  (v8i16 (shufflevector
    (v8i16 (bitcast (v4i32 (build_vector X, Y, ...)))),
    <1,1,...,1>))
into:
  (v8i16 (vbroadcast (i16 (trunc Y))))
whereas we should have extracted the higher bits from X.

Instead, bail out if the assumption doesn't hold.

llvm-svn: 252361
2015-11-06 23:16:38 +00:00
Ahmed Bougacha 0cdc7719f0 [X86] Look for scalar through one bitcast when lowering to VBROADCAST.
Fixes PR23464: one way to use the broadcast intrinsics is:

  _mm256_broadcastw_epi16(_mm_cvtsi32_si128(*(int*)src));

We don't currently fold this, but now that we use native IR for
the intrinsics (r245605), we can look through one bitcast to find
the broadcast scalar.

Differential Revision: http://reviews.llvm.org/D10557

llvm-svn: 245613
2015-08-20 21:02:39 +00:00
Ahmed Bougacha 69a17acb74 [X86] Add some broadcast-from-memory tests.
llvm-svn: 245612
2015-08-20 20:59:41 +00:00
Ahmed Bougacha 89ae9a1e28 [X86] update_llc_test_checks vector-shuffle-*. NFC.
Some of them had gone stale.

llvm-svn: 240485
2015-06-24 00:03:48 +00:00
Simon Pilgrim ed2ba33ba0 [DAGCombiner] Combine shuffles of BUILD_VECTOR and SCALAR_TO_VECTOR
This patch attempts to fold the shuffling of 'scalar source' inputs - BUILD_VECTOR and SCALAR_TO_VECTOR nodes - if the shuffle node is the only user. This folds away a lot of unnecessary shuffle nodes, and allows quite a bit of constant folding that was being missed.

Differential Revision: http://reviews.llvm.org/D8516

llvm-svn: 234004
2015-04-03 10:02:21 +00:00
Chandler Carruth eb206aa1ea [x86] Now that the new vector shuffle legality is enabled and everything
is going well, remove the flag and the code for the old legality tests.

This is the first step toward removing the entire old vector shuffle
lowering. *Much* more code to delete coming up next.

llvm-svn: 229963
2015-02-20 03:59:35 +00:00
Chandler Carruth 5d1a84b7b8 [x86] Delete still more piles of complex code now that we have a good
systematic lowering of v8i16.

This required a slight strategy shift to prefer unpack lowerings in more
places. While this isn't a cut-and-dry win in every case, it is in the
overwhelming majority. There are only a few places where the old
lowering would probably be a touch faster, and then only by a small
margin.

In some cases, this is yet another significant improvement.

llvm-svn: 229859
2015-02-19 15:21:57 +00:00
Chandler Carruth 0b39536390 [x86] Teach the unpack lowering how to lower with an initial unpack in
addition to lowering to trees rooted in an unpack.

This saves shuffles and or registers in many various ways, lets us
handle another class of v4i32 shuffles pre SSE4.1 without domain
crosses, etc.

llvm-svn: 229856
2015-02-19 15:06:13 +00:00
Chandler Carruth 352eba1c29 [x86] Dramatically improve v8i16 shuffle lowering by not using its
terribly complex partial blend logic.

This code path was one of the more complex and bug prone when it first
went in and it hasn't faired much better. Ultimately, with the simpler
basis for unpack lowering and support bit-math blending, this is
completely obsolete. In the worst case without this we generate
different but equivalent instructions. However, in many cases we
generate much better code. This is especially true when blends or pshufb
is available.

This does expose one (minor) weakness of the unpack lowering that I'll
try to address.

In case you were wondering, this is actually a big part of what I've
been trying to pull off in the recent string of commits.

llvm-svn: 229853
2015-02-19 14:08:24 +00:00
Chandler Carruth 2c0390ca4b [x86] Remove the final fallback in the v8i16 lowering that isn't really
needed, and significantly improve the SSSE3 path.

This makes the new strategy much more clear. If we can blend, we just go
with that. If we can't blend, we try to permute into an unpack so
that we handle cases where the unpack doing the blend also simplifies
the shuffle. If that fails and we've got SSSE3, we now call into
factored-out pshufb lowering code so that we leverage the fact that
pshufb can set up a blend for us while shuffling. This generates great
code, especially because we *know* we don't have a fast blend at this
point. Finally, we fall back on decomposing into permutes and blends
because we do at least have a bit-math-based blend if we need to use
that.

This pretty significantly improves some of the v8i16 code paths. We
never need to form pshufb for the single-input shuffles because we have
effective target-specific combines to form it there, but we were missing
its effectiveness in the blends.

llvm-svn: 229851
2015-02-19 13:56:49 +00:00
Chandler Carruth bcb6c5f62d [x86] Add support for bit-wise blending and use it in the v8 and v16
lowering paths. I'm going to be leveraging this to simplify a lot of the
overly complex lowering of v8 and v16 shuffles in pre-SSSE3 modes.

Sadly, this isn't profitable on v4i32 and v2i64. There, the float and
double blending instructions for pre-SSE4.1 are actually pretty good,
and we can't beat them with bit math. And once SSE4.1 comes around we
have direct blending support and this ceases to be relevant.

Also, some of the test cases look odd because the domain fixer
canonicalizes these to floating point domain. That's OK, it'll use the
integer domain when it matters and some day I may be able to update
enough of LLVM to canonicalize the other way.

This restores almost all of the regressions from teaching x86's vselect
lowering to always use vector shuffle lowering for blends. The remaining
problems are because the v16 lowering path is still doing crazy things.
I'll be re-arranging that strategy in more detail in subsequent commits
to finish recovering the performance here.

llvm-svn: 229836
2015-02-19 10:46:52 +00:00
Craig Topper 55ac42426e [X86] Add another test case for the bug fixed in r229642. With the bug a vpsrldq was emitted instead of pslldq.
llvm-svn: 229643
2015-02-18 07:45:43 +00:00
Chandler Carruth 55553f5299 [x86] Rewrite the byte shift detection to not use boolean variables to
track state.

I didn't like this in the code review because the pattern tends to be
error prone, but I didn't see a clear way to rewrite it. Turns out that
there were bugs here, I found them when fuzz testing our shuffle
lowering for correctness on x86.

The core of the problem is that we need to consistently test all our
preconditions for the same directionality of shift and the same input
vector. Instead, formulate this as two predicates (one doesn't depend on
the input in any way), pass things like the directionality and input
vector as inputs, and loop over the alternatives.

This fixes a pattern of very rare miscompiles coming out of this code.
Turned up roughly 4 out of every 1 million v8 shuffles in my fuzz
testing. The new code is over half a million test runs with no failures
yet. I've also fuzzed every other function in the lowering code with
over 3.5 million test cases and not discovered any other miscompiles.

llvm-svn: 229642
2015-02-18 07:13:48 +00:00
Chandler Carruth 55db07016e [x86] Teach the unpack lowering to try wider element unpacks.
This allows it to match still more places where previously we would have
to fall back on floating point shuffles or other more complex lowering
strategies.

I'm hoping to replace some of the hand-rolled unpack matching with this
routine is it gets more and more clever.

llvm-svn: 229463
2015-02-17 02:12:24 +00:00
Chandler Carruth 87e580a659 [x86] Teach the 128-bit vector shuffle lowering routines to take
advantage of the existence of a reasonable blend instruction.

The 256-bit vector shuffle lowering has leveraged the general technique
of decomposed shuffles and blends for quite some time, but this never
made it back into the 128-bit code, and there are a large number of
patterns where this is substantially better. For example, this removes
almost all domain crossing in vector shuffles that involve some blend
and some permutation with SSE4.1 and later. See the massive reduction
in 'shufps' for integer test cases in this commit.

This isn't perfect yet for a few reasons:

1) The v8i16 shuffle lowering continues to plague me. We don't always
   form an unpack-based blend when that would be better. But the wins
   pretty drastically outstrip the losses here.
2) The v16i8 shuffle lowering is just a disaster here. I never went and
   implemented blend support here for some terrible reason. I'll do
   that next probably. I've not updated it for now.

More variations on this technique are coming as well -- we don't
shuffle-into-unpack or shuffle-into-palignr, both of which would also be
profitable.

Note that some test cases grow significantly in the number of
instructions, but I expect to actually be faster. We use
pshufd+pshufd+blendw instead of a single shufps, but the pshufd's are
very likely to pipeline well (two ports on most modern intel chips) and
the blend is a *very* fast instruction. The domain switch penalty will
essentially always be more than a blend instruction, which is the only
increase in tree height.

llvm-svn: 229350
2015-02-16 01:52:02 +00:00
Chandler Carruth 1b5285dd57 [SDAG] Teach the SelectionDAG to canonicalize vector shuffles of splats
directly into blends of the splats.

These patterns show up even very late in the vector shuffle lowering
where we don't have any chance for DAG combining to kick in, and
blending is a tremendously simpler operation to model. By coercing the
shuffle into a blend we can much more easily match and lower shuffles of
splats.

Immediately with this change there are significantly more blends being
matched in the x86 vector shuffle lowering.

llvm-svn: 229308
2015-02-15 12:18:12 +00:00
Chandler Carruth fe69608839 [x86] Switch a collection of tests explicitly to the new vector shuffle
legality test (essentially, everything is legal).

I'm planning to make this the default shortly, but I'd like to fix
a collection of the bugs it exposes first, and this will let me easily
test them. It also showcases both the improvements and a few of the
regressions triggered by the change. The biggest improvements by far are
the significantly reduced shuffling and domain crossing in the combining
test case. The biggest regressions are missing some clever blending
patterns.

llvm-svn: 229284
2015-02-15 06:37:21 +00:00
Chandler Carruth 89a60770e0 [x86] Remove the now-default-on flag for the new vector shuffle lowering
strategy from a bunch of tests.

llvm-svn: 229283
2015-02-15 06:20:51 +00:00
Chandler Carruth 22b1525ae8 [x86] Include the destination register in the check-lines for AVX
instructions.

No actual change here.

llvm-svn: 228127
2015-02-04 09:18:27 +00:00
Chandler Carruth 18ba596609 [x86] Add some tests I missed in the prior commit to cover blends with
zero for v8i16 as well.

These exhibit the same domain badness, but also exhibit other weaknesses
in our blend lowering. More fixes to come.

llvm-svn: 228126
2015-02-04 09:15:46 +00:00
Simon Pilgrim 46cd4f7400 [X86][SSE] psrl(w/d/q) and psll(w/d/q) bit shifts for SSE2
Patch to match cases where shuffle masks can be reduced to bit shifts. Similar to byte shift shuffle matching from D5699.

Differential Revision: http://reviews.llvm.org/D6649

llvm-svn: 228047
2015-02-03 21:58:29 +00:00
Simon Pilgrim d9885856e6 [X86][SSE] Added general integer shuffle matching for MOVQ instruction
This patch adds general shuffle pattern matching for the MOVQ zero-extend instruction (copy lower 64bits, zero upper) for all 128-bit integer vectors, it is added as a fallback test in lowerVectorShuffleAsZeroOrAnyExtend.

llvm-svn: 228022
2015-02-03 20:09:18 +00:00
Simon Pilgrim 9c76b47469 [X86][SSE] Shuffle mask decode support for zero extend, scalar float/double moves and integer load instructions
This patch adds shuffle mask decodes for integer zero extends (pmovzx** and movq xmm,xmm) and scalar float/double loads/moves (movss/movsd).

Also adds shuffle mask decodes for integer loads (movd/movq).

Differential Revision: http://reviews.llvm.org/D7228

llvm-svn: 227688
2015-01-31 14:09:36 +00:00
Chandler Carruth 1d7d7aa1f5 [x86] Clean up the shift lowering vector shuffle tests a bit using my
script. Notably this folds all the SSE cases together into a single
FileCheck block. It also adds a vex prefix.

llvm-svn: 223610
2014-12-07 17:15:53 +00:00
Simon Pilgrim 371417db34 [X86][SSE] Improvements to byte shift shuffle matching
Since (v)pslldq / (v)psrldq instructions resolve to a single input argument it is useful to match it much earlier than we currently do - this prevents more complicated shuffles (notably insertion into a zero vector) matching before it.

Differential Revision: http://reviews.llvm.org/D6409

llvm-svn: 222796
2014-11-25 22:34:59 +00:00
Simon Pilgrim 3ac3b251a9 [X86][SSE] pslldq/psrldq byte shifts/rotation for SSE2
This patch builds on http://reviews.llvm.org/D5598 to perform byte rotation shuffles (lowerVectorShuffleAsByteRotate) on pre-SSSE3 (palignr) targets - pre-SSSE3 is only enabled on i8 and i16 vector targets where it is a more definite performance gain.

I've also added a separate byte shift shuffle (lowerVectorShuffleAsByteShift) that makes use of the ability of the SLLDQ/SRLDQ instructions to implicitly shift in zero bytes to avoid the need to create a zero register if we had used palignr.

Differential Revision: http://reviews.llvm.org/D5699

llvm-svn: 222340
2014-11-19 10:06:49 +00:00
Chandler Carruth 0c922fcec5 [x86] Start improving the matching of unpck instructions based on test
cases from Halide folks. This initial step was extracted from
a prototype change by Clay Wood to try and address regressions found
with Halide and the new vector shuffle lowering.

llvm-svn: 221779
2014-11-12 10:05:18 +00:00
Chandler Carruth ce6947d4cf [x86] Clean up a bunch of vector shuffle tests with my script. Notably,
removes windows line endings and other noise. This is in prelude to
making substantive changes to these tests.

llvm-svn: 221776
2014-11-12 09:17:15 +00:00
Simon Pilgrim a798e9ffdf [X86][SSE] pslldq/psrldq shuffle mask decodes
Patch to provide shuffle decodes and asm comments for the sse pslldq/psrldq SSE2/AVX2 byte shift instructions.

Differential Revision: http://reviews.llvm.org/D5598

llvm-svn: 219738
2014-10-14 22:31:34 +00:00
Chandler Carruth b9d3fa1e65 [x86] Teach the new vector shuffle lowering about VBROADCAST and
VPBROADCAST.

This has the somewhat expected pervasive impact. I don't know why
I forgot about this. Everything seems good with lots of significant
improvements in the tests.

llvm-svn: 218724
2014-10-01 00:41:21 +00:00
Chandler Carruth bebedbaf36 [x86] Add AVX1 and AVX2 testing to all of the 128-bit shuffle test
cases.

While clearly we don't need the AVX vector width, these ISA extensions
often cause us to select different instructions and we should cover them
even with the narrow vector width.

Also, while here, nuke the stress_test2 contents. There is no reason to
try to FileCheck this entire body when it is mostly a test for
successfully surviving the code generator.

llvm-svn: 218710
2014-09-30 22:16:23 +00:00
Chandler Carruth 6a62cd3538 [x86] Rework all of the 128-bit vector shuffle tests with my handy test
updating script so that they are more thorough and consistent.

Specific fixes here include:
- Actually test VEX-encoded AVX mnemonics.
- Actually use an SSE 4.1 run to test SSE 4.1 features!
- Correctly check instructions sequences from the start of the function.
- Elide the shuffle operands and comment designator in a consistent way.
- Test all of the architectures instead of just the ones I was motivated
  to manually author.

I've gone back through and fixed up any egregious issues I spotted. Let
me know if I missed something you really dislike.

One downside to this is that we're now not as diligently using FileCheck
variables for registers. I would be much more concerned with this if we
had larger register usage, but there just aren't that interesting of
register choices here and most of the registers are constrained by the
ABI. Ultimately, I don't think this is likely to be the maintenance
burden for these tests and updating them again should be staright
forward.

llvm-svn: 218707
2014-09-30 21:44:34 +00:00
Chandler Carruth 6578f9208b [x86] Fix a really silly bug that I introduced fixing another bug in the
new vector shuffle target DAG combines -- it helps to actually test for
the value you want rather than just using an integer in a boolean
context.

Have I mentioned that I loathe implicit conversions recently? :: sigh ::

llvm-svn: 218576
2014-09-28 06:11:04 +00:00
Chandler Carruth 0fc0c22fa9 [x86] Fully generalize the zext lowering in the new vector shuffle
lowering to support both anyext and zext and to custom lower for many
different microarchitectures.

Using this allows us to get *exactly* the right code for zext and anyext
shuffles in all the vector sizes. For v16i8, the improvement is *huge*.
The new SSE2 test case added I refused to add before this because it was
sooooo muny instructions.

llvm-svn: 218143
2014-09-19 20:00:32 +00:00
Chandler Carruth 398ba9a018 [x86] Add a dedicated lowering path for zext-compatible vector shuffles
to the new vector shuffle lowering code.

This allows us to emit PMOVZX variants consistently for patterns where
it is a viable lowering. This instruction is both fast and allows us to
fold loads into it. This only hooks the new lowering up for i16 and i8
element widths, mostly so I could manage the change to the tests. I'll
add the i32 one next, although it is significantly less interesting.

One thing to note is that we already had some tests for these patterns
but those tests had far less horrible instructions. The problem is that
those tests weren't checking the strict start and end of the instruction
sequence. =[ As a consequence something changed in the lowering making
us generate *TERRIBLE* code for these patterns in SSE2 through SSSE3.
I've consolidated all of the tests and spelled out the madness that we
currently emit for these shuffles. I'm going to try to figure out what
has gone wrong here.

llvm-svn: 218102
2014-09-19 06:07:49 +00:00
Chandler Carruth 9057fcaf82 [x86] Use PALIGNR for v4i32 and v2i64 blends when appropriate.
There is no purpose in using it for single-input shuffles as
pshufd is just as fast and doesn't tie the two operands. This removes
a substantial amount of wrong-domain blend operations in SSSE3 mode. It
also completes the usage of PALIGNR for integer shuffles and addresses
one of the test cases Quentin hit with the new vector shuffle lowering.

There is still the question of whether and when to use this for floating
point shuffles. It is faster than shufps or shufpd but in the integer
domain. I don't yet really have a good heuristic here for when to use
this instruction for floating point vectors.

llvm-svn: 218038
2014-09-18 09:00:25 +00:00
Chandler Carruth 867930aadf [x86] Initial step of teaching the new vector shuffle lowering about
PALIGNR. This just adds it to the v8i16 and v16i8 lowering steps where
it is completely unmatched. It also introduces the logic for detecting
rotation shuffle masks even in the presence of single input or blend
masks and arbitrarily undef lanes.

I've added fairly comprehensive tests for the matching logic in v8i16
because the tests at that size are much easier to write and manage.

I've not checked the SSE2 code generated for these tests because the
code is *horrible*. It is absolute madness. Testing it will just make
the test brittle without giving any interesting improvements in the
correctness confidence.

llvm-svn: 218013
2014-09-18 04:11:29 +00:00
Chandler Carruth 35e3b545d6 [x86] Undo a flawed transform I added to form UNPCK instructions when
AVX is available, and generally tidy up things surrounding UNPCK
formation.

Originally, I was thinking that the only advantage of PSHUFD over UNPCK
instruction variants was its free copy, and otherwise we should use the
shorter encoding UNPCK instructions. This isn't right though, there is
a larger advantage of being able to fold a load into the operand of
a PSHUFD. For UNPCK, the operand *must* be in a register so it can be
the second input.

This removes the UNPCK formation in the target-specific DAG combine for
v4i32 shuffles. It also lifts the v8 and v16 cases out of the
AVX-specific check as they are potentially replacing multiple
instructions with a single instruction and so should always be valuable.
The floating point checks are simplified accordingly.

This also adjusts the formation of PSHUFD instructions to attempt to
match the shuffle mask to one which would fit an UNPCK instruction
variant. This was originally motivated to allow it to match the UNPCK
instructions in the combiner, but clearly won't now.

Eventually, we should add a MachineCombiner pass that can form UNPCK
instructions post-RA when the operand is known to be in a register and
thus there is no loss.

llvm-svn: 217755
2014-09-15 10:35:41 +00:00
Chandler Carruth 44e64b5267 [x86] Teach the new vector shuffle lowering to use 'punpcklwd' and
'punpckhwd' instructions when suitable rather than falling back to the
generic algorithm.

While we could canonicalize to these patterns late in the process, that
wouldn't help when the freedom to use them is only visible during
initial lowering when undef lanes are well understood. This, it turns
out, is very important for matching the shuffle patterns that are used
to lower sign extension. Fixes a small but relevant regression in
gcc-loops with the new lowering.

When I changed this I noticed that several 'pshufd' lowerings became
unpck variants. This is bad because it removes the ability to freely
copy in the same instruction. I've adjusted the widening test to handle
undef lanes correctly and now those will correctly continue to use
'pshufd' to lower. However, this caused a bunch of churn in the test
cases. No functional change, just churn.

Both of these changes are part of addressing a general weakness in the
new lowering -- it doesn't sufficiently leverage undef lanes. I've at
least a couple of patches that will help there at least in an academic
sense.

llvm-svn: 217752
2014-09-15 09:02:37 +00:00
Chandler Carruth 19cbf0e2c4 [x86] Factor out the zero vector insertion logic in the new vector
shuffle lowering for integer vectors and share it from v4i32, v8i16, and
v16i8 code paths.

Ironically, the SSE2 v16i8 code for this is now better than the SSSE3!
=] Will have to fix the SSSE3 code next to just using a single pshufb.

llvm-svn: 217240
2014-09-05 10:36:31 +00:00
Chandler Carruth b7eda21bb0 [x86] Rewrite a core part of the new vector shuffle lowering to handle
one pesky test case correctly.

This test case caused the old code to infloop occilating between solving
the low-half and the high-half. The 'side balancing' part of
single-input v8 shuffle lowering didn't handle the one pattern which can
cause it to occilate. Fortunately the fuzz testing found this case.
Unfortuately it was *terrible* to handle. I'm really sorry for the
amount and density of the code here, I'd love suggestions on how to
simplify it. I feel like there *must* be a simpler form here, but after
a lot of days I've not found it. This is the only one I've found that
even works. I've added the one pesky test case along with some nice
comments explaining the core problem that we have to solve here.

So far this has survived approximately 32k test cases. More strenuous
fuzzing commencing.

llvm-svn: 215519
2014-08-13 01:25:45 +00:00