This patch adds shuffle mask decodes for integer zero extends (pmovzx** and movq xmm,xmm) and scalar float/double loads/moves (movss/movsd).
Also adds shuffle mask decodes for integer loads (movd/movq).
Differential Revision: http://reviews.llvm.org/D7228
llvm-svn: 227688
This patch adds shuffle matching for the SSE3 MOVDDUP, MOVSLDUP and MOVSHDUP instructions. The big use of these being that they avoid many single source shuffles from needing to use (pre-AVX) dual source instructions such as SHUFPD/SHUFPS: causing extra moves and preventing load folds.
Adding these instructions uncovered an issue in XFormVExtractWithShuffleIntoLoad which crashed on single operand shuffle instructions (now fixed). It also involved fixing getTargetShuffleMask to correctly identify theses instructions as unary shuffles.
Also adds a missing tablegen pattern for MOVDDUP.
Differential Revision: http://reviews.llvm.org/D7042
llvm-svn: 226716
has a remarkably unique and efficient lowering.
While we get this some of the time already, we miss a few cases and
there wasn't a principled reason we got it. We should at least test
this. v8 already has tests for this pattern.
llvm-svn: 222607
a bunch more improvements.
Non-lane-crossing is fine, the key is that lane merging only makes sense
for single-input shuffles. Not sure why I got so turned around here. The
code all works, I was just using the wrong model for it.
This only updates v4 and v8 lowering. The v16 and v32 lowering requires
restructuring the entire check sequence.
llvm-svn: 222537
lanes.
By special casing these we can often either reduce the total number of
shuffles significantly or reduce the number of (high latency on Haswell)
AVX2 shuffles that potentially cross 128-bit lanes. Even when these
don't actually cross lanes, they have much higher latency to support
that. Doing two of them and a blend is worse than doing a single insert
across the 128-bit lanes to blend and then doing a single interleaved
shuffle.
While this seems like a narrow case, it kept cropping up on me and the
difference is *huge* as you can see in many of the test cases. I first
hit this trying to perfectly fix the interleaving shuffle patterns used
by Halide for AVX2.
llvm-svn: 222533
merging 128-bit subvectors and also shuffling all the elements of those
subvectors. Currently we generate pretty bad code for many of these, but
I'm testing a patch that should dramatically improve this in addition to
making the shuffle lowering robust to other changes.
llvm-svn: 222525
Updated X86TargetLowering::isShuffleMaskLegal to match SHUFP masks with commuted inputs and PSHUFD masks that reference the second input.
As part of this I've refactored isPSHUFDMask to work in a more general manner and allow it to match against either the first or second input vector.
Differential Revision: http://reviews.llvm.org/D6287
llvm-svn: 222087
between splitting a vector into 128-bit lanes and recombining them vs.
decomposing things into single-input shuffles and a final blend.
This handles a large number of cases in AVX1 where the cross-lane
shuffles would be much more expensive to represent even though we end up
with a fast blend at the root. Instead, we can do a better job of
shuffling in a single lane and then inserting it into the other lanes.
This fixes the remaining bits of Halide's regression captured in PR21281
for AVX1. However, the bug persists in AVX2 because I've made this
change reasonably conservative. The cases where it makes sense in AVX2
to split into 128-bit lanes are much more rare because we can often do
full permutations across all elements of the 256-bit vector. However,
the particular test case in PR21281 is an example of one of the rare
cases where it is *always* better to work in a single 128-bit lane. I'm
going to try to teach the logic to detect and form the good code even in
AVX2 next, but it will need to use a separate heuristic.
Finally, there is one pesky regression here where we previously would
craftily use vpermilps in AVX1 to shuffle both high and low halves at
the same time. We no longer pull that off, and not for any really good
reason. Ultimately, I think this is just another missing nuance to the
selection heuristic that I'll try to add in afterward, but this change
already seems strictly worth doing considering the magnitude of the
improvements in common matrix math shuffle patterns.
As always, please let me know if this causes a surprising regression for
you.
llvm-svn: 221861
shuffles using AVX and AVX2 instructions. This fixes PR21138, one of the
few remaining regressions impacting benchmarks from the new vector
shuffle lowering.
You may note that it "regresses" many of the vperm2x128 test cases --
these were actually "improved" by the naive lowering that the new
shuffle lowering previously did. This regression gave me fits. I had
this patch ready-to-go about an hour after flipping the switch but
wasn't sure how to have the best of both worlds here and thought the
correct solution might be a completely different approach to lowering
these vector shuffles.
I'm now convinced this is the correct lowering and the missed
optimizations shown in vperm2x128 are actually due to missing
target-independent DAG combines. I've even written most of the needed
DAG combine and will submit it shortly, but this part is ready and
should help some real-world benchmarks out.
llvm-svn: 219079
the various ways in which blends can be used to do vector element
insertion for lowering with the scalar math instruction forms that
effectively re-blend with the high elements after performing the
operation.
This then allows me to bail on the element insertion lowering path when
we have SSE4.1 and are going to be doing a normal blend, which in turn
restores the last of the blends lost from the new vector shuffle
lowering when I got it to prioritize insertion in other cases (for
example when we don't *have* a blend instruction).
Without the patterns, using blends here would have regressed
sse-scalar-fp-arith.ll *completely* with the new vector shuffle
lowering. For completeness, I've added RUN-lines with the new lowering
here. This is somewhat superfluous as I'm about to flip the default, but
hey, it shows that this actually significantly changed behavior.
The patterns I've added are just ridiculously repetative. Suggestions on
making them better very much welcome. In particular, handling the
commuted form of the v2f64 patterns is somewhat obnoxious.
llvm-svn: 219033
perform a load to use blendps rather than movss when it is available.
For non-loads, blendps is *much* faster. It can execute on two ports in
Sandy Bridge and Ivy Bridge, and *three* ports on Haswell. This fixes
one of the "regressions" from aggressively taking the "insertion" path
in the new vector shuffle lowering.
This does highlight one problem with blendps -- it isn't commuted as
heavily as it should be. That's future work though.
llvm-svn: 219022
and MOVSD nodes for single element vector inserts.
This is particularly important because a number of patterns in the
backend detect these patterns and leverage them to simplify things. It
also fixes quite a few of the insertion bad code examples. However, it
regresses a specific area: when available, blendps and blendpd are
*dramatically* faster than movss and movsd respectively. But it doesn't
really work to form the blend logic first because the blends *aren't* as
crazy efficient when the data is coming from memory anyways, and thus
will have a movss or movsd regardless. Also, doing that would block
a bunch of the patterns that this is designed to hit.
So my plan is to go into the patterns for lowering MOVSS and MOVSD and
lower them via blends when available. However that's a pretty invasive
restructuring so it will need to be a follow-up patch.
I have already gone into the patterns to lower MOVSS and MOVSD from
memory using MOVLPD, etc. Without that, several of the test cases
I already have regress.
llvm-svn: 218985
lowering to match VZEXT_MOVL patterns.
I hadn't realized that these had sufficient pattern smarts in the
backend to lower zext-ing from the low element of a vector without it
being a scalar_to_vector node. They do, and this is how to match a bunch
of patterns for movq, movss, etc.
There is a weird propensity to end up using pshufd to place the element
afterward even though it means domain crossing (or rather, to use
xorps+movss to zext the element rather than movq) but that's an
orthogonal problem with VZEXT_MOVL that someone should probably look at.
llvm-svn: 218977
VPBROADCAST.
This has the somewhat expected pervasive impact. I don't know why
I forgot about this. Everything seems good with lots of significant
improvements in the tests.
llvm-svn: 218724
shuffle tests to match that used in the script I posted and now used
consistently in 128-bit tests.
Nothing interesting changing here, just using the label name as the
FileCheck label and a slightly more general comment marker consumption
strategy.
llvm-svn: 218709
AVX support.
New test cases included. Note that none of the existing test cases
covered these buggy code paths. =/ Also, it is clear from this that
SHUFPS and SHUFPD are the most bug prone shuffle instructions in x86. =[
These were all detected by fuzz-testing. (I <3 fuzz testing.)
llvm-svn: 218522
v4f64 and v8f32 shuffles when they are lane-crossing. We have fully
general lane-crossing permutation functions in AVX2 that make this easy.
Part of this also changes exactly when and how these vectors are split
up when we don't have AVX2. This isn't always a win but it usually is
a win, so on the balance I think its better. The primary regressions are
all things that just need to be fixed anyways such as modeling when
a blend can be completely accomplished via VINSERTF128, etc.
Also, this highlights one of the few remaining big features: we do
a really poor job of inserting elements into AVX registers efficiently.
This completes almost all of the big tricks I have in mind for AVX2. The
only things left that I plan to add:
1) element insertion smarts
2) palignr and other fairly specialized lowerings when they happen to
apply
llvm-svn: 218449
256-bit vectors with lane-crossing.
Rather than immediately decomposing to 128-bit vectors, try flipping the
256-bit vector lanes, shuffling them and blending them together. This
reduces our worst case shuffle by a pretty significant margin across the
board.
llvm-svn: 218446
detection. It was incorrectly handling undef lanes by actually treating
an undef lane in the first 128-bit lane as a *numeric* shuffle value.
Fortunately, this almost always DTRT and disabled detecting repeated
patterns. But not always. =/ This patch introduces a much more
principled approach and fixes the miscompiles I spotted by inspection
previously.
llvm-svn: 218346
shuffles using the AVX2 instructions. This is the first step of cutting
in real AVX2 support.
Note that I have spotted at least one bug in the test cases already, but
I suspect it was already present and just is getting surfaced. Will
investigate next.
llvm-svn: 218338
a more sane approach to AVX2 support.
Fundamentally, there is no useful way to lower integer vectors in AVX.
None. We always end up with a VINSERTF128 in the end, so we might as
well eagerly switch to the floating point domain and do everything
there. This cleans up lots of weird and unlikely to be correct
differences between integer and floating point shuffles when we only
have AVX1.
The other nice consequence is that by doing things this way we will make
it much easier to write the integer lowering routines as we won't need
to duplicate the logic to check for AVX vs. AVX2 in each one -- if we
actually try to lower a 256-bit vector as an integer vector, we have
AVX2 and can rely on it. I think this will make the code much simpler
and more comprehensible.
Currently, I've disabled *all* support for AVX2 so that we always fall
back to AVX. This keeps everything working rather than asserting. That
will go away with the subsequent series of patches that provide
a baseline AVX2 implementation.
Please note, I'm going to implement AVX2 *without access to hardware*.
That means I cannot correctness test this path. I will be relying on
those with access to AVX2 hardware to do correctness testing and fix
bugs here, but as a courtesy I'm trying to sketch out the framework for
the new-style vector shuffle lowering in the context of the AVX2 ISA.
llvm-svn: 218228
of a single element into a zero vector for v4f64 and v4i64 in AVX.
Ironically, there is less to see here because xor+blend is so crazy fast
that we can't really beat that to zero the high 128-bit lane.
llvm-svn: 218214
VBLENDPD over using VSHUFPD. While the 256-bit variant of VBLENDPD slows
down to the same speed as VSHUFPD on Sandy Bridge CPUs, it has twice the
reciprocal throughput on Ivy Bridge CPUs much like it does everywhere
for 128-bits. There isn't a downside, so just eagerly use this
instruction when it suffices.
llvm-svn: 218208
This expands the integer cases to cover the fact that AVX2 moves their
lane-crossing shuffles into the integer domain. It also adds proper
support for AVX2 run lines and the "ALL" group when it doesn't matter.
llvm-svn: 218206
single-input shuffles with doubles. This allows them to fold memory
operands into the shuffle, etc. This is just the analog to the v4f32
case in my prior commit.
llvm-svn: 218193
There is no purpose in using it for single-input shuffles as
pshufd is just as fast and doesn't tie the two operands. This removes
a substantial amount of wrong-domain blend operations in SSSE3 mode. It
also completes the usage of PALIGNR for integer shuffles and addresses
one of the test cases Quentin hit with the new vector shuffle lowering.
There is still the question of whether and when to use this for floating
point shuffles. It is faster than shufps or shufpd but in the integer
domain. I don't yet really have a good heuristic here for when to use
this instruction for floating point vectors.
llvm-svn: 218038
when SSE4.1 is available.
This removes a ton of domain crossing from blend code paths that were
ending up in the floating point code path.
This is just the tip of the iceberg though. The real switch is for
integer blend lowering to more actively rely on this instruction being
available so we don't hit shufps at all any longer. =] That will come in
a follow-up patch.
Another place where we need better support is for using PBLENDVB when
doing so avoids the need to have two complementary PSHUFB masks.
llvm-svn: 217767
AVX is available, and generally tidy up things surrounding UNPCK
formation.
Originally, I was thinking that the only advantage of PSHUFD over UNPCK
instruction variants was its free copy, and otherwise we should use the
shorter encoding UNPCK instructions. This isn't right though, there is
a larger advantage of being able to fold a load into the operand of
a PSHUFD. For UNPCK, the operand *must* be in a register so it can be
the second input.
This removes the UNPCK formation in the target-specific DAG combine for
v4i32 shuffles. It also lifts the v8 and v16 cases out of the
AVX-specific check as they are potentially replacing multiple
instructions with a single instruction and so should always be valuable.
The floating point checks are simplified accordingly.
This also adjusts the formation of PSHUFD instructions to attempt to
match the shuffle mask to one which would fit an UNPCK instruction
variant. This was originally motivated to allow it to match the UNPCK
instructions in the combiner, but clearly won't now.
Eventually, we should add a MachineCombiner pass that can form UNPCK
instructions post-RA when the operand is known to be in a register and
thus there is no loss.
llvm-svn: 217755
These are super simple. They even take precedence over crazy
instructions like INSERTPS because they have very high throughput on
modern x86 chips.
I still have to teach the integer shuffle variants about this to avoid
so many domain crossings. However, due to the particular instructions
available, that's a touch more complex and so a separate patch.
Also, the backend doesn't seem to realize it can commute blend
instructions by negating the mask. That would help remove a number of
copies here. Suggestions on how to do this welcome, it's an area I'm
less familiar with.
llvm-svn: 217744
support transforming the forms from the new vector shuffle lowering to
use 'movddup' when appropriate.
A bunch of the cases where we actually form 'movddup' don't actually
show up in the test results because something even later than DAG
legalization maps them back to 'unpcklpd'. If this shows back up as
a performance problem, I'll probably chase it down, but it is at least
an encoded size loss. =/
To make this work, also always do this canonicalizing step for floating
point vectors where the baseline shuffle instructions don't provide any
free copies of their inputs. This also causes us to canonicalize
unpck[hl]pd into mov{hl,lh}ps (resp.) which is a nice encoding space
win.
There is one test which is "regressed" by this: extractelement-load.
There, the test case where the optimization it is testing *fails*, the
exact instruction pattern which results is slightly different. This
should probably be fixed by having the appropriate extract formed
earlier in the DAG, but that would defeat the purpose of the test.... If
this test case is critically important for anyone, please let me know
and I'll try to work on it. The prior behavior was actually contrary to
the comment in the test case and seems likely to have been an accident.
llvm-svn: 217738
support for MOVDDUP which is really important for matrix multiply style
operations that do lots of non-vector-aligned load and splats.
The original motivation was to add support for MOVDDUP as the lack of it
regresses matmul_f64_4x4 by 5% or so. However, all of the rules here
were somewhat suspicious.
First, we should always be using the floating point domain shuffles,
regardless of how many copies we have to make as a movapd is *crazy*
faster than the domain switching cost on some chips. (Mostly because
movapd is crazy cheap.) Because SHUFPD can't do the copy-for-free trick
of the PSHUF instructions, there is no need to avoid canonicalizing on
UNPCK variants, so do that canonicalizing. This also ensures we have the
chance to form MOVDDUP. =]
Second, we assume SSE2 support when doing any vector lowering, and given
that we should just use UNPCKLPD and UNPCKHPD as they can operate on
registers or memory. If vectors get spilled or come from memory at all
this is going to allow the load to be folded into the operation. If we
want to optimize for encoding size (the only difference, and only
a 2 byte difference) it should be done *much* later, likely after RA.
llvm-svn: 217332
these DAG combines.
The DAG auto-CSE thing is truly terrible. Due to it, when RAUW-ing
a node with its operand, you can cause its uses to CSE to itself, which
then causes their uses to become your uses which causes them to be
picked up by the RAUW. For nodes that are determined to be "no-ops",
this is "fine". But if the RAUW is one of several steps to enact
a transformation, this causes the DAG to really silently eat an discard
nodes that you would never expect. It took days for me to actually
pinpoint a test case triggering this and a really frustrating amount of
time to even comprehend the bug because I never even thought about the
ability of RAUW to iteratively consume nodes due to CSE-ing them into
itself.
To fix this, we have to build up a brand-new chain of operations any
time we are combining across (potentially) intervening nodes. But once
the logic is added to do this, another issue surfaces: CombineTo eagerly
deletes the one node combined, *but no others*. This is... really
frustrating. If deleting it makes its operands become dead, those
operand nodes often won't go onto the worklist in the
order you would want -- they're already on it and not near the top. That
means things higher on the worklist will get combined prior to these
dead nodes being GCed out of the worklist, and if the chain is long, the
immediate users won't be enough to re-detect where the root of the chain
is that became single-use again after deleting the dead nodes. The
better way to do this is to never immediately delete nodes, and instead
to just enqueue them so we can recursively delete them. The
combined-from node is typically not on the worklist anyways by virtue of
having been popped off.... But that in turn breaks other tests that
*require* CombineTo to delete unused nodes. :: sigh ::
Fortunately, there is a better way. This whole routine should have been
returning the replacement rather than using CombineTo which is quite
hacky. Switch to that, and all the pieces fall together.
I suspect the same kind of miscompile is possible in the half-shuffle
folding code, and potentially the recursive folding code. I'll be
switching those over to a pattern more like this one for safety's sake
even though I don't immediately have any test cases for them. Note that
the only way I got a test case for this instance was with *heavily* DAG
combined 256-bit shuffle sequences generated by my fuzzer. ;]
llvm-svn: 216319
the new shuffle lowering and an implementation for v4 shuffles.
This allows us to handle non-half-crossing shuffles directly for v4
shuffles, both integer and floating point. This currently misses places
where we could perform the blend via UNPCK instructions, but otherwise
generates equally good or better code for the test cases included to the
existing vector shuffle lowering. There are a few cases that are
entertainingly better. ;]
llvm-svn: 215702
target-specific shuffl DAG combines.
We were recognizing the paired shuffles backwards. This code needs to be
replaced anyways as we have the same functionality elsewhere, but I'll
do the refactoring in a follow-up, this is the minimal fix to the
behavior.
In addition to fixing miscompiles with the new vector shuffle lowering,
it also causes the canonicalization to kick in much better, selecting
the smaller encoding variants in lots of places in the new AVX path.
This still isn't quite ideal as we don't need both the shufpd and the
punpck instructions, but that'll get fixed in a follow-up patch.
llvm-svn: 215690
lowering scheme.
Currently, this just directly bails to the fallback path of splitting
the 256-bit vector into two 128-bit vectors, operating there, and then
joining the results back together. While the results are far from
perfect, they are *shockingly* good for what we're doing here. I'll be
layering the rest of the functionality on top of this piece by piece and
updating tests as I go.
Note that 256-bit vectors in this mode are still somewhat WIP. While
I think the code paths that I'm adding here are clean and good-to-go,
there are still a lot of 128-bit assumptions that I'll need to stomp out
as I march through the functional spread here.
llvm-svn: 215637