Summary:
Added the FastVariableShuffle feature to cases that resembled processors
for which this fearure is on.
For AVX2 there are processors with and w/o this fearue enable.
For AVX512 only KNL does enable this feature so cases which only have
+avx512f were left without the FastVariableShuffle enabled.
Reviewers: RKSimon, craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D41851
llvm-svn: 322090
We try to prevent shuffle combining to value types that would stop the folding of masked operations, but by just returning early, we were failing to try different shuffle types.
The TODOs are all still relevant here to improve codegen but we're lacking test examples.
llvm-svn: 321085
As part of the unification of the debug format and the MIR format, print
MBB references as '%bb.5'.
The MIR printer prints the IR name of a MBB only for block definitions.
* find . \( -name "*.mir" -o -name "*.cpp" -o -name "*.h" -o -name "*.ll" \) -type f -print0 | xargs -0 sed -i '' -E 's/BB#" << ([a-zA-Z0-9_]+)->getNumber\(\)/" << printMBBReference(*\1)/g'
* find . \( -name "*.mir" -o -name "*.cpp" -o -name "*.h" -o -name "*.ll" \) -type f -print0 | xargs -0 sed -i '' -E 's/BB#" << ([a-zA-Z0-9_]+)\.getNumber\(\)/" << printMBBReference(\1)/g'
* find . \( -name "*.txt" -o -name "*.s" -o -name "*.mir" -o -name "*.cpp" -o -name "*.h" -o -name "*.ll" \) -type f -print0 | xargs -0 sed -i '' -E 's/BB#([0-9]+)/%bb.\1/g'
* grep -nr 'BB#' and fix
Differential Revision: https://reviews.llvm.org/D40422
llvm-svn: 319665
As part of the unification of the debug format and the MIR format,
always print registers as lowercase.
* Only debug printing is affected. It now follows MIR.
Differential Revision: https://reviews.llvm.org/D40417
llvm-svn: 319187
Similar to the existing code to lower to PACKSS, we can use PACKUS if the input vector's leading zero bits extend all the way to the packed/truncated value.
We have to account for pre-SSE41 targets not supporting PACKUSDW
llvm-svn: 317315
Similar to the existing code to lower to PACKSS, we can use PACKUS if the input vector's leading zero bits extend all the way to the packed/truncated value.
We have to account for pre-SSE41 targets not supporting PACKUSDW
llvm-svn: 317128
So far we've only been using PACKSS truncations with 'all-bits or zero-bits' patterns (vector comparison results etc.). When really we can safely use it for any case as long as the number of sign bits reach down to the last 16-bits (or 8-bits if we're truncating to bytes).
The next steps after this is add the equivalent support for PACKUS and to support packing to sub-128 bit vectors for truncating stores etc.
Differential Revision: https://reviews.llvm.org/D39476
llvm-svn: 317086
Recognise cases when we can merge the shuffles with their horizontal (HADD/HSUB/PACK) instruction inputs.
Replaces an older implementation which performed some of this during lowering, expanding an existing target shuffle combine stage instead.
Differential Revision: https://reviews.llvm.org/D38506
llvm-svn: 315150
This patch redefines the MOVSS/MOVSD instructions to take VR128 as its second input. This allows the MOVSS/SD->BLEND commute to work without requiring a COPY to be inserted.
This should fix PR33079
Overall this looks to be an improvement in the generated code. I haven't checked the EXPENSIVE_CHECKS build but I'll do that and update with results.
Differential Revision: https://reviews.llvm.org/D38449
llvm-svn: 314914
If the upper bits of a truncation shuffle patterns have at least the minimum number of sign/zero bits on their inputs then we can safely use PACKSS/PACKUS as shuffles.
Partial fix for https://bugs.llvm.org/show_bug.cgi?id=34773
Differential Revision: https://reviews.llvm.org/D38472
llvm-svn: 314788
These were taking priority over the aligned load instructions since there is no vmovda8/16. I don't think there is really a difference between aligned and unaligned on newer cpus so I don't think it matters which instructions we use.
But with this change we reduce the size of the isel table a little and we allow the aligned information to pass through to the evex->vec pass and produce the same output has avx/avx2 in some cases.
I also generally dislike patterns rooted in a bitcast which these were.
Differential Revision: https://reviews.llvm.org/D35977
llvm-svn: 309589
VZEROUPPER should not be issued on Knights Landing (KNL), but on Skylake-avx512 it should be.
Differential Revision: https://reviews.llvm.org/D29874
llvm-svn: 296859
Seems the execution dependency pass likes to use FP instructions when most of the consuming code is integer if a vextractf128 instruction produced the register. Without AVX2 we don't have the corresponding integer instruction available.
This patch suppresses the domain on these instructions to GenericDomain if AVX2 is not supported so that they are ignored by domain fixing. If AVX2 is supported we'll report the correct domain and allow them to switch between integer and fp.
Overall I think this produces better results in the modified test cases.
llvm-svn: 294824
Similar was already done for several other shuffles in this function.
The test changes are because the old code used explicity zeroing for elements that could have been undef.
While I was here I also changed other shuffle vectors in the same function to use the same input twice instead of creating UNDEF nodes. getVectorShuffle can create the UNDEF for us.
llvm-svn: 294130
There are cases of AVX-512 instructions that have two possible encodings. This is the case with instructions that use vector registers with low indexes of 0 - 15 and do not use the zmm registers or the mask k registers.
The EVEX encoding prefix requires 4 bytes whereas the VEX prefix can take only up to 3 bytes. Consequently, using the VEX encoding for these instructions results in a code size reduction of ~2 bytes even though it is compiled with the AVX-512 features enabled.
Reviewers: Craig Topper, Zvi Rackoover, Elena Demikhovsky
Differential Revision: https://reviews.llvm.org/D27901
llvm-svn: 290663
This is a tiny patch with a big pile of test changes.
This partially fixes PR27885:
https://llvm.org/bugs/show_bug.cgi?id=27885
My motivating case looks like this:
- vpshufd {{.*#+}} xmm1 = xmm1[0,1,0,2]
- vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
- vpblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7]
+ vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
And this happens several times in the diffs. For chips with domain-crossing penalties,
the instruction count and size reduction should usually overcome any potential
domain-crossing penalty due to using an FP op in a sequence of int ops. For chips such
as recent Intel big cores and Atom, there is no domain-crossing penalty for shufps, so
using shufps is a pure win.
So the test case diffs all appear to be improvements except one test in
vector-shuffle-combining.ll where we miss an opportunity to use a shift to generate
zero elements and one test in combine-sra.ll where multiple uses prevent the expected
shuffle combining.
Differential Revision: https://reviews.llvm.org/D27692
llvm-svn: 289837
This generalizes the build_vector -> vector_shuffle combine to support any
number of inputs. The idea is to create a binary tree of shuffles, where
the first layer performs pairwise shuffles of the input vectors placing each
input element into the correct lane, and the rest of the tree blends these
shuffles together.
This doesn't try to be smart and create any sort of "optimal" shuffles.
The assumption is that even a "poor" shuffle sequence is better than extracting
and inserting the elements one by one.
Differential Revision: https://reviews.llvm.org/D24683
llvm-svn: 283480
As discussed on PR28136, lowerShuffleAsRepeatedMaskAndLanePermute was attempting to match repeated masks at the 128-bit level and then permute the resultant lanes at the 128-bit (AVX1) or 64-bit (AVX2) sub-lane level.
This change allows us to create the repeated masks at the sub-lane level (and then concat them together to create a 128-bit repeated mask) and then select which sub-lane to permute. This has no effect on the AVX1 codegen.
Fixes PR28136.
llvm-svn: 275543
An identity COPY like this:
%AL = COPY %AL, %EAX<imp-def>
has no semantic effect, but encodes liveness information: Further users
of %EAX only depend on this instruction even though it does not define
the full register.
Replace the COPY with a KILL instruction in those cases to maintain this
liveness information. (This reverts a small part of r238588 but this
time adds a comment explaining why a KILL instruction is useful).
llvm-svn: 274952
This patch attempts to represent a shuffle as a repeating shuffle (recognisable by is128BitLaneRepeatedShuffleMask) with the source input(s) in their original lanes, followed by a single permutation of the 128-bit lanes to their final destinations.
On AVX2 we can additionally attempt to match using 64-bit sub-lane permutation. AVX2 can also now match a similar 'broadcasted' repeating shuffle.
This patch has several benefits:
* Avoids prematurely matching with lowerVectorShuffleByMerging128BitLanes which can require both inputs to have their input lanes permuted before shuffling.
* Can replace PERMPS/PERMD instructions - although these are useful for cross-lane unary shuffling, they require their shuffle mask to be pre-loaded (and increase register pressure).
* Matching the repeating shuffle makes use of a lot of existing shuffle lowering.
There is an outstanding minor AVX1 regression (combine_unneeded_subvector1 in vector-shuffle-combining.ll) of a previously 128-bit shuffle + subvector splat being converted to a subvector splat + (2 instruction) 256-bit shuffle, I intend to fix this in a followup patch for review.
Differential Revision: http://reviews.llvm.org/D16537
llvm-svn: 260834
This patch transforms truncation between vectors of integers into
X86ISD::PACKUS/PACKSS operations during DAG combine. We don't do it in
lowering phase because after type legalization, the original truncation
will be turned into a BUILD_VECTOR with each element that is extracted
from a vector and then truncated, and from them it is difficult to do
this optimization. This greatly improves the performance of truncations
on some specific types.
Cost table is updated accordingly.
Differential revision: http://reviews.llvm.org/D14588
llvm-svn: 256194
autogenerated.
Also update existing test cases which appear to be generated by it and
weren't modified (other than addition of the header) by rerunning it.
llvm-svn: 253917
This patch adds supports for the vector constant folding of TRUNCATE and FP_EXTEND instructions and tidies up the SINT_TO_FP and UINT_TO_FP instructions to match.
It also moves the vector constant folding for the FNEG and FABS instructions to use the DAG.getNode() functionality like the other unary instructions.
Differential Revision: http://reviews.llvm.org/D8593
llvm-svn: 233224
systematic lowering of v8i16.
This required a slight strategy shift to prefer unpack lowerings in more
places. While this isn't a cut-and-dry win in every case, it is in the
overwhelming majority. There are only a few places where the old
lowering would probably be a touch faster, and then only by a small
margin.
In some cases, this is yet another significant improvement.
llvm-svn: 229859
v16i8 shuffles, and replace it with new facilities.
This uses precise patterns to match exact unpacks, and the new
generalized unpack lowering only when we detect a case where we will
have to shuffle both inputs anyways and they terminate in exactly
a blend.
This fixes all of the blend horrors that I uncovered by always lowering
blends through the vector shuffle lowering. It also removes *sooooo*
much of the crazy instruction sequences required for v16i8 lowering
previously. Much cleaner now.
The only "meh" aspect is that we sometimes use pshufb+pshufb+unpck when
it would be marginally nicer to use pshufb+pshufb+por. However, the
difference there is *tiny*. In many cases its a win because we re-use
the pshufb mask. In others, we get to avoid the pshufb entirely. I've
left a FIXME, but I'm dubious we can really do better than this. I'm
actually pretty happy with this lowering now.
For SSE2 this exposes some horrors that were really already there. Those
will have to fixed by changing a different path through the v16i8
lowering.
llvm-svn: 229846
This allows it to match still more places where previously we would have
to fall back on floating point shuffles or other more complex lowering
strategies.
I'm hoping to replace some of the hand-rolled unpack matching with this
routine is it gets more and more clever.
llvm-svn: 229463
advantage of the existence of a reasonable blend instruction.
The 256-bit vector shuffle lowering has leveraged the general technique
of decomposed shuffles and blends for quite some time, but this never
made it back into the 128-bit code, and there are a large number of
patterns where this is substantially better. For example, this removes
almost all domain crossing in vector shuffles that involve some blend
and some permutation with SSE4.1 and later. See the massive reduction
in 'shufps' for integer test cases in this commit.
This isn't perfect yet for a few reasons:
1) The v8i16 shuffle lowering continues to plague me. We don't always
form an unpack-based blend when that would be better. But the wins
pretty drastically outstrip the losses here.
2) The v16i8 shuffle lowering is just a disaster here. I never went and
implemented blend support here for some terrible reason. I'll do
that next probably. I've not updated it for now.
More variations on this technique are coming as well -- we don't
shuffle-into-unpack or shuffle-into-palignr, both of which would also be
profitable.
Note that some test cases grow significantly in the number of
instructions, but I expect to actually be faster. We use
pshufd+pshufd+blendw instead of a single shufps, but the pshufd's are
very likely to pipeline well (two ports on most modern intel chips) and
the blend is a *very* fast instruction. The domain switch penalty will
essentially always be more than a blend instruction, which is the only
increase in tree height.
llvm-svn: 229350