Added TESTM and TESTNM to the list of instructions that already zeroing unused upper bits
and does not need the redundant shift left and shift right instructions afterwards.
Added a pattern for TESTM and TESTNM in iselLowering, so now icmp(neq,and(X,Y), 0) goes folds into TESTM
and icmp(eq,and(X,Y), 0) goes folds into TESTNM
This commit is a preparation for lowering the test and testn X86 intrinsics to IR.
Differential Revision: https://reviews.llvm.org/D38732
llvm-svn: 317465
The masked store instruction only cares about the sign-bit of each mask element,
so the compare s<0 isn't needed.
As noted in PR11210:
https://bugs.llvm.org/show_bug.cgi?id=11210
...fixing this should allow us to eliminate x86-specific masked store intrinsics in IR.
(Although more testing will be needed to confirm that.)
I filed a bug to track improvements for AVX512:
https://bugs.llvm.org/show_bug.cgi?id=34584
Differential Revision: https://reviews.llvm.org/D37446
llvm-svn: 313089
As noted in PR11210:
https://bugs.llvm.org/show_bug.cgi?id=11210
...fixing this should allow us to eliminate x86-specific masked store intrinsics in IR.
(Although more testing will be needed to confirm that.)
llvm-svn: 312496
[X86][AVX512] Improve lowering of AVX512 compare intrinsics (remove redundant shift left+right instructions).
AVX512 compare instructions return v*i1 types.
In cases where the number of elements in the returned value are less than 8, clang adds zeroes to get a mask of v8i1 type.
Later on it's replaced with CONCAT_VECTORS, which then is lowered to many DAG nodes including insert/extract element and shift right/left nodes.
The fact that AVX512 compare instructions put the result in a k register and zeroes all its upper bits allows us to remove the extra nodes simply by copying the result to the required register class.
When lowering, identify these cases and transform them into an INSERT_SUBVECTOR node (marked legal), then catch this pattern in instructions selection phase and transform it into one avx512 cmp instruction.
Differential Revision: https://reviews.llvm.org/D33188
llvm-svn: 306402
AVX512 compare instructions return v*i1 types.
In cases where the number of elements in the returned value are less than 8, clang adds zeroes to get a mask of v8i1 type.
Later on it's replaced with CONCAT_VECTORS, which then is lowered to many DAG nodes including insert/extract element and shift right/left nodes.
The fact that AVX512 compare instructions put the result in a k register and zeroes all its upper bits allows us to remove the extra nodes simply by copying the result to the required register class.
When lowering, identify these cases and transform them into an INSERT_SUBVECTOR node (marked legal), then catch this pattern in instructions selection phase and transform it into one avx512 cmp instruction.
Differential Revision: https://reviews.llvm.org/D33188
llvm-svn: 305465
We've had several bugs(PR32256, PR32241) recently that resulted from usages of AH/BH/CH/DH either before or after a copy to/from a mask register.
This ultimately occurs because we create COPY_TO_REGCLASS with VK1 and GR8. Then in CopyToFromAsymmetricReg in X86InstrInfo we find a 32-bit super register for the GR8 to emit the KMOV with. But as these tests are demonstrating, its possible for the GR8 register to be a high register and we end up doing an accidental extra or insert from bits 15:8.
I think the best way forward is to stop making copies directly between mask registers and GR8/GR16. Instead I think we should restrict to only copies between mask registers and GR32/GR64 and use EXTRACT_SUBREG/INSERT_SUBREG to handle the conversion from GR32 to GR16/8 or vice versa.
Unfortunately, this complicates fastisel a bit more now to create the subreg extracts where we used to create GR8 copies. We can probably make a helper function to bring down the repitition.
This does result in KMOVD being used for copies when BWI is available because we don't know the original mask register size. This caused a lot of deltas on tests because we have to split the checks for KMOVD vs KMOVW based on BWI.
Differential Revision: https://reviews.llvm.org/D30968
llvm-svn: 298928
VZEROUPPER should not be issued on Knights Landing (KNL), but on Skylake-avx512 it should be.
Differential Revision: https://reviews.llvm.org/D29874
llvm-svn: 296859
Isel now selects masked move instructions for vselect instead of blendm. But sometimes it beneficial to register allocation to remove the tied register constraint by using blendm instructions.
This also picks up cases where the masked move was created due to a masked load intrinsic.
Differential Revision: https://reviews.llvm.org/D28454
llvm-svn: 292005
There are cases of AVX-512 instructions that have two possible encodings. This is the case with instructions that use vector registers with low indexes of 0 - 15 and do not use the zmm registers or the mask k registers.
The EVEX encoding prefix requires 4 bytes whereas the VEX prefix can take only up to 3 bytes. Consequently, using the VEX encoding for these instructions results in a code size reduction of ~2 bytes even though it is compiled with the AVX-512 features enabled.
Reviewers: Craig Topper, Zvi Rackoover, Elena Demikhovsky
Differential Revision: https://reviews.llvm.org/D27901
llvm-svn: 290663
This is a tiny patch with a big pile of test changes.
This partially fixes PR27885:
https://llvm.org/bugs/show_bug.cgi?id=27885
My motivating case looks like this:
- vpshufd {{.*#+}} xmm1 = xmm1[0,1,0,2]
- vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
- vpblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7]
+ vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
And this happens several times in the diffs. For chips with domain-crossing penalties,
the instruction count and size reduction should usually overcome any potential
domain-crossing penalty due to using an FP op in a sequence of int ops. For chips such
as recent Intel big cores and Atom, there is no domain-crossing penalty for shufps, so
using shufps is a pure win.
So the test case diffs all appear to be improvements except one test in
vector-shuffle-combining.ll where we miss an opportunity to use a shift to generate
zero elements and one test in combine-sra.ll where multiple uses prevent the expected
shuffle combining.
Differential Revision: https://reviews.llvm.org/D27692
llvm-svn: 289837
We are being inconsistent with these instructions (and all their variants.....) with a random mix of them using the default float domain.
Differential Revision: https://reviews.llvm.org/D27419
llvm-svn: 288902
Previously we were extending to copying the whole ZMM register. The register allocator shouldn't use XMM16-31 or YMM16-31 in this configuration as the instructions to spill them aren't available.
llvm-svn: 280648
This reverts most of r274613 (AKA r274626) and its follow-ups (r276347, r277289),
due to miscompiles in the test suite. The FastISel change was left in, because
it apparently fixes an unrelated issue.
(Recommit of r279782 which was broken due to a bad merge.)
This fixes 4 out of the 5 test failures in PR29112.
llvm-svn: 279788
This reverts most of r274613 and its follow-ups (r276347, r277289), due to
miscompiles in the test suite. The FastISel change was left in, because it
apparently fixes an unrelated issue.
This fixes 4 out of the 5 test failures in PR29112.
llvm-svn: 279782
The previous implementation (not custom) doesn't enforce zeroing off upper bits. The assumption is that i1 PRODUCER (truncate and extractelement) must zero all upper bits, so i1 CONSUMER instructions ( test, zext, save, etc) can be done without additional zeroing.
Make extractelement i1 lowering custom for all vector i1.
Differential Revision: http://reviews.llvm.org/D23246
llvm-svn: 278328
Under normal circumstances we prefer the higher performance MOVD to extract the 0'th element of a v8i16 vector instead of PEXTRW.
But as detailed on PR27265, this prevents the SSE41 implementation of PEXTRW from folding the store of the 0'th element. Additionally it prevents us from making use of the fact that the (SSE2) reg-reg version of PEXTRW implicitly zero-extends the i16 element to the i32/i64 destination register.
This patch only preferentially lowers to MOVD if we will not be zero-extending the extracted i16, nor prevent a store from being folded (on SSSE41).
Fix for PR27265.
Differential Revision: https://reviews.llvm.org/D22509
llvm-svn: 276289
An identity COPY like this:
%AL = COPY %AL, %EAX<imp-def>
has no semantic effect, but encodes liveness information: Further users
of %EAX only depend on this instruction even though it does not define
the full register.
Replace the COPY with a KILL instruction in those cases to maintain this
liveness information. (This reverts a small part of r238588 but this
time adds a comment explaining why a KILL instruction is useful).
llvm-svn: 274952
This is a resubmittion of 263158 change after fixing the existing problem with intrinsics mangling (see LTO and intrinsics mangling llvm-dev thread for details).
This patch fixes the problem which occurs when loop-vectorize tries to use @llvm.masked.load/store intrinsic for a non-default addrspace pointer. It fails with "Calling a function with a bad signature!" assertion in CallInst constructor because it tries to pass a non-default addrspace pointer to the pointer argument which has default addrspace.
The fix is to add pointer type as another overloaded type to @llvm.masked.load/store intrinsics.
Reviewed By: reames
Differential Revision: http://reviews.llvm.org/D17270
llvm-svn: 274043
This is a resubmittion of 263158 change after fixing the existing problem with intrinsics mangling (see LTO and intrinsics mangling llvm-dev thread for details).
This patch fixes the problem which occurs when loop-vectorize tries to use @llvm.masked.load/store intrinsic for a non-default addrspace pointer. It fails with "Calling a function with a bad signature!" assertion in CallInst constructor because it tries to pass a non-default addrspace pointer to the pointer argument which has default addrspace.
The fix is to add pointer type as another overloaded type to @llvm.masked.load/store intrinsics.
Reviewed By: reames
Differential Revision: http://reviews.llvm.org/D17270
llvm-svn: 273892
This is a resubmittion of 263158 change.
This patch fixes the problem which occurs when loop-vectorize tries to use @llvm.masked.load/store intrinsic for a non-default addrspace pointer. It fails with "Calling a function with a bad signature!" assertion in CallInst constructor because it tries to pass a non-default addrspace pointer to the pointer argument which has default addrspace.
The fix is to add pointer type as another overloaded type to @llvm.masked.load/store intrinsics.
Reviewed By: reames
Differential Revision: http://reviews.llvm.org/D17270
llvm-svn: 266086
Converting masked vector loads to regular vector loads for x86 AVX should always be a win.
I raised the legality issue of reading the extra memory bytes on llvm-dev. I did not see any
objections.
1. x86 already does this kind of optimization for multiple scalar loads -> vector load.
2. If other targets have the same flexibility, we could move this transform up to CGP or DAGCombiner.
Differential Revision: http://reviews.llvm.org/D18094
llvm-svn: 263446
This patch fixes the problem which occurs when loop-vectorize tries to use @llvm.masked.load/store intrinsic for a non-default addrspace pointer. It fails with "Calling a function with a bad signature!" assertion in CallInst constructor because it tries to pass a non-default addrspace pointer to the pointer argument which has default addrspace.
The fix is to add pointer type as another overloaded type to @llvm.masked.load/store intrinsics.
Reviewed By: reames
Differential Revision: http://reviews.llvm.org/D17270
llvm-svn: 263158
Instead of a variable-blend instruction, form a blend with immediate because those are always cheaper.
Differential Revision: http://reviews.llvm.org/D17899
llvm-svn: 263067