None of these have any reordering issues, and they still emit the same reduction intrinsics without any change in the existing test coverage:
llvm-project\clang\test\CodeGen\X86\avx512-reduceIntrin.c
llvm-project\clang\test\CodeGen\X86\avx512-reduceMinMaxIntrin.c
Differential Revision: https://reviews.llvm.org/D117881
D111985 added the generic `__builtin_elementwise_max` and `__builtin_elementwise_min` intrinsics with the same integer behaviour as the SSE/AVX instructions
This patch removes the `__builtin_ia32_pmax/min` intrinsics and just uses `__builtin_elementwise_max/min` - the existing tests see no changes:
```
__m256i test_mm256_max_epu32(__m256i a, __m256i b) {
// CHECK-LABEL: test_mm256_max_epu32
// CHECK: call <8 x i32> @llvm.umax.v8i32(<8 x i32> %{{.*}}, <8 x i32> %{{.*}})
return _mm256_max_epu32(a, b);
}
```
This requires us to add a `__v64qs` explicitly signed char vector type (we already have `__v16qs` and `__v32qs`).
Sibling patch to D117791
Differential Revision: https://reviews.llvm.org/D117798
D111986 added the generic `__builtin_elementwise_abs()` intrinsic with the same integer absolute behaviour as the SSE/AVX instructions (abs(INT_MIN) == INT_MIN)
This patch removes the `__builtin_ia32_pabs*` intrinsics and just uses `__builtin_elementwise_abs` - the existing tests see no changes:
```
__m256i test_mm256_abs_epi8(__m256i a) {
// CHECK-LABEL: test_mm256_abs_epi8
// CHECK: [[ABS:%.*]] = call <32 x i8> @llvm.abs.v32i8(<32 x i8> %{{.*}}, i1 false)
return _mm256_abs_epi8(a);
}
```
This requires us to add a `__v64qs` explicitly signed char vector type (we already have `__v16qs` and `__v32qs`).
Differential Revision: https://reviews.llvm.org/D117791
D111985 added the generic `__builtin_elementwise_max` and `__builtin_elementwise_min` intrinsics with the same integer behaviour as the SSE/AVX instructions
This patch removes the `__builtin_ia32_pmax/min` intrinsics and just uses `__builtin_elementwise_max/min` - the existing tests see no changes:
```
__m256i test_mm256_max_epu32(__m256i a, __m256i b) {
// CHECK-LABEL: test_mm256_max_epu32
// CHECK: call <8 x i32> @llvm.umax.v8i32(<8 x i32> %{{.*}}, <8 x i32> %{{.*}})
return _mm256_max_epu32(a, b);
}
```
This requires us to add a `__v64qs` explicitly signed char vector type (we already have `__v16qs` and `__v32qs`).
Sibling patch to D117791
Differential Revision: https://reviews.llvm.org/D117798
D111986 added the generic `__builtin_elementwise_abs()` intrinsic with the same integer absolute behaviour as the SSE/AVX instructions (abs(INT_MIN) == INT_MIN)
This patch removes the `__builtin_ia32_pabs*` intrinsics and just uses `__builtin_elementwise_abs` - the existing tests see no changes:
```
__m256i test_mm256_abs_epi8(__m256i a) {
// CHECK-LABEL: test_mm256_abs_epi8
// CHECK: [[ABS:%.*]] = call <32 x i8> @llvm.abs.v32i8(<32 x i8> %{{.*}}, i1 false)
return _mm256_abs_epi8(a);
}
```
This requires us to add a `__v64qs` explicitly signed char vector type (we already have `__v16qs` and `__v32qs`).
Differential Revision: https://reviews.llvm.org/D117791
This is a follow up of D92940.
We have successfully converted fadd/fmul _mm_reduce_* intrinsics to
llvm.reduction + reassoc flag. We can do the same approach for fmin/fmax
too, i.e. llvm.reduction + nnan flag.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D93179
Intrinsics *reduce_add/mul_ps/pd have assumption that the elements in
the vector are reassociable. So we need to always assign the reassoc
flag when we call _mm_reduce_* intrinsics.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D96231
Followup to D87604, having confirmed on PR47506 that we can use the llvm codegen expansion for fadd/fmul as well.
Differential Revision: https://reviews.llvm.org/D92940
Emit the equivalent integer reduction intrinsics in IR instead of expanding to shuffle+arithmetic sequences.
The fadd/fmul reductions might be trickier as they assume a similar bisection reduction while the generic intrinsics assume a sequential reduction (intel docs are ambiguous on the correct approach) - I'm not sure if we want to always tag them with reassoc? Anyway, that issue can wait until a separate fp patch along with the fmin/fmax reductions.
Differential Revision: https://reviews.llvm.org/D87604
There are 65 that take a scalar shift amount. Intel documentation shows 60 of them taking unsigned int. There are 5 versions of srli_epi16 that use int, the 512-bit maskz and 128/256 mask/maskz.
Fixes PR45931
Differential Revision: https://reviews.llvm.org/D80251
Enabling `-Wcast-qual` identified many casts in various system headers
that were dropping the `const` qualifier. Fixing those missing
qualifiers pointed out that a few of the definitions of the builtins
did not properly identify their arguments as `const` pointers. This
commit fixes those builtin definitions, and the system header files
so that they no longer drop the qualifier.
Differential Revision: https://reviews.llvm.org/D71718
These intrinsics should always take an immediate for the rounding mode.
The base instruction comes from before EVEX embdedded rounding. The
user should always provide the immediate rather than us assuming
CUR_DIRECTION.
Make the 512-bit versions also explicit aliases instead of copy
pasting the code.
llvm-svn: 363961
Summary:
These all had somewhat custom file headers with different text from the
ones I searched for previously, and so I missed them. Thanks to Hal and
Kristina and others who prompted me to fix this, and sorry it took so
long.
Reviewers: hfinkel
Subscribers: mcrosier, javed.absar, cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D60406
llvm-svn: 357941
Summary:
With MSVC, #pragma pack is ignored when there is explicit alignment. This differs from gcc. Clang emulates this difference when compiling for Windows.
It appears that MSVC and its headers consider the __m128/__m128i/__m128d/etc. types to be explicitly aligned and ignores #pragma pack for them. Since we don't have explicit alignment on them in our headers, we don't match the MSVC behavior here.
This patch adds explicit alignment to match this behavior. I'm hoping this won't cause any problems when we're not emulating MSVC. But if someone knows of something that would be different we can swith to conditionally adding the alignment based on _MSC_VER.
I had to add explicitly unaligned types as well so we could use them in the loadu/storeu intrinsics which use __attribute__(__packed__). Using the now explicitly aligned types wouldn't produce align 1 accesses when targeting Windows.
Reviewers: rnk, erichkeane, spatel, RKSimon
Subscribers: cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D57961
llvm-svn: 353555
This adds:
_cvtmask8_u32, _cvtmask16_u32, _cvtmask32_u32, _cvtmask64_u64
_cvtu32_mask8, _cvtu32_mask16, _cvtu32_mask32, _cvtu64_mask64
_load_mask8, _load_mask16, _load_mask32, _load_mask64
_store_mask8, _store_mask16, _store_mask32, _store_mask64
These are currently missing from the Intel Intrinsics Guide webpage.
llvm-svn: 341251
This adds the following intrinsics:
_kshiftli_mask8
_kshiftli_mask16
_kshiftli_mask32
_kshiftli_mask64
_kshiftri_mask8
_kshiftri_mask16
_kshiftri_mask32
_kshiftri_mask64
llvm-svn: 341234
This also adds a second intrinsic name for the 16-bit mask versions.
These intrinsics match gcc and icc. They just aren't published in the Intel Intrinsics Guide so I only recently found they existed.
llvm-svn: 340719
The rounding mode is checked in CGBuiltin.cpp to generate the correct intrinsic call.
Making this switch switchs the masking to use the i8 bitcast to <8 x i1> and extract i1 version of the IR for the mask. Previously we ended up with a scalar 'and' plus an icmp.
llvm-svn: 336637
This will convert the i8 mask argument to <8 x i1> and extract an i1 and then emit a select instruction. This replaces the '(__U & 1)" and ternary operator used in some of intrinsics. The old sequence was lowered to a scalar and and compare. The new sequence uses an i1 vector that will interoperate better with other mask intrinsics.
This removes the need to handle div_ss/sd specially in CGBuiltin.cpp. A follow up patch will add the GCCBuiltin name back in llvm and remove the custom handling.
I made some adjustments to legacy move_ss/sd intrinsics which we reused here to do a simpler extract and insert instead of 2 extracts and two inserts or a shuffle.
llvm-svn: 336622
This is part of an ongoing attempt at making 512 bit vectors illegal in the X86 backend type legalizer due to CPU frequency penalties associated with wide vectors on Skylake Server CPUs. We want the loop vectorizer to be able to emit IR containing wide vectors as intermediate operations in vectorized code and allow these wide vectors to be legalized to 256 bits by the X86 backend even though we are targetting a CPU that supports 512 bit vectors. This is similar to what happens with an AVX2 CPU, the vectorizer can emit wide vectors and the backend will split them. We want this splitting behavior, but still be able to use new Skylake instructions that work on 256-bit vectors and support things like masking and gather/scatter.
Of course if the user uses explicit vector code in their source code we need to not split those operations. Especially if they have used any of the 512-bit vector intrinsics from immintrin.h. And we need to make it so that merely using the intrinsics produces the expected code in order to be backwards compatible.
To support this goal, this patch adds a new IR function attribute "min-legal-vector-width" that can indicate the need for a minimum vector width to be legal in the backend. We need to ensure this attribute is set to the largest vector width needed by any intrinsics from immintrin.h that the function uses. The inliner will be reponsible for merging this attribute when a function is inlined. We may also need a way to limit inlining in the future as well, but we can discuss that in the future.
To make things more complicated, there are two different ways intrinsics are implemented in immintrin.h. Either as an always_inline function containing calls to builtins(can be target specific or target independent) or vector extension code. Or as a macro wrapper around a taget specific builtin. I believe I've removed all cases where the macro was around a target independent builtin.
To support the always_inline function case this patch adds attribute((min_vector_width(128))) that can be used to tag these functions with their vector width. All x86 intrinsic functions that operate on vectors have been tagged with this attribute.
To support the macro case, all x86 specific builtins have also been tagged with the vector width that they require. Use of any builtin with this property will implicitly increase the min_vector_width of the function that calls it. I've done this as a new property in the attribute string for the builtin rather than basing it on the type string so that we can opt into it on a per builtin basis and avoid any impact to target independent builtins.
There will be future work to support vectors passed as function arguments and supporting inline assembly. And whatever else we can find that isn't covered by this patch.
Special thanks to Chandler who suggested this direction and reviewed a preview version of this patch. And thanks to Eric Christopher who has had many conversations with me about this issue.
Differential Revision: https://reviews.llvm.org/D48617
llvm-svn: 336583
I believe these have been broken since their introduction into clang.
I've enhanced the tests for these intrinsics to using a real rounding mode and checking all the intrinsic arguments instead of just the name.
llvm-svn: 336498
We had the mask versions of the rounding intrinsics, but not one without masking.
Also change the rounding tests to not use the CUR_DIRECTION rounding mode.
llvm-svn: 336470
Similar to what was done to max/min recently.
These already reduced the vector width to 256 and 128 bit as we go unlike the original max/min code.
Differential Revision: https://reviews.llvm.org/D48346
llvm-svn: 335253
We only need to use 512 bit vectors all the way through v8i64 reductions since those max instructions are new to avx512f and only available in 512 bits until SKX.
For v16i32 and floating point we have legacy 128/256 bit instructions we can use.
I've tried to use other intrinsics to reduce the verbosity of the code and avoid having to mention all the shuffles. I've also removed all the -1 shuffle indices so the output sequence is fully specified and not left to backend optimization.
Differential Revision: https://reviews.llvm.org/D47401
llvm-svn: 335070