Commit Graph

269 Commits

Author SHA1 Message Date
Wang, Pengfei 61da20575d [X86] Convert fmin/fmax _mm_reduce_* intrinsics to emit llvm.reduction intrinsics (PR47506)
This is a follow up of D92940.

We have successfully converted fadd/fmul _mm_reduce_* intrinsics to
llvm.reduction + reassoc flag. We can do the same approach for fmin/fmax
too, i.e. llvm.reduction + nnan flag.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D93179
2021-02-15 08:52:06 +08:00
Wang, Pengfei dd2460ed5d [X86] Always assign reassoc flag for intrinsics *reduce_add/mul_ps/pd.
Intrinsics *reduce_add/mul_ps/pd have assumption that the elements in
the vector are reassociable. So we need to always assign the reassoc
flag when we call _mm_reduce_* intrinsics.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D96231
2021-02-09 21:14:06 +08:00
Simon Pilgrim 4855a1004d [X86] Convert fadd/fmul _mm_reduce_* intrinsics to emit llvm.reduction intrinsics (PR47506)
Followup to D87604, having confirmed on PR47506 that we can use the llvm codegen expansion for fadd/fmul as well.

Differential Revision: https://reviews.llvm.org/D92940
2020-12-13 15:37:35 +00:00
Simon Pilgrim 6c23cbc560 [X86] Convert integer _mm_reduce_* intrinsics to emit llvm.reduction intrinsics (PR47506)
Emit the equivalent integer reduction intrinsics in IR instead of expanding to shuffle+arithmetic sequences.

The fadd/fmul reductions might be trickier as they assume a similar bisection reduction while the generic intrinsics assume a sequential reduction (intel docs are ambiguous on the correct approach) - I'm not sure if we want to always tag them with reassoc? Anyway, that issue can wait until a separate fp patch along with the fmin/fmax reductions.

Differential Revision: https://reviews.llvm.org/D87604
2020-10-13 09:28:39 +01:00
Craig Topper 1b02db52b7 [X86] Update some av512 shift intrinsics to use "unsigned int" parameter instead of int to match Intel documentation
There are 65 that take a scalar shift amount. Intel documentation shows 60 of them taking unsigned int. There are 5 versions of srli_epi16 that use int, the 512-bit maskz and 128/256 mask/maskz.

Fixes PR45931

Differential Revision: https://reviews.llvm.org/D80251
2020-05-22 20:12:57 -07:00
Warren Ristow 7fcd9e3f70 [X86] Mark various pointer arguments in builtins as const
Enabling `-Wcast-qual` identified many casts in various system headers
that were dropping the `const` qualifier.  Fixing those missing
qualifiers pointed out that a few of the definitions of the builtins
did not properly identify their arguments as `const` pointers.  This
commit fixes those builtin definitions, and the system header files
so that they no longer drop the qualifier.

Differential Revision: https://reviews.llvm.org/D71718
2019-12-19 11:42:11 -08:00
Richard Smith 5b2ba5afa9 Fix reliance on -flax-vector-conversions in AVX intrinsics headers and
corresponding tests.

llvm-svn: 372063
2019-09-17 03:56:30 +00:00
Pengfei Wang dea9cad10e [x86] Fix bugs of some intrinsic functions in CLANG : _mm512_stream_ps, _mm512_stream_pd, _mm512_stream_si512
Reviewers: craig.topper, pengfei, LuoYuanke, RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Patch by Bing Yu (yubing)

Differential Revision: https://reviews.llvm.org/D66786

llvm-svn: 370691
2019-09-03 02:06:15 +00:00
Pengfei Wang caac097fbf [x86] Adding support for some missing intrinsics: _mm512_cvtsi512_si32
Summary:
Adding support for some missing intrinsics:
_mm512_cvtsi512_si32

Reviewers: craig.topper, pengfei, LuoYuanke, spatel, RKSimon

Reviewed By: craig.topper

Subscribers: llvm-commits

Patch by Bing Yu (yubing)

Differential Revision: https://reviews.llvm.org/D66785

llvm-svn: 370297
2019-08-29 06:18:34 +00:00
Craig Topper 6d9fb68c53 [X86] Make _mm_mask_cvtps_ph, _mm_maskz_cvtps_ph, _mm256_mask_cvtps_ph, and _mm256_maskz_cvtps_ph aliases for their corresponding cvt_roundps_ph intrinsic.
These intrinsics should always take an immediate for the rounding mode.
The base instruction comes from before EVEX embdedded rounding. The
user should always provide the immediate rather than us assuming
CUR_DIRECTION.

Make the 512-bit versions also explicit aliases instead of copy
pasting the code.

llvm-svn: 363961
2019-06-20 18:24:29 +00:00
Craig Topper 3d7ecc4618 [X86] Remove semicolons at the end of intrinsics implemented as macros so they can be used as arguments to other intrinsics.
Also fix one intrinsic that was using variable names without underscores.

Fixes PR41932

llvm-svn: 361109
2019-05-19 01:01:52 +00:00
Chandler Carruth 4cf5743b77 Move the builtin headers to use the new license file header.
Summary:
These all had somewhat custom file headers with different text from the
ones I searched for previously, and so I missed them. Thanks to Hal and
Kristina and others who prompted me to fix this, and sorry it took so
long.

Reviewers: hfinkel

Subscribers: mcrosier, javed.absar, cfe-commits

Tags: #clang

Differential Revision: https://reviews.llvm.org/D60406

llvm-svn: 357941
2019-04-08 20:51:30 +00:00
Craig Topper be4cbe8726 [X86] Add explicit alignment to __m128/__m128i/__m128d/etc. to allow matching of MSVC behavior with #pragma pack.
Summary:
With MSVC, #pragma pack is ignored when there is explicit alignment. This differs from gcc. Clang emulates this difference when compiling for Windows.

It appears that MSVC and its headers consider the __m128/__m128i/__m128d/etc. types to be explicitly aligned and ignores #pragma pack for them. Since we don't have explicit alignment on them in our headers, we don't match the MSVC behavior here.

This patch adds explicit alignment to match this behavior. I'm hoping this won't cause any problems when we're not emulating MSVC. But if someone knows of something that would be different we can swith to conditionally adding the alignment based on _MSC_VER.

I had to add explicitly unaligned types as well so we could use them in the loadu/storeu intrinsics which use __attribute__(__packed__). Using the now explicitly aligned types wouldn't produce align 1 accesses when targeting Windows.

Reviewers: rnk, erichkeane, spatel, RKSimon

Subscribers: cfe-commits

Tags: #clang

Differential Revision: https://reviews.llvm.org/D57961

llvm-svn: 353555
2019-02-08 19:45:08 +00:00
Craig Topper bdbe5c7dc7 [X86] Make the pointer arguments to avx512 gather/scatter intrinsics 'void*' to match gcc and Intel's documentation.
The avx2 gather intrinsics are documented to use 'int', 'long long', 'float', or 'double' *. So I'm leaving those. This matches gcc.

llvm-svn: 350696
2019-01-09 07:36:01 +00:00
Craig Topper 58508be3c0 [X86] Add missing intrinsics to match icc.
This adds
_mm_and_epi32, _mm_and_epi64
_mm_andnot_epi32, _mm_andnot_epi64
_mm_or_epi32, _mm_or_epi64
_mm_xor_epi32, _mm_xor_epi64
_mm256_and_epi32, _mm256_and_epi64
_mm256_andnot_epi32, _mm256_andnot_epi64
_mm256_or_epi32, _mm256_or_epi64
_mm256_xor_epi32, _mm256_xor_epi64
_mm_loadu_epi32, _mm_loadu_epi64
_mm_load_epi32, _mm_load_epi64
_mm256_loadu_epi32, _mm256_loadu_epi64
_mm256_load_epi32, _mm256_load_epi64
_mm512_loadu_epi32, _mm512_loadu_epi64
_mm512_load_epi32, _mm512_load_epi64
_mm_storeu_epi32, _mm_storeu_epi64
_mm_store_epi32, _mm_load_epi64
_mm256_storeu_epi32, _mm256_storeu_epi64
_mm256_store_epi32, _mm256_load_epi64
_mm512_storeu_epi32, _mm512_storeu_epi64
_mm512_store_epi32,V _mm512_load_epi64

llvm-svn: 344861
2018-10-20 19:28:50 +00:00
Craig Topper 42a4d0822e [X86] Add k-mask conversion and load/store instrinsics to match gcc and icc.
This adds:
_cvtmask8_u32, _cvtmask16_u32, _cvtmask32_u32, _cvtmask64_u64
_cvtu32_mask8, _cvtu32_mask16, _cvtu32_mask32, _cvtu64_mask64
_load_mask8, _load_mask16, _load_mask32, _load_mask64
_store_mask8, _store_mask16, _store_mask32, _store_mask64

These are currently missing from the Intel Intrinsics Guide webpage.

llvm-svn: 341251
2018-08-31 20:41:06 +00:00
Craig Topper 2aa8efc820 [X86] Add kshift intrinsics to match gcc and icc.
This adds the following intrinsics:
_kshiftli_mask8
_kshiftli_mask16
_kshiftli_mask32
_kshiftli_mask64
_kshiftri_mask8
_kshiftri_mask16
_kshiftri_mask32
_kshiftri_mask64

llvm-svn: 341234
2018-08-31 18:22:52 +00:00
Craig Topper cb5fd56c7f [X86] Add kortest intrinsics for 8, 32, and 64 bit masks. Add new intrinsic names for 16 bit masks.
This matches gcc and icc despite not being documented in the Intel Intrinsics Guide.

llvm-svn: 340798
2018-08-28 06:28:25 +00:00
Craig Topper c330ca8611 [X86] Add intrinsics for kand/kandn/knot/kor/kxnor/kxor with 8, 32, and 64-bit mask registers.
This also adds a second intrinsic name for the 16-bit mask versions.

These intrinsics match gcc and icc. They just aren't published in the Intel Intrinsics Guide so I only recently found they existed.

llvm-svn: 340719
2018-08-27 06:20:22 +00:00
Craig Topper 266b858705 [X86] Undef __DEFAULT_FN_ATTRS in avx512fintrin.h.
Fixes test failure after r340713

llvm-svn: 340714
2018-08-27 05:44:45 +00:00
Craig Topper f821f5314e [X86] Don't set min_vector_width to 512 on intrinsics that only operate on k registers.
llvm-svn: 340713
2018-08-27 05:27:15 +00:00
Fangrui Song 6907ce2f8f Remove trailing space
sed -Ei 's/[[:space:]]+$//' include/**/*.{def,h,td} lib/**/*.{cpp,h}

llvm-svn: 338291
2018-07-30 19:24:48 +00:00
Craig Topper 36ab775cc1 [X86] Use masked the masked scalar fma builtins to implement the default rounding version of the fma intrinsics.
The rounding mode is checked in CGBuiltin.cpp to generate the correct intrinsic call.

Making this switch switchs the masking to use the i8 bitcast to <8 x i1> and extract i1 version of the IR for the mask. Previously we ended up with a scalar 'and' plus an icmp.

llvm-svn: 336637
2018-07-10 04:38:29 +00:00
Craig Topper 638426fc36 [X86] Add __builtin_ia32_selectss_128 and __builtin_ia32_selectsd_128 that is suitable for use in scalar mask intrinsics.
This will convert the i8 mask argument to <8 x i1> and extract an i1 and then emit a select instruction. This replaces the '(__U & 1)" and ternary operator used in some of intrinsics. The old sequence was lowered to a scalar and and compare. The new sequence uses an i1 vector that will interoperate better with other mask intrinsics.

This removes the need to handle div_ss/sd specially in CGBuiltin.cpp. A follow up patch will add the GCCBuiltin name back in llvm and remove the custom handling.

I made some adjustments to legacy move_ss/sd intrinsics which we reused here to do a simpler extract and insert instead of 2 extracts and two inserts or a shuffle.

llvm-svn: 336622
2018-07-10 00:37:25 +00:00
Craig Topper 74c10e3236 [Builtins][Attributes][X86] Tag all X86 builtins with their required vector width. Add a min_vector_width function attribute and tag all x86 instrinsics with it
This is part of an ongoing attempt at making 512 bit vectors illegal in the X86 backend type legalizer due to CPU frequency penalties associated with wide vectors on Skylake Server CPUs. We want the loop vectorizer to be able to emit IR containing wide vectors as intermediate operations in vectorized code and allow these wide vectors to be legalized to 256 bits by the X86 backend even though we are targetting a CPU that supports 512 bit vectors. This is similar to what happens with an AVX2 CPU, the vectorizer can emit wide vectors and the backend will split them. We want this splitting behavior, but still be able to use new Skylake instructions that work on 256-bit vectors and support things like masking and gather/scatter.

Of course if the user uses explicit vector code in their source code we need to not split those operations. Especially if they have used any of the 512-bit vector intrinsics from immintrin.h. And we need to make it so that merely using the intrinsics produces the expected code in order to be backwards compatible.

To support this goal, this patch adds a new IR function attribute "min-legal-vector-width" that can indicate the need for a minimum vector width to be legal in the backend. We need to ensure this attribute is set to the largest vector width needed by any intrinsics from immintrin.h that the function uses. The inliner will be reponsible for merging this attribute when a function is inlined. We may also need a way to limit inlining in the future as well, but we can discuss that in the future.

To make things more complicated, there are two different ways intrinsics are implemented in immintrin.h. Either as an always_inline function containing calls to builtins(can be target specific or target independent) or vector extension code. Or as a macro wrapper around a taget specific builtin. I believe I've removed all cases where the macro was around a target independent builtin.

To support the always_inline function case this patch adds attribute((min_vector_width(128))) that can be used to tag these functions with their vector width. All x86 intrinsic functions that operate on vectors have been tagged with this attribute.

To support the macro case, all x86 specific builtins have also been tagged with the vector width that they require. Use of any builtin with this property will implicitly increase the min_vector_width of the function that calls it. I've done this as a new property in the attribute string for the builtin rather than basing it on the type string so that we can opt into it on a per builtin basis and avoid any impact to target independent builtins.

There will be future work to support vectors passed as function arguments and supporting inline assembly. And whatever else we can find that isn't covered by this patch.

Special thanks to Chandler who suggested this direction and reviewed a preview version of this patch. And thanks to Eric Christopher who has had many conversations with me about this issue.

Differential Revision: https://reviews.llvm.org/D48617

llvm-svn: 336583
2018-07-09 19:00:16 +00:00
Craig Topper 0a485d13a4 [X86] Remove some unnecessarily escaped new lines from avx512fintrin.h
llvm-svn: 336499
2018-07-07 22:03:19 +00:00
Craig Topper 3e720a302c [X86] Fix a few intrinsics that were ignoring their rounding mode argument and hardcoded _MM_FROUND_CUR_DIRECTION internally.
I believe these have been broken since their introduction into clang.

I've enhanced the tests for these intrinsics to using a real rounding mode and checking all the intrinsic arguments instead of just the name.

llvm-svn: 336498
2018-07-07 22:03:16 +00:00
Craig Topper 218da62091 [X86] Change _mm512_shuffle_pd and _mm512_shuffle_ps to use target specific shuffle builtins instead of generic __builtin_shufflevector.
I added the builtins for 128, 256, and 512 bits recently but looks like I failed to convert to using the 512 bit one.

llvm-svn: 336488
2018-07-07 17:03:34 +00:00
Craig Topper 5cbeeedd27 [X86] Fix various type mismatches in intrinsic headers and intrinsic tests that cause extra bitcasts to be emitted in the IR.
Found via imprecise grepping of the -O0 IR. There could still be more bugs out there.

llvm-svn: 336487
2018-07-07 17:03:32 +00:00
Craig Topper 10f20fc42b [X86] Add missing scalar fma intrinsics with rounding, but no mask.
We had the mask versions of the rounding intrinsics, but not one without masking.

Also change the rounding tests to not use the CUR_DIRECTION rounding mode.

llvm-svn: 336470
2018-07-06 22:08:43 +00:00
Craig Topper 0e9de769a0 [X86] Remove masking from the avx512 rotate builtins. Use a select builtin instead.
llvm-svn: 336036
2018-06-30 01:32:14 +00:00
Craig Topper 8bf793fb35 [X86] Remove masking from the avx512 packed sqrt builtins. Use select builtins instead.
llvm-svn: 335945
2018-06-29 05:43:33 +00:00
Craig Topper ddfe69cc99 [X86] Rewrite the add/mul/or/and reduction intrinsics to make better use of other intrinsics and remove undef shuffle indices.
Similar to what was done to max/min recently.

These already reduced the vector width to 256 and 128 bit as we go unlike the original max/min code.

Differential Revision: https://reviews.llvm.org/D48346

llvm-svn: 335253
2018-06-21 16:41:28 +00:00
Craig Topper 2da60bc231 [X86] Remove masking from the 512-bit floating point max/min builtins. Use select in IR instead.
llvm-svn: 335200
2018-06-21 05:01:01 +00:00
Craig Topper d334615c23 [X86] Undefine _mm512_mask_reduce_operator macro in avx512fintrin.h before redefining it.
llvm-svn: 335086
2018-06-20 00:31:39 +00:00
Craig Topper 61495b3650 Recommit r335070 "[X86] Rewrite the max and min reduction intrinsics to make better use of other functions and to reduce width to 256 and 128 bits were possible.""
Test has been updated to reflect the IRGen.

llvm-svn: 335075
2018-06-19 21:00:30 +00:00
Craig Topper 79b13a0348 Revert r335070 "[X86] Rewrite the max and min reduction intrinsics to make better use of other functions and to reduce width to 256 and 128 bits were possible."
The test changes are failing the buildbot and its going to take me some time to fix it.

llvm-svn: 335072
2018-06-19 19:37:07 +00:00
Craig Topper 873afd0262 [X86] Rewrite the max and min reduction intrinsics to make better use of other functions and to reduce width to 256 and 128 bits were possible.
We only need to use 512 bit vectors all the way through v8i64 reductions since those max instructions are new to avx512f and only available in 512 bits until SKX.

For v16i32 and floating point we have legacy 128/256 bit instructions we can use.

I've tried to use other intrinsics to reduce the verbosity of the code and avoid having to mention all the shuffles. I've also removed all the -1 shuffle indices so the output sequence is fully specified and not left to backend optimization.

Differential Revision: https://reviews.llvm.org/D47401

llvm-svn: 335070
2018-06-19 19:13:54 +00:00
Tomasz Krupa 82aa42af49 [X86] Lowering Mask Scalar intrinsics to native IR (Clang part)
Summary: Lowering add, sub, mul, and div mask scalar intrinsic calls
to native IR.

Reviewers: craig.topper, RKSimon, spatel, sroland

Reviewed By: craig.topper

Subscribers: cfe-commits

Differential Revision: https://reviews.llvm.org/D47979

llvm-svn: 334741
2018-06-14 17:36:23 +00:00
Craig Topper 3614b41a8e [X86] Remove masking from the 512-bit packed floating point add/sub/mul/div builtins. Use select in IR instead.
llvm-svn: 334359
2018-06-10 06:01:42 +00:00
Craig Topper 88097d9355 [X86] Add back some masked vector truncate builtins. Custom IRgen a a few others.
I'd like to make the select builtins require an avx512f, avx512bw, or avx512vl fature to match what is normally required to get masking. Truncate is special in that there are instructions with a 128/256-bit masked result even without avx512vl.

By using special buitlins we can emit a select without using the 128/256-bit select builtins.

llvm-svn: 334331
2018-06-08 21:50:08 +00:00
Craig Topper 5f50f33806 [X86] Fold masking into subvector extract builtins.
I'm looking into making the select builtins require avx512f, avx512bw, or avx512vl since masking operations generally require those features.

The extract builtins are funny because the 512-bit versions return a 128 or 256 bit vector with masking even when avx512vl is not supported.

llvm-svn: 334330
2018-06-08 21:50:07 +00:00
Craig Topper 03f4f04b91 [X86] Add builtins for vpermq/vpermpd instructions to enable target feature checking.
llvm-svn: 334311
2018-06-08 18:00:25 +00:00
Craig Topper 03de166ccd [X86] Add builtins for pshufd, pshuflw, and pshufhw to enable target feature and immediate range checking.
llvm-svn: 334265
2018-06-08 06:13:16 +00:00
Craig Topper 3428beeb2f [X86] Add subvector insert and extract builtins to enable target feature checking and immediate range checking.
Test changes are due to differences in how we generate undef elements now. We also changed the types used for extractf128_si256/insertf128_si256 to match the signature of the builtin that previously existed which this patch resurrects. This also matches gcc.

llvm-svn: 334261
2018-06-08 03:24:47 +00:00
Craig Topper acf5601961 [X86] Add builtins for vpermilps/pd instructions to enable target feature checking.
llvm-svn: 334256
2018-06-08 00:59:27 +00:00
Craig Topper 9392136414 [X86] Add builtins for shuff32x4/shuff64x2/shufi32x4/shuff64x2 to enable target feature checking and immediate range checking.
llvm-svn: 334244
2018-06-07 23:03:08 +00:00
Craig Topper e56819eb69 [X86] Add builtins for VALIGNQ/VALIGND to enable proper target feature checking.
We still emit shufflevector instructions we just do it from CGBuiltin.cpp now. This ensures the intrinsics that use this are only available on CPUs that support the feature.

I also added range checking to the immediate, but only checked it is 8 bits or smaller. We should maybe be stricter since we never use all 8 bits, but gcc doesn't seem to do that.

llvm-svn: 334237
2018-06-07 21:27:41 +00:00
Craig Topper b92c77d176 [X86] Add back _mask, _maskz, and _mask3 builtins for some 512-bit fmadd/fmsub/fmaddsub/fmsubadd builtins.
Summary:
We recently switch to using a selects in the intrinsics header files for FMA instructions. But the 512-bit versions support flavors with rounding mode which must be an Integer Constant Expression. This has forced those intrinsics to be implemented as macros. As it stands now the mask and mask3 intrinsics evaluate one of their macro arguments twice. If that argument itself is another intrinsic macro, we can end up over expanding macros. Or if its something we can CSE later it would show up multiple times when it shouldn't.

I tried adding __extension__ around the macro and making it an expression statement and declaring a local variable. But whatever name you choose for the local variable can never be used as the name of an input to the macro in user code. If that happens you would end up with the same name on the LHS and RHS of an assignment after expansion. We might be safe if we use __ in front of the variable names because those names are reserved and user code shouldn't use that, but I wasn't sure I wanted to make that claim.

The other option which I've chosen here, is to add back _mask, _maskz, and _mask3 flavors of the builtin which we will expand in CGBuiltin.cpp to replicate the argument as needed and insert any fneg needed on the third operand to make a subtract. The _maskz isn't truly necessary if we have an unmasked version or if we use the masked version with a -1 mask and wrap a select around it. But I've chosen to make things more uniform.

I separated out the scalar builtin handling to avoid too many things going on in EmitX86FMAExpr. It was different enough due to the extract and insert that the minor duplication of the CreateCall was probably worth it.

Reviewers: tkrupa, RKSimon, spatel, GBuella

Reviewed By: tkrupa

Subscribers: cfe-commits

Differential Revision: https://reviews.llvm.org/D47724

llvm-svn: 334159
2018-06-07 02:46:02 +00:00
Reid Kleckner 89fbd55145 Revert r333791 "Cap "voluntary" vector alignment at 16 for all Darwin platforms."
Adding __attribute__((aligned(32))) to __m256 breaks the implementation
of _mm256_loadu_ps on Windows. On Windows, alignment attributes have
higher precedence than packing attributes.

We also might want to carefully consider the consequences of changing
our vector typedefs, since many users copy them and invent their own
new, non-Intel specific vector type names.

llvm-svn: 333958
2018-06-04 21:39:20 +00:00