I'd like to make the select builtins require an avx512f, avx512bw, or avx512vl fature to match what is normally required to get masking. Truncate is special in that there are instructions with a 128/256-bit masked result even without avx512vl.
By using special buitlins we can emit a select without using the 128/256-bit select builtins.
llvm-svn: 334331
I'm looking into making the select builtins require avx512f, avx512bw, or avx512vl since masking operations generally require those features.
The extract builtins are funny because the 512-bit versions return a 128 or 256 bit vector with masking even when avx512vl is not supported.
llvm-svn: 334330
Test changes are due to differences in how we generate undef elements now. We also changed the types used for extractf128_si256/insertf128_si256 to match the signature of the builtin that previously existed which this patch resurrects. This also matches gcc.
llvm-svn: 334261
We still emit shufflevector instructions we just do it from CGBuiltin.cpp now. This ensures the intrinsics that use this are only available on CPUs that support the feature.
I also added range checking to the immediate, but only checked it is 8 bits or smaller. We should maybe be stricter since we never use all 8 bits, but gcc doesn't seem to do that.
llvm-svn: 334237
Summary:
We recently switch to using a selects in the intrinsics header files for FMA instructions. But the 512-bit versions support flavors with rounding mode which must be an Integer Constant Expression. This has forced those intrinsics to be implemented as macros. As it stands now the mask and mask3 intrinsics evaluate one of their macro arguments twice. If that argument itself is another intrinsic macro, we can end up over expanding macros. Or if its something we can CSE later it would show up multiple times when it shouldn't.
I tried adding __extension__ around the macro and making it an expression statement and declaring a local variable. But whatever name you choose for the local variable can never be used as the name of an input to the macro in user code. If that happens you would end up with the same name on the LHS and RHS of an assignment after expansion. We might be safe if we use __ in front of the variable names because those names are reserved and user code shouldn't use that, but I wasn't sure I wanted to make that claim.
The other option which I've chosen here, is to add back _mask, _maskz, and _mask3 flavors of the builtin which we will expand in CGBuiltin.cpp to replicate the argument as needed and insert any fneg needed on the third operand to make a subtract. The _maskz isn't truly necessary if we have an unmasked version or if we use the masked version with a -1 mask and wrap a select around it. But I've chosen to make things more uniform.
I separated out the scalar builtin handling to avoid too many things going on in EmitX86FMAExpr. It was different enough due to the extract and insert that the minor duplication of the CreateCall was probably worth it.
Reviewers: tkrupa, RKSimon, spatel, GBuella
Reviewed By: tkrupa
Subscribers: cfe-commits
Differential Revision: https://reviews.llvm.org/D47724
llvm-svn: 334159
Adding __attribute__((aligned(32))) to __m256 breaks the implementation
of _mm256_loadu_ps on Windows. On Windows, alignment attributes have
higher precedence than packing attributes.
We also might want to carefully consider the consequences of changing
our vector typedefs, since many users copy them and invent their own
new, non-Intel specific vector type names.
llvm-svn: 333958
This is more consistent with other usages of builtin_shufflevector. Later optimization passes or codegen will detect the duplicate vector and replace it with undef. Using _mm_undefined just puts a zeroinitializer that still needs to be optimized out later.
llvm-svn: 333944
One of the arguments was being used when the passthru argument is unused due to the mask being all 1s. But in that case the actual value doesn't matter so we should use undef instead to avoid expanding the macro argument unnecessarily.
llvm-svn: 333865
This fixes two major problems:
- We were not capping vector alignment as desired on 32-bit ARM.
- We were using different alignments based on the AVX settings on
Intel, so we did not have a consistent ABI.
This is an ABI break, but we think we can get away with it because
vectors tend to be used mostly in inline code (which is why not having
a consistent ABI has not proven disastrous on Intel).
Intel's AVX types are specified as having 32-byte / 64-byte alignment,
so align them explicitly instead of relying on the base ABI rule.
Note that this sort of attribute is stripped from template arguments
in template substitution, so there's a possibility that code templated
over vectors will produce inadequately-aligned objects. The right
long-term solution for this is for alignment attributes to be
interpreted as true qualifiers and thus preserved in the canonical type.
llvm-svn: 333791
The majority of the cases were correct. This fixes the few that weren't.
I also removed some superfluous parentheses in non-macros that confused by attempts at grepping for missing casts.
llvm-svn: 333615
I think this is a holdover from when we used to declare variables inside the macros. And then its been copy and pasted forward for years every time a new macro intrinsic gets added.
Interestingly this caused some tests for IRGen to be slightly more optimized. We now return a zeroinitializer directly instead of going through a store+load.
It also removed a bogus error message on another test.
llvm-svn: 333613
Most of the origial comments used C style /* */ comments, but some C++ // comments had snuck in over time.
Still need to convert all the doxygen comments. Which is much harder to do.
llvm-svn: 333603
We had quite a few for different element sizes of integers sometimes with strange target features attached to them.
We only need a single version for each of _m128i, _m256i, and _m512i with the target feature that first introduced those types.
llvm-svn: 333568
This patch replaces all packed (and scalar without rounding
mode) fused intrinsics with fmadd/fmaddsub variations.
Then fmadd/fmaddsub are lowered to native IR.
Patch by tkrupa
Reviewers: craig.topper, sroland, spatel, RKSimon
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D47444
llvm-svn: 333555
Summary:
We only need to use 512 bit vectors all the way through v8i64 reductions since those max instructions are new to avx512f and only available in 512 bits until SKX.
For v16i32 and floating point we have legacy 128/256 bit instructions we can use.
I've tried to use other intrinsics to reduce the verbosity of the code and avoid having to mention all the shuffles. I've also removed all the -1 shuffle indices so the output sequence is fully specified and not left to backend optimization.
Reviewers: RKSimon, spatel, GBuella
Subscribers: cfe-commits
Differential Revision: https://reviews.llvm.org/D47401
llvm-svn: 333347
Previously we negated the whole vector after splatting infinity. But its better to negate the infinity before splatting. This generates IR with the negate already folded with the infinity constant.
llvm-svn: 333062
I believe this is safe assuming default default FP environment. The conversion might be inexact, but it can never overflow the FP type so this shouldn't be undefined behavior for the uitofp/sitofp instructions.
We already do something similar for scalar conversions.
Differential Revision: https://reviews.llvm.org/D46863
llvm-svn: 332882
As long as the destination type is a 256 or 128 bit vector with the same number of elements we can use __builtin_convertvector to directly generate trunc IR instruction which will be handled natively by the backend.
Differential Revision: https://reviews.llvm.org/D46742
llvm-svn: 332266
If we're using default rounding mode we can let __builtin_convertvector to generate an fpextend. This matches 128 and 256 bit.
If we're using the version that takes an explicit rounding mode argument we would need to look at the immediate to see if its CUR_DIRECTION.
llvm-svn: 332210
We can use direct C code for these that will use uitofp and insertelement instructions.
For the versions that take an explicit rounding mode we can't do this.
llvm-svn: 332203
This is unnecessary for AVX512VL supporting CPUs like SKX. We can just emit a 128-bit masked load/store here no matter what. The backend will widen it to 512-bits on KNL CPUs.
Fixes the frontend portion of PR37386. Need to fix the backend to optimize the new sequences well.
llvm-svn: 331958
This is similar to the LLVM change https://reviews.llvm.org/D46290.
We've been running doxygen with the autobrief option for a couple of
years now. This makes the \brief markers into our comments
redundant. Since they are a visual distraction and we don't want to
encourage more \brief markers in new code either, this patch removes
them all.
Patch produced by
for i in $(git grep -l '\@brief'); do perl -pi -e 's/\@brief //g' $i & done
for i in $(git grep -l '\\brief'); do perl -pi -e 's/\\brief //g' $i & done
Differential Revision: https://reviews.llvm.org/D46320
llvm-svn: 331834
On AVX512F targets we'll produce an emulated sequence using 3 pmuludqs with shifts and adds. On AVX512DQ we'll use vpmulld.
Fixes PR37140.
llvm-svn: 330923
The unmasked versions already didn't have this restrction. I don't think gcc or icc limit these to 64-bit mode so we shouldn't either.
llvm-svn: 330681
Summary:
kunpck intrinsics were removed in favor of native IR a few months ago. The implementation lowers them as by operation on the integer types passed to the intrinsic and then just shifting, masking, and oring them together. A special X86 DAG combine was added to recognize this patter and turn it into a concat_vector operation.
I think it makes more sense to keep the IR implementation closer to vector operations on vXi1. Given that we expect these builtins to be used around other builtins that operate on k-registers which we try to represent in IR with vXi1. InstCombine should be able to get rid of the bitcasts between integers and vXi1 leaving only the vector operations.
Reviewers: RKSimon, spatel, zvi, jina.nahias
Reviewed By: RKSimon
Subscribers: cfe-commits
Differential Revision: https://reviews.llvm.org/D42016
llvm-svn: 322461
This patch, together with a matching llvm patch (https://reviews.llvm.org/D39720), implements the lowering of X86 kunpack intrinsics to IR.
Differential Revision: https://reviews.llvm.org/D39719
Change-Id: Id5d3cb394ad33b98be79a6783d1d15569e2b798d
llvm-svn: 319777
Change Header files of the intrinsics for lowering test and testn intrinsics to IR code.
Removed test and testn builtins from clang
Differential Revision: https://reviews.llvm.org/D38737
llvm-svn: 318035
This patch, together with a matching llvm patch (https://reviews.llvm.org/D38671), implements the lowering of X86 shuffle i/f intrinsics to IR.
Differential Revision: https://reviews.llvm.org/D38672
Change-Id: I9b3c2f2b34323bd9ccb21d0c1832f848b88ec047
llvm-svn: 318025
The __builtin_ia32_pbroadcastq512_mem_mask we were previously trying to use in 32-bit mode is not implemented in the x86 backend and causes isel to fail in release builds. In debug builds it fails even earlier during legalization with an llvm_unreachable.
While there add the missing test case for this intrinsic for this for 64-bit mode.
This fixes PR34631. D37668 should be able to recover this for 32-bit mode soon. But I wanted to fix the crash ahead of that.
llvm-svn: 313392
Based off the Intel Intrinsics guide, we should expect a void const* argument.
Prevents 'passing 'const void *' to parameter of type 'void *' discards qualifiers' warnings.
Differential Revision: https://reviews.llvm.org/D37449
llvm-svn: 312523
Clang specifies a max type alignment of 16 bytes on darwin targets (annoyingly in the driver not via cc1), meaning that the builtin nontemporal stores don't correctly align the loads/stores to 32 or 64 bytes when required, resulting in lowering to temporal unaligned loads/stores.
This patch casts the vectors to explicitly aligned types prior to the load/store to ensure that the require alignment is respected.
Differential Revision: https://reviews.llvm.org/D35996
llvm-svn: 309488
MOVNTDQA non-temporal aligned vector loads can be correctly represented using generic builtin loads, allowing us to remove the existing x86 intrinsics.
LLVM companion patch: D31767.
Differential Revision: https://reviews.llvm.org/D31766
llvm-svn: 300326
This will allow the backend to constant fold these to generic shuffle vectors like 128-bit and 256-bit without having to working about handling masking.
llvm-svn: 289351
Both the (V)CVTDQ2PD (i32 to f64) and (V)CVTUDQ2PD (u32 to f64) conversion instructions are lossless and can be safely represented as generic __builtin_convertvector calls instead of x86 intrinsics without affecting final codegen.
This patch removes the clang builtins and their use in the headers - a future patch will deal with removing the llvm intrinsics.
This is an extension patch to D20528 which dealt with the equivalent sse/avx cases.
Differential Revision: https://reviews.llvm.org/D26686
llvm-svn: 287088
This is part of a set of changes to allow InstCombine in the backend to optimize variable shifts without having to know about masking.
llvm-svn: 286757
Summary: Inverting the mask argument does not reflect the intended semantics of the intrinsic.
Reviewers: igorb, delena
Subscribers: cfe-commits
Differential Revision: https://reviews.llvm.org/D26019
llvm-svn: 286733
Unfortunately, the backend currently doesn't fold masks into the instructions correctly when they come from these shufflevectors. I'll work on that in a future commit.
llvm-svn: 285667
Unfortunately, the backend currently doesn't fold masks into the instructions correctly when they come from these shufflevectors. I'll work on that in a future commit.
llvm-svn: 285540
After LGTM and Check-all
Vector-reduction arithmetic accepts vectors as inputs and produces
scalars as outputs.This class of vector operation forms the basis
of many scientific computations. In vector-reduction arithmetic,
the evaluation off is independent of the order of the input elements of V.
Reviewer: 1. craig.topper
2. igorb
Differential Revision: https://reviews.llvm.org/D25988
llvm-svn: 285493
Summary: The preserved input should be the first argument and the vector inputs should be in the same order as the intrinsics it is used to implement.
Reviewers: igorb, delena
Subscribers: cfe-commits
Differential Revision: https://reviews.llvm.org/D25902
llvm-svn: 285175
Committed after LGTM and check-all
Vector-reduction arithmetic accepts vectors as inputs and produces scalars as outputs.
This class of vector operation forms the basis of many scientific computations.
In vector-reduction arithmetic, the evaluation off is independent of the order of the input elements of V.
Used bisection method. At each step, we partition the vector with previous
step in half, and the operation is performed on its two halves.
This takes log2(n) steps where n is the number of elements in the vector.
Reviwer: 1. igorb
2. craig.topper
Differential Revision: https://reviews.llvm.org/D25527
llvm-svn: 285054
Committed after LGTM and check-all
Vector-reduction arithmetic accepts vectors as inputs and produces scalars as outputs.
This class of vector operation forms the basis of many scientific computations.
In vector-reduction arithmetic, the evaluation off is independent of the order of the input elements of V.
Used bisection method. At each step, we partition the vector with previous
step in half, and the operation is performed on its two halves.
This takes log2(n) steps where n is the number of elements in the vector.
Differential Revision: https://reviews.llvm.org/D25527
llvm-svn: 284963
add abs intrinsics that use native LLVM-IR.
change _mm512_mask[z]_and_epi{32|64} to use select intrinsic
Differential Revision: http://reviews.llvm.org/D21973
llvm-svn: 274542