Summary: The preserved input should be the first argument and the vector inputs should be in the same order as the intrinsics it is used to implement.
Reviewers: igorb, delena
Subscribers: cfe-commits
Differential Revision: https://reviews.llvm.org/D25902
llvm-svn: 285175
Committed after LGTM and check-all
Vector-reduction arithmetic accepts vectors as inputs and produces scalars as outputs.
This class of vector operation forms the basis of many scientific computations.
In vector-reduction arithmetic, the evaluation off is independent of the order of the input elements of V.
Used bisection method. At each step, we partition the vector with previous
step in half, and the operation is performed on its two halves.
This takes log2(n) steps where n is the number of elements in the vector.
Reviwer: 1. igorb
2. craig.topper
Differential Revision: https://reviews.llvm.org/D25527
llvm-svn: 285054
Committed after LGTM and check-all
Vector-reduction arithmetic accepts vectors as inputs and produces scalars as outputs.
This class of vector operation forms the basis of many scientific computations.
In vector-reduction arithmetic, the evaluation off is independent of the order of the input elements of V.
Used bisection method. At each step, we partition the vector with previous
step in half, and the operation is performed on its two halves.
This takes log2(n) steps where n is the number of elements in the vector.
Differential Revision: https://reviews.llvm.org/D25527
llvm-svn: 284963
With this patch, all intrinsics in this file (with an exception of a handful of a recently added ones) will be documented. I will send out a patch for 4 missining intrisics later.
The doxygen comments are automatically generated based on Sony's intrinsics document.
I got an OK from Eric Christopher to commit doxygen comments without prior code
review upstream. This patch was internally reviewed by Yunzhong Gao.
llvm-svn: 284934
With this patch, 75% of the intrinsics in this file will be documented now. The patches for the rest of the intrisics in this file will be send out later.
The doxygen comments are automatically generated based on Sony's intrinsics document.
I got an OK from Eric Christopher to commit doxygen comments without prior code review upstream. This patch was internally reviewed by Yunzhong Gao.
llvm-svn: 284754
Summary: We need `__stosb` to be an intrinsic, because SecureZeroMemory function uses it without including intrin.h. Implementing it as a volatile memset is not consistent with MSDN specification, but it gives us target-independent IR while keeping the most important properties of `__stosb`.
Reviewers: rnk, hans, thakis, majnemer
Subscribers: cfe-commits
Differential Revision: https://reviews.llvm.org/D25334
llvm-svn: 284253
Summary: Previously global 64-bit versions of _Interlocked functions broke buildbots on i386, so now I'm adding them as builtins for x86-64 and ARM only (should they be also on AArch64? I had problems with testing it for AArch64, so I left it)
Reviewers: hans, majnemer, mstorsjo, rnk
Subscribers: cfe-commits, aemerson
Differential Revision: https://reviews.llvm.org/D25576
llvm-svn: 284172
Summary: _BitScan intrinsics (and some others, for example _Interlocked and _bittest) are supposed to work on both ARM and x86. This is an attempt to isolate them, avoiding repeating their code or writing separate function for each builtin.
Reviewers: hans, thakis, rnk, majnemer
Subscribers: RKSimon, cfe-commits, aemerson
Differential Revision: https://reviews.llvm.org/D25264
llvm-svn: 284060
These were reverted in r283753 and r283747.
The first patch added a header to the root 'Headers' install directory,
instead of into 'Headers/cuda_wrappers'. This was fixed in the second
patch, but by then the damage was done: The bad header stayed in the
'Headers' directory, continuing to break the build.
We reverted both patches in an attempt to fix things, but that still
didn't get rid of the header, so the Windows boostrap build remained
broken.
It's probably worth fixing up our cmake logic to remove things from the
install dirs, but in the meantime, re-land these patches, since we
believe they no longer have this bug.
llvm-svn: 283907
Breaks bootstrap builds on (at least) Windows:
In file included from D:\buildslave\clang-x64-ninja-win7\llvm\lib\Support\Allocator.cpp:14:
In file included from D:\buildslave\clang-x64-ninja-win7\llvm\include\llvm/Support/Allocator.h:24:
In file included from D:\buildslave\clang-x64-ninja-win7\llvm\include\llvm/ADT/SmallVector.h:20:
In file included from D:\buildslave\clang-x64-ninja-win7\llvm\include\llvm/Support/MathExtras.h:19:
D:\buildslave\clang-x64-ninja-win7\stage1.install\bin\..\lib\clang\4.0.0\include\algorithm(63,8) :
error: unknown type name '__device__'
inline __device__ const __T &
llvm-svn: 283747
Summary:
We do this by wrapping <complex> and <algorithm>.
Tests are in the test-suite.
Reviewers: tra
Subscribers: jhen, beanz, cfe-commits, mgorny
Differential Revision: https://reviews.llvm.org/D24979
llvm-svn: 283680
Summary: This matches the idiom we use for our other CUDA wrapper headers.
Reviewers: tra
Subscribers: beanz, mgorny, cfe-commits
Differential Revision: https://reviews.llvm.org/D24978
llvm-svn: 283679
Summary:
Currently we declare our inline __device__ math functions in namespace
std. But libstdc++ and libc++ declare these functions in an inline
namespace inside namespace std. We need to match this because, in a
later patch, we want to get e.g. <complex> to use our device overloads,
and it only will if those overloads are in the right inline namespace.
Reviewers: tra
Subscribers: cfe-commits, jhen
Differential Revision: https://reviews.llvm.org/D24977
llvm-svn: 283678
Summary: We need x86-64-specific builtins if we want to implement some of the MS intrinsics - winnt.h contains definitions of some functions for i386, but not for x86-64 (for example _InterlockedOr64), which means that we cannot treat them as builtins for both i386 and x86-64, because then we have definitions of builtin functions in winnt.h on i386.
Reviewers: thakis, majnemer, hans, rnk
Subscribers: cfe-commits
Differential Revision: https://reviews.llvm.org/D24598
llvm-svn: 283264
This patch corresponds to review:
https://reviews.llvm.org/D24397
It adds the __POWER9_VECTOR__ macro and the -mpower9-vector option along with
a number of altivec.h functions (refer to the code review for a list).
llvm-svn: 282481
On ARM, there are multiple versions of each of the intrinsics, with
acquire/relaxed/release barrier semantics.
The newly added ones are provided as inline functions here instead of builtins,
since they should only be available on certain archs (arm/aarch64).
This is necessary in order to compile C++ code for ARM in MSVC mode.
Patch by Martin Storsjö!
llvm-svn: 282447
This patch adds the msa.h header file containing the shorter names for the
MSA instrinsics, e.g. msa_sll_b for builtin_msa_sll_b.
Reviewers: vkalintiris, zoran.jovanovic
Differential Review: https://reviews.llvm.org/D24674
llvm-svn: 281975
Summary:
We need to add a bunch more "using"s, which weren't necessary with
libstdc++.
Once this is in I can check in a test to the test-suite.
Reviewers: tra
Subscribers: cfe-commits
Differential Revision: https://reviews.llvm.org/D24588
llvm-svn: 281544
Summary: There was no definition for __nop function - added inline
assembly.
Patch by Albert Gutowski!
Reviewers: rnk, thakis
Subscribers: cfe-commits
Differential Revision: https://reviews.llvm.org/D24286
llvm-svn: 280826
This adds support for modules that require (non-)freestanding
environment, such as the compiler builtin mm_malloc submodule.
Differential Revision: https://reviews.llvm.org/D23871
llvm-svn: 280613
These will be reused when removing some builtins from avx512vldqintrin.h and this will make the tests for that change show a better number of vector elements.
llvm-svn: 280196
This adds support for modules that require (no-)gnu-inline-asm
environment, such as the compiler builtin cpuid submodule.
This is the gnu-inline-asm variant of https://reviews.llvm.org/D23871
Differential Revision: https://reviews.llvm.org/D23905
rdar://problem/26931199
llvm-svn: 280159
Summary: Make is_valid_event and create_user_event overloadable like other built-ins.
Patch by Evgeniy Tyurin.
Reviewers: bader, yaxunl
Subscribers: Anastasia, cfe-commits
Differential Revision: https://reviews.llvm.org/D23914
llvm-svn: 280097
Summary:
A bunch of related changes here to our CUDA math headers.
- The second arg to nexttoward is a double (well, technically, long
double, but we don't have that), not a float.
- Add a forward-declare of llround(float), which is defined in the CUDA
headers. We need this for the same reason we need most of the other
forward-declares: To prevent a constexpr function in our standard
library from becoming host+device.
- Add nexttowardf implementation.
- Pull "foobarf" functions defined by the CUDA headers in the global
namespace into namespace std. This lets you do e.g. std::sinf.
- Add overloads for math functions accepting integer types. This lets
you do e.g. std::sin(0) without having an ambiguity between the
overload that takes a float and the one that takes a double.
With these changes, we pass testcases derived from libc++ for cmath and
math.h. We can check these testcases in to the test-suite once support
for CUDA lands there.
Reviewers: tra
Subscribers: cfe-commits
Differential Revision: https://reviews.llvm.org/D23627
llvm-svn: 279140
Summary:
Previously these sort of worked because they didn't end up resulting in
calls at the ptx layer. But I'm adding stricter checks that break
placement new without these changes.
Reviewers: tra
Subscribers: cfe-commits
Differential Revision: https://reviews.llvm.org/D23239
llvm-svn: 278194
This fixes compiling with headers from the Windows SDK for ARM, where the
YieldProcessor function (in winnt.h) refers to _ARM_BARRIER_ISHST.
The actual MSVC armintr.h contains a lot more definitions, but this is enough to
build code that uses the Windows SDK but doesn't use ARM intrinsics directly.
An alternative would to just keep the addition to intrin.h (to include
armintr.h), but not actually ship armintr.h, instead having clang's intrin.h
include armintr.h from MSVC's include directory. (That one works fine with
clang, at least for building code that uses the Windows SDK.)
Patch by Martin Storsjö!
llvm-svn: 277928
There should be no native_ builtin functions with double type arguments.
Patch by Aaron En Ye Shi.
Differential Revision : https://reviews.llvm.org/D23071
llvm-svn: 277754
Summary:
Some cpuid bit defines are named slightly different from how gcc's
cpuid.h calls them.
Define a few more compatibility names to appease software built for gcc:
* `bit_PCLMUL` alias of `bit_PCLMULQDQ`
* `bit_SSE4_1` alias of `bit_SSE41`
* `bit_SSE4_2` alias of `bit_SSE42`
* `bit_AES` alias of `bit_AESNI`
* `bit_CMPXCHG8B` alias of `bit_CX8`
While here, add the misssing 29th bit, `bit_F16C` (which is how gcc
calls this bit).
Reviewers: joerg, rsmith
Subscribers: bruno, cfe-commits
Differential Revision: https://reviews.llvm.org/D22010
llvm-svn: 277307
Added CLK_ABGR definition for get_image_channel_order return value inside opencl-c.h file.
Patch by Aaron En Ye Shi.
Differential Revision: https://reviews.llvm.org/D22767
llvm-svn: 277179
Only around 50% of the intrinsics in this file are documented now. The patches for the rest of the intrisics in this file will be send out later.
The doxygen comments are automatically generated based on Sony's intrinsics docu
ment.
I got an OK from Eric Christopher to commit doxygen comments without prior code
review upstream. This patch was internally reviewed by Paul Robinson.
llvm-svn: 276499
D20859 and D20860 attempted to replace the SSE (V)CVTTPS2DQ and VCVTTPD2DQ truncating conversions with generic IR instead.
It turns out that the behaviour of these intrinsics is different enough from generic IR that this will cause problems, INF/NAN/out of range values are guaranteed to result in a 0x80000000 value - which plays havoc with constant folding which converts them to either zero or UNDEF. This is also an issue with the scalar implementations (which were already generic IR and what I was trying to match).
This patch changes both scalar and packed versions back to using x86-specific builtins.
It also deals with the other scalar conversion cases that are runtime rounding mode dependent and can have similar issues with constant folding.
Differential Revision: https://reviews.llvm.org/D22105
llvm-svn: 276102
add abs intrinsics that use native LLVM-IR.
change _mm512_mask[z]_and_epi{32|64} to use select intrinsic
Differential Revision: http://reviews.llvm.org/D21973
llvm-svn: 274542
- Added new Builtins: enqueue_kernel, get_kernel_work_group_size
and get_kernel_preferred_work_group_size_multiple.
These Builtins use custom check to diagnose parameters of the passed Blocks
i. e. variable number of 'local void*' type params, and check different
overloads specified in Table 6.31 of OpenCL v2.0.
- IR is generated as an internal library call for each OpenCL Builtin,
reusing ObjC Block implementation.
Review: http://reviews.llvm.org/D20249
llvm-svn: 274540
I deleted the extra mask parameter.
__m256i _mm256_permutexvar_epi64 (__m256i idx, __m256i a)
#include "immintrin.h"
Instruction: vpermq
CPUID Flags: AVX512VL + AVX512F
Description
Shuffle 64-bit integers in a across lanes using the corresponding index in idx, and store the results in dst.
Operation
FOR j := 0 to 3
i := j*64
id := idx[i+1:i]*64
dst[i+63:i] := a[id+63:id]
ENDFOR
dst[MAX:256] := 0
dst[MAX:256] := 0
(From: Intel intrinsics guide)
llvm-svn: 274539
Include opencl-c.h by default as a module to utilize the automatic AST caching mechanism of clang modules.
Add an option -finclude-default-header to enable default header for OpenCL, which is off by default.
Differential Revision: http://reviews.llvm.org/D20444
llvm-svn: 273191
Summary:
This patch adds support for the _MM_ALIGN16 attribute on non-windows targets. This aligns Clang with ICC which supports the attribute on all targets.
Fixes PR28056
Reviewers: aaboud, echristo, cfe-commits, mkuper
Subscribers: zvi, mehdi_amini
Projects: #clang-c
Differential Revision: http://reviews.llvm.org/D21173
llvm-svn: 273095
There is no need to use a target-specific intrinsic to implement
_bit_scan_forward or _bit_scan_reverse, reimplementing them using
generic intrinsics makes it more likely that the middle end will
understand what's going on.
llvm-svn: 272564
We can now use __builtin_nontemporal_store instead of target specific builtins for naturally aligned nontemporal stores which avoids the need for handling in CGBuiltin.cpp
The scalar integer nontemporal (unaligned) store builtins will have to wait as __builtin_nontemporal_store currently assumes natural alignment and doesn't accept the 'packed struct' trick that we use for normal unaligned load/stores.
The nontemporal loads require further backend support before we can safely convert them to __builtin_nontemporal_load
Differential Revision: http://reviews.llvm.org/D21272
llvm-svn: 272540
The doxygen comments are automatically generated based on Sony's intrinsics docu
ment.
I got an OK from Eric Christopher to commit doxygen comments without prior code
review upstream.
llvm-svn: 272350
Summary: Clang changes to make use of the LLVM intrinsics added in D21160.
Reviewers: tra
Subscribers: jholewinski, cfe-commits
Differential Revision: http://reviews.llvm.org/D21162
llvm-svn: 272299
Only half of the intrinsics in this file is documented here. The patch for the o
ther half will be sent out later.
The doxygen comments are automatically generated based on Sony's intrinsics docu
ment.
I got an OK from Eric Christopher to commit doxygen comments without prior code
review upstream.
llvm-svn: 272121
This is really only needed for addition, subtraction, and multiplication, but I did the bitwise ops too for overall consistency. Clang currently doesn't set NSW for signed vector operations so the undefined behavior shouldn't happen today.
llvm-svn: 271778
The 'cvtt' truncation (round to zero) conversions can be safely represented as generic __builtin_convertvector (fptosi) calls instead of x86 intrinsics. We already do this (implicitly) for the scalar equivalents.
Note: I looked at updating _mm_cvttpd_epi32 as well but this still requires a lot more backend work to correctly lower (both for debug and optimized builds).
Differential Revision: http://reviews.llvm.org/D20859
llvm-svn: 271436
According to the gcc headers, intel intrinsics docs and msdn codegen the _mm_store1_pd (and its _mm_store_pd1 equivalent) should use an aligned pointer - the clang headers are the only implementation I can find that assume non-aligned stores (by storing with _mm_storeu_pd).
Additionally, according to the intel intrinsics docs and msdn codegen the _mm_store1_ps (_mm_store_ps1) requires a similarly aligned pointer.
This patch raises the alignment requirements to match the other implementations by calling _mm_store_ps/_mm_store_pd instead.
I've also added the missing _mm_store_pd1 intrinsic (which maps to _mm_store1_pd like _mm_store_ps1 does to _mm_store1_ps).
As a followup I'll update the llvm fast-isel tests to match this codegen.
Differential Revision: http://reviews.llvm.org/D20617
llvm-svn: 271218
Summary: The order is [x, y, z, w], not [w, x, y, z].
Subscribers: cfe-commits, tra
Differential Revision: http://reviews.llvm.org/D20794
llvm-svn: 271215
OpenCL has large number of "builtin" functions ("builtin" in the sense of OpenCL spec) which are defined in header files. To compile OpenCL kernels using these builtin functions, a header file is needed.
This header file is based on the Khronos implementation (https://github.com/KhronosGroup/SPIR/blob/spirv-1.0/lib/Headers/opencl.h) with heavy refactoring.
Re-commit after fixing failures on ppc64/systemz etc.
Differential Revision: http://reviews.llvm.org/D18369
llvm-svn: 271197
As discussed on http://reviews.llvm.org/D20684, move the unsigned integer vector types used for zero extension to make them available for general use.
llvm-svn: 271187
OpenCL has large number of "builtin" functions ("builtin" in the sense of OpenCL spec) which are defined in header files. To compile OpenCL kernels using these builtin functions, a header file is needed.
This header file is based on the Khronos implementation (https://github.com/KhronosGroup/SPIR/blob/spirv-1.0/lib/Headers/opencl.h) with heavy refactoring.
Differential Revision: http://reviews.llvm.org/D18369
llvm-svn: 271136
The VPMOVSX and (V)PMOVZX sign/zero extension intrinsics can be safely represented as generic __builtin_convertvector calls instead of x86 intrinsics.
This patch removes the clang builtins and their use in the sse2/avx headers - a companion patch will remove/auto-upgrade the llvm intrinsics.
Note: We already did this for SSE41 PMOVSX sometime ago.
Differential Revision: http://reviews.llvm.org/D20684
llvm-svn: 271106
macros rather than functions.
Unfortunately couldn't come up with a simple testcase that didn't need
code generation to verify what was going on.
llvm-svn: 270625
Both the (V)CVTDQ2PD(Y) (i32 to f64) and (V)CVTPS2PD(Y) (f32 to f64) conversion instructions are lossless and can be safely represented as generic __builtin_convertvector calls instead of x86 intrinsics without affecting final codegen.
This patch removes the clang builtins and their use in the sse2/avx headers - a future patch will deal with removing the llvm intrinsics, but that will require a bit more work.
Differential Revision: http://reviews.llvm.org/D20528
llvm-svn: 270499
Ensure _mm256_extract_epi8 and _mm256_extract_epi16 zero extend their i8/i16 result to i32. This matches _mm_extract_epi8 and _mm_extract_epi16.
Fix for PR27594
Differential Revision: http://reviews.llvm.org/D20468
llvm-svn: 270330
Summary:
Previously it was implemented as inline asm in the CUDA headers.
This change allows us to use the [addr+imm] addressing mode when
executing ld.global.nc instructions. This translates into a 1.3x
speedup on some benchmarks that call this instruction from within an
unrolled loop.
Reviewers: tra, rsmith
Subscribers: jhen, cfe-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D19990
llvm-svn: 270150
Summary:
MONITORX/MWAITX instructions provide similar capability to the MONITOR/MWAIT
pair while adding a timer function, such that another termination of the MWAITX
instruction occurs when the timer expires. The presence of the MONITORX and
MWAITX instructions is indicated by CPUID 8000_0001, ECX, bit 29.
The MONITORX and MWAITX instructions are intercepted by the same bits that
intercept MONITOR and MWAIT. MONITORX instruction establishes a range to be
monitored. MWAITX instruction causes the processor to stop instruction
execution and enter an implementation-dependent optimized state until
occurrence of a class of events.
Opcode of MONITORX instruction is "0F 01 FA". Opcode of MWAITX instruction is
"0F 01 FB". These opcode information is used in adding tests for the
disassembler.
These instructions are enabled for AMD's bdver4 architecture.
Patch by Ganesh Gopalasubramanian!
Reviewers: echristo, craig.topper
Subscribers: RKSimon, joker.eph, llvm-commits, cfe-commits
Differential Revision: http://reviews.llvm.org/D19796
llvm-svn: 269907
This is a mostly mechanical change accomplished with a script. I tried to split out any changes to the typecasts that already existed into separate commits.
llvm-svn: 269746
This is a mostly mechanical change accomplished with a script. I tried to split out any changes to the typecasts that already existed into separate commits.
llvm-svn: 269745
This is a mostly mechanical change accomplished with a script. I tried to split out any changes to the typecasts that already existed into separate commits.
llvm-svn: 269744
This is a mostly mechanical change accomplished with a script. I tried to split out any changes to the typecasts that already existed into separate commits.
llvm-svn: 269743
This is a mostly mechanical change accomplished with a script. I tried to split out any changes to the typecasts that already existed into separate commits.
llvm-svn: 269742
This is a mostly mechanical change accomplished with a script. I tried to split out any changes to the typecasts that already existed into separate commits.
llvm-svn: 269741
This is a mostly mechanical change accomplished with a script. I tried to split out any changes to the typecasts that already existed into separate commits.
llvm-svn: 269740
This is a mostly mechanical change accomplished with a script. I tried to split out any changes to the typecasts that already existed into separate commits.
llvm-svn: 269739
Added doxygen comments to avxintrin.h's intrinsics. As of now, only around 50% of the intrinsics in this file are documented here. The patches for the other half will be sent out later.
Updated bmiintrin.h to fix an incorrect section name.
Updated f16cintrin.h to fix incorect parameter names.
The doxygen comments are automatically generated based on Sony's intrinsics document.
I got an OK from Eric Christopher to commit doxygen comments without prior code
review upstream.
llvm-svn: 269718
Visual Studio's C++ standard library headers include intrin.h, so the intrinsic
headers get included a lot more often in Microsoft mode than elsewhere. The
AVX512 intrinsics are a lot of code (0.7 MB, causing 30% compile time overhead
for small programs including e.g. <string> and 6% compile time overhead for
larger projects like e.g. v8). Since multiversioning can't be relied on in
Microsoft mode (cl.exe doesn't support it), having faster compiles seems like
the much better tradeoff until we have a better intrinsic story going forward
(which we'll need for e.g. PR19898).
Actually using intrinsics on Windows already requires the right /arch:
settings, so this patch should have no big behavior change.
See also thread "The intrinsics headers (especially avx512) are too big. What
to do about it?" on cfe-dev.
http://reviews.llvm.org/D20291
llvm-svn: 269675