Also copy/modify the unary_intrin.inc from math/ to make the
intrinsic declaration somewhat reusable.
Passes CL CTS integer_ops/test_integer_ops popcount tests for CL 1.2
Tested-by on GCN 1.0 (Pitcairn)
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>
llvm-svn: 312854
v2: add vload(half) as well
make helpers amdgpu specific (NVPTX uses different private AS numbering)
use clang builtin on clang >= 6
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Reviewed-by: Tom Stellard <tstellar@redhat.com>
llvm-svn: 312839
Add missing undefs
Make helpers amdgpu specific (NVPTX uses different numbering for private AS)
Use clang builtins on clang >= 6
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Reviewed-by: Tom Stellard <tstellar@redhat.com>
llvm-svn: 312838
This was added in CL 1.1
Tested with a Radeon HD 7850 (Pitcairn) using the CL CTS via:
test_conformance/relationals/test_relationals shuffle_built_in_dual_input
v2: Add half support to shuffle2
Move shuffle2 to misc/
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>
llvm-svn: 312404
This was added in CL 1.1
Tested with a Radeon HD 7850 (Pitcairn) using the CL CTS via:
test_conformance/relationals/test_relationals shuffle_built_in
v2: Add half-precision support to shuffle when available.
Move to misc/ and add section 6.12.12 to clc.h
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>
llvm-svn: 312403
Uses the same mechanism to enable fp16 as we use for fp64 when
processing clc.h
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>
llvm-svn: 312402
also consolidate macros into one file, and rename to clcmacros.h
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Reviewed-by: Aaron Watry <awatry@gmail.com>
llvm-svn: 309358
Trivially define native_tan as a redirect to tan.
If there are any targets with a native implementation, we can deal with it later.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Matt Arsenault <arsenm2@gmail.com>
llvm-svn: 295920
Ported from the amd-builtins branch.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Matt Arsenault <Matthew.Arsenault@amd.com>
CC: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 292335
Ported from the amd-builtins branch.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Matt Arsenault <Matthew.Arsenault@amd.com>
CC: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 292334
Just use lgamma_r and ignore the value returned in the second argument
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 281565
Ported from the amd-builtins branch, which is itself based on the
Sun Microsystems implementation.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 281564
This macro is currently unused, but I plan to use it shortly.
The previous form did casts of pointers without an address space, which
doesn't work so well for CL 1.x.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 281563
clang (since r280553) allows pointer casts in function overloads,
so we need to disambiguate the second argument.
clang might be smarter about overloads in the future
see https://reviews.llvm.org/D24113, but let's be safe in libclc anyway.
llvm-svn: 280871
OpenCL 1.0: "Returns y if y < x, otherwise it returns x. If x *and* y
are infinite or NaN, the return values are undefined."
OpenCL 1.1+: "Returns y if y < x, otherwise it returns x. If x *or* y
are infinite or NaN, the return values are undefined."
The 1.0 version is stricter so use that one.
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
llvm-svn: 276704
Also fix get_global_id to consider offset
No idea how to add this for ptx, so they are stuck with the old get_global_id
implementation.
v2: split to a separate patch
v3: Switch R600 to use implictarg.ptr
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
llvm-svn: 276443
Fixes fdim piglit on Turks
v2: use CL fmax instead of __builtin
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Reviewed-by: Tom Stellard <tom.stellard@amd.com>
llvm-svn: 269807
The scalar float/double function bodies are a direct copy/paste,
aside from the removed (optional) code in float function body that
requires subnormals.
reviewers: jvesely
Patch by: Vedran Miletić <rivanvx@gmail.com>
llvm-svn: 268766
Based on the amd-builtin, but explicitly vectorized for all sizes (not just
float4), and includes a vectorized double implementation.
Passes piglit (float) tests on pitcairn.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>
llvm-svn: 268708
The scalar float/double function bodies are a direct copy/paste
with usage of the CLC wrappers to vectorize them.
This commit also adds in the FP_ILOGB0 and FP_ILOGBNAN macros which are
equal to the results of ilogb(0.0f) and ilogb(float nan) respectively.
v2: Add FP_ILOGB0 and FP_ILOGBNAN definitions
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>
v1 Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 261639
The float implementation is almost a direct port from the amd-builtins,
but instead of just having a scalar and float4 implementation, it has
a scalar and arbitrary width vector implementation.
The double scalar is also a direct port from AMD's builtin release.
The double vector implementation copies the logic in the float vector
implementation using the values from the double scalar version.
Both have been tested in piglit using tests sent to that project's
mailing list.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>
llvm-svn: 260114
The spec says (section 6.12.3, CL version 1.2):
The macro names given in the following list must use the values
specified. The values shall all be constant expressions suitable
for use in #if preprocessing directives.
This commit addresses the second part of that statement.
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>
Reviewed-by: Tom Stellard <tom@stellard.net>
CC: Moritz Pflanzer <moritz.pflanzer14@imperial.ac.uk>
CC: Serge Martin <edb+libclc@sigluy.net>
llvm-svn: 249445
The values for the char/short/integer/long minimums were declared with
their actual values, not the definitions from the CL spec (v1.1). As
a result, (-2147483648) was actually being treated as a long by the
compiler, not an int, which caused issues when trying to add/subtract
that value from a vector.
Update the definitions to use the values declared by the spec, and also
add explicit casts for the char/short/int minimums so that the compiler
actually treats them as shorts/chars. Without those casts, they
actually end up stored as integers, and the compiler may end up storing
the INT_MIN as a long.
The compiler can sign extend the values if it needs to convert the
char->short, short->int, or int->long
v2: Add explicit cast for INT_MIN and fix some type-o's and wrapping
in the commit message.
Reported-by: Moritz Pflanzer <moritz.pflanzer14@imperial.ac.uk>
CC: Moritz Pflanzer <moritz.pflanzer14@imperial.ac.uk>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
Signed-off-by: Aaron Watry <awatry@gmail.com>
llvm-svn: 247661
Use the implementation was ported from the AMD builtin library rather
than LLVM Intrinsics.
This has been tested with piglit, OpenCV, and the ocl conformance tests.
llvm-svn: 243131
Passing values less than 0 to the llvm.sqrt() intrinsic results in
undefined behavior, so we need to check the input and return NaN if
is is less than 0.
v2:
- Fix build failures.
llvm-svn: 241906
Using exp2(x * M_LOG2E_F) does not give us accurate enough results for
OpenCL. If you look at the new exp implementation you'll see that
it does multiply the input by M_LOG2E_F, but it still uses the original
input in part of the calculation.
This exp implementation was ported from the AMD builtin library
and has been tested with piglit, OpenCV, and the ocl conformance tests.
llvm-svn: 237229
Not all targets support the intrinsic, so it's better to have a
generic implementation which does not use it.
This exp2 implementation was ported from the AMD builtin library
and has been tested with piglit, OpenCV, and the ocl conformance tests.
llvm-svn: 237228
This implementation was ported from the AMD builtin library
and has been tested with piglit, OpenCV, and the ocl conformance tests.
v2:
- Remove f suffix from constant in double implementations.
- Consolidate implementations using the .cl/.inc approach.
v3:
- Use __CLC_FPSIZE instead of __CLC_FP{32,64}
v4 (Jan Vesely):
- Limit to single precision.
llvm-svn: 236920
This is a generic implementation which just calls rsqrt.
Targets should override this if they want a faster implementation.
v2:
- Alphabettize SOURCES
v3 (Jan Vesely):
Limit to single precision types.
llvm-svn: 236915
Ported from AMD builtin library, passes piglit on Turks.
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 236647
The new implementation was ported from the AMD builtin library
and has been tested with piglit, OpenCV, and the ocl conformance tests.
llvm-svn: 236608
This makes it possible for runtime implementations to disable
subnormal handling at runtime.
When this flag is enabled, decisions about how to handle subnormals
in the library will be controlled by an external variable called
__CLC_SUBNORMAL_DISABLE.
Function implementations should use these new helpers for querying subnormal
support:
__clc_fp16_subnormals_supported();
__clc_fp32_subnormals_supported();
__clc_fp64_subnormals_supported();
In order for the library to link correctly with this feature,
users will be required to either:
1. Insert this variable into the module (if using the LLVM/Clang C++/C APIs).
2. Pass either subnormal_disable.bc or subnormal_use_default.bc to the
linker. These files are distributed with liblclc and installed to
$(installdir). e.g.:
llvm-link -o kernel-out.bc kernel.bc builtins-nosubnormal.bc subnormal_disable.bc
or
llvm-link -o kernel-out.bc kernel.bc builtins-nosubnormal.bc subnormal_use_default.bc
If you do not supply the --enable-runtime-subnormal then the library
behaves the same as it did before this commit.
In addition to these changes, the patch adds helper functions that
should be used when implementing library functions that need
special handling for denormals:
__clc_fp16_subnormals_supported();
__clc_fp32_subnormals_supported();
__clc_fp64_subnormals_supported();
llvm-svn: 235329
This is a generic implementation which just calls sqrt. Targets should
override this if they want a faster implementation.
v2:
- Alphabetize SOURCES
llvm-svn: 232965
This implementation was ported from the AMD builtin library
and has been tested with piglit, OpenCV, and the ocl conformance tests.
v2:
- Remove unnecessary copyright.
llvm-svn: 232964
We need to reinterpret float/double types as uint/ulong in order to
perform the bitwise operations.
This has been tested with piglit, OpenCV, and the ocl conformance tests.
v2:
- Use vector operations rather than splitting vectors into scalar
components.
Reviewed-by: Aaron Watry <awatry@gmail.com>
llvm-svn: 231373
It has been part of the common functions since 1.0
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 231137
Ported from the libclc/amd-builtins branch
v2: Rename sincos_f_piby4 to __libclc__sincosf_piby4
Add cospi(double) implementation instead of using llvm.cos
Notes:
The sincosD_piby4.h file is mostly the same as the builtin implementation
released by AMD. The inline attribute declaration is changed, and M_PI is
used instead of a constant double. Otherwise, the only difference is that
the header explicitly enables the fp64 pragma.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Jeroen Ketema <j.ketema@imperial.ac.uk>
CC: Tom Stellard <tom@stellard.net>
CC: Matt Arsenault <Matthew.Arsenault@amd.com>
llvm-svn: 230641
Including a standard or system header isn't allowed in OpenCL.
The type "size_t" needs to be explicitely defined now.
v2: Use __SIZE_TYPE__ instead of unsigned int.
v3: Define ptrdiff_t and NULL.
Patch-by: Jean-Sébastien Pédron
Reviewed-by: Jeroen Ketema
Reviewed-by: Jan Vesely
llvm-svn: 222235
v2: Fix function declaration
Add range metadata to r600 implementation
v3: change prefix to AMDGPU
Reviewed-by: Tom Stellard <tom@stellard.net>
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
llvm-svn: 219793
This is a simple implementation which just copies data synchronously.
v2:
- Use size_t.
v3:
- Fix possible race condition by splitting the copy among multiple
work items.
llvm-svn: 219008
We were missing the local versions of the atom_* before
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 217911
Uses the algorithm:
tan(x) = sin(x) / sqrt(1-sin^2(x))
An alternative is:
tan(x) = sin(x) / cos(x)
Which produces more verbose bitcode and longer assembly.
Either way, the generated bitcode seems pretty nasty and a more optimized
but still precise-enough solution is welcome.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>
llvm-svn: 217511
Passes the tests that were submitted to the piglit list
Tested on R600 (Pitcairn)
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>
llvm-svn: 217509
This was previously implemented with a macro and we were using
__builtin_copysign(), which takes double inputs for the float
version of copysign().
Reviewed-and-Tested-by: Aaron Watry <awatry@gmail.com>
llvm-svn: 217045
These were missing and caused mad24/mul24 with int3/uint3 arg type to fail
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 216321
This generates bitcode which is indistinguishable from what was
hand-written for int32 types in v[load|store]_impl.ll.
v4: Use vec2+scalar for vec3 load/stores to prevent corruption (per Tom)
v3: Also remove unused generic/lib/shared/v[load|store]_impl.ll
v2: (Per Matt Arsenault) Fix alignment issues with vector load stores
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
CC: Matt Arsenault <Matthew.Arsenault@amd.com>
CC: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 216069
These were present in CL 1.0, just not implemented yet.
v2: Use hex values and fix commit message
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Jeroen Ketema <j.ketema@imperial.ac.uk>
CC: Matt Arsenault <Matthew.Arsenault@amd.com>
llvm-svn: 213321
Vector true is -1, not 1, which means we need to use the relational unary
macro instead of the normal unary builtin one.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 213316
relational.h includes relational macros for defining functions which need to
return 1 for scalar true and -1 for vector true.
I believe that this is the only place that this behavior is required, so the
macro is placed at its lowest useful level (same directory as it is used in).
This also creates re-usable unary/binary declaration and floatn includes which
should simplify relational builtin declarations.
Mostly patterned off of include/math/[binary_decl|unary_decl|floatn].inc
but with required changes for relational functions.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 213315
Otherwise the test evaluates to true on OpenCL 1.1 and earlier. Since we
therefore cannot use the CL_VERSION_?_? macros move them to the proper
position in the top-level header.
llvm-svn: 211787
The vector components were mistakenly using () instead of {}, which caused
all but the last vector component to be dropped on the floor.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Jeroen Ketema <j.ketema@imperial.ac.uk>
llvm-svn: 211733
v2 Changes:
- use __builtin_signbit instead of shifting by hand
- significantly improve vector shuffling
- Works correctly now for signbit(float16) on radeonsi
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 211696
These are apparently only defined in OpenCL 1.2.
HALF_MAX, HALF_MIN and HALF_EPSILON are currently omitted. Clang does
not seem to support the ‘h’ suffix for half float constants even with
the cl_khr_fp16 extension enabled.
Reviewed-by: Tom Sellard <tom@stellard.net>
llvm-svn: 211579
Add these out-of-order in clc.h so we can use these in other headers.
v2: Take into account the lack of a definition in OpenCL 1.0
Reviewed-by: Tom Stellard <tom@stellard.net>
llvm-svn: 211578
v2: - use quotes instead of <>
- add include to r600/lib/math/nextafter.c changed
Reviewed-by: Tom Stellard <tom@stellard.net>
Reviewed-by: Aaron Watry <awatry@gmail.com>
llvm-svn: 211576
v3: change __builtin_nanf() to __builtin_nanf("")
This doesn't work yet, but it was agreed to commit as-is with the logic
that "broken" is better than "completely missing" and this should be
fixed in clang.
v2: use __builtin_inff() and also add nan/huge_val definitions
Signed-off-by: Aaron Watry <awatry@gmail.com>
llvm-svn: 211065
Use separate implementations instead of a macro
to ensure the constant multiplied with is of
higher precision.
v2: Use the correct formula, spotted by Dan Liew <daniel.liew@imperial.ac.uk>
Reviewed-by: Aaron Warty <awatry@gmail.com>
Reviewed-by: Tom Stellard <tom@stellard.net>
llvm-svn: 210891
OpenCL C lang says that trunc rounds towards zero.
llvm.trunc.* intrinsic rounds to integer not larger in magnitude.
These definitions are equivalent.
Patch by: Jan Vesely
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
llvm-svn: 197769
Some function definitions were using _CLC_DECL, which meant that they
weren't being marked as always_inline.
Reviewed-by and Tested-by: Aaron Watry <awatry@gmail.com>
llvm-svn: 193754
There are two implementations of nextafter():
1. Using clang's __builtin_nextafter. Clang replaces this builtin with
a call to nextafter which is part of libm. Therefore, this
implementation will only work for targets with an implementation of
libm (e.g. most CPU targets).
2. The other implementation is written in OpenCL C. This function is
known internally as __clc_nextafter and can be used by targets that
don't have access to libm.
llvm-svn: 192383
We already have a working mul_hi, and the spec gives us the implementation as:
Returns mul_hi(a,b)+c.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 190211
Everything except long/ulong is handled by just casting to the next larger type,
doing the math and then shifting/casting the result.
For 64-bit types, we break the high/low parts of each operand apart, and do
a FOIL-based multiplication.
v2:
Discard the stack-overflow implementation due to copyright concerns.
- The implementation is still FOIL-based, but discards the previous code.
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 188684
rhadd = (x+y+1)>>1
Implemented as:
(x>>1) + (y>>1) + ((x&1)|(y&1))
This prevents us having to do assembly addition and overflow detection
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 188477
(x + y) >> 1 gets changed to:
(x>>1) + (y>>1) + (x&y&1)
Saves us having to do any llvm assembly and overflow checking in the addition.
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 188476
Not hooked up to R600 yet due to current lack of support, at least on EG.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 188181