This patch introduces support for %hix, %lox, %gdop_hix22, %gdop_lox10 and %gdop.
An extra test is introduced to make sure the fixups are correctly applied.
Reviewed By: dcederman
Differential Revision: https://reviews.llvm.org/D102575
On targets with __float128 available and distinct from long double,
use it to support more kind=16 entry points. This affects mostly
x86-64 targets. This means that more runtime entry points are
defined for lowering to call.
Delete Common/long-double.h and its LONG_DOUBLE macro in favor of
testing the standard macro LDBL_MANT_DIG.
Differential Revision: https://reviews.llvm.org/D127025
In some instances its advantageous to calculate edit distances without worrying about casing.
Currently to achieve this both strings need to be converted to the same case first, then edit distance can be calculated.
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D126159
Instead of setting the successor to the exit using CFG.ExitBB, set it to
nullptr initially. The successor to the exit block is later set either
through createEmptyBasicBlock or after VPlan execution (because at the
moment, no block is created by VPlan for the exit block, the existing
one is reused).
This also enables BranchOnCond to be used as terminator for the exiting
block of the topmost vector region.
Depends on D126618.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D126679
Diagnose OPEN(FILE=f) when f is already connected by the same name to
a distinct external I/O unit.
Differential Revision: https://reviews.llvm.org/D127035
Add extra arguments and checks to the runtime support library so that
a call to the intrinsic functions MOD and MODULO with "denominator"
argument P of zero will cause a crash with a source location rather
than an uninformative floating-point error or integer division by
zero signal.
Additional work is required in lowering to (1) pass source file path and
source line number arguments and (2) actually call these runtime
library APIs instead of emitting inline code for MOD &/or MODULO.
Differential Revision: https://reviews.llvm.org/D127034
When an external I/O statement is in a recoverable error
state before any data transfers take place (for example,
an unformatted transfer with ERR=/IOSTAT=/IOMSG= attempted on
a formatted unit), ensure that the unit's mutex is still
released at the end of the statement.
Differential Revision: https://reviews.llvm.org/D127032
For example, FINDLOC(A,X) should convert both A and X to COMPLEX(8)
if the operands are REAL(8) and COMPLEX(4), so that comparisons
can be done without losing inforation. The current implementation
unconditionally converts X to the type of the array A.
Differential Revision: https://reviews.llvm.org/D127030
The initial size of the file was not being captured as the file position
on which the first output buffer should be framed.
Differential Revision: https://reviews.llvm.org/D127029
The "engineering" ENw.d output editing descriptor has some difficult
edge case behavior for values that might format into a bunch of 9's
or round up to a 1 for a given scale factor. Fix the algorithm,
and add tests to protect against regressions.
Differential Revision: https://reviews.llvm.org/D127028
Avoid calls to memcpy with zero byte counts if their address argument
calculations may not be valid expressions.
Differential Revision: https://reviews.llvm.org/D127027
After the program has survived its attempt to overflow the output buffer
with an internal WRITE using ERR=, IOSTAT=, &/or IOMSG=, don't crash
by accidentally blank-filling the next record that usually doesn't exist.
Differential Revision: https://reviews.llvm.org/D127024
When the current seed of the pseudo-random generator is queried
with CALL RANDOM_SEED(GET=n), that query should not change the
stream of pseudo-random numbers produced by CALL RANDOM_NUMBER().
Differential Revision: https://reviews.llvm.org/D127023
This is correct for all values, i.e. the same as promoting the division to fp32 in the NVPTX backend. But it is faster (~10% in average, sometimes more) because:
- it performs less Newton iterations
- it avoids the slow path for e.g. denormals
- it allows reuse of the reciprocal for multiple divisions by the same divisor
Test program:
```
#include <stdio.h>
#include "cuda_fp16.h"
// This is a variant of CUDA's own __hdiv which is fast than hdiv_promote below
// and doesn't suffer from the perf cliff of div.rn.fp32 with 'special' values.
__device__ half hdiv_newton(half a, half b) {
float fa = __half2float(a);
float fb = __half2float(b);
float rcp;
asm("{rcp.approx.ftz.f32 %0, %1;\n}" : "=f"(rcp) : "f"(fb));
float result = fa * rcp;
auto exponent = reinterpret_cast<const unsigned&>(result) & 0x7f800000;
if (exponent != 0 && exponent != 0x7f800000) {
float err = __fmaf_rn(-fb, result, fa);
result = __fmaf_rn(rcp, err, result);
}
return __float2half(result);
}
// Surprisingly, this is faster than CUDA's own __hdiv.
__device__ half hdiv_promote(half a, half b) {
return __float2half(__half2float(a) / __half2float(b));
}
// This is an approximation that is accurate up to 1 ulp.
__device__ half hdiv_approx(half a, half b) {
float fa = __half2float(a);
float fb = __half2float(b);
float result;
asm("{div.approx.ftz.f32 %0, %1, %2;\n}" : "=f"(result) : "f"(fa), "f"(fb));
return __float2half(result);
}
__global__ void CheckCorrectness() {
int i = threadIdx.x + blockIdx.x * blockDim.x;
half x = reinterpret_cast<const half&>(i);
for (int j = 0; j < 65536; ++j) {
half y = reinterpret_cast<const half&>(j);
half d1 = hdiv_newton(x, y);
half d2 = hdiv_promote(x, y);
auto s1 = reinterpret_cast<const short&>(d1);
auto s2 = reinterpret_cast<const short&>(d2);
if (s1 != s2) {
printf("%f (%u) / %f (%u), got %f (%hu), expected: %f (%hu)\n",
__half2float(x), i, __half2float(y), j, __half2float(d1), s1,
__half2float(d2), s2);
//__trap();
}
}
}
__device__ half dst;
__global__ void ProfileBuiltin(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = x / x;
}
dst = x;
}
__global__ void ProfilePromote(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = hdiv_promote(x, x);
}
dst = x;
}
__global__ void ProfileNewton(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = hdiv_newton(x, x);
}
dst = x;
}
__global__ void ProfileApprox(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = hdiv_approx(x, x);
}
dst = x;
}
int main() {
CheckCorrectness<<<256, 256>>>();
half one = __float2half(1.0f);
ProfileBuiltin<<<1, 1>>>(one); // 1.001s
ProfilePromote<<<1, 1>>>(one); // 0.560s
ProfileNewton<<<1, 1>>>(one); // 0.508s
ProfileApprox<<<1, 1>>>(one); // 0.304s
auto status = cudaDeviceSynchronize();
printf("%s\n", cudaGetErrorString(status));
}
```
Reviewed By: herhut
Differential Revision: https://reviews.llvm.org/D126158
Besides raising the IEEE floating-point overflow exception, treat
a floating-point overflow on input as an I/O error catchable with
ERR=, IOSTAT=, &/or IOMSG=.
Differential Revision: https://reviews.llvm.org/D127022
# Stash local changes before checkout.
# Print a message that the source repository revision has been changed, with
instructions to switch back.
# Make the script executable.
# Print sample instructions how to run bolt tests.
# Assume that llvm-bolt-wrapper script is in the same source directory.
Reviewed By: rafauler
Differential Revision: https://reviews.llvm.org/D126941
F18 preserves lower bounds of explicit-shape named constant arrays, but
failed to also do so for implicit-shape named constants. Fix.
Differential Revision: https://reviews.llvm.org/D127021
This reverts commit dcf3368e33.
It breaks -DLLVM_ENABLE_ASSERTIONS=on builds. In addition, the description is
incorrect about ld.lld behavior. For wasm, there should be justification to add
the new mode.
In the USE statements that f18 emits to module files, ensure that symbols
from intrinsic modules are marked as such on their USE statements. And
ensure that the current working directory (".") cannot override the intrinsic
module search path when trying to locate an intrinsic module.
Differential Revision: https://reviews.llvm.org/D127019
Some cl::ZeroOrMore were added to avoid the `may only occur zero or one times!`
error. More were added due to cargo cult. Since the error has been removed,
cl::ZeroOrMore is unneeded.
Also remove cl::init(false) while touching the lines.
As well as ELF linker does, retain all data segments named X referenced
through `__start_X` or `__stop_X`.
For example, `FOO_MD` should not be stripped in the below case, but it's currently mis-stripped
```llvm
@FOO_MD = global [4 x i8] c"bar\00", section "foo_md", align 1
@__start_foo_md = external constant i8*
@__stop_foo_md = external constant i8*
@llvm.used = appending global [1 x i8*] [i8* bitcast (i32 ()* @foo_md_size to i8*)], section "llvm.metadata"
define i32 @foo_md_size() {
entry:
ret i32 sub (
i32 ptrtoint (i8** @__stop_foo_md to i32),
i32 ptrtoint (i8** @__start_foo_md to i32)
)
}
```
This fixes https://github.com/llvm/llvm-project/issues/55839
Reviewed By: sbc100
Differential Revision: https://reviews.llvm.org/D126950
If a file being opened with no ACTION= is write-only then cope with
it rather than defaulting prematurely to treating it as read-only.
Differential Revision: https://reviews.llvm.org/D127015
This fixes an inconsistency between RV32 and RV64. Still considering
trying to do this peephole during isel, but wanted to fix the
inconsistency first.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126986