Introduce RoundOp in the math dialect. The operation rounds the operand to the
nearest integer value in floating-point format. RoundOp lowers to LLVM
intrinsics 'llvm.intr.round' or as a function call to libm (round or roundf).
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D127286
Find writability conflicts (writes to buffers that are not allowed to be written to) by checking SSA use-def chains. This is better than the current writability analysis, which is too conservative and finds false positives.
Differential Revision: https://reviews.llvm.org/D127256
This is required for the distribution system for installing the
mlir-libraries component. This is copied from clang's equivalent
feature.
Differential Revision: https://reviews.llvm.org/D126837
The `init` and `tensor` ops are renamed (and one moved to another dialect).
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D127169
Reverts commit 1469ebf838 (original commit)
Reverts commit a392a39f75 (build fix for above commit)
The commit broke tests in out-of-tree projects, indicating that some logical
error was made in the previous change but not covered by current tests.
This patch fixes a bug in PresburgeRelation::subtract that made it process the
inequality at index 0, multiple times. This was caused by allocating memory
instead of reserving memory in llvm::SmallVector.
Reviewed By: arjunp
Differential Revision: https://reviews.llvm.org/D127228
A few OpenMP tests were retaining the FIR operands even after running
the LLVM conversion pass. To fix these tests the legality checkes for
OpenMP conversion are made stricter to include operands and results.
The Flush, Single and Sections operations are added to conversions or
legality checks. The RegionLessOpConversion is appropriately renamed
to clarify that it works only for operations with Variable operands.
The operands of the flush operation are changed to match those of
Variable Operands.
Fix for an OpenMP issue mentioned in
https://github.com/llvm/llvm-project/issues/55210.
Reviewed By: shraiysh, peixin, awarzynski
Differential Revision: https://reviews.llvm.org/D127092
Four leading spaces are interpreted as a code block in markdown. Unless
used consistently in ODS op description, they cannot be stripped away by
the tablegen backend, which results in malformed markdown being
generated.
Add complex.conj op to calculate the complex conjugate which is widely used for the mathematical operation on the complex space.
Reviewed By: pifon2a
Differential Revision: https://reviews.llvm.org/D127181
These allow for displaying additional inline information,
such as the types of variables, names operands/results,
constraint/rewrite arguments, etc. This requires a bump in the
vscode extension to a newer version, as inlay hints are a new LSP feature.
Differential Revision: https://reviews.llvm.org/D126033
This is much more efficient over the full mode, as it only requires sending
smalls chunks of files. It also works around a weird command ordering
issue (full document updates are being sent after other commands like
code completion) in newer versions of vscode.
Differential Revision: https://reviews.llvm.org/D126032
This commit beefs up the documentation for MLIR language servers by
adding proper documentations/examples/etc for the provided TableGen
language server capabilities. Given that this documentation is also used
for the vscode extension, this commit also updates the user facing vscode
extension documentation.
Note that the images referenced in the new documentation are hosted on
the website, and will be commited to mlir-www shortly after this commit
lands.
Transpose operations on constant data were getting folded during the
canonicalization process. This has compile time cost proportional to
the constant size. Moving this to a separate pass to enable optionality
and flexibility of how such scenarios can be handled.
Reviewed By: rsuderman, jpienaar, stellaraccident
Differential Revision: https://reviews.llvm.org/D124685
Adds supprot for vector unroll transformations to unroll in different
orders. For example, the `vector.contract` can be unrolled into a
smaller set of contractions. There is a choice of how to unroll the
decomposition based on the traversal order of (dim0, dim1, dim2).
The choice of traversal order can now be specified by a callback which
given by the caller of the transform. For now, only the
`vector.contract`, `vector.transfer_read/transfer_write` operations
support the callback.
Differential Revision: https://reviews.llvm.org/D127004
This commit beefs up the documentation for MLIR language servers by
adding proper documentations/examples/etc for the provided PDLL
language server capabilities. Given that this documentation is also used
for the vscode extension, this commit also updates the user facing vscode
extension documentation.
Not that the images referenced in the new documentation are hosted on
the website, and will be commited to mlir-www shortly after this commit
lands.
Differential Revision: https://reviews.llvm.org/D125650
This operation should be supported as a named op because
when the operands are viewed as having canonical layouts
with decreasing strides, then the "reduction" dimensions
of the filter (h, w, and c) are contiguous relative to each
output channel. When lowered to a matrix multiplication,
this layout is the simplest to deal with, and thus future
transforms/vectorizations of `conv2d` may find using this
named op convenient.
Differential Revision: https://reviews.llvm.org/D126995
This change adds support for promoting `linalg` operation operands that
are produced by rank-reducing `memref.subview` ops.
Differential Revision: https://reviews.llvm.org/D127086
When building in debug mode, the link time of the standalone sample is excessive, taking upwards of a minute if using BFD. This at least allows lld to be used if the main invocation was configured that way. On my machine, this gets a standalone test that requires a relink to run in ~13s for Debug mode. This is still a lot, but better than it was. I think we may want to do something about this test: it adds a lot of latency to a normal compile/test cycle and requires a bunch of arg fiddling to exclude.
I think we may end up wanting a `check-mlir-heavy` target that can be used just prior to submit, and then make `check-mlir` just run unit/lite tests. More just thoughts for the future (none of that is done here).
Reviewed By: bondhugula, mehdi_amini
Differential Revision: https://reviews.llvm.org/D126585
This is correct for all values, i.e. the same as promoting the division to fp32 in the NVPTX backend. But it is faster (~10% in average, sometimes more) because:
- it performs less Newton iterations
- it avoids the slow path for e.g. denormals
- it allows reuse of the reciprocal for multiple divisions by the same divisor
Test program:
```
#include <stdio.h>
#include "cuda_fp16.h"
// This is a variant of CUDA's own __hdiv which is fast than hdiv_promote below
// and doesn't suffer from the perf cliff of div.rn.fp32 with 'special' values.
__device__ half hdiv_newton(half a, half b) {
float fa = __half2float(a);
float fb = __half2float(b);
float rcp;
asm("{rcp.approx.ftz.f32 %0, %1;\n}" : "=f"(rcp) : "f"(fb));
float result = fa * rcp;
auto exponent = reinterpret_cast<const unsigned&>(result) & 0x7f800000;
if (exponent != 0 && exponent != 0x7f800000) {
float err = __fmaf_rn(-fb, result, fa);
result = __fmaf_rn(rcp, err, result);
}
return __float2half(result);
}
// Surprisingly, this is faster than CUDA's own __hdiv.
__device__ half hdiv_promote(half a, half b) {
return __float2half(__half2float(a) / __half2float(b));
}
// This is an approximation that is accurate up to 1 ulp.
__device__ half hdiv_approx(half a, half b) {
float fa = __half2float(a);
float fb = __half2float(b);
float result;
asm("{div.approx.ftz.f32 %0, %1, %2;\n}" : "=f"(result) : "f"(fa), "f"(fb));
return __float2half(result);
}
__global__ void CheckCorrectness() {
int i = threadIdx.x + blockIdx.x * blockDim.x;
half x = reinterpret_cast<const half&>(i);
for (int j = 0; j < 65536; ++j) {
half y = reinterpret_cast<const half&>(j);
half d1 = hdiv_newton(x, y);
half d2 = hdiv_promote(x, y);
auto s1 = reinterpret_cast<const short&>(d1);
auto s2 = reinterpret_cast<const short&>(d2);
if (s1 != s2) {
printf("%f (%u) / %f (%u), got %f (%hu), expected: %f (%hu)\n",
__half2float(x), i, __half2float(y), j, __half2float(d1), s1,
__half2float(d2), s2);
//__trap();
}
}
}
__device__ half dst;
__global__ void ProfileBuiltin(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = x / x;
}
dst = x;
}
__global__ void ProfilePromote(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = hdiv_promote(x, x);
}
dst = x;
}
__global__ void ProfileNewton(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = hdiv_newton(x, x);
}
dst = x;
}
__global__ void ProfileApprox(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = hdiv_approx(x, x);
}
dst = x;
}
int main() {
CheckCorrectness<<<256, 256>>>();
half one = __float2half(1.0f);
ProfileBuiltin<<<1, 1>>>(one); // 1.001s
ProfilePromote<<<1, 1>>>(one); // 0.560s
ProfileNewton<<<1, 1>>>(one); // 0.508s
ProfileApprox<<<1, 1>>>(one); // 0.304s
auto status = cudaDeviceSynchronize();
printf("%s\n", cudaGetErrorString(status));
}
```
Reviewed By: herhut
Differential Revision: https://reviews.llvm.org/D126158
The current vectorization logic implicitly expects "elementwise"
linalg ops to have projected permutations for indexing maps, but
the precondition logic misses this check. This can result in a
crash when executing the generic vectorization transform on an op
with a non-projected permutation input indexing map. This change
fixes the logic and adds a test (which crashes without this fix).
Differential Revision: https://reviews.llvm.org/D127000
This patch adds the knobs to use peeling in the codegen strategy
infrastructure.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D126842
This patch adds tests for memory_order clause for atomic update and
capture operations. This patch also adds a check for making sure that
the operations inside and omp.atomic.capture region do not specify the
memory_order clause.
Reviewed By: kiranchandramohan, peixin
Differential Revision: https://reviews.llvm.org/D126195
`scf.foreach_thread` results alias with the underlying `scf.foreach_thread.parallel_insert_slice` destination operands
and they bufferize to equivalent buffers in the absence of other conflicts.
`scf.foreach_thread.parallel_insert_slice` conflict detection is similar to `tensor.insert_slice` conflict detection.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D126769
Add an option to predicate the epilogue within the kernel instead of
peeling the epilogue. This is a useful option to prevent generating
large amount of code for deep pipeline. This currently require a user
lamdba to implement operation predication.
Differential Revision: https://reviews.llvm.org/D126753
Similar to complex128/complex64, float16 has no direct support
in the ctypes implementation. This fixes the issue by using a
custom F16 type to change the view in and out of MLIR code
Reviewed By: wrengr
Differential Revision: https://reviews.llvm.org/D126928
Since version 0.7 we've added:
* Initial language support for TableGen
* Tweaked syntax highlighting for PDLL
* Added a new command to view intermediate PDLL output
This commit enables providing long-form documentation more seamlessly to the LSP
by revamping decl documentation. For ODS imported constructs, we now also import
descriptions and attach them to decls when possible. For PDLL constructs, the LSP will
now try to provide documentation by parsing the comments directly above the decls
location within the source file. This commit also adds a new parser flag
`enableDocumentation` that gates the import and attachment of ODS documentation,
which is unnecessary in the normal build process (i.e. it should only be used/consumed
by tools).
Differential Revision: https://reviews.llvm.org/D124881
This commit defines a dataflow analysis for integer ranges, which
uses a newly-added InferIntRangeInterface to compute the lower and
upper bounds on the results of an operation from the bounds on the
arguments. The range inference is a flow-insensitive dataflow analysis
that can be used to simplify code, such as by statically identifying
bounds checks that cannot fail in order to eliminate them.
The InferIntRangeInterface has one method, inferResultRanges(), which
takes a vector of inferred ranges for each argument to an op
implementing the interface and a callback allowing the implementation
to define the ranges for each result. These ranges are stored as
ConstantIntRanges, which hold the lower and upper bounds for a
value. Bounds are tracked separately for the signed and unsigned
interpretations of a value, which ensures that the impact of
arithmetic overflows is correctly tracked during the analysis.
The commit also adds a -test-int-range-inference pass to test the
analysis until it is integrated into SCCP or otherwise exposed.
Finally, this commit fixes some bugs relating to the handling of
region iteration arguments and terminators in the data flow analysis
framework.
Depends on D124020
Depends on D124021
Reviewed By: rriddle, Mogball
Differential Revision: https://reviews.llvm.org/D124023