Add lowering of the vector.warp_execute_on_lane_0 into scf.if plus memory
transfer for the operands and yield values.
This also add an integration test running on GPU warp. The same tests can be
later re-used with different comment lines to tests distribution
transformations.
This is mostly from @springerm contribution.
Differential Revision: https://reviews.llvm.org/D125430
Complex nested in other types is perfectly fine, just nested structs
aren't supported. Instead of checking whether there's nesting just check
whether the struct we're dealing with is a complex number.
Differential Revision: https://reviews.llvm.org/D125381
This change integrates the BufferResultsToOutParamsPass into One-Shot Module Bufferization. This improves memory management (deallocation) when buffers are returned from a function.
Note: This currently only works with statically-sized tensors. The generated code is not very efficient yet and there are opportunities for improvment (fewer copies). By default, this new functionality is deactivated.
Differential Revision: https://reviews.llvm.org/D125376
Bufferization has an optional filter to exclude certain ops from analysis+bufferization. There were a few remaining places in the codebase where the filter was not checked.
Differential Revision: https://reviews.llvm.org/D125356
When a custom operation is unknown and does not have a dialect prefix, we currently
emit an error using the name of the operation with the default dialect prefix. This
leads to a confusing error message, especially when operations get moved between dialects.
For example, `func` was recently moved out of `builtin` and to the `func` dialect. The current
error message we get is:
```
func @foo()
^ custom op 'builtin.func' is unknown
```
This could lead users to believe that there is supposed to be a `builtin.func`,
because there used to be. This commit adds a better error message that does
not assume that the operation is supposed to be in the default dialect:
```
func @foo()
^ custom op 'func' is unknown (tried 'builtin.func' as well)
```
Differential Revision: https://reviews.llvm.org/D125351
`linalg.generic` ops have canonicalizers that either remove arguments
not used in the payload, or redundant arguments. Combine these and
enhance the canonicalization to also remove results that have no use.
This is effectively dead code elimination for Linalg ops.
Differential Revision: https://reviews.llvm.org/D123632
Using "replaceUsesOfWith" is incorrect because the same initializer value may appear multiple times.
For example, if the epilogue is needed when this loop is unrolled
```
%x:2 = scf.for ... iter_args(%arg1 = %c1, %arg2 = %c1) {
...
}
```
then both epilogue's arguments will be incorrectly renamed to use the same result index (note #1 in both cases):
```
%x_unrolled:2 = scf.for ... iter_args(%arg1 = %c1, %arg2 = %c1) {
...
}
%x_epilogue:2 = scf.for ... iter_args(%arg1 = %x_unrolled#1, %arg2 = %x_unrolled#1) {
...
}
```
This is a full audit of emitError calls, I took the opportunity
to remove extranous parens and fix a couple cases where we'd
generate multiple diagnostics for the same error.
Differential Revision: https://reviews.llvm.org/D125355
Change the parsing logic to use StringRef instead of lower level
char* logic. Also, if emitting a diagnostic on the first token
in the file, we make sure to use that position instead of the
very start of the file.
Differential Revision: https://reviews.llvm.org/D125353
Move async copy operations to NVGPU as they only exist on NV target and are
designed to match ptx semantic. This allows us to also add more fine grain
caching hint attribute to the op.
Add hint to bypass L1 and hook it up to NVVM op.
Differential Revision: https://reviews.llvm.org/D125244
The current implementation of `cloneWithNewYields` has a few issues
- It clones the loop body of the original loop to create a new
loop. This is very expensive.
- It performs `erase` operations which are incompatible when this
method is called from within a pattern rewrite. All erases need to
go through `PatternRewriter`.
To address these a new utility method `replaceLoopWithNewYields` is added
which
- moves the operations from the original loop into the new loop.
- replaces all uses of the original loop with the corresponding
results of the new loop
- use a call back to allow caller to generate the new yield values.
- the original loop is modified to just yield the basic block
arguments corresponding to the iter_args of the loop. This
represents a no-op loop. The loop itself is dead (since all its uses
are replaced), but is not removed. The caller is expected to erase
the op. Consequently, this method can be called from within a
`matchAndRewrite` method of a `PatternRewriter`.
The `cloneWithNewYields` could be replaces with
`replaceLoopWithNewYields`, but that seems to trigger a failure during
walks, potentially due to the operations being moved. That is left as
a TODO.
Differential Revision: https://reviews.llvm.org/D125147
By analogy with the NVGPU dialect, introduce an AMDGPU dialect for
AMD-specific intrinsic wrappers.
The dialect initially includes wrappers around the raw buffer intrinsics.
On AMD GPUs, a memref can be converted to a "buffer descriptor" that
allows more precise control of memory access, such as by allowing for
out of bounds loads/stores to be replaced by 0/ignored without adding
additional conditional logic, which is important for performance.
The repository currently contains a limited conversion from
transfer_read/transfer_write to Mubuf intrinsics, which are an older,
deprecated intrinsic for the same functionality.
The new amdgpu.raw_buffer_* ops allow these operations to be used
explicitly and for including metadata such as whether the target
chipset is an RDNA chip or not (which impacts the interpretation of
some bits in the buffer descriptor), while still maintaining an
MLIR-like interface.
(This change also exposes the floating-point atomic add intrinsic.)
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D122765
A typical problem with missing a token is that the missing
token is at the end of a line. The problem with this is that
the error message gets reported on the start of the following
line (which is where the next / invalid token is) which can
be confusing.
Handle this by noticing this case and backing up to the end of
the previous line.
Differential Revision: https://reviews.llvm.org/D125295
Add attribute to be able to generate the intrinsic version of async copy
generating a copy with l1 bypass. This correspond to
cp.async.cg.shared.global in ptx.
Differential Revision: https://reviews.llvm.org/D125241
There are a couple of issues with the python bindings on Windows:
- `create_symlink` requires special permissions on Windows - using `copy_if_different` instead allows the build to complete and then be usable
- the path to the `python_executable` is likely to contain spaces if python is installed in Program Files. llvm's python substitution adds extra quotes in order to account for this case, but mlir's own python substitution does not
- the location of the shared libraries is different on windows
- if the type is not specified for numpy arrays, they appear to be treated as strings
I've implemented the smallest possible changes for each of these in the patch, but I would actually prefer a slightly more comprehensive fix for the python_executable and the shared libraries.
For the python substitution, I think it makes sense to leverage the existing %python instead of adding %PYTHON and instead add a new variable for the case when preloading is needed. This would also make it clearer which tests are which and should be skipped on platforms where the preloading won't work.
For the shared libraries, I think it would make sense to pass the correct path and extension (possibly even the names) to the python script since these are known by lit and don't have to be hardcoded in the test at all.
Reviewed By: stellaraccident
Differential Revision: https://reviews.llvm.org/D125122
This patch fixed the padding size calculation for Conv2d ops when the stride > 1. It contains the changes below:
- Use addBound to add constraint for AffineApplyOp in getUpperBoundForIndex. So the result value can be mapped and retrieved later.
- Fixed the bound from AffineMinOp by adding as a closed bound. Originally the bound was added as an open upper bound, which results in the incorrect bounds when we multiply the values. For example:
```
%0 = affine.min affine_map<()[s0] -> (4, -s0 + 11)>()[iv0]
%1 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%0]
If we add the affine.min as an open bound, addBound will internally transform it into the close bound "%0 <= 3". The following sliceBounds will derive the bound of %1 as "%1 <= 6" and return the open bound "%1 < 7", while the correct bound should be "%1 <= 8".
```
- In addition to addBound, I also changed sliceBounds to support returning closed upper bound, since for the size computation, we usually care about the closed bounds.
- Change the getUpperBoundForIndex to favor constant bounds when required. The sliceBounds will return a tighter but non-constant bounds, which can't be used for padding. The constantRequired option requires getUpperBoundForIndex to get the constant bounds when possible.
Reviewed By: hanchung
Differential Revision: https://reviews.llvm.org/D124821
This patch augments the `tensor-bufferize` pass by adding a conversion
rule to translate ReshapeOp from the `tensor` dialect to the `memref`
dialect, in addition to adding a unit test to validate the translation.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D125031
libm doesn't have overloads for the small types, so promote them to a
bigger type and use the f32 function.
Differential Revision: https://reviews.llvm.org/D125093
Adds missing logic in the lowering from NvGPU to NVVM to support fp32
(in an accumulator operand) and tf32 (in multiplicand operand) types.
Fixes logic in one of the helper functions for converting the result
of a mma.sync operation with multiple 8x256bit output tiles, which is
the case for f32 outputs.
Differential Revision: https://reviews.llvm.org/D124533
This was leftover from when the standard dialect was destroyed, and
when FuncOp moved to the func dialect. Now that these transitions
have settled a bit we can drop these.
Most updates were handled using a simple regex: replace `^( *)func` with `$1func.func`
Differential Revision: https://reviews.llvm.org/D124146
This follows the same implementation strategy as scf::ForOp and common functionality is extracted into helper functions.
This implementation works well in cases where each yielded value (from either body/condition region) is equivalent to the corresponding bbArg of the parent block. In that case, each OpResult of the loop may be aliasing with the corresponding OpOperand of the loop (and with no other OpOperand).
In the absence of said equivalence relationship, new buffer copies must be inserted, so that the aliasing OpOperand/OpResult contract of scf::WhileOp is honored. In essence, by yielding a newly allocated buffer, we can enforce the specified may-alias relationship. (Newly allocated buffers cannot alias with any OpOperands of the loop.)
Differential Revision: https://reviews.llvm.org/D124929
A large DenseElementsAttr of i1could trigger a bug in printer/parser roundtrip.
Ex. A DenseElementsAttr of i1 with 200 elements will print as Hex format of length 400 before the fix. However, when parsing the printed text, an error will be triggered. After fix, the printed length will be 50.
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D122925
Although we now have semi-rings to deal with arbitrary ops,
it is still good to convey zero-preserving semantics of
ops to the sparse compiler.
Reviewed By: bixia
Differential Revision: https://reviews.llvm.org/D125043
The fallback attribute parse path is parsing a Type attribute, but this results
in a really unintuitive error message: `expected non-function type`, which
doesn't really hint at tall that we were trying to parse an attribute. This
commit fixes this by trying to optionally parse a type, and on failure
emitting an error that we were expecting an attribute.
Differential Revision: https://reviews.llvm.org/D124870
The names of the functions that are supposed to be exported do not match the implementations. This is due in part to cac7aabbd8.
This change makes the implementations and declarations match and adds a couple missing declarations.
The new names follow the pattern of the existing `verify` functions where the prefix is maintained as `_mlir_ciface_` but the suffix follows the new naming convention.
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D124891
The NVVM dialect test coverage for all possible type/shape combinations
in the `nvvm.mma.sync` op is mostly complete. However, there were tests
missing for TF32 datatype support. This change adds tests for the one
relevant shape/type combination. This uncovered a small bug in the op
verifier, which this change also fixes.
Differential Revision: https://reviews.llvm.org/D124975
The previous error message was technically incorrect. We do not compare equivalence of YieldOp operands and ForOp operands.
Differential Revision: https://reviews.llvm.org/D124934
Inside processInstruction, we assign the translated mlir::Value to a
reference previously taken from the corresponding entry in instMap.
However, instMap (a DenseMap) might resize after the entry reference was
taken, rendering the assignment useless since it's assigning to a
dangling reference. Here is a (pseudo) snippet that shows the concept:
```
// inst has type llvm::Instruction *
Value &v = instMap[inst];
...
// op is one of the operands of inst, has type llvm::Value *
processValue(op);
// instMap resizes inside processValue
...
translatedValue = b.createOp<Foo>(...);
// v is already a dangling reference at this point!
// The following assignment is bogus.
v = translatedValue;
```
Nevertheless, after we stop caching llvm::Constant into instMap, there
is only one case that can cause processValue to resize instMap: If the
operand is a llvm::ConstantExpr. In which case we will insert the
derived llvm::Instruction into instMap.
To trigger instMap to resize, which is a DenseMap, the threshold depends
on the ratio between # of map entries and # of (hash) buckets. More specifically,
it resizes if (# of map entries / # of buckets) >= 0.75.
In this case # of map entries is equal to # of LLVM instructions, and # of
buckets is the power-of-two upperbound of # of map entries. Thus, eventually
in the attaching test case (test/Target/LLVMIR/Import/incorrect-instmap-assignment.ll),
we picked 96 and 128 for the # of map entries and # of buckets, respectively.
(We can't pick numbers that are too small since DenseMap used inlined
storage for small number of entries). Therefore, the ConstantExpr in the
said test case (i.e. a GEP) is the 96-th llvm::Value cached into the
instMap, triggering the issue we're discussing here on its enclosing
instruction (i.e. a load).
This patch fixes this issue by calling `operator[]` everytime we need to
update an entry.
Differential Revision: https://reviews.llvm.org/D124627
Support int8, int16, int32 and int32. Also fix source code format in mlir_pytaco_utils.py.
Add tests.
Reviewed By: aartbik
Differential Revision: https://reviews.llvm.org/D124925
This commit relaxes the rules around ops that define a value but do not specify the tensor's contents. (The only such op at the moment is init_tensor.)
When such a tensor is written in a loop, it should not cause out-of-place bufferization.
Differential Revision: https://reviews.llvm.org/D124849
Adding lowering for Unary and Binary required several changes due to
their unique nature of containing custom code for different "regions"
of the sparse structure being operated on. Along with a Kind, a pointer
to the Operation is passed along to be merged once the lattice
structure is figured out.
The original operation is maintained, as it is required for subsequent
lattice decisions. However, sparse_tensor.binary has some branches
are considered as fully handled and therefore are marked with as
kBinaryBranch to distinguish them.
A unique aspect of the custom code is that sometimes the desired result
is no result at all -- i.e. a user wants overlapping sparse entries to
become empty in the output. The solution to this is to return an
uninitialized Value(), which is checked and handled elsewhere in the
code and results in nothing being written to the output tensor for that
case.
Reviewed By: aartbik
Differential Revision: https://reviews.llvm.org/D123057
Add the mechanism for TransformState extensions to update the mapping between
Transform IR values and Payload IR operations held by the state. The mechanism
is intentionally restrictive, similarly to how results of the transform op are
handled.
Introduce test ops that exercise a simple extension that maintains information
across the application of multiple transform ops.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D124778
This patch add supports for translating FCmp and more kinds of FP
constants in addition to 32 & 64-bit ones. However, we can't express
ppc_fp128 constants right now because the semantics for its underlying
APFloat is `S_PPCDoubleDouble` but mlir::FloatType doesn't support such
semantics right now.
Differential Revision: https://reviews.llvm.org/D124630
This patch restricts the value of `if` clause expression to an I1 value.
It also restricts the value of `num_threads` clause expression to an I32
value.
Reviewed By: kiranchandramohan
Differential Revision: https://reviews.llvm.org/D124142
The current implementation uses a discrete "pdl_interp.inferred_types"
operation, which acts as a "fake" handle to a type range. This op is
used as a signal to pdl_interp.create_operation that types should be
inferred. This is terribly awkward and clunky though:
* This op doesn't have a byte code representation, and its conversion
to bytecode kind of assumes that it is only used in a certain way. The
current lowering is also broken and seemingly untested.
* Given that this is a different operation, it gives off the assumption
that it can be used multiple times, or that after the first use
the value contains the inferred types. This isn't the case though,
the resultant type range can never actually be used as a type range.
This commit refactors the representation by removing the discrete
InferredTypesOp, and instead adds a UnitAttr to
pdl_interp.CreateOperation that signals when the created operations
should infer their types. This leads to a much much cleaner abstraction,
a more optimal bytecode lowering, and also allows for better error
handling and diagnostics when a created operation doesn't actually
support type inferrence.
Differential Revision: https://reviews.llvm.org/D124587
MLIR has a common pattern for "arguments" that uses syntax
like `%x : i32 {attrs} loc("sourceloc")` which is implemented
in adhoc ways throughout the codebase. The approach this uses
is verbose (because it is implemented with parallel arrays) and
inconsistent (e.g. lots of things drop source location info).
Solve this by introducing OpAsmParser::Argument and make addRegion
(which sets up BlockArguments for the region) take it. Convert the
world to propagating this down. This means that we correctly
capture and propagate source location information in a lot more
cases (e.g. see the affine.for testcase example), and it also
simplifies much code.
Differential Revision: https://reviews.llvm.org/D124649