This revision drops init_tensor arguments from Linalg on tensors and instead uniformizes the output buffers and output tensors to be consistent.
This significantly simplifies the usage of Linalg on tensors and is a stepping stone for
its evolution towards a mixed tensor and shape abstraction discussed in https://llvm.discourse.group/t/linalg-and-shapes/2421/19.
Differential Revision: https://reviews.llvm.org/D93469
The InlineAsmOp mirrors the underlying LLVM semantics with a notable
exception: the embedded `asm_string` is not allowed to define or reference
any symbol or any global variable: only the operands of the op may be read,
written, or referenced.
Attempting to define or reference any symbol or any global behavior is
considered undefined behavior at this time.
The asm dialect syntax is currently specified with an integer (0 [default] for the "att dialect", 1 for the intel dialect) to circumvent the ODS limitation on string enums.
Translation to LLVM is provided and raises the fact that the asm constraints string must be well-formed with respect to in/out operands. No check is performed on the asm_string.
An InlineAsm instruction in LLVM is a special call operation to a function that is constructed on the fly.
It does not fit the current model of MLIR calls with symbols.
As a consequence, the current implementation constructs the function type in ModuleTranslation.cpp.
This should be refactored in the future.
The mlir-cpu-runner is augmented with the global initialization of the X86 asm parser to allow proper execution in JIT mode. Previously, only the X86 asm printer was initialized.
Differential revision: https://reviews.llvm.org/D92166
This change is required so that bufferization can properly identify
the linalg.yield as a terminator with an associated parent op.
Differential Revision: https://reviews.llvm.org/D92173
Depends On D89963
**Automatic reference counting algorithm outline:**
1. `ReturnLike` operations forward the reference counted values without
modifying the reference count.
2. Use liveness analysis to find blocks in the CFG where the lifetime of
reference counted values ends, and insert `drop_ref` operations after
the last use of the value.
3. Insert `add_ref` before the `async.execute` operation capturing the
value, and pairing `drop_ref` before the async body region terminator,
to release the captured reference counted value when execution
completes.
4. If the reference counted value is passed only to some of the block
successors, insert `drop_ref` operations in the beginning of the blocks
that do not have reference coutned value uses.
Reviewed By: silvas
Differential Revision: https://reviews.llvm.org/D90716
Depends On D89958
1. Adds `async.group`/`async.awaitall` to group together multiple async tokens/values
2. Rewrite scf.parallel operation into multiple concurrent async.execute operations over non overlapping subranges of the original loop.
Example:
```
scf.for (%i, %j) = (%lbi, %lbj) to (%ubi, %ubj) step (%si, %sj) {
"do_some_compute"(%i, %j): () -> ()
}
```
Converted to:
```
%c0 = constant 0 : index
%c1 = constant 1 : index
// Compute blocks sizes for each induction variable.
%num_blocks_i = ... : index
%num_blocks_j = ... : index
%block_size_i = ... : index
%block_size_j = ... : index
// Create an async group to track async execute ops.
%group = async.create_group
scf.for %bi = %c0 to %num_blocks_i step %c1 {
%block_start_i = ... : index
%block_end_i = ... : index
scf.for %bj = %c0 t0 %num_blocks_j step %c1 {
%block_start_j = ... : index
%block_end_j = ... : index
// Execute the body of original parallel operation for the current
// block.
%token = async.execute {
scf.for %i = %block_start_i to %block_end_i step %si {
scf.for %j = %block_start_j to %block_end_j step %sj {
"do_some_compute"(%i, %j): () -> ()
}
}
}
// Add produced async token to the group.
async.add_to_group %token, %group
}
}
// Await completion of all async.execute operations.
async.await_all %group
```
In this example outer loop launches inner block level loops as separate async
execute operations which will be executed concurrently.
At the end it waits for the completiom of all async execute operations.
Reviewed By: ftynse, mehdi_amini
Differential Revision: https://reviews.llvm.org/D89963
We lower them to a std.global_memref (uniqued by constant value) + a
std.get_global_memref to produce the corresponding memref value.
This allows removing Linalg's somewhat hacky lowering of tensor
constants, now that std properly supports this.
Differential Revision: https://reviews.llvm.org/D91306
It was incorrect in the presence of a tensor argument with multiple
uses.
The bufferization of subtensor_insert was writing into a converted
memref operand, but there is no guarantee that the converted memref for
that operand is safe to write into. In this case, the same converted
memref is written to in-place by the subtensor_insert bufferization,
violating the tensor-level semantics.
I left some comments in a TODO about ways forward on this. I will be
working actively on this problem in the coming days.
Differential Revision: https://reviews.llvm.org/D91371
This patch converts elementwise ops on tensors to linalg.generic ops
with the same elementwise op in the payload (except rewritten to
operate on scalars, obviously). This is a great form for later fusion to
clean up.
E.g.
```
// Compute: %arg0 + %arg1 - %arg2
func @f(%arg0: tensor<?xf32>, %arg1: tensor<?xf32>, %arg2: tensor<?xf32>) -> tensor<?xf32> {
%0 = addf %arg0, %arg1 : tensor<?xf32>
%1 = subf %0, %arg2 : tensor<?xf32>
return %1 : tensor<?xf32>
}
```
Running this through
`mlir-opt -convert-std-to-linalg -linalg-fusion-for-tensor-ops` we get:
```
func @f(%arg0: tensor<?xf32>, %arg1: tensor<?xf32>, %arg2: tensor<?xf32>) -> tensor<?xf32> {
%0 = linalg.generic {indexing_maps = [#map0, #map0, #map0, #map0], iterator_types = ["parallel"]} ins(%arg0, %arg1, %arg2 : tensor<?xf32>, tensor<?xf32>, tensor<?xf32>) {
^bb0(%arg3: f32, %arg4: f32, %arg5: f32): // no predecessors
%1 = addf %arg3, %arg4 : f32
%2 = subf %1, %arg5 : f32
linalg.yield %2 : f32
} -> tensor<?xf32>
return %0 : tensor<?xf32>
}
```
So the elementwise ops on tensors have nicely collapsed into a single
linalg.generic, which is the form we want for further transformations.
Differential Revision: https://reviews.llvm.org/D90354
The pass combines patterns of ExpandAtomic, ExpandMemRefReshape,
StdExpandDivs passes. The pass is meant to legalize STD for conversion to LLVM.
Differential Revision: https://reviews.llvm.org/D91082
Previously, linalg-bufferize was a "finalizing" bufferization pass (it
did a "full" conversion). This wasn't great because it couldn't be used
composably with other bufferization passes like std-bufferize and
scf-bufferize.
This patch makes linalg-bufferize a composable bufferization pass.
Notice that the integration tests are switched over to using a pipeline
of std-bufferize, linalg-bufferize, and (to finalize the conversion)
func-bufferize. It all "just works" together.
While doing this transition, I ran into a nasty bug in the 1-use special
case logic for forwarding init tensors. That logic, while
well-intentioned, was fundamentally flawed, because it assumed that if
the original tensor value had one use, then the converted memref could
be mutated in place. That assumption is wrong in many cases. For
example:
```
%0 = some_tensor : tensor<4xf32>
br ^bb0(%0, %0: tensor<4xf32>, tensor<4xf32>)
^bb0(%bbarg0: tensor<4xf32>, %bbarg1: tensor<4xf32>)
// %bbarg0 is an alias of %bbarg1. We cannot safely write
// to it without analyzing uses of %bbarg1.
linalg.generic ... init(%bbarg0) {...}
```
A similar example can happen in many scenarios with function arguments.
Even more sinister, if the converted memref is produced by a
`std.get_global_memref` of a constant global memref, then we might
attempt to write into read-only statically allocated storage! Not all
memrefs are writable!
Clearly, this 1-use check is not a local transformation that we can do
on the fly in this pattern, so I removed it.
The test is now drastically shorter and I basically rewrote the CHECK
lines from scratch because:
- the new composable linalg-bufferize just doesn't do as much, so there
is less to test
- a lot of the tests were related to the 1-use check, which is now gone,
so there is less to test
- the `-buffer-hoisting -buffer-deallocation` is no longer mixed in, so
the checks related to that had to be rewritten
Differential Revision: https://reviews.llvm.org/D90657
Fix semantic in the distribute integration test based on offline feedback. This
exposed a bug in block distribution, we need to make sure the id is multiplied
by the stride of the vector. Fix the transformation and unit test.
Differential Revision: https://reviews.llvm.org/D89291
TensorConstantOp bufferization currently uses the vector dialect to store constant data into memory.
Due to natural vector size and alignment properties, this is problematic with n>1-D vectors whose most minor dimension is not naturally aligned.
Instead, this revision linearizes the constant and introduces a linalg.reshape to go back to the desired shape.
Still this is still to be considered a workaround and a better longer term solution will probably involve `llvm.global`.
Differential Revision: https://reviews.llvm.org/D89311
The TensorConstantOp bufferize conversion pattern has a bug that
makes it incorrect in the case of vectors whose alignment is not
the natural alignment. Circumvent it temporarily by using a power of 2.
Differential Revision: https://reviews.llvm.org/D89265
This revision introduces support for buffer allocation for any named linalg op.
To avoid template instantiating many ops, a new ConversionPattern is created to capture the LinalgOp interface.
Some APIs are updated to remain consistent with MLIR style:
`OwningRewritePatternList * -> OwningRewritePatternList &`
`BufferAssignmentTypeConverter * -> BufferAssignmentTypeConverter &`
Differential revision: https://reviews.llvm.org/D89226
This revision also inserts an end-to-end test that lowers tensors to buffers all the way to executable code on CPU.
Differential revision: https://reviews.llvm.org/D88998
Current setup for conv op vectorization does not enable user to specify tile
sizes as well as dimensions for vectorization. In this commit we change that by
adding tile sizes as pass arguments. Every dimension with corresponding tile
size > 1 is automatically vectorized.
Differential Revision: https://reviews.llvm.org/D88533
Recently, restrictions on vector reductions were made more relaxed by
accepting any width signless integer and floating-point. This CL relaxes
the restriction even more by including unsigned and signed integers.
Reviewed By: bkramer
Differential Revision: https://reviews.llvm.org/D88442
(1) simplify integer printing logic by always using 64-bit print
(2) add index support (since vector<16xindex> is planned to be added)
(3) adjust naming convention print_x -> printX
Reviewed By: bkramer
Differential Revision: https://reviews.llvm.org/D88436
This generalizes printing beyond just i1,i32,i64 and also accounts
for signed and unsigned interpretation in the output.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D88290
This revision allows representing a reduction at the level of linalg on tensors for named ops. When a structured op has a reduction and returns tensor(s), new conventions are added and documented.
As an illustration, the syntax for a `linalg.matmul` writing into a buffer is:
```
linalg.matmul ins(%a, %b : memref<?x?xf32>, tensor<?x?xf32>)
outs(%c : memref<?x?xf32>)
```
, whereas the syntax for a `linalg.matmul` returning a new tensor is:
```
%d = linalg.matmul ins(%a, %b : tensor<?x?xf32>, memref<?x?xf32>)
init(%c : memref<?x?xf32>)
-> tensor<?x?xf32>
```
Other parts of linalg will be extended accordingly to allow mixed buffer/tensor semantics in the presence of reductions.
ConvOp vectorization supports now only convolutions of static shapes with dimensions
of size either 3(vectorized) or 1(not) as underlying vectors have to be of static
shape as well. In this commit we add support for convolutions of any size as well as
dynamic shapes by leveraging existing matmul infrastructure for tiling of both input
and kernel to sizes accepted by the previous version of ConvOp vectorization.
In the future this pass can be extended to take "tiling mask" as a user input which
will enable vectorization of user specified dimensions.
Differential Revision: https://reviews.llvm.org/D87676
This commit introduces end-to-end integration tests for
convolutions that test multiple ways of ConvOps lowering.
Differential Revision: https://reviews.llvm.org/D87277
This replaces the select chain for edge-padding with an scf.if that
performs the memory operation when the index is in bounds and uses the
pad value when it's not. For transfer_write the same mechanism is used,
skipping the store when the index is out of bounds.
The integration test has a bunch of cases of how I believe this should
work.
Differential Revision: https://reviews.llvm.org/D87241
Vector to SCF conversion still had issues due to the interaction with the natural alignment derived by the LLVM data layout. One traditional workaround is to allocate aligned. However, this does not always work for vector sizes that are non-powers of 2.
This revision implements a more portable mechanism where the intermediate allocation is always a memref of elemental vector type. AllocOp is extended to use the natural LLVM DataLayout alignment for non-scalar types, when the alignment is not specified in the first place.
An integration test is added that exercises the transfer to scf.for + scalar lowering with a 5x5 transposition.
Differential Revision: https://reviews.llvm.org/D87150
The intrinsics were already supported and vector.transfer_read/write lowered
direclty into these operations. By providing them as individual ops, however,
clients can used them directly, and it opens up progressively lowering transfer
operations at higher levels (rather than direct lowering to LLVM IR as done now).
Reviewed By: bkramer
Differential Revision: https://reviews.llvm.org/D85357
Introduces the expand and compress operations to the Vector dialect
(important memory operations for sparse computations), together
with a first reference implementation that lowers to the LLVM IR
dialect to enable running on CPU (and other targets that support
the corresponding LLVM IR intrinsics).
Reviewed By: reidtatge
Differential Revision: https://reviews.llvm.org/D84888
A new first-party modeling for LLVM IR types in the LLVM dialect has been
developed in parallel to the existing modeling based on wrapping LLVM `Type *`
instances. It resolves the long-standing problem of modeling identified
structure types, including recursive structures, and enables future removal of
LLVMContext and related locking mechanisms from LLVMDialect.
This commit only switches the modeling by (a) renaming LLVMTypeNew to LLVMType,
(b) removing the old implementaiton of LLVMType, and (c) updating the tests. It
is intentionally minimal. Separate commits will remove the infrastructure built
for the transition and update API uses where appropriate.
Depends On D85020
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D85021
Added a "clone" of the 1D vector's test_transfer_read and added a second dimensionality. The test is not as generic as I would like it to be, because more generic versions appear to break the compiler or the runtime at this stage. As bug are fixed, I will be happy to add another more complete test.
Differential Revision: https://reviews.llvm.org/D83096
Integration test that illustrates the gather operation with a
real-world operation expressed in mostly the Vector dialect.
Uses jagged diagonal storage.
Reviewed By: bondhugula
Differential Revision: https://reviews.llvm.org/D84571
Introduces the scatter/gather operations to the Vector dialect
(important memory operations for sparse computations), together
with a first reference implementation that lowers to the LLVM IR
dialect to enable running on CPU (and other targets that support
the corresponding LLVM IR intrinsics).
The operations can be used directly where applicable, or can be used
during progressively lowering to bring other memory operations closer to
hardware ISA support for a gather/scatter. The semantics of the operation
closely correspond to those of the corresponding llvm intrinsics.
Note that the operation allows for a dynamic index vector (which is
important for sparse computations). However, this first reference
lowering implementation "serializes" the address computation when
base + index_vector is converted to a vector of pointers. Exploring
how to use SIMD properly during these step is TBD. More general
memrefs and idiomatic versions of striding are also TBD.
Reviewed By: arpith-jacob
Differential Revision: https://reviews.llvm.org/D84039
Summary: The native alignment may generally not be used when lowering a vector.transfer to the underlying load/store operation. This revision fixes the unmasked load/store alignment to match that of the masked path.
Differential Revision: https://reviews.llvm.org/D83684
This specialization allows sharing more code where an AXPY follows naturally
in cases where an OUTERPRODUCT on a scalar would be generated.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D83453
Summary:
Two integration tests focused on i1 vectors, which exposed omissions
in the llvm backend which have since then been fixed. Note that this also
exposed an inaccuracy for print_i1 which has been fixed in this CL:
for a pure C ABI, int should be used rather than bool.
Reviewers: nicolasvasilache, ftynse, reidtatge, andydavis1, bkramer
Reviewed By: bkramer
Subscribers: mehdi_amini, rriddle, jpienaar, shauheen, antiagainst, nicolasvasilache, arpith-jacob, mgester, lucyrfox, liufengdb, stephenneuendorffer, Joonsoo, grosul1, frgossen, Kayjukh, jurahul, msifontes
Tags: #mlir
Differential Revision: https://reviews.llvm.org/D81957