forked from OSchip/llvm-project
[mlir][vector] NFC change to improve doc of vector distribution op
Improve doc based on post commit review from https://reviews.llvm.org/D123703 Add more details on the op semantic, explicitly mention what part are parallel and what parts are serial. Differential Revision: https://reviews.llvm.org/D125227
This commit is contained in:
parent
9429b67b8e
commit
c53ee73b48
|
@ -2570,16 +2570,16 @@ def Vector_WarpExecuteOnLane0Op : Vector_Op<"warp_execute_on_lane_0",
|
|||
[DeclareOpInterfaceMethods<RegionBranchOpInterface, ["areTypesCompatible"]>,
|
||||
SingleBlockImplicitTerminator<"vector::YieldOp">,
|
||||
RecursiveSideEffects]> {
|
||||
let summary = "Executes operations in the associated region on lane #0 of a"
|
||||
"GPU SIMT warp";
|
||||
let summary = "Executes operations in the associated region on thread #0 of a"
|
||||
"SPMD program";
|
||||
let description = [{
|
||||
`warp_execute_on_lane_0` is an operation used to bridge the gap between
|
||||
vector programming and GPU SIMT programming model. It allows to trivially
|
||||
convert a region of vector code meant to run on a GPU warp into a valid SIMT
|
||||
region and then allows incremental transformation to distribute vector
|
||||
operations on the SIMT lane.
|
||||
vector programming and SPMD programming model like GPU SIMT. It allows to
|
||||
trivially convert a region of vector code meant to run on a multiple threads
|
||||
into a valid SPMD region and then allows incremental transformation to
|
||||
distribute vector operations on the threads.
|
||||
|
||||
Any code present in the region would only be executed on first lane
|
||||
Any code present in the region would only be executed on first thread/lane
|
||||
based on the `laneid` operand. The `laneid` operand is an integer ID between
|
||||
[0, `warp_size`). The `warp_size` attribute indicates the number of lanes in
|
||||
a warp.
|
||||
|
@ -2588,7 +2588,8 @@ def Vector_WarpExecuteOnLane0Op : Vector_Op<"warp_execute_on_lane_0",
|
|||
the single lane execution. The matching region argument is a vector of all
|
||||
the values of those lanes available to the single active lane. The
|
||||
distributed dimension is implicit based on the shape of the operand and
|
||||
argument. In the future this may be described by an affine map.
|
||||
argument. the properties of the distribution may be described by extra
|
||||
attributes (e.g. affine map).
|
||||
|
||||
Return values are distributed on all lanes using laneId as index. The
|
||||
vector is distributed based on the shape ratio between the vector type of
|
||||
|
@ -2600,6 +2601,8 @@ def Vector_WarpExecuteOnLane0Op : Vector_Op<"warp_execute_on_lane_0",
|
|||
Therefore the `warp_execute_on_lane_0` operations allow to implicitly copy
|
||||
between lane0 and the lanes of the warp. When distributing a vector
|
||||
from lane0 to all the lanes, the data are distributed in a block cyclic way.
|
||||
For exemple `vector<64xf32>` gets distributed on 32 threads and map to
|
||||
`vector<2xf32>` where thread 0 contains vector[0] and vector[1].
|
||||
|
||||
During lowering values passed as operands and return value need to be
|
||||
visible to different lanes within the warp. This would usually be done by
|
||||
|
@ -2611,43 +2614,62 @@ def Vector_WarpExecuteOnLane0Op : Vector_Op<"warp_execute_on_lane_0",
|
|||
|
||||
Example:
|
||||
```
|
||||
// Execute in parallel on all threads/lanes.
|
||||
vector.warp_execute_on_lane_0 (%laneid)[32] {
|
||||
// Serial code running only on thread/lane 0.
|
||||
...
|
||||
}
|
||||
// Execute in parallel on all threads/lanes.
|
||||
```
|
||||
|
||||
This may be lowered to an scf.if region as below:
|
||||
```
|
||||
// Execute in parallel on all threads/lanes.
|
||||
%cnd = arith.cmpi eq, %laneid, %c0 : index
|
||||
scf.if %cnd {
|
||||
...
|
||||
// Serial code running only on thread/lane 0.
|
||||
...
|
||||
}
|
||||
// Execute in parallel on all threads/lanes.
|
||||
```
|
||||
|
||||
When the region has operands and/or return values:
|
||||
```
|
||||
// Execute in parallel on all threads/lanes.
|
||||
%0 = vector.warp_execute_on_lane_0(%laneid)[32]
|
||||
args(%v0 : vector<4xi32>) -> (vector<1xf32>) {
|
||||
^bb0(%arg0 : vector<128xi32>) :
|
||||
// Serial code running only on thread/lane 0.
|
||||
...
|
||||
vector.yield %1 : vector<32xf32>
|
||||
}
|
||||
// Execute in parallel on all threads/lanes.
|
||||
```
|
||||
|
||||
values at the region boundary would go through memory:
|
||||
```
|
||||
%tmp0 = memreg.alloc() : memref<32xf32, 3>
|
||||
%tmp1 = memreg.alloc() : memref<32xf32, 3>
|
||||
// Execute in parallel on all threads/lanes.
|
||||
...
|
||||
// Store the data from each thread into memory and Synchronization.
|
||||
%tmp0 = memreg.alloc() : memref<128xf32>
|
||||
%tmp1 = memreg.alloc() : memref<32xf32>
|
||||
%cnd = arith.cmpi eq, %laneid, %c0 : index
|
||||
vector.store %v0, %tmp0[%laneid] : memref<32xf32>, vector<1xf32>
|
||||
warp_sync
|
||||
vector.store %v0, %tmp0[%laneid] : memref<128xf32>, vector<4xf32>
|
||||
some_synchronization_primitive
|
||||
scf.if %cnd {
|
||||
%arg0 = vector.load %tmp0[%c0] : memref<32xf32>, vector<32xf32>
|
||||
// Serialized code running only on thread 0.
|
||||
// Load the data from all the threads into a register from thread 0. This
|
||||
// allow threads 0 to access data from all the threads.
|
||||
%arg0 = vector.load %tmp0[%c0] : memref<128xf32>, vector<128xf32>
|
||||
...
|
||||
// Store the data from thread 0 into memory.
|
||||
vector.store %1, %tmp1[%c0] : memref<32xf32>, vector<32xf32>
|
||||
}
|
||||
warp_sync
|
||||
// Synchronization and load the data in a block cyclic way so that the
|
||||
// vector is distributed on all threads.
|
||||
some_synchronization_primitive
|
||||
%0 = vector.load %tmp1[%laneid] : memref<32xf32>, vector<32xf32>
|
||||
// Execute in parallel on all threads/lanes.
|
||||
```
|
||||
|
||||
}];
|
||||
|
|
Loading…
Reference in New Issue