[mlir][vector] NFC change to improve doc of vector distribution op

Improve doc based on post commit review from https://reviews.llvm.org/D123703
Add more details on the op semantic, explicitly mention what part are parallel
and what parts are serial.

Differential Revision: https://reviews.llvm.org/D125227
This commit is contained in:
Thomas Raoux 2022-07-22 17:13:22 +00:00
parent 9429b67b8e
commit c53ee73b48
1 changed files with 37 additions and 15 deletions

View File

@ -2570,16 +2570,16 @@ def Vector_WarpExecuteOnLane0Op : Vector_Op<"warp_execute_on_lane_0",
[DeclareOpInterfaceMethods<RegionBranchOpInterface, ["areTypesCompatible"]>,
SingleBlockImplicitTerminator<"vector::YieldOp">,
RecursiveSideEffects]> {
let summary = "Executes operations in the associated region on lane #0 of a"
"GPU SIMT warp";
let summary = "Executes operations in the associated region on thread #0 of a"
"SPMD program";
let description = [{
`warp_execute_on_lane_0` is an operation used to bridge the gap between
vector programming and GPU SIMT programming model. It allows to trivially
convert a region of vector code meant to run on a GPU warp into a valid SIMT
region and then allows incremental transformation to distribute vector
operations on the SIMT lane.
vector programming and SPMD programming model like GPU SIMT. It allows to
trivially convert a region of vector code meant to run on a multiple threads
into a valid SPMD region and then allows incremental transformation to
distribute vector operations on the threads.
Any code present in the region would only be executed on first lane
Any code present in the region would only be executed on first thread/lane
based on the `laneid` operand. The `laneid` operand is an integer ID between
[0, `warp_size`). The `warp_size` attribute indicates the number of lanes in
a warp.
@ -2588,7 +2588,8 @@ def Vector_WarpExecuteOnLane0Op : Vector_Op<"warp_execute_on_lane_0",
the single lane execution. The matching region argument is a vector of all
the values of those lanes available to the single active lane. The
distributed dimension is implicit based on the shape of the operand and
argument. In the future this may be described by an affine map.
argument. the properties of the distribution may be described by extra
attributes (e.g. affine map).
Return values are distributed on all lanes using laneId as index. The
vector is distributed based on the shape ratio between the vector type of
@ -2600,6 +2601,8 @@ def Vector_WarpExecuteOnLane0Op : Vector_Op<"warp_execute_on_lane_0",
Therefore the `warp_execute_on_lane_0` operations allow to implicitly copy
between lane0 and the lanes of the warp. When distributing a vector
from lane0 to all the lanes, the data are distributed in a block cyclic way.
For exemple `vector<64xf32>` gets distributed on 32 threads and map to
`vector<2xf32>` where thread 0 contains vector[0] and vector[1].
During lowering values passed as operands and return value need to be
visible to different lanes within the warp. This would usually be done by
@ -2611,43 +2614,62 @@ def Vector_WarpExecuteOnLane0Op : Vector_Op<"warp_execute_on_lane_0",
Example:
```
// Execute in parallel on all threads/lanes.
vector.warp_execute_on_lane_0 (%laneid)[32] {
// Serial code running only on thread/lane 0.
...
}
// Execute in parallel on all threads/lanes.
```
This may be lowered to an scf.if region as below:
```
// Execute in parallel on all threads/lanes.
%cnd = arith.cmpi eq, %laneid, %c0 : index
scf.if %cnd {
...
// Serial code running only on thread/lane 0.
...
}
// Execute in parallel on all threads/lanes.
```
When the region has operands and/or return values:
```
// Execute in parallel on all threads/lanes.
%0 = vector.warp_execute_on_lane_0(%laneid)[32]
args(%v0 : vector<4xi32>) -> (vector<1xf32>) {
^bb0(%arg0 : vector<128xi32>) :
// Serial code running only on thread/lane 0.
...
vector.yield %1 : vector<32xf32>
}
// Execute in parallel on all threads/lanes.
```
values at the region boundary would go through memory:
```
%tmp0 = memreg.alloc() : memref<32xf32, 3>
%tmp1 = memreg.alloc() : memref<32xf32, 3>
// Execute in parallel on all threads/lanes.
...
// Store the data from each thread into memory and Synchronization.
%tmp0 = memreg.alloc() : memref<128xf32>
%tmp1 = memreg.alloc() : memref<32xf32>
%cnd = arith.cmpi eq, %laneid, %c0 : index
vector.store %v0, %tmp0[%laneid] : memref<32xf32>, vector<1xf32>
warp_sync
vector.store %v0, %tmp0[%laneid] : memref<128xf32>, vector<4xf32>
some_synchronization_primitive
scf.if %cnd {
%arg0 = vector.load %tmp0[%c0] : memref<32xf32>, vector<32xf32>
// Serialized code running only on thread 0.
// Load the data from all the threads into a register from thread 0. This
// allow threads 0 to access data from all the threads.
%arg0 = vector.load %tmp0[%c0] : memref<128xf32>, vector<128xf32>
...
// Store the data from thread 0 into memory.
vector.store %1, %tmp1[%c0] : memref<32xf32>, vector<32xf32>
}
warp_sync
// Synchronization and load the data in a block cyclic way so that the
// vector is distributed on all threads.
some_synchronization_primitive
%0 = vector.load %tmp1[%laneid] : memref<32xf32>, vector<32xf32>
// Execute in parallel on all threads/lanes.
```
}];