Start GPU Dialect

Define a new dialect related to GPU kernels. Currently, it only contains a single operation for launching a kernel on a three-dimensional grid of thread blocks, following a model similar to that of CUDA. In particular, the body of the kernel contains operations executed by each thread and uses region arguments to accept thread and block identifiers (similar to how the loop body region accepts the induction value). -- PiperOrigin-RevId: 245713728
2019-04-29 03:00:25 -07:00 · 2019-04-29 03:00:25 -07:00 · aae8a7446e
parent 4c74f1bf38
commit aae8a7446e
11 changed files with 492 additions and 0 deletions
--- a/mlir/g3doc/Dialects/GPU.md
+++ b/mlir/g3doc/Dialects/GPU.md
@ -0,0 +1,69 @@
+# GPU Dialect
+
+Note: this dialect is more likely to change than others in the near future; use
+with caution.
+
+This dialect provides middle-level abstractions for launching GPU kernels
+following a programming model similar to that of CUDA or OpenCL. It provides
+abstractions for kernel invocations (and may eventually provide those for device
+management) that are not present at the lower level (e.g., as LLVM IR intrinsics
+for GPUs). Its goal is to abstract away device- and driver-specific
+manipulations to launch a GPU kernel and provide a simple path towards GPU
+execution from MLIR. It may be targeted, for example, by DSLs using MLIR. The
+dialect uses `gpu` as its canonical prefix.
+
+## Operations
+
+### `gpu.launch`
+
+Launch a kernel on the specified grid of thread blocks. The body of the kernel
+is defined by the single region that this operation contains. The operation
+takes at least six operands, with first three operands being grid sizes along
+x,y,z dimensions, the following three arguments being block sizes along x,y,z
+dimension, and the remaining operands are arguments of the kernel. When a
+lower-dimensional kernel is required, unused sizes must be explicitly set to
+`1`.
+
+The body region has at least _twelve_ arguments, grouped as follows:
+
+-   three arguments that contain block identifiers along x,y,z dimensions;
+-   three arguments that contain thread identifiers along x,y,z dimensions;
+-   operands of the `gpu.launch` operation as is, including six leading operands
+    for grid and block sizes.
+
+Operations inside the body region, and any operations in the nested regions, are
+_not_ allowed to use values defined outside the _body_ region, as if this region
+was a function. If necessary, values must be passed as kernel arguments into the
+body region. Nested regions inside the kernel body are allowed to use values
+defined in their ancestor regions as long as they don't cross the kernel body
+region boundary.
+
+Custom syntax for this operation is currently not available.
+
+Example:
+
+```mlir {.mlir}
+// Generic syntax explains how the pretty syntax maps to the IR structure.
+"gpu.launch"(%cst, %cst, %c1,  // Grid sizes.
+                    %cst, %c1, %c1,   // Block sizes.
+                    %arg0, %arg1)     // Actual arguments.
+    {/*attributes*/}
+    // All sizes and identifiers have "index" size.
+    : (index, index, index, index, index, index, f32, memref<?xf32, 1>) -> () {
+// The operation passes block and thread identifiers, followed by grid and block
+// sizes, followed by actual arguments to the entry block of the region.
+^bb0(%bx : index, %by : index, %bz : index,
+     %tx : index, %ty : index, %tz : index,
+     %num_bx : index, %num_by : index, %num_bz : index,
+     %num_tx : index, %num_ty : index, %num_tz : index,
+     %arg0 : f32, %arg1 : memref<?xf32, 1>):
+  "some_op"(%bx, %tx) : (index, index) -> ()
+  %3 = "std.load"(%arg1, %bx) : (memref<?xf32, 1>, index) -> f32
+}
+```
+
+Rationale: using operation/block arguments gives analyses a clear way of
+understanding that a value has additional semantics (e.g., we will need to know
+what value corresponds to threadIdx.x for coalescing). We can recover these
+properties by analyzing the operations producing values, but it is easier just
+to have that information by construction.
--- a/mlir/include/mlir/GPU/GPUDialect.h
+++ b/mlir/include/mlir/GPU/GPUDialect.h
@ -0,0 +1,72 @@
+//===- GPUDialect.h - MLIR Dialect for GPU Kernels --------------*- C++ -*-===//
+//
+// Copyright 2019 The MLIR Authors.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+// =============================================================================
+//
+// This file defines the GPU kernel-related operations and puts them in the
+// corresponding dialect.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef MLIR_GPU_GPUDIALECT_H
+#define MLIR_GPU_GPUDIALECT_H
+
+#include "mlir/IR/Dialect.h"
+#include "mlir/IR/OpDefinition.h"
+
+namespace mlir {
+
+/// The dialect containing GPU kernel launching operations and related
+/// facilities.
+class GPUDialect : public Dialect {
+public:
+  /// Create the dialect in the given `context`.
+  GPUDialect(MLIRContext *context);
+
+  /// Get the canonical string name of the dialect.
+  static StringRef getDialectName();
+};
+
+struct KernelDim3 {
+  Value *x;
+  Value *y;
+  Value *z;
+};
+
+class LaunchOp : public Op<LaunchOp, OpTrait::AtLeastNOperands<6>::Impl,
+                           OpTrait::ZeroResult,
+                           OpTrait::NthRegionIsIsolatedAbove<0>::Impl> {
+public:
+  using Op::Op;
+
+  static void build(Builder *builder, OperationState *result, Value *gridSizeX,
+                    Value *gridSizeY, Value *gridSizeZ, Value *blockSizeX,
+                    Value *blockSizeY, Value *blockSizeZ,
+                    ArrayRef<Value *> operands);
+
+  Region &getBody();
+  KernelDim3 getBlockIds();
+  KernelDim3 getThreadIds();
+  KernelDim3 getGridSize();
+  KernelDim3 getBlockSize();
+
+  LogicalResult verify();
+
+  static StringRef getOperationName() { return "gpu.launch"; }
+};
+
+} // end namespace mlir
+
+#endif // MLIR_GPUKERNEL_GPUDIALECT_H
--- a/mlir/include/mlir/IR/Block.h
+++ b/mlir/include/mlir/IR/Block.h
@ -395,6 +395,13 @@ public:
    blocks.splice(blocks.end(), other.getBlocks());
  }

+  /// Check that this does not use any value defined outside it.
+  /// Emit errors if `noteEmitter` is provided; this callback is used to point
+  /// to the operation containing the region, the actual error is reported at
+  /// the operation with an offending use.
+  bool
+  isIsolatedAbove(llvm::function_ref<void(const Twine &)> noteEmitter = {});
+
 private:
  RegionType blocks;

--- a/mlir/include/mlir/IR/OpDefinition.h
+++ b/mlir/include/mlir/IR/OpDefinition.h
@ -697,6 +697,22 @@ public:
  }
 };

+/// This verifiers that all operands used in N-th region of the given operation
+/// are defined within that region.
+template <unsigned RegionIdx> class NthRegionIsIsolatedAbove {
+public:
+  template <typename ConcreteType>
+  class Impl : public TraitBase<ConcreteType,
+                                NthRegionIsIsolatedAbove<RegionIdx>::Impl> {
+  public:
+    static LogicalResult verifyTrait(Operation *op) {
+      auto noteEmitter = [op](const Twine &message) { op->emitNote(message); };
+      return op->getRegion(RegionIdx).isIsolatedAbove(noteEmitter) ? success()
+                                                                   : failure();
+    }
+  };
+};
+
 } // end namespace OpTrait

 //===----------------------------------------------------------------------===//
--- a/mlir/lib/CMakeLists.txt
+++ b/mlir/lib/CMakeLists.txt
@ -4,6 +4,7 @@ add_subdirectory(Dialect)
 add_subdirectory(EDSC)
 add_subdirectory(ExecutionEngine)
 add_subdirectory(FxpMathOps)
+add_subdirectory(GPU)
 add_subdirectory(IR)
 add_subdirectory(LLVMIR)
 add_subdirectory(Linalg)
--- a/mlir/lib/GPU/CMakeLists.txt
+++ b/mlir/lib/GPU/CMakeLists.txt
@ -0,0 +1,9 @@
+add_llvm_library(MLIRGPU
+  IR/GPUDialect.cpp
+  IR/DialectRegistration.cpp
+
+  ADDITIONAL_HEADER_DIRS
+  ${MLIR_MAIN_INCLUDE_DIR}/mlir/GPU
+)
+add_dependencies(MLIRGPU MLIRIR LLVMSupport)
+target_link_libraries(MLIRGPU MLIRIR LLVMSupport)
--- a/mlir/lib/GPU/IR/DialectRegistration.cpp
+++ b/mlir/lib/GPU/IR/DialectRegistration.cpp
@ -0,0 +1,21 @@
+//===- DialectRegistration.cpp - MLIR GPU dialect registration ------------===//
+//
+// Copyright 2019 The MLIR Authors.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+// =============================================================================
+
+#include "mlir/GPU/GPUDialect.h"
+
+// Static initialization for GPU dialect registration.
+static mlir::DialectRegistration<mlir::GPUDialect> kernelDialect;
--- a/mlir/lib/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/GPU/IR/GPUDialect.cpp
@ -0,0 +1,99 @@
+//===- GPUDialect.cpp - MLIR Dialect for GPU Kernels implementation -------===//
+//
+// Copyright 2019 The MLIR Authors.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+// =============================================================================
+//
+// This file implements the GPU kernel-related dialect and its operations.
+//
+//===----------------------------------------------------------------------===//
+
+#include "mlir/GPU/GPUDialect.h"
+#include "mlir/IR/Builders.h"
+#include "mlir/IR/StandardTypes.h"
+
+using namespace mlir;
+
+StringRef GPUDialect::getDialectName() { return "gpu"; }
+
+GPUDialect::GPUDialect(MLIRContext *context)
+    : Dialect(getDialectName(), context) {
+  addOperations<LaunchOp>();
+}
+
+//===----------------------------------------------------------------------===//
+// LaunchOp
+//===----------------------------------------------------------------------===//
+
+static SmallVector<Type, 4> getValueTypes(ArrayRef<Value *> values) {
+  SmallVector<Type, 4> types;
+  types.reserve(values.size());
+  for (Value *v : values)
+    types.push_back(v->getType());
+  return types;
+}
+
+void LaunchOp::build(Builder *builder, OperationState *result, Value *gridSizeX,
+                     Value *gridSizeY, Value *gridSizeZ, Value *blockSizeX,
+                     Value *blockSizeY, Value *blockSizeZ,
+                     ArrayRef<Value *> operands) {
+  // Add grid and block sizes as op operands, followed by the data operands.
+  result->addOperands(
+      {gridSizeX, gridSizeY, gridSizeZ, blockSizeX, blockSizeY, blockSizeZ});
+  result->addOperands(operands);
+
+  // Create a kernel body region with 12 + N arguments, where the first 12
+  // arguments have `index` type and the rest have the same types as the data
+  // operands.
+  Region *kernelRegion = result->addRegion();
+  Block *body = new Block();
+  body->addArguments(std::vector<Type>(12, builder->getIndexType()));
+  body->addArguments(getValueTypes(operands));
+  kernelRegion->push_back(body);
+}
+
+Region &LaunchOp::getBody() { return getOperation()->getRegion(0); }
+
+KernelDim3 LaunchOp::getBlockIds() {
+  auto args = getBody().getBlocks().front().getArguments();
+  return KernelDim3{args[0], args[1], args[2]};
+}
+
+KernelDim3 LaunchOp::getThreadIds() {
+  auto args = getBody().getBlocks().front().getArguments();
+  return KernelDim3{args[3], args[4], args[5]};
+}
+
+KernelDim3 LaunchOp::getGridSize() {
+  auto args = getBody().getBlocks().front().getArguments();
+  return KernelDim3{args[6], args[7], args[8]};
+}
+
+KernelDim3 LaunchOp::getBlockSize() {
+  auto args = getBody().getBlocks().front().getArguments();
+  return KernelDim3{args[9], args[10], args[11]};
+}
+
+LogicalResult LaunchOp::verify() {
+  // Kernel launch takes 6 leading operands for grid/block sizes and transforms
+  // them into 12 region arguments for block/thread identifiers and grid/block
+  // sizes.
+  if (!getBody().empty()) {
+    Block &entryBlock = getBody().front();
+    if (entryBlock.getNumArguments() != 6 + getNumOperands())
+      return emitError("unexpected number of region arguments");
+  }
+
+  return success();
+}
--- a/mlir/lib/IR/Block.cpp
+++ b/mlir/lib/IR/Block.cpp
@ -353,6 +353,55 @@ void Region::cloneInto(Region *dest, BlockAndValueMapping &mapper,
    it->walk(remapOperands);
 }

+// Check that the given `region` does not use any value defined outside its
+// ancestor region `limit`.  That is, given `A{B{C{}}}` with limit `B`, `C` is
+// allowed to use values defined in `B` but not those defined in `A`.
+// Emit errors if `emitOpNote` is provided; this callback is used to point to
+// the operation containing the region, the actual error is reported at the
+// operation with an offending use.
+static bool
+isRegionIsolatedAbove(Region &region, Region &limit,
+                      llvm::function_ref<void(const Twine &)> emitOpNote = {}) {
+  assert(limit.isAncestor(&region) &&
+         "expected isolation limit to be an ancestor of the given region");
+
+  // List of regions to analyze.  Each region is processed independently, with
+  // respect to the common `limit` region, so we can look at them in any order.
+  // Therefore, use a simple vector and push/pop back the current region.
+  SmallVector<Region *, 8> pendingRegions;
+  pendingRegions.push_back(&region);
+
+  // Traverse all operations in the region.
+  while (!pendingRegions.empty()) {
+    for (Block &block : *pendingRegions.pop_back_val()) {
+      for (Operation &op : block) {
+        for (Value *operand : op.getOperands()) {
+          // Check that any value that is used by an operation is defined in the
+          // same region as either an operation result or a block argument.
+          if (operand->getContainingRegion()->isProperAncestor(&limit)) {
+            if (emitOpNote) {
+              op.emitOpError("using value defined outside the region");
+              emitOpNote("required by region isolation constraints");
+            }
+            return false;
+          }
+        }
+        // Schedule any regions the operations contain for further checking.
+        pendingRegions.reserve(pendingRegions.size() + op.getNumRegions());
+        for (Region &subRegion : op.getRegions())
+          pendingRegions.push_back(&subRegion);
+      }
+    }
+  }
+
+  return true;
+}
+
+bool Region::isIsolatedAbove(
+    llvm::function_ref<void(const Twine &)> noteEmitter) {
+  return isRegionIsolatedAbove(*this, *this, noteEmitter);
+}
+
 Region *llvm::ilist_traits<::mlir::Block>::getContainingRegion() {
  size_t Offset(
      size_t(&((Region *)nullptr->*Region::getSublistAccess(nullptr))));
--- a/mlir/test/GPU/invalid.mlir
+++ b/mlir/test/GPU/invalid.mlir
@ -0,0 +1,78 @@
+// RUN: mlir-opt -split-input-file -verify %s
+
+func @not_enough_sizes(%sz : index) {
+  // expected-error@+1 {{expected 6 or more operands}}
+  "gpu.launch"(%sz, %sz, %sz, %sz, %sz)
+      : (index, index, index, index, index) -> () {
+    return
+  }
+  return
+}
+
+// -----
+
+func @no_region_attrs(%sz : index) {
+  // expected-error@+1 {{unexpected number of region arguments}}
+ "gpu.launch"(%sz, %sz, %sz, %sz, %sz, %sz)
+      : (index, index, index, index, index, index) -> () {
+  ^bb1(%bx: index, %by: index, %bz: index,
+       %tx: index, %ty: index, %tz: index):
+    return
+  }
+  return
+}
+
+// -----
+
+func @isolation_arg(%sz : index) {
+ // expected-note@+1 {{required by region isolation constraints}}
+ "gpu.launch"(%sz, %sz, %sz, %sz, %sz, %sz)
+      : (index, index, index, index, index, index) -> () {
+  ^bb1(%bx: index, %by: index, %bz: index,
+       %tx: index, %ty: index, %tz: index,
+       %szbx: index, %szby: index, %szbz: index,
+       %sztx: index, %szty: index, %sztz: index):
+    // expected-error@+1 {{using value defined outside the region}}
+    "use"(%sz) : (index) -> ()
+    return
+  }
+  return
+}
+
+// -----
+
+func @isolation_op(%sz : index) {
+ %val = "produce"() : () -> (index)
+ // expected-note@+1 {{required by region isolation constraints}}
+ "gpu.launch"(%sz, %sz, %sz, %sz, %sz, %sz)
+      : (index, index, index, index, index, index) -> () {
+  ^bb1(%bx: index, %by: index, %bz: index,
+       %tx: index, %ty: index, %tz: index,
+       %szbx: index, %szby: index, %szbz: index,
+       %sztx: index, %szty: index, %sztz: index):
+    // expected-error@+1 {{using value defined outside the region}}
+    "use"(%val) : (index) -> ()
+    return
+  }
+  return
+}
+
+// -----
+
+func @nested_isolation(%sz : index) {
+  // expected-note@+1 {{required by region isolation constraints}}
+  "gpu.launch"(%sz, %sz, %sz, %sz, %sz, %sz)
+      : (index, index, index, index, index, index) -> () {
+  ^bb1(%bx: index, %by: index, %bz: index,
+       %tx: index, %ty: index, %tz: index,
+       %szbx: index, %szby: index, %szbz: index,
+       %sztx: index, %szty: index, %sztz: index):
+    "region"() : () -> () {
+      "region"() : () -> () {
+        // expected-error@+1 {{using value defined outside the region}}
+        "use"(%sz) : (index) -> ()
+      }
+    }
+  }
+  return
+}
--- a/mlir/test/GPU/ops.mlir
+++ b/mlir/test/GPU/ops.mlir
@ -0,0 +1,71 @@
+// RUN: mlir-opt %s | FileCheck %s
+
+// CHECK-LABEL:func @no_args(%arg0: index)
+func @no_args(%sz : index) {
+// CHECK:  "gpu.launch"(%arg0, %arg0, %arg0, %arg0, %arg0, %arg0) : (index, index, index, index, index, index) -> () {
+  "gpu.launch"(%sz, %sz, %sz, %sz, %sz, %sz)
+      : (index, index, index, index, index, index) -> () {
+  ^bb1(%bx: index, %by: index, %bz: index,
+       %tx: index, %ty: index, %tz: index,
+       %szbx: index, %szby: index, %szbz: index,
+       %sztx: index, %szty: index, %sztz: index):
+    return
+  }
+  return
+}
+
+// CHECK-LABEL:func @args(%arg0: index, %arg1: index, %arg2: f32, %arg3: memref<?xf32, 1>) {
+func @args(%blk : index, %thrd : index, %float : f32, %data : memref<?xf32,1>) {
+// CHECK:  "gpu.launch"(%arg0, %arg0, %arg0, %arg1, %arg1, %arg1, %arg2, %arg3) : (index, index, index, index, index, index, f32, memref<?xf32, 1>) -> () {
+  "gpu.launch"(%blk, %blk, %blk, %thrd, %thrd, %thrd, %float, %data)
+      : (index, index, index, index, index, index, f32, memref<?xf32,1>) -> () {
+  ^bb1(%bx: index, %by: index, %bz: index,
+       %tx: index, %ty: index, %tz: index,
+       %szbx: index, %szby: index, %szbz: index,
+       %sztx: index, %szty: index, %sztz: index,
+       %data0: f32, %data1: memref<?xf32,1>):
+    return
+  }
+  return
+}
+
+// It is possible to use values passed into the region as arguments.
+// CHECK-LABEL: func @passing_values
+func @passing_values(%blk : index, %thrd : index, %float : f32, %data : memref<?xf32,1>) {
+// CHECK:  "gpu.launch"(%arg0, %arg0, %arg0, %arg1, %arg1, %arg1, %arg2, %arg3) : (index, index, index, index, index, index, f32, memref<?xf32, 1>) -> () {
+  "gpu.launch"(%blk, %blk, %blk, %thrd, %thrd, %thrd, %float, %data)
+      : (index, index, index, index, index, index, f32, memref<?xf32,1>) -> () {
+// CHECK: ^bb1(%i0: index, %i1: index, %i2: index, %i3: index, %i4: index, %i5: index, %i6: index, %i7: index, %i8: index, %i9: index, %i10: index, %i11: index, %i12: f32, %i13: memref<?xf32, 1>)
+  ^bb1(%bx: index, %by: index, %bz: index,
+       %tx: index, %ty: index, %tz: index,
+       %szbx: index, %szby: index, %szbz: index,
+       %sztx: index, %szty: index, %sztz: index,
+       %data0: f32, %data1: memref<?xf32,1>):
+// CHECK: "use"(%i12)
+    "use"(%data0): (f32) -> ()
+    return
+  }
+  return
+}
+
+// It is possible to use values defined in nested regions as long as they don't
+// cross kernel launch region boundaries.
+// CHECK-LABEL: func @nested_isolation
+func @nested_isolation(%sz : index) {
+  "gpu.launch"(%sz, %sz, %sz, %sz, %sz, %sz)
+      : (index, index, index, index, index, index) -> () {
+  ^bb1(%bx: index, %by: index, %bz: index,
+       %tx: index, %ty: index, %tz: index,
+       %szbx: index, %szby: index, %szbz: index,
+       %sztx: index, %szty: index, %sztz: index):
+    "region"() : () -> () {
+// CHECK: %0 = "produce"()
+      %val = "produce"() : () -> (index)
+      "region"() : () -> () {
+// CHECK: "use"(%0)
+        "use"(%val) : (index) -> ()
+      }
+    }
+  }
+  return
+}