Commit Graph

866 Commits

Author SHA1 Message Date
Nicolas Vasilache f826ceef3c Extend vector.outerproduct with an optional 3rd argument
This CL adds an optional third argument to the vector.outerproduct instruction.
When such a third argument is specified, it is added to the result of the outerproduct and  is lowered to FMA intrinsic when the lowering supports it.

In the future, we can add an attribute on the `vector.outerproduct` instruction to modify the operations for which to emit code (e.g. "+/*", "max/+", "min/+", "log/exp" ...).

This CL additionally performs minor cleanups in the vector lowering and adds tests to improve coverage.

This has been independently verified to result in proper fma instructions for haswell as follows.

Input:
```
func @outerproduct_add(%arg0: vector<17xf32>, %arg1: vector<8xf32>, %arg2: vector<17x8xf32>) -> vector<17x8xf32> {
  %2 = vector.outerproduct %arg0, %arg1, %arg2 : vector<17xf32>, vector<8xf32>
  return %2 : vector<17x8xf32>
}
}
```

Command:
```
mlir-opt vector-to-llvm.mlir -vector-lower-to-llvm-dialect --disable-pass-threading | mlir-opt -lower-to-cfg -lower-to-llvm | mlir-translate --mlir-to-llvmir | opt -O3 | llc -O3 -march=x86-64 -mcpu=haswell -mattr=fma,avx2
```

Output:
```
outerproduct_add:                       # @outerproduct_add
# %bb.0:
        ...
        vmovaps 112(%rbp), %ymm8
        vbroadcastss    %xmm0, %ymm0
        ...
        vbroadcastss    64(%rbp), %ymm15
        vfmadd213ps     144(%rbp), %ymm8, %ymm0 # ymm0 = (ymm8 * ymm0) + mem
        ...
        vfmadd213ps     400(%rbp), %ymm8, %ymm9 # ymm9 = (ymm8 * ymm9) + mem
        ...
```
PiperOrigin-RevId: 263743359
2019-08-16 03:53:26 -07:00
Alex Zinenko 88de8b2a2b GenerateCubinAccessors: use LLVM dialect constants
The GenerateCubinAccessors was generating functions that fill
dynamically-allocated memory with the binary constant of a CUBIN attached as a
stirng attribute to the GPU kernel.  This approach was taken to circumvent the
missing support for global constants in the LLVM dialect (and MLIR in general).
Global constants were recently added to the LLVM dialect.  Change the
GenerateCubinAccessors pass to emit a global constant array of characters and a
function that returns a pointer to the first character in the array.

PiperOrigin-RevId: 263092052
2019-08-13 01:39:21 -07:00
Nicolas Vasilache 252ada4932 Add lowering of vector dialect to LLVM dialect.
This CL is step 3/n towards building a simple, programmable and portable vector abstraction in MLIR that can go all the way down to generating assembly vector code via LLVM's opt and llc tools.

This CL adds support for converting MLIR n-D vector types to (n-1)-D arrays of 1-D LLVM vectors and a conversion VectorToLLVM that lowers the `vector.extractelement` and `vector.outerproduct` instructions to the proper mix of `llvm.vectorshuffle`, `llvm.extractelement` and `llvm.mulf`.

This has been independently verified to produce proper avx2 code.

Input:
```
func @vec_1d(%arg0: vector<4xf32>, %arg1: vector<8xf32>) -> vector<8xf32> {
  %2 = vector.outerproduct %arg0, %arg1 : vector<4xf32>, vector<8xf32>
  %3 = vector.extractelement %2[0 : i32]: vector<4x8xf32>
  return %3 : vector<8xf32>
}
```

Command:
```
mlir-opt vector-to-llvm.mlir -vector-lower-to-llvm-dialect --disable-pass-threading | mlir-opt -lower-to-cfg -lower-to-llvm | mlir-translate --mlir-to-llvmir | opt -O3 | llc -O3 -march=x86-64 -mcpu=haswell -mattr=fma,avx2
```

Output:
```
vec_1d:                                 # @vec_1d
# %bb.0:
        vbroadcastss    %xmm0, %ymm0
        vmulps  %ymm1, %ymm0, %ymm0
        retq
```
PiperOrigin-RevId: 262895929
2019-08-12 04:08:57 -07:00
Mahesh Ravishankar ea56025f1e Initial implementation to translate kernel fn in GPU Dialect to SPIR-V Dialect
This CL adds an initial implementation for translation of kernel
function in GPU Dialect (used with a gpu.launch_kernel) op to a
spv.Module. The original function is translated into an entry
function.
Most of the heavy lifting is done by adding TypeConversion and other
utility functions/classes that provide most of the functionality to
translate from Standard Dialect to SPIR-V Dialect. These are intended
to be reusable in implementation of different dialect conversion
pipelines.
Note : Some of the files for have been renamed to be consistent with
the norm used by the other Conversion frameworks.
PiperOrigin-RevId: 260759165
2019-07-30 11:55:55 -07:00
Nicolas Vasilache e78ea03b24 Replace linalg.for by loop.for
With the introduction of the Loop dialect, uses of the `linalg.for` operation can now be subsumed 1-to-1 by `loop.for`.
This CL performs the replacement and tests are updated accordingly.

PiperOrigin-RevId: 258322565
2019-07-16 13:44:57 -07:00
Nicolas Vasilache cca53e8527 Extract std.for std.if and std.terminator in their own dialect
These ops should not belong to the std dialect.
This CL extracts them in their own dialect and updates the corresponding conversions and tests.

PiperOrigin-RevId: 258123853
2019-07-16 13:43:18 -07:00
Nicolas Vasilache cab671d166 Lower affine control flow to std control flow to LLVM dialect
This CL splits the lowering of affine to LLVM into 2 parts:
1. affine -> std
2. std -> LLVM

The conversions mostly consists of splitting concerns between the affine and non-affine worlds from existing conversions.
Short-circuiting of affine `if` conditions was never tested or exercised and is removed in the process, it can be reintroduced later if needed.

LoopParametricTiling.cpp is updated to reflect the newly added ForOp::build.

PiperOrigin-RevId: 257794436
2019-07-12 08:44:28 -07:00
River Riddle 89bc449cee Standardize the value numbering in the AsmPrinter.
Change the AsmPrinter to number values breadth-first so that values in adjacent regions can have the same name. This allows for ModuleOp to contain operations that produce results. This also standardizes the special name of region entry arguments to "arg[0-9+]" now that Functions are also operations.

PiperOrigin-RevId: 257225069
2019-07-09 10:41:00 -07:00
Alex Zinenko 80e2871087 Extend AffineToGPU to support Linalg loops
Extend the utility that converts affine loop nests to support other types of
loops by abstracting away common behavior through templates.  This also
slightly simplifies the existing Affine to GPU conversion by always passing in
the loop step as an additional kernel argument even though it is a known
constant.  If it is used, it will be propagated into the loop body by the
existing canonicalization pattern and can be further constant-folded, otherwise
it will be dropped by canonicalization.

This prepares for the common loop abstraction that will be used for converting
to GPU kernels, which is conceptually close to Linalg loops, while maintaining
the existing conversion operational.

PiperOrigin-RevId: 257172216
2019-07-09 05:26:50 -07:00
Stephan Herhut e8b21a75f8 Add an mlir-cuda-runner tool.
This tool allows to execute MLIR IR snippets written in the GPU dialect
on a CUDA capable GPU. For this to work, a working CUDA install is required
and the build has to be configured with MLIR_CUDA_RUNNER_ENABLED set to 1.

PiperOrigin-RevId: 256551415
2019-07-04 07:53:54 -07:00
Stephan Herhut 630119f84f Add a pass that inserts getters for all cubins found via nvvm.cubin
annotations.

Getters are required as there are currently no global constants in MLIR and this
is an easy way to unblock CUDA execution while waiting for those.

PiperOrigin-RevId: 255169002
2019-06-26 05:33:11 -07:00
Stephan Herhut c72c6c3907 Make GPU to CUDA transformations independent of CUDA runtime.
The actual transformation from PTX source to a CUDA binary is now factored out,
enabling compiling and testing the transformations independently of a CUDA
runtime.

MLIR has still to be built with NVPTX target support for the conversions to be
built and tested.

PiperOrigin-RevId: 255167139
2019-06-26 05:16:37 -07:00
River Riddle 679a3b4191 Change the attribute dictionary syntax to separate name and value with '='.
The current syntax separates the name and value with ':', but ':' is already overloaded by several other things(e.g. trailing types). This makes the syntax difficult to parse in some situtations:

Old:
  "foo: 10 : i32"

New:
  "foo = 10 : i32"
PiperOrigin-RevId: 255097928
2019-06-25 19:06:34 -07:00
Alex Zinenko 2628641b23 GPUtoNVVM: adjust integer bitwidth when lowering special register ops
GPU dialect operations (launch and launch_func) use `index` type for thread and
block index values inside the kernel, for compatibility with affine loops.
NVVM dialect operations, following the NVVM intrinsics, use `!llvm.i32` type,
which does not necessarily have the same bit width as the lowered `index` type.
Optionally sign-extend (indices are signed) or truncate the result of the NVVM
dialect operation to the bit width of the lowered `index` type before passing
it to other operations.  This behavior is consistent with `std.index_cast`.  We
cannot use the latter since we are targeting LLVM dialect types directly,
rather than standard integer types.

PiperOrigin-RevId: 254980868
2019-06-25 09:21:26 -07:00
Stephan Herhut a14eeacf2c Add lowering pass from GPU dialect operations to LLVM/NVVM intrinsics.
PiperOrigin-RevId: 253551452
2019-06-19 23:03:30 -07:00
Alex Zinenko ee6f84aebd Convert a nest affine loops to a GPU kernel
This converts entire loops into threads/blocks.  No check on the size of the
block or grid, or on the validity of parallelization is performed, it is under
the responsibility of the caller to strip-mine the loops and to perform the
dependence analysis before calling the conversion.

PiperOrigin-RevId: 253189268
2019-06-19 23:02:02 -07:00