2018-10-25 00:48:12 +08:00
|
|
|
|
# MLIR Rationale
|
|
|
|
|
|
|
|
|
|
This document is intended to capture some of the alternatives considered and
|
|
|
|
|
open debates in the design of MLIR, along with the rationale for certain
|
|
|
|
|
decisions we made. This is not intended to be a "finely groomed" document - we
|
|
|
|
|
prefer the ability to dump in interesting tidbits without worrying too much
|
|
|
|
|
about their consistency or readability.
|
|
|
|
|
|
|
|
|
|
[TOC]
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
## Abstract
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
MLIR is a compiler intermediate representation with similarities to traditional
|
|
|
|
|
three-address SSA representations (like
|
|
|
|
|
[LLVM IR](http://llvm.org/docs/LangRef.html) or
|
|
|
|
|
[SIL](https://github.com/apple/swift/blob/master/docs/SIL.rst)), but which
|
|
|
|
|
introduces notions from the polyhedral loop optimization works as first class
|
|
|
|
|
concepts. This hybrid design is optimized to represent, analyze, and transform
|
|
|
|
|
high level dataflow graphs as well as target-specific code generated for high
|
|
|
|
|
performance data parallel systems. Beyond its representational capabilities, its
|
|
|
|
|
single continuous design provides a framework to lower from dataflow graphs to
|
|
|
|
|
high performance target specific code.
|
|
|
|
|
|
2019-01-03 04:32:30 +08:00
|
|
|
|
MLIR stands for one of "Multi-Level IR" or "Multi-dimensional Loop IR" or
|
|
|
|
|
"Machine Learning IR" or "Mid Level IR", we prefer the first. This document only
|
|
|
|
|
provides the rationale behind MLIR -- its actual
|
2019-04-04 05:00:30 +08:00
|
|
|
|
[specification document](LangRef.md) and other content is hosted elsewhere.
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
## Introduction and Motivation
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2019-01-03 04:32:30 +08:00
|
|
|
|
The Multi-Level Intermediate Representation (MLIR) is intended for easy
|
|
|
|
|
expression and optimization of computations involving deep loop nests and dense
|
|
|
|
|
matrices of high dimensionality. It is thus well-suited to deep learning
|
2018-10-25 00:48:12 +08:00
|
|
|
|
computations in particular. Yet it is general enough to also represent arbitrary
|
|
|
|
|
sequential computation. The representation allows high-level optimization and
|
|
|
|
|
parallelization for a wide range of parallel architectures including those with
|
|
|
|
|
deep memory hierarchies --- general-purpose multicores, GPUs, and specialized
|
|
|
|
|
neural network accelerators.
|
|
|
|
|
|
|
|
|
|
MLIR uses ideas drawn from IRs of LLVM and Swift for lower level constructs
|
|
|
|
|
while combining them with ideas from the polyhedral abstraction to represent
|
|
|
|
|
loop nests, multi-dimensional data (tensors), and transformations on these
|
|
|
|
|
entities as first class concepts in the IR.
|
|
|
|
|
|
|
|
|
|
MLIR is a multi-level IR, i.e., it represents code at a domain-specific
|
|
|
|
|
representation such as HLO or TensorFlow graphs, all the way down to the machine
|
|
|
|
|
level. MLIR is able to represent arbitrary control flow and arbitrary data
|
|
|
|
|
accesses, and is general enough to represent nearly all sequential computation.
|
|
|
|
|
This is a key distinction from existing polyhedral representation
|
|
|
|
|
implementations (such LLVM [Polly](https://polly.llvm.org/)) that are able to
|
|
|
|
|
use the polyhedral abstraction in a way isolated from the LLVM IR and only for
|
|
|
|
|
affine loop nests, i.e., portions of the code where array accesses, loop bounds,
|
|
|
|
|
and conditionals are regular (involve linear functions of loop iterators and
|
|
|
|
|
constant symbols). The presence of statically unpredictable data accesses or
|
|
|
|
|
control flow does not preclude representation in MLIR, but only limits to a
|
|
|
|
|
certain extent the ability to reason about and apply transformations using the
|
|
|
|
|
polyhedral abstraction.
|
|
|
|
|
|
|
|
|
|
Maps, sets, and relations with affine constraints are the core structures
|
|
|
|
|
underlying a polyhedral representation of high-dimensional loop nests and
|
|
|
|
|
multi-dimensional arrays. These structures are represented as textual
|
|
|
|
|
expressions in a form close to their mathematical form. These structures are
|
|
|
|
|
used to capture loop nests, tensor data structures, and how they are reordered
|
|
|
|
|
and mapped for a target architecture. All structured or "conforming" loops are
|
|
|
|
|
captured as part of the polyhedral information, and so are tensor variables,
|
|
|
|
|
their layouts, and subscripted accesses to these tensors in memory.
|
|
|
|
|
|
|
|
|
|
The information captured in the IR allows a compact expression of all loop
|
|
|
|
|
transformations, data remappings, explicit copying necessary for explicitly
|
|
|
|
|
addressed memory in accelerators, mapping to pre-tuned expert written
|
|
|
|
|
primitives, and mapping to specialized vector instructions. Loop transformations
|
|
|
|
|
that can be easily implemented include the body of affine transformations: these
|
|
|
|
|
subsume all traditional loop transformations (unimodular and non-unimodular)
|
|
|
|
|
such as loop tiling, interchange, permutation, skewing, scaling, relative
|
|
|
|
|
shifting, reversal, fusion, and distribution/fission. Transformations on data
|
|
|
|
|
layout such as padding and transforming to blocked layouts are also captured.
|
|
|
|
|
The design of the IR allows a progressive lowering to target-specific forms.
|
|
|
|
|
|
|
|
|
|
Besides high-level transformations for loop nests and data layout that a typical
|
|
|
|
|
mid-level optimizer is expected to deal with, MLIR is also designed to perform
|
|
|
|
|
certain low-level scheduling and mapping decisions that a typical backend IR is
|
|
|
|
|
entrusted with: these include mapping to specialized vector instructions,
|
|
|
|
|
auto-vectorization, and software pipelining. The need to support these
|
|
|
|
|
transformations stems from the fact that neural network accelerators have
|
|
|
|
|
specialized units that deal with large chunks of data whose computation maps
|
|
|
|
|
back to chunks of more than one loop of the loop nests as viewed by a program at
|
|
|
|
|
a level closer to the original specification. Such specialized units or
|
|
|
|
|
instructions operate on multidimensional data chunks from a programmer's
|
|
|
|
|
viewpoint. It thus makes it hard or infeasible for a backend operating on a very
|
|
|
|
|
low-level IR close to assembly to lift and reconstruct loops and perform such a
|
|
|
|
|
mapping. This is in contrast to classic instruction selection and scheduling in
|
|
|
|
|
today's compilers that primarily only deals with the body of the innermost loop.
|
|
|
|
|
MLIR also facilitates automatic mapping to expert pre-tuned primitives or vendor
|
|
|
|
|
libraries operating on data at higher levels (or at the highest level) of the
|
|
|
|
|
memory hierarchy.
|
|
|
|
|
|
|
|
|
|
**Strengths**
|
|
|
|
|
|
|
|
|
|
* MLIR is closed under the kind of transformations needed to lower to TPUs;
|
|
|
|
|
MLIR can be used to represent both the input and output of emitters
|
|
|
|
|
* MLIR allows us to build modular and reusable target independent and target
|
|
|
|
|
dependent passes - since each pass/emitter can read in another's output.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
## Design Decisions
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
This section sheds light on some of the design decisions -- some of these are
|
|
|
|
|
indirectly implied by the specification document.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Loads and stores
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
The 'load' and 'store' instructions are specifically crafted to fully resolve to
|
|
|
|
|
an element of a memref. These instructions take as arguments n+1 indices for an
|
|
|
|
|
n-ranked tensor. This disallows the equivalent of pointer arithmetic or the
|
|
|
|
|
ability to index into the same memref in other ways (something which C arrays
|
2019-02-06 08:29:25 +08:00
|
|
|
|
allow for example). Furthermore, in an affine construct, the compiler can follow
|
|
|
|
|
use-def chains (e.g. through
|
2019-04-05 23:19:42 +08:00
|
|
|
|
[affine.apply instructions](Dialects/Affine.md#affineapply-operation)) to
|
2019-02-06 08:29:25 +08:00
|
|
|
|
precisely analyze references at compile-time using polyhedral techniques. This
|
|
|
|
|
is possible because of the
|
|
|
|
|
[restrictions on dimensions and symbols](Dialects/Affine.md#restrictions-on-dimensions-and-symbols).
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
A scalar of element-type (a primitive type or a vector type) that is stored in
|
|
|
|
|
memory is modeled as a 0-d memref. This is also necessary for scalars that are
|
2019-01-03 04:32:30 +08:00
|
|
|
|
live out of for loops and if conditionals in a function, for which we don't yet
|
|
|
|
|
have an SSA representation --
|
2018-10-25 00:48:12 +08:00
|
|
|
|
[an extension](#mlfunction-extensions-for-"escaping-scalars") to allow that is
|
|
|
|
|
described later in this doc.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Symbols and types
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
The current MLIR disallows use of symbols in types. For example, when a tensor
|
|
|
|
|
or memref dimension is statically unknown, it is denoted in the type as '?'. An
|
|
|
|
|
SSA symbol is then bound to it when a memref is created. The actual value of the
|
|
|
|
|
unknown dimension can be queried using the "dim" builtin as shown below.
|
|
|
|
|
|
|
|
|
|
Example:
|
|
|
|
|
|
2018-11-14 23:58:42 +08:00
|
|
|
|
```mlir {.mlir}
|
2019-01-03 04:32:30 +08:00
|
|
|
|
func foo(...) {
|
2018-11-01 05:11:09 +08:00
|
|
|
|
%A = alloc <8x?xf32, #lmap> (%N)
|
|
|
|
|
...
|
|
|
|
|
call bar(%A) : (memref<8x?xf32, #lmap>)
|
|
|
|
|
}
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2019-01-03 04:32:30 +08:00
|
|
|
|
func bar(%A : memref<8x?xf32, #lmap>) {
|
2018-11-01 05:11:09 +08:00
|
|
|
|
// Type of %A indicates that %A has dynamic shape with 8 rows
|
|
|
|
|
// and unknown number of columns. The number of columns is queried
|
|
|
|
|
// dynamically using dim instruction.
|
2019-01-03 04:32:30 +08:00
|
|
|
|
%N = dim %A, 1 : memref<8x?xf32, #lmap>
|
2018-11-01 05:11:09 +08:00
|
|
|
|
|
2019-03-26 01:14:34 +08:00
|
|
|
|
affine.for %i = 0 to 8 {
|
|
|
|
|
affine.for %j = 0 to %N {
|
2018-11-01 05:11:09 +08:00
|
|
|
|
// A[i,j] += 1
|
|
|
|
|
%s1 = load %A [%i, %j] : memref<8x?xf32, #lmap>
|
|
|
|
|
%s2 = add %s1, 1
|
|
|
|
|
store %s2 to %A [%i, %j] : memref<8x?xf32, #lmap>
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
return
|
|
|
|
|
}
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
An alternative design is to embed the reference to symbols directly in the
|
|
|
|
|
type - memref<8x%Nxf32>. We went for the current approach in MLIR because it
|
|
|
|
|
simplifies the design --- types remain immutable when the values of symbols
|
|
|
|
|
change.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Block Arguments vs PHI nodes
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2019-07-04 04:21:24 +08:00
|
|
|
|
MLIR Regions represent SSA using "[block arguments](LangRef.md#blocks)" rather
|
2019-01-03 04:32:30 +08:00
|
|
|
|
than [PHI instructions](http://llvm.org/docs/LangRef.html#i-phi) used in LLVM.
|
|
|
|
|
This choice is representationally identical (the same constructs can be
|
2018-12-29 05:07:39 +08:00
|
|
|
|
represented in either form) but block arguments have several advantages:
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2018-12-29 05:07:39 +08:00
|
|
|
|
1. LLVM PHI nodes always have to be kept at the top of a block, and
|
2018-10-25 00:48:12 +08:00
|
|
|
|
transformations frequently have to manually skip over them. This is defined
|
|
|
|
|
away with BB arguments.
|
|
|
|
|
1. LLVM has a separate function Argument node. This is defined away with BB
|
|
|
|
|
arguments, because the arguments to the entry block serve this purpose.
|
|
|
|
|
1. Blocks of PHI nodes in LLVM execute atomically, which is surprising and
|
|
|
|
|
super confusing to compiler engineers and it is easy to introduce bugs with
|
|
|
|
|
this (very related to the
|
|
|
|
|
"[lost copy](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.524.5461&rep=rep1&type=pdf)"
|
|
|
|
|
problem in SSA lowering literature.) With the BB argument representation,
|
|
|
|
|
this confusion is defined away.
|
|
|
|
|
1. The entry list of PHI nodes in LLVM are unordered, and some blocks have
|
|
|
|
|
thousands of predecessors (e.g. unwind blocks). This can cause long compile
|
|
|
|
|
time problems because transformations have to linearly scan this list. This
|
|
|
|
|
is defined away with BB argument representation.
|
|
|
|
|
1. LLVM has no way to represent values that are available only in one successor
|
|
|
|
|
but not the other, e.g. its invoke instruction cannot produce the exception
|
|
|
|
|
value JUST on the exception edge. Instead, the
|
|
|
|
|
[landingpad instruction](http://llvm.org/docs/LangRef.html#landingpad-instruction)
|
|
|
|
|
is a hack used to represent this. MLIR doesn't make use of this capability,
|
|
|
|
|
but SIL uses it extensively, e.g. in the
|
|
|
|
|
[switch_enum instruction](https://github.com/apple/swift/blob/master/docs/SIL.rst#switch-enum).
|
|
|
|
|
|
2018-12-29 05:07:39 +08:00
|
|
|
|
For more context, block arguments were previously used in the Swift
|
2018-10-25 00:48:12 +08:00
|
|
|
|
[SIL Intermediate Representation](https://github.com/apple/swift/blob/master/docs/SIL.rst),
|
|
|
|
|
and described in
|
|
|
|
|
[a talk on YouTube](https://www.youtube.com/watch?v=Ntj8ab-5cvE). The section of
|
|
|
|
|
interest
|
|
|
|
|
[starts here](https://www.google.com/url?q=https://youtu.be/Ntj8ab-5cvE?t%3D596&sa=D&ust=1529450150971000&usg=AFQjCNFQHEWL7m8q3eO-1DiKw9zqC2v24Q).
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Index type disallowed in vector/tensor/memref types
|
2018-12-12 05:49:43 +08:00
|
|
|
|
|
|
|
|
|
Index types are not allowed as elements of `vector`, `tensor` or `memref` type.
|
|
|
|
|
Index types are intended to be used for platform-specific "size" values and may
|
|
|
|
|
appear in subscripts, sizes of aggregate types and affine expressions. They are
|
2019-02-07 03:08:18 +08:00
|
|
|
|
also tightly coupled with `affine.apply` and load/store operations; having
|
2018-12-12 05:49:43 +08:00
|
|
|
|
`index` type is a necessary precondition of a value to be acceptable by these
|
|
|
|
|
operations. While it may be useful to have `memref<?xindex>` to express indirect
|
|
|
|
|
accesses in MLFunctions, e.g. sparse matrix manipulations or lookup tables, it
|
|
|
|
|
creates problems MLIR is not ready to address yet. MLIR needs to internally
|
|
|
|
|
store constants of aggregate types and emit code operating on values of those
|
|
|
|
|
types, which are subject to target-specific size and alignment constraints.
|
|
|
|
|
Since MLIR does not have a target description mechanism at the moment, it cannot
|
|
|
|
|
reliably emit such code. Moreover, some platforms may not support vectors of
|
|
|
|
|
type equivalent to `index`.
|
|
|
|
|
|
|
|
|
|
Indirect access use cases can be alternatively supported by providing and
|
|
|
|
|
`index_cast` instruction that allows for conversion between `index` and
|
|
|
|
|
fixed-width integer types, at the SSA value level. It has an additional benefit
|
|
|
|
|
of supporting smaller integer types, e.g. `i8` or `i16`, for small indices
|
|
|
|
|
instead of (presumably larger) `index` type.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Bit width of a non-primitive types and `index` is undefined
|
2018-12-18 02:05:56 +08:00
|
|
|
|
|
|
|
|
|
The bit width of a compound type is not defined by MLIR, it may be defined by a
|
|
|
|
|
specific lowering pass. In MLIR, bit width is a property of certain primitive
|
|
|
|
|
_type_, in particular integers and floats. It is equal to the number that
|
|
|
|
|
appears in the type definition, e.g. the bit width of `i32` is `32`, so is the
|
|
|
|
|
bit width of `f32`. The bit width is not _necessarily_ related to the amount of
|
|
|
|
|
memory (in bytes) or the size of register (in bits) that is necessary to store
|
|
|
|
|
the value of the given type. These quantities are target and ABI-specific and
|
|
|
|
|
should be defined during the lowering process rather than imposed from above.
|
|
|
|
|
For example, `vector<3xi57>` is likely to be lowered to a vector of four 64-bit
|
|
|
|
|
integers, so that its storage requirement is `4 x 64 / 8 = 32` bytes, rather
|
|
|
|
|
than `(3 x 57) ceildiv 8 = 22` bytes as can be naively computed from the
|
|
|
|
|
bitwidth. Individual components of MLIR that allocate space for storing values
|
|
|
|
|
may use the bit size as the baseline and query the target description when it is
|
|
|
|
|
introduced.
|
|
|
|
|
|
|
|
|
|
The bit width is not defined for dialect-specific types at MLIR level. Dialects
|
|
|
|
|
are free to define their own quantities for type sizes.
|
|
|
|
|
|
2019-04-12 13:29:21 +08:00
|
|
|
|
### Signless types
|
|
|
|
|
|
|
|
|
|
Integers in the builtin MLIR type system have a bitwidth (note that the `index`
|
|
|
|
|
type has a symbolic width equal to the machine word size), but they do not have
|
|
|
|
|
an intrinsic sign. This means that the "standard ops" operation set has things
|
|
|
|
|
like `addi` and `muli` which do two's complement arithmetic, but some other
|
|
|
|
|
operations get a sign, e.g. `divis` vs `diviu`.
|
|
|
|
|
|
|
|
|
|
LLVM uses the [same design](http://llvm.org/docs/LangRef.html#integer-type),
|
|
|
|
|
which was introduced in a revamp rolled out
|
|
|
|
|
[in the LLVM 2.0 integer type](http://releases.llvm.org/2.0/docs/LangRef.html#t_derived).
|
|
|
|
|
Prior to that, from
|
|
|
|
|
[LLVM 1.0](http://releases.llvm.org/1.0/docs/LangRef.html#t_classifications) to
|
|
|
|
|
[1.9](http://releases.llvm.org/1.9/docs/LangRef.html#t_classifications), LLVM
|
|
|
|
|
uses signed types like "sbyte" and "ubyte". This shift was important and has
|
|
|
|
|
served LLVM well over the years. The reason this is important is that it is a
|
|
|
|
|
good thing for an intermediate representation to represent the same computation
|
|
|
|
|
with the same instruction. Signed types got in the way, because (e.g.) an "add
|
|
|
|
|
of an sbyte" does the same computation as an "add of a ubyte", but the type
|
|
|
|
|
system made them look artificially different. This split also required casts
|
|
|
|
|
like "cast from sbyte to ubyte" which do nothing at the machine level. Removing
|
|
|
|
|
signs from the type system eliminated these problems, making the compiler
|
|
|
|
|
simpler.
|
|
|
|
|
|
|
|
|
|
More information about this split is available in an old
|
|
|
|
|
[talk on youtube](https://www.youtube.com/watch?v=VeRaLPupGks) talking about
|
|
|
|
|
LLVM 2.0.
|
|
|
|
|
|
|
|
|
|
Note that this rationale only applies to the "standard ops" dialect in which we
|
|
|
|
|
can express an opinion about its design. Other dialects generally try to model
|
|
|
|
|
an external system, and should aim to reflect its design as closely as possible.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Splitting floating point vs integer operations
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2019-04-12 13:29:21 +08:00
|
|
|
|
The MLIR "standard" operation set splits many integer and floating point
|
|
|
|
|
operations into different categories, for example `addf` vs `addi` and `cmpf` vs
|
|
|
|
|
`cmpi`
|
|
|
|
|
([following the design of LLVM](http://llvm.org/docs/LangRef.html#binary-operations)).
|
|
|
|
|
These instructions _are_ polymorphic on the number of elements in the type
|
|
|
|
|
though, for example `addf` is used with scalar floats, vectors of floats, and
|
|
|
|
|
tensors of floats (LLVM does the same thing with its scalar/vector types).
|
|
|
|
|
|
|
|
|
|
This split is important because floating point and integer operations are quite
|
|
|
|
|
different in practice: for example, floating point values include NaN's, so
|
2018-10-25 00:48:12 +08:00
|
|
|
|
[integer comparisons](http://llvm.org/docs/LangRef.html#icmp-instruction) and
|
|
|
|
|
[floating point comparisons](http://llvm.org/docs/LangRef.html#fcmp-instruction)
|
|
|
|
|
should use different comparison opcodes. On the arithmetic side of things,
|
|
|
|
|
floating point operations support rounding modes, floating point contractions,
|
|
|
|
|
["fast math"](http://llvm.org/docs/LangRef.html#fadd-instruction), and integers
|
|
|
|
|
may want to have two's complement overflow behavior or be undefined on
|
|
|
|
|
[various forms of wrapping](http://llvm.org/docs/LangRef.html#add-instruction)
|
|
|
|
|
for performance.
|
|
|
|
|
|
|
|
|
|
We are a long way from this sort of thing being a priority to care about in
|
|
|
|
|
MLIR, but since we have experience and know the right way to do this, we'd
|
|
|
|
|
rather design it in from the beginning.
|
|
|
|
|
|
2019-04-12 13:29:21 +08:00
|
|
|
|
Note that this rationale only applies to the "standard ops" dialect in which we
|
|
|
|
|
can express an opinion about its design. Other dialects generally try to model
|
|
|
|
|
an external system, and should aim to reflect its design as closely as possible.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Specifying sign in integer comparison operations
|
2018-11-08 20:02:00 +08:00
|
|
|
|
|
|
|
|
|
Since integers are [signless](#signless-types), it is necessary to define the
|
|
|
|
|
sign for integer comparison operations. This sign indicates how to treat the
|
|
|
|
|
foremost bit of the integer: as sign bit or as most significant bit. For
|
|
|
|
|
example, comparing two `i4` values `0b1000` and `0b0010` yields different
|
|
|
|
|
results for unsigned (`8 > 3`) and signed (`-8 < 3`) interpretations. This
|
|
|
|
|
difference is only significant for _order_ comparisons, but not for _equality_
|
|
|
|
|
comparisons. Indeed, for the latter all bits must have the same value
|
|
|
|
|
independently of the sign. Since both arguments have exactly the same bit width
|
|
|
|
|
and cannot be padded by this operation, it is impossible to compare two values
|
|
|
|
|
whose bit representations would differ while the values are interpreted as
|
|
|
|
|
equal.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Specifying comparison kind as attribute
|
2018-11-08 20:02:00 +08:00
|
|
|
|
|
|
|
|
|
Unlike arithmetic, comparison operators share several common properties, e.g.
|
|
|
|
|
they cannot be considered associative. In practice, comparisons are sometimes
|
|
|
|
|
implemented by the same instruction or its variants so it makes sense to group
|
|
|
|
|
them together at the IR level.
|
|
|
|
|
|
|
|
|
|
An alternative would be introducing ten distinct operators for all currently
|
|
|
|
|
supported kinds of integer comparisons. These operators would have increased the
|
|
|
|
|
number of "reserved" names used by standard operations as well as the size of
|
|
|
|
|
the C++ API while their implementations would have been mostly identical.
|
|
|
|
|
|
|
|
|
|
The comparison kind is internally an integer attribute. However, for the sake of
|
2019-01-24 03:26:56 +08:00
|
|
|
|
readability by humans, custom assembly form accepts string literals that are
|
2018-11-08 20:02:00 +08:00
|
|
|
|
mapped to the underlying integer values: `cmpi "eq", %lhs, %rhs` better implies
|
|
|
|
|
integer equality comparison than `cmpi 0, %lhs, %rhs` where it is unclear what
|
|
|
|
|
gets compared to what else. This syntactic sugar is possible thanks to parser
|
2019-01-24 03:26:56 +08:00
|
|
|
|
logic redefinitions for custom assembly form of non-builtin operations.
|
2018-11-08 20:02:00 +08:00
|
|
|
|
Supporting it in the full notation would have required changing how the main
|
|
|
|
|
parsing algorithm works and may have unexpected repercussions. While it had been
|
|
|
|
|
possible to store the predicate as string attribute, it would have rendered
|
|
|
|
|
impossible to implement switching logic based on the comparison kind and made
|
2018-11-28 23:08:55 +08:00
|
|
|
|
attribute validity checks (one out of ten possible kinds) more complex.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### 'select' operation to implement min/max
|
2018-11-28 23:08:55 +08:00
|
|
|
|
|
|
|
|
|
Although `min` and `max` operations are likely to occur as a result of
|
|
|
|
|
transforming affine loops in ML functions, we did not make them first-class
|
|
|
|
|
operations. Instead, we provide the `select` operation that can be combined with
|
|
|
|
|
`cmpi` to implement the minimum and maximum computation. Although they now
|
|
|
|
|
require two operations, they are likely to be emitted automatically during the
|
|
|
|
|
transformation inside MLIR. On the other hand, there are multiple benefits of
|
|
|
|
|
introducing `select`: standalone min/max would concern themselves with the
|
|
|
|
|
signedness of the comparison, already taken into account by `cmpi`; `select` can
|
|
|
|
|
support floats transparently if used after a float-comparison operation; the
|
|
|
|
|
lower-level targets provide `select`-like instructions making the translation
|
|
|
|
|
trivial.
|
|
|
|
|
|
|
|
|
|
This operation could have been implemented with additional control flow: `%r =
|
|
|
|
|
select %cond, %t, %f` is equivalent to
|
|
|
|
|
|
|
|
|
|
```mlir
|
2018-12-30 03:32:37 +08:00
|
|
|
|
^bb0:
|
|
|
|
|
br_cond %cond, ^bb1(%t), ^bb1(%f)
|
|
|
|
|
^bb1(%r):
|
2018-11-28 23:08:55 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
However, this control flow granularity is not available in the ML functions
|
|
|
|
|
where min/max, and thus `select`, are likely to appear. In addition, simpler
|
|
|
|
|
control flow may be beneficial for optimization in general.
|
2018-11-08 20:02:00 +08:00
|
|
|
|
|
2019-05-13 23:51:34 +08:00
|
|
|
|
### Regions
|
|
|
|
|
|
|
|
|
|
#### Attributes of type 'Block'
|
|
|
|
|
|
|
|
|
|
We considered representing regions through `ArrayAttr`s containing a list of a
|
|
|
|
|
special type `IRBlockAttr`, which in turn would contain a list of operations.
|
|
|
|
|
All attributes in MLIR are unique’d within the context, which would make the IR
|
|
|
|
|
inside the regions immortal for no good reason.
|
|
|
|
|
|
|
|
|
|
#### Use "inlined" functions as regions
|
|
|
|
|
|
|
|
|
|
We considered attaching a "force-inline" attribute on a function and/or a
|
|
|
|
|
function `call` operation. Even the minimal region support (use cases in
|
|
|
|
|
affine.for and affine.if existing before the regions) requires access to the
|
|
|
|
|
values defined in the dominating block, which is not supported by functions.
|
|
|
|
|
Conceptually, function bodies are instances of regions rather than the inverse;
|
|
|
|
|
regions can also be device kernels, alternative sections, etc.
|
|
|
|
|
|
|
|
|
|
#### Dedicated `region` operation
|
|
|
|
|
|
|
|
|
|
This would mean we have a special kind of operation that is allowed to have
|
|
|
|
|
regions while other operations are not. Such distinction is similar to the
|
|
|
|
|
Stmt/Op difference we have had and chose to remove to make the IR simpler and
|
|
|
|
|
more flexible. It would also require analyses and passes to consider the
|
|
|
|
|
interplay between operations (e.g., an `affine.for` operation must be followed
|
|
|
|
|
by a region operation). Finally, a region operation can be introduced using the
|
|
|
|
|
current implementation, among other operations and without being special in any
|
|
|
|
|
sense.
|
|
|
|
|
|
|
|
|
|
#### Explicit capture of the values used in a region
|
|
|
|
|
|
|
|
|
|
Being able to use values defined outside the region implies that use-def chains
|
|
|
|
|
may contain uses from different nested regions. Consequently, IR transformations
|
|
|
|
|
and analyses can pull the instruction defining the value across region
|
|
|
|
|
boundaries, for example in case of TableGen-defined canonicalization patterns.
|
|
|
|
|
This would not be the case if all used values had been passed as region
|
|
|
|
|
arguments. One of the motivations for introducing regions in the IR is precisely
|
|
|
|
|
to enable cross-region analyses and transformations that are simpler than
|
|
|
|
|
inter-procedural transformations. Having uses from different regions appear in
|
|
|
|
|
the same use-def chain, contrary to an additional data structure maintaining
|
|
|
|
|
correspondence between function call arguments as uses of the original
|
|
|
|
|
definitions and formal arguments as new definitions, enables such
|
|
|
|
|
simplification. Since individual operations now belong to blocks, which belong
|
|
|
|
|
to regions, it is always possible to check if the definition of the value
|
|
|
|
|
belongs to the same region as its particular use. The risk is that any IR
|
|
|
|
|
traversal will need to handle explicitly this situation and it is easy to forget
|
|
|
|
|
a check (or conversely it isn’t easy to design the right check in a tablegen
|
|
|
|
|
pattern for example): traversing use-def chains potentially crosses implicitly
|
|
|
|
|
semantic barriers, making it possible to unknowingly break region semantics.
|
|
|
|
|
This is expected to be caught in the verifier after the transformation.
|
|
|
|
|
|
|
|
|
|
At the same time, one may choose to pass certain or all values as region
|
|
|
|
|
arguments to explicitly break the use-def chains in the current proposal. This
|
|
|
|
|
can be combined with an attribute-imposed semantic requirement disallowing the
|
|
|
|
|
body of the region to refer to any value from outside it.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Quantized integer operations
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
We haven't designed integer quantized operations in MLIR, but experience from
|
|
|
|
|
TensorFlow suggests that it is better to put information about the quantization
|
|
|
|
|
range/scale into the type itself, rather than have a single type like "qint8"
|
|
|
|
|
and put these on attributes of the operation.
|
|
|
|
|
|
|
|
|
|
There are a few ways to do this with MLIR, including at least:
|
|
|
|
|
|
|
|
|
|
* We could do the same thing TensorFlow does - and we will _have_ to support
|
|
|
|
|
that model to some extent for compatibility.
|
|
|
|
|
* We can encode the fp range of quantized integers directly into the types
|
|
|
|
|
when they are constants. The best practice on this seems to be to encode the
|
|
|
|
|
zero point as well as a scale factor. This ensures that 0.0 is always
|
|
|
|
|
exactly representable, e.g. `qi8<-1.42, 31.23x>`.
|
|
|
|
|
* We could theoretically encode dynamically determined ranges into the types
|
|
|
|
|
using something like `qi8<?,?>` with the bounds being determined through the
|
|
|
|
|
SSA dataflow graph dynamically - similar to how dynamic shapes are handled.
|
|
|
|
|
|
|
|
|
|
We will definitely need to do #1 for compatibility, we probably want to do #2,
|
|
|
|
|
and we should investigate #3 over time. That said, our short term plan is to get
|
|
|
|
|
more implementation experience with the rest of the system first, then come back
|
|
|
|
|
to re-examine the representation for quantized arithmetic when we have that
|
|
|
|
|
experience. When we do, we should chat with benoitjacob@ and
|
|
|
|
|
[read the paper](https://arxiv.org/abs/1712.05877).
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Dialect type extensions
|
2019-01-08 01:59:55 +08:00
|
|
|
|
|
|
|
|
|
This section describes the design decisions that shaped the dialect extensible
|
|
|
|
|
type system present in MLIR.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
#### Reserving dialect type kinds
|
2019-01-08 01:59:55 +08:00
|
|
|
|
|
|
|
|
|
Dialects that wish to define type extensions must reserve a range of type kinds
|
|
|
|
|
within a '.def' file within the core IR library. This means that every dialect
|
|
|
|
|
wishing to define custom types must modify this file, but it guarantees that all
|
|
|
|
|
type casting checkings are performed in O(1) time.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
#### Interactions between dialects
|
2019-01-08 01:59:55 +08:00
|
|
|
|
|
|
|
|
|
There are two different interactions between dialects that are important to
|
|
|
|
|
understand. When types of a dialect are:
|
|
|
|
|
|
|
|
|
|
* In operations of other dialects
|
|
|
|
|
|
|
|
|
|
- For standard/builtin operations, only standard/builtin types are
|
|
|
|
|
allowed. This restriction allows for operations to clearly understand
|
|
|
|
|
the invariants that they are working under.
|
|
|
|
|
- Outside of standard/builtin operations, dialects are expected to verify
|
|
|
|
|
the allowable operation types per operation.
|
|
|
|
|
|
|
|
|
|
* In types of other dialects
|
|
|
|
|
|
|
|
|
|
- For standard/builtin types, these types are allowed to contain types
|
|
|
|
|
from other dialects. This simplifies the type system and removes the
|
|
|
|
|
need for dialects to redefine all of the standard aggregate types, e.g.
|
|
|
|
|
tensor, as well as the memref type. Dialects are expected to verify that
|
|
|
|
|
a specific type is valid within a standard type, e.g. if a type can be
|
|
|
|
|
an element of a tensor.
|
|
|
|
|
- For dialect types, the dialect is expected to verify any type
|
|
|
|
|
invariants, e.g. if the standard tensor type can contain a specific type
|
|
|
|
|
of that dialect.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
#### Separating builtin and standard types
|
2019-01-08 01:59:55 +08:00
|
|
|
|
|
|
|
|
|
Following the separation between the built-in and standard dialect, it makes
|
|
|
|
|
sense to separate built-in types and standard dialect types. Built-in types are
|
|
|
|
|
required for the validity of the IR itself, e.g. the function type (which
|
2019-01-24 03:26:56 +08:00
|
|
|
|
appears in function signatures and generic assembly forms of operations).
|
|
|
|
|
Integer, float, vector, memref and tensor types, while important, are not
|
|
|
|
|
necessary for IR validity.
|
2019-01-08 01:59:55 +08:00
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
#### Unregistered types
|
2019-01-08 01:59:55 +08:00
|
|
|
|
|
2019-01-24 03:26:56 +08:00
|
|
|
|
MLIR supports unregistered operations in generic assembly form. MLIR also
|
|
|
|
|
supports a similar concept for types. When parsing, if the dialect for dialect
|
2019-04-04 07:49:01 +08:00
|
|
|
|
type has not been registered the type is modeled as an 'OpaqueType'. This allows
|
|
|
|
|
for types to be round-tripped without needing to link in the dialect library
|
|
|
|
|
that defined them. No additional information about opaque types, outside of
|
|
|
|
|
parsing/printing, will be available.
|
2019-01-08 01:59:55 +08:00
|
|
|
|
|
|
|
|
|
#### Dialect type syntax
|
|
|
|
|
|
|
|
|
|
Dialect extended types are represented as string literals wrapped inside of the
|
|
|
|
|
dialect namespace. This means that the parser delegates to the dialect for
|
|
|
|
|
parsing specific type instances. This differs from the representation of dialect
|
|
|
|
|
defined operations, of which have a identifier name that the parser uses to
|
|
|
|
|
identify and parse them.
|
|
|
|
|
|
|
|
|
|
This representation was chosen for several reasons:
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
##### Dialects must provide custom type parsers
|
2019-01-08 01:59:55 +08:00
|
|
|
|
|
|
|
|
|
Dialect type parsing cannot plug into the existing parser infrastructure as
|
|
|
|
|
operations do with the OpAsmParser/Printer. Operations have a defined syntax
|
|
|
|
|
structure that is the same across all dialects. Types, on the other hand, may
|
|
|
|
|
have many different, and sometimes conflicting, parsing constraints that would
|
|
|
|
|
be difficult/unmaintainable to provide within a single interface.
|
|
|
|
|
|
|
|
|
|
This also has the added benefit of encouraging dialects to reuse existing
|
|
|
|
|
external type parsers. For example, an LLVM dialect may provide an MLIR LLVM
|
|
|
|
|
type that is simply a wrapper around LLVM types. The LLVM dialect would then use
|
|
|
|
|
the existing LLVM type parsing infrastructure.
|
|
|
|
|
|
|
|
|
|
Example:
|
|
|
|
|
|
|
|
|
|
```mlir {.mlir}
|
|
|
|
|
%s = "foo"() : () -> !llvm<"i32*">
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
##### Types do not always have canonical names
|
|
|
|
|
|
|
|
|
|
Unlike operations, types generally do not have a formal canonical name. For
|
|
|
|
|
example, function types have no defined keyword and integer types are defined by
|
|
|
|
|
a regular expression to support arbitrary bitwidth. Dialects with existing type
|
|
|
|
|
systems, e.g. LLVM, are likely to provide wrappers around their existing type
|
|
|
|
|
systems. For these wrapper types there is no simple canonical name, it's logical
|
2019-01-08 10:42:04 +08:00
|
|
|
|
to think of these types as existing within the namespace of the dialect. If a
|
|
|
|
|
dialect wishes to assign a canonical name to a type, it can be done via
|
|
|
|
|
[type aliases](LangRef.md#type-aliases).
|
2019-01-08 01:59:55 +08:00
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Tuple types
|
2019-03-20 01:59:02 +08:00
|
|
|
|
|
|
|
|
|
The MLIR type system provides first class support for defining
|
|
|
|
|
[tuple types](LangRef.md#tuple-type). This is due to the fact that `Tuple`
|
2019-03-30 00:18:45 +08:00
|
|
|
|
represents a universal concept that is likely to, and has already begun to,
|
|
|
|
|
present itself in many different dialects. Though this type is first class in
|
|
|
|
|
the type system, it merely serves to provide a common mechanism in which to
|
|
|
|
|
represent this concept in MLIR. As such, MLIR provides no standard operations
|
|
|
|
|
for interfacing with `tuple` types. It is up to dialect authors to provide
|
2019-03-20 01:59:02 +08:00
|
|
|
|
operations, e.g. extract_tuple_element, to interpret and manipulate them. When
|
|
|
|
|
possible, operations should prefer to use multiple results instead. These
|
|
|
|
|
provide a myriad of benefits, such as alleviating any need for tuple-extract
|
|
|
|
|
operations that merely get in the way of analysis and transformation.
|
|
|
|
|
|
2019-01-24 03:26:56 +08:00
|
|
|
|
### Assembly forms
|
|
|
|
|
|
|
|
|
|
MLIR decides to support both generic and custom assembly forms under the
|
|
|
|
|
following considerations:
|
|
|
|
|
|
|
|
|
|
MLIR is an open system; it is designed to support modular and pluggable
|
|
|
|
|
dialects. Depending on whether there exists a corresponding dialect and whether
|
|
|
|
|
the dialect is plugged in, operations may or may not be registered into MLIR
|
|
|
|
|
system. Yet we still need a way to investigate these operations. So the generic
|
|
|
|
|
assembly form is mandated by this aspect of MLIR system. It provides a default
|
|
|
|
|
textual form for operations.
|
|
|
|
|
|
|
|
|
|
On the other hand, an assembly form is for assisting developers to investigate
|
|
|
|
|
the IR. The generic form serves as a safe fallback but it can be too verbose for
|
|
|
|
|
certain ops. Therefore, MLIR gives each dialect the choice to define a custom
|
|
|
|
|
assembly form for each operation according to the operation's semantics and
|
|
|
|
|
specific needs. The custom assembly form can de-duplicate information from the
|
|
|
|
|
operation to derive a more concise form, thus better facilitating the
|
|
|
|
|
comprehension of the IR.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
## Examples
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2019-01-03 04:32:30 +08:00
|
|
|
|
This section describes a few very simple examples that help understand how MLIR
|
|
|
|
|
represents computation.
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Non-affine control flow
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2018-11-14 23:58:42 +08:00
|
|
|
|
```mlir {.mlir}
|
2018-10-25 00:48:12 +08:00
|
|
|
|
// A simple linear search in every row of a matrix
|
|
|
|
|
for (i=0; i<N; i++) {
|
|
|
|
|
for (j=0; j<N; j++) {
|
|
|
|
|
// dynamic control flow
|
|
|
|
|
if (a[i][j] == key) {
|
|
|
|
|
s[i] = j;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
2019-01-03 04:32:30 +08:00
|
|
|
|
The presence of dynamic control flow leads to an inner non-affine function
|
|
|
|
|
nested in an outer function that using affine loops.
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2018-11-14 23:58:42 +08:00
|
|
|
|
```mlir {.mlir}
|
2019-01-03 04:32:30 +08:00
|
|
|
|
func @search(memref<?x?xi32 %A, <?xi32> %S, i32 %key) {
|
2018-10-25 00:48:12 +08:00
|
|
|
|
%ni = dim %A, 0 : memref<?x?xi32>
|
|
|
|
|
// This loop can be parallelized
|
2019-03-26 01:14:34 +08:00
|
|
|
|
affine.for %i = 0 to %ni {
|
2018-10-25 00:48:12 +08:00
|
|
|
|
call @search_body (%A, %S, %i) : (memref<?x?xi32>, memref<?xi32>, i32)
|
|
|
|
|
}
|
|
|
|
|
return
|
|
|
|
|
}
|
|
|
|
|
|
2019-01-03 04:32:30 +08:00
|
|
|
|
func @search_body(%A: memref<?x?xi32>, %S: memref<?xi32>, %key: i32) {
|
2018-10-25 00:48:12 +08:00
|
|
|
|
%nj = dim %A, 1 : memref<?x?xi32>
|
2018-12-30 03:32:37 +08:00
|
|
|
|
br ^bb1(0)
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2018-12-30 03:32:37 +08:00
|
|
|
|
^bb1(%j: i32)
|
2018-11-08 20:02:00 +08:00
|
|
|
|
%p1 = cmpi "lt", %j, %nj : i32
|
2018-12-30 03:32:37 +08:00
|
|
|
|
br_cond %p1, ^bb2, ^bb5
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2018-12-30 03:32:37 +08:00
|
|
|
|
^bb2:
|
2018-10-25 00:48:12 +08:00
|
|
|
|
%v = load %A[%i, %j] : memref<?x?xi32>
|
2018-11-08 20:02:00 +08:00
|
|
|
|
%p2 = cmpi "eq", %v, %key : i32
|
2018-12-30 03:32:37 +08:00
|
|
|
|
br_cond %p2, ^bb3(%j), ^bb4
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2018-12-30 03:32:37 +08:00
|
|
|
|
^bb3(%j: i32)
|
2019-01-03 04:32:30 +08:00
|
|
|
|
store %j, %S[%i] : memref<?xi32>
|
2018-12-30 03:32:37 +08:00
|
|
|
|
br ^bb5
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2018-12-30 03:32:37 +08:00
|
|
|
|
^bb4:
|
2018-10-25 00:48:12 +08:00
|
|
|
|
%jinc = addi %j, 1 : i32
|
2018-12-30 03:32:37 +08:00
|
|
|
|
br ^bb1(%jinc)
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2018-12-30 03:32:37 +08:00
|
|
|
|
^bb5:
|
2018-10-25 00:48:12 +08:00
|
|
|
|
return
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
As per the [MLIR spec](LangRef.md), the restrictions on dimensions and symbol
|
2019-03-30 04:15:06 +08:00
|
|
|
|
identifiers to be used with the affine.apply operation only apply to accesses
|
|
|
|
|
inside `affine.for` and `affine.if` operations. However, an analysis of accesses
|
|
|
|
|
inside the called function (`@search_body`) is necessary to determine if the
|
|
|
|
|
`%i` loop could be parallelized: such function access analysis is calling
|
2019-02-07 03:58:03 +08:00
|
|
|
|
context sensitive.
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Non-affine loop bounds
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2019-01-03 04:32:30 +08:00
|
|
|
|
Loop bounds that are not affine lead to a nesting of functions as shown below.
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
for (i=0; i <N; i++)
|
|
|
|
|
for (j=0; j<N; j++)
|
|
|
|
|
// non-affine loop bound for k loop
|
|
|
|
|
for (k=0; k<pow(2,j); k++)
|
|
|
|
|
for (l=0; l<N; l++) {
|
2018-12-29 05:07:39 +08:00
|
|
|
|
// block loop body
|
2018-10-25 00:48:12 +08:00
|
|
|
|
...
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
2018-11-14 23:58:42 +08:00
|
|
|
|
```mlir {.mlir}
|
2019-01-03 04:32:30 +08:00
|
|
|
|
func @outer_nest(%n) : (i32) {
|
2019-03-26 01:14:34 +08:00
|
|
|
|
affine.for %i = 0 to %n {
|
|
|
|
|
affine.for %j = 0 to %n {
|
2018-10-25 00:48:12 +08:00
|
|
|
|
call @inner_nest(%i, %j, %n)
|
|
|
|
|
}
|
|
|
|
|
}
|
2019-01-03 04:32:30 +08:00
|
|
|
|
return
|
2018-10-25 00:48:12 +08:00
|
|
|
|
}
|
|
|
|
|
|
2019-01-03 04:32:30 +08:00
|
|
|
|
func @inner_nest(%i: i32, %j: i32, %n: i32) {
|
2018-10-25 00:48:12 +08:00
|
|
|
|
%pow = call @pow(2, %j) : (f32, f32) -> f32
|
|
|
|
|
// TODO(missing cast from f32 to i32)
|
|
|
|
|
call @inner_nest2(%pow, %n)
|
2019-01-03 04:32:30 +08:00
|
|
|
|
return
|
2018-10-25 00:48:12 +08:00
|
|
|
|
}
|
|
|
|
|
|
2019-01-03 04:32:30 +08:00
|
|
|
|
func @inner_nest2(%m, %n) -> i32 {
|
2019-03-26 01:14:34 +08:00
|
|
|
|
affine.for %k = 0 to %m {
|
|
|
|
|
affine.for %l = 0 to %n {
|
2019-01-03 04:32:30 +08:00
|
|
|
|
...
|
2018-10-25 00:48:12 +08:00
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
return
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Reference 2D Convolution
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
The following example illustrates a reference implementation of a 2D
|
2019-01-03 04:32:30 +08:00
|
|
|
|
convolution, which uses an integer set `#domain` to represent valid input data
|
2018-10-25 00:48:12 +08:00
|
|
|
|
in a dilated convolution.
|
|
|
|
|
|
2018-11-14 23:58:42 +08:00
|
|
|
|
```mlir {.mlir}
|
2018-10-25 00:48:12 +08:00
|
|
|
|
// Dilation factors S0 and S1 can be constant folded if constant at compile time.
|
2019-01-03 04:32:30 +08:00
|
|
|
|
#domain = (d0, d1)[S0,S1,S2,S3]: (d0 % S0 == 0, d1 % S1 == 0, d0 >= 0, d1 >= 0,
|
2018-10-25 00:48:12 +08:00
|
|
|
|
S3 - d0 - 1 >= 0, S4 - d1 - 1 >= 0)
|
|
|
|
|
// Identity map (shown here for illustration).
|
|
|
|
|
#map0 = (d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3, d4, d5, d6)
|
|
|
|
|
|
|
|
|
|
// Affine map from output to input coordinate space.
|
|
|
|
|
// d0 = output_h, d1 = output_w, d2 = kernel_h, d3 = kernel_w
|
|
|
|
|
// S0 = h_stride, S1 = w_stride, S2 = h_kernel_dilation, S3 = w_kernel_dilation
|
|
|
|
|
// S4 = h_pad_low, S5 = w_pad_low
|
|
|
|
|
// %out0 = %0#1 * %h_stride + %0#4 * %h_kernel_dilation - %h_pad_low
|
|
|
|
|
// %out1= %0#2 * %w_stride + %0#5 * %w_kernel_dilation - %w_pad_low
|
2019-01-28 01:33:19 +08:00
|
|
|
|
#map1_0 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d0 * S0 + d2 * S2 - %S4)
|
|
|
|
|
#map1_1 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d1 * S1 + d3 * S3 - %S5)
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
// Semi-affine map to undilated input coordinate space.
|
|
|
|
|
// d0 = input_h, d1 = input_w, S0 = h_base_dilation, S1 = w_base_dilation.
|
2019-01-28 01:33:19 +08:00
|
|
|
|
#map2_0 = (d0, d1) [S0, S1] -> (d0 / S0)
|
|
|
|
|
#map2_1 = (d0, d1) [S0, S1] -> (d1 / S1)
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
// Conv2D shapes:
|
|
|
|
|
// input: [batch, input_height, input_width, input_feature]
|
|
|
|
|
// kernel: [kernel_height, kernel_width, input_feature, output_feature]
|
|
|
|
|
// output: [batch, output_height, output_width, output_feature]
|
2019-01-03 04:32:30 +08:00
|
|
|
|
func @conv2d(memref<16x1024x1024x3xf32, #lm0, vmem> %input,
|
|
|
|
|
memref<5x5x3x32xf32, #lm0, vmem> %kernel,
|
|
|
|
|
memref<16x512x512x32xf32, #lm0, vmem> %output) {
|
2019-03-26 01:14:34 +08:00
|
|
|
|
affine.for %b = 0 to %batch {
|
|
|
|
|
affine.for %oh = 0 to %output_height {
|
|
|
|
|
affine.for %ow = 0 to %output_width {
|
|
|
|
|
affine.for %of = 0 to %output_feature {
|
|
|
|
|
affine.for %kh = 0 to %kernel_height {
|
|
|
|
|
affine.for %kw = 0 to %kernel_width {
|
|
|
|
|
affine.for %if = 0 to %input_feature {
|
2018-10-25 00:48:12 +08:00
|
|
|
|
// Calculate input indices.
|
2019-02-07 03:08:18 +08:00
|
|
|
|
%1_0 = affine.apply #map1_0 (%0#1, %0#2, %0#4, %0#5)
|
2019-01-28 01:33:19 +08:00
|
|
|
|
[%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation,
|
|
|
|
|
%h_pad_low, %w_pad_low]
|
2019-02-07 03:08:18 +08:00
|
|
|
|
%1_1 = affine.apply #map1_1 (%0#1, %0#2, %0#4, %0#5)
|
2018-10-25 00:48:12 +08:00
|
|
|
|
[%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation,
|
|
|
|
|
%h_pad_low, %w_pad_low]
|
|
|
|
|
|
|
|
|
|
// Check if access is not in padding.
|
2019-03-25 11:35:07 +08:00
|
|
|
|
affine.if #domain(%1_0, %1_1)
|
2018-10-25 00:48:12 +08:00
|
|
|
|
[%h_base_dilation, %w_kernel_dilation, %h_bound, %w_bound] {
|
2019-02-07 03:08:18 +08:00
|
|
|
|
%2_0 = affine.apply #map2 (%1_0, %1_1)
|
|
|
|
|
%2_1 = affine.apply #map2 (%1_0, %1_1)
|
2018-10-25 00:48:12 +08:00
|
|
|
|
// Compute: output[output_indices] += input[input_indices] * kernel[kernel_indices]
|
2019-01-28 01:33:19 +08:00
|
|
|
|
call @multiply_accumulate(%input, %kernel, %output, %b, %oh, %ow, %of, %kh, %kw, %if, %2_0, %2_1)
|
2018-10-25 00:48:12 +08:00
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
return
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
TODO (Add more examples showing the IR for a variety of interesting cases)
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
## Design alternatives and extensions
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
This is a list of some design alternatives and extensions that we discussed in
|
|
|
|
|
detail but did not include in the spec or postponed them for future
|
|
|
|
|
consideration on demand. We will revisit these discussions when we have more
|
|
|
|
|
implementation experience and learn more about the challenges and limitations of
|
|
|
|
|
our current design in practice.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Polyhedral code representation alternatives: schedule lists vs schedules trees vs affine loop/if forms
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
The current MLIR uses a representation of polyhedral schedules using a tree of
|
|
|
|
|
if/for loops. We extensively debated the tradeoffs involved in the typical
|
2018-12-29 08:05:35 +08:00
|
|
|
|
unordered polyhedral instruction representation (where each instruction has
|
2018-10-25 00:48:12 +08:00
|
|
|
|
multi-dimensional schedule information), discussed the benefits of schedule tree
|
|
|
|
|
forms, and eventually decided to go with a syntactic tree of affine if/else
|
|
|
|
|
conditionals and affine for loops. Discussion of the tradeoff was captured in
|
|
|
|
|
this document:
|
2019-04-04 05:00:30 +08:00
|
|
|
|
[ MLIR: The case for a simplified polyhedral form](RationaleSimplifiedPolyhedralForm.md).
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
At a high level, we have two alternatives here:
|
|
|
|
|
|
2019-01-03 04:32:30 +08:00
|
|
|
|
1. Schedule tree representation instead of an affine loop AST form: The current
|
|
|
|
|
proposal uses an affine loop and conditional tree form, which is syntactic
|
|
|
|
|
and with no separation of domains as sets and schedules as multidimensional
|
|
|
|
|
affine functions. A schedule tree form however makes polyhedral domains and
|
|
|
|
|
schedules a first class concept in the IR allowing compact expression of
|
|
|
|
|
transformations through the schedule tree without changing the domains of
|
|
|
|
|
instructions. Such a representation also hides prologues, epilogues, partial
|
|
|
|
|
tiles, complex loop bounds and conditionals making loop nests free of
|
|
|
|
|
"syntax". Cost models instead look at domains and schedules. In addition, if
|
|
|
|
|
necessary such a domain schedule representation can be normalized to
|
|
|
|
|
explicitly propagate the schedule into domains and model all the cleanup
|
|
|
|
|
code. An example and more detail on the schedule tree form is in the next
|
|
|
|
|
section.
|
2018-10-25 00:48:12 +08:00
|
|
|
|
1. Having two different forms of MLFunctions: an affine loop tree form
|
|
|
|
|
(AffineLoopTreeFunction) and a polyhedral schedule tree form as two
|
|
|
|
|
different forms of MLFunctions. Or in effect, having four different forms
|
2018-12-29 00:48:09 +08:00
|
|
|
|
for functions in MLIR instead of three: CFG Function,
|
|
|
|
|
AffineLoopTreeFunction, Polyhedral Schedule Tree function, and external
|
|
|
|
|
functions.
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
#### Schedule Tree Representation for MLFunctions
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
This representation is based on a simplified form of the domain/schedule
|
|
|
|
|
representation used by the polyhedral compiler community. Domains represent what
|
|
|
|
|
has to be executed while schedules represent the order in which domain elements
|
|
|
|
|
are interleaved. We model domains as non piece-wise convex integer sets, and
|
|
|
|
|
schedules as affine functions; however, the former can be disjunctive, and the
|
|
|
|
|
latter can be piece-wise affine relations. In the schedule tree representation,
|
2018-12-29 08:05:35 +08:00
|
|
|
|
domain and schedules for instructions are represented in a tree-like structure
|
2018-10-25 00:48:12 +08:00
|
|
|
|
which is called a schedule tree. Each non-leaf node of the tree is an abstract
|
|
|
|
|
polyhedral dimension corresponding to an abstract fused loop for each ML
|
|
|
|
|
instruction that appears in that branch. Each leaf node is an ML Instruction.
|
|
|
|
|
|
2018-11-14 23:58:42 +08:00
|
|
|
|
```mlir {.mlir}
|
2018-10-25 00:48:12 +08:00
|
|
|
|
// A tiled matmul code (128x128x128) represented in schedule tree form
|
|
|
|
|
|
|
|
|
|
// #map0 = (d0, d1, d2, d3, d4, d5) -> (128*d0 + d3, 128*d1 + d4, 128*d2 + d5)
|
2019-01-03 04:32:30 +08:00
|
|
|
|
#intset_ij = (i, j) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0, -j + N-1 >= 0
|
|
|
|
|
#intset_ijk = (i, j, k) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0,
|
2018-10-25 00:48:12 +08:00
|
|
|
|
-j + M-1 >= 0, k >= 0, -k + N - 1 >= 0)
|
2019-01-03 04:32:30 +08:00
|
|
|
|
func @matmul(%A, %B, %C, %M, %N, %K) : (...) { // %M, N, K are symbols
|
2018-10-25 00:48:12 +08:00
|
|
|
|
// t1, t2, t3, t4, t5, t6 are abstract polyhedral loops
|
|
|
|
|
mldim %t1 : {S1,S2,S3,S4,S5} floordiv (i, 128) {
|
|
|
|
|
mldim %t2 : {S1,S2,S3,S4,S5} floordiv (j, 128) {
|
2019-02-07 03:08:18 +08:00
|
|
|
|
// (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2)
|
2018-10-25 00:48:12 +08:00
|
|
|
|
call dma_hbm_to_vmem(%C, %i, %j, %M, %N, %K)
|
|
|
|
|
with @intset_ij(%i, %j) [%M, %N, %K]
|
|
|
|
|
mldim %t3 : {S2,S3,S4,S5} floordiv (k, 128) {
|
2019-02-07 03:08:18 +08:00
|
|
|
|
// (%i, %j, %k) = affine.apply (d0, d1, d2)
|
2018-10-25 00:48:12 +08:00
|
|
|
|
// -> (128*d0, 128*d1, 128*d2) (%t1, %t2, %t3)
|
2019-01-03 04:32:30 +08:00
|
|
|
|
call dma_hbm_to_vmem(%A, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K]
|
2019-02-07 03:08:18 +08:00
|
|
|
|
// (%i, %j, %k) = affine.apply (d0, d1, d2)
|
2018-10-25 00:48:12 +08:00
|
|
|
|
// -> (128*d0, 128*d1, 128*d2) (%t1, %t2, %t3)
|
2019-01-03 04:32:30 +08:00
|
|
|
|
call dma_hbm_to_vmem(%B, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K]
|
2018-10-25 00:48:12 +08:00
|
|
|
|
mldim %t4 : {S4} i mod 128 {
|
|
|
|
|
mldim %t5 : {S4} j mod 128 {
|
|
|
|
|
mldim %t6 : {S4} k mod 128 {
|
2019-02-07 03:08:18 +08:00
|
|
|
|
// (%i, %j, %k) = affine.apply #map0 (%t1, %t2, %t3, %t4, %t5, %t6)
|
2018-10-25 00:48:12 +08:00
|
|
|
|
call matmul_body(A, B, C, %i, %j, %k, %M, %N, %K)
|
2019-01-03 04:32:30 +08:00
|
|
|
|
with #inset_ijk(%i, %j, %k) [%M, %N, %K]
|
2018-10-25 00:48:12 +08:00
|
|
|
|
} // end mld4im t6
|
|
|
|
|
} // end mldim t5
|
|
|
|
|
} // end mldim t4
|
|
|
|
|
} // end mldim t3
|
2019-02-07 03:08:18 +08:00
|
|
|
|
// (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2)
|
2019-01-03 04:32:30 +08:00
|
|
|
|
call $dma_vmem_to_hbm_C ... with #intset(%i, %j) [%M, %N, %K]
|
2018-10-25 00:48:12 +08:00
|
|
|
|
} // end mldim t2
|
|
|
|
|
} // end mldim t1
|
|
|
|
|
return
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Affine Relations
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
The current MLIR spec includes affine maps and integer sets, but not affine
|
|
|
|
|
relations. Affine relations are a natural way to model read and write access
|
|
|
|
|
information, which can be very useful to capture the behavior of opaque external
|
|
|
|
|
library calls, high-performance vendor libraries, or user-provided / user-tuned
|
|
|
|
|
routines.
|
|
|
|
|
|
|
|
|
|
An affine relation is a relation between input and output dimension identifiers
|
|
|
|
|
while being symbolic on a list of symbolic identifiers and with affine
|
|
|
|
|
constraints on the identifiers.
|
|
|
|
|
|
|
|
|
|
Syntax:
|
|
|
|
|
|
|
|
|
|
``` {.ebnf}
|
|
|
|
|
// Affine relation definition at the top of file
|
|
|
|
|
affine-rel-def ::= affine-rel-id `=` affine-relation-inline
|
|
|
|
|
|
|
|
|
|
affine-rel-id ::= `##` prefixed-id
|
|
|
|
|
|
|
|
|
|
affine-relation-inline ::=
|
|
|
|
|
`(` input-dims `)` (`[` symbols `]`)? `->`
|
|
|
|
|
`(` output-dims `)` : affine-constraint-conjunction
|
|
|
|
|
|
|
|
|
|
input-dims ::= bare-id-list
|
|
|
|
|
output-dims ::= bare-id-list
|
|
|
|
|
symbols ::= bare-id-list
|
|
|
|
|
|
|
|
|
|
affine-rel ::= affine-rel-id | affine-relation-inline
|
|
|
|
|
|
|
|
|
|
// Usage
|
|
|
|
|
affine-rel-spec ::= affine-rel dim-and-symbol-use-list
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
All identifiers appearing in input-dims, output-dims, and symbol-dims are
|
|
|
|
|
pairwise distinct. All affine-constraint non-terminals in the above syntax are
|
|
|
|
|
allowed to contain identifiers only from input-dims, output-dims, and
|
|
|
|
|
symbol-dims.
|
|
|
|
|
|
|
|
|
|
Affine relations are used to model read, write, may_read, and may_write sets of
|
|
|
|
|
functions in the IR. The output dimension identifiers correspond to the data
|
|
|
|
|
dimensions.
|
|
|
|
|
|
|
|
|
|
Example:
|
|
|
|
|
|
2018-11-14 23:58:42 +08:00
|
|
|
|
```mlir {.mlir}
|
2018-10-25 00:48:12 +08:00
|
|
|
|
// read relation: two elements ( d0 <= r0 <= d0+1 )
|
|
|
|
|
##aff_rel9 = (d0) -> (r0) : r0 - d0 >= 0, d0 - r0 + 1 >= 0
|
|
|
|
|
|
2019-01-03 04:32:30 +08:00
|
|
|
|
func @count (memref<128xf32, (d0) -> (d0)> %A, i32 %pos) -> f32
|
2018-10-25 00:48:12 +08:00
|
|
|
|
reads: {%A ##aff_rel9 (%pos)}
|
|
|
|
|
writes: /* empty */
|
|
|
|
|
may_reads: /* empty */
|
|
|
|
|
may_writes: /* empty */ {
|
|
|
|
|
bb0 (%0, %1: memref<128xf32>, i64):
|
|
|
|
|
%val = load %A [(d0) -> (d0) (%pos)]
|
|
|
|
|
%val = load %A [(d0) -> (d0 + 1) (%pos)]
|
2019-01-03 04:32:30 +08:00
|
|
|
|
%p = mulf %val, %val : f32
|
2018-10-25 00:48:12 +08:00
|
|
|
|
return %p
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
2019-05-13 23:51:34 +08:00
|
|
|
|
### Regions
|
|
|
|
|
|
|
|
|
|
#### Making function definition an operation
|
|
|
|
|
|
|
|
|
|
MLIR supports values of a Function type. Instead of having first-class IR
|
|
|
|
|
concept for functions, one could define an operation with a body region that
|
|
|
|
|
defines a function value. The particularity of functions is that their names are
|
|
|
|
|
globally visible and can be referred to before being defined, unlike SSA values
|
|
|
|
|
that must be defined first. Implementing a "function definition" operation would
|
|
|
|
|
require to relax some of the SSA constraints in a region, and also make the IR
|
|
|
|
|
Module a region as well. It would also affect the core infrastructure (e.g.,
|
|
|
|
|
function passes) only for the sake of concept unification.
|
|
|
|
|
|
|
|
|
|
#### Having types on a region
|
|
|
|
|
|
|
|
|
|
Instead of inspecting the types of arguments of the first block, one could give
|
|
|
|
|
the region itself a type. This type would be redundant with block argument
|
|
|
|
|
types, which must have values and create room for type mismatches. While
|
|
|
|
|
functions do have types that are partly redundant with the arguments of the
|
|
|
|
|
first block in the function, this is necessary to support function declarations
|
|
|
|
|
that do not have a body which we can refer to in order to obtain the argument
|
|
|
|
|
types. A region is always contained in an operation or a function that can be
|
|
|
|
|
queried to obtain the “type” of the region if necessary.
|
|
|
|
|
|
|
|
|
|
A type on a region can be justified if Regions were to be considered separately
|
|
|
|
|
from the enclosing entity (operation or function) and had their own semantics
|
|
|
|
|
that should be checked.
|
|
|
|
|
|
|
|
|
|
#### Attaching attributes to regions
|
|
|
|
|
|
|
|
|
|
Regions could be annotated with dialect attributes to use attribute verification
|
|
|
|
|
hooks. An operation could take multiple regions as arguments, and each of them
|
|
|
|
|
may require different attributes. However, there are currently very few
|
|
|
|
|
practical cases where this would be necessary. Instead, one could simulate
|
|
|
|
|
per-region attributes with array attributes attached to the entity containing
|
|
|
|
|
the region (operation or function). This decreases the overall complexity of the
|
|
|
|
|
IR and enables more concise and op-specific forms, e.g., when all regions of an
|
|
|
|
|
op have the same attribute that can be only mentioned once. Since the semantics
|
|
|
|
|
of the region is entirely defined by the enclosing entity, it also makes sense
|
|
|
|
|
to have attributes attached to that entity rather than to the region itself.
|
|
|
|
|
|
|
|
|
|
This can be reconsidered in the future if we see a non-neglectable amount of use
|
|
|
|
|
cases.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Read/Write/May_Read/May_Write sets for External Functions
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
Having read, write, may_read, and may_write sets for external functions which
|
|
|
|
|
include opaque ones, high-performance vendor libraries such as CuDNN, CuB, MKL,
|
|
|
|
|
FFT libraries, user-provided/optimized functions, or data movement runtimes such
|
|
|
|
|
as DMA ones is a powerful feature. It allows the compiler to perform analysis,
|
|
|
|
|
composition/transformation in the presence of such calls and with loops around
|
|
|
|
|
such calls on sub-tensors. For user-provided or custom hand-tuned functions, the
|
|
|
|
|
read/write/may_read/may_write sets could be provided a-priori by a user as part
|
|
|
|
|
of the external function signature or they could be part of a database.
|
|
|
|
|
|
2019-01-03 04:32:30 +08:00
|
|
|
|
TODO: Design this, and update to use function attribute syntax.
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
Example:
|
|
|
|
|
|
2018-11-14 23:58:42 +08:00
|
|
|
|
```mlir {.mlir}
|
2018-10-25 00:48:12 +08:00
|
|
|
|
##rel9 ( ) [s0] -> (r0, r1) : 0 <= r0 <= 1023, 0 <= r1 <= s0 - 1
|
|
|
|
|
|
2019-01-03 04:32:30 +08:00
|
|
|
|
func @cblas_reduce_ffi(memref<1024 x ? x f32, #layout_map0, hbm> %M) -> f32 [
|
2018-10-25 00:48:12 +08:00
|
|
|
|
reads: {%M, ##rel9() }
|
|
|
|
|
writes: /* empty */
|
|
|
|
|
may_reads: /* empty */
|
|
|
|
|
may_writes: /* empty */
|
|
|
|
|
]
|
|
|
|
|
|
2019-01-03 04:32:30 +08:00
|
|
|
|
func @dma_hbm_to_vmem(memref<1024 x f32, #layout_map0, hbm> %a,
|
2018-10-25 00:48:12 +08:00
|
|
|
|
offset, memref<1024 x f32, #layout_map0, vmem> %b,
|
|
|
|
|
memref<1024 x f32, #layout_map0> %c
|
|
|
|
|
) [
|
|
|
|
|
reads: {%M, ##rel9() }
|
|
|
|
|
writes: /* empty */
|
|
|
|
|
may_reads: /* empty */
|
|
|
|
|
may_writes: /* empty */
|
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### Memref Extensions
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
1. Arbitrary polyhedral shapes for tensors: e.g., triangular shapes in tensor
|
|
|
|
|
dimensions where there is symmetry: use integer set (affine constraints) to
|
|
|
|
|
model tensor data space (instead of just extents). Requires some changes to
|
|
|
|
|
the IR and the in-memory form.
|
|
|
|
|
1. Layout maps
|
|
|
|
|
|
|
|
|
|
1. Allow piece-wise affine maps for layouts: allows clean modeling of
|
|
|
|
|
boundary cases for images/tensors through padding, wrapping, mirroring,
|
|
|
|
|
padding where padded values are the results of computation as opposed to
|
|
|
|
|
data, padding in the interior as opposed to just boundaries.
|
|
|
|
|
1. Allow many-to-one layout maps: Index and layout maps in the current
|
|
|
|
|
proposal are bijective. Extending them to many-to-one layout maps allows
|
|
|
|
|
cleaner(?) modeling of broadcast/reduce style computations while reusing
|
|
|
|
|
memory.
|
|
|
|
|
|
|
|
|
|
Proposal 2(a) requires non-trivial changes to the IR and the in-memory
|
|
|
|
|
representation. 2(b) requires no change, but impacts how cost models look at
|
|
|
|
|
index and layout maps.
|
|
|
|
|
|
2019-04-05 23:19:42 +08:00
|
|
|
|
### `affine.if` and `affine.for` Extensions for "Escaping Scalars"
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
We considered providing a representation for SSA values that are live out of
|
2019-03-26 01:14:34 +08:00
|
|
|
|
`if/else` conditional bodies and loop carried in `affine.for` loops. We
|
2019-02-07 03:58:03 +08:00
|
|
|
|
ultimately abandoned this approach due to its complexity. In the current design
|
|
|
|
|
of MLIR, scalar variables cannot escape for loops or if instructions. In
|
|
|
|
|
situations, where escaping is necessary, we use zero-dimensional tensors and
|
|
|
|
|
memrefs instead of scalars.
|
2019-01-03 04:32:30 +08:00
|
|
|
|
|
|
|
|
|
**TODO**: This whole section is obsolete and should be updated to use block
|
|
|
|
|
arguments and a yield like terminator in for/if instructions.
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
The abandoned design of supporting escaping scalars is as follows:
|
|
|
|
|
|
2019-04-12 13:29:21 +08:00
|
|
|
|
#### affine.for Instruction
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
Syntax:
|
|
|
|
|
|
|
|
|
|
``` {.ebnf}
|
|
|
|
|
[<out-var-list> =]
|
2019-02-07 13:54:18 +08:00
|
|
|
|
for %<index-variable-name> = <lower-bound> ... <upper-bound> step <step>
|
2018-12-29 08:05:35 +08:00
|
|
|
|
[with <in-var-list>] { <loop-instruction-list> }
|
2018-10-25 00:48:12 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
out-var-list is a comma separated list of SSA values defined in the loop body
|
|
|
|
|
and used outside the loop body. in-var-list is a comma separated list of SSA
|
2018-12-29 08:05:35 +08:00
|
|
|
|
values used inside the loop body and their initializers. loop-instruction-list
|
|
|
|
|
is a list of instructions that may also include a yield instruction.
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
Example:
|
|
|
|
|
|
2018-11-14 23:58:42 +08:00
|
|
|
|
```mlir {.mlir}
|
2018-10-25 00:48:12 +08:00
|
|
|
|
// Return sum of elements in 1-dimensional mref A
|
2019-01-03 04:32:30 +08:00
|
|
|
|
func int32 @sum(%A : memref<?xi32>, %N : i32) -> (i32) {
|
2018-10-25 00:48:12 +08:00
|
|
|
|
%init = 0
|
2019-03-26 01:14:34 +08:00
|
|
|
|
%result = affine.for %i = 0 to N with %tmp(%init) {
|
2018-10-25 00:48:12 +08:00
|
|
|
|
%value = load %A[%i]
|
|
|
|
|
%sum = %value + %tmp
|
|
|
|
|
yield %sum
|
|
|
|
|
}
|
|
|
|
|
return %result
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
2019-04-12 13:29:21 +08:00
|
|
|
|
#### affine.if/else Instruction
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
Syntax:
|
|
|
|
|
|
|
|
|
|
``` {.ebnf}
|
2019-03-25 11:35:07 +08:00
|
|
|
|
<out-var-list> = affine.if (<cond-list>) {...} [else {...}]
|
2018-10-25 00:48:12 +08:00
|
|
|
|
```
|
|
|
|
|
|
2018-12-29 08:05:35 +08:00
|
|
|
|
Out-var-list is a list of SSA values defined by the if-instruction. The values
|
|
|
|
|
are arguments to the yield-instruction that occurs in both then and else clauses
|
|
|
|
|
when else clause is present. When if instruction contains only if clause, the
|
|
|
|
|
escaping value defined in the then clause should be merged with the value the
|
|
|
|
|
variable had before the if instruction. The design captured here does not handle
|
|
|
|
|
this situation.
|
2018-10-25 00:48:12 +08:00
|
|
|
|
|
|
|
|
|
Example:
|
|
|
|
|
|
2018-11-14 23:58:42 +08:00
|
|
|
|
```mlir {.mlir}
|
2018-10-25 00:48:12 +08:00
|
|
|
|
// Compute sum of half of the array
|
2019-01-03 04:32:30 +08:00
|
|
|
|
func int32 @sum_half(%A, %N) {
|
2018-10-25 00:48:12 +08:00
|
|
|
|
%s0 = 0
|
2019-03-26 01:14:34 +08:00
|
|
|
|
%s1 = affine.for %i = 1 ... N step 1 with %s2 (%s0) {
|
2019-02-08 06:24:18 +08:00
|
|
|
|
%s3 = if (%i >= %N / 2) {
|
2018-10-25 00:48:12 +08:00
|
|
|
|
%v0 = load %A[%i]
|
|
|
|
|
%s4 = %s2 + %v0
|
|
|
|
|
yield %s4
|
|
|
|
|
}
|
|
|
|
|
yield %s3
|
|
|
|
|
}
|
|
|
|
|
return %s1
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Multithreading the compiler
|
|
|
|
|
|
|
|
|
|
People want compilers to go fast, and one simple way to do that is to
|
|
|
|
|
multi-thread them. There are multiple strategies for this, but a simple one is
|
|
|
|
|
to optimize and compile separate functions in parallel. LLVM's original pass
|
|
|
|
|
manager anticipated this demand, and the CallGraphSCCPass manager is even
|
|
|
|
|
designed to support this as well, but unfortunately, a few early design
|
|
|
|
|
decisions in LLVM prevent this from ever happening. Instead, things like ThinLTO
|
|
|
|
|
are forced to split programs into separate LLVM modules/context and optimize
|
|
|
|
|
those chunks independently.
|
|
|
|
|
|
|
|
|
|
The problem is that LLVM has several objects in its IR that are globally uniqued
|
|
|
|
|
and also mutable: notably constants like `i32 0`. In LLVM, these constants are
|
|
|
|
|
`Value*r`'s, which allow them to be used as operands to instructions, and that
|
|
|
|
|
they also have SSA use lists. Because these things are uniqued, every `i32 0` in
|
|
|
|
|
any function share a use list. This means that optimizing multiple functions in
|
|
|
|
|
parallel won't work (at least without some sort of synchronization on the use
|
|
|
|
|
lists, which would be unbearably inefficient).
|
|
|
|
|
|
2019-04-12 13:29:21 +08:00
|
|
|
|
MLIR now supports a multithreaded pass manager. We do this through several
|
|
|
|
|
design choices:
|
|
|
|
|
|
|
|
|
|
1) MLIR makes use of extensive uniqued immutable data structures (affine
|
|
|
|
|
expressions, types, etc are all immutable, uniqued, and immortal). 2) constants
|
|
|
|
|
are defined in per-function pools, instead of being globally uniqued. 3)
|
2018-10-25 00:48:12 +08:00
|
|
|
|
functions themselves are not SSA values either, so they don't have the same
|
2019-04-12 13:29:21 +08:00
|
|
|
|
problem as constants. 4) FunctionPasses are copied (through their copy ctor)
|
|
|
|
|
into one instances per thread, avoiding sharing of local state across threads.
|
|
|
|
|
|
|
|
|
|
This allows MLIR function passes to support efficient multithreaded compilation
|
|
|
|
|
and code generation.
|