[mlir][Linalg][doc] Add Design Document for the Linalg Dialect

Summary: This revision adds a Rationale for the Linalg Dialect

Reviewers: rriddle, mehdi_amini, ftynse, albertcohen

Reviewed By: albertcohen

Subscribers: merge_guards_bot, jfb, jpienaar, burmako, shauheen, antiagainst, arpith-jacob, mgester, lucyrfox, aartbik, liufengdb, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73595
This commit is contained in:
Nicolas Vasilache 2020-02-02 10:21:34 -05:00
parent ff50c8dcef
commit 34cd354ea9
3 changed files with 1096 additions and 83 deletions

View File

@ -1,8 +1,471 @@
# Linalg Dialect
To generate the documentation:
[TOC]
```sh
mlir-tblgen --gen-op-doc -I /path/to/mlir/include \
/path/to/mlir/include/mlir/Dialect/Linalg/IR/LinalgDoc.td
# Rationale
<img width="90" align="left" alt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-45c5-11ea-94b7-074aeea94c7b.png">
Linalg is designed to solve the High-level Hierarchical Optimization
(HHO box) in MLIR and to interoperate nicely within a
*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box).
The [Rationale Document](https://mlir.llvm.org/docs/RationaleLinalgDialect)
goes into significantly more design and architectural decision details.
# Set of Key Transformations<a name="key_transformations"></a>
The following key transformations have been central to driving the design of
Linalg. They are all implemented in terms of the properties of the
`linalg.generic` OpInterface and avoid the pitfall of relying on hardcoded
one-off op knowledge.
The textual form description of these transformations is left for future
work. Still, it is useful to at least the key transformations that are
performed on the Linalg IR and that have influenced its design:
1. Progressive Buffer Allocation.
1. Parametric Tiling.
1. Promotion to Temporary Buffer in Fast Memory.
1. Tiled Producer-Consumer Fusion with Parametric Tile-And-Fuse.
1. Map to Parallel and Reduction Loops and Hardware.
1. Vectorization: Rewrite in Vector Form.
1. Lower to Loops (Affine and/or Generic).
1. Lower to Library Calls or Special Instructions, Intrinsics or ISA.
1. Partially Lower to Iterations Over a Finer-Grained Linalg Op.
# High-Level Description of Linalg Ops<a name="linalg_ops"></a>
Linalg takes at least some inspiration from all previously [listed prior
art](#prior_art). The design enables the definition of ***CustomOps*** with
generic properties that enable [key transformations](#key_transformations),
including lowering to scalar load/store and other operations or to external
library calls and intrinsics.
These ops can have ***either tensor or buffer operands***.
## Payload-Carrying Ops<a name="payload_ops"></a>
Linalg defines two payload carrying operations that implement the [structured ops](
https://docs.google.com/presentation/d/1P-j1GrH6Q5gLBjao0afQ-GfvcAeF-QU4GXXeSy0eJ9I/edit#slide=id.p
) abstraction on tensors and buffers. This is architected as two generic operations
`linalg.generic` (resp. `linalg.indexed_generic`) that can express custom
operations with *index-free semantics* (resp. *indexing semantics*).
The properties of these generic ops are the result of applying the
guiding principles described in the [Rationale Document](https://mlir.llvm.org/docs/RationaleLinalgDialect).
They are listed next, with a brief example and discussion for each.
### Property 1: Input and Output Operands Define The Iteration Space<a name="prop1"></a>
A `linalg.generic` op fully *derives* the specification of its iteration space
from its operands.
The property enforces that a localized IR element (the op) *has* all the information
needed to synthesize the control-flow required to iterate over its operands,
according to their type. This notion of IR localization bears some resemblance
to [URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf).
Consider the following, partially specified, `linalg.generic` example:
```
#attrs = {args_in: 1, args_out: 1}
func @example(%A: memref<?xf32, layout1>,
%B: memref<?xvector<4xf32, layout2>>) {
linalg.generic #attrs (%2, %3): memref<?xf32, layout1>,
memref<?xvector<4xf32, layout2>>
return
}
```
The property "*Input and Output Operands Define The Iteration Space*" is
materialized by a lowering into a form that will resemble:
```
func @example(%A: memref<?xf32, layout1>,
%B: memref<?xvector<4xf32, layout2>>) {
%M = "dim" %A, 0: index
%N = "dim" %B, 0: index
%eq = eq %M, %N: i1 // iteration space is consistent with data
assert(%eq): (i1) -> ()
for %i = 0 to %M {
%a = load %A[%i]: memref<?xf32, layout1>
%b = load %B[%i]: memref<?xvector<4xf32>, layout2>
// compute arg types match elemental tensor types
%c = "some_compute"(%a, %b): (f32, vector<4xf32>) -> (vector<4xf32>)
store %c, %B[%i]: memref<?xvector<4xf32>, layout2>
}
return
}
```
The property participates in simplifying analyses and transformations. For
instance, it guarantees no out-of bounds access can occur by construction
(assuming dynamic operand dimensions agree with each other, which is the
purpose of the `assert` runtime check).
Before lowering to loop form, loop induction variables and iterators are *not yet
materialized*. This is a necessary property if we want an abstraction that
works on both tensor values and buffers because ***values dont escape
loops/nesting***.
The main implications are that:
1. The semantics of the ops are *restricted to operate on structured data
types*, on which we can define an iterator.
2. This does not model arbitrary code with side-effects.
We do not think these are serious limitations in practice because MLIR is all
about mixing different levels of abstractions in the same IR. As long as
Linalg can progressively lower to the next level of abstraction, it can also
be just bypassed for things that do not fit.
At the same time, conditioning op semantics on structured data types is a very
promising path towards extensibility to non-dense tensors as experience with
LIFT abstractions for
[sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf)
and [position-dependent
arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf),
as well as [TACO](http://tensor-compiler.org/), has shown.
### Property 2: Reversible Mappings Between Control and Data Structures<a name="prop2"></a>
A `linalg.generic` *defines* the mapping between the iteration space (i.e. the
loops) and the data.
Consider the following, partially specified, `linalg.generic` example:
```
#indexing_maps = {
(i, j) -> (j, i),
(i, j) -> (j)
}
#attrs = {args_in: 1, args_out: 1, indexings: indexing_maps}
func @example(%A: memref<?xf32, layout1>,
%B: memref<?xvector<4xf32, layout2>>) {
linalg.generic #attrs (%A, %B): memref<?xf32, layout1>,
memref<?xvector<4xf32, layout2>>
return
}
```
The property "*Reversible Mappings Between Control and Data Structures*" is
materialized by a lowering into a form that will resemble:
```
#attrs = {args_in: 1, args_out: 1, indexings: indexing_maps}
func @example(%A: memref<?xf32, layout1>,
%B: memref<?xvector<4xf32, layout2>>) {
// loop bounds determined from data sizes by “inverting the map”
%J = "dim" %2, 0: index
%I = "dim" %2, 1: index
%J2 = "dim" %3, 0: index
// iteration space is consistent with data + mapping inference
%eq = "eq" %J, %J2: i1
"assert" %eq: (i1) -> ()
for %i = 0 to %I { // loop order is fully defined by indexing maps
for %j = 0 to %J { // arbitrary permutations are possible
%a = "load" %2, %j, %i: memref<8x?xf32>
%b = "load" %3, %j: memref<?xvector<4xf32>>
%c = "some_compute"(%a, %b): (f32, vector<4xf32>) -> (vector<4xf32>)
"store" %c, %3, %j: memref<?xvector<4xf32>>
}
}
return
}
```
This mapping needs to be reversible because we want to be
able to go back and forth between the two and answer questions such as:
- Given a subset of the iteration space, what subset of data does it read and
write?
- Given a subset of data read or written, what subset of the iteration space
is responsible for this read or write?
Answering these `2` questions is one of the main analyses that Linalg uses to
implement transformations such as tiling, tiled producer-consumer fusion, and
promotion to temporary buffers in fast memory.
In the current implementation, `linalg.generic` uses a list of [AffineMaps]().
This is a pragmatic short-term solution, but in the longer term note that
this property could be even evaluated dynamically, similarly to
inspector-executor algorithms.
### Property 3: The Type Of Iterators is Defined Explicitly<a name="prop3"></a>
A `linalg.generic` op fully *declares* the type of its iterators. This
information is used in transformations.
These properties are derived from established practice in the field and mirror
the properties from Ken Kennedy's [Optimizing Compilers for Modern Architectures](
https://www.elsevier.com/books/optimizing-compilers-for-modern-architectures/allen/978-0-08-051324-9).
The key idea of legality of loop transformations expressed by Kennedy is
that ***the lexicographic order of all dependence vectors must be
preserved***.
This can be better captured directly at the loop level thanks to specific
iterator types, among which:
*parallel*, *reduction*, *partition*, *permutable/monotonic*, *sequential*,
*dependence distance*, ...
These types are traditionally the result of complex dependence analyses and
have been referred to as "*bands*" in the polyhedral community (e.g. *parallel
bands*, *permutable bands*, etc, in
[ISL](https://en.wikipedia.org/wiki/Integer_set_library) schedule tree
parlance).
Specifying the information declaratively in a `linalg.generic` allows
conveying properties that may be hard (or even impossible) to derive from
lower-level information. These properties can be brought all the way to the
moment when they are useful for transformations, used and then discarded.
Additionally, these properties may also be viewed as a contract that the
frontend/user guarantees and that the compiler may take advantage of. The
common example is the use of data-dependent reduction semantics for
specifying histogram computations. If the frontend has additional knowledge
that proper atomic operations are available, it may be better to specify
parallel semantics and use the special atomic in the computation region.
At this time, Linalg only has an explicit use for *parallel* and *reduction*
loops but previous experience shows that the abstraction generalizes.
### Property 4: The Compute Payload is Specified With a Region<a name="prop4"></a>
A `linalg.generic` op has a compute payload that is fully generic thanks to
the use of
[Regions](https://github.com/llvm/llvm-project/blob/58265ad42a90ae8905be6a447cb42e53529a54a0/mlir/docs/LangRef.md#regions).
The region takes as arguments the scalar elemental types of the tensor or
buffer operands of the `linalg.generic`. For flexibility and ability to match
library calls, additional special values may be passed. For instance, a
`linalg.fill` operation takes a buffer and an additional scalar value.
At this time there are no additional restrictions to the region
semantics. This is meant to allow the exploration of various design tradeoffs
at the intersection of regions and iterator types.
In particular, the frontend is responsible for the semantics of iterator types
to correspond to the operations inside the region: the region can capture
buffers arbitrarily and write into them. If this conflicts with some parallel
iterator requirement, this is undefined behavior.
Concretely, consider the following, partially specified, `linalg.generic`
example:
```
#indexing_maps = {
(i, j) -> (i, j),
(i, j) -> (i, j)
}
#attrs = {args_in: 1, args_out: 1, indexings: #indexing_maps}
func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
linalg.generic #attrs (%A, %B, %C) {
^bb0(%a: f32, %b: f32):
%c = addf %a, %b : f32
return %c : f32
}: memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>
return
}
```
The property "*The Compute Payload is Specified With a Region*" is
materialized by a lowering into a form that will resemble:
```
func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
%M = dim %A, 0: index
%N = dim %B, 1: index
for %i = 0 to %M {
for %j = 0 to %N {
%a = load %A[%i, %j]: memref<?x?xf32>
%b = load %B[%i, %j]: memref<?x?xf32>>
%c = addf %a, %b : f32
store %c, %C[%i, %j]: memref<?x?xf32>
}
}
return
}
```
In the process of lowering to loops and lower-level constructs, similar
requirements are encountered, as are discussed in the [inlined call op
proposal](https://llvm.discourse.group/t/introduce-std-inlined-call-op-proposal/282/2).
We expect to be able to reuse the common lower-level infrastructure provided
it evolves to support both region arguments and captures.
### Property 5: May Map To an External Library Call<a name="prop5"></a>
A `linalg.generic` op may map to an external library call by specifying a
`SymbolAttr`. At this level of abstraction, the important glue is the ability
to perform transformations that preserve the structure necessary to ***call
the external library after different transformations have been applied***.
This involves considerations related to preservation of op semantics
and integration at the ABI level. Regardless of whether one wants to use
external library calls or a custom ISA, the problem for codegen is similar:
preservation of a fixed granularity.
Consider the following, partially specified, `linalg.generic`
example:
```
#fun_attr = "pointwise_add"
#indexing_maps = {
(i, j) -> (i, j),
(i, j) -> (i, j)
}
#attrs = {args_in: 1, args_out: 1, indexings: #indexing_maps, fun: #fun_attr}
func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
linalg.generic #attrs (%A, %B, %C) {
^bb0(%a: f32, %b: f32):
%c = addf %a, %b : f32
return %c : f32
}: memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>
return
}
```
The property "*Map To an External Library Call*" is
materialized by a lowering into a form that will resemble:
```
func @pointwise_add_sxsxf32_sxsxf32(memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()
func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
call @pointwise_add_sxsxf32_sxsxf32 (%A, %B, %C):
(memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()
return
}
```
Which, after lowering to LLVM resembles:
```
func @pointwise_add_sxsxf32_sxsxf32(!llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
!llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
!llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">) -> ()
func @example(%A: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
%B: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
%C: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">) {
llvm.call @pointwise_add_sxsxf32_sxsxf32 (%A, %B, %C):
(!llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">...) -> ()
return
}
```
#### Convention For External Library Interoperability
The `linalg` dialect adopts a convention that is similar to `BLAS` when
offloading operations to fast library implementations: pass a non-owning
pointer to input and output data with additional metadata. This convention
is also found in libraries such as `MKL`, `OpenBLAS`, `BLIS`, `cuBLAS`,
`cuDNN`, etc.. and more generally at interface points across language
boundaries (e.g. C++ / Python).
Generally, `linalg` passes non-owning pointers to View data structures
to pre-compiled library calls linked externally.
There is an [ongoing
discussion](https://llvm.discourse.group/t/lowering-optional-attributes-in-linalg-structuredops-to-standard-dialect/333/3)
on the topic of extending interoperability in the presence of key attributes.
### Property 6: Perfectly Nested Writes To The Whole Output Operands<a name="prop6"></a>
Perfectly nested loops form a particularly important class of structure that
enables key loop transformations such as tiling and mapping to library calls.
Unfortunately, this type of structure is easily broken by transformations such
as partial loop fusion. Tiling and mapping to library calls become more
challenging, or even infeasible. Linalg ops adopt perfect-nestedness
as a first-class property: the structure cannot be broken and is
transported in the IR by construction.
A `linalg.generic` op represents a perfectly nested loop nest that writes the
entire memory region. This is a structural constraint across regions and
loops that has proven to be key in simplifying transformations.
One particular point to mention is that converting imperfectly nested code
into perfectly nested code can often be done with enough loop distribution
and embedding of conditionals down to the innermost loop level.
Previous experience with Tensor Comprehensions gave us the intuition that
forcing innermost control-flow nesting is a lot like writing data-parallel
code with arrays of boolean values and predication.
This type of trick has also been used before in polyhedral compilers to
convert non-affine control into affine compute dependencies.
While it may be possible to automate such rewrites from generic IR,
`linalg.generic` just forces the semantics for now.
The key implication is that this conversion to deep predication needs to be
undone once we are done with Linalg transformations.
After iterators and induction variables are materialized (i.e. after lowering
out of `linalg.generic` occurred), the overall performance will be greatly
influenced by the quality of canonicalizations, foldings and *Loop Independent
Code Motion* (LICM).
In the grander scheme, the reliance on late LICM was deemed a necessary risk.
### Putting it Together<a name="summary"></a>
As it stands, the six properties above define the semantics of a
`linalg.generic` op. It is an open question whether all of these semantics are
strictly necessary in practice and whether some should or could be derived
automatically while still maintaining the [core guiding
principles](#guiding_principles).
For the time being, we have settled on the combination of these properties
because of empirical evidence building and working on multiple high-level
compilers. As we lay those down and engage more with the community, we expect
multiple rounds of discussions and design changes to the original architecture.
## Data Representation: Views<a name="views"></a>
The current implementation uses the [Strided MemRef (a.k.a View)](
https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/MaL8m2nXuio)
abstraction. The name *View* is used interchangeably in `linalg` to signify
*Strided MemRef*.
In the future we expect to use other structured data types and
support ragged, mixed-sparse and other types. We expect to draw on the
experience from existing LIFT abstractions for
[sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf)
and [position-dependent
arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf).
## Metadata Ops<a name="metadata_ops"></a>
A set of ops that manipulate metadata but do not move memory. These ops take
`view` operands + extra attributes and return new `view`s. The returned
`view`s generally alias the operand `view`. At the moment the existing ops
are:
* `std.view`,
* `std.subview`,
* `linalg.range`,
* `linalg.slice`,
* `linalg.transpose`.
* `linalg.reshape`,
Future ops are added on a per-need basis but should include:
* `linalg.tile`,
* `linalg.intersection`,
* `linalg.convex_union`,
* `linalg.difference` (would need to work on a list of views).
These additional operations correspond to abstractions that have been known to
work in the field of large-scale distributed stencil computations.
In a longer-term future, the abstractions from [Legion data-centric
programming model](https://legion.stanford.edu/overview/) seem generally
appealing.
## Named Payload-Carrying Ops<a name="named_ops"></a>
Additionally, `linalg` provides a small subset of commonly named operations:
* `linalg.copy`,
* `linalg.fill`,
* `linalg.dot`,
* `linalg.matmul`,
* `linalg.conv`.
These named operations adhere to the `linalg.generic` op interface. Work is in
progress to define declarative mechanisms to automatically generate named ops
from a description in terms of only the generic op interface.
This is the main reason there are only a small number of ops today: we expect
them to be auto-generated from Tablegen soon.
# Open Issues and Design Alternatives<a name="open_issues"></a>
Multiple open issues and design alternatives are in flight and it is time to
lay them out for the community to discuss and pick apart:
1. Should `linalg.generic` support nesting?
1. Should `linalg.generic` regions take views or only scalars?
1. Should we try to solve automatic differentiation at this level of
abstraction?
1. Are all the six properties really necessary?
1. Is this relying too much on declarative specification and would we be
better off relying more on analyses?
1. Is this general enough for the community's needs? If not how should this be
extended, if at all?
...
These key questions (and much more) should be really thought of in the general
context of MLIR in which different levels of IR interoperate seamlessly. In
practice, it is not necessary (or beneficial) to try and solve all problems in the
same IR.

View File

@ -0,0 +1,624 @@
# Linalg Dialect Rationale: The Case For Compiler-Friendly Custom Operations
[TOC]
# Introduction<a name="introduction"></a>
## Positioning
<img width="180" align="left" alt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-45c5-11ea-94b7-074aeea94c7b.png">
This document describes the key design principles
that led to the existing implementation of Linalg and aims at exposing
the tradeoffs involved when building higher-level Intermediate
Representations (IR) and Dialects to facilitate code
generation. Consider the simplified schema describing codegen in MLIR.
Linalg is designed to solve the High-level Hierarchical Optimization
(HHO box) and to interoperate nicely within a
*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box).
This work is inspired by a wealth of [prior art](#prior_art) in
the field, from which it seeks to learn key lessons. This documentation
and introspection effort also comes in the context of the proposal for a
working group for discussing the [Development of high-level Tensor Compute
Primitives dialect(s) and
transformations](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/3).
We hope that the lessons from prior art, the design principles outlined in
this doc and the architecture of Linalg can help inform the community on a
path to defining these High-Level Tensor Compute Primitives.
## Inception
Linalg started as a pragmatic dialect to bootstrap code generation in MLIR, by
*defining away* complex code generation problems like precise dependence
analysis or polyhedral code generation and by introducing the ability to call
into fast library implementations when available. Linalg **defines ops and
transformations declaratively** and was originally restricted to ops with
*linear-algebra like* semantics (`pointwise`, `matmul`, `conv`...). This
approach enables building a high-level productivity-first codegen solution that
leverages *both* compiler optimizations *and* efficient library implementations
so as not to miss out on simple performance benefits. For example, if
one's favorite HPC library or ISA has a `matmul` primitive running at 95% of
the achievable peak performance, for operands stored in some memory, one should
be able to **use the primitive** when possible *and* generate code otherwise.
However, as the design of Linalg co-evolved with the design of MLIR, it became
apparent that it could extend to larger application domains than just machine
learning on dense tensors.
The design and evolution of Linalg follows a *codegen-friendly* approach where
the IR and the transformations evolve hand-in-hand.
The key idea is that op semantics *declare* and transport information that is
traditionally obtained by compiler analyses.
This information captures the legality and applicability of transformations and
is **not lost by lowering prematurely to loop or CFG form**. The key
transformations are designed so as to **preserve this information** as long as
necessary. For example, `linalg.matmul` remains `linalg.matmul` after tiling
and fusion.
Furthermore, Linalg decouples transformation validity from profitability
considerations and voluntarily leaves the latter aside in the first iteration
(see the [suitability for search](#suitability_for_search) guiding principle).
The first incarnation of these ideas was presented as an example at the
EuroLLVM 2019 developer's meeting as part of the
[Linalg section](https://llvm.org/devmtg/2019-04/slides/Tutorial-AminiVasilacheZinenko-MLIR.pdf)
of the first [MLIR Tutorial](https://www.youtube.com/watch?v=cyICUIZ56wQ).
## Evolution
Since the initial implementation, the design has evolved with, and partially
driven the evolution of the core MLIR infrastructure to use
[Regions](https://mlir.llvm.org/docs/LangRef/#regions),
[OpInterfaces](https://mlir.llvm.org/docs/Interfaces/),
[ODS](https://mlir.llvm.org/docs/OpDefinitions/) and
[Declarative Rewrite Rules](https://mlir.llvm.org/docs/DeclarativeRewrites/)
among others. The approach adopted by Linalg was extended to become
[StructuredOps abstractions](
https://drive.google.com/drive/u/0/folders/1sRAsgsd8Bvpm_IxREmZf2agsGU2KvrK-),
with Linalg becoming its incarnation on tensors and buffers.
It is complemented by the
[Vector dialect](https://mlir.llvm.org/docs/Dialects/Vector/),
which define structured operations on vectors, following the same rationale and
design principles as Linalg. (Vector dialect includes the higher-level
operations on multi-dimensional vectors and abstracts away the lowering to
single-dimensional vectors).
The Linalg dialect itself grew beyond linear algebra-like operations to become
more expressive, in particular by providing an abstraction of a loop nest
supporting parallelism, reductions and sliding windows around arbitrary MLIR
[regions](https://mlir.llvm.org/docs/LangRef/#regions). It also has the
potential of growing beyond *dense* linear-algebra to support richer data
types, such as sparse and ragged tensors and buffers.
Linalg design remains open to evolution and cross-pollination with other
dialects and approaches. It has been successfully used as the staging ground
for code generation-related abstractions, spinning off the generalization of
the following:
- the `!linalg.view` type folded into the *"Strided MemRef"* type while
preserving structure to allow calling into external C++ libraries with
unsurprising ABI conventions;
- the `linalg.view` and `linalg.subview` ops evolved into the standard dialect;
- the `linalg.for`, `linalg.load` and `linalg.store` ops evolved into a prelude
to the *structured control flow* dialect (named `LoopOps`).
More components can be extracted, redesigned and generalized when new uses or
requirements arise.
Several [design questions](#open_issues) remain open in Linalg, which does not
claim to be a general solution to all compilation problems.
It does aim at driving thinking and implementations of domain-specific
abstractions where programmer's intent can be captured at a very high level,
directly in the IR.
Given the evolution of the scope, it becomes apparent that a better name than
"Linalg" could remove some of the confusions related to the dialect (and the
underlying approach), its goals and limitations.
# Prior Art<a name=""></a>
Linalg draws inspiration from decades of prior art to design a modern a
pragmatic solution. The following non-exhaustive list refers to some of the
projects that influenced Linalg design:
- [ONNX](https://onnx.ai/),
- [LIFT](https://www.lift-project.org/),
- [XLA](https://www.tensorflow.org/xla/architecture),
- [Halide](https://halide-lang.org/) and [TVM](https://tvm.apache.org/),
- [TACO](http://tensor-compiler.org/),
- [Darkroom](http://darkroom-lang.org/) and [Terra](http://terralang.org/),
- [Sigma-LL](http://spiral.ece.cmu.edu:8080/pub-spiral/pubfile/cgo16-preprint_248.pdf),
- [Tensor Comprehensions](https://arxiv.org/abs/1802.04730),
- [Polyhedral Compilers](https://en.wikipedia.org/wiki/Polytope_model),
- the [Affine dialect](https://mlir.llvm.org/docs/Dialects/Affine/) in MLIR,
- Generic Loop Transformations (see Ken Kennedy's
[Optimizing Compilers for Modern Architectures](
https://www.elsevier.com/books/optimizing-compilers-for-modern-architectures/allen/978-0-08-051324-9))
- Traditional compiler CFGs with SSA forms.
Additionally, experience with the following tools proved very valuable when
thinking holistically about how all these components interplay all the way
up to the user and down to the hardware:
- the [Torch](http://torch.ch/) machine-learning framework,
- the LLVM compiler, specifically in JIT mode,
- high-performance libraries (MKL, CUBLAS, FBFFT)
- the [PeachPy](https://www.cs.utexas.edu/users/flame/BLISRetreat/BLISRetreatTalks/PeachPy.pdf) assembler
- current and potentially upcoming hardware ISAs.
The novelty of MLIR's code base and its unprecedented support for defining and
mixing abstractions, enabling one to reflect on and integrate the key elements
of the prior art success as well as avoid the common pitfalls in the area of
code generation. Thus, instead of diverging into a discussion about the
implications of adopting any of the existing solutions, Linalg had the
possibility to build on all of them and learn from their experience while
leveraging the benefit of hindsight.
The following reflections on prior art have influenced the design of Linalg.
The discussion is by no means exhaustive but should capture the key motivations
behind Linalg.
## Lessons from ONNX<a name="lessonsonnx"></a>
ONNX is a specification of operations that appear in Machine Learning
workloads. As such, it is predominantly driven by the expressiveness requirements
of ML, and less by the considerations of IR design for HPC code generation.
Similarly to ONNX, Linalg defines *"semantically charged" named ops*.
But it also considers *transformations on these ops* as a key component and
defines the IR to support the transformations, preferring transformations over
expressiveness if necessary.
Linalg hopes to additionally address the following:
- facilitate frontend-compiler co-design by taking into account compiler
transformations and lowerings in op definition;
- minimize the set of available ops by making them non-overlapping with each
other, thus simplifying the intermediate representation.
## Lessons from LIFT<a name="lessonslift"></a>
[LIFT](https://www.lift-project.org/) is a system to write computational
kernels based on functional abstractions. Transformations are
represented by additional nodes in the IR, whose semantics are at the
level of the algorithm (e.g. `partialReduce`).
LIFT applies and composes transformations by using [local rewrite
rules](https://www.lift-project.org/presentations/2015/ICFP-2015.pdf) that
embed these additional nodes directly in the functional abstraction.
Similarly to LIFT, Linalg uses local rewrite rules implemented with the MLIR
[Declarative Rewrite Rules](https://mlir.llvm.org/docs/DeclarativeRewrites/)
mechanisms.
Linalg builds on, and helps separate concerns in the LIFT approach as follows:
- transformations are either separated from the representation or expressed as
composable attributes that are independent of the actual computation,
avoiding intricate effects on performance;
- abstractions are split into smaller components (e.g., control flow and data
structure abstractions) potentially reusable across different dialects in the
MLIR's open ecosystem.
LIFT is expected to further influence the design of Linalg as it evolve. In
particular, extending the data structure abstractions to support non-dense
tensors can use the experience of LIFT abstractions for
[sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf)
and [position-dependent
arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf).
## Lessons from XLA<a name="lessonsxla"></a>
[XLA](https://www.tensorflow.org/xla/architecture) is one of the first
post-Theano ML compilers that was introduced as a pragmatic compilation
solution for TensorFlow. It shines on Google's xPU
hardware and is an important piece of the puzzle. It is particularly good at
(1) transforming code back and forth between the scalar and the vector
worlds, (2) passing function boundaries for handling both host and device
code, and (3) complying to stringent requirements imposed by energy-efficient
xPUs.
XLA followed a pragmatic design process where the compiler is given perfect
knowledge of each op's semantic, all starting from the mighty `conv` and
`matmul` ops. XLA transformations consist of writing emitters that compose, as C++
functions. Perfect op semantics knowledge has 2 big benefits: (1) transformations are
correct by construction (2) very strong performance on difficult xPU targets.
Similarly, Linalg ops *"know their semantics"* and *"know how to transform and
lower themselves"*. The means by which this information is made available and
how it is used in MLIR are, however, very different.
Linalg hopes to additionally address the following:
- HLOs are expressive as a whole, but each op has very limited and fixed
semantics: ops are not configurable. As a consequence, HLOs have evolved into
a too large set of ops whose semantics intersect.
This echoes the ops proliferation problem also exhibited by ONNX.
- Reliance on perfect op knowledge leads to situations where transformations and
ops end up needing to know about each other's semantics (e.g. during fusion).
Since the transformations themselves are not simple local rewrite patterns
(unlike LIFT), code complexity grows quickly.
- XLA lacks an independent IR that can be inspected, unit tested and used
independently. This monolithic design makes the system not portable: xPU passes
and GPU passes do not share much code.
## Lessons from Halide and TVM<a name="lessonshalide"></a>
[Halide](https://halide-lang.org/) is a DSL embedded in C++ that provides a
way of metaprogramming the HalideIR and applying transformations declaratively
to let the expert user transform and optimize the program in tailored ways.
Halide, initially targeted the SIGGRAPH community but is now more generally
applicable. [TVM](https://tvm.apache.org/) is an evolution of Halide into the
machine learning and deep-neural network space, based on HalideIR.
The Halide transformation methodology follows similar principles to the
[URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf)
and
[CHiLL](https://pdfs.semanticscholar.org/6a46/20589f63f3385707d2d590f7b7dc8ee4d74f.pdf)
compiler transformation frameworks, but without the strengths (and especially
complexity) of the polyhedral model.
Halide particularly shines at making the HPC transformation methodology
accessible to $\Omega$(10-100) users, at a time when polyhedral tools are
still only accessible to $\Omega$(1-10) users. Halide makes heavy usage of
canonicalization rules that are also very prevalent in MLIR.
Linalg hopes to additionally address the following:
- Halide scheduling is powerful and explores a large swath of possible
transformations. But it's still too hard for newcomers to use or extend. The
level of performance you get from Halide is very different depending on
whether one is a seasoned veteran or a newcomer. This is especially true as
the number of transformations grow.
- Halide raises rather than lowers in two ways, going counter-current to the
design goals we set for high-level codegen abstractions in in MLIR. First,
canonical Halide front-end code uses explicit indexing and math on scalar
values, so to target BLAS/DNN libraries one needs to add pattern matching
which is similarly brittle as in the affine case. While Halide's performance
is on par with the libraries on programmable targets (CPU/GPU), that
approach doesn't work on mobile accelerators or on xPUs, where the framework
ingests whole-tensor operations.
Second, reductions and scans are expressed using serial iteration, again
requiring pattern matching before they can be transformed (e.g. to do a
reduction using atomics, or hierarchically). The lesson to draw is that we
should start with higher-level primitives than Halide.
## Lessons from Tensor Comprehensions<a name="lessonstc"></a>
[Tensor Comprehensions](https://arxiv.org/abs/1802.04730) is a
high-level language to express tensor computations with a syntax
generalizing the Einstein notation, coupled to an end-to-end
compilation flow capable of lowering to efficient GPU code. It was
integrated with 2 ML frameworks: Caffe2 and PyTorch.
<img width="600" alt="MLIR Codegen Flow"
src="https://user-images.githubusercontent.com/10148468/73613272-df904480-45c1-11ea-88f9-214dee7464cf.png">
The compilation flow combines [Halide](#lessonshalide) and a Polyhedral Compiler
derived from [ISL](https://en.wikipedia.org/wiki/Integer_set_library)
and uses both HalideIR and the ISL *schedule-tree* IR.
The compiler provides a collection of polyhedral compilation
algorithms to perform fusion and favor multi-level parallelism and
promotion to deeper levels of the memory hierarchy.
Tensor Comprehensions showed that, fixing a few predefined strategies
with parametric transformations and tuning knobs, can already provide
great results. In that previous work, simple
genetic search combined with an autotining framework was sufficient
to find good implementations in the ***non-compute bound regime***.
This requires code versions obtainable by the
various transformations to encompass versions that get close to the
roofline limit.
The ultimate goal of Tensor Comprehensions was to concretely mix
Halide high-level transformations with polyhedral mid-level
transformations and build a pragmatic system that could take advantage
of both styles of compilation.
Linalg hopes to additionally address the following:
- Halide was never properly used in Tensor Comprehensions beyond shape
inference. Most of the investment went into simplifying polyhedral
transformations and building a usable end-to-end system. MLIR was
deemed a better infrastructure to mix these types of compilation.
- The early gains provided by reusing established infrastructures
(HalideIR and ISL schedule trees) turned into more impedance mismatch
problems than could be solved with a small tactical investment.
- Tensor Comprehensions emitted CUDA code which was then JIT compiled
with NVCC from a textual representation. While this was a pragmatic
short-term solution it made it hard to perform low-level rewrites that
would have helped with register reuse in the ***comput-bound regime***.
- The same reliance on emitting CUDA code made it difficult to
create cost models when time came. This made it artifically harder to
prune out bad solutions than necessary. This resulted in excessive
runtime evaluation, as reported in the paper [Machine Learning Systems
are Stuck in a Rut](https://dl.acm.org/doi/10.1145/3317550.3321441).
Many of those issues are naturally addressed by implementing these ideas
in the MLIR infrastructure.
## Lessons from Polyhedral compilers<a name="lessonspolyhedral"></a>
The polyhedral model has been on the cutting edge of loop-level optimization for
decades, with several incarnations in production compilers such as
[GRAPHITE](https://gcc.gnu.org/wiki/Graphite) for GCC and
[Polly](https://polly.llvm.org) for LLVM. Although it has proved crucial to
generate efficient code from domain-specific languages such as
[PolyMage](http://mcl.csa.iisc.ac.in/polymage.html) and [Tensor
Comprehensions](https://dl.acm.org/doi/abs/10.1145/3355606), it has never been
fully included into mainstream general-purpose optimization pipelines. Detailed
analysis of the role of polyhedral transformations is provided in the
[simplified polyhedral
form](https://mlir.llvm.org/docs/RationaleSimplifiedPolyhedralForm/) document
dating back to the inception of MLIR.
In particular, polyhedral abstractions have proved challenging to integrate with
a more conventional compiler due to the following.
- The transformed code (or IR) quickly gets complex and thus hard to analyze and
understand.
- Code generation from the mathematical form used in the polyhedral model relies
on non-trivial exponentially complex algorithms.
- The mathematical form is rarely composable with the SSA representation and
related algorithms, on which most mainstream compilers are built today.
- Expressiveness limitations, although addressed in the scientific literature
through, e.g., summary functions, often remain present in actual
implementations.
The Affine dialect in MLIR was specifically designed to address the integration
problems mention above. In particular, it maintains the IR in the same form
(loops with additional constraints on how the bounds are expressed) throughout
the transformation, decreasing the need for one-shot conversion between
drastically different representations. It also embeds the polyhedral
representation into the SSA form by using MLIR regions and thus allows one to
combine polyhedral and SSA-based transformations.
## Lessons from the Affine dialect<a name="lessonsaffine"></a>
The Affine dialect in MLIR brings the polyhedral abstraction closer to the
conventional SSA representation. It addresses several long-standing integration
challenges as described above and is likely to be more suitable when compiling
from a C language-level abstraction.
MLIR makes it possible to start from a higher-level abstraction than C, for
example in machine learning workloads. In such cases, it may be possible to
avoid complex analyses (data-flow analysis across loop iterations is
exponentially complex) required for polyhedral transformation by leveraging the
information available at higher levels of abstractions, similarly to DSL
compilers. Linalg intends to use this information when available and ensure
*legality of transformations by construction*, by integrating legality
preconditions in the op semantics (for example, loop tiling can be applied to
the loop nest computing a matrix multiplication, no need to additionally rely on
affine dependence analysis to check this). This information is not readily
available in the Affine dialect, and can only be derived using potentially
expensive pattern-matching algorithms.
Informed by the practical experience in polyhedral compilation and with the
Affine dialects in particular, Linalg takes the following decisions.
- **Discourage loop skewing**: the loop skewing transformation, that is
sometimes used to enable parallelization, often has surprising (negative)
effects on performance. In particular, polyhedral auto-transformation can be
expressed in a simpler way without loop skewing; skewing often leads to
complex control flow hampering performance on accelerators such as GPUs.
Moreover, the problems loop skewing addresses can be better addressed by other
approaches, e.g., diamond tiling. In the more restricted case of ML workloads,
multi-for loops with induction variables independent of each other (referred
to as hyper-rectangular iteration domains in the literature) such as the
proposed
[affine.parallel]((https://llvm.discourse.group/t/rfc-add-affine-parallel/350)
are sufficient in the majority of cases.
- **Declarative Tiling**: the *tiling* transformation is ubiquitous in HPC code
generation. It can be seen as a decomposition of either the iteration space or
the data space into smaller regular parts, referred to as tiles. Polyhedral
approaches, including the Affine dialect, mostly opt for iteration space
tiling, which introduces additional control flow and complex address
expressions. If the tile sizes are not known during the transformation (so
called parametric tiling), the address expressions and conditions quickly
become non-affine or require exponentially complex algorithms to reason about
them. Linalg focuses tiling on the data space instead, creating views into the
buffers that leverage MLIR's strided `memref` abstraction. These views compose
and the complexity of access expressions remains predictable.
- **Preserve high-level information**: Linalg maintains the information provided
by the op semantics as long as necessary for transformations. For example, the
result of tiling a matrix multiplication is loops around a smaller matrix
multiplication. Even with pattern-matching on top of the Affine dialect, this
would have required another step of pattern-matching after the transformation.
Given these choices, Linalg intends to be a better fit for **high-level
compilation** were significantly more information is readily available in the
input representation and should be leveraged before lowering to other
abstractions. Affine remains a strong abstraction for mid-level transformation
and is used as a lowering target for Linalg, enabling further transformations
and combination of semantically-loaded and lower-level inputs. As such, Linalg
is intended to complement Affine rather than replace it.
# Core Guiding Principles<a name="guiding_principles"></a>
## Transformations and Simplicity First<a name="transformations_first"></a>
The purpose of the Linalg IR and its operations is primarily to:
- develop a set of key transformations, and
- make them correct by construction by carefully curating the set of
generic operation properties that drive applicability, and
- make them very simple to implement, apply, verify and especially
maintain.
The problem at hand is fundamentally driven by compilation of domain-specific
workloads for high-performance and parallel hardware architectures: **this is
an HPC compilation problem**.
The selection of relevant transformations follows a codesign approach and
involves considerations related to:
- concrete current and future needs of the application domain,
- concrete current and future hardware properties and ISAs,
- understanding of strengths and limitations of [existing approaches](#prior_art),
- taking advantage of the coexistence of multiple levels of IR in MLIR,
One needs to be methodical to avoid proliferation and redundancy. A given
transformation could exist at multiple levels of abstraction but **just
because one can write transformation X at level Y absolutely does not mean
one should**. This is where evaluation of existing
systems and acknowledgement of their strengths and weaknesses is crucial:
simplicity and maintainability aspects must be first-order concerns. Without
this additional effort of introspection, a design will not stand the test of
time. At the same time, complexity is very hard to ward off. It seems one needs
to suffer complexity to be prompted to take a step back and rethink
abstractions.
This is not merely a reimplementation of idea X in system Y: simplicity
**must be the outcome** of this introspection effort.
## Preservation of Information<a name="information_preservation"></a>
The last two decades have seen a proliferation of Domain-Specific Languages
(DSLs) that have been very successful at limited application domains.
The main commonality between these systems is their use of a significantly
richer structural information than CFGs or loops.
Still, another commonality of existing systems is to lower to LLVM very quickly,
and cross a wide abstraction gap in a single step. This process often drops
semantic information that later needs to be reconstructed later,
when it is not irremediably lost.
These remarks, coupled with MLIR's suitability for defining IR at multiple
levels of abstraction led to the following 2 principles.
### Declarative Specification: Avoid Raising<a name="declarative_specification"></a>
Compiler transformations need static structural information (e.g. loop-nests,
graphs of basic blocks, pure functions etc). When that structural information
is lost, it needs to be reconstructed.
A good illustration of this phenomenon is the notion of *raising* in polyhedral
compilers: multiple polyhedral tools start by raising from a simplified C
form or from SSA IR into a higher-level representation that is more amenable
to loop transformations.
In advanced polyhedral compilers, a second type of raising
may typically exist to detect particular patterns (often variations of
BLAS). Such patterns may be broken by transformations making their detection
very fragile or even just impossible (incorrect).
MLIR makes it easy to define op semantics declaratively thanks to the use of
regions and attributes. This is an ideal opportunity to define new abstractions
to convey user-intent directly into the proper abstraction.
### Progressive Lowering: Don't Lose Information too Quickly<a name="#progressive_lowering"></a>
Lowering too quickly to affine, generic loops or CFG form reduces the
amount of structure available to derive transformations from. While
manipulating loops is a net gain compared to CFG form for a certain class of
transformations, important information is still lost (e.g. parallel loops, or
mapping of a loop nest to an external implementation).
This creates non-trivial phase ordering issues. For instance, loop fusion may
easily destroy the ability to detect a BLAS pattern. One possible alternative
is to perform loop fusion, tiling, intra-tile loop distribution and then hope to
detect the BLAS pattern. Such a scheme presents difficult phase-ordering
constraints that will likely interfere with other decisions and passes.
Instead, certain Linalg ops are designed to maintain high-level information
across transformations such as tiling and fusion.
MLIR is designed as an infrastructure for ***progressive lowering***.
Linalg fully embraces this notion and thinks of codegen in terms of
*reducing a potential function*. That potential function is loosely
defined in terms of number of low-level instructions in a particular
Linalg ops (i.e. how heavy or lightweight the Linalg op is).
Linalg-based codegen and transformations start from higher-level IR
ops and dialects. Then each transformation application reduces the
potential by introducing lower-level IR ops and *smaller* Linalg ops.
This gradually reduces the potential, all the way to Loops + VectorOps
and LLVMIR.
## Composable and Declarative Transformations<a name="declarative_transformations"></a>
Complex and impactful transformations need not be hard to manipulate, write or
maintain. Mixing XLA-style high-level op semantics knowledge with generic
properties to describe these semantics, directly in MLIR, is a promising way to:
- Design transformations that are correct by construction, easy to
write, easy to verify and easy to maintain.
- Provide a way to specify transformations and the units of IR they manipulate
declaratively. In turn this allows using local pattern rewrite rules in MLIR
(i.e. [DRR](https://mlir.llvm.org/docs/DeclarativeRewrites/)).
- Allow creating customizable passes declaratively by simply selecting rewrite
rules. This allows mixing transformations, canonicalizations, constant folding
and other enabling rewrites in a single pass. The result is a system where pass
fusion is very simple to obtain and gives hope to solving certain
[phase ordering issues](https://dl.acm.org/doi/10.1145/201059.201061).
## Suitability for Search and Machine Learning<a name="ml"></a>
Compiler heuristics are hand-crafted human-engineered features: it is
ripe for disruption by machine-learning techniques.
To enable search, compiler transformations should be fine-grained,
[composable](#declarative_transformations) and expose tuning parameters that
can modify their behavior, guided by lessons from previous experience
with [Tensor Comprehensions](#lessonstc).
Of course, we are not advocating for using ML everywhere in the stack
immediately: low-level compilation and machine models are still quite performant
in LLVM. However, for the high-level and mid-level optimization problems,
models need to be conditioned (probalistically) on the low-level
compiler which acts as a blackbox. For these reasons we prioritize the
design of IR and transformations with search-friendly properties over
building cost models.
Still, this does not mean Linalg refuses cost models: instead we
prefer to invest in infrastructure that will enable [ML-based
techniques to automatically build cost
models](http://homepages.inf.ed.ac.uk/hleather/publications/2009_autofeatures_cgo.pdf).
## Extensibility and Future-Proofness<a name="future"></a>
MLIR allows defining IR for structured control flow and structured
data types. We choose to take advantage of these properties for the
reasons described above.
In particular, the `MemRefType` represents dense non-contiguous memory regions.
This structure should extend beyond simple dense data types and generalize to
ragged, sparse and mixed dens/sparse tensors as well as to trees, hash tables,
tables of records and maybe even graphs.
For such more advanced data types, the control-flow required to traverse the
data structures, termination conditions etc are much less simple to analyze and
characterize statically. As a consequence we need to also design solutions that
stand a chance of evolving into runtime-adaptive computations (e.g.
inspector-executor in which an *inspector* runs a cheap runtime
analysis on the data to configure how the *executor* should run).
While there is no concrete solution
today to solve these problems in MLIR, it is pretty clear that perfect
static knowledge and analyses will not be serious contenders for these problems.
# Key Observations<a name="keyobservation"></a>
The following key observations have influenced the design of Linalg and helped
reconcile [core guiding principles](#guiding_principles) with real-world
requirements when producing an implementation based on MLIR.
## Algorithms + Data Structures = Programs<a name="data_and_compute"></a>
This is a twist on Niklaus Wirth's formulation but captures the essence of the
design of Linalg: control-flow does not exist in a vacuum, independently of
data.
On the contrary, there is a very strong relationship between control-flow and
data structures: one cannot exist without the other. This has multiple
implications on the [semantics of Linalg Ops](#linalg_ops) and their
transformations. In particular, this observation influences whether
certain transformations are better done:
- as control flow or data structure manipulation,
- on Linalg ops attributes or on loops after some partial lowering
occurred,
- as extensions to the Linalg dialect in terms of new ops or attributes.
## The Dialect Need not be Closed Under Transformations<a name="dialect_not_closed"></a>
This is probably the most surprising and counter-intuitive
observation. When one designs IR for transformations, closed-ness is
often a nonnegotiable property.
This is a key design principle of polyhedral IRs such as
[URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf)
and
[ISL-based IRs](https://en.wikipedia.org/wiki/Integer_set_library):
they are closed under affine transformations.
In MLIR, multiple dialects coexist and form a coherent whole. After
experimenting with different alternatives, it became clear that strict
dialect closed-ness wasn't necessary and could be relaxed. Previous
systems did not have simple and principled means of building new IR
and probably suffered from this limitation. We conjecture this is a
key reason they required the IR to be closed under transformations.
Despite the fact that Linalg ops only allow perfectly nested
semantics, once tiling and fusion kick in, imperfectly nested loops
are gradually introduced.
In other words, imperfectly nested control flow appears as ***the result of
applying key transformations***.
Considering the *potential* described during the discussion on
[Progressive Lowering](#progressive_lowering), closed-ness under
transformation would dictate that the potential remains constant.
In contrast, Linalg advocates for ***monotonicity*** under
transformations.
## Summary of Existing Alternatives a Picture<a name="observationssummary"></a>
Lastly, we summarize our observations of lessons from [Prior
Art](#prior_art)---when viewed under the lense of our [Core Guiding
Principles](#guiding_principles)---with the following picture.
<img width="1200" alt="MLIR Codegen Flow"
src="https://user-images.githubusercontent.com/10148468/73613904-2f720a00-45c8-11ea-8265-1c856c02525b.png">
This figure is not meant to be perfectly accurate but a rough map of
how we view the distribution of structural information in existing
systems, from a codegen-friendly angle. Unsurprisingly, the
[Linalg Dialect](https://mlir.llvm.org/docs/Dialects/Linalg) and its
future evolutions aspire to a position in the top-right of this map.

View File

@ -24,85 +24,11 @@ def Linalg_Dialect : Dialect {
can lower to scalar load/store and other operations or to more general
library calls.
The `linalg` dialect manipulates the following types and operations:
### Core data types and special ops.
The following abstractions are used by the `linalg` dialect:
#### Views
The current implementation uses the strided memref abstraction. In the
future other abstractions than strided memref will be used.
#### `!linalg.range`
This data type is currently just a triple (`min`,`max`, `step`) that does
not pass function boundaries.
#### `linalg.yield`
This op is used as a terminator within the appropriate `linalg` regions.
In the future, richer `view` and `range` representations are expected, in
particular to represent sparse traversals.
### Metadata Ops
A set of ops that manipulate metadata but do not move memory. These ops take
`view` operands + extra attributes and return new `view`s. The returned
`view`s generally alias the operand `view`. At the moment the existing ops
are:
* `std.view`,
* `std.subview`,
* `linalg.range`,
* `linalg.slice`,
* `linalg.transpose`.
Future ops are added on a per-need basis but should include:
* `linalg.reshape`,
* `linalg.tile`,
* `linalg.intersection`,
* `linalg.convex_union`,
* `linalg.difference` (would need to work on a list of views).
### Payload Ops
A set of payload carrying operations that implement the [structured ops](
https://docs.google.com/presentation/d/1P-j1GrH6Q5gLBjao0afQ-GfvcAeF-QU4GXXeSy0eJ9I/edit#slide=id.p
)
abstraction on tensors and buffers. `linalg` has `2` generic operations
`linalg.generic` and `linalg.indexed_generic` for expressing custom
operations.
This is subject to further evolution as transformations and analyses
continue to be developed.
Additionally, `linalg` provides some commonly named operations:
* `linalg.copy`,
* `linalg.fill`,
* `linalg.dot`,
* `linalg.matmul`,
* `linalg.conv`.
Future ops are added on a per-need basis but should include:
* `linalg.pad`.
In an ideal world, all the named ops would be automatically generated from
a description in terms of only the `2` generic ops. Unfortunately we do not
have such support yet (contributions are most welcome).
### Convention for external library interop
The `linalg` dialect adopts a convention that is similar to `BLAS` when
offloading operations to fast library implementations: pass a non-owning
pointer to input and output data with additional metadata. This convention
is also found in libraries such as `MKL`, `OpenBLAS`, `BLIS`, `cuBLAS`,
`cuDNN`, etc.. and more generally at interface points across language
boundaries (e.g. C++ / Python).
Generally, `linalg` passes non-owning pointers to strided memref data
structures to precompiled library calls linked externally. The name `view`
is used interchangeably in `linalg` to signify strided memref discussed at
length in the [strided memref RFC](
https://groups.google.com/a/tensorflow.org/g/mlir/c/MaL8m2nXuio/m/a_v07o9yBwAJ).
Additional [Linalg Dialect
Documentation](https://mlir.llvm.org/docs/Dialects/Linalg) and a
[Rationale Document](https://mlir.llvm.org/docs/RationaleLinalgDialect) are
are also available and should be read first before going in the details of
the op semantics.
}];
}