[mlir][Linalg][doc] Add Design Document for the Linalg Dialect

Summary: This revision adds a Rationale for the Linalg Dialect Reviewers: rriddle, mehdi_amini, ftynse, albertcohen Reviewed By: albertcohen Subscribers: merge_guards_bot, jfb, jpienaar, burmako, shauheen, antiagainst, arpith-jacob, mgester, lucyrfox, aartbik, liufengdb, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D73595
2020-02-02 10:21:34 -05:00 · 2020-02-02 10:21:34 -05:00 · 34cd354ea9
parent ff50c8dcef
commit 34cd354ea9
3 changed files with 1096 additions and 83 deletions
--- a/mlir/docs/Dialects/Linalg.md
+++ b/mlir/docs/Dialects/Linalg.md
@ -1,8 +1,471 @@
 # Linalg Dialect

-To generate the documentation:
+[TOC]

-```sh
-mlir-tblgen --gen-op-doc -I /path/to/mlir/include \
-/path/to/mlir/include/mlir/Dialect/Linalg/IR/LinalgDoc.td
+# Rationale
+
+<img width="90" align="left" alt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-45c5-11ea-94b7-074aeea94c7b.png">
+
+Linalg is designed to solve the High-level Hierarchical Optimization
+(HHO box) in MLIR and to interoperate nicely within a
+*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box). 
+
+The [Rationale Document](https://mlir.llvm.org/docs/RationaleLinalgDialect)
+goes into significantly more design and architectural decision details.
+
+# Set of Key Transformations<a name="key_transformations"></a>
+
+The following key transformations have been central to driving the design of
+Linalg. They are all implemented in terms of the properties of the
+`linalg.generic` OpInterface and avoid the pitfall of relying on hardcoded
+one-off op knowledge.
+
+The textual form description of these transformations is left for future
+work. Still, it is useful to at least the key transformations that are
+performed on the Linalg IR and that have influenced its design:
+1. Progressive Buffer Allocation.
+1. Parametric Tiling.
+1. Promotion to Temporary Buffer in Fast Memory.
+1. Tiled Producer-Consumer Fusion with Parametric Tile-And-Fuse.
+1. Map to Parallel and Reduction Loops and Hardware.
+1. Vectorization: Rewrite in Vector Form.
+1. Lower to Loops (Affine and/or Generic).
+1. Lower to Library Calls or Special Instructions, Intrinsics or ISA.
+1. Partially Lower to Iterations Over a Finer-Grained Linalg Op.
+
+# High-Level Description of Linalg Ops<a name="linalg_ops"></a>
+Linalg takes at least some inspiration from all previously [listed prior
+art](#prior_art). The design enables the definition of ***CustomOps*** with
+generic properties that enable [key transformations](#key_transformations),
+including lowering to scalar load/store and other operations or to external
+library calls and intrinsics.
+
+These ops can have ***either tensor or buffer operands***.
+
+## Payload-Carrying Ops<a name="payload_ops"></a>
+Linalg defines two payload carrying operations that implement the [structured ops](
+https://docs.google.com/presentation/d/1P-j1GrH6Q5gLBjao0afQ-GfvcAeF-QU4GXXeSy0eJ9I/edit#slide=id.p
+) abstraction on tensors and buffers. This is architected as two generic operations
+`linalg.generic` (resp. `linalg.indexed_generic`) that can express custom
+operations with *index-free semantics* (resp. *indexing semantics*).
+The properties of these generic ops are the result of applying the
+guiding principles described in the [Rationale Document](https://mlir.llvm.org/docs/RationaleLinalgDialect).
+They are listed next, with a brief example and discussion for each.
+
+### Property 1: Input and Output Operands Define The Iteration Space<a name="prop1"></a>
+A `linalg.generic` op fully *derives* the specification of its iteration space
+from its operands.
+The property enforces that a localized IR element (the op) *has* all the information
+needed to synthesize the control-flow required to iterate over its operands,
+according to their type. This notion of IR localization bears some resemblance
+to [URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf).
+
+Consider the following, partially specified, `linalg.generic` example:
 ```
+#attrs = {args_in: 1, args_out: 1}
+func @example(%A: memref<?xf32, layout1>, 
+              %B: memref<?xvector<4xf32, layout2>>) {
+  linalg.generic #attrs (%2, %3): memref<?xf32, layout1>,
+                                  memref<?xvector<4xf32, layout2>>
+  return
+}
+```
+
+The property "*Input and Output Operands Define The Iteration Space*" is
+materialized by a lowering into a form that will resemble:
+```
+func @example(%A: memref<?xf32, layout1>, 
+              %B: memref<?xvector<4xf32, layout2>>) {
+  %M = "dim" %A, 0: index
+  %N = "dim" %B, 0: index
+  %eq = eq %M, %N: i1   // iteration space is consistent with data
+  assert(%eq): (i1) -> ()
+  for %i = 0 to %M {
+    %a = load %A[%i]: memref<?xf32, layout1>
+    %b = load %B[%i]: memref<?xvector<4xf32>, layout2>
+    // compute arg types match elemental tensor types
+    %c = "some_compute"(%a, %b): (f32, vector<4xf32>) -> (vector<4xf32>)
+    store %c, %B[%i]: memref<?xvector<4xf32>, layout2>
+  }
+  return
+}
+```
+
+The property participates in simplifying analyses and transformations. For
+instance, it guarantees no out-of bounds access can occur by construction
+(assuming dynamic operand dimensions agree with each other, which is the
+purpose of the `assert` runtime check).
+
+Before lowering to loop form, loop induction variables and iterators are *not yet
+materialized*. This is a necessary property if we want an abstraction that
+works on both tensor values and buffers because ***values don’t escape
+loops/nesting***.
+
+The main implications are that:
+1. The semantics of the ops are *restricted to operate on structured data
+types*, on which we can define an iterator.
+2. This does not model arbitrary code with side-effects.
+
+We do not think these are serious limitations in practice because MLIR is all
+about mixing different levels of abstractions in the same IR. As long as
+Linalg can progressively lower to the next level of abstraction, it can also
+be just bypassed for things that do not fit.
+
+At the same time, conditioning op semantics on structured data types is a very
+promising path towards extensibility to non-dense tensors as experience with
+LIFT abstractions for
+[sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf)
+and [position-dependent
+arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf),
+as well as [TACO](http://tensor-compiler.org/), has shown.
+
+### Property 2: Reversible Mappings Between Control and Data Structures<a name="prop2"></a>
+A `linalg.generic` *defines* the mapping between the iteration space (i.e. the
+loops) and the data. 
+
+Consider the following, partially specified, `linalg.generic` example:
+```
+#indexing_maps = { 
+  (i, j) -> (j, i), 
+  (i, j) -> (j) 
+}
+#attrs = {args_in: 1, args_out: 1, indexings: indexing_maps}
+func @example(%A: memref<?xf32, layout1>, 
+              %B: memref<?xvector<4xf32, layout2>>) {
+  linalg.generic #attrs (%A, %B): memref<?xf32, layout1>,
+                                  memref<?xvector<4xf32, layout2>>
+  return
+}
+```
+
+The property "*Reversible Mappings Between Control and Data Structures*" is
+materialized by a lowering into a form that will resemble:
+```
+#attrs = {args_in: 1, args_out: 1, indexings: indexing_maps}
+func @example(%A: memref<?xf32, layout1>, 
+              %B: memref<?xvector<4xf32, layout2>>) {
+  // loop bounds determined from data sizes by “inverting the map”
+  %J = "dim" %2, 0: index
+  %I = "dim" %2, 1: index
+  %J2 = "dim" %3, 0: index
+  // iteration space is consistent with data + mapping inference 
+  %eq = "eq" %J, %J2: i1
+  "assert" %eq: (i1) -> ()
+  for %i = 0 to %I {           // loop order is fully defined by indexing maps
+    for %j = 0 to %J {         // arbitrary permutations are possible
+      %a = "load" %2, %j, %i: memref<8x?xf32>
+      %b = "load" %3, %j: memref<?xvector<4xf32>>
+      %c = "some_compute"(%a, %b): (f32, vector<4xf32>) -> (vector<4xf32>)
+      "store" %c, %3, %j: memref<?xvector<4xf32>>
+    }
+  }
+  return
+}
+```
+
+This mapping needs to be reversible because we want to be
+able to go back and forth between the two and answer questions such as:
+- Given a subset of the iteration space, what subset of data does it read and
+write?
+- Given a subset of data read or written, what subset of the iteration space
+is responsible for this read or write?
+
+Answering these `2` questions is one of the main analyses that Linalg uses to 
+implement transformations such as tiling, tiled producer-consumer fusion, and
+promotion to temporary buffers in fast memory.
+
+In the current implementation, `linalg.generic` uses a list of [AffineMaps]().
+This is a pragmatic short-term solution, but in the longer term note that
+this property could be even evaluated dynamically, similarly to
+inspector-executor algorithms.
+
+### Property 3: The Type Of Iterators is Defined Explicitly<a name="prop3"></a>
+A `linalg.generic` op fully *declares* the type of its iterators. This
+information is used in transformations.
+
+These properties are derived from established practice in the field and mirror
+the properties from Ken Kennedy's [Optimizing Compilers for Modern Architectures](
+https://www.elsevier.com/books/optimizing-compilers-for-modern-architectures/allen/978-0-08-051324-9).
+The key idea of legality of loop transformations expressed by Kennedy is
+that ***the lexicographic order of all dependence vectors must be
+preserved***.
+
+This can be better captured directly at the loop level thanks to specific
+iterator types, among which:
+*parallel*, *reduction*, *partition*, *permutable/monotonic*, *sequential*, 
+*dependence distance*, ...
+
+These types are traditionally the result of complex dependence analyses and
+have been referred to as "*bands*" in the polyhedral community (e.g. *parallel
+bands*, *permutable bands*, etc, in
+[ISL](https://en.wikipedia.org/wiki/Integer_set_library) schedule tree
+parlance). 
+
+Specifying the information declaratively in a `linalg.generic` allows
+conveying properties that may be hard (or even impossible) to derive from
+lower-level information. These properties can be brought all the way to the
+moment when they are useful for transformations, used and then discarded.
+
+Additionally, these properties may also be viewed as a contract that the 
+frontend/user guarantees and that the compiler may take advantage of. The
+common example is the use of data-dependent reduction semantics for
+specifying histogram computations. If the frontend has additional knowledge
+that proper atomic operations are available, it may be better to specify
+parallel semantics and use the special atomic in the computation region.
+
+At this time, Linalg only has an explicit use for *parallel* and *reduction*
+loops but previous experience shows that the abstraction generalizes.
+
+### Property 4: The Compute Payload is Specified With a Region<a name="prop4"></a>
+A `linalg.generic` op has a compute payload that is fully generic thanks to 
+the use of
+[Regions](https://github.com/llvm/llvm-project/blob/58265ad42a90ae8905be6a447cb42e53529a54a0/mlir/docs/LangRef.md#regions).
+
+The region takes as arguments the scalar elemental types of the tensor or
+buffer operands of the `linalg.generic`. For flexibility and ability to match
+library calls, additional special values may be passed. For instance, a
+`linalg.fill` operation takes a buffer and an additional scalar value.
+
+At this time there are no additional restrictions to the region
+semantics. This is meant to allow the exploration of various design tradeoffs
+at the intersection of regions and iterator types.
+In particular, the frontend is responsible for the semantics of iterator types
+to correspond to the operations inside the region: the region can capture 
+buffers arbitrarily and write into them. If this conflicts with some parallel
+iterator requirement, this is undefined behavior.
+
+Concretely, consider the following, partially specified, `linalg.generic`
+example:
+```
+#indexing_maps = { 
+  (i, j) -> (i, j), 
+  (i, j) -> (i, j) 
+}
+#attrs = {args_in: 1, args_out: 1, indexings: #indexing_maps}
+func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
+  linalg.generic #attrs (%A, %B, %C) {
+    ^bb0(%a: f32, %b: f32):
+      %c = addf %a, %b : f32
+      return %c : f32
+  }: memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>
+  return
+}
+```
+
+The property "*The Compute Payload is Specified With a Region*" is
+materialized by a lowering into a form that will resemble:
+```
+func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
+  %M = dim %A, 0: index
+  %N = dim %B, 1: index
+  for %i = 0 to %M {
+    for %j = 0 to %N {
+      %a = load %A[%i, %j]: memref<?x?xf32>
+      %b = load %B[%i, %j]: memref<?x?xf32>>
+      %c = addf %a, %b : f32
+      store %c, %C[%i, %j]: memref<?x?xf32>
+    }
+  }
+  return
+}
+```
+
+In the process of lowering to loops and lower-level constructs, similar
+requirements are encountered, as are discussed in the [inlined call op
+proposal](https://llvm.discourse.group/t/introduce-std-inlined-call-op-proposal/282/2).
+We expect to be able to reuse the common lower-level infrastructure provided
+it evolves to support both region arguments and captures.
+
+### Property 5: May Map To an External Library Call<a name="prop5"></a>
+A `linalg.generic` op may map to an external library call by specifying a
+`SymbolAttr`. At this level of abstraction, the important glue is the ability 
+to perform transformations that preserve the structure necessary to ***call
+the external library after different transformations have been applied***.
+
+This involves considerations related to preservation of op semantics
+and integration at the ABI level. Regardless of whether one wants to use
+external library calls or a custom ISA, the problem for codegen is similar: 
+preservation of a fixed granularity.
+
+Consider the following, partially specified, `linalg.generic`
+example:
+```
+#fun_attr = "pointwise_add"
+#indexing_maps = { 
+  (i, j) -> (i, j), 
+  (i, j) -> (i, j) 
+}
+#attrs = {args_in: 1, args_out: 1, indexings: #indexing_maps, fun: #fun_attr}
+func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
+  linalg.generic #attrs (%A, %B, %C) {
+    ^bb0(%a: f32, %b: f32):
+      %c = addf %a, %b : f32
+      return %c : f32
+  }: memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>
+  return
+}
+```
+
+The property "*Map To an External Library Call*" is
+materialized by a lowering into a form that will resemble:
+
+```
+func @pointwise_add_sxsxf32_sxsxf32(memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()
+
+func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
+  call @pointwise_add_sxsxf32_sxsxf32 (%A, %B, %C): 
+    (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()
+  return
+}
+```
+
+Which, after lowering to LLVM resembles:
+```
+func @pointwise_add_sxsxf32_sxsxf32(!llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">, 
+                                    !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">, 
+                                    !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">) -> ()
+
+func @example(%A: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">, 
+              %B: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">, 
+              %C: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">) {
+  llvm.call @pointwise_add_sxsxf32_sxsxf32 (%A, %B, %C): 
+    (!llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">...) -> ()
+  return
+}
+```
+
+#### Convention For External Library Interoperability
+The `linalg` dialect adopts a convention that is similar to `BLAS` when
+offloading operations to fast library implementations: pass a non-owning
+pointer to input and output data with additional metadata. This convention
+is also found in libraries such as `MKL`, `OpenBLAS`, `BLIS`, `cuBLAS`,
+`cuDNN`, etc.. and more generally at interface points across language
+boundaries (e.g. C++ / Python).
+
+Generally, `linalg` passes non-owning pointers to View data structures
+to pre-compiled library calls linked externally.
+
+There is an [ongoing
+discussion](https://llvm.discourse.group/t/lowering-optional-attributes-in-linalg-structuredops-to-standard-dialect/333/3)
+on the topic of extending interoperability in the presence of key attributes.
+
+### Property 6: Perfectly Nested Writes To The Whole Output Operands<a name="prop6"></a>
+Perfectly nested loops form a particularly important class of structure that
+enables key loop transformations such as tiling and mapping to library calls.
+Unfortunately, this type of structure is easily broken by transformations such
+as partial loop fusion. Tiling and mapping to library calls become more
+challenging, or even infeasible. Linalg ops adopt perfect-nestedness
+as a first-class property: the structure cannot be broken and is
+transported in the IR by construction.
+
+A `linalg.generic` op represents a perfectly nested loop nest that writes the
+entire memory region.  This is a structural constraint across regions and
+loops that has proven to be key in simplifying transformations.
+
+One particular point to mention is that converting imperfectly nested code
+into perfectly nested code can often be done with enough loop distribution 
+and embedding of conditionals down to the innermost loop level.
+
+Previous experience with Tensor Comprehensions gave us the intuition that
+forcing innermost control-flow nesting is a lot like writing data-parallel
+code with arrays of boolean values and predication. 
+This type of trick has also been used before in polyhedral compilers to
+convert non-affine control into affine compute dependencies.
+
+While it may be possible to automate such rewrites from generic IR,
+`linalg.generic` just forces the semantics for now.
+
+The key implication is that this conversion to deep predication needs to be
+undone once we are done with Linalg transformations. 
+After iterators and induction variables are materialized (i.e. after lowering
+out of `linalg.generic` occurred), the overall performance will be greatly
+influenced by the quality of canonicalizations, foldings and *Loop Independent
+Code Motion* (LICM).
+
+In the grander scheme, the reliance on late LICM was deemed a necessary risk.
+
+### Putting it Together<a name="summary"></a>
+As it stands, the six properties above define the semantics of a
+`linalg.generic` op. It is an open question whether all of these semantics are
+strictly necessary in practice and whether some should or could be derived 
+automatically while still maintaining the [core guiding
+principles](#guiding_principles).
+
+For the time being, we have settled on the combination of these properties
+because of empirical evidence building and working on multiple high-level
+compilers. As we lay those down and engage more with the community, we expect
+multiple rounds of discussions and design changes to the original architecture.
+
+## Data Representation: Views<a name="views"></a>
+The current implementation uses the [Strided MemRef (a.k.a View)](
+https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/MaL8m2nXuio)
+abstraction. The name *View* is used interchangeably in `linalg` to signify
+*Strided MemRef*.
+In the future we expect to use other structured data types and
+support ragged, mixed-sparse and other types. We expect to draw on the
+experience from existing LIFT abstractions for
+[sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf)
+and [position-dependent
+arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf).
+
+## Metadata Ops<a name="metadata_ops"></a>
+A set of ops that manipulate metadata but do not move memory. These ops take
+`view` operands + extra attributes and return new `view`s. The returned
+`view`s generally alias the operand `view`. At the moment the existing ops
+are:
+
+    * `std.view`,
+    * `std.subview`,
+    * `linalg.range`,
+    * `linalg.slice`,
+    * `linalg.transpose`.
+    * `linalg.reshape`,
+
+Future ops are added on a per-need basis but should include:
+
+    * `linalg.tile`,
+    * `linalg.intersection`,
+    * `linalg.convex_union`,
+    * `linalg.difference` (would need to work on a list of views).
+
+These additional operations correspond to abstractions that have been known to
+work in the field of large-scale distributed stencil computations.
+
+In a longer-term future, the abstractions from [Legion data-centric
+programming model](https://legion.stanford.edu/overview/) seem generally
+appealing.
+
+## Named Payload-Carrying Ops<a name="named_ops"></a>
+Additionally, `linalg` provides a small subset of commonly named operations:
+
+    * `linalg.copy`,
+    * `linalg.fill`,
+    * `linalg.dot`,
+    * `linalg.matmul`,
+    * `linalg.conv`.
+
+These named operations adhere to the `linalg.generic` op interface. Work is in
+progress to define declarative mechanisms to automatically generate named ops
+from a description in terms of only the generic op interface. 
+
+This is the main reason there are only a small number of ops today: we expect
+them to be auto-generated from Tablegen soon.
+
+# Open Issues and Design Alternatives<a name="open_issues"></a>
+Multiple open issues and design alternatives are in flight and it is time to
+lay them out for the community to discuss and pick apart:
+1. Should `linalg.generic` support nesting?
+1. Should `linalg.generic` regions take views or only scalars?
+1. Should we try to solve automatic differentiation at this level of
+abstraction?
+1. Are all the six properties really necessary?
+1. Is this relying too much on declarative specification and would we be
+better off relying more on analyses?
+1. Is this general enough for the community's needs? If not how should this be
+extended, if at all?
+...
+
+These key questions (and much more) should be really thought of in the general
+context of MLIR in which different levels of IR interoperate seamlessly. In 
+practice, it is not necessary (or beneficial) to try and solve all problems in the 
+same IR.
--- a/mlir/docs/RationaleLinalgDialect.md
+++ b/mlir/docs/RationaleLinalgDialect.md
@ -0,0 +1,624 @@
+# Linalg Dialect Rationale: The Case For Compiler-Friendly Custom Operations
+
+[TOC]
+
+# Introduction<a name="introduction"></a>
+
+## Positioning
+
+<img width="180" align="left" alt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-45c5-11ea-94b7-074aeea94c7b.png">
+
+This document describes the key design principles 
+that led to the existing implementation of Linalg and aims at exposing
+the tradeoffs involved when building higher-level Intermediate
+Representations (IR) and Dialects to facilitate code
+generation. Consider the simplified schema describing codegen in MLIR.
+Linalg is designed to solve the High-level Hierarchical Optimization
+(HHO box) and to interoperate nicely within a
+*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box). 
+This work is inspired by a wealth of [prior art](#prior_art) in
+the field, from which it seeks to learn key lessons. This documentation
+and introspection effort also comes in the context of the proposal for a
+working group for discussing the [Development of high-level Tensor Compute
+Primitives dialect(s) and
+transformations](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/3). 
+We hope that the lessons from prior art, the design principles outlined in
+this doc and the architecture of Linalg can help inform the community on a 
+path to defining these High-Level Tensor Compute Primitives.
+
+
+## Inception
+ 
+Linalg started as a pragmatic dialect to bootstrap code generation in MLIR, by
+*defining away* complex code generation problems like precise dependence
+analysis or polyhedral code generation and by introducing the ability to call
+into fast library implementations when available. Linalg **defines ops and
+transformations declaratively**  and was originally restricted to ops with
+*linear-algebra like* semantics (`pointwise`, `matmul`, `conv`...). This
+approach enables building a high-level productivity-first codegen solution that
+leverages *both* compiler optimizations *and* efficient library implementations
+so as not to miss out on simple performance benefits. For example, if
+one's favorite HPC library or ISA has a `matmul` primitive running at 95% of
+the achievable peak performance, for operands stored in some memory, one should
+be able to **use the primitive** when possible *and* generate code otherwise.
+ 
+However, as the design of Linalg co-evolved with the design of MLIR, it became
+apparent that it could extend to larger application domains than just machine
+learning on dense tensors.
+ 
+The design and evolution of Linalg follows a *codegen-friendly* approach where
+the IR and the transformations evolve hand-in-hand.
+The key idea is that op semantics *declare* and transport information that is
+traditionally obtained by compiler analyses. 
+This information captures the legality and applicability of transformations and
+is **not lost by lowering prematurely to loop or CFG form**. The key
+transformations are designed so as to **preserve this information** as long as
+necessary. For example, `linalg.matmul` remains `linalg.matmul` after tiling
+and fusion.
+ 
+Furthermore, Linalg decouples transformation validity from profitability
+considerations and voluntarily leaves the latter aside in the first iteration
+(see the [suitability for search](#suitability_for_search) guiding principle).
+ 
+The first incarnation of these ideas was presented as an example at the
+EuroLLVM 2019 developer's meeting as part of the
+[Linalg section](https://llvm.org/devmtg/2019-04/slides/Tutorial-AminiVasilacheZinenko-MLIR.pdf)
+of the first [MLIR Tutorial](https://www.youtube.com/watch?v=cyICUIZ56wQ).
+ 
+## Evolution
+Since the initial implementation, the design has evolved with, and partially
+driven the evolution of the core MLIR infrastructure to use
+[Regions](https://mlir.llvm.org/docs/LangRef/#regions),
+[OpInterfaces](https://mlir.llvm.org/docs/Interfaces/),
+[ODS](https://mlir.llvm.org/docs/OpDefinitions/) and
+[Declarative Rewrite Rules](https://mlir.llvm.org/docs/DeclarativeRewrites/)
+among others. The approach adopted by Linalg was extended to become
+[StructuredOps abstractions](
+https://drive.google.com/drive/u/0/folders/1sRAsgsd8Bvpm_IxREmZf2agsGU2KvrK-),
+with Linalg becoming its incarnation on tensors and buffers.
+It is complemented by the
+[Vector dialect](https://mlir.llvm.org/docs/Dialects/Vector/),
+which define structured operations on vectors, following the same rationale and
+design principles as Linalg. (Vector dialect includes the higher-level
+operations on multi-dimensional vectors and abstracts away the lowering to
+single-dimensional vectors).
+ 
+The Linalg dialect itself grew beyond linear algebra-like operations to become
+more expressive, in particular by providing an abstraction of a loop nest
+supporting parallelism, reductions and sliding windows around arbitrary MLIR
+[regions](https://mlir.llvm.org/docs/LangRef/#regions). It also has the
+potential of growing beyond *dense* linear-algebra to support richer data
+types, such as sparse and ragged tensors and buffers.
+ 
+Linalg design remains open to evolution and cross-pollination with other
+dialects and approaches. It has been successfully used as the staging ground
+for code generation-related abstractions, spinning off the generalization of
+the following:
+- the `!linalg.view` type folded into the *"Strided MemRef"* type while
+preserving structure to allow calling into external C++ libraries with
+unsurprising ABI conventions;
+- the `linalg.view` and `linalg.subview` ops evolved into the standard dialect;
+- the `linalg.for`, `linalg.load` and `linalg.store` ops evolved into a prelude
+to the *structured control flow* dialect (named `LoopOps`).
+More components can be extracted, redesigned and generalized when new uses or
+requirements arise.
+ 
+Several [design questions](#open_issues) remain open in Linalg, which does not
+claim to be a general solution to all compilation problems.
+It does aim at driving thinking and implementations of domain-specific
+abstractions where programmer's intent can be captured at a very high level,
+directly in the IR.
+ 
+Given the evolution of the scope, it becomes apparent that a better name than
+"Linalg" could remove some of the confusions related to the dialect (and the
+underlying approach), its goals and limitations.
+
+# Prior Art<a name=""></a>
+Linalg draws inspiration from decades of prior art to design a modern a
+pragmatic solution. The following non-exhaustive list refers to some of the
+projects that influenced Linalg design:
+ 
+- [ONNX](https://onnx.ai/),
+- [LIFT](https://www.lift-project.org/),
+- [XLA](https://www.tensorflow.org/xla/architecture),
+- [Halide](https://halide-lang.org/) and [TVM](https://tvm.apache.org/),
+- [TACO](http://tensor-compiler.org/),
+- [Darkroom](http://darkroom-lang.org/) and [Terra](http://terralang.org/),
+- [Sigma-LL](http://spiral.ece.cmu.edu:8080/pub-spiral/pubfile/cgo16-preprint_248.pdf),
+- [Tensor Comprehensions](https://arxiv.org/abs/1802.04730),
+- [Polyhedral Compilers](https://en.wikipedia.org/wiki/Polytope_model),
+- the [Affine dialect](https://mlir.llvm.org/docs/Dialects/Affine/) in MLIR,
+- Generic Loop Transformations (see Ken Kennedy's
+[Optimizing Compilers for Modern Architectures](
+https://www.elsevier.com/books/optimizing-compilers-for-modern-architectures/allen/978-0-08-051324-9))
+- Traditional compiler CFGs with SSA forms.
+ 
+Additionally, experience with the following tools proved very valuable when
+thinking holistically about how all these components interplay all the way
+up to the user and down to the hardware:
+ 
+- the [Torch](http://torch.ch/) machine-learning framework,
+- the LLVM compiler, specifically in JIT mode,
+- high-performance libraries (MKL, CUBLAS, FBFFT)
+- the [PeachPy](https://www.cs.utexas.edu/users/flame/BLISRetreat/BLISRetreatTalks/PeachPy.pdf) assembler
+- current and potentially upcoming hardware ISAs.
+ 
+The novelty of MLIR's code base and its unprecedented support for defining and
+mixing abstractions, enabling one to reflect on and integrate the key elements
+of the prior art success as well as avoid the common pitfalls in the area of
+code generation. Thus, instead of diverging into a discussion about the
+implications of adopting any of the existing solutions, Linalg had the
+possibility to build on all of them and learn from their experience while
+leveraging the benefit of hindsight.
+ 
+The following reflections on prior art have influenced the design of Linalg.
+The discussion is by no means exhaustive but should capture the key motivations
+behind Linalg.
+ 
+## Lessons from ONNX<a name="lessonsonnx"></a>
+ONNX is a specification of operations that appear in Machine Learning
+workloads. As such, it is predominantly driven by the expressiveness requirements
+of ML, and less by the considerations of IR design for HPC code generation.
+ 
+Similarly to ONNX, Linalg defines *"semantically charged" named ops*.
+But it also considers *transformations on these ops* as a key component and
+defines the IR to support the transformations, preferring transformations over
+expressiveness if necessary.
+ 
+Linalg hopes to additionally address the following:
+- facilitate frontend-compiler co-design by taking into account compiler
+  transformations and lowerings in op definition;
+- minimize the set of available ops by making them non-overlapping with each
+  other, thus simplifying the intermediate representation.
+ 
+## Lessons from LIFT<a name="lessonslift"></a>
+[LIFT](https://www.lift-project.org/) is a system to write computational
+kernels based on functional abstractions. Transformations are
+represented by additional nodes in the IR, whose semantics are at the
+level of the algorithm (e.g. `partialReduce`).
+LIFT applies and composes transformations by using [local rewrite
+rules](https://www.lift-project.org/presentations/2015/ICFP-2015.pdf) that
+embed these additional nodes directly in the functional abstraction.
+ 
+Similarly to LIFT, Linalg uses local rewrite rules implemented with the MLIR
+[Declarative Rewrite Rules](https://mlir.llvm.org/docs/DeclarativeRewrites/)
+mechanisms.
+ 
+Linalg builds on, and helps separate concerns in the LIFT approach as follows:
+- transformations are either separated from the representation or expressed as
+  composable attributes that are independent of the actual computation,
+  avoiding intricate effects on performance;
+- abstractions are split into smaller components (e.g., control flow and data
+  structure abstractions) potentially reusable across different dialects in the
+  MLIR's open ecosystem.
+ 
+LIFT is expected to further influence the design of Linalg as it evolve. In
+particular, extending the data structure abstractions to support non-dense
+tensors can use the experience of LIFT abstractions for
+[sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf)
+and [position-dependent
+arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf).
+
+## Lessons from XLA<a name="lessonsxla"></a>
+[XLA](https://www.tensorflow.org/xla/architecture) is one of the first
+post-Theano ML compilers that was introduced as a pragmatic compilation
+solution for TensorFlow. It shines on Google's xPU 
+hardware and is an important piece of the puzzle. It is particularly good at
+(1) transforming code back and forth between the scalar and the vector
+worlds, (2) passing function boundaries for handling both host and device
+code, and (3) complying to stringent requirements imposed by energy-efficient
+xPUs.
+XLA followed a pragmatic design process where the compiler is given perfect
+knowledge of each op's semantic, all starting from the mighty `conv` and
+`matmul` ops. XLA transformations consist of writing emitters that compose, as C++
+functions. Perfect op semantics knowledge has 2 big benefits: (1) transformations are
+correct by construction (2) very strong performance on difficult xPU targets.
+
+Similarly, Linalg ops *"know their semantics"* and *"know how to transform and
+lower themselves"*. The means by which this information is made available and
+how it is used in MLIR are, however, very different.
+
+Linalg hopes to additionally address the following:
+- HLOs are expressive as a whole, but each op has very limited and fixed
+semantics: ops are not configurable. As a consequence, HLOs have evolved into
+a too large set of ops whose semantics intersect.
+This echoes the ops proliferation problem also exhibited by ONNX.
+- Reliance on perfect op knowledge leads to situations where transformations and
+ops end up needing to know about each other's semantics (e.g. during fusion).
+Since the transformations themselves are not simple local rewrite patterns
+(unlike LIFT), code complexity grows quickly.
+- XLA lacks an independent IR that can be inspected, unit tested and used
+independently. This monolithic design makes the system not portable: xPU passes
+and GPU passes do not share much code.
+
+## Lessons from Halide and TVM<a name="lessonshalide"></a>
+[Halide](https://halide-lang.org/) is a DSL embedded in C++ that provides a
+way of metaprogramming the HalideIR and applying transformations declaratively
+to let the expert user transform and optimize the program in tailored ways.
+Halide, initially targeted the SIGGRAPH community but is now more generally
+applicable. [TVM](https://tvm.apache.org/) is an evolution of Halide into the
+machine learning and deep-neural network space, based on HalideIR.
+
+The Halide transformation methodology follows similar principles to the
+[URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf)
+and
+[CHiLL](https://pdfs.semanticscholar.org/6a46/20589f63f3385707d2d590f7b7dc8ee4d74f.pdf)
+compiler transformation frameworks, but without the strengths (and especially
+complexity) of the polyhedral model.
+
+Halide particularly shines at making the HPC transformation methodology
+accessible to $\Omega$(10-100) users, at a time when polyhedral tools are
+still only accessible to $\Omega$(1-10) users. Halide makes heavy usage of
+canonicalization rules that are also very prevalent in MLIR.
+
+Linalg hopes to additionally address the following:
+- Halide scheduling is powerful and explores a large swath of possible
+transformations. But it's still too hard for newcomers to use or extend. The 
+level of performance you get from Halide is very different depending on
+whether one is a seasoned veteran or a newcomer. This is especially true as
+the number of transformations grow.
+- Halide raises rather than lowers in two ways, going counter-current to the 
+design goals we set for high-level codegen abstractions in in MLIR. First,
+canonical Halide front-end code uses explicit indexing and math on scalar 
+values, so to target BLAS/DNN libraries one needs to add pattern matching
+which is similarly brittle as in the affine case. While Halide's performance 
+is on par with the libraries on programmable targets (CPU/GPU), that 
+approach doesn't work on mobile accelerators or on xPUs, where the framework
+ingests whole-tensor operations. 
+Second, reductions and scans are expressed using serial iteration, again 
+requiring pattern matching before they can be transformed (e.g. to do a 
+reduction using atomics, or hierarchically). The lesson to draw is that we 
+should start with higher-level primitives than Halide.
+
+## Lessons from Tensor Comprehensions<a name="lessonstc"></a>
+[Tensor Comprehensions](https://arxiv.org/abs/1802.04730) is a
+high-level language to express tensor computations with a syntax
+generalizing the Einstein notation, coupled to an end-to-end
+compilation flow capable of lowering to efficient GPU code. It was
+integrated with 2 ML frameworks: Caffe2 and PyTorch. 
+
+<img width="600" alt="MLIR Codegen Flow"
+src="https://user-images.githubusercontent.com/10148468/73613272-df904480-45c1-11ea-88f9-214dee7464cf.png">
+
+The compilation flow combines [Halide](#lessonshalide) and a Polyhedral Compiler
+derived from [ISL](https://en.wikipedia.org/wiki/Integer_set_library)
+and uses both HalideIR and the ISL *schedule-tree* IR. 
+The compiler provides a collection of polyhedral compilation
+algorithms to perform fusion and favor multi-level parallelism and
+promotion to deeper levels of the memory hierarchy.
+Tensor Comprehensions showed that, fixing a few predefined strategies
+with parametric transformations and tuning knobs, can already provide
+great results. In that previous work, simple 
+genetic search combined with an autotining framework was sufficient
+to find good implementations in the ***non-compute bound regime***.
+This requires code versions obtainable by the
+various transformations to encompass versions that get close to the
+roofline limit.
+The ultimate goal of Tensor Comprehensions was to concretely mix
+Halide high-level transformations with polyhedral mid-level
+transformations and build a pragmatic system that could take advantage
+of both styles of compilation.
+
+Linalg hopes to additionally address the following:
+- Halide was never properly used in Tensor Comprehensions beyond shape
+inference. Most of the investment went into simplifying polyhedral
+transformations and building a usable end-to-end system. MLIR was
+deemed a better infrastructure to mix these types of compilation.
+- The early gains provided by reusing established infrastructures
+(HalideIR and ISL schedule trees) turned into more impedance mismatch
+problems than could be solved with a small tactical investment.
+- Tensor Comprehensions emitted CUDA code which was then JIT compiled
+with NVCC from a textual representation. While this was a pragmatic
+short-term solution it made it hard to perform low-level rewrites that
+would have helped with register reuse in the ***comput-bound regime***.
+- The same reliance on emitting CUDA code made it difficult to
+create cost models when time came. This made it artifically harder to
+prune out bad solutions than necessary. This resulted in excessive
+runtime evaluation, as reported in the paper [Machine Learning Systems
+are Stuck in a Rut](https://dl.acm.org/doi/10.1145/3317550.3321441).
+
+Many of those issues are naturally addressed by implementing these ideas
+in the MLIR infrastructure.
+
+## Lessons from Polyhedral compilers<a name="lessonspolyhedral"></a>
+The polyhedral model has been on the cutting edge of loop-level optimization for
+decades, with several incarnations in production compilers such as
+[GRAPHITE](https://gcc.gnu.org/wiki/Graphite) for GCC and
+[Polly](https://polly.llvm.org) for LLVM. Although it has proved crucial to
+generate efficient code from domain-specific languages such as
+[PolyMage](http://mcl.csa.iisc.ac.in/polymage.html) and [Tensor
+Comprehensions](https://dl.acm.org/doi/abs/10.1145/3355606), it has never been
+fully included into mainstream general-purpose optimization pipelines. Detailed
+analysis of the role of polyhedral transformations is provided in the
+[simplified polyhedral
+form](https://mlir.llvm.org/docs/RationaleSimplifiedPolyhedralForm/) document
+dating back to the inception of MLIR.
+ 
+In particular, polyhedral abstractions have proved challenging to integrate with
+a more conventional compiler due to the following.
+- The transformed code (or IR) quickly gets complex and thus hard to analyze and
+  understand.
+- Code generation from the mathematical form used in the polyhedral model relies
+  on non-trivial exponentially complex algorithms.
+- The mathematical form is rarely composable with the SSA representation and
+  related algorithms, on which most mainstream compilers are built today.
+- Expressiveness limitations, although addressed in the scientific literature
+  through, e.g., summary functions, often remain present in actual
+  implementations.
+ 
+The Affine dialect in MLIR was specifically designed to address the integration
+problems mention above. In particular, it maintains the IR in the same form
+(loops with additional constraints on how the bounds are expressed) throughout
+the transformation, decreasing the need for one-shot conversion between
+drastically different representations. It also embeds the polyhedral
+representation into the SSA form by using MLIR regions and thus allows one to
+combine polyhedral and SSA-based transformations.
+ 
+## Lessons from the Affine dialect<a name="lessonsaffine"></a>
+The Affine dialect in MLIR brings the polyhedral abstraction closer to the
+conventional SSA representation. It addresses several long-standing integration
+challenges as described above and is likely to be more suitable when compiling
+from a C language-level abstraction.
+ 
+MLIR makes it possible to start from a higher-level abstraction than C, for
+example in machine learning workloads. In such cases, it may be possible to
+avoid complex analyses (data-flow analysis across loop iterations is
+exponentially complex) required for polyhedral transformation by leveraging the
+information available at higher levels of abstractions, similarly to DSL
+compilers. Linalg intends to use this information when available and ensure
+*legality of transformations by construction*, by integrating legality
+preconditions in the op semantics (for example, loop tiling can be applied to
+the loop nest computing a matrix multiplication, no need to additionally rely on
+affine dependence analysis to check this). This information is not readily
+available in the Affine dialect, and can only be derived using potentially
+expensive pattern-matching algorithms.
+ 
+Informed by the practical experience in polyhedral compilation and with the
+Affine dialects in particular, Linalg takes the following decisions.
+- **Discourage loop skewing**: the loop skewing transformation, that is
+  sometimes used to enable parallelization, often has surprising (negative)
+  effects on performance. In particular, polyhedral auto-transformation can be
+  expressed in a simpler way without loop skewing; skewing often leads to
+  complex control flow hampering performance on accelerators such as GPUs.
+  Moreover, the problems loop skewing addresses can be better addressed by other
+  approaches, e.g., diamond tiling. In the more restricted case of ML workloads,
+  multi-for loops with induction variables independent of each other (referred
+  to as hyper-rectangular iteration domains in the literature) such as the
+  proposed
+  [affine.parallel]((https://llvm.discourse.group/t/rfc-add-affine-parallel/350)
+  are sufficient in the majority of cases.
+- **Declarative Tiling**: the *tiling* transformation is ubiquitous in HPC code
+  generation. It can be seen as a decomposition of either the iteration space or
+  the data space into smaller regular parts, referred to as tiles. Polyhedral
+  approaches, including the Affine dialect, mostly opt for iteration space
+  tiling, which introduces additional control flow and complex address
+  expressions. If the tile sizes are not known during the transformation (so
+  called parametric tiling), the address expressions and conditions quickly
+  become non-affine or require exponentially complex algorithms to reason about
+  them. Linalg focuses tiling on the data space instead, creating views into the
+  buffers that leverage MLIR's strided `memref` abstraction. These views compose
+  and the complexity of access expressions remains predictable.
+- **Preserve high-level information**: Linalg maintains the information provided
+  by the op semantics as long as necessary for transformations. For example, the
+  result of tiling a matrix multiplication is loops around a smaller matrix
+  multiplication. Even with pattern-matching on top of the Affine dialect, this
+  would have required another step of pattern-matching after the transformation.
+ 
+Given these choices, Linalg intends to be a better fit for **high-level
+compilation** were significantly more information is readily available in the
+input representation and should be leveraged before lowering to other
+abstractions. Affine remains a strong abstraction for mid-level transformation
+and is used as a lowering target for Linalg, enabling further transformations
+and combination of semantically-loaded and lower-level inputs. As such, Linalg
+is intended to complement Affine rather than replace it.
+
+# Core Guiding Principles<a name="guiding_principles"></a>
+
+## Transformations and Simplicity First<a name="transformations_first"></a>
+The purpose of the Linalg IR and its operations is primarily to:
+- develop a set of key transformations, and
+- make them correct by construction by carefully curating the set of
+generic operation properties that drive applicability, and
+- make them very simple to implement, apply, verify and especially
+maintain.
+
+The problem at hand is fundamentally driven by compilation of domain-specific
+workloads for high-performance and parallel hardware architectures: **this is
+an HPC compilation problem**.
+
+The selection of relevant transformations follows a codesign approach and
+involves considerations related to:
+- concrete current and future needs of the application domain,
+- concrete current and future hardware properties and ISAs,
+- understanding of strengths and limitations of [existing approaches](#prior_art),
+- taking advantage of the coexistence of multiple levels of IR in MLIR,
+
+One needs to be methodical to avoid proliferation and redundancy. A given
+transformation could exist at multiple levels of abstraction but **just
+because one can write transformation X at level Y absolutely does not mean
+one should**. This is where evaluation of existing
+systems and acknowledgement of their strengths and weaknesses is crucial:
+simplicity and maintainability aspects must be first-order concerns. Without
+this additional effort of introspection, a design will not stand the test of
+time. At the same time, complexity is very hard to ward off. It seems one needs
+to suffer complexity to be prompted to take a step back and rethink
+abstractions.
+
+This is not merely a reimplementation of idea X in system Y: simplicity
+**must be the outcome** of this introspection effort.
+
+## Preservation of Information<a name="information_preservation"></a>
+The last two decades have seen a proliferation of Domain-Specific Languages
+(DSLs) that have been very successful at limited application domains.
+The main commonality between these systems is their use of a significantly
+richer structural information than CFGs or loops.
+Still, another commonality of existing systems is to lower to LLVM very quickly,
+and cross a wide abstraction gap in a single step. This process often drops
+semantic information that later needs to be reconstructed later,
+when it is not irremediably lost.
+
+These remarks, coupled with MLIR's suitability for defining IR at multiple
+levels of abstraction led to the following 2 principles.
+
+### Declarative Specification: Avoid Raising<a name="declarative_specification"></a>
+
+Compiler transformations need static structural information (e.g. loop-nests,
+graphs of basic blocks, pure functions etc). When that structural information
+is lost, it needs to be reconstructed.
+
+A good illustration of this phenomenon is the notion of *raising* in polyhedral
+compilers: multiple polyhedral tools start by raising from a simplified C
+form or from SSA IR into a higher-level representation that is more amenable
+to loop transformations.
+
+In advanced polyhedral compilers, a second type of raising
+may typically exist to detect particular patterns (often variations of
+BLAS). Such patterns may be broken by transformations making their detection
+very fragile or even just impossible (incorrect).
+
+MLIR makes it easy to define op semantics declaratively thanks to the use of
+regions and attributes. This is an ideal opportunity to define new abstractions
+to convey user-intent directly into the proper abstraction.
+
+### Progressive Lowering: Don't Lose Information too Quickly<a name="#progressive_lowering"></a>
+
+Lowering too quickly to affine, generic loops or CFG form reduces the
+amount of structure available to derive transformations from. While
+manipulating loops is a net gain compared to CFG form for a certain class of
+transformations, important information is still lost (e.g. parallel loops, or
+mapping of a loop nest to an external implementation).
+
+This creates non-trivial phase ordering issues. For instance, loop fusion may
+easily destroy the ability to detect a BLAS pattern. One possible alternative
+is to perform loop fusion, tiling, intra-tile loop distribution and then hope to
+detect the BLAS pattern. Such a scheme presents difficult phase-ordering
+constraints that will likely interfere with other decisions and passes.
+Instead, certain Linalg ops are designed to maintain high-level information
+across transformations such as tiling and fusion.
+
+MLIR is designed as an infrastructure for ***progressive lowering***.
+Linalg fully embraces this notion and thinks of codegen in terms of
+*reducing a potential function*. That potential function is loosely
+defined in terms of number of low-level instructions in a particular
+Linalg ops (i.e. how heavy or lightweight the Linalg op is). 
+Linalg-based codegen and transformations start from higher-level IR
+ops and dialects. Then each transformation application reduces the
+potential by introducing lower-level IR ops and *smaller* Linalg ops.
+This gradually reduces the potential, all the way to Loops + VectorOps
+and LLVMIR.
+
+## Composable and Declarative Transformations<a name="declarative_transformations"></a>
+Complex and impactful transformations need not be hard to manipulate, write or
+maintain. Mixing XLA-style high-level op semantics knowledge with generic
+properties to describe these semantics, directly in MLIR, is a promising way to:
+- Design transformations that are correct by construction, easy to
+write, easy to verify and easy to maintain. 
+- Provide a way to specify transformations and the units of IR they manipulate
+declaratively. In turn this allows using local pattern rewrite rules in MLIR
+(i.e. [DRR](https://mlir.llvm.org/docs/DeclarativeRewrites/)).
+- Allow creating customizable passes declaratively by simply selecting rewrite
+rules. This allows mixing transformations, canonicalizations, constant folding
+and other enabling rewrites in a single pass. The result is a system where pass
+fusion is very simple to obtain and gives hope to solving certain
+[phase ordering issues](https://dl.acm.org/doi/10.1145/201059.201061).
+
+## Suitability for Search and Machine Learning<a name="ml"></a>
+Compiler heuristics are hand-crafted human-engineered features: it is
+ripe for disruption by machine-learning  techniques.
+To enable search, compiler transformations should be fine-grained, 
+[composable](#declarative_transformations) and expose tuning parameters that
+can modify their behavior, guided by lessons from previous experience
+with [Tensor Comprehensions](#lessonstc).
+
+Of course, we are not advocating for using ML everywhere in the stack
+immediately: low-level compilation and machine models are still quite performant
+in LLVM. However, for the high-level and mid-level optimization problems,
+models need to be conditioned (probalistically) on the low-level
+compiler which acts as a blackbox. For these reasons we prioritize the
+design of IR and transformations with search-friendly properties over
+building cost models.
+Still, this  does not mean Linalg refuses cost models: instead we
+prefer to invest in infrastructure that will enable [ML-based
+techniques to automatically build cost
+models](http://homepages.inf.ed.ac.uk/hleather/publications/2009_autofeatures_cgo.pdf). 
+
+## Extensibility and Future-Proofness<a name="future"></a>
+MLIR allows defining IR for structured control flow and structured
+data types. We choose to take advantage of these properties for the
+reasons described above.
+In particular, the `MemRefType` represents dense non-contiguous memory regions.
+This structure should extend beyond simple dense data types and generalize to
+ragged, sparse and mixed dens/sparse tensors as well as to trees, hash tables,
+tables of records and maybe even graphs.
+
+For such more advanced data types, the control-flow required to traverse the
+data structures, termination conditions etc are much less simple to analyze and
+characterize statically. As a consequence we need to also design solutions that
+stand a chance of evolving into runtime-adaptive computations (e.g.
+inspector-executor in which an *inspector* runs a cheap runtime
+analysis on the data to configure how the *executor* should run).
+While there is no concrete solution
+today to solve these problems in MLIR, it is pretty clear that perfect
+static knowledge and analyses will not be serious contenders for these problems.
+
+# Key Observations<a name="keyobservation"></a>
+The following key observations have influenced the design of Linalg and helped
+reconcile [core guiding principles](#guiding_principles) with real-world
+requirements when producing an implementation based on MLIR.
+
+## Algorithms + Data Structures = Programs<a name="data_and_compute"></a>
+This is a twist on Niklaus Wirth's formulation but captures the essence of the
+design of Linalg: control-flow does not exist in a vacuum, independently of
+data.
+On the contrary, there is a very strong relationship between control-flow and
+data structures: one cannot exist without the other. This has multiple
+implications on the [semantics of Linalg Ops](#linalg_ops) and their
+transformations. In particular, this observation influences whether
+certain transformations are better done:
+- as control flow or data structure manipulation,
+- on Linalg ops attributes or on loops after some partial lowering
+occurred,
+- as extensions to the Linalg dialect in terms of new ops or attributes.
+
+## The Dialect Need not be Closed Under Transformations<a name="dialect_not_closed"></a>
+This is probably the most surprising and counter-intuitive
+observation. When one designs IR for transformations, closed-ness is
+often a nonnegotiable property.
+This is a key design principle of polyhedral IRs such as
+[URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf)
+and 
+[ISL-based IRs](https://en.wikipedia.org/wiki/Integer_set_library):
+they are closed under affine transformations.
+In MLIR, multiple dialects coexist and form a coherent whole. After 
+experimenting with different alternatives, it became clear that strict
+dialect closed-ness wasn't necessary and could be relaxed. Previous
+systems did not have simple and principled means of building new IR
+and probably suffered from this limitation. We conjecture this is a
+key reason they required the IR to be closed under transformations. 
+
+Despite the fact that Linalg ops only allow perfectly nested
+semantics, once tiling and fusion kick in, imperfectly nested loops
+are gradually introduced.
+In other words, imperfectly nested control flow appears as ***the result of
+applying key transformations***.
+
+Considering the *potential* described during the discussion on
+[Progressive Lowering](#progressive_lowering), closed-ness under
+transformation would dictate that the potential remains constant.
+In contrast, Linalg advocates for ***monotonicity*** under
+transformations.
+
+## Summary of Existing Alternatives a Picture<a name="observationssummary"></a>
+Lastly, we summarize our observations of lessons from [Prior
+Art](#prior_art)---when viewed under the lense of our [Core Guiding
+Principles](#guiding_principles)---with the following picture.
+
+<img width="1200" alt="MLIR Codegen Flow"
+src="https://user-images.githubusercontent.com/10148468/73613904-2f720a00-45c8-11ea-8265-1c856c02525b.png">
+
+This figure is not meant to be perfectly accurate but a rough map of
+how we view the distribution of structural information in existing
+systems, from a codegen-friendly angle. Unsurprisingly, the
+[Linalg Dialect](https://mlir.llvm.org/docs/Dialects/Linalg) and its
+future evolutions aspire to a position in the top-right of this map.
+
--- a/mlir/include/mlir/Dialect/Linalg/IR/LinalgBase.td
+++ b/mlir/include/mlir/Dialect/Linalg/IR/LinalgBase.td
@ -24,85 +24,11 @@ def Linalg_Dialect : Dialect {
    can lower to scalar load/store and other operations or to more general
    library calls.

-    The `linalg` dialect manipulates the following types and operations:
-
-    ### Core data types and special ops.
-
-    The following abstractions are used by the `linalg` dialect:
-
-    #### Views
-    The current implementation uses the strided memref abstraction. In the
-    future other abstractions than strided memref will be used.
-
-    #### `!linalg.range`
-    This data type is currently just a triple (`min`,`max`, `step`) that does
-    not pass function boundaries.
-
-    #### `linalg.yield`
-    This op is used as a terminator within the appropriate `linalg` regions.
-
-    In the future, richer `view` and `range` representations are expected, in
-    particular to represent sparse traversals.
-
-    ### Metadata Ops
-    A set of ops that manipulate metadata but do not move memory. These ops take
-    `view` operands + extra attributes and return new `view`s. The returned
-    `view`s generally alias the operand `view`. At the moment the existing ops
-    are:
-
-        * `std.view`,
-        * `std.subview`,
-        * `linalg.range`,
-        * `linalg.slice`,
-        * `linalg.transpose`.
-
-    Future ops are added on a per-need basis but should include:
-
-        * `linalg.reshape`,
-        * `linalg.tile`,
-        * `linalg.intersection`,
-        * `linalg.convex_union`,
-        * `linalg.difference` (would need to work on a list of views).
-
-    ### Payload Ops
-    A set of payload carrying operations that implement the [structured ops](
-    https://docs.google.com/presentation/d/1P-j1GrH6Q5gLBjao0afQ-GfvcAeF-QU4GXXeSy0eJ9I/edit#slide=id.p
-    )
-    abstraction on tensors and buffers. `linalg` has `2` generic operations
-    `linalg.generic` and `linalg.indexed_generic` for expressing custom
-    operations.
-    This is subject to further evolution as transformations and analyses
-    continue to be developed.
-
-    Additionally, `linalg` provides some commonly named operations:
-
-        * `linalg.copy`,
-        * `linalg.fill`,
-        * `linalg.dot`,
-        * `linalg.matmul`,
-        * `linalg.conv`.
-
-    Future ops are added on a per-need basis but should include:
-
-        * `linalg.pad`.
-
-    In an ideal world, all the named ops would be automatically generated from
-    a description in terms of only the `2` generic ops. Unfortunately we do not
-    have such support yet (contributions are most welcome).
-
-    ### Convention for external library interop
-    The `linalg` dialect adopts a convention that is similar to `BLAS` when
-    offloading operations to fast library implementations: pass a non-owning
-    pointer to input and output data with additional metadata. This convention
-    is also found in libraries such as `MKL`, `OpenBLAS`, `BLIS`, `cuBLAS`,
-    `cuDNN`, etc.. and more generally at interface points across language
-    boundaries (e.g. C++ / Python).
-
-    Generally, `linalg` passes non-owning pointers to strided memref data
-    structures to precompiled library calls linked externally. The name `view`
-    is used interchangeably in `linalg` to signify strided memref discussed at
-    length in the [strided memref RFC](
-    https://groups.google.com/a/tensorflow.org/g/mlir/c/MaL8m2nXuio/m/a_v07o9yBwAJ).
+    Additional [Linalg Dialect
+    Documentation](https://mlir.llvm.org/docs/Dialects/Linalg) and a
+    [Rationale Document](https://mlir.llvm.org/docs/RationaleLinalgDialect) are
+    are also available and should be read first before going in the details of
+    the op semantics.
  }];
 }