llvm-project/mlir/docs/Rationale/RationaleGenericDAGRewriter.md

287 lines
15 KiB
Markdown

# Generic DAG Rewriter Infrastructure Rationale
This document details the rationale behind a general DAG-to-DAG rewrite
infrastructure for MLIR. For up-to-date documentation on the user facing API,
please look at the main [Pattern Rewriting document](../PatternRewriter.md).
## Introduction and Motivation
The goal of a compiler IR is to represent code - at various levels of
abstraction which pose different sets of tradeoffs in terms of representational
capabilities and ease of transformation. However, the ability to represent code
is not itself very useful - you also need to be able to implement those
transformations.
There are many different types of compiler transformations, but this document
focuses on a particularly important class of transformation that comes up
repeatedly at scale, and is important for the goals of MLIR: matching one DAG of
operations, and replacing with another. This is an integral part of many
compilers and necessary for peephole optimizations like "eliminate identity
nodes" or "replace x+0 with x", a generalized canonicalization framework (e.g.
Instruction Combiner in LLVM), as well as a useful abstraction to implement
optimization algorithms for optimization algorithms for IR at multiple levels.
A particular strength of MLIR (and a major difference vs other compiler
infrastructures like LLVM, GCC, XLA, TensorFlow, etc) is that it uses a single
compiler IR to represent code at multiple levels of abstraction: an MLIR
operation can be a "TensorFlow operation", an "XLA HLO", an Affine Loop Nest, an
LLVM IR instruction (transitively including X86, Lanai, PTX, and other target
specific instructions), or anything else that the MLIR operation system can
reasonably express. Given that MLIR spans such a wide range of different problem
scopes, a single infrastructure for performing graph-to-graph rewrites can help
solve many diverse domain challenges.
[Static single assignment](https://en.wikipedia.org/wiki/Static_single_assignment_form)
(SSA) representations like MLIR make it easy to access the operands and "users"
of an operation. As such, a natural abstraction for these graph-to-graph
rewrites is that of DAG pattern matching: clients define DAG tile patterns
(where a tile is a sequence of operations defining a subgraph of the DAG), and
each pattern includes a result DAG to produce and the cost of the result (or,
inversely, the benefit of doing the replacement). A common infrastructure
efficiently finds and performs the rewrites.
While this concept is simple, the details are more nuanced. This document
defines and explores a set of abstractions that can solve a wide range of
different problems, and be applied to many different sorts of problems that MLIR
is - and is expected to - face over time. We do this by separating the pattern
application algorithm from the "driver" of the computation loop, and make space
for the patterns to be defined declaratively.
### Constant folding
A degenerate but pervasive case of DAG-to-DAG pattern matching is constant
folding: an operation whose operands contain constants can often be folded to a
result constant value.
MLIR operations may override a
[`fold`](../Canonicalization.md/#canonicalizing-with-the-fold-method) routine, which
exposes a simpler API compared to a general DAG-to-DAG pattern matcher, and
allows for it to be applicable in cases that a generic matcher would not. For
example, a DAG-rewrite can remove arbitrary nodes in the current function, which
could invalidate iterators. Constant folding as an API does not remove any
nodes, it just provides a (list of) constant values and allows the clients to
update their data structures as necessary.
## Related Work
There is a huge amount of related work to consider, given that nearly every
compiler in existence has to solve this problem many times over. One unifying
problem is that all of these systems are designed to solve one particular, and
usually, narrow problem: MLIR on the other hand would like to solve many of
these problems within a single infrastructure. Here are a few related graph
rewrite systems, along with the pros and cons of their work (The most similar
design to the infrastructure present in MLIR is the LLVM DAG-to-DAG instruction
selection algorithm).
### AST-Level Pattern Matchers
The literature is full of source-to-source translators which transform
identities in order to improve performance (e.g. transforming `X*0` into `0`).
One large example is the GCC `fold` function, which performs
[many optimizations](https://github.com/gcc-mirror/gcc/blob/master/gcc/fold-const.c)
on ASTs. Clang has
[similar routines](https://clang.llvm.org/docs/InternalsManual.html#constant-folding-in-the-clang-ast)
for simple constant folding of expressions (as required by the C++ standard) but
doesn't perform general optimizations on its ASTs.
The primary downside of AST optimizers is that you can't see across operations
that have multiple uses. It is
[well known in literature](https://llvm.org/pubs/2008-06-LCTES-ISelUsingSSAGraphs.pdf)
that DAG pattern matching is more powerful than tree pattern matching, but on
the other hand, DAG pattern matching can lead to duplication of computation
which needs to be checked for.
### "Combiners" and other peephole optimizers
Compilers end up with a lot of peephole optimizers for various things, e.g. the
GCC
["combine" routines](https://github.com/gcc-mirror/gcc/blob/master/gcc/combine.c)
(which try to merge two machine instructions into a single one), the LLVM
[Inst Combine](https://github.com/llvm/llvm-project/tree/main/llvm/lib/Transforms/InstCombine)
[pass](https://llvm.org/docs/Passes.html#instcombine-combine-redundant-instructions),
LLVM's
[DAG Combiner](https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/DAGCombiner.cpp),
the Swift compiler's
[SIL Combiner](https://github.com/apple/swift/tree/main/lib/SILOptimizer/SILCombiner),
etc. These generally match one or more operations and produce zero or more
operations as a result. The LLVM
[Legalization](https://github.com/llvm/llvm-project/tree/main/llvm/lib/CodeGen/SelectionDAG)
infrastructure has a different outer loop but otherwise works the same way.
These passes have a lot of diversity, but also have a unifying structure: they
mostly have a worklist outer loop which visits operations. They then use a
visitor pattern (or equivalent) to switch over the class of operation and
dispatch to a method. That method contains a long list of hand-written C++ code
that pattern-matches various special cases. LLVM introduced a "match" function
that allows writing patterns in a somewhat more declarative style using template
metaprogramming (MLIR has similar facilities). Here's a simple example:
```c++
// Y - (X + 1) --> ~X + Y
if (match(Op1, m_OneUse(m_Add(m_Value(X), m_One()))))
return BinaryOperator::CreateAdd(Builder.CreateNot(X), Op0);
```
Here is a somewhat more complicated one (this is not the biggest or most
complicated :)
```c++
// C2 is ODD
// LHS = XOR(Y,C1), Y = AND(Z,C2), C1==(C2+1) => LHS == NEG(OR(Z, ~C2))
// ADD(LHS, RHS) == SUB(RHS, OR(Z, ~C2))
if (match(LHS, m_Xor(m_Value(Y), m_APInt(C1))))
if (C1->countTrailingZeros() == 0)
if (match(Y, m_And(m_Value(Z), m_APInt(C2))) && *C1 == (*C2 + 1)) {
Value NewOr = Builder.CreateOr(Z, ~(*C2));
return Builder.CreateSub(RHS, NewOr, "sub");
}
```
These systems are simple to set up, and pattern matching templates have some
advantages (they are extensible for new sorts of sub-patterns, look compact at
point of use). On the other hand, they have lots of well known problems, for
example:
* These patterns are very error prone to write, and contain lots of
redundancies.
* The IR being matched often has identities (e.g. when matching commutative
operators) and the C++ code has to handle it manually - take a look at
[the full code](https://github.com/llvm/llvm-project/blob/c0b5000bd848303320c03f80fbf84d71e74518c9/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp#L767)
for `checkForNegativeOperand` that defines the second pattern).
* The matching code compiles slowly, both because it generates tons of code
and because the templates instantiate slowly.
* Adding new patterns (e.g. for count leading zeros in the example above) is
awkward and doesn't often happen.
* The cost model for these patterns is not really defined - it is emergent
based on the order the patterns are matched in code.
* They are non-extensible without rebuilding the compiler.
* It isn't practical to apply theorem provers and other tools to these
patterns - they cannot be reused for other purposes.
In addition to structured "combiners" like these, there are lots of ad-hoc
systems like the
[LLVM Machine code peephole optimizer](http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/PeepholeOptimizer.cpp?view=markup)
which are related.
### LLVM's DAG-to-DAG Instruction Selection Infrastructure
The instruction selection subsystem in LLVM is the result of many years worth of
iteration and discovery, driven by the need for LLVM to support code generation
for lots of targets, the complexity of code generators for modern instruction
sets (e.g. X86), and the fanatical pursuit of reusing code across targets. Eli
Bendersky wrote a
[nice short overview](https://eli.thegreenplace.net/2013/02/25/a-deeper-look-into-the-llvm-code-generator-part-1)
of how this works, and the
[LLVM documentation](https://llvm.org/docs/CodeGenerator.html#select-instructions-from-dag)
describes it in more depth including its advantages and limitations. It allows
writing patterns like this.
```
def : Pat<(or GR64:$src, (not (add GR64:$src, 1))),
(BLCI64rr GR64:$src)>;
```
This example defines a matcher for the
["blci" instruction](https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#TBM_\(Trailing_Bit_Manipulation\))
in the
[X86 target description](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86InstrInfo.td),
there are many others in that file (look for `Pat<>` patterns, since they aren't
entangled in details of the compiler like assembler/disassembler generation
logic).
For the purposes of MLIR, there is much to like about this system, for example:
* It is defined in a declarative format.
* It is extensible to target-defined operations.
* It automates matching across identities, like commutative patterns.
* It allows custom abstractions and intense factoring of target-specific
commonalities.
* It generates compact code - it compiles into a state machine, which is
interpreted.
* It allows the instruction patterns to be defined and reused for multiple
purposes.
* The patterns are "type checked" at compile time, detecting lots of bugs
early and eliminating redundancy from the pattern specifications.
* It allows the use of general C++ code for weird/complex cases.
While there is a lot that is good here, there are also a few undesirable bits:
* The representation is specifically designed and only applicable for
instruction selection, meaning that the directly adjacent problems like the
DAGCombiner and Legalizer can't use it.
* This isn't extensible at compiler runtime, you have to rebuild the compiler
to extend it.
* The error messages when failing to match a pattern
[are not exactly optimal](https://www.google.com/search?q=llvm+cannot+select).
* It has lots of implementation problems and limitations (e.g. can't write a
pattern for a multi-result operation) as a result of working with the
awkward SelectionDAG representation and being designed and implemented on
demand.
* Organic growth over time has left lots of sharp edges.
### Summary
MLIR faces a wide range of pattern matching and graph rewrite problems, and one
of the major advantages of having a common representation for code at multiple
levels is that it allows for investing in - and highly leveraging - a single
infrastructure for doing this sort of work.
## Goals
We'd like the to encompass many problems in the MLIR space, including 1-to-N
expansions (e.g. such as in type legalization during instruction selection when
an add of one bit width may be split into multiple adds of a smaller bit width),
M-to-1 patterns (e.g. when converting a multiply+add into a single muladd
operation), as well as general M-to-N patterns (e.g. instruction selection for
target instructions). Patterns have a benefit associated with them, and the
common infrastructure should be responsible for sorting out the highest benefit
match for a given application.
We separate the task of picking a particular optimal pattern from a given root
node, the algorithm used to rewrite an entire graph given a particular set of
goals, and the definition of the patterns themselves. We do this because DAG
tile pattern matching is NP complete. Additionally, we would like to support
iterative rewrite algorithms that progressively transform the input program
through multiple steps. Furthermore, we would like to support many different
sorts of clients across the MLIR stack, and they may have different tolerances
for compile time cost, different demands for optimality, and other algorithmic
goals or constraints.
We aim for MLIR transformations to be easy to implement and reduce the
likelihood for compiler bugs. We expect there to be a very large number of
patterns that are defined over time, and we believe that these sorts of patterns
will have a very large number of legality/validity constraints - many of which
are difficult to reason about in a consistent way, may be target specific, and
whose implementation may be particularly bug-prone. As such, we aim to design
the API around pattern definition to be simple, resilient to programmer errors,
and allow separation of concerns between the legality of the nodes generated
from the idea of the pattern being defined.
Finally, error handling is a topmost concern, we want pattern match failures to
be diagnosable in a reasonable way. This is a difficult problem in general, as
the space of malfunction is too great to be fully enumerated and handled
optimally, but MLIR is already designed to represent the provenance of an
operation well. The aim of the pattern rewriting infrastructure is simply to
propagate that provenance information precisely, as well as diagnose pattern
match failures with the rationale for why a set of patterns do not apply.
### Non goals
The pattern infrastructure does not aim to solve all compiler problems, it is
simply a DAG-to-DAG pattern matching system. Compiler algorithms that require
global dataflow analysis (e.g. common subexpression elimination, conditional
constant propagation, and many many others) will not be directly solved by this
infrastructure.
This infrastructure is limited to DAG patterns, which (by definition) prevent
the patterns from seeing across cycles in a graph. In an SSA-based IR like MLIR,
this means that these patterns don't see across basic block arguments. We
consider this acceptable given the set of problems we are trying to solve - we
don't know of any other system that attempts to do so, and consider the payoff
of worrying about this to be low.
This design includes the ability for DAG patterns to have associated benefits,
but those benefits are defined in terms of magic numbers (typically equal to the
number of nodes being replaced). For any given application, the units of magic
numbers will have to be defined.