llvm-project/mlir/docs/Rationale/RationaleSimplifiedPolyhedr...

# MLIR: The case for a simplified polyhedral form

MLIR embraces polyhedral compiler techniques for their many advantages
representing and transforming dense numerical kernels, but it uses a form that
differs significantly from other polyhedral frameworks.

**Disclaimer / Warning**

This document is a very early design proposal (which has since been accepted)
that explored the tradeoffs of using this simplified form vs the traditional
polyhedral schedule list form. At some point, this document could be dusted off
and written as a proper academic paper, but until now, it is better to included
it in this crafty form than not to. Beware that this document uses archaic
syntax and should not be considered a canonical reference to modern MLIR.

## Introduction

This document discusses general goals of the project, introduces context and the
two alternatives, then talks about the tradeoffs of these designs. Written by
Chris Lattner.

## General goals of an IR, and goals of mlfunc's specifically

Our currently planned representation for MLIR consists of two kinds of
functions: an LLVM-like "CFG Function" and an "ML Function": a function
represented in multidimensional loop form. The idea is that a CFG function is
capable of full generality for expressing arbitrary computation, but is awkward
for loop transformations. In contrast, mlfunc's are limited (e.g. to control
flow involving loop nests over affine spaces) but these limitations make it much
easier to transform and analyze, particularly for the set of computations in a
machine learning kernel.

The design of an intermediate representations is an optimization problem, which
makes intentional tradeoffs that aim to make certain kinds of compiler
transformations simple. After all, it is "possible" to do almost any
transformation on any IR: we could theoretically do loop transformations on
assembly language. OTOH, such transformations would take too long to write,
would be fragile due to irrelevant changes, would be difficult to maintain, and
difficult to make target independent. Performing transformations on the "right
level" of IR makes it much easier to do analysis and transformation of code, and
can make them faster by reducing the size of the IR, and eliminating
possibilities that would have otherwise have to be considered.

This is the reason we're interested in adding polyhedral techniques to an IR in
the first place: though our base "CFG function" representation is fully capable
of expressing any computation, it is "too" expressive. The limitations imposed
by polyhedral techniques (e.g. on affine loop bounds and array subscripts)
define a closed algebra that can represent an interesting range of
transformations and their compositions, and because of their simplicity, we can
perform (e.g.) dependence analysis more efficiently and more reliably.

This raises an important question that this document examines: given we are
introducing a redundant and limited way to express code and transformations,
exactly what form is best to perform the analyses and transformations we want?

We explore two different design points that are capable of expressing the same
class of affine loop computations, but which use different representational
forms. These forms trade off verbosity, ease of transformation, and ease of
analysis in interesting ways.

## Context: Traditional Polyhedral Form

We started by discussing a representation that uses the traditional polyhedral
schedule set + domain representation, e.g. consider C-like code like:

```c
  void simple_example(...) {
    for (int i = 0; i < N; ++i) {
      for (int j = 0; j < N; ++j) {
         float tmp = X[i,j]    // S1
         A[i,j] = tmp + 1      // S2
         B[i,j] = tmp * 42     // S3
       }
    }
  }
```

The polyhedral representation doesn't care about the actual computation, so we
will abstract them into S1/S2/S3 in the discussion below. Originally, we planned
to represent this with a classical form like (syntax details are not important
and probably slightly incorrect below):

```mlir
  mlfunc @simple_example(... %N) {
    %tmp = call @S1(%X, %i, %j)
      domain: (0 <= %i < %N), (0 <= %j < %N)
      schedule: (i, j, 0)

    call @S2(%tmp, %A, %i, %j)
      domain: (0 <= %i < %N), (0 <= %j < %N)
      schedule: (i, j, 1)

    call @S3(%tmp, %B, %i, %j)
      domain: (0 <= %i < %N), (0 <= %j < %N)
      schedule: (i, j, 2)
  }
```

In this design, an mlfunc is an unordered bag of instructions whose execution
order is fully controlled by their schedule.

However, we recently agreed that a more explicit schedule tree representation is
a better fit for our needs, because it exposes important structure that will
make analyses and optimizations more efficient, and also makes the scoping of
SSA values more explicit. This leads us to a representation along the lines of:

```mlir
  mlfunc @simple_example(... %N) {
    d0/d1 = mlspace
    for S1(d0), S2(d0), S3(d0) {
      for S1(d1), S2(d1), S3(d1) {

        %tmp = call @S1(%X, d0, d1)      ;; S1
          domain: (0 <= d0 < %N), (0 <= d1 < %N)

        call @S2(%tmp, %A, d0, d1)      ;; S2
          domain: (0 <= d0 < %N), (0 <= d1 < %N)

        call @S3(%tmp, %B, d0, d1)      ;; S3
          domain: (0 <= d0 < %N), (0 <= d1 < %N)
      }
    }
  }
```

This change makes the nesting structure of the loops an explicit part of the
representation, and makes lexical ordering within a loop significant
(eliminating the constant 0/1/2 of schedules).

It isn't obvious in the example above, but the representation allows for some
interesting features, including the ability for instructions within a loop nest
to have non-equal domains, like this - the second instruction ignores the outer
10 points inside the loop:

```mlir
  mlfunc @reduced_domain_example(... %N) {
    d0/d1 = mlspace
    for S1(d0), S2(d0) {
      for S1(d1), S2(d1) {
        %tmp = call @S1(%X, d0, d1)    ;; S1
          domain: (0 <= d0 < %N), (0 <= d1 < %N)

        call @S2(%tmp, %A, d0, d1)      ;; S2
          domain: (10 <= d0 < %N-10), (10 <= d1 < %N-10)
      }
    }
  }
```

It also allows schedule remapping within the instruction, like this example that
introduces a diagonal skew through a simple change to the schedules of the two
instructions:

```mlir
  mlfunc @skewed_domain_example(... %N) {
    d0/d1 = mlspace
    for S1(d0), S2(d0+d1) {
      for S1(d0+d1), S2(d1) {
        %tmp = call @S1(%X, d0, d1)    ;; S1
          domain: (0 <= d0 < %N), (0 <= d1 < %N)

        call @S2(%tmp, %A, d0, d1)      ;; S2
          domain: (0 <= d0 < %N), (0 <= d1 < %N)
      }
    }
  }
```

This form has great power, and the polyhedral code generator (which lowers from
an mlfunc to a cfgfunc representation) handles this power so things that
introduce loop transformations don't have to explicitly manipulate the looping
structure.

## Proposal: Simplified Polyhedral Form

This document proposes and explores the idea of going one step further, moving
all of the domain and schedule information into the "schedule tree". In this
form, we would have a representation where all instructions inside of a given
for-loop are known to have the same domain, which is maintained by the loop. In
the simplified form, we also have an "if" instruction that takes an affine
condition.

Our simple example above would be represented as:

```mlir
  mlfunc @simple_example(... %N) {
    affine.for %i = 0 ... %N step 1 {
      affine.for %j = 0 ... %N step 1 {
        // identity noop in this case, but can exist in general.
        %0,%1 = affine.apply #57(%i, %j)

        %tmp = call @S1(%X, %0, %1)

        call @S2(%tmp, %A, %0, %1)

        call @S3(%tmp, %B, %0, %1)
      }
    }
  }
```

The example with the reduced domain would be represented with an if instruction:

```mlir
  mlfunc @reduced_domain_example(... %N) {
    affine.for %i = 0 ... %N step 1 {
      affine.for %j = 0 ... %N step 1 {
        // identity noop in this case, but can exist in general.
        %0,%1 = affinecall #57(%i, %j)

        %tmp = call @S1(%X, %0, %1)

        if (10 <= %i < %N-10), (10 <= %j < %N-10) {

          %2,%3 = affine.apply(%i, %j)    // identity noop in this case

          call @S2(%tmp, %A, %2, %3)
        }
      }
    }
  }
```

These IRs represent exactly the same information, and use a similar information
density. The 'traditional' form introduces an extra level of abstraction
(schedules and domains) that make it easy to transform instructions at the
expense of making it difficult to reason about how those instructions will come
out after code generation. With the simplified form, transformations have to do
parts of code generation inline with their transformation: instead of simply
changing a schedule to **(i+j, j)** to get skewing, you'd have to generate this
code explicitly (potentially implemented by making polyhedral codegen a library
that transformations call into):

```mlir
mlfunc @skewed_domain_example(... %N) {
  affine.for %t1 = 0 ... 2*N-2 step 1 {
    affine.for %t2 = max(0, t1-N+1) ... min(N, t1) step 1 {
      (%i, %j) = (%t1-%t2, %t2)
      ...
    }
  }
}
```

## Evaluation

Both of these forms are capable of expressing the same class of computation:
multidimensional loop nests with affine loop bounds and affine memory
references. That said, they pose very different tradeoffs in other ways.

### Commonality: can express same computation

Both of these can express the same sorts of computation, e.g. kernels written in
one form are representable in the other form in all cases.

### Commonality: dependence analysis

These representations both use affine functions for data layout mapping and
access subscripts, and dependence analysis works the same way.

### Commonality: difficulty of determining optimal transformation series

One major challenge in performance of optimization of this sort of code is
choosing the ordering and behavior of various loop transformations that get
applied. There are non-local effects of every decision, and neither
representation helps solve this inherently hard problem.

### Commonality: compactness of IR

In the cases that are most relevant to us (hyper rectangular spaces) these forms
are directly equivalent: a traditional instruction with a limited domain (e.g.
the "reduced_domain_example" above) ends up having one level of ML 'if' inside
its loops. The simplified form pays for this by eliminating schedules and
domains from the IR. Both forms allow code duplication to reduce dynamic
branches in the IR: the traditional approach allows instruction splitting, the
simplified form supports instruction duplication.

It is important to point out that the traditional form wins on compactness in
the extreme cases: e.g. the loop skewing case. These cases will be rare in
practice for our workloads, and are exactly the cases that downstream
transformations want to be explicit about what they are doing.

### Simplicity of code generation

A key final stage of an mlfunc is its conversion to a CFG function, which is
required as part of lowering to the target machine. The simplified form has a
clear advantage here: the IR has a direct correspondence to the structure of the
generated code.

In contrast, the traditional form has significant complexity in the lowering
process to a CFG function, because the verbosity not imbued in the IR needs to
come out during code generation. Code generation from ISL shows that it is
possible to do this, but it is a non-trivial transformation.

### Ease of transformation

An advantage for the traditional form is that it is easier to perform certain
transformations on it: skewing and tiling are just transformations on the
schedule of the instructions in question, it doesn't require changing the loop
structure.

In practice, the simplified form requires moving the complexity of code
generation into the transformations themselves - this is sometimes trivial,
sometimes involved. The author believes that this should be possible by making
the code generation algorithms themselves be library functions that
transformations call into, instead of an opaque block that happens at the end of
the mlfunc processing.

Also, the sorts of transformations performed today by XLA (including tiling,
padding, unrolling, and other rectangular transformations) should be easy enough
to implement on either representation. The only cases that are a challenge are
more advanced cases like skewing, e.g. for DMA data movement generation.

### Ease of analysis: Cost models

The simplified form is much easier for analyses and transformations to build
cost models for (e.g. answering the question of "how much code bloat will be
caused by unrolling a loop at this level?"), because it is easier to predict
what target code will be generated. With the traditional form, these analyses
will have to anticipate what polyhedral codegen will do to a set of instructions
under consideration: something that is non-trivial in the interesting cases in
question (see "Cost of code generation").

### Cost of code generation

State of the art polyhedral code generation is
[expensive and complicated](https://lirias.kuleuven.be/bitstream/123456789/497238/1/toplas-astgen.pdf),
sometimes exponential time complexity. We expect that most machine learning
workloads will be hyper-rectangular, and thus it should be easy to specialize in
important cases. That said, the traditional polyhedral representation makes it
very easy to introduce complicated and expensive schedules, and provides no way
to understand and project a cost model for using them. All downstream clients of
the IR need to be prepared to handle the full generality of IR that may come to
them.

The simplified form defines this away: the concepts in the IR remain simple, and
the code much more directly reflects the cost model for lowering to CFG
functions and machine code. This is expected to be very important in the late
stages of a code generator for an accelerator.

### SSA in ML Functions

We agree already that values defined in an mlfunc can include scalar values and
they are defined based on traditional dominance. In the simplified form, this is
very simple: arguments and induction variables defined in for-loops are live
inside their lexical body, and linear series of instructions have the same "top
down" dominance relation that a basic block does.

In the traditional form though, this is not the case: it seems that a lot of
knowledge about how codegen will emit the code is necessary to determine if SSA
form is correct or not. For example, this is invalid code:

```mlir
  %tmp = call @S1(%X, %0, %1)
    domain: (10 <= %i < %N), (0 <= %j < %N)
    schedule: (i, j)

  call @S2(%tmp, %A, %0, %1)
    domain: (0 <= %i < %N), (0 <= %j < %N)
    schedule: (i, j)
```

Because `%tmp` isn't defined on some iterations of the %i loop.

This matters because it makes the verifier more complicated, but more
significantly, it means that load promotion and other optimizations that will
produce SSA form will need to be aware of this and be able to model what codegen
does.

An emergent property of this that we discussed recently is that PHI nodes in
mlfunc's (if we support them) will also have to have domains.

### Lack of redundancy in IR

The traditional form has multiple encodings for the same sorts of behavior: you
end up having bits on `affine.for` loops to specify whether codegen should use
"atomic/separate" policies, unroll loops, etc. Instructions can be split or can
generate multiple copies of their instruction because of overlapping domains,
etc.

This is a problem for analyses and cost models, because they each have to reason
about these additional forms in the IR.

### Suitability to purpose: lowering to machine code

One of the main drivers for this work is lowering to low-level accelerator code,
including two-dimensional vectorization, insertion of DMAs, and other
utilization of the matrix accelerator units. In the author's opinion, the extra
compactness of the traditional form is a negative for this purpose: reasoning
about the generated machine code will require understanding the mapping from
mlfunc to lowered code, which means that it must understand what code generation
will do.

In the simplified form, the effect of "code generation" is always obvious from
the IR itself, which should make it easier to perform vectorization to target
instructions and other analyses we need to perform.

## Third Alternative: two different levels of mlfunc

One hybrid alternative is to support both the traditional and simplified forms
of mlfunc in our IR.

The stages could look like this, for example:

1.  Early performance transformations could be done on the traditional form.
1.  Partial code generation lowers to the simplified form
1.  Target specific lowering phases for tiling, and vectorization and other 2D
    transforms that don't benefit much from the traditional form could be run.
1.  Final codegen to a cfg func can be done when all of the instructions are
    replaced with ones valid on the target.

While this is possible, it isn't clear what would justify the complexity of this
approach. Unless there is a super compelling reason for this, it would be nice
to not do this. **Update:** we discussed this as a design team and agreed that
this wouldn't be a good way to go.