From a603b2f5b2dc04813548d2f90821a8efe4635bb8 Mon Sep 17 00:00:00 2001 From: Chris Lattner Date: Sat, 17 Nov 2018 22:16:48 -0800 Subject: [PATCH] Import the "MLIR: The case for a simplified polyhedral form" proposal doc from google docs into the codebase as a rationale doc, since this is an important aspect of our design. PiperOrigin-RevId: 221957444 --- .../RationaleSimplifiedPolyhedralForm.md | 414 ++++++++++++++++++ 1 file changed, 414 insertions(+) create mode 100644 mlir/g3doc/RationaleSimplifiedPolyhedralForm.md diff --git a/mlir/g3doc/RationaleSimplifiedPolyhedralForm.md b/mlir/g3doc/RationaleSimplifiedPolyhedralForm.md new file mode 100644 index 000000000000..c2770e1e26df --- /dev/null +++ b/mlir/g3doc/RationaleSimplifiedPolyhedralForm.md @@ -0,0 +1,414 @@ +# MLIR: The case for a simplified polyhedral form + +MLIR embraces polyhedral compiler techniques for their many advantages +representing and transforming dense numerical kernels, but it uses a form that +differs significantly from other polyhedral frameworks. + +**Disclaimer / Warning** + +This document is a very early design proposal (which has since been accepted) +that explored the tradeoffs of using this simplified form vs the tranditional +polyhedral schedule list form. At some point, this document could be dusted off +and written as a proper academic paper, but until now, it is better to included +it in this crufty form than not to. Beware that this document uses archaic +syntax and should not be considered a canonical reference to modern MLIR. + +## Introduction + +This document discusses general goals of the project, introduces context and the +two alternatives, then talks about the tradeoffs of these designs. Written by +Chris Lattner. + +## General goals of an IR, and goals of mlfunc's specifically + +Our currently planned representation for MLIR consists of two kinds of +functions: an LLVM-like "CFG Function" and an "ML Function": a function +represented in multidimensional loop form. The idea is that a CFG function is +capable of full generality for expressing arbitrary computation, but is awkward +for loop transformations. In contrast, mlfunc's are limited (e.g. to control +flow involving loop nests over affine spaces) but these limitations make it much +easier to transform and analyze, particularly for the set of computations in a +machine learning kernel. + +The design of an intermediate representations is an optimization problem, which +makes intentional tradeoffs that aim to make certain kinds of compiler +transformations simple. After all, it is "possible" to do almost any +transformation on any IR: we could theoretically do loop transformations on +assembly language. OTOH, such transformations would take too long to write, +would be fragile due to irrelevant changes, would be difficult to maintain, and +difficult to make target independent. Performing transformations on the "right +level" of IR makes it much easier to do analysis and transformation of code, and +can make them faster by reducing the size of the IR, and eliminating +possibilities that would have otherwise have to be considered. + +This is the reason we're interested in adding polyhedral techniques to an IR in +the first place: though our base "CFG function" representation is fully capable +of expressing any computation, it is "too" expressive. The limitations imposed +by polyhedral techniques (e.g. on affine loop bounds and array subscripts) +define a closed algebra that can represent an interesting range of +transformations and their compositions, and because of their simplicity, we can +perform (e.g.) dependence analysis more efficiently and more reliably. + +This raises an important question that this document examines: given we are +introducing a redundant and limited way to express code and transformations, +exactly what form is best to perform the analyses and transformations we want? + +We explore two different design points that are capable of expressing the same +class of affine loop computations, but which use different representational +forms. These forms trade off verbosity, ease of transformation, and ease of +analysis in interesting ways. + +## Context: Traditional Polyhedral Form + +We started by discussing a representation that uses the traditional polyhedral +schedule set + domain representation, e.g. consider C-like code like: + +```c + void simple_example(...) { + for (int i = 0; i < N; ++i) { + for (int j = 0; j < N; ++j) { + float tmp = X[i,j] // S1 + A[i,j] = tmp + 1 // S2 + B[i,j] = tmp * 42 // S3 + } + } + } +``` + +The polyhedral representation doesn't care about the actual computation, so we +will abstract them into S1/S2/S3 in the discussion below. Originally, we planned +to represent this with a classical form like (syntax details are not important +and probably slightly incorrect below): + +``` + mlfunc @simple_example(... %N) { + %tmp = call @S1(%X, %i, %j) + domain: (0 <= %i < %N), (0 <= %j < %N) + schedule: (i, j, 0) + + call @S2(%tmp, %A, %i, %j) + domain: (0 <= %i < %N), (0 <= %j < %N) + schedule: (i, j, 1) + + call @S3(%tmp, %B, %i, %j) + domain: (0 <= %i < %N), (0 <= %j < %N) + schedule: (i, j, 2) + } +``` + +In this design, an mlfunc is an unordered bag of statements whose execution +order is fully controlled by their schedule. + +However, we recently agreed that a more explicit schedule tree representation is +a better fit for our needs, because it exposes important structure that will +make analyses and optimizations more efficient, and also makes the scoping of +SSA values more explicit. This leads us to a representation along the lines of: + +``` + mlfunc @simple_example(... %N) { + d0/d1 = mlspace + for S1(d0), S2(d0), S3(d0) { + for S1(d1), S2(d1), S3(d1) { + + %tmp = call @S1(%X, d0, d1) ;; S1 + domain: (0 <= d0 < %N), (0 <= d1 < %N) + + call @S2(%tmp, %A, d0, d1) ;; S2 + domain: (0 <= d0 < %N), (0 <= d1 < %N) + + call @S3(%tmp, %B, d0, d1) ;; S3 + domain: (0 <= d0 < %N), (0 <= d1 < %N) + } + } + } +``` + +This change makes the nesting structure of the loops an explicit part of the +representation, and makes lexical ordering within a loop significant +(eliminating the constant 0/1/2 of schedules). + +It isn't obvious in the example above, but the representation allows for some +interesting features, including the ability for statements within a loop nest to +have non-equal domains, like this - the second statement ignores the outer 10 +points inside the loop: + +``` + mlfunc @reduced_domain_example(... %N) { + d0/d1 = mlspace + for S1(d0), S2(d0) { + for S1(d1), S2(d1) { + %tmp = call @S1(%X, d0, d1) ;; S1 + domain: (0 <= d0 < %N), (0 <= d1 < %N) + + call @S2(%tmp, %A, d0, d1) ;; S2 + domain: (10 <= d0 < %N-10), (10 <= d1 < %N-10) + } + } + } +``` + +It also allows schedule remapping within the statement, like this example that +introduces a diagonal skew through a simple change to the schedules of the two +statements: + +``` + mlfunc @skewed_domain_example(... %N) { + d0/d1 = mlspace + for S1(d0), S2(d0+d1) { + for S1(d0+d1), S2(d1) { + %tmp = call @S1(%X, d0, d1) ;; S1 + domain: (0 <= d0 < %N), (0 <= d1 < %N) + + call @S2(%tmp, %A, d0, d1) ;; S2 + domain: (0 <= d0 < %N), (0 <= d1 < %N) + } + } + } +``` + +This form has great power, and the polyhedral code generator (which lowers from +an mlfunc to a cfgfunc representation) handles this power so things that +introduce loop transformations don't have to explicitly manipulate the looping +structure. + +## Proposal: Simplified Polyhedral Form + +This document proposes and explores the idea of going one step further, moving +all of the domain and schedule information into the "schedule tree". In this +form, we would have a representation where all statements inside of a given +for-loop are known to have the same domain, which is maintained by the loop. In +the simplified form, we also have an "if" statement that takes an affine +condition. + +Our simple example above would be represented as: + +```mlir + mlfunc @simple_example(... %N) { + for %i = 0 ... %N step 1 { + for %j = 0 ... %N step 1 { + // identity noop in this case, but can exist in general. + %0,%1 = affine_apply #57(%i, %j) + + %tmp = call @S1(%X, %0, %1) + + call @S2(%tmp, %A, %0, %1) + + call @S3(%tmp, %B, %0, %1) + } + } + } +``` + +The example with the reduced domain would be represented with an if statement: + +```mlir + mlfunc @reduced_domain_example(... %N) { + for %i = 0 ... %N step 1 { + for %j = 0 ... %N step 1 { + // identity noop in this case, but can exist in general. + %0,%1 = affinecall #57(%i, %j) + + %tmp = call @S1(%X, %0, %1) + + if (10 <= %i < %N-10), (10 <= %j < %N-10) { + + %2,%3 = affine_apply(%i, %j) // identity noop in this case + + call @S2(%tmp, %A, %2, %3) + } + } + } + } +``` + +These IRs represent exactly the same information, and use a similar information +density. The 'traditional' form introduces an extra level of abstraction +(schedules and domains) that make it easy to transform statements at the expense +of making it difficult to reason about how those statements will come out after +code generation. With the simplified form, transformations have to do parts of +code generation inline with their transformation: instead of simply changing a +schedule to **(i+j, j)** to get skewing, you'd have to generate this code +explicitly (potentially implemented by making polyhedral codegen a library that +transformations call into): + +```mlir +mlfunc @skewed_domain_example(... %N) { + for %t1 = 0 ... 2*N-2 step 1 { + for %t2 = max(0, t1-N+1) ... min(N, t1) step 1 { + (%i, %j) = (%t1-%t2, %t2) + ... + } + } +} +``` + +## Evaluation + +Both of these forms are capable of expressing the same class of computation: +multidimensional loop nests with affine loop bounds and affine memory +references. That said, they pose very different tradeoffs in other ways. + +### Commonality: can express same computation + +Both of these can express the same sorts of computation, e.g. kernels written in +one form are representable in the other form in all cases. + +### Commonality: dependence analysis + +These representations both use affine functions for data layout mapping and +access subscripts, and dependence analysis works the same way. + +### Commonality: difficulty of determining optimal transformation series + +One major challenge in performance of optimization of this sort of code is +choosing the ordering and behavior of various loop transformations that get +applied. There are non-local effects of every decision, and neither +representation helps solve this inherently hard problem. + +### Commonality: compactness of IR + +In the cases that are most relevant to us (hyper rectangular spaces) these forms +are directly equivalent: a traditional statement with a limited domain (e.g. the +"reduced_domain_example" above) ends up having one level of ML 'if' inside its +loops. The simplified form pays for this by eliminating schedules and domains +from the IR. Both forms allow code duplication to reduce dynamic branches in the +IR: the traditional approach allows statement splitting, the simplified form +supports statement duplication. + +It is important to point out that the traditional form wins on compactness in +the extreme cases: e.g. the loop skewing case. These cases will be rare in +practice for our workloads, and are exactly the cases that downstream +transformations want to be explicit about what they are doing. + +### Simplicity of code generation + +A key final stage of an mlfunc is its conversion to a cfg function, which is +required as part of lowering to the target machine. The simplified form has a +clear advantage here: the IR has a direct correspondence to the structure of the +generated code. + +In contrast, the traditional form has significant complexity in the lowering +process to a CFG function, because the verbosity not imbued in the IR needs to +come out during code generation. Code generation from ISL shows that it is +possible to do this, but it is a non-trivial transformation. + +### Ease of transformation + +An advantage for the traditional form is that it is easier to perform certain +transformations on it: skewing and tiling are just transformations on the +schedule of the statements in question, it doesn't require changing the loop +structure. + +In practice, the simplified form requires moving the complexity of code +generation into the transformations themselves - this is sometimes trivial, +sometimes involved. The author believes that this should be possible by making +the code generation algorithms themselves be library functions that +transformations call into, instead of an opaque block that happens at the end of +the mlfunc processing. + +Also, the sorts of transformations performed today by XLA (including tiling, +padding, unrolling, and other rectangular transformations) should be easy enough +to implement on either representation. The only cases that are a challenge are +more advanced cases like skewing, e.g. for DMA data movement generation. + +### Ease of analysis: Cost models + +The simplified form is much easier for analyses and transformations to build +cost models for (e.g. answering the question of "how much code bloat will be +caused by unrolling a loop at this level?"), because it is easier to predict +what target code will be generated. With the traditional form, these analyses +will have to anticipate what polyhedral codegen will do to a set of statements +under consideration: something that is non-trivial in the interesting cases in +question (see "Cost of code generation"). + +### Cost of code generation + +State of the art polyhedral code generation is +[expensive and complicated](https://lirias.kuleuven.be/bitstream/123456789/497238/1/toplas-astgen.pdf), +sometimes exponential time complexity. We expect that most machine learning +workloads will be hyper-rectangular, and thus it should be easy to specialize in +important cases. That said, the traditional polyhedral representation makes it +very easy to introduce complicated and expensive schedules, and provides no way +to understand and project a cost model for using them. All downstream clients of +the IR need to be prepared to handle the full generality of IR that may come to +them. + +The simplified form defines this away: the concepts in the IR remain simple, and +the code much more directly reflects the cost model for lowering to CFG +functions and machine code. This is expected to be very important in the late +stages of a code generator for an accelerator. + +### SSA in ML Functions + +We agree already that values defined in an mlfunc can include scalar values and +they are defined based on traditional dominance. In the simplified form, this is +very simple: arguments and induction variables defined in for-loops are live +inside their lexical body, and linear series of statements have the same "top +down" dominance relation that a basic block does. + +In the traditional form though, this is not the case: it seems that a lot of +knowledge about how codegen will emit the code is necessary to determine if SSA +form is correct or not. For example, this is invalid code: + +``` + %tmp = call @S1(%X, %0, %1) + domain: (10 <= %i < %N), (0 <= %j < %N) + schedule: (i, j) + + call @S2(%tmp, %A, %0, %1) + domain: (0 <= %i < %N), (0 <= %j < %N) + schedule: (i, j) +``` + +Because `%tmp` isn't defined on some iterations of the %i loop. + +This matters because it makes the verifier more complicated, but more +significantly, it means that load promotion and other optimizations that will +produce SSA form will need to be aware of this and be able to model what codegen +does. + +An emergent property of this that we discussed recently is that PHI nodes in +mlfunc's (if we support them) will also have to have domains. + +### Lack of redundancy in IR + +The traditional form has multiple encodings for the same sorts of behavior: you +end up having bits on `for` loops to specify whether codegen should use +"atomic/separate" policies, unroll loops, etc. Statements can be split or can +generate multiple copies of their statement because of overlapping domains, etc. + +This is a problem for analyses and cost models, because they each have to reason +about these additional forms in the IR. + +### Suitability to purpose: lowering to machine code + +One of the main drivers for this work is lowering to low-level accelerator code, +including two-dimensional vectorization, insertion of DMAs, and other +utilization of the matrix accelerator units. In the author's opinion, the extra +compactness of the traditional form is a negative for this purpose: reasoning +about the generated machine code will require understanding the mapping from +mlfunc to lowered code, which means that it must understand what code generation +will do. + +In the simplified form, the effect of "code generation" is always obvious from +the IR itself, which should make it easier to perform vectorization to target +instructions and other analyses we need to perform. + +## Third Alternative: two different levels of mlfunc + +One hybrid alternative is to support both the traditional and simplified forms +of mlfunc in our IR. + +The stages could look like this, for example: + +1. Early performance transformations could be done on the traditional form. +1. Partial code generation lowers to the simplified form +1. Target specific lowering phases for tiling, and vectorization and other 2D + transforms that don't benefit much from the traditional form could be run. +1. Final codegen to a cfg func can be done when all of the instructions are + replaced with ones valid on the target. + +While this is possible, it isn't clear what would justify the complexity of this +approach. Unless there is a super compelling reason for this, it would be nice +to not do this. **Update:** we discussed this as a design team and agreed that +this wouldn't be a good way to go.