rfcs/text/1211-mir.md

- Feature Name: N/A
- Start Date: 2015-07-14
- RFC PR: [rust-lang/rfcs#1211](https://github.com/rust-lang/rfcs/pull/1211)
- Rust Issue: [rust-lang/rust#27840](https://github.com/rust-lang/rust/issues/27840)

# Summary

Introduce a "mid-level IR" (MIR) into the compiler. The MIR desugars
most of Rust's surface representation, leaving a simpler form that is
well-suited to type-checking and translation.

# Motivation

The current compiler uses a single AST from the initial parse all the
way to the final generation of bitcode. While this has some advantages,
there are also a number of distinct downsides. 

1. The complexity of the compiler is increased because all passes must
   be written against the full Rust language, rather than being able
   to consider a reduced subset. The MIR proposed here is *radically*
   simpler than the surface Rust syntax -- for example, it contains no
   "match" statements, and converts both `ref` bindings and `&`
   expressions into a single form.
   
   a. There are numerous examples of "desugaring" in Rust. In
      principle, desugaring one language feature into another should
      make the compiler *simpler*, but in our current implementation,
      it tends to make things more complex, because every phase must
      simulate the desugaring anew. The most prominent example are
      closure expressions (`|| ...`), which desugar to a fresh struct
      instance, but other examples abound: `for` loops, `if let` and
      `while let`, `box` expressions, overloaded operators (which
      desugar to method calls), method calls (which desugar to UFCS
      notation).
      
   b. There are a number of features which are almost infeasible to
      implement today but which should be much easier given a MIR
      representation. Examples include box patterns and non-lexical
      lifetimes.
      
2. Reasoning about fine-grained control-flow in an AST is rather
   difficult.  The right tool for this job is a control-flow graph
   (CFG). We currently construct a CFG that lives "on top" of the AST,
   which allows the borrow checking code to be flow sensitive, but it
   is awkward to work with. Worse, because this CFG is not used by
   trans, it is not necessarily the case that the control-flow as seen
   by the analyses corresponds to the code that will be generated.
   The MIR is based on a CFG, resolving this situation.
   
3. The reliability of safety analyses is reduced because the gap
   between what is being analyzed (the AST) and what is being executed
   (bitcode) is very wide. The MIR is very low-level and hence the
   translation to bitcode should be straightforward.
   
4. The reliability of safety proofs, when we have some, would be
   reduced because the formal language we are modeling is so far from
   the full compiler AST. The MIR is simple enough that it should be
   possible to (eventually) make safety proofs based on the MIR
   itself.

5. Rust-specific optimizations, and optimizing `trans` output, are very
   challenging. There are numerous cases where it would be nice to be
   able to do optimizations *before* translating to bitcode, or
   to take advantage of Rust-specific knowledge of which a backend may be
   unaware. Currently, we are forced to do these optimizations as part
   of lowering to bitcode, which can get quite complex. Having an
   intermediate form improves the situation because:
   
   a. In some cases, we can do the optimizations in the MIR itself before translation.
   
   b. In other cases, we can do analyses on the MIR to easily determine when the optimization
      would be safe.
   
   c. In all cases, whatever we can do on the MIR will be helpful for other
      targets beyond existing backends (see next bullet).
      
6. Migrating away from LLVM is nearly impossible. Since so much of the
   semantics of Rust itself are embedded in the `trans` step which
   converts to LLVM IR. Under the MIR design, those semantics are
   instead described in the translation from AST to MIR, and the LLVM
   step itself simply applies optimizations.
   
Given the numerous benefits of a MIR, you may wonder why we have not
taken steps in this direction earlier. In fact, we have a number of
structures in the compiler that simulate the effect of a MIR:

1. Adjustments. Every expression can have various adjustments, like
   autoderefs and so forth. These are computed by the type-checker
   and then read by later analyses. This is a form of MIR, but not a particularly
   convenient one.
2. The CFG. The CFG tries to model the flow of execution as a graph
   rather than a tree, to help analyses in dealing with complex
   control-flow formed by things like loops, `break`, `continue`, etc.
   This CFG is however inferior to the MIR in that it is only an
   approximation of control-flow and does not include all the
   information one would need to actually execute the program (for
   example, for an `if` expression, the CFG would indicate that two
   branches are possible, but would not contain enough information to
   decide which branch to take).
3. `ExprUseVisitor`. The `ExprUseVisitor` is designed to work in
   conjunction with the CFG. It walks the AST and highlights actions
   of interest to later analyses, such as borrows or moves. For each
   such action, the analysis gets a callback indicating the point in
   the CFG where the action occurred along with what
   happened. Overloaded operators, method calls, and so forth are
   "desugared" into their more primitive operations. This is
   effectively a kind of MIR, but it is not complete enough to do
   translation, since it focuses purely on borrows, moves, and other
   things of interest to the safety checker.

Each of these things were added in order to try and cope with the
complexity of working directly on the AST. The CFG for example
consolidates knowledge about control-flow into one piece of code,
producing a data structure that can be easily interpreted. Similarly,
the `ExprUseVisitor` consolidates knowledge of how to walk and
interpret the current compiler representation.

### Goals

It is useful to think about what "knowledge" the MIR should
encapsulate. Here is a listing of the kinds of things that should be
explicit in the MIR and thus that downstream code won't have to
re-encode in the form of repeated logic:

- **Precise ordering of control-flow.** The CFG makes this very explicit,
  and the individual statements and nodes in the MIR are very small
  and detailed and hence nothing "interesting" happens in the middle
  of an individual node with respect to control-flow.
- **What needs to be dropped and when.** The set of data that needs to
  be dropped and when is a fairly complex thing to calculate: you have
  to know what's in scope, including temporary values and so forth.
  In the MIR, all drops are explicit, including those that result from
  panics and unwinding.
- **How matches are desugared.** Reasoning about matches has been a
  traditional source of complexity. Matches combine traversing types
  with borrows, moves, and all sorts of other things, depending on the
  precise patterns in use. This is all vastly simplified and explicit
  in MIR.
  
One thing the current MIR does not make as explicit as it
could is when something is *moved*. For by-value uses of a value, the
code must still consult the type of the value to decide if that is a
move or not. This could be made more explicit in the IR.

### Which analyses are well-suited to the MIR?

Some analyses are better suited to the AST than to a MIR. The
following is a list of work the compiler does that would benefit from
using a MIR:

- **liveness checking**: this is used to issue warnings about unused assignments
  and the like. The MIR is perfect for this sort of data-flow analysis.
- **borrow and move checking**: the borrow checker already uses a
  combination of the CFG and `ExprUseVisitor` to try and achieve a
  similarly low-level of detail.
- **translation to IR**: the MIR is much closer than the AST to
  the desired bitcode end-product.
  
Some other passes would probably work equally well on the MIR or an
AST, but they will likely find the MIR somewhat easier to work with
than the current AST simply because it is, well, simpler:

- **rvalue checking**, which checks that things are `Sized` which need to be.
- **reachability** and **death checking**.

These items are likely ill-suited to the MIR as designed:

- **privacy checking**, since it relies on explicit knowledge of paths that is not
  necessarily present in the MIR.
- **lint checking**, since it is often dependent on the sort of surface details
  we are seeking to obscure.

For some passes, the impact is not entirely clear. In particular,
**match exhaustiveness checking** could easily be subsumed by the MIR
construction process, which must do a similar analysis during the
lowering process. However, once the MIR is built, the match is
completely desugared into more primitive switches and so forth, so we
will need to leave some markers in order to know where to check for
exhaustiveness and to reconstruct counter examples.

# Detailed design

### What is *really* being proposed here?

The rest of this section goes into detail on a particular MIR design.
However, the true purpose of this RFC is not to nail down every detail
of the MIR -- which are expected to evolve and change over time anyway
-- but rather to establish some high-level principles which drive the
rest of the design:

1. We should indeed lower the representation from an AST to something
   else that will drive later analyses, and this representation should
   be based on a CFG, not a tree.
2. This representation should be explicitly minimal and not attempt to retain
   the original syntactic structure, though it should be possible to recover enough
   of it to make quality error messages.
3. This representation should encode drops, panics, and other
   scope-dependent items explicitly.
4. This representation does not have to be well-typed Rust, though it
   should be possible to type-check it using a tweaked variant on the
   Rust type system.

### Prototype

The MIR design being described can be found [here][crate]. In
particular, [this module][mir] defines the MIR representation,
and [this `build` module][build] contains the code to create a MIR
representation from an AST-like form.

For increased flexibility, as well as to make the code simpler, the
prototype is not coded directly against the compiler's AST, but rather
against an idealized representation defined by [the `HAIR` trait][hair].
Note that this HAIR trait is entirely independent from the HIR discussed by
nrc in [RFC 1191][1191] -- you can think of it as an abstract trait
that any high-level Rust IR could implement, including our current
AST. Moreover, it's just an implementation detail and not part of the
MIR being proposed here per se. Still, if you want to read the code,
you have to understand its design.

The `HAIR` trait contains a number of opaque associated types for the
various aspects of the compiler. For example, the type `H::Expr`
represents an expression. In order to find out what kind of expression
it is, the `mirror` method is called, which converts an `H::Expr` into
an `Expr<H>` mirror. This mirror then contains embedded `ExprRef<H>`
nodes to refer to further subexpressions; these may either be mirrors
themselves, or else they may be additional `H::Expr` nodes. This
allows the tree that is exported to differ in small ways from the
actual tree within the compiler; the primary intention is to use this
to model "adjustments" like autoderef. The code to convert from our
current AST to the HAIR is not yet complete, but it can be found
[here][hair-mod].

Note that the HAIR mirroring system is an experiment and not really
part of the MIR itself. It does however present an interesting option
for (eventually) stabilizing access to the compiler's internals.
 
[crate]: https://github.com/rust-lang/rust/tree/2532ad7d0b2edac1909ef0ef346230331200790b/src/librustc/mir
[mir]: https://github.com/rust-lang/rust/blob/2532ad7d0b2edac1909ef0ef346230331200790b/src/librustc/mir/mod.rs
[build]: https://github.com/rust-lang/rust/tree/fc04eaacc5bd5760e98cd3aa390dcc3ae795d12f/src/librustc_mir/build
[hair]: https://github.com/rust-lang/rust/tree/fc04eaacc5bd5760e98cd3aa390dcc3ae795d12f/src/librustc_mir/hair
[1191]: https://github.com/rust-lang/rfcs/pull/1191
[hair-mod]: https://github.com/rust-lang/rust/blob/21c61336bb9e327b90f4cb8e87a948be40eeafe5/src/librustc_mir/hair/mod.rs

### Overview of the MIR

The proposed MIR always describes the execution of a single fn.  At
the highest level it consists of a series of declarations regarding
the stack storage that will be required and then a set of basic
blocks:

    MIR = fn({TYPE}) -> TYPE {
        {let [mut] B: TYPE;}  // user-declared bindings and their types
        {let TEMP: TYPE;}     // compiler-introduced temporary
        {BASIC_BLOCK}         // control-flow graph
    };
    
The storage declarations are broken into two categories. User-declared
bindings have a 1-to-1 relationship with the variables specified in
the program. Temporaries are introduced by the compiler in various
cases. For example, borrowing an lvalue (e.g., `&foo()`) will
introduce a temporary to store the result of `foo()`. Similarly,
discarding a value `foo();` is translated to something like `let tmp =
foo(); drop(tmp);`).  Temporaries are single-assignment, but because
they can be borrowed they may be mutated after this assignment and
hence they differ somewhat from variables in a pure SSA
representation.

The proposed MIR takes the form of a graph where each node is a *basic
block*. A basic block is a standard compiler term for a continuous
sequence of instructions with a single entry point.  All interesting
control-flow happens between basic blocks. Each basic block has an id
`BB` and consists of a sequence of statements and a terminator:

    BASIC_BLOCK = BB: {STATEMENT} TERMINATOR
    
A `STATEMENT` can have one of three forms:

    STATEMENT = LVALUE "=" RVALUE        // assign rvalue into lvalue
              | Drop(DROP_KIND, LVALUE)  // drop value if needed
    DROP_KIND = SHALLOW                  // (see discussion below)
              | DEEP
             
The following sections dives into these various kinds of statements in
more detail.

The `TERMINATOR` for a basic block describes how it connects to
subsequent blocks:

    TERMINATOR = GOTO(BB)              // normal control-flow
               | PANIC(BB)             // initiate unwinding, branching to BB for cleanup
               | IF(LVALUE, BB0, BB1)  // test LVALUE and branch to BB0 if true, else BB1
               | SWITCH(LVALUE, BB...) // load discriminant from LVALUE (which must be an enum),
                                       // and branch to BB... depending on which variant it is
               | CALL(LVALUE0 = LVALUE1(LVALUE2...), BB0, BB1)
                                       // call LVALUE1 with LVALUE2... as arguments. Write
                                       // result into LVALUE0. Branch to BB0 if it returns
                                       // normally, BB1 if it is unwinding.
               | DIVERGE               // return to caller, unwinding
               | RETURN                // return to caller normally

Most of the terminators should be fairly obvious. The most interesting
part is the handling of unwinding. This aligns fairly close with how
LLVM works: there is one terminator, PANIC, that initiates unwinding.
It immediately branches to a handler (BB) which will perform cleanup
and (eventually) reach a block that has a DIVERGE terminator. DIVERGE
causes unwinding to continue up the stack.

Because calls to other functions can always (or almost always) panic,
calls are themselves a kind of terminator. If we can determine that
some function we are calling cannot unwind, we can always modify the
IR to make the second basic block optional. (We could also add an
`RVALUE` to represent calls, but it's probably easiest to keep the
call as a terminator unless the memory savings of consolidating basic
blocks are found to be worthwhile.)

It's worth pointing out that basic blocks are just a kind of
compile-time and memory-use optimization; there is no semantic
difference between a single block and two blocks joined by a GOTO
terminator.

### Assignments, values, and rvalues

The primary kind of statement is an assignment:

    LVALUE "=" RVALUE
    
The semantics of this operation are to first evaluate the RVALUE and
then store it into the LVALUE (which must represent a memory location
of suitable type). 

An `LVALUE` represents a path to a memory location. This is the basic
"unit" analyzed by the borrow checker. It is always possible to
evaluate an `LVALUE` without triggering any side-effects (modulo
dereferences of unsafe pointers, which naturally can trigger arbitrary
behavior if the pointer is not valid).

    LVALUE = B                   // reference to a user-declared binding
           | TEMP                // a temporary introduced by the compiler
           | ARG                 // a formal argument of the fn
           | STATIC              // a reference to a static or static mut
           | RETURN              // the return pointer of the fn
           | LVALUE.f            // project a field or tuple field, like x.f or x.0
           | *LVALUE             // dereference a pointer
           | LVALUE[LVALUE]      // index into an array (see disc. below about bounds checks)
           | (LVALUE as VARIANT) // downcast to a specific variant of an enum,
                                 // see the section on desugaring matches below
           
An `RVALUE` represents a computation that yields a result. This result
must be stored in memory somewhere to be accessible. The MIR does not
contain any kind of nested expressions: everything is flattened out,
going through lvalues as intermediaries.

    RVALUE = Use(LVALUE)                // just read an lvalue
           | [LVALUE; LVALUE]
           | &'REGION LVALUE
           | &'REGION mut LVALUE
           | LVALUE as TYPE
           | LVALUE <BINOP> LVALUE
           | <UNOP> LVALUE
           | Struct { f: LVALUE0, ... } // aggregates, see section below
           | (LVALUE...LVALUE)
           | [LVALUE...LVALUE]
           | CONSTANT
           | LEN(LVALUE)                // load length from a slice, see section below
           | BOX                        // malloc for builtin box, see section below
    BINOP = + | - | * | / | ...         // excluding && and ||
    UNOP = ! | -                        // note: no `*`, as that is part of LVALUE
    
One thing worth pointing out is that the binary and unary operators
are only the *builtin* form, operating on scalar values. Overloaded
operators will be desugared to trait calls. Moreover, all method calls
are desugared into normal calls via UFCS form.

### Constants

Constants are a subset of rvalues that can be evaluated at compilation
time:

    CONSTANT = INT
             | UINT
             | FLOAT
             | BOOL
             | BYTES
             | STATIC_STRING
             | ITEM<SUBSTS>                 // reference to an item or constant etc
             | <P0 as TRAIT<P1...Pn>>       // projection
             | CONSTANT(CONSTANT...)        // 
             | CAST(CONSTANT, TY)           // foo as bar
             | Struct { (f: CONSTANT)... }  // aggregates...
             | (CONSTANT...)                //
             | [CONSTANT...]                //

### Aggregates and further lowering

The set of rvalues includes "aggregate" expressions like `(x, y)` or
`Foo { f: x, g: y }`. This is a place where the MIR (somewhat) departs
from what will be generated compilation time, since (often) an
expression like `f = (x, y, z)` will wind up desugared into a series
of piecewise assignments like:

    f.0 = x;
    f.1 = y;
    f.2 = z;

However, there are good reasons to include aggregates as first-class
rvalues. For one thing, if we break down each aggregate into the
specific assignments that would be used to construct the value, then
zero-sized types are *never* assigned, since there is no data to
actually move around at runtime. This means that the compiler couldn't
distinguish uninitialized variables from initialized ones. That is,
code like this:

```rust
let x: (); // note: never initialized
use(x)
```

and this:

```rust
let x: () = ();
use(x);
```

would desugar to the same MIR. That is a problem, particularly with
respect to destructors: imagine that instead of the type `()`, we used
a type like `struct Foo;` where `Foo` implements `Drop`.

Another advantage is that building aggregates in a two-step way
assures the proper execution order when unwinding occurs before the
complete value is constructed. In particular, we want to drop the
intermediate results in the order that they appear in the source, not
in the order in which the fields are specified in the struct
definition.

A final reason to include aggregates is that, at runtime, the
representation of an aggregate may indeed fit within a single word, in
which case making a temporary and writing the fields piecemeal may in
fact not be the correct representation.

In any case, after the move and correctness checking is done, it is
easy enough to remove these aggregate rvalues and replace them with
assignments. This could potentially be done during lowering, or
as a pre-pass that transforms MIR statements like:

    x = ...x;
    y = ...y;
    z = ...z;
    f = (x, y, z)
    
to:

    x = ...x;
    y = ...y;
    z = ...z;
    f.0 = x;
    f.1 = y;
    f.2 = z;

combined with another pass that removes temporaries that are only used
within a single assignment (and nowhere else):

    f.0 = ...x;
    f.1 = ...y;
    f.2 = ...z;
    
Going further, once type-checking is done, it is plausible to do
further lowering within the MIR purely for optimization purposes. For
example, we could introduce intermediate references to cache the
results of common lvalue computations and so forth.

### Bounds checking

Because bounds checks are fallible, it's important to encode them in
the MIR whenever we do indexing. Otherwise the `trans` code would have
to figure out on its own how to do unwinding at that point. Because
the MIR doesn't "desugar" fat pointers, we include a special rvalue
`LEN` that extracts the length from an array value whose type matches
`[T]` or `[T;n]` (in the latter case, it yields a constant). Using
this, we desugar an array reference like `y = arr[x]` as follows:

    let len: usize;
    let idx: usize;
    let lt: bool;
    
    B0: {
      len = len(arr);
      idx = x;
      lt = idx < len;
      if lt { B1 } else { B2 }
    }
    
    B1: {
      y = arr[idx]
      ...
    }
    
    B2: {
      <panic>
    }
    
The key point here is that we create a temporary (`idx`) capturing the
value that we bounds checked and we ensure that there is a comparison
against the length.
    
### Overflow checking

Similarly, since overflow checks can trigger a panic, they ought to be
exposed in the MIR as well. This is handled by having distinct binary
operators for "add with overflow" and so forth, analogous to the LLVM
intrinsics. These operators yield a tuple of (result, overflow), so
`result = left + right` might be translated like:

    let tmp: (u32, bool);
    
    B0: {
      tmp = left + right;
      if(tmp.1, B2, B1)
    }
    
    B1: {
      result = tmp.0
      ...
    }
    
    B2: {
      <panic>
    }
    
### Matches

One of the goals of the MIR is to desugar matches into something much
more primitive, so that we are freed from reasoning about their
complexity. This is primarily achieved through a combination of SWITCH
terminators and downcasts. To get the idea, consider this simple match
statement:

```rust
match foo() {
    Some(ref v) => ...0,
    None => ...1
}
```

This would be converted into MIR as follows (leaving out the unwinding support):

    BB0 {
        call(tmp = foo(), BB1, ...);
    }
    
    BB1 {
        switch(tmp, BB2, BB3) // two branches, corresponding to the Some and None variants resp.
    }
    
    BB2 {
        v = &(tmp as Option::Some).0;
        ...0
    }
    
    BB3 {
        ...1
    }
    
There are some interesting cases that arise from matches that are
worth examining.
    
**Vector patterns.** Currently, (unstable) Rust supports vector
patterns which permit borrows that would not otherwise be legal:

```rust
let mut vec = [1, 2];
match vec {
    [ref mut p, ref mut q] => { ... }
}
```

If this code were written using `p = &mut vec[0], q = &mut vec[1]`,
the borrow checker would complain. This is because it does not attempt
to reason about indices being disjoint, even if they are constant
(this is a limitation we may wish to consider lifting at some point in
the future, however).

To accommodate these, we plan to desugar such matches into lvalues
using the special "constant index" form. The borrow checker would be
able to reason that two constant indices are disjoint but it could
consider "variable indices" to be (potentially) overlapping with all
constant indices. This is a fairly straightforward thing to do (and in
fact the borrow checker already includes similar logic, since the
`ExprUseVisitor` encounters a similar dilemma trying to resolve
borrows).

### Drops

The `Drop(DROP_KIND, LVALUE)` instruction is intended to represent
"automatic" compiler-inserted drops. The semantics of a `Drop` is that
it drops "if needed". This means that the compiler can insert it
everywhere that a `Drop` would make sense (due to scoping), and assume
that instrumentation will be done as needed to prevent double
drops. Currently, this signaling is done by zeroing out memory at
runtime, but we are in the process of introducing stack flags for this
purpose: the MIR offers the opportunity to reify those flags if we
wanted, and rewrite drops to be more narrow.

To illustrate how drop works, let's work through a simple
example. Imagine that we have a snippet of code like:

```rust
{
  let x = Box::new(22);
  send(x);
}
```

The compiler would generate a drop for `x` at the end of the block,
but the value `x` would also be moved as part of the call to `send`.
A later analysis could easily strip out this `Drop` since it is evident
that the value is always used on all paths that lead to `Drop`.

### Shallow drops and Box

The MIR includes the distinction between "shallow" and "deep"
drop. Deep drop is the normal thing, but shallow drop is used when
partially initializing boxes. This is tied to the `box` keyword.
For example, an assignment like the following:

    let x = box Foo::new();
    
would be translated to something like the following:

    let tmp: Box<Foo>;
    
    B0: {
      tmp = BOX;
      f = Foo::new; // constant reference
      call(*tmp, f, B1, B2);
    } 
    
    B1: { // successful return of the call
      x = use(tmp); // move of tmp
      ...
    }
    
    B2: { // calling Foo::new() panic'd
      drop(Shallow, tmp);
      diverge;
    }

The interesting part here is the block B2, which indicates the case
that `Foo::new()` invoked unwinding. In that case, we have to free the
box that we allocated, but we only want to free the box itself, not
its contents (it is not yet initialized).

Note that having this kind of builtin box code is a legacy thing. The
more generalized protocol that [RFC 809][809] specifies works in
more-or-less exactly the same way: when that is adopted uniformly, the
need for shallow drop and the Box rvalue will go away.

[809]: https://github.com/rust-lang/rfcs/blob/master/text/0809-box-and-in-for-stdlib.md

### Phasing

Ideally, the translation to MIR would be done during type checking,
but before "region checking". This is because we would like to
implement non-lexical lifetimes eventually, and doing that well would
requires access to a control-flow graph. Given that we do very limited
reasoning about regions at present, this should not be a problem.

### Representing scopes

Lexical scopes in Rust play a large role in terms of when destructors
run and how the reasoning about lifetimes works. However, they are
completely erased by the graph format. For the most part, this is not
an issue, since drops are encoded explicitly into the control-flow
where needed. However, one place that we still need to reason about
scopes (at least in the short term) is in region checking, because
currently regions are encoded in terms of scopes, and we have to be
able to map that to a region in the graph. The MIR therefore includes
extra information mapping every scope to a SEME region (single-entry,
multiple-exit). If/when we move to non-lexical lifetimes, regions
would be defined in terms of the graph itself, and the need to retain
scoping information should go away.

### Monomorphization

Currently, we do monomorphization at translation time. If we ever
chose to do it at a MIR level, that would be fine, but one thing to be
careful of is that we may be able to elide `Drop` nodes based on the
specific types.

### Unchecked assertions

There are various bits of the MIR that are not trivially type-checked.
In general, these are properties which are assured in Rust by
construction in the high-level syntax, and thus we must be careful not
to do any transformation that would endanger them after the fact.

- **Bounds-checking.** We introduce explicit bounds checks into the IR
  that guard all indexing lvalues, but there is no explicit connection
  between this check and the later accesses.
- **Downcasts to a specific variant.** We test variants with a SWITCH
  opcode but there is no explicit connection between this test and
  later downcasts.

This need for unchecked operations results form trying to lower and
simplify the representation as much as possible, as well as trying to
represent all panics explicitly. We believe the tradeoff to be
worthwhile, particularly since:

1. the existing analyses can continue to generally assume that these
properties hold (e.g., that all indices are in bounds and all
downcasts are safe); and,
2. it would be trivial to implement a static dataflow analysis
checking that bounds and downcasts only occur downstream of a relevant
check.

# Drawbacks

**Converting from AST to a MIR will take some compilation time.**
Expectations are that constructing the MIR will be quite fast, and
that follow-on code (such as `trans` and `borrowck`) will execute faster,
because they will operate over a simpler and more compact
representation. However, this needs to be measured.

**More effort is required to make quality error messages.** Because
the representation the compiler is working with is now quite different
from what the user typed, we have to put in extra effort to make sure
that we bridge this gap when reporting errors. We have some precedent
for dealing with this, however. For example, the `ExprUseVisitor` (and
`mem_categorization`) includes extra annotations and hints to tell the
borrow checker when a reference was introduced as part of a closure
versus being explicit in the source code. The current prototype
doesn't have much in this direction, but it should be relatively
straightforward to add. Hints like those, in addition to spans, should
be enough to bridge the error message gap.

# Alternatives

**Use SSA.** In the proposed MIR, temporaries are single-assignment
but can be borrowed, making them more analogous to allocas than SSA
values. This is helpful to analyses like the borrow checker, because
it means that the program operates directly on paths through memory,
versus having the stack modeled as allocas. The current model is also
helpful for generating debuginfo.

SSA representation can be helpful for more sophisticated backend
optimizations. However, it makes more sense to have the MIR be based on 
lvalues. There are some cases where it might make sense to do analyses
on the MIR that would benefit from SSA, such as bounds check elision.
In those cases, we could either quickly identify those temporaries
that are not mutably borrowed (and which therefore act like SSA
variables); or, further lower into a LIR, (which would be an SSA
form); or else simply perform the analyses on the MIR using standard
techniques like def-use chains. (CSE and so forth are straightforward
both with and without SSA, honestly.)

**Exclude unwinding.** Excluding unwinding from the MIR would allow us
to elide annoying details like bounds and overflow checking. These are
not particularly interesting to `borrowck`, so that is somewhat
appealing. But that would mean that consumers of MIR would have to
reconstruct the order of drops and so forth on unwinding paths, which
would require them reasoning about scopes and other rather complex
bits of information. Moreover, having all drops fully exposed in the
MIR is likely helpful for better handling of dynamic drop and also for
the rules collectively known as dropck, though all details there have
not been worked out.
  
**Expand the set of operands.** The proposed MIR forces all rvalue operands
to be lvalues. This means that integer constants and other "simple" things
will wind up introducing a temporary. For example, translating `x = 2+2`
will generate code like:

    tmp0 = 2
    tmp1 = 2
    x = tmp0 + tmp1
    
A more common case will be calls to statically known functions like `x = foo(3)`,
which desugars to a temporary and a constant reference:

    tmp0 = foo;
    tmp1 = 3
    x = tmp0(tmp1)
    
There is no particular *harm* in such constants: it would be very easy
to optimize them away when reducing to bitcode, and if we do not
do so, a backend may do it. However, we could also expand the scope of
operands to include both lvalues and some simple rvalues like
constants. The main advantage of this is that it would reduce the
total number of statements and hence might help with memory
consumption.

**Totally safe MIR.** This MIR includes operations whose safety is not
trivially type-checked (see the section on *unchecked assertions*
above). We might design a higher-level MIR where those properties held
by construction, or modify the MIR to thread "evidence" of some form
that makes it easier to check that the properties hold. The former
would make downstream code accommodate more complexity. The latter
remains an option in the future but doesn't seem to offer much
practical advantage.

# Unresolved questions

**What additional info is needed to provide for good error messages?**
Currently the implementation only has spans on statements, not lvalues
or rvalues. We'll have to experiment here. I expect we will probably
wind up placing "debug info" on all lvalues, which includes not only a
span but also a "translation" into terms the user understands. For
example, in a closure, a reference to an by-reference upvar `foo` will
be translated to something like `*self.foo`, and we would like that to
be displayed to the user as just `foo`.

**What additional info is needed for debuginfo?** It may be that to
generate good debuginfo we want to include additional information
about control-flow or scoping.

**Unsafe blocks.** Should we layer unsafe in the MIR so that effect
checking can be done on the CFG? It's not the most natural way to do
it, *but* it would make it fairly easy to support (e.g.) autoderef on
unsafe pointers, since all the implicit operations are made explicit
in the MIR. My hunch is that we can improve our HIR instead.