2015-07-07 22:26:00 +08:00
|
|
|
- Feature Name: N/A
|
2023-08-14 04:59:21 +08:00
|
|
|
- Start Date: 2015-07-14
|
2015-08-15 04:48:31 +08:00
|
|
|
- RFC PR: [rust-lang/rfcs#1211](https://github.com/rust-lang/rfcs/pull/1211)
|
|
|
|
- Rust Issue: [rust-lang/rust#27840](https://github.com/rust-lang/rust/issues/27840)
|
2015-07-07 22:26:00 +08:00
|
|
|
|
|
|
|
# Summary
|
|
|
|
|
|
|
|
Introduce a "mid-level IR" (MIR) into the compiler. The MIR desugars
|
|
|
|
most of Rust's surface representation, leaving a simpler form that is
|
|
|
|
well-suited to type-checking and translation.
|
|
|
|
|
|
|
|
# Motivation
|
|
|
|
|
|
|
|
The current compiler uses a single AST from the initial parse all the
|
2021-09-01 05:08:36 +08:00
|
|
|
way to the final generation of bitcode. While this has some advantages,
|
2015-07-07 22:26:00 +08:00
|
|
|
there are also a number of distinct downsides.
|
|
|
|
|
|
|
|
1. The complexity of the compiler is increased because all passes must
|
|
|
|
be written against the full Rust language, rather than being able
|
|
|
|
to consider a reduced subset. The MIR proposed here is *radically*
|
|
|
|
simpler than the surface Rust syntax -- for example, it contains no
|
|
|
|
"match" statements, and converts both `ref` bindings and `&`
|
2020-10-28 10:58:13 +08:00
|
|
|
expressions into a single form.
|
2015-07-07 22:26:00 +08:00
|
|
|
|
|
|
|
a. There are numerous examples of "desugaring" in Rust. In
|
|
|
|
principle, desugaring one language feature into another should
|
|
|
|
make the compiler *simpler*, but in our current implementation,
|
|
|
|
it tends to make things more complex, because every phase must
|
|
|
|
simulate the desugaring anew. The most prominent example are
|
|
|
|
closure expressions (`|| ...`), which desugar to a fresh struct
|
|
|
|
instance, but other examples abound: `for` loops, `if let` and
|
|
|
|
`while let`, `box` expressions, overloaded operators (which
|
|
|
|
desugar to method calls), method calls (which desugar to UFCS
|
|
|
|
notation).
|
|
|
|
|
|
|
|
b. There are a number of features which are almost infeasible to
|
|
|
|
implement today but which should be much easier given a MIR
|
|
|
|
representation. Examples include box patterns and non-lexical
|
|
|
|
lifetimes.
|
|
|
|
|
|
|
|
2. Reasoning about fine-grained control-flow in an AST is rather
|
|
|
|
difficult. The right tool for this job is a control-flow graph
|
|
|
|
(CFG). We currently construct a CFG that lives "on top" of the AST,
|
|
|
|
which allows the borrow checking code to be flow sensitive, but it
|
|
|
|
is awkward to work with. Worse, because this CFG is not used by
|
|
|
|
trans, it is not necessarily the case that the control-flow as seen
|
|
|
|
by the analyses corresponds to the code that will be generated.
|
|
|
|
The MIR is based on a CFG, resolving this situation.
|
|
|
|
|
|
|
|
3. The reliability of safety analyses is reduced because the gap
|
|
|
|
between what is being analyzed (the AST) and what is being executed
|
2021-09-01 05:08:36 +08:00
|
|
|
(bitcode) is very wide. The MIR is very low-level and hence the
|
|
|
|
translation to bitcode should be straightforward.
|
2015-07-07 22:26:00 +08:00
|
|
|
|
|
|
|
4. The reliability of safety proofs, when we have some, would be
|
|
|
|
reduced because the formal language we are modeling is so far from
|
|
|
|
the full compiler AST. The MIR is simple enough that it should be
|
|
|
|
possible to (eventually) make safety proofs based on the MIR
|
|
|
|
itself.
|
|
|
|
|
2021-09-01 05:08:36 +08:00
|
|
|
5. Rust-specific optimizations, and optimizing `trans` output, are very
|
2015-07-07 22:26:00 +08:00
|
|
|
challenging. There are numerous cases where it would be nice to be
|
2021-09-01 05:08:36 +08:00
|
|
|
able to do optimizations *before* translating to bitcode, or
|
|
|
|
to take advantage of Rust-specific knowledge of which a backend may be
|
2015-07-07 22:26:00 +08:00
|
|
|
unaware. Currently, we are forced to do these optimizations as part
|
|
|
|
of lowering to bitcode, which can get quite complex. Having an
|
|
|
|
intermediate form improves the situation because:
|
|
|
|
|
|
|
|
a. In some cases, we can do the optimizations in the MIR itself before translation.
|
2016-04-15 20:23:15 +08:00
|
|
|
|
2015-07-07 22:26:00 +08:00
|
|
|
b. In other cases, we can do analyses on the MIR to easily determine when the optimization
|
|
|
|
would be safe.
|
2016-04-15 20:23:15 +08:00
|
|
|
|
2015-07-07 22:26:00 +08:00
|
|
|
c. In all cases, whatever we can do on the MIR will be helpful for other
|
2021-09-01 05:08:36 +08:00
|
|
|
targets beyond existing backends (see next bullet).
|
2015-07-07 22:26:00 +08:00
|
|
|
|
2021-09-01 05:08:36 +08:00
|
|
|
6. Migrating away from LLVM is nearly impossible. Since so much of the
|
2015-07-07 22:26:00 +08:00
|
|
|
semantics of Rust itself are embedded in the `trans` step which
|
|
|
|
converts to LLVM IR. Under the MIR design, those semantics are
|
|
|
|
instead described in the translation from AST to MIR, and the LLVM
|
|
|
|
step itself simply applies optimizations.
|
|
|
|
|
|
|
|
Given the numerous benefits of a MIR, you may wonder why we have not
|
|
|
|
taken steps in this direction earlier. In fact, we have a number of
|
|
|
|
structures in the compiler that simulate the effect of a MIR:
|
|
|
|
|
|
|
|
1. Adjustments. Every expression can have various adjustments, like
|
|
|
|
autoderefs and so forth. These are computed by the type-checker
|
|
|
|
and then read by later analyses. This is a form of MIR, but not a particularly
|
|
|
|
convenient one.
|
|
|
|
2. The CFG. The CFG tries to model the flow of execution as a graph
|
|
|
|
rather than a tree, to help analyses in dealing with complex
|
|
|
|
control-flow formed by things like loops, `break`, `continue`, etc.
|
|
|
|
This CFG is however inferior to the MIR in that it is only an
|
|
|
|
approximation of control-flow and does not include all the
|
|
|
|
information one would need to actually execute the program (for
|
|
|
|
example, for an `if` expression, the CFG would indicate that two
|
|
|
|
branches are possible, but would not contain enough information to
|
|
|
|
decide which branch to take).
|
|
|
|
3. `ExprUseVisitor`. The `ExprUseVisitor` is designed to work in
|
|
|
|
conjunction with the CFG. It walks the AST and highlights actions
|
|
|
|
of interest to later analyses, such as borrows or moves. For each
|
|
|
|
such action, the analysis gets a callback indicating the point in
|
|
|
|
the CFG where the action occurred along with what
|
|
|
|
happened. Overloaded operators, method calls, and so forth are
|
|
|
|
"desugared" into their more primitive operations. This is
|
|
|
|
effectively a kind of MIR, but it is not complete enough to do
|
|
|
|
translation, since it focuses purely on borrows, moves, and other
|
|
|
|
things of interest to the safety checker.
|
|
|
|
|
|
|
|
Each of these things were added in order to try and cope with the
|
|
|
|
complexity of working directly on the AST. The CFG for example
|
|
|
|
consolidates knowledge about control-flow into one piece of code,
|
|
|
|
producing a data structure that can be easily interpreted. Similarly,
|
|
|
|
the `ExprUseVisitor` consolidates knowledge of how to walk and
|
|
|
|
interpret the current compiler representation.
|
|
|
|
|
|
|
|
### Goals
|
|
|
|
|
|
|
|
It is useful to think about what "knowledge" the MIR should
|
|
|
|
encapsulate. Here is a listing of the kinds of things that should be
|
|
|
|
explicit in the MIR and thus that downstream code won't have to
|
|
|
|
re-encode in the form of repeated logic:
|
|
|
|
|
|
|
|
- **Precise ordering of control-flow.** The CFG makes this very explicit,
|
|
|
|
and the individual statements and nodes in the MIR are very small
|
|
|
|
and detailed and hence nothing "interesting" happens in the middle
|
|
|
|
of an individual node with respect to control-flow.
|
|
|
|
- **What needs to be dropped and when.** The set of data that needs to
|
|
|
|
be dropped and when is a fairly complex thing to calculate: you have
|
|
|
|
to know what's in scope, including temporary values and so forth.
|
|
|
|
In the MIR, all drops are explicit, including those that result from
|
|
|
|
panics and unwinding.
|
|
|
|
- **How matches are desugared.** Reasoning about matches has been a
|
|
|
|
traditional source of complexity. Matches combine traversing types
|
|
|
|
with borrows, moves, and all sorts of other things, depending on the
|
|
|
|
precise patterns in use. This is all vastly simplified and explicit
|
|
|
|
in MIR.
|
|
|
|
|
2022-05-27 14:08:39 +08:00
|
|
|
One thing the current MIR does not make as explicit as it
|
2015-07-07 22:26:00 +08:00
|
|
|
could is when something is *moved*. For by-value uses of a value, the
|
|
|
|
code must still consult the type of the value to decide if that is a
|
|
|
|
move or not. This could be made more explicit in the IR.
|
|
|
|
|
|
|
|
### Which analyses are well-suited to the MIR?
|
|
|
|
|
|
|
|
Some analyses are better suited to the AST than to a MIR. The
|
|
|
|
following is a list of work the compiler does that would benefit from
|
|
|
|
using a MIR:
|
|
|
|
|
|
|
|
- **liveness checking**: this is used to issue warnings about unused assignments
|
|
|
|
and the like. The MIR is perfect for this sort of data-flow analysis.
|
|
|
|
- **borrow and move checking**: the borrow checker already uses a
|
|
|
|
combination of the CFG and `ExprUseVisitor` to try and achieve a
|
|
|
|
similarly low-level of detail.
|
2021-09-01 05:08:36 +08:00
|
|
|
- **translation to IR**: the MIR is much closer than the AST to
|
|
|
|
the desired bitcode end-product.
|
2015-07-07 22:26:00 +08:00
|
|
|
|
|
|
|
Some other passes would probably work equally well on the MIR or an
|
|
|
|
AST, but they will likely find the MIR somewhat easier to work with
|
|
|
|
than the current AST simply because it is, well, simpler:
|
|
|
|
|
|
|
|
- **rvalue checking**, which checks that things are `Sized` which need to be.
|
|
|
|
- **reachability** and **death checking**.
|
|
|
|
|
|
|
|
These items are likely ill-suited to the MIR as designed:
|
|
|
|
|
|
|
|
- **privacy checking**, since it relies on explicit knowledge of paths that is not
|
|
|
|
necessarily present in the MIR.
|
|
|
|
- **lint checking**, since it is often dependent on the sort of surface details
|
|
|
|
we are seeking to obscure.
|
|
|
|
|
|
|
|
For some passes, the impact is not entirely clear. In particular,
|
|
|
|
**match exhaustiveness checking** could easily be subsumed by the MIR
|
|
|
|
construction process, which must do a similar analysis during the
|
|
|
|
lowering process. However, once the MIR is built, the match is
|
|
|
|
completely desugared into more primitive switches and so forth, so we
|
|
|
|
will need to leave some markers in order to know where to check for
|
|
|
|
exhaustiveness and to reconstruct counter examples.
|
|
|
|
|
|
|
|
# Detailed design
|
|
|
|
|
|
|
|
### What is *really* being proposed here?
|
|
|
|
|
|
|
|
The rest of this section goes into detail on a particular MIR design.
|
|
|
|
However, the true purpose of this RFC is not to nail down every detail
|
|
|
|
of the MIR -- which are expected to evolve and change over time anyway
|
|
|
|
-- but rather to establish some high-level principles which drive the
|
|
|
|
rest of the design:
|
|
|
|
|
|
|
|
1. We should indeed lower the representation from an AST to something
|
|
|
|
else that will drive later analyses, and this representation should
|
|
|
|
be based on a CFG, not a tree.
|
|
|
|
2. This representation should be explicitly minimal and not attempt to retain
|
|
|
|
the original syntactic structure, though it should be possible to recover enough
|
|
|
|
of it to make quality error messages.
|
|
|
|
3. This representation should encode drops, panics, and other
|
|
|
|
scope-dependent items explicitly.
|
|
|
|
4. This representation does not have to be well-typed Rust, though it
|
|
|
|
should be possible to type-check it using a tweaked variant on the
|
|
|
|
Rust type system.
|
|
|
|
|
|
|
|
### Prototype
|
|
|
|
|
2017-03-06 00:59:46 +08:00
|
|
|
The MIR design being described can be found [here][crate]. In
|
|
|
|
particular, [this module][mir] defines the MIR representation,
|
|
|
|
and [this `build` module][build] contains the code to create a MIR
|
2015-07-07 22:26:00 +08:00
|
|
|
representation from an AST-like form.
|
|
|
|
|
|
|
|
For increased flexibility, as well as to make the code simpler, the
|
|
|
|
prototype is not coded directly against the compiler's AST, but rather
|
2017-03-06 12:27:33 +08:00
|
|
|
against an idealized representation defined by [the `HAIR` trait][hair].
|
|
|
|
Note that this HAIR trait is entirely independent from the HIR discussed by
|
2015-07-07 22:26:00 +08:00
|
|
|
nrc in [RFC 1191][1191] -- you can think of it as an abstract trait
|
|
|
|
that any high-level Rust IR could implement, including our current
|
|
|
|
AST. Moreover, it's just an implementation detail and not part of the
|
|
|
|
MIR being proposed here per se. Still, if you want to read the code,
|
|
|
|
you have to understand its design.
|
|
|
|
|
2017-03-06 12:27:33 +08:00
|
|
|
The `HAIR` trait contains a number of opaque associated types for the
|
2015-07-07 22:26:00 +08:00
|
|
|
various aspects of the compiler. For example, the type `H::Expr`
|
|
|
|
represents an expression. In order to find out what kind of expression
|
|
|
|
it is, the `mirror` method is called, which converts an `H::Expr` into
|
|
|
|
an `Expr<H>` mirror. This mirror then contains embedded `ExprRef<H>`
|
|
|
|
nodes to refer to further subexpressions; these may either be mirrors
|
|
|
|
themselves, or else they may be additional `H::Expr` nodes. This
|
|
|
|
allows the tree that is exported to differ in small ways from the
|
|
|
|
actual tree within the compiler; the primary intention is to use this
|
|
|
|
to model "adjustments" like autoderef. The code to convert from our
|
2017-03-06 12:27:33 +08:00
|
|
|
current AST to the HAIR is not yet complete, but it can be found
|
2017-03-06 00:59:46 +08:00
|
|
|
[here][hair-mod].
|
2015-07-07 22:26:00 +08:00
|
|
|
|
2017-03-06 12:27:33 +08:00
|
|
|
Note that the HAIR mirroring system is an experiment and not really
|
2015-07-07 22:26:00 +08:00
|
|
|
part of the MIR itself. It does however present an interesting option
|
|
|
|
for (eventually) stabilizing access to the compiler's internals.
|
|
|
|
|
2017-03-23 09:43:36 +08:00
|
|
|
[crate]: https://github.com/rust-lang/rust/tree/2532ad7d0b2edac1909ef0ef346230331200790b/src/librustc/mir
|
|
|
|
[mir]: https://github.com/rust-lang/rust/blob/2532ad7d0b2edac1909ef0ef346230331200790b/src/librustc/mir/mod.rs
|
|
|
|
[build]: https://github.com/rust-lang/rust/tree/fc04eaacc5bd5760e98cd3aa390dcc3ae795d12f/src/librustc_mir/build
|
|
|
|
[hair]: https://github.com/rust-lang/rust/tree/fc04eaacc5bd5760e98cd3aa390dcc3ae795d12f/src/librustc_mir/hair
|
2015-07-07 22:26:00 +08:00
|
|
|
[1191]: https://github.com/rust-lang/rfcs/pull/1191
|
2017-03-23 09:43:36 +08:00
|
|
|
[hair-mod]: https://github.com/rust-lang/rust/blob/21c61336bb9e327b90f4cb8e87a948be40eeafe5/src/librustc_mir/hair/mod.rs
|
2015-07-07 22:26:00 +08:00
|
|
|
|
|
|
|
### Overview of the MIR
|
|
|
|
|
|
|
|
The proposed MIR always describes the execution of a single fn. At
|
|
|
|
the highest level it consists of a series of declarations regarding
|
|
|
|
the stack storage that will be required and then a set of basic
|
|
|
|
blocks:
|
|
|
|
|
|
|
|
MIR = fn({TYPE}) -> TYPE {
|
|
|
|
{let [mut] B: TYPE;} // user-declared bindings and their types
|
|
|
|
{let TEMP: TYPE;} // compiler-introduced temporary
|
|
|
|
{BASIC_BLOCK} // control-flow graph
|
|
|
|
};
|
|
|
|
|
|
|
|
The storage declarations are broken into two categories. User-declared
|
|
|
|
bindings have a 1-to-1 relationship with the variables specified in
|
|
|
|
the program. Temporaries are introduced by the compiler in various
|
|
|
|
cases. For example, borrowing an lvalue (e.g., `&foo()`) will
|
|
|
|
introduce a temporary to store the result of `foo()`. Similarly,
|
|
|
|
discarding a value `foo();` is translated to something like `let tmp =
|
|
|
|
foo(); drop(tmp);`). Temporaries are single-assignment, but because
|
|
|
|
they can be borrowed they may be mutated after this assignment and
|
|
|
|
hence they differ somewhat from variables in a pure SSA
|
|
|
|
representation.
|
|
|
|
|
|
|
|
The proposed MIR takes the form of a graph where each node is a *basic
|
|
|
|
block*. A basic block is a standard compiler term for a continuous
|
|
|
|
sequence of instructions with a single entry point. All interesting
|
|
|
|
control-flow happens between basic blocks. Each basic block has an id
|
|
|
|
`BB` and consists of a sequence of statements and a terminator:
|
|
|
|
|
|
|
|
BASIC_BLOCK = BB: {STATEMENT} TERMINATOR
|
|
|
|
|
|
|
|
A `STATEMENT` can have one of three forms:
|
|
|
|
|
|
|
|
STATEMENT = LVALUE "=" RVALUE // assign rvalue into lvalue
|
|
|
|
| Drop(DROP_KIND, LVALUE) // drop value if needed
|
|
|
|
DROP_KIND = SHALLOW // (see discussion below)
|
|
|
|
| DEEP
|
|
|
|
|
|
|
|
The following sections dives into these various kinds of statements in
|
|
|
|
more detail.
|
|
|
|
|
|
|
|
The `TERMINATOR` for a basic block describes how it connects to
|
|
|
|
subsequent blocks:
|
|
|
|
|
|
|
|
TERMINATOR = GOTO(BB) // normal control-flow
|
|
|
|
| PANIC(BB) // initiate unwinding, branching to BB for cleanup
|
|
|
|
| IF(LVALUE, BB0, BB1) // test LVALUE and branch to BB0 if true, else BB1
|
|
|
|
| SWITCH(LVALUE, BB...) // load discriminant from LVALUE (which must be an enum),
|
|
|
|
// and branch to BB... depending on which variant it is
|
|
|
|
| CALL(LVALUE0 = LVALUE1(LVALUE2...), BB0, BB1)
|
|
|
|
// call LVALUE1 with LVALUE2... as arguments. Write
|
|
|
|
// result into LVALUE0. Branch to BB0 if it returns
|
|
|
|
// normally, BB1 if it is unwinding.
|
|
|
|
| DIVERGE // return to caller, unwinding
|
|
|
|
| RETURN // return to caller normally
|
|
|
|
|
|
|
|
Most of the terminators should be fairly obvious. The most interesting
|
|
|
|
part is the handling of unwinding. This aligns fairly close with how
|
|
|
|
LLVM works: there is one terminator, PANIC, that initiates unwinding.
|
|
|
|
It immediately branches to a handler (BB) which will perform cleanup
|
|
|
|
and (eventually) reach a block that has a DIVERGE terminator. DIVERGE
|
|
|
|
causes unwinding to continue up the stack.
|
|
|
|
|
|
|
|
Because calls to other functions can always (or almost always) panic,
|
|
|
|
calls are themselves a kind of terminator. If we can determine that
|
|
|
|
some function we are calling cannot unwind, we can always modify the
|
|
|
|
IR to make the second basic block optional. (We could also add an
|
|
|
|
`RVALUE` to represent calls, but it's probably easiest to keep the
|
|
|
|
call as a terminator unless the memory savings of consolidating basic
|
|
|
|
blocks are found to be worthwhile.)
|
|
|
|
|
|
|
|
It's worth pointing out that basic blocks are just a kind of
|
|
|
|
compile-time and memory-use optimization; there is no semantic
|
|
|
|
difference between a single block and two blocks joined by a GOTO
|
|
|
|
terminator.
|
|
|
|
|
|
|
|
### Assignments, values, and rvalues
|
|
|
|
|
2020-10-28 10:58:13 +08:00
|
|
|
The primary kind of statement is an assignment:
|
2015-07-07 22:26:00 +08:00
|
|
|
|
|
|
|
LVALUE "=" RVALUE
|
|
|
|
|
|
|
|
The semantics of this operation are to first evaluate the RVALUE and
|
|
|
|
then store it into the LVALUE (which must represent a memory location
|
|
|
|
of suitable type).
|
|
|
|
|
|
|
|
An `LVALUE` represents a path to a memory location. This is the basic
|
|
|
|
"unit" analyzed by the borrow checker. It is always possible to
|
|
|
|
evaluate an `LVALUE` without triggering any side-effects (modulo
|
2020-10-28 10:58:13 +08:00
|
|
|
dereferences of unsafe pointers, which naturally can trigger arbitrary
|
2015-07-07 22:26:00 +08:00
|
|
|
behavior if the pointer is not valid).
|
|
|
|
|
|
|
|
LVALUE = B // reference to a user-declared binding
|
|
|
|
| TEMP // a temporary introduced by the compiler
|
|
|
|
| ARG // a formal argument of the fn
|
|
|
|
| STATIC // a reference to a static or static mut
|
|
|
|
| RETURN // the return pointer of the fn
|
|
|
|
| LVALUE.f // project a field or tuple field, like x.f or x.0
|
|
|
|
| *LVALUE // dereference a pointer
|
|
|
|
| LVALUE[LVALUE] // index into an array (see disc. below about bounds checks)
|
|
|
|
| (LVALUE as VARIANT) // downcast to a specific variant of an enum,
|
|
|
|
// see the section on desugaring matches below
|
|
|
|
|
|
|
|
An `RVALUE` represents a computation that yields a result. This result
|
|
|
|
must be stored in memory somewhere to be accessible. The MIR does not
|
|
|
|
contain any kind of nested expressions: everything is flattened out,
|
|
|
|
going through lvalues as intermediaries.
|
|
|
|
|
|
|
|
RVALUE = Use(LVALUE) // just read an lvalue
|
|
|
|
| [LVALUE; LVALUE]
|
|
|
|
| &'REGION LVALUE
|
|
|
|
| &'REGION mut LVALUE
|
|
|
|
| LVALUE as TYPE
|
|
|
|
| LVALUE <BINOP> LVALUE
|
|
|
|
| <UNOP> LVALUE
|
|
|
|
| Struct { f: LVALUE0, ... } // aggregates, see section below
|
|
|
|
| (LVALUE...LVALUE)
|
|
|
|
| [LVALUE...LVALUE]
|
|
|
|
| CONSTANT
|
|
|
|
| LEN(LVALUE) // load length from a slice, see section below
|
|
|
|
| BOX // malloc for builtin box, see section below
|
|
|
|
BINOP = + | - | * | / | ... // excluding && and ||
|
|
|
|
UNOP = ! | - // note: no `*`, as that is part of LVALUE
|
|
|
|
|
|
|
|
One thing worth pointing out is that the binary and unary operators
|
|
|
|
are only the *builtin* form, operating on scalar values. Overloaded
|
|
|
|
operators will be desugared to trait calls. Moreover, all method calls
|
|
|
|
are desugared into normal calls via UFCS form.
|
|
|
|
|
|
|
|
### Constants
|
|
|
|
|
|
|
|
Constants are a subset of rvalues that can be evaluated at compilation
|
|
|
|
time:
|
|
|
|
|
|
|
|
CONSTANT = INT
|
|
|
|
| UINT
|
|
|
|
| FLOAT
|
|
|
|
| BOOL
|
|
|
|
| BYTES
|
|
|
|
| STATIC_STRING
|
|
|
|
| ITEM<SUBSTS> // reference to an item or constant etc
|
|
|
|
| <P0 as TRAIT<P1...Pn>> // projection
|
|
|
|
| CONSTANT(CONSTANT...) //
|
|
|
|
| CAST(CONSTANT, TY) // foo as bar
|
|
|
|
| Struct { (f: CONSTANT)... } // aggregates...
|
|
|
|
| (CONSTANT...) //
|
|
|
|
| [CONSTANT...] //
|
|
|
|
|
|
|
|
### Aggregates and further lowering
|
|
|
|
|
|
|
|
The set of rvalues includes "aggregate" expressions like `(x, y)` or
|
|
|
|
`Foo { f: x, g: y }`. This is a place where the MIR (somewhat) departs
|
|
|
|
from what will be generated compilation time, since (often) an
|
|
|
|
expression like `f = (x, y, z)` will wind up desugared into a series
|
|
|
|
of piecewise assignments like:
|
|
|
|
|
|
|
|
f.0 = x;
|
|
|
|
f.1 = y;
|
|
|
|
f.2 = z;
|
|
|
|
|
|
|
|
However, there are good reasons to include aggregates as first-class
|
|
|
|
rvalues. For one thing, if we break down each aggregate into the
|
|
|
|
specific assignments that would be used to construct the value, then
|
|
|
|
zero-sized types are *never* assigned, since there is no data to
|
|
|
|
actually move around at runtime. This means that the compiler couldn't
|
|
|
|
distinguish uninitialized variables from initialized ones. That is,
|
|
|
|
code like this:
|
|
|
|
|
|
|
|
```rust
|
|
|
|
let x: (); // note: never initialized
|
|
|
|
use(x)
|
|
|
|
```
|
|
|
|
|
|
|
|
and this:
|
|
|
|
|
|
|
|
```rust
|
|
|
|
let x: () = ();
|
|
|
|
use(x);
|
|
|
|
```
|
|
|
|
|
|
|
|
would desugar to the same MIR. That is a problem, particularly with
|
|
|
|
respect to destructors: imagine that instead of the type `()`, we used
|
|
|
|
a type like `struct Foo;` where `Foo` implements `Drop`.
|
|
|
|
|
|
|
|
Another advantage is that building aggregates in a two-step way
|
|
|
|
assures the proper execution order when unwinding occurs before the
|
|
|
|
complete value is constructed. In particular, we want to drop the
|
|
|
|
intermediate results in the order that they appear in the source, not
|
|
|
|
in the order in which the fields are specified in the struct
|
|
|
|
definition.
|
|
|
|
|
|
|
|
A final reason to include aggregates is that, at runtime, the
|
|
|
|
representation of an aggregate may indeed fit within a single word, in
|
|
|
|
which case making a temporary and writing the fields piecemeal may in
|
|
|
|
fact not be the correct representation.
|
|
|
|
|
|
|
|
In any case, after the move and correctness checking is done, it is
|
|
|
|
easy enough to remove these aggregate rvalues and replace them with
|
2021-09-01 05:08:36 +08:00
|
|
|
assignments. This could potentially be done during lowering, or
|
2015-07-07 22:26:00 +08:00
|
|
|
as a pre-pass that transforms MIR statements like:
|
|
|
|
|
|
|
|
x = ...x;
|
|
|
|
y = ...y;
|
|
|
|
z = ...z;
|
|
|
|
f = (x, y, z)
|
|
|
|
|
|
|
|
to:
|
|
|
|
|
|
|
|
x = ...x;
|
|
|
|
y = ...y;
|
|
|
|
z = ...z;
|
|
|
|
f.0 = x;
|
|
|
|
f.1 = y;
|
|
|
|
f.2 = z;
|
|
|
|
|
|
|
|
combined with another pass that removes temporaries that are only used
|
|
|
|
within a single assignment (and nowhere else):
|
|
|
|
|
|
|
|
f.0 = ...x;
|
|
|
|
f.1 = ...y;
|
|
|
|
f.2 = ...z;
|
|
|
|
|
|
|
|
Going further, once type-checking is done, it is plausible to do
|
|
|
|
further lowering within the MIR purely for optimization purposes. For
|
|
|
|
example, we could introduce intermediate references to cache the
|
2021-09-01 05:08:36 +08:00
|
|
|
results of common lvalue computations and so forth.
|
2015-07-07 22:26:00 +08:00
|
|
|
|
|
|
|
### Bounds checking
|
|
|
|
|
|
|
|
Because bounds checks are fallible, it's important to encode them in
|
2021-09-01 05:08:36 +08:00
|
|
|
the MIR whenever we do indexing. Otherwise the `trans` code would have
|
2015-07-07 22:26:00 +08:00
|
|
|
to figure out on its own how to do unwinding at that point. Because
|
|
|
|
the MIR doesn't "desugar" fat pointers, we include a special rvalue
|
|
|
|
`LEN` that extracts the length from an array value whose type matches
|
|
|
|
`[T]` or `[T;n]` (in the latter case, it yields a constant). Using
|
|
|
|
this, we desugar an array reference like `y = arr[x]` as follows:
|
|
|
|
|
|
|
|
let len: usize;
|
|
|
|
let idx: usize;
|
|
|
|
let lt: bool;
|
|
|
|
|
|
|
|
B0: {
|
|
|
|
len = len(arr);
|
|
|
|
idx = x;
|
|
|
|
lt = idx < len;
|
|
|
|
if lt { B1 } else { B2 }
|
|
|
|
}
|
|
|
|
|
|
|
|
B1: {
|
2017-01-13 23:31:00 +08:00
|
|
|
y = arr[idx]
|
2015-07-07 22:26:00 +08:00
|
|
|
...
|
|
|
|
}
|
|
|
|
|
|
|
|
B2: {
|
|
|
|
<panic>
|
|
|
|
}
|
|
|
|
|
|
|
|
The key point here is that we create a temporary (`idx`) capturing the
|
|
|
|
value that we bounds checked and we ensure that there is a comparison
|
|
|
|
against the length.
|
|
|
|
|
|
|
|
### Overflow checking
|
|
|
|
|
|
|
|
Similarly, since overflow checks can trigger a panic, they ought to be
|
|
|
|
exposed in the MIR as well. This is handled by having distinct binary
|
|
|
|
operators for "add with overflow" and so forth, analogous to the LLVM
|
|
|
|
intrinsics. These operators yield a tuple of (result, overflow), so
|
|
|
|
`result = left + right` might be translated like:
|
|
|
|
|
|
|
|
let tmp: (u32, bool);
|
|
|
|
|
|
|
|
B0: {
|
|
|
|
tmp = left + right;
|
2017-01-13 23:31:00 +08:00
|
|
|
if(tmp.1, B2, B1)
|
2015-07-07 22:26:00 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
B1: {
|
|
|
|
result = tmp.0
|
|
|
|
...
|
|
|
|
}
|
|
|
|
|
|
|
|
B2: {
|
|
|
|
<panic>
|
|
|
|
}
|
|
|
|
|
|
|
|
### Matches
|
|
|
|
|
|
|
|
One of the goals of the MIR is to desugar matches into something much
|
|
|
|
more primitive, so that we are freed from reasoning about their
|
|
|
|
complexity. This is primarily achieved through a combination of SWITCH
|
|
|
|
terminators and downcasts. To get the idea, consider this simple match
|
|
|
|
statement:
|
|
|
|
|
|
|
|
```rust
|
|
|
|
match foo() {
|
|
|
|
Some(ref v) => ...0,
|
|
|
|
None => ...1
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
This would be converted into MIR as follows (leaving out the unwinding support):
|
|
|
|
|
|
|
|
BB0 {
|
|
|
|
call(tmp = foo(), BB1, ...);
|
|
|
|
}
|
|
|
|
|
|
|
|
BB1 {
|
|
|
|
switch(tmp, BB2, BB3) // two branches, corresponding to the Some and None variants resp.
|
|
|
|
}
|
|
|
|
|
|
|
|
BB2 {
|
|
|
|
v = &(tmp as Option::Some).0;
|
|
|
|
...0
|
|
|
|
}
|
|
|
|
|
|
|
|
BB3 {
|
|
|
|
...1
|
|
|
|
}
|
|
|
|
|
|
|
|
There are some interesting cases that arise from matches that are
|
|
|
|
worth examining.
|
|
|
|
|
|
|
|
**Vector patterns.** Currently, (unstable) Rust supports vector
|
|
|
|
patterns which permit borrows that would not otherwise be legal:
|
|
|
|
|
|
|
|
```rust
|
|
|
|
let mut vec = [1, 2];
|
|
|
|
match vec {
|
|
|
|
[ref mut p, ref mut q] => { ... }
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
If this code were written using `p = &mut vec[0], q = &mut vec[1]`,
|
|
|
|
the borrow checker would complain. This is because it does not attempt
|
|
|
|
to reason about indices being disjoint, even if they are constant
|
|
|
|
(this is a limitation we may wish to consider lifting at some point in
|
|
|
|
the future, however).
|
|
|
|
|
|
|
|
To accommodate these, we plan to desugar such matches into lvalues
|
|
|
|
using the special "constant index" form. The borrow checker would be
|
|
|
|
able to reason that two constant indices are disjoint but it could
|
|
|
|
consider "variable indices" to be (potentially) overlapping with all
|
|
|
|
constant indices. This is a fairly straightforward thing to do (and in
|
|
|
|
fact the borrow checker already includes similar logic, since the
|
2020-10-28 10:58:13 +08:00
|
|
|
`ExprUseVisitor` encounters a similar dilemma trying to resolve
|
2015-07-07 22:26:00 +08:00
|
|
|
borrows).
|
|
|
|
|
|
|
|
### Drops
|
|
|
|
|
|
|
|
The `Drop(DROP_KIND, LVALUE)` instruction is intended to represent
|
|
|
|
"automatic" compiler-inserted drops. The semantics of a `Drop` is that
|
|
|
|
it drops "if needed". This means that the compiler can insert it
|
|
|
|
everywhere that a `Drop` would make sense (due to scoping), and assume
|
|
|
|
that instrumentation will be done as needed to prevent double
|
|
|
|
drops. Currently, this signaling is done by zeroing out memory at
|
|
|
|
runtime, but we are in the process of introducing stack flags for this
|
|
|
|
purpose: the MIR offers the opportunity to reify those flags if we
|
2021-09-01 05:08:36 +08:00
|
|
|
wanted, and rewrite drops to be more narrow.
|
2015-07-07 22:26:00 +08:00
|
|
|
|
|
|
|
To illustrate how drop works, let's work through a simple
|
|
|
|
example. Imagine that we have a snippet of code like:
|
|
|
|
|
|
|
|
```rust
|
|
|
|
{
|
|
|
|
let x = Box::new(22);
|
|
|
|
send(x);
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
The compiler would generate a drop for `x` at the end of the block,
|
|
|
|
but the value `x` would also be moved as part of the call to `send`.
|
|
|
|
A later analysis could easily strip out this `Drop` since it is evident
|
|
|
|
that the value is always used on all paths that lead to `Drop`.
|
|
|
|
|
|
|
|
### Shallow drops and Box
|
|
|
|
|
|
|
|
The MIR includes the distinction between "shallow" and "deep"
|
|
|
|
drop. Deep drop is the normal thing, but shallow drop is used when
|
|
|
|
partially initializing boxes. This is tied to the `box` keyword.
|
|
|
|
For example, an assignment like the following:
|
|
|
|
|
|
|
|
let x = box Foo::new();
|
|
|
|
|
|
|
|
would be translated to something like the following:
|
|
|
|
|
|
|
|
let tmp: Box<Foo>;
|
|
|
|
|
|
|
|
B0: {
|
|
|
|
tmp = BOX;
|
|
|
|
f = Foo::new; // constant reference
|
|
|
|
call(*tmp, f, B1, B2);
|
|
|
|
}
|
|
|
|
|
|
|
|
B1: { // successful return of the call
|
|
|
|
x = use(tmp); // move of tmp
|
|
|
|
...
|
|
|
|
}
|
|
|
|
|
|
|
|
B2: { // calling Foo::new() panic'd
|
|
|
|
drop(Shallow, tmp);
|
|
|
|
diverge;
|
|
|
|
}
|
|
|
|
|
|
|
|
The interesting part here is the block B2, which indicates the case
|
|
|
|
that `Foo::new()` invoked unwinding. In that case, we have to free the
|
|
|
|
box that we allocated, but we only want to free the box itself, not
|
|
|
|
its contents (it is not yet initialized).
|
|
|
|
|
|
|
|
Note that having this kind of builtin box code is a legacy thing. The
|
|
|
|
more generalized protocol that [RFC 809][809] specifies works in
|
|
|
|
more-or-less exactly the same way: when that is adopted uniformly, the
|
|
|
|
need for shallow drop and the Box rvalue will go away.
|
|
|
|
|
2017-03-12 15:16:14 +08:00
|
|
|
[809]: https://github.com/rust-lang/rfcs/blob/master/text/0809-box-and-in-for-stdlib.md
|
|
|
|
|
2015-07-07 22:26:00 +08:00
|
|
|
### Phasing
|
|
|
|
|
|
|
|
Ideally, the translation to MIR would be done during type checking,
|
|
|
|
but before "region checking". This is because we would like to
|
|
|
|
implement non-lexical lifetimes eventually, and doing that well would
|
|
|
|
requires access to a control-flow graph. Given that we do very limited
|
|
|
|
reasoning about regions at present, this should not be a problem.
|
|
|
|
|
|
|
|
### Representing scopes
|
|
|
|
|
|
|
|
Lexical scopes in Rust play a large role in terms of when destructors
|
|
|
|
run and how the reasoning about lifetimes works. However, they are
|
|
|
|
completely erased by the graph format. For the most part, this is not
|
|
|
|
an issue, since drops are encoded explicitly into the control-flow
|
|
|
|
where needed. However, one place that we still need to reason about
|
|
|
|
scopes (at least in the short term) is in region checking, because
|
|
|
|
currently regions are encoded in terms of scopes, and we have to be
|
|
|
|
able to map that to a region in the graph. The MIR therefore includes
|
|
|
|
extra information mapping every scope to a SEME region (single-entry,
|
|
|
|
multiple-exit). If/when we move to non-lexical lifetimes, regions
|
|
|
|
would be defined in terms of the graph itself, and the need to retain
|
|
|
|
scoping information should go away.
|
|
|
|
|
|
|
|
### Monomorphization
|
|
|
|
|
2021-09-01 05:08:36 +08:00
|
|
|
Currently, we do monomorphization at translation time. If we ever
|
2015-07-07 22:26:00 +08:00
|
|
|
chose to do it at a MIR level, that would be fine, but one thing to be
|
|
|
|
careful of is that we may be able to elide `Drop` nodes based on the
|
|
|
|
specific types.
|
|
|
|
|
|
|
|
### Unchecked assertions
|
|
|
|
|
|
|
|
There are various bits of the MIR that are not trivially type-checked.
|
|
|
|
In general, these are properties which are assured in Rust by
|
|
|
|
construction in the high-level syntax, and thus we must be careful not
|
|
|
|
to do any transformation that would endanger them after the fact.
|
|
|
|
|
|
|
|
- **Bounds-checking.** We introduce explicit bounds checks into the IR
|
|
|
|
that guard all indexing lvalues, but there is no explicit connection
|
|
|
|
between this check and the later accesses.
|
|
|
|
- **Downcasts to a specific variant.** We test variants with a SWITCH
|
|
|
|
opcode but there is no explicit connection between this test and
|
|
|
|
later downcasts.
|
|
|
|
|
|
|
|
This need for unchecked operations results form trying to lower and
|
|
|
|
simplify the representation as much as possible, as well as trying to
|
|
|
|
represent all panics explicitly. We believe the tradeoff to be
|
|
|
|
worthwhile, particularly since:
|
|
|
|
|
|
|
|
1. the existing analyses can continue to generally assume that these
|
|
|
|
properties hold (e.g., that all indices are in bounds and all
|
|
|
|
downcasts are safe); and,
|
|
|
|
2. it would be trivial to implement a static dataflow analysis
|
|
|
|
checking that bounds and downcasts only occur downstream of a relevant
|
|
|
|
check.
|
|
|
|
|
|
|
|
# Drawbacks
|
|
|
|
|
|
|
|
**Converting from AST to a MIR will take some compilation time.**
|
|
|
|
Expectations are that constructing the MIR will be quite fast, and
|
2021-09-01 05:08:36 +08:00
|
|
|
that follow-on code (such as `trans` and `borrowck`) will execute faster,
|
2015-07-07 22:26:00 +08:00
|
|
|
because they will operate over a simpler and more compact
|
|
|
|
representation. However, this needs to be measured.
|
|
|
|
|
|
|
|
**More effort is required to make quality error messages.** Because
|
|
|
|
the representation the compiler is working with is now quite different
|
|
|
|
from what the user typed, we have to put in extra effort to make sure
|
|
|
|
that we bridge this gap when reporting errors. We have some precedent
|
|
|
|
for dealing with this, however. For example, the `ExprUseVisitor` (and
|
|
|
|
`mem_categorization`) includes extra annotations and hints to tell the
|
|
|
|
borrow checker when a reference was introduced as part of a closure
|
|
|
|
versus being explicit in the source code. The current prototype
|
|
|
|
doesn't have much in this direction, but it should be relatively
|
|
|
|
straightforward to add. Hints like those, in addition to spans, should
|
|
|
|
be enough to bridge the error message gap.
|
|
|
|
|
|
|
|
# Alternatives
|
|
|
|
|
|
|
|
**Use SSA.** In the proposed MIR, temporaries are single-assignment
|
|
|
|
but can be borrowed, making them more analogous to allocas than SSA
|
|
|
|
values. This is helpful to analyses like the borrow checker, because
|
|
|
|
it means that the program operates directly on paths through memory,
|
|
|
|
versus having the stack modeled as allocas. The current model is also
|
|
|
|
helpful for generating debuginfo.
|
|
|
|
|
|
|
|
SSA representation can be helpful for more sophisticated backend
|
2021-09-01 05:08:36 +08:00
|
|
|
optimizations. However, it makes more sense to have the MIR be based on
|
|
|
|
lvalues. There are some cases where it might make sense to do analyses
|
2015-07-07 22:26:00 +08:00
|
|
|
on the MIR that would benefit from SSA, such as bounds check elision.
|
|
|
|
In those cases, we could either quickly identify those temporaries
|
|
|
|
that are not mutably borrowed (and which therefore act like SSA
|
|
|
|
variables); or, further lower into a LIR, (which would be an SSA
|
|
|
|
form); or else simply perform the analyses on the MIR using standard
|
|
|
|
techniques like def-use chains. (CSE and so forth are straightforward
|
|
|
|
both with and without SSA, honestly.)
|
|
|
|
|
|
|
|
**Exclude unwinding.** Excluding unwinding from the MIR would allow us
|
|
|
|
to elide annoying details like bounds and overflow checking. These are
|
2021-09-01 05:08:36 +08:00
|
|
|
not particularly interesting to `borrowck`, so that is somewhat
|
2015-07-07 22:26:00 +08:00
|
|
|
appealing. But that would mean that consumers of MIR would have to
|
|
|
|
reconstruct the order of drops and so forth on unwinding paths, which
|
|
|
|
would require them reasoning about scopes and other rather complex
|
|
|
|
bits of information. Moreover, having all drops fully exposed in the
|
|
|
|
MIR is likely helpful for better handling of dynamic drop and also for
|
|
|
|
the rules collectively known as dropck, though all details there have
|
|
|
|
not been worked out.
|
|
|
|
|
|
|
|
**Expand the set of operands.** The proposed MIR forces all rvalue operands
|
|
|
|
to be lvalues. This means that integer constants and other "simple" things
|
|
|
|
will wind up introducing a temporary. For example, translating `x = 2+2`
|
|
|
|
will generate code like:
|
|
|
|
|
|
|
|
tmp0 = 2
|
|
|
|
tmp1 = 2
|
|
|
|
x = tmp0 + tmp1
|
|
|
|
|
|
|
|
A more common case will be calls to statically known functions like `x = foo(3)`,
|
|
|
|
which desugars to a temporary and a constant reference:
|
|
|
|
|
|
|
|
tmp0 = foo;
|
|
|
|
tmp1 = 3
|
2017-03-12 15:16:14 +08:00
|
|
|
x = tmp0(tmp1)
|
2015-07-07 22:26:00 +08:00
|
|
|
|
|
|
|
There is no particular *harm* in such constants: it would be very easy
|
2021-09-01 05:08:36 +08:00
|
|
|
to optimize them away when reducing to bitcode, and if we do not
|
|
|
|
do so, a backend may do it. However, we could also expand the scope of
|
2015-07-07 22:26:00 +08:00
|
|
|
operands to include both lvalues and some simple rvalues like
|
|
|
|
constants. The main advantage of this is that it would reduce the
|
|
|
|
total number of statements and hence might help with memory
|
|
|
|
consumption.
|
|
|
|
|
|
|
|
**Totally safe MIR.** This MIR includes operations whose safety is not
|
|
|
|
trivially type-checked (see the section on *unchecked assertions*
|
|
|
|
above). We might design a higher-level MIR where those properties held
|
|
|
|
by construction, or modify the MIR to thread "evidence" of some form
|
|
|
|
that makes it easier to check that the properties hold. The former
|
|
|
|
would make downstream code accommodate more complexity. The latter
|
|
|
|
remains an option in the future but doesn't seem to offer much
|
|
|
|
practical advantage.
|
|
|
|
|
|
|
|
# Unresolved questions
|
|
|
|
|
|
|
|
**What additional info is needed to provide for good error messages?**
|
|
|
|
Currently the implementation only has spans on statements, not lvalues
|
|
|
|
or rvalues. We'll have to experiment here. I expect we will probably
|
|
|
|
wind up placing "debug info" on all lvalues, which includes not only a
|
|
|
|
span but also a "translation" into terms the user understands. For
|
|
|
|
example, in a closure, a reference to an by-reference upvar `foo` will
|
|
|
|
be translated to something like `*self.foo`, and we would like that to
|
|
|
|
be displayed to the user as just `foo`.
|
|
|
|
|
|
|
|
**What additional info is needed for debuginfo?** It may be that to
|
|
|
|
generate good debuginfo we want to include additional information
|
|
|
|
about control-flow or scoping.
|
|
|
|
|
|
|
|
**Unsafe blocks.** Should we layer unsafe in the MIR so that effect
|
|
|
|
checking can be done on the CFG? It's not the most natural way to do
|
|
|
|
it, *but* it would make it fairly easy to support (e.g.) autoderef on
|
|
|
|
unsafe pointers, since all the implicit operations are made explicit
|
|
|
|
in the MIR. My hunch is that we can improve our HIR instead.
|