Add a bit about various type system concepts (#697)

* add a bit on dataflow analysis

* add a bit on quanitification

* add a bit on debruijn index

* add a bit on early and late bound params

* add missing link

* Typos

Co-authored-by: Tshepang Lekhonkhobe <tshepang@gmail.com>

* clarify dataflow example

* fix formatting

* fix typos

* Typos

Co-authored-by: Tshepang Lekhonkhobe <tshepang@gmail.com>

* fix errors in background

* remove dup material and make early/late intro short

* adjust intro

* Niko's intro

Co-authored-by: Niko Matsakis <niko@alum.mit.edu>

Co-authored-by: Tshepang Lekhonkhobe <tshepang@gmail.com>
Co-authored-by: Niko Matsakis <niko@alum.mit.edu>
This commit is contained in:
Who? Me?! 2020-06-05 11:11:09 -05:00 committed by GitHub
parent e74b4ab3a4
commit 98d000fb3c
4 changed files with 253 additions and 2 deletions

View File

@ -76,6 +76,7 @@
- [Generic arguments](./generic_arguments.md)
- [Type inference](./type-inference.md)
- [Trait solving](./traits/resolution.md)
- [Early and Late Bound Parameters](./early-late-bound.md)
- [Higher-ranked trait bounds](./traits/hrtb.md)
- [Caching subtleties](./traits/caching.md)
- [Specialization](./traits/specialization.md)

View File

@ -79,16 +79,158 @@ cycle.
[*Static Program Analysis*](https://cs.au.dk/~amoeller/spa/) by Anders Møller
and Michael I. Schwartzbach is an incredible resource!
*to be written*
_Dataflow analysis_ is a type of static analysis that is common in many
compilers. It describes a general technique, rather than a particular analysis.
The basic idea is that we can walk over a [CFG](#cfg) and keep track of what
some value could be. At the end of the walk, we might have shown that some
claim is true or not necessarily true (e.g. "this variable must be
initialized"). `rustc` tends to do dataflow analyses over the MIR, since that
is already a CFG.
For example, suppose we want to check that `x` is initialized before it is used
in this snippet:
```rust,ignore
fn foo() {
let mut x;
if some_cond {
x = 1;
}
dbg!(x);
}
```
A CFG for this code might look like this:
```txt
+------+
| Init | (A)
+------+
| \
| if some_cond
else \ +-------+
| \| x = 1 | (B)
| +-------+
| /
+---------+
| dbg!(x) | (C)
+---------+
```
We can do the dataflow analysis as follows: we will start off with a flag `init`
which indicates if we know `x` is initialized. As we walk the CFG, we will
update the flag. At the end, we can check its value.
So first, in block (A), the variable `x` is declared but not initialized, so
`init = false`. In block (B), we initialize the value, so we know that `x` is
initialized. So at the end of (B), `init = true`.
Block (C) is where things get interesting. Notice that there are two incoming
edges, one from (A) and one from (B), corresponding to whether `some_cond` is true or not.
But we cannot know that! It could be the case the `some_cond` is always true,
so that `x` is actually always initialized. It could also be the case that
`some_cond` depends on something random (e.g. the time), so `x` may not be
initialized. In general, we cannot know statically (due to [Rice's
Theorem][rice]). So what should the value of `init` be in block (C)?
[rice]: https://en.wikipedia.org/wiki/Rice%27s_theorem
Generally, in dataflow analyses, if a block has multiple parents (like (C) in
our example), its dataflow value will be some function of all its parents (and
of course, what happens in (C)). Which function we use depends on the analysis
we are doing.
In this case, we want to be able to prove definitively that `x` must be
initialized before use. This forces us to be conservative and assume that
`some_cond` might be false sometimes. So our "merging function" is "and". That
is, `init = true` in (C) if `init = true` in (A) _and_ in (B) (or if `x` is
initialized in (C)). But this is not the case; in particular, `init = false` in
(A), and `x` is not initialized in (C). Thus, `init = false` in (C); we can
report an error that "`x` may not be initialized before use".
There is definitely a lot more that can be said about dataflow analyses. There is an
extensive body of research literature on the topic, including a lot of theory.
We only discussed a forwards analysis, but backwards dataflow analysis is also
useful. For example, rather than starting from block (A) and moving forwards,
we might have started with the usage of `x` and moved backwards to try to find
its initialization.
<a name="quantified"></a>
## What is "universally quantified"? What about "existentially quantified"?
*to be written*
In math, a predicate may be _universally quantified_ or _existentially
quantified_:
- _Universal_ quantification:
- the predicate holds if it is true for all possible inputs.
- Traditional notation: ∀x: P(x). Read as "for all x, P(x) holds".
- _Existential_ quantification:
- the predicate holds if there is any input where it is true, i.e., there
only has to be a single input.
- Traditional notation: ∃x: P(x). Read as "there exists x such that P(x) holds".
In Rust, they come up in type checking and trait solving. For example,
```rust,ignore
fn foo<T>()
```
This function claims that the function is well-typed for all types `T`: `∀ T: well_typed(foo)`.
Another example:
```rust,ignore
fn foo<'a>(_: &'a usize)
```
This function claims that for any lifetime `'a` (determined by the
caller), it is well-typed: `∀ 'a: well_typed(foo)`.
Another example:
```rust,ignore
fn foo<F>()
where for<'a> F: Fn(&'a u8)
```
This function claims that it is well-typed for all types `F` such that for all
lifetimes `'a`, `F: Fn(&'a u8)`: `∀ F: ∀ 'a: (F: Fn(&'a u8)) => well_typed(foo)`.
One more example:
```rust,ignore
fn foo(_: dyn Debug)
```
This function claims that there exists some type `T` that implements `Debug`
such that the function is well-typed: `∃ T: (T: Debug) and well_typed(foo)`.
<a name="variance"></a>
## What is a DeBruijn Index?
DeBruijn indices are a way of representing which variables are bound in
which binders using only integers. They were [originally invented][wikideb] for
use in lambda calculus evaluation. In `rustc`, we use a similar idea for the
[representation of generic types][sub].
[wikideb]: https://en.wikipedia.org/wiki/De_Bruijn_index
[sub]: ../generics.md
Here is a basic example of how DeBruijn indices might be used for closures (we
don't actually do this in `rustc` though):
```rust,ignore
|x| {
f(x) // de Bruijn index of `x` is 1 because `x` is bound 1 level up
|y| {
g(x, y) // index of `x` is 2 because it is bound 2 levels up
// index of `y` is 1 because it is bound 1 level up
}
}
```
## What is co- and contra-variance?
Check out the subtyping chapter from the

View File

@ -15,6 +15,7 @@ CTFE <div id="ctfe"/> | Short for Compile-Time Function Eval
cx <div id="cx"/> | We tend to use "cx" as an abbreviation for context. See also `tcx`, `infcx`, etc.
DAG <div id="dag"/> | A directed acyclic graph is used during compilation to keep track of dependencies between queries. ([see more](../queries/incremental-compilation.html))
data-flow analysis <div id="data-flow"/> | A static analysis that figures out what properties are true at each point in the control-flow of a program; see [the background chapter for more](./background.html#dataflow).
DeBruijn Index <div id="debruijn"> | A technique for describing which binder a variable is bound by using only integers. It has the benefit that it is invariant under variable renaming. ([see more](./background.md#debruijn))
DefId <div id="def-id"/> | An index identifying a definition (see `librustc_middle/hir/def_id.rs`). Uniquely identifies a `DefPath`. See [the HIR chapter for more](../hir.html#identifiers-in-the-hir).
Double pointer <div id="double-ptr"/> | A pointer with additional metadata. See "fat pointer" for more.
drop glue <div id="drop-glue"/> | (internal) compiler-generated instructions that handle calling the destructors (`Drop`) for data types.

107
src/early-late-bound.md Normal file
View File

@ -0,0 +1,107 @@
# Early and Late Bound Variables
In Rust, item definitions (like `fn`) can often have generic parameters, which
are always [_universally_ quantified][quant]. That is, if you have a function
like
```rust
fn foo<T>(x: T) { }
```
this function is defined "for all T" (not "for some specific T", which would be
[_existentially_ quantified][quant]).
[quant]: ./appendix/background.md#quantified
While Rust *items* can be quantified over types, lifetimes, and constants, the
types of values in Rust are only ever quantified over lifetimes. So you can
have a type like `for<'a> fn(&'a u32)`, which represents a function pointer
that takes a reference with any lifetime, or `for<'a> dyn Trait<'a>`, which is
a `dyn` trait for a trait implemented for any lifetime; but we have no type
like `for<T> fn(T)`, which would be a function that takes a value of *any type*
as a parameter. This is a consequence of monomorphization -- to support a value
of type `for<T> fn(T)`, we would need a single function pointer that can be
used for a parameter of any type, but in Rust we generate customized code for
each parameter type.
One consequence of this asymmetry is a weird split in how we represesent some
generic types: _early-_ and _late-_ bound parameters.
Basically, if we cannot represent a type (e.g. a universally quantified type),
we have to bind it _early_ so that the unrepresentable type is never around.
Consider the following example:
```rust,ignore
fn foo<'a, 'b, T>(x: &'a u32, y: &'b T) where T: 'b { ... }
```
We cannot treat `'a`, `'b`, and `T` in the same way. Types in Rust can't have
`for<T> { .. }`, only `for<'a> {...}`, so whenever you reference `foo` the type
you get back can't be `for<'a, 'b, T> fn(&'a u32, y: &'b T)`. Instead, the `T`
must be substituted early. In particular, you have:
```rust,ignore
let x = foo; // T, 'b have to be substituted here
x(...); // 'a substituted here, at the point of call
x(...); // 'a substituted here with a different value
```
## Early-bound parameters
Early-bound parameters in rustc are identified by an index, stored in the
[`ParamTy`] struct for types or the [`EarlyBoundRegion`] struct for lifetimes.
The index counts from the outermost declaration in scope. This means that as you
add more binders inside, the index doesn't change.
For example,
```rust,ignore
trait Foo<T> {
type Bar<U> = (Self, T, U);
}
```
Here, the type `(Self, T, U)` would be `($0, $1, $2)`, where `$N` means a
[`ParamTy`] with the index of `N`.
In rustc, the [`Generics`] structure carries the this information. So the
[`Generics`] for `Bar` above would be just like for `U` and would indicate the
'parent' generics of `Foo`, which declares `Self` and `T`. You can read more
in [this chapter](./generics.md).
[`ParamTy`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.ParamTy.html
[`EarlyBoundRegion`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.EarlyBoundRegion.html
[`Generics`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.Generics.html
## Late-bound parameters
Late-bound parameters in `rustc` are handled quite differently (they are also
specialized to lifetimes since, right now, only late-bound lifetimes are
supported, though with GATs that has to change). We indicate their potential
presence by a [`Binder`] type. The [`Binder`] doesn't know how many variables
there are at that binding level. This can only be determined by walking the
type itself and collecting them. So a type like `for<'a, 'b> ('a, 'b)` would be
`for (^0.a, ^0.b)`. Here, we just write `for` because we don't know the names
of the things bound within.
Moreover, a reference to a late-bound lifetime is written `^0.a`:
- The `0` is the index; it identifies that this lifetime is bound in the
innermost binder (the `for`).
- The `a` is the "name"; late-bound lifetimes in rustc are identified by a
"name" -- the [`BoundRegion`] struct. This struct can contain a
[`DefId`][defid] or it might have various "anonymous" numbered names. The
latter arise from types like `fn(&u32, &u32)`, which are equivalent to
something like `for<'a, 'b> fn(&'a u32, &'b u32)`, but the names of those
lifetimes must be generated.
This setup of not knowing the full set of variables at a binding level has some
advantages and some disadvantages. The disadvantage is that you must walk the
type to find out what is bound at the given level and so forth. The advantage
is primarily that, when constructing types from Rust syntax, if we encounter
anonymous regions like in `fn(&u32)`, we just create a fresh index and don't have
to update the binder.
[`Binder`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.Binder.html
[`BoundRegion`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/enum.BoundRegion.html
[defid]: ./hir.html#identifiers-in-the-hir