From 98d000fb3c30287e18af6103e94a2f8d38717c01 Mon Sep 17 00:00:00 2001 From: Who? Me?! Date: Fri, 5 Jun 2020 11:11:09 -0500 Subject: [PATCH] Add a bit about various type system concepts (#697) * add a bit on dataflow analysis * add a bit on quanitification * add a bit on debruijn index * add a bit on early and late bound params * add missing link * Typos Co-authored-by: Tshepang Lekhonkhobe * clarify dataflow example * fix formatting * fix typos * Typos Co-authored-by: Tshepang Lekhonkhobe * fix errors in background * remove dup material and make early/late intro short * adjust intro * Niko's intro Co-authored-by: Niko Matsakis Co-authored-by: Tshepang Lekhonkhobe Co-authored-by: Niko Matsakis --- src/SUMMARY.md | 1 + src/appendix/background.md | 146 ++++++++++++++++++++++++++++++++++++- src/appendix/glossary.md | 1 + src/early-late-bound.md | 107 +++++++++++++++++++++++++++ 4 files changed, 253 insertions(+), 2 deletions(-) create mode 100644 src/early-late-bound.md diff --git a/src/SUMMARY.md b/src/SUMMARY.md index 0a9dc777..f45e8ba9 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -76,6 +76,7 @@ - [Generic arguments](./generic_arguments.md) - [Type inference](./type-inference.md) - [Trait solving](./traits/resolution.md) + - [Early and Late Bound Parameters](./early-late-bound.md) - [Higher-ranked trait bounds](./traits/hrtb.md) - [Caching subtleties](./traits/caching.md) - [Specialization](./traits/specialization.md) diff --git a/src/appendix/background.md b/src/appendix/background.md index b688d2c7..fe69f6bf 100644 --- a/src/appendix/background.md +++ b/src/appendix/background.md @@ -79,16 +79,158 @@ cycle. [*Static Program Analysis*](https://cs.au.dk/~amoeller/spa/) by Anders Møller and Michael I. Schwartzbach is an incredible resource! -*to be written* +_Dataflow analysis_ is a type of static analysis that is common in many +compilers. It describes a general technique, rather than a particular analysis. + +The basic idea is that we can walk over a [CFG](#cfg) and keep track of what +some value could be. At the end of the walk, we might have shown that some +claim is true or not necessarily true (e.g. "this variable must be +initialized"). `rustc` tends to do dataflow analyses over the MIR, since that +is already a CFG. + +For example, suppose we want to check that `x` is initialized before it is used +in this snippet: + +```rust,ignore +fn foo() { + let mut x; + + if some_cond { + x = 1; + } + + dbg!(x); +} +``` + +A CFG for this code might look like this: + +```txt + +------+ + | Init | (A) + +------+ + | \ + | if some_cond + else \ +-------+ + | \| x = 1 | (B) + | +-------+ + | / + +---------+ + | dbg!(x) | (C) + +---------+ +``` + +We can do the dataflow analysis as follows: we will start off with a flag `init` +which indicates if we know `x` is initialized. As we walk the CFG, we will +update the flag. At the end, we can check its value. + +So first, in block (A), the variable `x` is declared but not initialized, so +`init = false`. In block (B), we initialize the value, so we know that `x` is +initialized. So at the end of (B), `init = true`. + +Block (C) is where things get interesting. Notice that there are two incoming +edges, one from (A) and one from (B), corresponding to whether `some_cond` is true or not. +But we cannot know that! It could be the case the `some_cond` is always true, +so that `x` is actually always initialized. It could also be the case that +`some_cond` depends on something random (e.g. the time), so `x` may not be +initialized. In general, we cannot know statically (due to [Rice's +Theorem][rice]). So what should the value of `init` be in block (C)? + +[rice]: https://en.wikipedia.org/wiki/Rice%27s_theorem + +Generally, in dataflow analyses, if a block has multiple parents (like (C) in +our example), its dataflow value will be some function of all its parents (and +of course, what happens in (C)). Which function we use depends on the analysis +we are doing. + +In this case, we want to be able to prove definitively that `x` must be +initialized before use. This forces us to be conservative and assume that +`some_cond` might be false sometimes. So our "merging function" is "and". That +is, `init = true` in (C) if `init = true` in (A) _and_ in (B) (or if `x` is +initialized in (C)). But this is not the case; in particular, `init = false` in +(A), and `x` is not initialized in (C). Thus, `init = false` in (C); we can +report an error that "`x` may not be initialized before use". + +There is definitely a lot more that can be said about dataflow analyses. There is an +extensive body of research literature on the topic, including a lot of theory. +We only discussed a forwards analysis, but backwards dataflow analysis is also +useful. For example, rather than starting from block (A) and moving forwards, +we might have started with the usage of `x` and moved backwards to try to find +its initialization. ## What is "universally quantified"? What about "existentially quantified"? -*to be written* +In math, a predicate may be _universally quantified_ or _existentially +quantified_: + +- _Universal_ quantification: + - the predicate holds if it is true for all possible inputs. + - Traditional notation: ∀x: P(x). Read as "for all x, P(x) holds". +- _Existential_ quantification: + - the predicate holds if there is any input where it is true, i.e., there + only has to be a single input. + - Traditional notation: ∃x: P(x). Read as "there exists x such that P(x) holds". + +In Rust, they come up in type checking and trait solving. For example, + +```rust,ignore +fn foo() +``` +This function claims that the function is well-typed for all types `T`: `∀ T: well_typed(foo)`. + +Another example: + +```rust,ignore +fn foo<'a>(_: &'a usize) +``` +This function claims that for any lifetime `'a` (determined by the +caller), it is well-typed: `∀ 'a: well_typed(foo)`. + +Another example: + +```rust,ignore +fn foo() +where for<'a> F: Fn(&'a u8) +``` +This function claims that it is well-typed for all types `F` such that for all +lifetimes `'a`, `F: Fn(&'a u8)`: `∀ F: ∀ 'a: (F: Fn(&'a u8)) => well_typed(foo)`. + +One more example: + +```rust,ignore +fn foo(_: dyn Debug) +``` +This function claims that there exists some type `T` that implements `Debug` +such that the function is well-typed: `∃ T: (T: Debug) and well_typed(foo)`. +## What is a DeBruijn Index? + +DeBruijn indices are a way of representing which variables are bound in +which binders using only integers. They were [originally invented][wikideb] for +use in lambda calculus evaluation. In `rustc`, we use a similar idea for the +[representation of generic types][sub]. + +[wikideb]: https://en.wikipedia.org/wiki/De_Bruijn_index +[sub]: ../generics.md + +Here is a basic example of how DeBruijn indices might be used for closures (we +don't actually do this in `rustc` though): + +```rust,ignore +|x| { + f(x) // de Bruijn index of `x` is 1 because `x` is bound 1 level up + + |y| { + g(x, y) // index of `x` is 2 because it is bound 2 levels up + // index of `y` is 1 because it is bound 1 level up + } +} +``` + ## What is co- and contra-variance? Check out the subtyping chapter from the diff --git a/src/appendix/glossary.md b/src/appendix/glossary.md index dddf628f..f8042170 100644 --- a/src/appendix/glossary.md +++ b/src/appendix/glossary.md @@ -15,6 +15,7 @@ CTFE
| Short for Compile-Time Function Eval cx
| We tend to use "cx" as an abbreviation for context. See also `tcx`, `infcx`, etc. DAG
| A directed acyclic graph is used during compilation to keep track of dependencies between queries. ([see more](../queries/incremental-compilation.html)) data-flow analysis
| A static analysis that figures out what properties are true at each point in the control-flow of a program; see [the background chapter for more](./background.html#dataflow). +DeBruijn Index
| A technique for describing which binder a variable is bound by using only integers. It has the benefit that it is invariant under variable renaming. ([see more](./background.md#debruijn)) DefId
| An index identifying a definition (see `librustc_middle/hir/def_id.rs`). Uniquely identifies a `DefPath`. See [the HIR chapter for more](../hir.html#identifiers-in-the-hir). Double pointer
| A pointer with additional metadata. See "fat pointer" for more. drop glue
| (internal) compiler-generated instructions that handle calling the destructors (`Drop`) for data types. diff --git a/src/early-late-bound.md b/src/early-late-bound.md new file mode 100644 index 00000000..a703cbab --- /dev/null +++ b/src/early-late-bound.md @@ -0,0 +1,107 @@ +# Early and Late Bound Variables + +In Rust, item definitions (like `fn`) can often have generic parameters, which +are always [_universally_ quantified][quant]. That is, if you have a function +like + +```rust +fn foo(x: T) { } +``` + +this function is defined "for all T" (not "for some specific T", which would be +[_existentially_ quantified][quant]). + +[quant]: ./appendix/background.md#quantified + +While Rust *items* can be quantified over types, lifetimes, and constants, the +types of values in Rust are only ever quantified over lifetimes. So you can +have a type like `for<'a> fn(&'a u32)`, which represents a function pointer +that takes a reference with any lifetime, or `for<'a> dyn Trait<'a>`, which is +a `dyn` trait for a trait implemented for any lifetime; but we have no type +like `for fn(T)`, which would be a function that takes a value of *any type* +as a parameter. This is a consequence of monomorphization -- to support a value +of type `for fn(T)`, we would need a single function pointer that can be +used for a parameter of any type, but in Rust we generate customized code for +each parameter type. + +One consequence of this asymmetry is a weird split in how we represesent some +generic types: _early-_ and _late-_ bound parameters. +Basically, if we cannot represent a type (e.g. a universally quantified type), +we have to bind it _early_ so that the unrepresentable type is never around. + +Consider the following example: + +```rust,ignore +fn foo<'a, 'b, T>(x: &'a u32, y: &'b T) where T: 'b { ... } +``` + +We cannot treat `'a`, `'b`, and `T` in the same way. Types in Rust can't have +`for { .. }`, only `for<'a> {...}`, so whenever you reference `foo` the type +you get back can't be `for<'a, 'b, T> fn(&'a u32, y: &'b T)`. Instead, the `T` +must be substituted early. In particular, you have: + +```rust,ignore +let x = foo; // T, 'b have to be substituted here +x(...); // 'a substituted here, at the point of call +x(...); // 'a substituted here with a different value +``` + +## Early-bound parameters + +Early-bound parameters in rustc are identified by an index, stored in the +[`ParamTy`] struct for types or the [`EarlyBoundRegion`] struct for lifetimes. +The index counts from the outermost declaration in scope. This means that as you +add more binders inside, the index doesn't change. + +For example, + +```rust,ignore +trait Foo { + type Bar = (Self, T, U); +} +``` + +Here, the type `(Self, T, U)` would be `($0, $1, $2)`, where `$N` means a +[`ParamTy`] with the index of `N`. + +In rustc, the [`Generics`] structure carries the this information. So the +[`Generics`] for `Bar` above would be just like for `U` and would indicate the +'parent' generics of `Foo`, which declares `Self` and `T`. You can read more +in [this chapter](./generics.md). + +[`ParamTy`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.ParamTy.html +[`EarlyBoundRegion`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.EarlyBoundRegion.html +[`Generics`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.Generics.html + +## Late-bound parameters + +Late-bound parameters in `rustc` are handled quite differently (they are also +specialized to lifetimes since, right now, only late-bound lifetimes are +supported, though with GATs that has to change). We indicate their potential +presence by a [`Binder`] type. The [`Binder`] doesn't know how many variables +there are at that binding level. This can only be determined by walking the +type itself and collecting them. So a type like `for<'a, 'b> ('a, 'b)` would be +`for (^0.a, ^0.b)`. Here, we just write `for` because we don't know the names +of the things bound within. + +Moreover, a reference to a late-bound lifetime is written `^0.a`: + +- The `0` is the index; it identifies that this lifetime is bound in the + innermost binder (the `for`). +- The `a` is the "name"; late-bound lifetimes in rustc are identified by a + "name" -- the [`BoundRegion`] struct. This struct can contain a + [`DefId`][defid] or it might have various "anonymous" numbered names. The + latter arise from types like `fn(&u32, &u32)`, which are equivalent to + something like `for<'a, 'b> fn(&'a u32, &'b u32)`, but the names of those + lifetimes must be generated. + +This setup of not knowing the full set of variables at a binding level has some +advantages and some disadvantages. The disadvantage is that you must walk the +type to find out what is bound at the given level and so forth. The advantage +is primarily that, when constructing types from Rust syntax, if we encounter +anonymous regions like in `fn(&u32)`, we just create a fresh index and don't have +to update the binder. + +[`Binder`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.Binder.html +[`BoundRegion`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/enum.BoundRegion.html +[defid]: ./hir.html#identifiers-in-the-hir