Merge pull request #28 from nikomatsakis/master

add query + incremental section and restructure a bit
2018-01-29 10:27:18 -05:00 · 2018-01-29 10:27:18 -05:00 · ccc8ca961e
parent af83b8e8d4 cd055e97a4
commit ccc8ca961e
6 changed files with 478 additions and 15 deletions
--- a/src/SUMMARY.md
+++ b/src/SUMMARY.md
@ -5,16 +5,19 @@
 - [Using the compiler testing framework](./running-tests.md)
 - [Walkthrough: a typical contribution](./walkthrough.md)
 - [High-level overview of the compiler source](./high-level-overview.md)
 - [Queries: demand-driven compilation](./query.md)
    - [Incremental compilation](./incremental-compilation.md)
 - [The parser](./the-parser.md)
 - [Macro expansion](./macro-expansion.md)
 - [Name resolution](./name-resolution.md)
- [HIR lowering](./hir-lowering.md)
+- [The HIR (High-level IR)](./hir.md)
 - [The `ty` module: representing types](./ty.md)
 - [Type inference](./type-inference.md)
 - [Trait resolution](./trait-resolution.md)
 - [Type checking](./type-checking.md)
- [MIR construction](./mir-construction.md)
+- [The MIR (Mid-level IR)](./mir.md)
- [MIR borrowck](./mir-borrowck.md)
+    - [MIR construction](./mir-construction.md)
- [MIR optimizations](./mir-optimizations.md)
+    - [MIR borrowck](./mir-borrowck.md)
    - [MIR optimizations](./mir-optimizations.md)
 - [trans: generating LLVM IR](./trans.md)
 - [Glossary](./glossary.md)
--- a/src/glossary.md
+++ b/src/glossary.md
@ -9,23 +9,24 @@ AST                     |  the abstract syntax tree produced by the syntax crate
 codegen unit            |  when we produce LLVM IR, we group the Rust code into a number of codegen units. Each of these units is processed by LLVM independently from one another, enabling parallelism. They are also the unit of incremental re-use.
 cx                      |  we tend to use "cx" as an abbrevation for context. See also `tcx`, `infcx`, etc.
 DefId                   |  an index identifying a definition (see `librustc/hir/def_id.rs`). Uniquely identifies a `DefPath`.
-HIR                     |  the High-level IR, created by lowering and desugaring the AST. See `librustc/hir`.
+HIR                     |  the High-level IR, created by lowering and desugaring the AST ([see more](hir.html))
 HirId                   |  identifies a particular node in the HIR by combining a def-id with an "intra-definition offset".
-'gcx                    |  the lifetime of the global arena (see `librustc/ty`).
+'gcx                    |  the lifetime of the global arena ([see more](ty.html))
 generics                |  the set of generic type parameters defined on a type or item
 ICE                     |  internal compiler error. When the compiler crashes.
 infcx                   |  the inference context (see `librustc/infer`)
-MIR                     |  the Mid-level IR that is created after type-checking for use by borrowck and trans. Defined in the `src/librustc/mir/` module, but much of the code that manipulates it is found in `src/librustc_mir`.
+MIR                     |  the Mid-level IR that is created after type-checking for use by borrowck and trans ([see more](./mir.html))
-obligation              |  something that must be proven by the trait system; see `librustc/traits`.
+obligation              |  something that must be proven by the trait system ([see more](trait-resolution.html))
 local crate             |  the crate currently being compiled.
 node-id or NodeId       |  an index identifying a particular node in the AST or HIR; gradually being phased out and replaced with `HirId`.
-query                   |  perhaps some sub-computation during compilation; see `librustc/maps`.
+query                   |  perhaps some sub-computation during compilation ([see more](query.html))
-provider                |  the function that executes a query; see `librustc/maps`.
+provider                |  the function that executes a query ([see more](query.html))
 sess                    |  the compiler session, which stores global data used throughout compilation
 side tables             |  because the AST and HIR are immutable once created, we often carry extra information about them in the form of hashtables, indexed by the id of a particular node.
 span                    |  a location in the user's source code, used for error reporting primarily. These are like a file-name/line-number/column tuple on steroids: they carry a start/end point, and also track macro expansions and compiler desugaring. All while being packed into a few bytes (really, it's an index into a table). See the Span datatype for more.
 substs                  |  the substitutions for a given generic type or item (e.g., the `i32`, `u32` in `HashMap<i32, u32>`)
-tcx                     |  the "typing context", main data structure of the compiler (see `librustc/ty`).
+tcx                     |  the "typing context", main data structure of the compiler ([see more](ty.html))
 'tcx                    |  the lifetime of the currently active inference context ([see more](ty.html))
 trans                   |  the code to translate MIR into LLVM IR.
-trait reference         |  a trait and values for its type parameters (see `librustc/ty`).
+trait reference         |  a trait and values for its type parameters ([see more](ty.html)).
-ty                      |  the internal representation of a type (see `librustc/ty`).
+ty                      |  the internal representation of a type ([see more](ty.html)).
--- a/src/hir-lowering.md
+++ b/src/hir-lowering.md
@ -1,4 +1,4 @@
-# HIR lowering
+# The HIR
 The HIR -- "High-level IR" -- is the primary IR used in most of
 rustc. It is a desugared version of the "abstract syntax tree" (AST)
@ -116,4 +116,4 @@ associated with an **owner**, which is typically some kind of item
 (e.g., a `fn()` or `const`), but could also be a closure expression
 (e.g., `|x, y| x + y`). You can use the HIR map to find the body
 associated with a given def-id (`maybe_body_owned_by()`) or to find
-the owner of a body (`body_owner_def_id()`).
+the owner of a body (`body_owner_def_id()`).
--- a/src/incremental-compilation.md
+++ b/src/incremental-compilation.md
@ -0,0 +1,139 @@
 # Incremental compilation
 The incremental compilation scheme is, in essence, a surprisingly
 simple extension to the overall query system. We'll start by describing
 a slightly simplified variant of the real thing, the "basic algorithm", and then describe
 some possible improvements.
 ## The basic algorithm
 The basic algorithm is
 called the **red-green** algorithm[^salsa]. The high-level idea is
 that, after each run of the compiler, we will save the results of all
 the queries that we do, as well as the **query DAG**. The
 **query DAG** is a [DAG] that indices which queries executed which
 other queries. So for example there would be an edge from a query Q1
 to another query Q2 if computing Q1 required computing Q2 (note that
 because queries cannot depend on themselves, this results in a DAG and
 not a general graph).
 [DAG]: https://en.wikipedia.org/wiki/Directed_acyclic_graph
 On the next run of the compiler, then, we can sometimes reuse these
 query results to avoid re-executing a query. We do this by assigning
 every query a **color**:
 - If a query is colored **red**, that means that its result during
  this compilation has **changed** from the previous compilation.
 - If a query is colored **green**, that means that its result is
  the **same** as the previous compilation.
 There are two key insights here:
 - First, if all the inputs to query Q are colored green, then the
  query Q **must** result in the same value as last time and hence
  need not be re-executed (or else the compiler is not deterministic).
 - Second, even if some inputs to a query changes, it may be that it
  **still** produces the same result as the previous compilation. In
  particular, the query may only use part of its input.
  - Therefore, after executing a query, we always check whether it
    produced the same result as the previous time. **If it did,** we
    can still mark the query as green, and hence avoid re-executing
    dependent queries.
 ### The try-mark-green algorithm
 The core of the incremental compilation is an algorithm called
 "try-mark-green". It has the job of determining the color of a given
 query Q (which must not yet have been executed). In cases where Q has
 red inputs, determining Q's color may involve re-executing Q so that
 we can compare its output; but if all of Q's inputs are green, then we
 can determine that Q must be green without re-executing it or inspect
 its value what-so-ever. In the compiler, this allows us to avoid
 deserializing the result from disk when we don't need it, and -- in
 fact -- enables us to sometimes skip *serializing* the result as well
 (see the refinements section below).
 Try-mark-green works as follows:
 - First check if there is the query Q was executed during the previous
  compilation.
  - If not, we can just re-execute the query as normal, and assign it the
    color of red.
 - If yes, then load the 'dependent queries' that Q 
 - If there is a saved result, then we load the `reads(Q)` vector from the
  query DAG. The "reads" is the set of queries that Q executed during
  its execution.
  - For each query R that in `reads(Q)`, we recursively demand the color
    of R using try-mark-green.
    - Note: it is important that we visit each node in `reads(Q)` in same order
      as they occurred in the original compilation. See [the section on the query DAG below](#dag).
    - If **any** of the nodes in `reads(Q)` wind up colored **red**, then Q is dirty.
      - We re-execute Q and compare the hash of its result to the hash of the result
        from the previous compilation.
      - If the hash has not changed, we can mark Q as **green** and return.
    - Otherwise, **all** of the nodes in `reads(Q)` must be **green**. In that case,
      we can color Q as **green** and return.
 <a name="dag">
 ### The query DAG
 The query DAG code is stored in
 [`src/librustc/dep_graph`][dep_graph]. Construction of the DAG is done
 by instrumenting the query execution. 
 One key point is that the query DAG also tracks ordering; that is, for
 each query Q, we noy only track the queries that Q reads, we track the
 **order** in which they were read.  This allows try-mark-green to walk
 those queries back in the same order. This is important because once a subquery comes back as red,
 we can no longer be sure that Q will continue along the same path as before.
 That is, imagine a query like this:
 ```rust,ignore
 fn main_query(tcx) {
    if tcx.subquery1() {
        tcx.subquery2()
    } else {
        tcx.subquery3()
    }
 }
 ```
 Now imagine that in the first compilation, `main_query` starts by
 executing `subquery1`, and this returns true. In that case, the next
 query `main_query` executes will be `subquery2`, and `subquery3` will
 not be executed at all.
 But now imagine that in the **next** compilation, the input has
 changed such that `subquery` returns **false**. In this case, `subquery2` would never
 execute. If try-mark-green were to visit `reads(main_query)` out of order,
 however, it might have visited `subquery2` before `subquery1`, and hence executed it.
 This can lead to ICEs and other problems in the compiler.
 [dep_graph]: https://github.com/rust-lang/rust/tree/master/src/librustc/dep_graph
 ## Improvements to the basic algorithm
 In the description basic algorithm, we said that at the end of
 compilation we would save the results of all the queries that were
 performed.  In practice, this can be quite wasteful -- many of those
 results are very cheap to recompute, and serializing + deserializing
 them is not a particular win. In practice, what we would do is to save
 **the hashes** of all the subqueries that we performed. Then, in select cases,
 we **also** save the results.
 This is why the incremental algorithm separates computing the
 **color** of a node, which often does not require its value, from
 computing the **result** of a node. Computing the result is done via a simple algorithm
 like so:
 - Check if a saved result for Q is available. If so, compute the color of Q.
  If Q is green, deserialize and return the saved result.
 - Otherwise, execute Q.
  - We can then compare the hash of the result and color Q as green if
    it did not change.
 # Footnotes
 [^salsa]: I have long wanted to rename it to the Salsa algorithm, but it never caught on. -@nikomatsakis
--- a/src/mir.md
+++ b/src/mir.md
@ -0,0 +1,6 @@
 # The MIR (Mid-level IR)
 TODO
 Defined in the `src/librustc/mir/` module, but much of the code that
 manipulates it is found in `src/librustc_mir`.
--- a/src/query.md
+++ b/src/query.md
@ -0,0 +1,314 @@
 # Queries: demand-driven compilation
 As described in [the high-level overview of the compiler][hl], the
 Rust compiler is current transitioning from a traditional "pass-based"
 setup to a "demand-driven" system. **The Compiler Query System is the
 key to our new demand-driven organization.** The idea is pretty
 simple. You have various queries that compute things about the input
 -- for example, there is a query called `type_of(def_id)` that, given
 the def-id of some item, will compute the type of that item and return
 it to you.
 [hl]: high-level-overview.html
 Query execution is **memoized** -- so the first time you invoke a
 query, it will go do the computation, but the next time, the result is
 returned from a hashtable. Moreover, query execution fits nicely into
 **incremental computation**; the idea is roughly that, when you do a
 query, the result **may** be returned to you by loading stored data
 from disk (but that's a separate topic we won't discuss further here).
 The overall vision is that, eventually, the entire compiler
 control-flow will be query driven. There will effectively be one
 top-level query ("compile") that will run compilation on a crate; this
 will in turn demand information about that crate, starting from the
 *end*.  For example:
 - This "compile" query might demand to get a list of codegen-units
  (i.e., modules that need to be compiled by LLVM).
 - But computing the list of codegen-units would invoke some subquery
  that returns the list of all modules defined in the Rust source.
 - That query in turn would invoke something asking for the HIR.
 - This keeps going further and further back until we wind up doing the
  actual parsing.
 However, that vision is not fully realized. Still, big chunks of the
 compiler (for example, generating MIR) work exactly like this.
 ### Invoking queries
 To invoke a query is simple. The tcx ("type context") offers a method
 for each defined query. So, for example, to invoke the `type_of`
 query, you would just do this:
 ```rust
 let ty = tcx.type_of(some_def_id);
 ```
 ### Cycles between queries
 Currently, cycles during query execution should always result in a
 compilation error. Typically, they arise because of illegal programs
 that contain cyclic references they shouldn't (though sometimes they
 arise because of compiler bugs, in which case we need to factor our
 queries in a more fine-grained fashion to avoid them).
 However, it is nonetheless often useful to *recover* from a cycle
 (after reporting an error, say) and try to soldier on, so as to give a
 better user experience. In order to recover from a cycle, you don't
 get to use the nice method-call-style syntax. Instead, you invoke
 using the `try_get` method, which looks roughly like this:
 ```rust
 use ty::maps::queries;
 ...
 match queries::type_of::try_get(tcx, DUMMY_SP, self.did) {
  Ok(result) => {
    // no cycle occurred! You can use `result`
  }
  Err(err) => {
    // A cycle occurred! The error value `err` is a `DiagnosticBuilder`,
    // meaning essentially an "in-progress", not-yet-reported error message.
    // See below for more details on what to do here.
  }
 }
 ```
 So, if you get back an `Err` from `try_get`, then a cycle *did* occur. This means that
 you must ensure that a compiler error message is reported. You can do that in two ways:
 The simplest is to invoke `err.emit()`. This will emit the cycle error to the user.
 However, often cycles happen because of an illegal program, and you
 know at that point that an error either already has been reported or
 will be reported due to this cycle by some other bit of code. In that
 case, you can invoke `err.cancel()` to not emit any error. It is
 traditional to then invoke:
 ```
 tcx.sess.delay_span_bug(some_span, "some message")
 ```
 `delay_span_bug()` is a helper that says: we expect a compilation
 error to have happened or to happen in the future; so, if compilation
 ultimately succeeds, make an ICE with the message `"some
 message"`. This is basically just a precaution in case you are wrong.
 ### How the compiler executes a query
 So you may be wondering what happens when you invoke a query
 method. The answer is that, for each query, the compiler maintains a
 cache -- if your query has already been executed, then, the answer is
 simple: we clone the return value out of the cache and return it
 (therefore, you should try to ensure that the return types of queries
 are cheaply cloneable; insert a `Rc` if necessary).
 #### Providers
 If, however, the query is *not* in the cache, then the compiler will
 try to find a suitable **provider**. A provider is a function that has
 been defined and linked into the compiler somewhere that contains the
 code to compute the result of the query.
 **Providers are defined per-crate.** The compiler maintains,
 internally, a table of providers for every crate, at least
 conceptually. Right now, there are really two sets: the providers for
 queries about the **local crate** (that is, the one being compiled)
 and providers for queries about **external crates** (that is,
 dependencies of the local crate). Note that what determines the crate
 that a query is targeting is not the *kind* of query, but the *key*.
 For example, when you invoke `tcx.type_of(def_id)`, that could be a
 local query or an external query, depending on what crate the `def_id`
 is referring to (see the `self::keys::Key` trait for more information
 on how that works).
 Providers always have the same signature:
 ```rust
 fn provider<'cx, 'tcx>(tcx: TyCtxt<'cx, 'tcx, 'tcx>,
                       key: QUERY_KEY)
                       -> QUERY_RESULT
 {
    ...
 }
 ```
 Providers take two arguments: the `tcx` and the query key. Note also
 that they take the *global* tcx (i.e., they use the `'tcx` lifetime
 twice), rather than taking a tcx with some active inference context.
 They return the result of the query.
 ####  How providers are setup
 When the tcx is created, it is given the providers by its creator using
 the `Providers` struct. This struct is generate by the macros here, but it
 is basically a big list of function pointers:
 ```rust
 struct Providers {
    type_of: for<'cx, 'tcx> fn(TyCtxt<'cx, 'tcx, 'tcx>, DefId) -> Ty<'tcx>,
    ...
 }
 ```
 At present, we have one copy of the struct for local crates, and one
 for external crates, though the plan is that we may eventually have
 one per crate.
 These `Provider` structs are ultimately created and populated by
 `librustc_driver`, but it does this by distributing the work
 throughout the other `rustc_*` crates. This is done by invoking
 various `provide` functions. These functions tend to look something
 like this:
 ```rust
 pub fn provide(providers: &mut Providers) {
    *providers = Providers {
        type_of,
        ..*providers
    };
 }
 ```
 That is, they take an `&mut Providers` and mutate it in place. Usually
 we use the formulation above just because it looks nice, but you could
 as well do `providers.type_of = type_of`, which would be equivalent.
 (Here, `type_of` would be a top-level function, defined as we saw
 before.) So, if we want to add a provider for some other query,
 let's call it `fubar`, into the crate above, we might modify the `provide()`
 function like so:
 ```rust
 pub fn provide(providers: &mut Providers) {
    *providers = Providers {
        type_of,
        fubar,
        ..*providers
    };
 }
 fn fubar<'cx, 'tcx>(tcx: TyCtxt<'cx, 'tcx>, key: DefId) -> Fubar<'tcx> { .. }
 ```
 NB. Most of the `rustc_*` crates only provide **local
 providers**. Almost all **extern providers** wind up going through the
 [`rustc_metadata` crate][rustc_metadata], which loads the information from the crate
 metadata.  But in some cases there are crates that provide queries for
 *both* local and external crates, in which case they define both a
 `provide` and a `provide_extern` function that `rustc_driver` can
 invoke.
 [rustc_metadata]: https://github.com/rust-lang/rust/tree/master/src/librustc_metadata
 ### Adding a new kind of query
 So suppose you want to add a new kind of query, how do you do so?
 Well, defining a query takes place in two steps:
 1. first, you have to specify the query name and arguments; and then,
 2. you have to supply query providers where needed.
 To specify the query name and arguments, you simply add an entry to
 the big macro invocation in
 [`src/librustc/ty/maps/mod.rs`][maps-mod]. This will probably have
 changed by the time you read this README, but at present it looks
 something like:
 [maps-mod]: https://github.com/rust-lang/rust/blob/master/src/librustc/ty/maps/mod.rs
 ```
 define_maps! { <'tcx>
    /// Records the type of every item.
    [] fn type_of: TypeOfItem(DefId) -> Ty<'tcx>,
    ...
 }
 ```
 Each line of the macro defines one query. The name is broken up like this:
 ```
 [] fn type_of: TypeOfItem(DefId) -> Ty<'tcx>,
 ^^    ^^^^^^^  ^^^^^^^^^^ ^^^^^     ^^^^^^^^
 |     |        |          |         |
 |     |        |          |         result type of query
 |     |        |          query key type
 |     |        dep-node constructor
 |     name of query
 query flags
 ```
 Let's go over them one by one:
 - **Query flags:** these are largely unused right now, but the intention
  is that we'll be able to customize various aspects of how the query is
  processed.
 - **Name of query:** the name of the query method
  (`tcx.type_of(..)`). Also used as the name of a struct
  (`ty::maps::queries::type_of`) that will be generated to represent
  this query.
 - **Dep-node constructor:** indicates the constructor function that
  connects this query to incremental compilation. Typically, this is a
  `DepNode` variant, which can be added by modifying the
  `define_dep_nodes!` macro invocation in
  [`librustc/dep_graph/dep_node.rs`][dep-node].
  - However, sometimes we use a custom function, in which case the
    name will be in snake case and the function will be defined at the
    bottom of the file. This is typically used when the query key is
    not a def-id, or just not the type that the dep-node expects.
 - **Query key type:** the type of the argument to this query.
  This type must implement the `ty::maps::keys::Key` trait, which
  defines (for example) how to map it to a crate, and so forth.
 - **Result type of query:** the type produced by this query. This type
  should (a) not use `RefCell` or other interior mutability and (b) be
  cheaply cloneable. Interning or using `Rc` or `Arc` is recommended for
  non-trivial data types.
  - The one exception to those rules is the `ty::steal::Steal` type,
    which is used to cheaply modify MIR in place. See the definition
    of `Steal` for more details. New uses of `Steal` should **not** be
    added without alerting `@rust-lang/compiler`.
 [dep-node]: https://github.com/rust-lang/rust/blob/master/src/librustc/dep_graph/dep_node.rs
 So, to add a query:
 - Add an entry to `define_maps!` using the format above.
 - Possibly add a corresponding entry to the dep-node macro.
 - Link the provider by modifying the appropriate `provide` method;
  or add a new one if needed and ensure that `rustc_driver` is invoking it.
 #### Query structs and descriptions
 For each kind, the `define_maps` macro will generate a "query struct"
 named after the query. This struct is a kind of a place-holder
 describing the query. Each such struct implements the
 `self::config::QueryConfig` trait, which has associated types for the
 key/value of that particular query. Basically the code generated looks something
 like this:
 ```rust
 // Dummy struct representing a particular kind of query:
 pub struct type_of<'tcx> { phantom: PhantomData<&'tcx ()> }
 impl<'tcx> QueryConfig for type_of<'tcx> {
  type Key = DefId;
  type Value = Ty<'tcx>;
 }
 ```
 There is an additional trait that you may wish to implement called
 `self::config::QueryDescription`. This trait is used during cycle
 errors to give a "human readable" name for the query, so that we can
 summarize what was happening when the cycle occurred. Implementing
 this trait is optional if the query key is `DefId`, but if you *don't*
 implement it, you get a pretty generic error ("processing `foo`...").
 You can put new impls into the `config` module. They look something like this:
 ```rust
 impl<'tcx> QueryDescription for queries::type_of<'tcx> {
    fn describe(tcx: TyCtxt, key: DefId) -> String {
        format!("computing the type of `{}`", tcx.item_path_str(key))
    }
 }
 ```