From b8af56c8ac7f60672642f85cc0ac212aded57bed Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Fri, 25 Jan 2019 16:50:22 +0100 Subject: [PATCH] Add a more detailed description of how incremental compilation works. --- src/SUMMARY.md | 5 +- src/appendix/glossary.md | 2 +- .../incremental-compilation-in-detail.md | 354 ++++++++++++++++++ src/{ => queries}/incremental-compilation.md | 0 .../query-evaluation-model-in-detail.md | 27 +- src/query.md | 5 +- src/variance.md | 2 +- 7 files changed, 376 insertions(+), 19 deletions(-) create mode 100644 src/queries/incremental-compilation-in-detail.md rename src/{ => queries}/incremental-compilation.md (100%) diff --git a/src/SUMMARY.md b/src/SUMMARY.md index 4d613670..cd3c9d33 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -20,8 +20,9 @@ - [The Rustc Driver](./rustc-driver.md) - [Rustdoc](./rustdoc.md) - [Queries: demand-driven compilation](./query.md) - - [The Query Evaluation Model in Detail](./query-evaluation-model-in-detail.md) - - [Incremental compilation](./incremental-compilation.md) + - [The Query Evaluation Model in Detail](./queries/query-evaluation-model-in-detail.md) + - [Incremental compilation](./queries/incremental-compilation.md) + - [Incremental compilation In Detail](./queries/incremental-compilation-in-detail.md) - [Debugging and Testing](./incrcomp-debugging.md) - [The parser](./the-parser.md) - [`#[test]` Implementation](./test-implementation.md) diff --git a/src/appendix/glossary.md b/src/appendix/glossary.md index 0e697aee..decfb442 100644 --- a/src/appendix/glossary.md +++ b/src/appendix/glossary.md @@ -15,7 +15,7 @@ completeness | completeness is a technical term in type theory. Comp control-flow graph | a representation of the control-flow of a program; see [the background chapter for more](./background.html#cfg) CTFE | Compile-Time Function Evaluation. This is the ability of the compiler to evaluate `const fn`s at compile time. This is part of the compiler's constant evaluation system. ([see more](../const-eval.html)) cx | we tend to use "cx" as an abbreviation for context. See also `tcx`, `infcx`, etc. -DAG | a directed acyclic graph is used during compilation to keep track of dependencies between queries. ([see more](../incremental-compilation.html)) +DAG | a directed acyclic graph is used during compilation to keep track of dependencies between queries. ([see more](../queries/incremental-compilation.html)) data-flow analysis | a static analysis that figures out what properties are true at each point in the control-flow of a program; see [the background chapter for more](./background.html#dataflow) DefId | an index identifying a definition (see `librustc/hir/def_id.rs`). Uniquely identifies a `DefPath`. Double pointer | a pointer with additional metadata. See "fat pointer" for more. diff --git a/src/queries/incremental-compilation-in-detail.md b/src/queries/incremental-compilation-in-detail.md new file mode 100644 index 00000000..fbe226e9 --- /dev/null +++ b/src/queries/incremental-compilation-in-detail.md @@ -0,0 +1,354 @@ +# Incremental Compilation In Detail + +The incremental compilation scheme is, in essence, a surprisingly +simple extension to the overall query system. It relies on the fact that: + + 1. queries are pure functions -- given the same inputs, a query will always + yield the same result, and + 2. the query model structures compilation in an acyclic graph that makes + dependencies between individual computations explicit. + +This chapter will explain how we can use these properties for making things +incremental and then goes on to discuss version implementation issues. + +# A Basic Algorithm For Incremental Query Evaluation + +As explained in the [query evaluation model primer][query-model], query +invocations form a directed-acyclic graph. Here's the example from the +previous chapter again: + +```ignore + list_of_all_hir_items <----------------------------- type_check_crate() + | + | + Hir(foo) <--- type_of(foo) <--- type_check_item(foo) <-------+ + | | + +-----------------+ | + | | + v | + Hir(bar) <--- type_of(bar) <--- type_check_item(bar) <-------+ +``` + +Since every access from one query to another has to go through the query +context, we can record these accesses and thus actually build this dependency +graph in memory. With dependency tracking enabled, when compilation is done, +we know which queries were invoked (the nodes of the graph) and for each +invocation, which other queries or input has gone into computing the query's +result (the edges of the graph). + +Now suppose, we change the source code of our program so that +HIR of `bar` looks different than before. Our goal is to only recompute +those queries that are actually affected by the change while just re-using +the cached results of all the other queries. Given the dependency graph we can +do exactly that. For a given query invocation, the graph tells us exactly +what data has gone into computing its results, we just have to follow the +edges until we reach something that has changed. If we don't encounter +anything that has changed, we know that the query still would evaluate to +the same result we already have in our cache. + +Taking the `type_of(foo)` invocation from above as example, we can check +whether the cached result is still valid by following the edges to its +inputs. The only edge leads to `Hir(foo)`, an input that has not been affected +by the change. So we know that the cached result for `type_of(foo)` is still +valid. + +The story is a bit different for `type_check_item(foo)`: We again walk the +edges and already know that `type_of(foo)` is fine. Then we get to +`type_of(bar)` which we have not checked yet, so we walk the edges of +`type_of(bar)` and encounter `Hir(bar)` which *has* changed. Consequently +the result of `type_of(bar)` might yield a different same result than what we +have in the cache and, transitively, the result of `type_check_item(foo)` +might have changed too. We thus re-run `type_check_item(foo)`, which in +turn will re-run `type_of(bar)`, which will yield an up-to-date result +because it reads the up-to-date version of `Hir(bar)`. + + +# The Problem With The Basic Algorithm: False Positives + +If you read the previous paragraph carefully, you'll notice that it says that +`type_of(bar)` *might* have changed because one of its inputs has changed. +There's also the possibility that it might still yield exactly the same +result *even though* its input has changed. Consider an example with a +simple query that just computes the sign of an integer: + +```ignore + IntValue(x) <---- sign_of(x) <--- some_other_query(x) +``` + +Let's say that `IntValue(x)` starts out as `1000` and then is set to `2000`. +Even though `IntValue(x)` is different in the two cases, `sign_of(x)` yields +the result `+` in both cases. + +If we follow the basic algorithm, however, `some_other_query(x)` would have to +(unnecessarily) be re-evaluated because it transitively depends on a changed +input. Change detection yields a "false positive" in this case because it has +to conservatively assume that `some_other_query(x)` might be affected by that +changed input. + +Unfortunately it turns out that the actual queries in the compiler are full +of examples like this and small changes to the input often potentially affect +very large parts of the output binaries. As a consequence, we had to make the +change detection system smarter and more accurate. + +# Improving Accuracy: The red-green Algorithm + +The "false positives" problem can be solved by interleaving change detection +and query re-evaluation. Instead of walking the graph all the way to the +inputs when trying to find out if some cached result is still valid, we can +check if a result has *actually* changed after we were forced to re-evaluate +it. + +We call this algorithm, for better or worse, the red-green algorithm because nodes +in the dependency graph are assigned the color green if we were able to prove +that its cached result is still valid and the color red if the result has +turned out to be different after re-evaluating it. + +The meat of red-green change tracking is implemented in the try-mark-green +algorithm, that, you've guessed it, tries to mark a given node as green: + +```rust,ignore +fn try_mark_green(tcx, current_node) -> bool { + + // Fetch the inputs to `current_node`, i.e. get the nodes that the direct + // edges from `node` lead to. + let dependencies = tcx.dep_graph.get_dependencies_of(current_node); + + // Now check all the inputs for changes + for dependency in dependencies { + + match tcx.dep_graph.get_node_color(dependency) { + Green => { + // This input has already been checked before and it has not + // changed; so we can go on to check the next one + } + Red => { + // We found an input that has changed. We cannot mark + // `current_node` as green without re-running the + // corresponding query. + return false + } + Unknown => { + // This is the first time we are look at this node. Let's try + // to mark it green by calling try_mark_green() recursively. + if try_mark_green(tcx, dependency) { + // We successfully marked the input as green, on to the + // next. + } else { + // We could *not* mark the input as green. This means we + // don't know if its value has changed. In order to find + // out, we re-run the corresponding query now! + tcx.run_query_for(dependency); + + // Fetch and check the node color again. Running the query + // has forced it to either red (if it yielded a different + // result than we have in the cache) or green (if it + // yielded the same result). + match tcx.dep_graph.get_node_color(dependency) { + Red => { + // The input turned out to be red, so we cannot + // mark `current_node` as green. + return false + } + Green => { + // Re-running the query paid off! The result is the + // same as before, so this particular input does + // not invalidate `current_node`. + } + Unknown => { + // There is no way a node has no color after + // re-running the query. + panic!("unreachable") + } + } + } + } + } + } + + // If we have gotten through the entire loop, it means that all inputs + // have turned out to be green. If all inputs are unchanged, it means + // that the query result corresponding to `current_node` cannot have + // changed either. + tcx.dep_graph.mark_green(current_node); + + true +} + +// Note: The actual implementation can be found in +// src/librustc/dep_graph/graph.rs +``` + +By using red-green marking we can avoid the devastating cumulative effect of +having false positives during change detection. Whenever a query is executed +in incremental mode, we first check if its already green. If not, we run +`try_mark_green()` on it. If it still isn't green after that, then we actually +invoke the query provider to re-compute the result. + + + +# The Real World: How Persistence Makes Everything Complicated + +The sections above described the underlying algorithm for incremental +compilation but because the compiler process exits after being finished and +takes the query context with its result cache with it into oblivion, we have +persist data to disk, so the next compilation session can make use of it. +This comes with a whole new set of implementation challenges: + +- The query results cache is stored to disk, so they are not readily available + for change comparison. +- A subsequent compilation session will start off with new version of the code + that has arbitrary changes applied to it. All kinds of IDs and indices that + are generated from a global, sequential counter (e.g. `NodeId`, `DefId`, etc) + might have shifted, making the persisted results on disk not immediately + usable anymore because the same numeric IDs and indices might refer to + completely new things in the new compilation session. +- Persisting things to disk comes at a cost, so not every tiny piece of + information should be actually cached in between compilation sessions. + Fixed-sized, plain-old-data is preferred to complex things that need to run + branching code during (de-)serialization. + +The following sections describe how the compiler currently solves these issues. + +## A Question Of Stability: Bridging The Gap Between Compilation Sessions + +As noted before, various IDs (like `DefId`) are generated by the compiler in a +way that depends on the contents of the source code being compiled. ID assignment +is usually deterministic, that is, if the exact same code is compiled twice, +the same things will end up with the same IDs. However, if something +changes, e.g. a function is added in the middle of a file, there is no +guarantee that anything will have the same ID as it had before. + +As a consequence we cannot represent the data in our on-disk cache the same +way it is represented in memory. For example, if we just stored a piece +of type information like `TyKind::FnDef(DefId, &'tcx Substs<'tcx>)` (as we do +in memory) and then the contained `DefId` points to a different function in +a new compilation session we'd be in trouble. + +The solution to this problem is to find "stable" forms for IDs which remain +valid in between compilation sessions. For the most important case, `DefId`s, +these are the so-called `DefPath`s. Each `DefId` has a +corresponding `DefPath` but in place of a numeric ID, a `DefPath` is based on +the path to the identified item, e.g. `std::collections::HashMap`. The +advantage of an ID like this is that it is not affected by unrelated changes. +For example, one can add a new function to `std::collections` but +`std::collections::HashMap` would still be `std::collections::HashMap`. A +`DefPath` is "stable" across changes made to the source code while a `DefId` +isn't. + +There is also the `DefPathHash` which is just a 128-bit hash value of the +`DefPath`. The two contain the same information and we mostly use the +`DefPathHash` because it simpler to handle, being `Copy` and self-contained. + +This principle of stable identifiers is used to make the data in the on-disk +cache resilient to source code changes. Instead of storing a `DefId`, we store +the `DefPathHash` and when we deserialize something from the cache, we map the +`DefPathHash` to the corresponding `DefId` in the *current* compilation session +(which is just a simple hash table lookup). + +The `HirId`, used for identifying HIR components that don't have their own +`DefId`, is another such stable ID. It is (conceptually) a pair of a `DefPath` +and a `LocalId`, where the `LocalId` identifies something (e.g. a `hir::Expr`) +locally within its "owner" (e.g. a `hir::Item`). If the owner is moved around, +the `LocalId`s within it are still the same. + + + +## Checking Query Results For Changes: StableHash And Fingerprints + +In order to do red-green-marking we often need to check if the result of a +query has changed compared to the result it had during the previous +compilation session. There are two performance problems with this though: + +- We'd like to avoid having to load the previous result from disk just for + doing the comparison. We already computed the new result and will use that. + Also loading a result from disk will "pollute" the interners with data that + is unlikely to ever be used. +- We don't want to store each and every result in the on-disk cache. For + example, it would be wasted effort to persist things to disk that are + already available in upstream crates. + +The compiler avoids these problems by using so-called `Fingerprint`s. Each time +a new query result is computed, the query engine will compute a 128 bit hash +value of the result. We call this hash value "the `Fingerprint` of the query +result". The hashing is (and has to be) done "in a stable way". This means +that whenever something is hashed that might change in between compilation +sessions (e.g. a `DefId`), we instead hash its stable equivalent +(e.g. the corresponding `DefPath`). That's what the whole `StableHash` +infrastructure is for. This way `Fingerprint`s computed in two +different compilation sessions are still comparable. + +The next step is to store these fingerprints along with the dependency graph. +This is cheap since fingerprints are just bytes to be copied. It's also cheap to +load the entire set of fingerprints together with the dependency graph. + +Now, when red-green-marking reaches the point where it needs to check if a +result has changed, it can just compare the (already loaded) previous +fingerprint to the fingerprint of the new result. + +This approach works rather well but it's not without flaws: + +- There is a small possibility of hash collisions. That is, two different + results could have the same fingerprint and the system would erroneously + assume that the result hasn't changed, leading to a missed update. + + We mitigate this risk by using a high-quality hash function and a 128 bit + wide hash value. Due to these measures the practical risk of a hash + collision is negligible. + +- Computing fingerprints is quite costly. It is the main reason why incremental + compilation can be slower than non-incremental compilation. We are forced to + use a good and thus expensive hash function, and we have to map things to + their stable equivalents while doing the hashing. + +In the future we might want to explore different approaches to this problem. +For now it's `StableHash` and `Fingerprint`. + + + +## A Tale Of Two DepGraphs: The Old And The New + +The initial description of dependency tracking glosses over a few details +that quickly become a head scratcher when actually trying to implement things. +In particular it's easy to overlook that we are actually dealing with *two* +dependency graphs: The one we built during the previous compilation session and +the one that we are building for the current compilation session. + +When a compilation session starts, the compiler loads the previous dependency +graph into memory as an immutable piece of data. Then, when a query is invoked, +it will first try to mark the corresponding node in the graph as green. This +means really that we are trying to mark the node in the *previous* dep-graph +as green that corresponds to the query key in the *current* session. How do we +do this mapping between current query key and previous `DepNode`? The answer +is again `Fingerprint`s: Nodes in the dependency graph are identified by a +fingerprint of the query key. Since fingerprints are stable across compilation +sessions, computing one in the current session allows us to find a node +in the dependency graph from the previous session. If we don't find a node with +the given fingerprint, it means that the query key refers to something that +did not yet exist in the previous session. + +So, having found the dep-node in the previous dependency graph, we can look +up its dependencies (also dep-nodes in the previous graph) and continue with +the rest of the try-mark-green algorithm. The next interesting thing happens +when we successfully marked the node as green. At that point we copy the node +and the edges to its dependencies from the old graph into the new graph. We +have to do this because the new dep-graph cannot not acquire the +node and edges via the regular dependency tracking. The tracking system can +only record edges while actually running a query -- but running the query, +although we have the result already cached, is exactly what we want to avoid. + +Once the compilation session has finished, all the unchanged parts have been +copied over from the old into the new dependency graph, while the changed parts +have been added to the new graph by the tracking system. At this point, the +new graph is serialized out to disk, alongside the query result cache, and can +act as the previous dep-graph in a subsequent compilation session. + + +## Didn't You Forget Something?: Cache Promotion +TODO + + +# The Future: Shortcomings Of The Current System and Possible Solutions +TODO + + +[query-model]: ./query-evaluation-model-in-detail.html diff --git a/src/incremental-compilation.md b/src/queries/incremental-compilation.md similarity index 100% rename from src/incremental-compilation.md rename to src/queries/incremental-compilation.md diff --git a/src/queries/query-evaluation-model-in-detail.md b/src/queries/query-evaluation-model-in-detail.md index d2dc1047..41789637 100644 --- a/src/queries/query-evaluation-model-in-detail.md +++ b/src/queries/query-evaluation-model-in-detail.md @@ -123,23 +123,24 @@ fn type_check_crate_provider(tcx, _key: ()) { } ``` -We see that the `type_check_crate` query accesses input data (`tcx.hir_map`) -and invokes other queries (`type_check_item`). The `type_check_item` +We see that the `type_check_crate` query accesses input data +(`tcx.hir_map.list_of_items()`) and invokes other queries +(`type_check_item`). The `type_check_item` invocations will themselves access input data and/or invoke other queries, so that in the end the DAG of query invocations will be built up backwards from the node that was initially executed: -``` - (1) - hir_map <--------------------------------------------------- type_check_crate() - ^ | - | (4) (3) (2) | - +-- Hir(foo) <--- type_of(foo) <--- type_check_item(foo) <-------+ - | | | - | +-----------------+ | - | | | - | (6) v (5) (7) | - +-- Hir(bar) <--- type_of(bar) <--- type_check_item(bar) <-------+ +```ignore + (2) (1) + list_of_all_hir_items <----------------------------- type_check_crate() + | + (5) (4) (3) | + Hir(foo) <--- type_of(foo) <--- type_check_item(foo) <-------+ + | | + +-----------------+ | + | | + (7) v (6) (8) | + Hir(bar) <--- type_of(bar) <--- type_check_item(bar) <-------+ // (x) denotes invocation order ``` diff --git a/src/query.md b/src/query.md index ee7d60e4..703c560e 100644 --- a/src/query.md +++ b/src/query.md @@ -37,8 +37,8 @@ compiler (for example, generating MIR) work exactly like this. ### The Query Evaluation Model in Detail -The [Query Evaluation Model in Detail](query-evaluation-model-in-detail.html) -chapter gives a more in-depth description of what queries are and how they work. +The [Query Evaluation Model in Detail][query-model] chapter gives a more +in-depth description of what queries are and how they work. If you intend to write a query of your own, this is a good read. ### Invoking queries @@ -267,3 +267,4 @@ impl<'tcx> QueryDescription for queries::type_of<'tcx> { } ``` +[query-model]: queries/query-evaluation-model-in-detail.html diff --git a/src/variance.md b/src/variance.md index 9fe98b4a..c6a1a320 100644 --- a/src/variance.md +++ b/src/variance.md @@ -139,7 +139,7 @@ crate (through `crate_variances`), but since most changes will not result in a change to the actual results from variance inference, the `variances_of` query will wind up being considered green after it is re-evaluated. -[rga]: ./incremental-compilation.html +[rga]: ./queries/incremental-compilation.html