Edit "What the compiler does to your code" (#1306)

* Edit overview.md

* Fix lexer crate

* Edit wording

Co-authored-by: pierwill <pierwill@users.noreply.github.com>
This commit is contained in:
pierwill 2022-05-02 15:10:45 -05:00 committed by GitHub
parent a1340e010c
commit 105bc3d35d
1 changed files with 123 additions and 91 deletions

View File

@ -17,93 +17,121 @@ So first, let's look at what the compiler does to your code. For now, we will
avoid mentioning how the compiler implements these steps except as needed; avoid mentioning how the compiler implements these steps except as needed;
we'll talk about that later. we'll talk about that later.
- The compile process begins when a user writes a Rust source program in text ### Invocation
Compilation begins when a user writes a Rust source program in text
and invokes the `rustc` compiler on it. The work that the compiler needs to and invokes the `rustc` compiler on it. The work that the compiler needs to
perform is defined by command-line options. For example, it is possible to perform is defined by command-line options. For example, it is possible to
enable nightly features (`-Z` flags), perform `check`-only builds, or emit enable nightly features (`-Z` flags), perform `check`-only builds, or emit
LLVM-IR rather than executable machine code. The `rustc` executable call may LLVM-IR rather than executable machine code. The `rustc` executable call may
be indirect through the use of `cargo`. be indirect through the use of `cargo`.
- Command line argument parsing occurs in the [`rustc_driver`]. This crate
Command line argument parsing occurs in the [`rustc_driver`]. This crate
defines the compile configuration that is requested by the user and passes it defines the compile configuration that is requested by the user and passes it
to the rest of the compilation process as a [`rustc_interface::Config`]. to the rest of the compilation process as a [`rustc_interface::Config`].
- The raw Rust source text is analyzed by a low-level lexer located in
### Lexing and parsing
The raw Rust source text is analyzed by a low-level *lexer* located in
[`rustc_lexer`]. At this stage, the source text is turned into a stream of [`rustc_lexer`]. At this stage, the source text is turned into a stream of
atomic source code units known as _tokens_. The lexer supports the atomic source code units known as _tokens_. The lexer supports the
Unicode character encoding. Unicode character encoding.
- The token stream passes through a higher-level lexer located in
The token stream passes through a higher-level lexer located in
[`rustc_parse`] to prepare for the next stage of the compile process. The [`rustc_parse`] to prepare for the next stage of the compile process. The
[`StringReader`] struct is used at this stage to perform a set of validations [`StringReader`] struct is used at this stage to perform a set of validations
and turn strings into interned symbols (_interning_ is discussed later). and turn strings into interned symbols (_interning_ is discussed later).
[String interning] is a way of storing only one immutable [String interning] is a way of storing only one immutable
copy of each distinct string value. copy of each distinct string value.
- The lexer has a small interface and doesn't depend directly on the The lexer has a small interface and doesn't depend directly on the
diagnostic infrastructure in `rustc`. Instead it provides diagnostics as plain diagnostic infrastructure in `rustc`. Instead it provides diagnostics as plain
data which are emitted in `rustc_parse::lexer::mod` as real diagnostics. data which are emitted in `rustc_parse::lexer` as real diagnostics.
- The lexer preserves full fidelity information for both IDEs and proc macros. The lexer preserves full fidelity information for both IDEs and proc macros.
- The parser [translates the token stream from the lexer into an Abstract Syntax
The *parser* [translates the token stream from the lexer into an Abstract Syntax
Tree (AST)][parser]. It uses a recursive descent (top-down) approach to syntax Tree (AST)][parser]. It uses a recursive descent (top-down) approach to syntax
analysis. The crate entry points for the parser are the analysis. The crate entry points for the parser are the
[`Parser::parse_crate_mod()`][parse_crate_mod] and [`Parser::parse_mod()`][parse_mod] [`Parser::parse_crate_mod()`][parse_crate_mod] and [`Parser::parse_mod()`][parse_mod]
methods found in [`rustc_parse::parser::Parser`]. The external module parsing methods found in [`rustc_parse::parser::Parser`]. The external module parsing
entry point is [`rustc_expand::module::parse_external_mod`][parse_external_mod]. entry point is [`rustc_expand::module::parse_external_mod`][parse_external_mod].
And the macro parser entry point is [`Parser::parse_nonterminal()`][parse_nonterminal]. And the macro parser entry point is [`Parser::parse_nonterminal()`][parse_nonterminal].
- Parsing is performed with a set of `Parser` utility methods including `fn bump`,
`fn check`, `fn eat`, `fn expect`, `fn look_ahead`. Parsing is performed with a set of `Parser` utility methods including `bump`,
- Parsing is organized by the semantic construct that is being parsed. Separate `check`, `eat`, `expect`, `look_ahead`.
`parse_*` methods can be found in [`rustc_parse` `parser`][rustc_parse_parser_dir]
Parsing is organized by semantic construct. Separate
`parse_*` methods can be found in the [`rustc_parse`][rustc_parse_parser_dir]
directory. The source file name follows the construct name. For example, the directory. The source file name follows the construct name. For example, the
following files are found in the parser: following files are found in the parser:
- `expr.rs` - `expr.rs`
- `pat.rs` - `pat.rs`
- `ty.rs` - `ty.rs`
- `stmt.rs` - `stmt.rs`
- This naming scheme is used across many compiler stages. You will find
This naming scheme is used across many compiler stages. You will find
either a file or directory with the same name across the parsing, lowering, either a file or directory with the same name across the parsing, lowering,
type checking, THIR lowering, and MIR building sources. type checking, THIR lowering, and MIR building sources.
- Macro expansion, AST validation, name resolution, and early linting takes place
during this stage of the compile process. Macro expansion, AST validation, name resolution, and early linting also take place
- The parser uses the standard `DiagnosticBuilder` API for error handling, but we during this stage.
The parser uses the standard `DiagnosticBuilder` API for error handling, but we
try to recover, parsing a superset of Rust's grammar, while also emitting an error. try to recover, parsing a superset of Rust's grammar, while also emitting an error.
- `rustc_ast::ast::{Crate, Mod, Expr, Pat, ...}` AST nodes are returned from the parser. `rustc_ast::ast::{Crate, Mod, Expr, Pat, ...}` AST nodes are returned from the parser.
- We then take the AST and [convert it to High-Level Intermediate
Representation (HIR)][hir]. This is a compiler-friendly representation of the ### HIR lowering
AST. This involves a lot of desugaring of things like loops and `async fn`.
- We use the HIR to do [type inference] (the process of automatic We next take the AST and convert it to [High-Level Intermediate
detection of the type of an expression), [trait solving] (the process Representation (HIR)][hir], a more compiler-friendly representation of the
of pairing up an impl with each reference to a trait), and [type AST. This process called "lowering". It involves a lot of desugaring of things
checking] (the process of converting the types found in the HIR like loops and `async fn`.
(`hir::Ty`), which represent the syntactic things that the user wrote,
into the internal representation used by the compiler (`Ty<'tcx>`), We then use the HIR to do [*type inference*] (the process of automatic
and using that information to verify the type safety, correctness and detection of the type of an expression), [*trait solving*] (the process
coherence of the types used in the program). of pairing up an impl with each reference to a trait), and [*type
- The HIR is then [lowered to Mid-Level Intermediate Representation (MIR)][mir]. checking*]. Type checking is the process of converting the types found in the HIR
- Along the way, we construct the THIR, which is an even more desugared HIR. ([`hir::Ty`]), which represent what the user wrote,
into the internal representation used by the compiler ([`Ty<'tcx>`]).
That information is usedto verify the type safety, correctness and
coherence of the types used in the program.
### MIR lowering
The HIR is then [lowered to Mid-level Intermediate Representation (MIR)][mir],
which is used for [borrow checking].
Along the way, we also construct the THIR, which is an even more desugared HIR.
THIR is used for pattern and exhaustiveness checking. It is also more THIR is used for pattern and exhaustiveness checking. It is also more
convenient to convert into MIR than HIR is. convenient to convert into MIR than HIR is.
- The MIR is used for [borrow checking].
- We (want to) do [many optimizations on the MIR][mir-opt] because it is still We do [many optimizations on the MIR][mir-opt] because it is still
generic and that improves the code we generate later, improving compilation generic and that improves the code we generate later, improving compilation
speed too. speed too.
- MIR is a higher level (and generic) representation, so it is easier to do MIR is a higher level (and generic) representation, so it is easier to do
some optimizations at MIR level than at LLVM-IR level. For example LLVM some optimizations at MIR level than at LLVM-IR level. For example LLVM
doesn't seem to be able to optimize the pattern the [`simplify_try`] mir doesn't seem to be able to optimize the pattern the [`simplify_try`] mir
opt looks for. opt looks for.
- Rust code is _monomorphized_, which means making copies of all the generic
Rust code is _monomorphized_, which means making copies of all the generic
code with the type parameters replaced by concrete types. To do code with the type parameters replaced by concrete types. To do
this, we need to collect a list of what concrete types to generate code for. this, we need to collect a list of what concrete types to generate code for.
This is called _monomorphization collection_. This is called _monomorphization collection_ and it happens at the MIR level.
- We then begin what is vaguely called _code generation_ or _codegen_.
- The [code generation stage (codegen)][codegen] is when higher level ### Code generation
We then begin what is vaguely called _code generation_ or _codegen_.
The [code generation stage][codegen] is when higher level
representations of source are turned into an executable binary. `rustc` representations of source are turned into an executable binary. `rustc`
uses LLVM for code generation. The first step is to convert the MIR uses LLVM for code generation. The first step is to convert the MIR
to LLVM Intermediate Representation (LLVM IR). This is where the MIR to LLVM Intermediate Representation (LLVM IR). This is where the MIR
is actually monomorphized, according to the list we created in the is actually monomorphized, according to the list we created in the
previous step. previous step.
- The LLVM IR is passed to LLVM, which does a lot more optimizations on it. The LLVM IR is passed to LLVM, which does a lot more optimizations on it.
It then emits machine code. It is basically assembly code with additional It then emits machine code. It is basically assembly code with additional
low-level types and annotations added. (e.g. an ELF object or wasm). low-level types and annotations added (e.g. an ELF object or WASM).
- The different libraries/binaries are linked together to produce the final The different libraries/binaries are then linked together to produce the final
binary. binary.
[String interning]: https://en.wikipedia.org/wiki/String_interning [String interning]: https://en.wikipedia.org/wiki/String_interning
@ -115,9 +143,9 @@ we'll talk about that later.
[`rustc_parse`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html [`rustc_parse`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html
[parser]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html [parser]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html
[hir]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_hir/index.html [hir]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_hir/index.html
[type inference]: https://rustc-dev-guide.rust-lang.org/type-inference.html [*type inference*]: https://rustc-dev-guide.rust-lang.org/type-inference.html
[trait solving]: https://rustc-dev-guide.rust-lang.org/traits/resolution.html [*trait solving*]: https://rustc-dev-guide.rust-lang.org/traits/resolution.html
[type checking]: https://rustc-dev-guide.rust-lang.org/type-checking.html [*type checking*]: https://rustc-dev-guide.rust-lang.org/type-checking.html
[mir]: https://rustc-dev-guide.rust-lang.org/mir/index.html [mir]: https://rustc-dev-guide.rust-lang.org/mir/index.html
[borrow checking]: https://rustc-dev-guide.rust-lang.org/borrow_check.html [borrow checking]: https://rustc-dev-guide.rust-lang.org/borrow_check.html
[mir-opt]: https://rustc-dev-guide.rust-lang.org/mir/optimizations.html [mir-opt]: https://rustc-dev-guide.rust-lang.org/mir/optimizations.html
@ -129,6 +157,8 @@ we'll talk about that later.
[`rustc_parse::parser::Parser`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/parser/struct.Parser.html [`rustc_parse::parser::Parser`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/parser/struct.Parser.html
[parse_external_mod]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_expand/module/fn.parse_external_mod.html [parse_external_mod]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_expand/module/fn.parse_external_mod.html
[rustc_parse_parser_dir]: https://github.com/rust-lang/rust/tree/master/compiler/rustc_parse/src/parser [rustc_parse_parser_dir]: https://github.com/rust-lang/rust/tree/master/compiler/rustc_parse/src/parser
[`hir::Ty`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_hir/hir/struct.Ty.html
[`Ty<'tcx>`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.Ty.html
## How it does it ## How it does it
@ -323,6 +353,7 @@ For more details on bootstrapping, see
[_bootstrapping_]: https://en.wikipedia.org/wiki/Bootstrapping_(compilers) [_bootstrapping_]: https://en.wikipedia.org/wiki/Bootstrapping_(compilers)
[rustc-bootstrap]: building/bootstrapping.md [rustc-bootstrap]: building/bootstrapping.md
<!--
# Unresolved Questions # Unresolved Questions
- Does LLVM ever do optimizations in debug builds? - Does LLVM ever do optimizations in debug builds?
@ -332,6 +363,7 @@ For more details on bootstrapping, see
- What is the main source entry point for `X`? - What is the main source entry point for `X`?
- Where do phases diverge for cross-compilation to machine code across - Where do phases diverge for cross-compilation to machine code across
different platforms? different platforms?
-->
# References # References