From da7894aa293638d6bae5b3edc2e49d224277c084 Mon Sep 17 00:00:00 2001 From: Mark Mansi Date: Sat, 7 Mar 2020 17:26:30 -0600 Subject: [PATCH] update backend chapters from nagisa's notes --- src/appendix/glossary.md | 2 +- src/backend/backend.md | 49 ++++++++++++++++++++++----- src/backend/codegen.md | 66 ++++++++++++++++++++++++------------- src/backend/lowering-mir.md | 57 +++++++++++++++++++++++++++++++- src/backend/monomorph.md | 9 +++++ 5 files changed, 151 insertions(+), 32 deletions(-) diff --git a/src/appendix/glossary.md b/src/appendix/glossary.md index 715928d2..0218d1a8 100644 --- a/src/appendix/glossary.md +++ b/src/appendix/glossary.md @@ -11,7 +11,7 @@ AST | the abstract syntax tree produced by the `rustc_ast` binder | a "binder" is a place where a variable or type is declared; for example, the `` is a binder for the generic type parameter `T` in `fn foo(..)`, and \|`a`\|` ...` is a binder for the parameter `a`. See [the background chapter for more](./background.html#free-vs-bound) bound variable | a "bound variable" is one that is declared within an expression/term. For example, the variable `a` is bound within the closure expression \|`a`\|` a * 2`. See [the background chapter for more](./background.html#free-vs-bound) codegen | the code to translate MIR into LLVM IR. -codegen unit | when we produce LLVM IR, we group the Rust code into a number of codegen units (sometimes abbreviated as CGUs). Each of these units is processed by LLVM independently from one another, enabling parallelism. They are also the unit of incremental re-use. +codegen unit | when we produce LLVM IR, we group the Rust code into a number of codegen units (sometimes abbreviated as CGUs). Each of these units is processed by LLVM independently from one another, enabling parallelism. They are also the unit of incremental re-use. ([see more](../backend/codegen.md)) completeness | completeness is a technical term in type theory. Completeness means that every type-safe program also type-checks. Having both soundness and completeness is very hard, and usually soundness is more important. (see "soundness"). control-flow graph | a representation of the control-flow of a program; see [the background chapter for more](./background.html#cfg) CTFE | Compile-Time Function Evaluation. This is the ability of the compiler to evaluate `const fn`s at compile time. This is part of the compiler's constant evaluation system. ([see more](../const-eval.html)) diff --git a/src/backend/backend.md b/src/backend/backend.md index cc477b3e..9832abb1 100644 --- a/src/backend/backend.md +++ b/src/backend/backend.md @@ -1,9 +1,19 @@ # The Compiler Backend -The _compiler backend_ refers to the parts of the compiler that turn rustc's -MIR into actual executable code (e.g. an ELF or EXE binary) that can run on a -processor. This is the last stage of compilation, and it has a few important -parts: +All of the preceding chapters of this guide have one thing in common: we never +generated any executable machine code at all! With this chapter, all of that +changes. + +It's often useful to think of compilers as being composed of a _frontend_ and a +_backend_ (though in rustc, there's not a sharp line between frontend and +backend). The _frontend_ is responsible for taking raw source code, checking it +for correctness, and getting it into a format usable by the backend. For rustc, +this format is the MIR. The _backend_ refers to the parts of the compiler that +turn rustc's MIR into actual executable code (e.g. an ELF or EXE binary) that +can run on a processor. All of the previous chapters deal with rustc's +frontend. + +rustc's backend does the following: 0. First, we need to collect the set of things to generate code for. In particular, we need to find out which concrete types to substitute for @@ -11,7 +21,30 @@ parts: Generating code for the concrete types (i.e. emitting a copy of the code for each concrete type) is called _monomorphization_, so the process of collecting all the concrete types is called _monomorphization collection_. -1. Next, we need to actually lower the MIR (which is generic) to a codegen IR - (usually LLVM IR; which is not generic) for each concrete type we collected. -2. Finally, we need to invoke LLVM, which runs a bunch of optimization passes, - generates executable code, and links together an executable binary. +1. Next, we need to actually lower the MIR to a codegen IR + (usually LLVM IR) for each concrete type we collected. +2. Finally, we need to invoke the codegen backend (e.g. LLVM or Cranelift), + which runs a bunch of optimization passes, generates executable code, and + links together an executable binary. + +[codegen1]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/base/fn.codegen_crate.html + +The code for codegen is actually a bit complex due to a few factors: + +- Support for multiple backends (LLVM and Cranelift). We try to share as much + backend code between them as possible, so a lot of it is generic over the + codegen implementation. This means that there are often a lot of layers of + abstraction. +- Codegen happens asynchronously in another thread for performance. +- The actual codegen is done by a third-party library (either LLVM or Cranelift). + +Generally, the [`rustc_codegen_ssa`][ssa] crate contains backend-agnastic code +(i.e. independent of LLVM or Cranelift), while the [`rustc_codegen_llvm`][llvm] +crate contains code specific to LLVM codegen. + +[ssa]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/index.html +[llvm]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_llvm/index.html + +At a very high level, the entry point is +[`rustc_codegen_ssa::base::codegen_crate`][codegen1]. This function starts the +process discussed in the rest of this chapter. diff --git a/src/backend/codegen.md b/src/backend/codegen.md index 02bb4d57..8b005525 100644 --- a/src/backend/codegen.md +++ b/src/backend/codegen.md @@ -1,7 +1,13 @@ # Code generation Code generation or "codegen" is the part of the compiler that actually -generates an executable binary. rustc uses LLVM for code generation. +generates an executable binary. Usually, rustc uses LLVM for code generation; +there is also support for [Cranelift]. The key is that rustc doesn't implement +codegen itself. It's worth noting, though, that in the rust source code, many +parts of the backend have `codegen` in their names (there are no hard +boundaries). + +[Cranelift]: https://github.com/bytecodealliance/wasmtime/tree/master/cranelift > NOTE: If you are looking for hints on how to debug code generation bugs, > please see [this section of the debugging chapter][debugging]. @@ -10,28 +16,16 @@ generates an executable binary. rustc uses LLVM for code generation. ## What is LLVM? -All of the preceding chapters of this guide have one thing in common: we never -generated any executable machine code at all! With this chapter, all of that -changes. +[LLVM](https://llvm.org) is "a collection of modular and reusable compiler and +toolchain technologies". In particular, the LLVM project contains a pluggable +compiler backend (also called "LLVM"), which is used by many compiler projects, +including the `clang` C compiler and our beloved `rustc`. -Like most compilers, rustc is composed of a "frontend" and a "backend". The -"frontend" is responsible for taking raw source code, checking it for -correctness, and getting it into a format `X` from which we can generate -executable machine code. The "backend" then takes that format `X` and produces -(possibly optimized) executable machine code for some platform. All of the -previous chapters deal with rustc's frontend. - -rustc's backend is [LLVM](https://llvm.org), "a collection of modular and -reusable compiler and toolchain technologies". In particular, the LLVM project -contains a pluggable compiler backend (also called "LLVM"), which is used by -many compiler projects, including the `clang` C compiler and our beloved -`rustc`. - -LLVM's "format `X`" is called LLVM IR. It is basically assembly code with +LLVM takes input in the form of LLVM IR. It is basically assembly code with additional low-level types and annotations added. These annotations are helpful for doing optimizations on the LLVM IR and outputted machine code. The end -result of all this is (at long last) something executable (e.g. an ELF object -or wasm). +result of all this is (at long last) something executable (e.g. an ELF object, +an EXE, or wasm). There are a few benefits to using LLVM: @@ -49,6 +43,34 @@ There are a few benefits to using LLVM: [spectre]: https://meltdownattack.com/ -## Generating LLVM IR +## Running LLVM, linking, and metadata generation -TODO +Once LLVM IR for all of the functions and statics, etc is built, it is time to +start running LLVM and its optimisation passes. LLVM IR is grouped into +"modules". Multiple "modules" can be codegened at the same time to aid in +multi-core utilisation. These "modules" are what we refer to as _codegen +units_. These units were established way back during monomorphisation +collection phase. + +Once LLVM produces objects from these modules, these objects are passed to the +linker along with, optionally, the metadata object and an archive or an +executable is produced. + +It is not necessarily the codegen phase described above that runs the +optimisations. With certain kinds of LTO, the optimisation might happen at the +linking time instead. It is also possible for some optimisations to happen +before objects are passed on to the linker and some to happen during the +linking. + +This all happens towards the very end of compilation. The code for this can be +found in [`librustc_codegen_ssa::back`][ssaback] and +[`librustc_codegen_llvm::back`][llvmback]. Sadly, this piece of code is not +really well-separated into LLVM-dependent code; the [`rustc_codegen_ssa`][ssa] +contains a fair amount of code specific to the LLVM backend. + +Once these components are done with their work you end up with a number of +files in your filesystem corresponding to the outputs you have requested. + +[ssa]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/index.html +[ssaback]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/back/index.html +[llvmback]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_llvm/back/index.html diff --git a/src/backend/lowering-mir.md b/src/backend/lowering-mir.md index e3c13795..42536b95 100644 --- a/src/backend/lowering-mir.md +++ b/src/backend/lowering-mir.md @@ -1,3 +1,58 @@ # Lowering MIR to a Codegen IR -TODO +Now that we have a list of symbols to generate from the collector, we need to +generate some sort of codegen IR. In this chapter, we will assume LLVM IR, +since that's what rustc usually uses. The actual monomorphisation is performed +as we go, while we do the translation. + +Recall that the backend is started by +[`rustc_codegen_ssa::base::codegen_crate`][codegen1]. Eventually, this reaches +[`rustc_codegen_ssa::mir::codegen_mir`][codegen2], which does the lowering from +MIR to LLVM IR. + +[codegen1]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/base/fn.codegen_crate.html +[codegen2]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/mir/fn.codegen_mir.html + +The code is split into modules which handle particular MIR primitives: + +- [`librustc_codegen_ssa::mir::block`][mirblk] will deal with translating + blocks and their terminators. The most complicated and also the most + interesting thing this module does is generating code for function calls, + including the necessary unwinding handling IR. +- [`librustc_codegen_ssa::mir::statement`][mirst] translates MIR statements. +- [`librustc_codegen_ssa::mir::operand`][mirop] translates MIR operands. +- [`librustc_codegen_ssa::mir::place`][mirpl] translates MIR place references. +- [`librustc_codegen_ssa::mir::rvalue`][mirrv] translates MIR r-values. + +[mirblk]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/mir/block/index.html +[mirst]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/mir/statement/index.html +[mirop]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/mir/operand/index.html +[mirpl]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/mir/place/index.html +[mirrv]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/mir/rvalue/index.html + +Before a function is translated a number of simple and primitive analysis +passes will run to help us generate simpler and more efficient LLVM IR. An +example of such an analysis pass would be figuring out which variables are +SSA-like, so that we can translate them to SSA directly rather than relying on +LLVM's `mem2reg` for those variables. The anayses can be found in +[`rustc_codegen_ssa::mir::analyze`][mirana]. + +[mirana]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/mir/analyze/index.html + +Usually a single MIR basic block will map to a LLVM basic block, with very few +exceptions: intrinsic or function calls and less basic MIR statemenets like +`assert` can result in multiple basic blocks. This is a perfect lede into the +non-portable LLVM-specific part of the code generation. Intrinsic generation is +fairly easy to understand as it involves very few abstraction levels in between +and can be found in [`rustc_codegen_llvm::intrinsic`][llvmint]. + +[llvmint]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_llvm/intrinsic/index.html + +Everything else will use the [builder interface][builder], this is the code that gets +called in [`librustc_codegen_ssa::mir::*`][ssamir] modules that was discussed +above. + +[builder]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_llvm/builder/index.html +[ssamir]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/mir/index.html + +> TODO: discuss how constants are generated diff --git a/src/backend/monomorph.md b/src/backend/monomorph.md index e28eac8f..0d2de4d2 100644 --- a/src/backend/monomorph.md +++ b/src/backend/monomorph.md @@ -49,6 +49,15 @@ See [the collector rustdocs][collect] for more info. [collect]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir/monomorphize/collector/index.html +The monomorphisation collector is run just before MIR lowering and codegen. +[`rustc_codegen_ssa::base::codegen_crate`][codegen1] calls the +[`collect_and_partition_mono_items`][mono] query, which does monomorphisation +collection and then partitions them into [codegen +units](../appendix/glossary.md). + +[mono]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir/monomorphize/partitioning/fn.collect_and_partition_mono_items.html +[codegen1]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/base/fn.codegen_crate.html + ## Polymorphization As mentioned above, monomorphisation produces fast code, but it comes at the