# Lexing and Parsing The very first thing the compiler does is take the program (in UTF-8 Unicode text) and turn it into a data format the compiler can work with more conveniently than strings. This happens in two stages: Lexing and Parsing. 1. _Lexing_ takes strings and turns them into streams of [tokens]. For example, `foo.bar + buz` would be turned into the tokens `foo`, `.`, `bar`, `+`, and `buz`. This is implemented in [`rustc_lexer`][lexer]. [tokens]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/token/index.html [lexer]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html 2. _Parsing_ takes streams of tokens and turns them into a structured form which is easier for the compiler to work with, usually called an [*Abstract Syntax Tree* (AST)][ast] . ## The AST The AST mirrors the structure of a Rust program in memory, using a `Span` to link a particular AST node back to its source text. The AST is defined in [`rustc_ast`][rustc_ast], along with some definitions for tokens and token streams, data structures/traits for mutating ASTs, and shared definitions for other AST-related parts of the compiler (like the lexer and macro-expansion). Every node in the AST has its own [`NodeId`], including top-level items such as structs, but also individual statements and expressions. A [`NodeId`] is an identifier number that uniquely identifies an AST node within a crate. However, because they are absolute within a crate, adding or removing a single node in the AST causes all the subsequent [`NodeId`]s to change. This renders [`NodeId`]s pretty much useless for incremental compilation, where you want as few things as possible to change. [`NodeId`]s are used in all the `rustc` bits that operate directly on the AST, like macro expansion and name resolution (more on these over the next couple chapters). [`NodeId`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/node_id/struct.NodeId.html ## Parsing The parser is defined in [`rustc_parse`][rustc_parse], along with a high-level interface to the lexer and some validation routines that run after macro expansion. In particular, the [`rustc_parse::parser`][parser] contains the parser implementation. The main entrypoint to the parser is via the various `parse_*` functions and others in [rustc_parse][rustc_parse]. They let you do things like turn a [`SourceFile`][sourcefile] (e.g. the source in a single file) into a token stream, create a parser from the token stream, and then execute the parser to get a [`Crate`] (the root AST node). To minimize the amount of copying that is done, both [`Lexer`] and [`Parser`] have lifetimes which bind them to the parent [`ParseSess`]. This contains all the information needed while parsing, as well as the [`SourceMap`] itself. Note that while parsing, we may encounter macro definitions or invocations. We set these aside to be expanded (see [Macro Expansion](./macro-expansion.md)). Expansion itself may require parsing the output of a macro, which may reveal more macros to be expanded, and so on. ## More on Lexical Analysis Code for lexical analysis is split between two crates: - [`rustc_lexer`] crate is responsible for breaking a `&str` into chunks constituting tokens. Although it is popular to implement lexers as generated finite state machines, the lexer in [`rustc_lexer`] is hand-written. - [`Lexer`] integrates [`rustc_lexer`] with data structures specific to `rustc`. Specifically, it adds `Span` information to tokens returned by [`rustc_lexer`] and interns identifiers. [`Crate`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/ast/struct.Crate.html [`Parser`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/parser/struct.Parser.html [`ParseSess`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_session/parse/struct.ParseSess.html [`rustc_lexer`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html [`SourceMap`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_span/source_map/struct.SourceMap.html [`Lexer`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/lexer/struct.Lexer.html [ast module]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/ast/index.html [ast]: ./ast-validation.md [parser]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/parser/index.html [rustc_ast]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/index.html [rustc_errors]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_errors/index.html [rustc_parse]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html [sourcefile]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_span/struct.SourceFile.html [visit module]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/visit/index.html