rustc-dev-guide/src/macro-expansion.md

24 KiB
Raw Blame History

Macro expansion

librustc_ast, librustc_expand, and librustc_builtin_macros are all undergoing refactoring, so some of the links in this chapter may be broken.

Rust has a very powerful macro system. In the previous chapter, we saw how the parser sets aside macros to be expanded. This chapter is about the process of expanding those macros iteratively until we have a complete AST for our crate with no unexpanded macros (or a compile error).

First, we will discuss the algorithm that expands and integrates macro output into ASTs. Next, we will take a look at how hygiene data is collected. Finally, we will look at the specifics of expanding different types of macros.

Expansion and AST Integration

TODO: expand these notes (har har)...

  • Expansion happens over a whole crate at once.
  • We run fully_expand_fragment on the crate
    • If fully_expand_fragment is run not on a whole crate, it means that we are performing eager expansion.

      • We do this for some built-ins that expect literals (not exposed to users).
      • It performs a subset of actions performed by non-eager expansion, so the discussion below focuses on eager expansion.
    • Original description here: https://github.com/rust-lang/rust/pull/53778#issuecomment-419224049

    • Algorithm: fully_expand_fragment works in iterations. We repeat until there are no unresolved macros left.

      • Resolve imports in our partially built crate as much as possible.
        • (link to name-resolution chapter) names resolved from "closer" scopes (e.g. current block) to further ones (e.g. prelude)
        • A resolution fails differently for different scopes, e.g. for a module scope it means no unexpanded macros and no unresolved glob imports in that module.
      • Collect as many macro invocations as possible from our partially built crate (fn-like, attributes, derives) from the crate and add them to the queue.
      • Take a macro from the queue, and attempt to resolve it.
      • If it's resolved - run its expander function that consumes tokens or AST and produces tokens or AST (depending on the macro kind). (If it's not resolved, then put it back into the queue.)
        • At this point, we know everything about the macro itself and can call set_expn_data to fill in its properties in the global data -- that is the hygiene data associated with ExpnId.
      • The macro's expander function returns a piece of AST (or tokens). We need to integrate that piece of AST into the big existing partially built AST.
        • If the macro produces tokens (e.g. a proc macro), we will have to parse into an AST, which may produce parse errors.
        • During expansion, we create SyntaxContexts (heirarchy 2).
        • This is essentially where the "token-like mass" becomes a proper set-in-stone AST with side-tables
        • These three passes happen one after another on every AST fragment freshly expanded from a macro
          • NodeIds are assigned by InvocationCollector
            • also collects new macro calls from this new AST piece and adds them to the queue
          • def_paths are created and DefIds are assigned to them by DefCollector
          • Names are put into modules (from the resolver's point of view) by BuildReducedGraphVisitor
      • After expanding a single macro and integrating its output continue to the next iteration of fully_expand_fragment.
      • If we make no progress in an iteration, then we have reached a compilation error (e.g. an undefined macro).
    • We attempt to recover from failures (unresolved macros or imports) for the sake of diagnostics

      • recovery can't cause compilation to suceed. We know that it will fail at this point.
      • we expand errors into ExprKind::Err or something like that for unresolved macros
      • this allows compilation to continue past the first error so that we can report more errors at a time

Relationship to name resolution

  • name resolution is done for macro and import names during expansion and integration into the AST, as discussed above
  • For all other names we certainly know whether a name is resolved successfully or not on the first attempt, because no new names can appear, due to hygiene
    • They are resolved in a later pass, see librustc_resolve/late.rs

Hygiene and Heirarchies

If you have ever used C/C++ preprocessor macros, you know that there are some annoying and hard-to-debug gotchas! For example, consider the following C code:

#define DEFINE_FOO struct Bar {int x;}; struct Foo {Bar bar;};

// Then, somewhere else
struct Bar {
    ...
};

DEFINE_FOO

Most people avoid writing C like this and for good reason: it doesn't compile. The struct Bar defined by the macro clashes names with the struct Bar defined in the code. Consider also the following example:

#define DO_FOO(x) {\
    int y = 0;\
    foo(x, y);\
    }

// Then elsewhere
int y = 22;
DO_FOO(y);

Do you see the problem? We wanted to generate a call foo(22, 0), but instead we got foo(0, 0) because the macro defined its own y!

These are both examples of macro hygiene issues. Hygiene relates to how to handle names defined within a macro. In particular, a hygienic macro system prevents errors due to names introduced within a macro. Rust macros are hygienic in that they do not allow one to write the sorts of bugs above.

At a high level, hygiene within the rust compiler is accomplished by keeping track of the context where a name is introduced and used. We can then disambiguate names based on that context. Future iterations of the macro system will allow greater control to the macro author to use that context. For example, a macro author may want to introduce a new name to the context where the macro was called. Alternately, the macro author may be defining a variable for use only within the macro (i.e. it should not be visible outside the macro).

This section is about how that context is tracked.

TODO: expand these notes

  • Expansion is lazy. We work from the outside of a macro invocation inward.

    • Ex: foo!(bar!(ident)) -> expand -> bar!(ident) -> expand -> ident
    • Eager expansion: https://github.com/rust-lang/rfcs/pull/2320.
      • Seems complicated to implemented
      • We have it hacked into some built-in macros, but not generally.
  • Many AST nodes have some sort of syntax context, especially nodes from macros.

  • When we ask what is the syntax context of a node, the answer actually differs by what we are trying to do. Thus, we don't just keep track of a single context. There are in fact 3 different types of context used for different things.

  • Each type of context is tracked by an "expansion heirarchy". As we expand macros, new macro calls or macro definitions may be generated, leading to some nesting. This nesting is where the heirarchies come from. Each heirarchy tracks some different aspect, though, as we will see.

  • There are 3 expansion heirarchies

    • All macros receive an integer ID assigned continuously starting from 0 as we discover new macro calls
      • This is used as the expn_id where needed.
    • All heirarchies start at ExpnId::root, which is its own parent
    • The context of a node consists of a chain of expansions leading to ExpnId::root. A non-macro-expanded node has syntax context 0 (SyntaxContext::empty()) which represents just the root node.
    • There are vectors in HygieneData that contain expansion info.
      • There are entries here for both SyntaxContext::empty() and ExpnId::root, but they aren't used much.
    1. Tracks expansion order: when a macro invocation is in the output of another macro. ... expn_id2 expn_id1 InternalExpnData::parent is the child->parent link. That is the expn_id1 points to expn_id2 points to ...

      Ex: macro_rules! foo { () => { println!(); } } fn main() { foo!(); }

       // Then AST nodes that are finally generated would have parent(expn_id_println) -> parent(expn_id_foo), right?
      
    2. Tracks macro definitions: when we are expanding one macro another macro definition is revealed in its output. ... SyntaxContext2 SyntaxContext1 SyntaxContextData::parent is the child->parent link here. SyntaxContext is the whole chain in this hierarchy, and SyntaxContextData::outer_expns are individual elements in the chain.

      • For built-in macros (e.g. line!()) or stable proc macros: tokens produced by the macro are given the context SyntaxContext::empty().apply_mark(expn_id)
        • Such macros are considered to have been defined at the root.
          • For proc macros this is because they are always cross-crate and we don't have cross-crate hygiene implemented.

      The second hierarchy has the context transplantation hack. See https://github.com/rust-lang/rust/pull/51762#issuecomment-401400732.

      If the token had context X before being produced by a macro then after being produced by the macro it has context X -> macro_id.

      Ex: rust macro m() { ident }

       Here `ident` originally has context SyntaxContext::root(). `ident` has context ROOT -> id(m) after it's produced by m.
       The "chaining operator" is `apply_mark` in compiler code.
      

      Ex:

       ```rust
       macro m() { macro n() { ident } }
       ```
       In this example the ident has context ROOT originally, then ROOT -> id(m), then ROOT -> id(m) -> id(n).
      

      Note that these chains are not entirely determined by their last element, in other words ExpnId is not isomorphic to SyntaxCtxt.

      Ex: ```rust macro m($i: ident) { macro n() { ($i, bar) } }

       m!(foo);
       ```
      
       After all expansions, foo has context ROOT -> id(n) and bar has context ROOT -> id(m) -> id(n)
      
    3. Call-site: tracks the location of the macro invocation. Ex: If foo!(bar!(ident)) expands into ident then hierarchy 1 is root -> foo -> bar -> ident but hierarchy 3 is root -> ident

      ExpnInfo::call_site is the child-parent link in this case.

  • Hygiene-related algorithms are entirely in hygiene.rs

    • Some hacks in resolve_crate_root, though.

Producing Macro Output

Above, we saw how the output of a macro is integrated into the AST for a crate, and we also saw how th e hygiene data for a crate is generated. But how do we actually produce the output of a macro? It depends on the type of macro.

There are two types of macros in Rust: macro_rules! macros (a.k.a. "Macros By Example" (MBE)) and procedural macros (or "proc macros"; including custom derives). During the parsing phase, the normal Rust parser will set aside the contents of macros and their invocations. Later, macros are expanded using these portions of the code.

Macros By Example

MBEs have their own parser distinct from the normal Rust parser. When macros are expanded, we may invoke the MBE parser to parse and expand a macro. The MBE parser, in turn, may call the normal Rust parser when it needs to bind a metavariable (e.g. $my_expr) while parsing the contents of a macro invocation. The code for macro expansion is in src/librustc_expand/mbe/.

Example

It's helpful to have an example to refer to. For the remainder of this chapter, whenever we refer to the "example definition", we mean the following:

macro_rules! printer {
    (print $mvar:ident) => {
        println!("{}", $mvar);
    };
    (print twice $mvar:ident) => {
        println!("{}", $mvar);
        println!("{}", $mvar);
    };
}

$mvar is called a metavariable. Unlike normal variables, rather than binding to a value in a computation, a metavariable binds at compile time to a tree of tokens. A token is a single "unit" of the grammar, such as an identifier (e.g. foo) or punctuation (e.g. =>). There are also other special tokens, such as EOF, which indicates that there are no more tokens. Token trees resulting from paired parentheses-like characters ((...), [...], and {...}) they include the open and close and all the tokens in between (we do require that parentheses-like characters be balanced). Having macro expansion operate on token streams rather than the raw bytes of a source file abstracts away a lot of complexity. The macro expander (and much of the rest of the compiler) doesn't really care that much about the exact line and column of some syntactic construct in the code; it cares about what constructs are used in the code. Using tokens allows us to care about what without worrying about where. For more information about tokens, see the Parsing chapter of this book.

Whenever we refer to the "example invocation", we mean the following snippet:

printer!(print foo); // Assume `foo` is a variable defined somewhere else...

The process of expanding the macro invocation into the syntax tree println!("{}", foo) and then expanding that into a call to Display::fmt is called macro expansion, and it is the topic of this chapter.

The MBE parser

There are two parts to MBE expansion: parsing the definition and parsing the invocations. Interestingly, both are done by the macro parser.

Basically, the MBE parser is like an NFA-based regex parser. It uses an algorithm similar in spirit to the Earley parsing algorithm. The macro parser is defined in src/librustc_expand/mbe/macro_parser.rs.

The interface of the macro parser is as follows (this is slightly simplified):

fn parse_tt(
    parser: &mut Cow<Parser>,
    ms: &[TokenTree],
) -> NamedParseResult

We use these items in macro parser:

  • parser is a reference to the state of a normal Rust parser, including the token stream and parsing session. The token stream is what we are about to ask the MBE parser to parse. We will consume the raw stream of tokens and output a binding of metavariables to corresponding token trees. The parsing session can be used to report parser errros.
  • ms a matcher. This is a sequence of token trees that we want to match the token stream against.

In the analogy of a regex parser, the token stream is the input and we are matching it against the pattern ms. Using our examples, the token stream could be the stream of tokens containing the inside of the example invocation print foo, while ms might be the sequence of token (trees) print $mvar:ident.

The output of the parser is a NamedParseResult, which indicates which of three cases has occurred:

  • Success: the token stream matches the given matcher ms, and we have produced a binding from metavariables to the corresponding token trees.
  • Failure: the token stream does not match ms. This results in an error message such as "No rule expected token blah".
  • Error: some fatal error has occurred in the parser. For example, this happens if there are more than one pattern match, since that indicates the macro is ambiguous.

The full interface is defined here.

The macro parser does pretty much exactly the same as a normal regex parser with one exception: in order to parse different types of metavariables, such as ident, block, expr, etc., the macro parser must sometimes call back to the normal Rust parser.

As mentioned above, both definitions and invocations of macros are parsed using the macro parser. This is extremely non-intuitive and self-referential. The code to parse macro definitions is in src/librustc_expand/mbe/macro_rules.rs. It defines the pattern for matching for a macro definition as $( $lhs:tt => $rhs:tt );+. In other words, a macro_rules definition should have in its body at least one occurrence of a token tree followed by => followed by another token tree. When the compiler comes to a macro_rules definition, it uses this pattern to match the two token trees per rule in the definition of the macro using the macro parser itself. In our example definition, the metavariable $lhs would match the patterns of both arms: (print $mvar:ident) and (print twice $mvar:ident). And $rhs would match the bodies of both arms: { println!("{}", $mvar); } and { println!("{}", $mvar); println!("{}", $mvar); }. The parser would keep this knowledge around for when it needs to expand a macro invocation.

When the compiler comes to a macro invocation, it parses that invocation using the same NFA-based macro parser that is described above. However, the matcher used is the first token tree ($lhs) extracted from the arms of the macro definition. Using our example, we would try to match the token stream print foo from the invocation against the matchers print $mvar:ident and print twice $mvar:ident that we previously extracted from the definition. The algorithm is exactly the same, but when the macro parser comes to a place in the current matcher where it needs to match a non-terminal (e.g. $mvar:ident), it calls back to the normal Rust parser to get the contents of that non-terminal. In this case, the Rust parser would look for an ident token, which it finds (foo) and returns to the macro parser. Then, the macro parser proceeds in parsing as normal. Also, note that exactly one of the matchers from the various arms should match the invocation; if there is more than one match, the parse is ambiguous, while if there are no matches at all, there is a syntax error.

For more information about the macro parser's implementation, see the comments in src/librustc_expand/mbe/macro_parser.rs.

macros and Macros 2.0

There is an old and mostly undocumented effort to improve the MBE system, give it more hygiene-related features, better scoping and visibility rules, etc. There hasn't been a lot of work on this recently, unfortunately. Internally, macro macros use the same machinery as today's MBEs; they just have additional syntactic sugar and are allowed to be in namespaces.

Procedural Macros

Precedural macros are also expanded during parsing, as mentioned above. However, they use a rather different mechanism. Rather than having a parser in the compiler, procedural macros are implemented as custom, third-party crates. The compiler will compile the proc macro crate and specially annotated functions in them (i.e. the proc macro itself), passing them a stream of tokens.

The proc macro can then transform the token stream and output a new token stream, which is synthesized into the AST.

It's worth noting that the token stream type used by proc macros is stable, so rustc does not use it internally (since our internal data structures are unstable).

TODO: more here.

Custom Derive

Custom derives are a special type of proc macro.

TODO: more?

Notes from petrochenkov discussion

TODO: sprinkle these links around the chapter...

Where to find the code:

  • librustc_span/hygiene.rs - structures related to hygiene and expansion that are kept in global data (can be accessed from any Ident without any context)
  • librustc_span/lib.rs - some secondary methods like macro backtrace using primary methods from hygiene.rs
  • librustc_builtin_macros - implementations of built-in macros (including macro attributes and derives) and some other early code generation facilities like injection of standard library imports or generation of test harness.
  • librustc_ast/config.rs - implementation of cfg/cfg_attr (they treated specially from other macros), should probably be moved into librustc_ast/ext.
  • librustc_ast/tokenstream.rs + librustc_ast/parse/token.rs - structures for compiler-side tokens, token trees, and token streams.
  • librustc_ast/ext - various expansion-related stuff
  • librustc_ast/ext/base.rs - basic structures used by expansion
  • librustc_ast/ext/expand.rs - some expansion structures and the bulk of expansion infrastructure code - collecting macro invocations, calling into resolve for them, calling their expanding functions, and integrating the results back into AST
  • librustc_ast/ext/placeholder.rs - the part of expand.rs responsible for "integrating the results back into AST" basicallly, "placeholder" is a temporary AST node replaced with macro expansion result nodes
  • librustc_ast/ext/builer.rs - helper functions for building AST for built-in macros in librustc_builtin_macros (and user-defined syntactic plugins previously), can probably be moved into librustc_builtin_macros these days
  • librustc_ast/ext/proc_macro.rs + librustc_ast/ext/proc_macro_server.rs - interfaces between the compiler and the stable proc_macro library, converting tokens and token streams between the two representations and sending them through C ABI
  • librustc_ast/ext/tt - implementation of macro_rules, turns macro_rules DSL into something with signature Fn(TokenStream) -> TokenStream that can eat and produce tokens, @mark-i-m knows more about this
  • librustc_resolve/macros.rs - resolving macro paths, validating those resolutions, reporting various "not found"/"found, but it's unstable"/"expected x, found y" errors
  • librustc_middle/hir/map/def_collector.rs + librustc_resolve/build_reduced_graph.rs - integrate an AST fragment freshly expanded from a macro into various parent/child structures like module hierarchy or "definition paths"

Primary structures:

  • HygieneData - global piece of data containing hygiene and expansion info that can be accessed from any Ident without any context
  • ExpnId - ID of a macro call or desugaring (and also expansion of that call/desugaring, depending on context)
  • ExpnInfo/InternalExpnData - a subset of properties from both macro definition and macro call available through global data
  • SyntaxContext - ID of a chain of nested macro definitions (identified by ExpnIds)
  • SyntaxContextData - data associated with the given SyntaxContext, mostly a cache for results of filtering that chain in different ways
  • Span - a code location + SyntaxContext
  • Ident - interned string (Symbol) + Span, i.e. a string with attached hygiene data
  • TokenStream - a collection of TokenTrees
  • TokenTree - a token (punctuation, identifier, or literal) or a delimited group (anything inside ()/[]/{})
  • SyntaxExtension - a lowered macro representation, contains its expander function transforming a tokenstream or AST into tokenstream or AST + some additional data like stability, or a list of unstable features allowed inside the macro.
  • SyntaxExtensionKind - expander functions may have several different signatures (take one token stream, or two, or a piece of AST, etc), this is an enum that lists them
  • ProcMacro/TTMacroExpander/AttrProcMacro/MultiItemModifier - traits representing the expander signatures (TODO: change and rename the signatures into something more consistent)
  • Resolver - a trait used to break crate dependencies (so resolver services can be used in librustc_ast, despite librustc_resolve and pretty much everything else depending on librustc_ast)
  • ExtCtxt/ExpansionData - various intermediate data kept and used by expansion infra in the process of its work
  • AstFragment - a piece of AST that can be produced by a macro (may include multiple homogeneous AST nodes, like e.g. a list of items)
  • Annotatable - a piece of AST that can be an attribute target, almost same thing as AstFragment except for types and patterns that can be produced by macros but cannot be annotated with attributes (TODO: Merge into AstFragment)
  • MacResult - a "polymorphic" AST fragment, something that can turn into a different AstFragment depending on its context (aka AstFragmentKind - item, or expression, or pattern etc.)
  • Invocation/InvocationKind - a structure describing a macro call, these structures are collected by the expansion infra (InvocationCollector), queued, resolved, expanded when resolved, etc.