28 KiB
Macro expansion
librustc_ast,librustc_expand, andlibrustc_builtin_macrosare all undergoing refactoring, so some of the links in this chapter may be broken.
Rust has a very powerful macro system. There are two major types of macros:
macro_rules! macros (a.k.a. "Macros By Example" (MBE)) and procedural macros
("proc macros"; including custom derives). During the parsing phase, the normal
Rust parser will set aside the contents of macros and their invocations. Later,
before name resolution, macros are expanded using these portions of the code.
In this chapter, we will discuss MBEs, proc macros, and hygiene. Both types of
macros are expanded during parsing, but they happen in different ways.
Macros By Example
MBEs have their own parser distinct from the normal Rust parser. When macros
are expanded, we may invoke the MBE parser to parse and expand a macro. The
MBE parser, in turn, may call the normal Rust parser when it needs to bind a
metavariable (e.g. $my_expr) while parsing the contents of a macro
invocation. The code for macro expansion is in
src/librustc_expand/mbe/.
Example
It's helpful to have an example to refer to. For the remainder of this chapter, whenever we refer to the "example definition", we mean the following:
macro_rules! printer {
(print $mvar:ident) => {
println!("{}", $mvar);
};
(print twice $mvar:ident) => {
println!("{}", $mvar);
println!("{}", $mvar);
};
}
$mvar is called a metavariable. Unlike normal variables, rather than
binding to a value in a computation, a metavariable binds at compile time to
a tree of tokens. A token is a single "unit" of the grammar, such as an
identifier (e.g. foo) or punctuation (e.g. =>). There are also other
special tokens, such as EOF, which indicates that there are no more tokens.
Token trees resulting from paired parentheses-like characters ((...),
[...], and {...}) – they include the open and close and all the tokens
in between (we do require that parentheses-like characters be balanced). Having
macro expansion operate on token streams rather than the raw bytes of a source
file abstracts away a lot of complexity. The macro expander (and much of the
rest of the compiler) doesn't really care that much about the exact line and
column of some syntactic construct in the code; it cares about what constructs
are used in the code. Using tokens allows us to care about what without
worrying about where. For more information about tokens, see the
Parsing chapter of this book.
Whenever we refer to the "example invocation", we mean the following snippet:
printer!(print foo); // Assume `foo` is a variable defined somewhere else...
The process of expanding the macro invocation into the syntax tree
println!("{}", foo) and then expanding that into a call to Display::fmt is
called macro expansion, and it is the topic of this chapter.
The MBE parser
There are two parts to macro expansion: parsing the definition and parsing the invocations. Interestingly, both are done by the macro parser.
Basically, the macro parser is like an NFA-based regex parser. It uses an
algorithm similar in spirit to the Earley parsing
algorithm. The macro parser is
defined in src/librustc_expand/mbe/macro_parser.rs.
The interface of the macro parser is as follows (this is slightly simplified):
fn parse_tt(
parser: &mut Cow<Parser>,
ms: &[TokenTree],
) -> NamedParseResult
We use these items in macro parser:
parseris a reference to the state of a normal Rust parser, including the token stream and parsing session. The token stream is what we are about to ask the MBE parser to parse. We will consume the raw stream of tokens and output a binding of metavariables to corresponding token trees. The parsing session can be used to report parser errros.msa matcher. This is a sequence of token trees that we want to match the token stream against.
In the analogy of a regex parser, the token stream is the input and we are matching it
against the pattern ms. Using our examples, the token stream could be the stream of
tokens containing the inside of the example invocation print foo, while ms
might be the sequence of token (trees) print $mvar:ident.
The output of the parser is a NamedParseResult, which indicates which of
three cases has occurred:
- Success: the token stream matches the given matcher
ms, and we have produced a binding from metavariables to the corresponding token trees. - Failure: the token stream does not match
ms. This results in an error message such as "No rule expected token blah". - Error: some fatal error has occurred in the parser. For example, this happens if there are more than one pattern match, since that indicates the macro is ambiguous.
The full interface is defined here.
The macro parser does pretty much exactly the same as a normal regex parser with
one exception: in order to parse different types of metavariables, such as
ident, block, expr, etc., the macro parser must sometimes call back to the
normal Rust parser.
As mentioned above, both definitions and invocations of macros are parsed using
the macro parser. This is extremely non-intuitive and self-referential. The code
to parse macro definitions is in
src/librustc_expand/mbe/macro_rules.rs. It defines the pattern for
matching for a macro definition as $( $lhs:tt => $rhs:tt );+. In other words,
a macro_rules definition should have in its body at least one occurrence of a
token tree followed by => followed by another token tree. When the compiler
comes to a macro_rules definition, it uses this pattern to match the two token
trees per rule in the definition of the macro using the macro parser itself.
In our example definition, the metavariable $lhs would match the patterns of
both arms: (print $mvar:ident) and (print twice $mvar:ident). And $rhs
would match the bodies of both arms: { println!("{}", $mvar); } and { println!("{}", $mvar); println!("{}", $mvar); }. The parser would keep this
knowledge around for when it needs to expand a macro invocation.
When the compiler comes to a macro invocation, it parses that invocation using
the same NFA-based macro parser that is described above. However, the matcher
used is the first token tree ($lhs) extracted from the arms of the macro
definition. Using our example, we would try to match the token stream print foo from the invocation against the matchers print $mvar:ident and print twice $mvar:ident that we previously extracted from the definition. The
algorithm is exactly the same, but when the macro parser comes to a place in the
current matcher where it needs to match a non-terminal (e.g. $mvar:ident),
it calls back to the normal Rust parser to get the contents of that
non-terminal. In this case, the Rust parser would look for an ident token,
which it finds (foo) and returns to the macro parser. Then, the macro parser
proceeds in parsing as normal. Also, note that exactly one of the matchers from
the various arms should match the invocation; if there is more than one match,
the parse is ambiguous, while if there are no matches at all, there is a syntax
error.
For more information about the macro parser's implementation, see the comments
in src/librustc_expand/mbe/macro_parser.rs.
macros and Macros 2.0
There is an old and mostly undocumented effort to improve the MBE system, give
it more hygiene-related features, better scoping and visibility rules, etc. There
hasn't been a lot of work on this recently, unfortunately. Internally, macro
macros use the same machinery as today's MBEs; they just have additional
syntactic sugar and are allowed to be in namespaces.
Procedural Macros
Precedural macros are also expanded during parsing, as mentioned above. However, they use a rather different mechanism. Rather than having a parser in the compiler, procedural macros are implemented as custom, third-party crates. The compiler will compile the proc macro crate and specially annotated functions in them (i.e. the proc macro itself), passing them a stream of tokens.
The proc macro can then transform the token stream and output a new token stream, which is synthesized into the AST.
It's worth noting that the token stream type used by proc macros is stable,
so rustc does not use it internally (since our internal data structures are
unstable).
TODO: more here.
Custom Derive
Custom derives are a special type of proc macro.
TODO: more?
Hygiene
If you have ever used C/C++ preprocessor macros, you know that there are some annoying and hard-to-debug gotchas! For example, consider the following C code:
#define DEFINE_FOO struct Bar {int x;}; struct Foo {Bar bar;};
// Then, somewhere else
struct Bar {
...
};
DEFINE_FOO
Most people avoid writing C like this – and for good reason: it doesn't
compile. The struct Bar defined by the macro clashes names with the struct Bar defined in the code. Consider also the following example:
#define DO_FOO(x) {\
int y = 0;\
foo(x, y);\
}
// Then elsewhere
int y = 22;
DO_FOO(y);
Do you see the problem? We wanted to generate a call foo(22, 0), but instead
we got foo(0, 0) because the macro defined its own y!
These are both examples of macro hygiene issues. Hygiene relates to how to handle names defined within a macro. In particular, a hygienic macro system prevents errors due to names introduced within a macro. Rust macros are hygienic in that they do not allow one to write the sorts of bugs above.
At a high level, hygiene within the rust compiler is accomplished by keeping track of the context where a name is introduced and used. We can then disambiguate names based on that context. Future iterations of the macro system will allow greater control to the macro author to use that context. For example, a macro author may want to introduce a new name to the context where the macro was called. Alternately, the macro author may be defining a variable for use only within the macro (i.e. it should not be visible outside the macro).
This section is about how that context is tracked.
Notes from petrochenkov discussion
Where to find the code:
- librustc_span/hygiene.rs - structures related to hygiene and expansion that are kept in global data (can be accessed from any Ident without any context)
- librustc_span/lib.rs - some secondary methods like macro backtrace using primary methods from hygiene.rs
- librustc_builtin_macros - implementations of built-in macros (including macro attributes and derives) and some other early code generation facilities like injection of standard library imports or generation of test harness.
- librustc_ast/config.rs - implementation of cfg/cfg_attr (they treated specially from other macros), should probably be moved into librustc_ast/ext.
- librustc_ast/tokenstream.rs + librustc_ast/parse/token.rs - structures for compiler-side tokens, token trees, and token streams.
- librustc_ast/ext - various expansion-related stuff
- librustc_ast/ext/base.rs - basic structures used by expansion
- librustc_ast/ext/expand.rs - some expansion structures and the bulk of expansion infrastructure code - collecting macro invocations, calling into resolve for them, calling their expanding functions, and integrating the results back into AST
- librustc_ast/ext/placeholder.rs - the part of expand.rs responsible for "integrating the results back into AST" basicallly, "placeholder" is a temporary AST node replaced with macro expansion result nodes
- librustc_ast/ext/builer.rs - helper functions for building AST for built-in macros in librustc_builtin_macros (and user-defined syntactic plugins previously), can probably be moved into librustc_builtin_macros these days
- librustc_ast/ext/proc_macro.rs + librustc_ast/ext/proc_macro_server.rs - interfaces between the compiler and the stable proc_macro library, converting tokens and token streams between the two representations and sending them through C ABI
- librustc_ast/ext/tt - implementation of macro_rules, turns macro_rules DSL into something with signature Fn(TokenStream) -> TokenStream that can eat and produce tokens, @mark-i-m knows more about this
- librustc_resolve/macros.rs - resolving macro paths, validating those resolutions, reporting various "not found"/"found, but it's unstable"/"expected x, found y" errors
- librustc_middle/hir/map/def_collector.rs + librustc_resolve/build_reduced_graph.rs - integrate an AST fragment freshly expanded from a macro into various parent/child structures like module hierarchy or "definition paths"
Primary structures:
- HygieneData - global piece of data containing hygiene and expansion info that can be accessed from any Ident without any context
- ExpnId - ID of a macro call or desugaring (and also expansion of that call/desugaring, depending on context)
- ExpnInfo/InternalExpnData - a subset of properties from both macro definition and macro call available through global data
- SyntaxContext - ID of a chain of nested macro definitions (identified by ExpnIds)
- SyntaxContextData - data associated with the given SyntaxContext, mostly a cache for results of filtering that chain in different ways
- Span - a code location + SyntaxContext
- Ident - interned string (Symbol) + Span, i.e. a string with attached hygiene data
- TokenStream - a collection of TokenTrees
- TokenTree - a token (punctuation, identifier, or literal) or a delimited group (anything inside ()/[]/{})
- SyntaxExtension - a lowered macro representation, contains its expander function transforming a tokenstream or AST into tokenstream or AST + some additional data like stability, or a list of unstable features allowed inside the macro.
- SyntaxExtensionKind - expander functions may have several different signatures (take one token stream, or two, or a piece of AST, etc), this is an enum that lists them
- ProcMacro/TTMacroExpander/AttrProcMacro/MultiItemModifier - traits representing the expander signatures (TODO: change and rename the signatures into something more consistent)
- Resolver - a trait used to break crate dependencies (so resolver services can be used in librustc_ast, despite librustc_resolve and pretty much everything else depending on librustc_ast)
- ExtCtxt/ExpansionData - various intermediate data kept and used by expansion infra in the process of its work
- AstFragment - a piece of AST that can be produced by a macro (may include multiple homogeneous AST nodes, like e.g. a list of items)
- Annotatable - a piece of AST that can be an attribute target, almost same thing as AstFragment except for types and patterns that can be produced by macros but cannot be annotated with attributes (TODO: Merge into AstFragment)
- MacResult - a "polymorphic" AST fragment, something that can turn into a different AstFragment depending on its context (aka AstFragmentKind - item, or expression, or pattern etc.)
- Invocation/InvocationKind - a structure describing a macro call, these structures are collected by the expansion infra (InvocationCollector), queued, resolved, expanded when resolved, etc.
TODO: how a crate transitions from the state "macros exist as written in source" to "all macros are expanded"
Expansion Heirarchies and Syntax Context
-
Many AST nodes have some sort of syntax context, especially nodes from macros.
-
When we ask what is the syntax context of a node, the answer actually differs by what we are trying to do. Thus, we don't just keep track of a single context. There are in fact 3 different types of context used for different things.
-
Each type of context is tracked by an "expansion heirarchy". As we expand macros, new macro calls or macro definitions may be generated, leading to some nesting. This nesting is where the heirarchies come from. Each heirarchy tracks some different aspect, though, as we will see.
-
There are 3 expansion heirarchies
- They all start at ExpnId::root, which is its own parent
- The context of a node consists of a chain of expansions leading to
ExpnId::root. A non-macro-expanded node has syntax context 0 (SyntaxContext::empty()) which represents just the root node. - There are vectors in
HygieneDatathat contain expansion info.- There are entries here for both
SyntaxContext::empty()andExpnId::root, but they aren't used much.
- There are entries here for both
-
Tracks expansion order: when a macro invocation is in the output of another macro. ... expn_id2 expn_id1 InternalExpnData::parent is the child->parent link. That is the expn_id1 points to expn_id2 points to ...
Ex: macro_rules! foo { () => { println!(); } } fn main() { foo!(); }
// Then AST nodes that are finally generated would have parent(expn_id_println) -> parent(expn_id_foo), right? -
Tracks macro definitions: when we are expanding one macro another macro definition is revealed in its output. ... SyntaxContext2 SyntaxContext1 SyntaxContextData::parent is the child->parent link here. SyntaxContext is the whole chain in this hierarchy, and SyntaxContextData::outer_expns are individual elements in the chain.
Discussion about hygiene
Vadim Petrochenkov: Pretty common construction (at least it was, before refactorings) is SyntaxContext::empty().apply_mark(expn_id), which means a token produced by a built-in macro (which is defined in the root effectively).
Vadim Petrochenkov: Or a stable proc macro, which are always considered to be defined in the root because they are always cross-crate, and we don't have the cross-crate hygiene implemented, ha-ha.
mark-i-m: Where does the expn_id come from?
Vadim Petrochenkov: ID of the built-in macro call like line!().
Vadim Petrochenkov: Assigned continuously from 0 to N as soon as we discover new macro calls.
mark-i-m: Sorry, I didn't quite understand. Do you mean that only built-in macros receive continuous IDs?
Vadim Petrochenkov: So, the second hierarchy has a catch - the context transplantation hack - https://github.com/rust-lang/rust/pull/51762#issuecomment-401400732.
Vadim Petrochenkov:
Do you mean that only built-in macros receive continuous IDs?
Vadim Petrochenkov: No, all macro calls receive ID.
Vadim Petrochenkov: Built-ins have the typical pattern SyntaxContext::empty().apply_mark(expn_id) for syntax contexts produced by them.
mark-i-m: I see, but this pattern is only used for built-ins, right?
Vadim Petrochenkov: And also all stable proc macros, see the comments above.
Vadim Petrochenkov: The third hierarchy is call-site hierarchy.
Vadim Petrochenkov: If foo!(bar!(ident)) expands into ident
Vadim Petrochenkov: then hierarchy 1 is root -> foo -> bar -> ident
Vadim Petrochenkov: but hierarchy 3 is root -> ident
Vadim Petrochenkov: ExpnInfo::call_site is the child-parent link in this case.
mark-i-m: When we expand, do we expand foo first or bar? Why is there a hierarchy 1 here? Is that foo expands first and it expands to something that contains bar!(ident)?
Vadim Petrochenkov: Ah, yes, let's assume both foo and bar are identity macros.
Vadim Petrochenkov: Then foo!(bar!(ident)) -> expand -> bar!(ident) -> expand -> ident
Vadim Petrochenkov: If bar were expanded first, that would be eager expansion - https://github.com/rust-lang/rfcs/pull/2320.
mark-i-m: And after we expand only foo! presumably whatever intermediate state has heirarchy 1 of root->foo->(bar_ident), right?
Vadim Petrochenkov: (We have it hacked into some built-in macros, but not generally.)
Vadim Petrochenkov:
And after we expand only foo! presumably whatever intermediate state has
heirarchy 1 of root->foo->(bar_ident), right?
Vadim Petrochenkov: Yes.
Vadim Petrochenkov: Ok, let's move from hygiene to expansion.
Vadim Petrochenkov: Especially given that I don't remember the specific hygiene
algorithms like adjust in detail.
Vadim Petrochenkov:
Given some piece of rust code, how do we get to the point where things are
expanded
So, first of all, the "some piece of rust code" is the whole crate.
mark-i-m: Just to confirm, the algorithms are well-encapsulated, right? Like a
function or a struct as opposed to a bunch of conventions distributed across
the codebase?
Vadim Petrochenkov: We run fully_expand_fragment in it.
Vadim Petrochenkov:
Just to confirm, the algorithms are well-encapsulated, right?
Yes, the algorithmic parts are entirely inside hygiene.rs.
Vadim Petrochenkov: Ok, some are in fn resolve_crate_root, but those are hacks.
Vadim Petrochenkov: (Continuing about expansion.) If fully_expand_fragment is
run not on a whole crate, it means that we are performing eager expansion.
Vadim Petrochenkov: Eager expansion is done for arguments of some built-in
macros that expect literals.
Vadim Petrochenkov: It generally performs a subset of actions performed by the
non-eager expansion.
Vadim Petrochenkov: So, I'll talk about non-eager expansion for now.
mark-i-m: Eager expansion is not exposed as a language feature, right? i.e. it
is not possible for me to write an eager macro?
Vadim Petrochenkov:
https://github.com/rust-lang/rust/pull/53778#issuecomment-419224049 (vvv The
link is explained below vvv )
Vadim Petrochenkov:
Eager expansion is not exposed as a language feature, right? i.e. it is not
possible for me to write an eager macro?
Yes, it's entirely an ability of some built-in macros.
Vadim Petrochenkov: Not exposed for general use.
Vadim Petrochenkov: fully_expand_fragment works in iterations.
Vadim Petrochenkov: Iterations looks roughly like this:
- Resolve imports in our partially built crate as much as possible.
- Collect as many macro invocations as possible from our partially built crate
(fn-like, attributes, derives) from the crate and add them to the queue.
Vadim Petrochenkov: Take a macro from the queue, and attempt to resolve it.
Vadim Petrochenkov: If it's resolved - run its expander function that
consumes tokens or AST and produces tokens or AST (depending on the macro
kind).
Vadim Petrochenkov: (If it's not resolved, then put it back into the
queue.)
Vadim Petrochenkov: ^^^ That's where we fill in the hygiene data associated
with ExpnIds.
mark-i-m: When we put it back in the queue?
mark-i-m: or do you mean the collect step in general?
Vadim Petrochenkov: Once we resolved the macro call to the macro definition we
know everything about the macro and can call set_expn_data to fill in its
properties in the global data.
Vadim Petrochenkov: I mean, immediately after successful resolution.
Vadim Petrochenkov: That's the first part of hygiene data, the second one is
associated with SyntaxContext rather than with ExpnId, it's filled in later
during expansion.
Vadim Petrochenkov: So, after we run the macro's expander function and got a
piece of AST (or got tokens and parsed them into a piece of AST) we need to
integrate that piece of AST into the big existing partially built AST.
Vadim Petrochenkov: This integration is a really important step where the next
things happen:
- NodeIds are assigned.
Vadim Petrochenkov: "def paths"s and their IDs (DefIds) are created
Vadim Petrochenkov: Names are put into modules from the resolver point of
view.
Vadim Petrochenkov: So, we are basically turning some vague token-like mass
into proper set in stone hierarhical AST and side tables.
Vadim Petrochenkov: Where exactly this happens - NodeIds are assigned by
InvocationCollector (which also collects new macro calls from this new AST
piece and adds them to the queue), DefIds are created by DefCollector, and
modules are filled by BuildReducedGraphVisitor.
Vadim Petrochenkov: These three passes run one after another on every AST
fragment freshly expanded from a macro.
Vadim Petrochenkov: After expanding a single macro and integrating its output
we again try to resolve all imports in the crate, and then return to the big
queue processing loop and pick up the next macro.
Vadim Petrochenkov: Repeat until there's no more macros. Vadim Petrochenkov:
mark-i-m: The integration step is where we would get parser errors too right?
mark-i-m: Also, when do we know definitively that resolution has failed for
particular ident?
Vadim Petrochenkov:
The integration step is where we would get parser errors too right?
Yes, if the macro produced tokens (rather than AST directly) and we had to
parse them.
Vadim Petrochenkov:
when do we know definitively that resolution has failed for particular
ident?
So, ident is looked up in a number of scopes during resolution. From closest
like the current block or module, to far away like preludes or built-in types.
Vadim Petrochenkov: If lookup is certainly failed in all of the scopes, then
it's certainly failed.
mark-i-m: This is after all expansions and integrations are done, right?
Vadim Petrochenkov: "Certainly" is determined differently for different scopes,
e.g. for a module scope it means no unexpanded macros and no unresolved glob
imports in that module.
Vadim Petrochenkov:
This is after all expansions and integrations are done, right?
For macro and import names this happens during expansions and integrations.
Vadim Petrochenkov: For all other names we certainly know whether a name is
resolved successfully or not on the first attempt, because no new names can
appear.
Vadim Petrochenkov: (They are resolved in a later pass, see
librustc_resolve/late.rs.)
mark-i-m: And if at the end of the iteration, there are still things in the
queue that can't be resolve, this represents an error, right?
mark-i-m: i.e. an undefined macro?
Vadim Petrochenkov: Yes, if we make no progress during an iteration, then we
are stuck and that state represent an error.
Vadim Petrochenkov: We attempt to recover though, using dummies expanding into
nothing or ExprKind::Err or something like that for unresolved macros.
mark-i-m: This is for the purposes of diagnostics, though, right?
Vadim Petrochenkov: But if we are going through recovery, then compilation must
result in an error anyway.
Vadim Petrochenkov: Yes, that's for diagnostics, without recovery we would
stuck at the first unresolved macro or import. Vadim Petrochenkov:
So, about the SyntaxContext hygiene...
Vadim Petrochenkov: New syntax contexts are created during macro expansion.
Vadim Petrochenkov: If the token had context X before being produced by a
macro, e.g. here ident has context SyntaxContext::root(): Vadim Petrochenkov:
macro m() { ident }
Vadim Petrochenkov: , then after being produced by the macro it has context X
-> macro_id.
Vadim Petrochenkov: I.e. our ident has context ROOT -> id(m) after it's
produced by m.
Vadim Petrochenkov: The "chaining operator" -> is apply_mark in compiler code.
Vadim Petrochenkov:
macro m() { macro n() { ident } }
Vadim Petrochenkov: In this example the ident has context ROOT originally, then
ROOT -> id(m), then ROOT -> id(m) -> id(n).
Vadim Petrochenkov: Note that these chains are not entirely determined by their
last element, in other words ExpnId is not isomorphic to SyntaxCtxt.
Vadim Petrochenkov: Couterexample: Vadim Petrochenkov:
macro m($i: ident) { macro n() { ($i, bar) } }
m!(foo);
Vadim Petrochenkov: foo has context ROOT -> id(n) and bar has context ROOT ->
id(m) -> id(n) after all the expansions.