6 November 2015

Macros in Rust pt4

Previously: intro, macro_rules, procedural macros, hygiene.

This post is a grab-bag of stuff that didn't fit elsewhere. First, I'll discuss some more macro_rules stuff around import/export/modularisation of macros. Then a few more implemetnation details. Next time, I'll get into what I think are issues that we should fix.

scoping and import/export

Mostly in Rust, the ordering of items is irrelevant, i.e., you can refer to a function before it is declared. However, for macros, order is important. You can only refer to a macro after it is declared. This even applies where the declaration and/or use is in modules. E.g.,

mod a; // can't name foo!
macro_rules! foo { ... }
mod b; // can name foo!

The module system works differently for macros, the structure of macros does not impose a naming scheme like it does for other items. Macros defined inside a module cannot be used unless that module is annotated with #[macro_use]. That means you can define module-private macros. If you only want to use some macros in the module, you can list those macros in parentheses, e.g., #[macro_use(foo, bar)]. Macros defined before and outside a module can be used inside the module without being imported.

#[macro_use]
mod foo {
    pub fn bar() {}

    // Doesn't need to be `pub` (and can't be).
    macro_rules! baz ...
}

fn main() {
    ::foo::bar(); // bar cannot be named without importing it here or using a path.
    baz!();       // baz can - modules don't matter for naming.
}

Macros are encapsulated by crates. They must be explicitly imported and exported. When declaring a macro, if you want it to be visible in other crates, you must mark it with #[macro_export]. When importing a crate using extern crate you must annotate the extern crate statement with #[macro_use]. You can also re-export macros from a crate (so somebody that pulls in your crate can use the macros) using #[macro_reexport(reexported)]. Note that #[macro_reexport(reexported)] is feature gated.

As mentioned in the last post, you should nearly always refer to items and types in macro definitions using absolute paths. If a macro is exported and used in another crate, there is no way (in regular Rust code) to write an absolute path for the exporting crate which is valid in the importing crate. For example, in the crate foo we might refer to ::bar::baz but in the importing crate we would need foo::bar::baz. To solve this you can use the magic path element $crate, e.g., $crate::bar::baz. It will be replaced with the crate name when the macro is exported, but ignored (giving an absolute path) when the macro is used in the same crate in which it is declared.

Note that if you export a macro from crate A which uses items defined in crate B into crate C, then crate C will need an extern crate B;.

For more details and discussion, see RFC 453. Note that not everything in that RFC was actually implemented.

more implementation details

data structures

When discussing macros, there are three representations of the program which are important: the source text, token trees, and the AST.

The source text is the text as passed to rustc. It is stored (pretty much verbatim) in the codemap, this is a data structure representing an entire crate. Files loaded due to modules are included, but not inline (a CodeMap has many FileMaps, one per file). It is immutable: it is not modified due to macro expansion, or for any other reason. It can be indexed by character indices, byte indices, or by line.

Once the source text has been lexed, it is not used very much by the compiler. The main use case is in error messages.

Lexing is the first stage of compilation where source text is transformed into tokens. In Rust, we have token trees, rather than a token stream. There are tokens for each keyword, punctuation recognised by Rust, string literals, names, etc. Tokens are mostly defined in token.rs. There are some interesting edge cases in tokenisation, for example in <X: Foo<Y>> the tokens are <, X, :, Foo, <, Y, and >>. The >> is transformed into two > tokens during parsing.

Token trees are trees because we match opening and closing brackets - (), {}, and [] (but not <>). We do this so that we can find the scope of macros from the token tree without having to parse. This does mean you can't use unbalanced brackets in your input to macros, but that is a minor limitation. TokenTree is defined in ast.rs.

Parsing a token tree gives an AST (abstract syntax tree). An AST is a tree of nodes representing the source text in a fairly concrete way. Rust's AST even includes parentheses and doc comments, although not regular comments. Each expression and sub-expression is a node in the AST, for example x.foo(a, b, c) is an ExprMethodCall with four sub-expressions for x, a, b, and c, each of which is an ExprPath. Likewise, an item like a trait is an AST node and it's functions are child nodes.

Many AST nodes have an id field which is a NodeId, this gives an easy way to identify a node in the AST. NodeIds are crate-relative. For referring to nodes in other crates, we use DefIds. Many nodes also have a Span, which is basically a pair of start and end indices into the codemap. Spans are used for highlighting sections of code in error messages, among other things (more on these later).

parsing process

At the coarsest level you can think of Rust compilation in three phases: parsing and expansion, analysis, and code generation. The first phase is implemented in libsyntax and consists of lexing, parsing, and macro expansion. Everything in this phase is purely syntactic - we don't do any analysis which requires assigning meaning to code. The analysis and code generation phases are implemented in various librustc_* crates and begins with name resolution, which we discussed briefly in the last post. Libsyntax is slightly more user-facing than the rest of the compiler. We convert the AST into a high-level intermediate representation (HIR) in the compiler, before name resolution, so that the compiler's internals are isolated from libsyntax and from tools or procedural macros which hook into it.

Although we talk about lexing, parsing, and macro expansion as distinct phases, that is not really the case. As far as the compiler is concerned there is just parsing, which takes a file or string and returns an AST.

The process beings by reading the source text of the first file into a FileMap which is stored in a CodeMap. That FileMap is lexed to produce a token tree which is wrapped in a Parser. We then start parsing by reading the tokens in order.

If we encounter a mod foo; item, then we read the corresponding file into a FileMap (and store that in our codemap), lex it, wrap it in a parser, and parse it. That will give an ItemMod AST node, which we insert into the outer module's AST.

If we encounter a macro use (which includes using macro_rules!, i.e., a macro_rules macro definition), then we do not parse it's arguments, we save the token tree into the AST without parsing it.

Once parsing is complete, then we have an AST for the whole crate which will include macro uses with un-parsed token trees as arguments. We can then start macro expansion.

As described last time, macro expansion has to walk the entire AST in order to properly implement macro hygiene. During this walk we build up a table of macro definitions. When we encounter a macro use on the walk, we look it up in this table and try to match the token tree argument with one of the patterns in the macro definition. If that succeeds then we perform substitution into the macro body, parse the result (which may even cause us to read more files and lex them), and insert it into the AST to replace the macro use. We then continue to walk down into the new AST to further expand macros.

Expanding procedural macros is similar, except that the procedural macro must explicitly do any parsing required.

At the end of this process we have an expanded AST. This contains a subset of AST nodes (no macro uses, although defintions are left in). Even at this stage there may still be unparsed token trees in the AST due to macro definitions, however, these are not used when compiling the crate, they are left in so that they can be referred to from other crates via the crate's metadata.

Spans and expansion stacks

A span identifies a section of code in the source text. It is implemented as two indices into the codemap: a start and an end. This saves a lot of memory compared with saving text inline. For example, for the statement let foo = bar(a, b); we store spans for a, b, bar, bar(a, b), foo, and the whole statement.

The primary use of spans is in error messages. Say we couldn't find a variable called b, then we would use the span on bs AST node to highlight it in the error message along with some context.

If an error occurs due to code that came from a macro expansion, then there is no single part of the source code to highlight. We want to highlight the erroneous code in the macro, and the macro use, since the error could be due to code in either place. Furthermore, if the macro expansion had many steps (due to nested macros), we want to record each stage in expansion.

To handle this scenario, spans are more complex than just a start and end index. (By the way, Spans are defined in codemap.rs). They also hold an ExpnId (expansion id), which is an index into a table of expansion traces, this table is stored in a CodeMap. Each entry in the table is an ExpnInfo instance. If there is no expansion trace (i.e., the span is code written directly by the user with no macro expansion), then the expansion id is NO_EXPANSION.

An ExpnInfo contains information about the macro use and the macro definition (called call_site and callee respectively, although technically, macros are not called).

Consider

1 | fn main() {
2 |     macro_rules! foo {  
3 |         ($e: expr) => {
4 |             $e
5 |         }
6 |     }
7 |
8 |     macro_rules! bar {  
9 |         ($e: expr) => {
10|             foo!($e);
11|         }
12|     }
13|
14|
15|     bar!(42);
16| }

This will expand to 42. The span of that token (expressed as row, column) will be Span { lo: (15, 10), hi: (15, 12), expn_id: 0 } (since that is where the token originates). The ExpnInfo for ids 0 and 1 (which is referenced by 0) is:

0: ExpnInfo {
    call_site: Span { lo: (10, 13), hi: (10, 21), expn_id: 1 },
    callee: (foo, Span { lo: (2, 5), hi: (6, 6), expn_id: NO_ }),
}
1: ExpnInfo {
    call_site: Span { lo: (15, 5), hi: (15, 14), expn_id: NO_ },
    callee: (foo, Span { lo: (8, 5), hi: (12, 6), expn_id: NO_ }),
}

The call_site always refers to a macro call somewhere in the source, the callee refers to the whole macro which is used. Since our example macros are not defined by macros, there is no further expansion info for either callee. However, in general expansion traces are trees of expansions.

If we think of the stages of expansion, the first step is to expand bar!(42) to foo!(42), this is given by ExpnInfo 1. Then we expand foo!(42) to 42, given by 2. Note that the expansion trace starts with the fully expanded code and goes 'backwards' to the pre-expansion code. Note also that the span we have for 42 is for an argument to a macro, but the expansion trace is all about macros and their calls, we don't have information about how arguments are propagated.