Macros in Rust pt3

Previously I covered macro_rules and procedural macros. In this post I'll cover macro hygiene in a bit more detail. In upcoming posts, I'll cover modularisation for macros, some implementation details, some areas where I think we should improve the system, and some ideas for doing so.

macro hygiene in Rust

I explained the general concept of macro hygiene in this blog post.

The good news is that all macros in Rust are hygienic (caveat, see the bad news below). When you use macros in Rust, variable naming is hygienic, this is best illustrated with some examples:

macro_rules! foo {  
    () => {
        let x = 0;
    }
}

fn main() {
    let mut x = 42;
    foo!();
    println!("{}", x);
}

Here the xs defined in main and foo! are disjoint, so we print 42.

macro_rules! foo {  
    ($x: ident) => {
        $x = 0;
    }
}

fn main() {
    let mut x = 42;
    foo!(x);
    println!("{}", x);
}

Here, x is passed in to the macro so foo! can modify main's x. We print 0.

macro_rules! foo {  
    ($x: ident) => {
        let $x = 0;
    }
}

fn main() {
    let x = 42;
    foo!(x);
    println!("{}", x);
}

Again, we print 0. This version is (IMO) slightly unfortunate because looking at main it seems like x should not be mutable, but foo! redefines x. A limitation of Rust's hygiene system is that although it limits a macro's effects to objects 'passed' to the macro, it does not limit those effects to what you would expect from the rest of the language. One could imagine a stricter system where the user must explicitly state how the macro can affect the use context (in this case by adding a variable to the environment).

fn main() {
    let mut x = 42;

    macro_rules! foo {  
        () => {
            x = 0;
        }
    }

    foo!();
    println!("{}", x);
}

In this example, main's x is in scope for foo!, so we print 0.

fn main() {
    let mut x = 42;

    macro_rules! foo {  
        () => {
            x = 0;
        }
    }

    let mut x = 33;

    foo!();
    println!("{}", x);
}

Finally, in this example we print 33 because foo! can only 'see' the first x and modifies that one, not the second x which is printed. Note that you can't write code without macros to do what foo! does here - you can't name the first x because it is hidden by the second one, but macro hygiene means the two xs are different to the macro.

Rust is also hygienic with respect to labels, e.g.,

fn main() {
    macro_rules! foo {  
        ($e: expr) => {
            'foo: loop {
                $e;
            }
        }
    }
    

    'foo: loop {
        foo!{
            break 'foo
        }
    }
}

This example terminates after a single iteration, whereas with unhygienic macro expansion, it would loop forever. (Compiling this example does give an erroneous warning, but that is due to a bug).

The macro system is also hygienic with respect to Rust's system of feature gating and stability checking.

limitations

There is some bad news too: hygiene basically only works on expression variables and labels, there's a bunch of places we're poor with hygiene. We don't apply hygiene to lifetime or type variables or types themselves. We are also unhygienic with respect to privacy and safety (i.e., use of unsafe blocks).

When referring to items from a macro, we resolve names in the context of the use, without taking into account the definition site. That means that to be safe, all references to non-local names must use an absolute path, and such names must be visible anywhere the macro is to be used.

It is not exactly clear what hygienic expansion of items should look like. For example, if a macro defines a function (or any other item), should that function be nameable from outside the macro (assuming it's name wasn't passed in)? It currently behaves naively - it can be accessed and has the scope of the use, without applying any kind of hygiene.

implementation

For the purposes of this discussion, the phases of the Rust compiler basically look like this:

lexing -> parsing -> macro expansion -> name resolution -> analysis and translation

Note that lexing, parsing, and macro expansion are all sort of mixed up together to some extent, but that doesn't matter too much. The macro expansion phase is hygienic, we'll get back to exactly why later. Name resolution is the phase where we take syntactic names and resolve them to definitions. For example in

let foo = ...;
...
let x = foo;

when we resolve foo in the second statement, we find the definition of foo in the first statement and store a 'link' between the two. Name resolution is a bit complicated - it must take into account relative and absolute paths, use items including pub use, glob imports, imports which introduce an alias, lexical scoping, redefinitions using let, the type vs value namespacing, and so forth. At the end of all of that we have to be able to answer the question does foo == foo? In an un-hygienic system, this just comes down to string equality (well, modulo possible unicode normalisation). In Rust, the question is a little more complicated.

When we parse Rust code, identifiers are represented as Names, these are just strings, interned for the sake of efficiency. For name resolution, we consider Idents, an Ident is a Name and a syntax context. These syntax contexts are created during macro expansion. To check if two Idents are equal we resolve each Name under it's syntax context (note that this resolution is a completely separate concept from name resolution). The result of this resolution is a new Name, we then compare these resolved Names for equality (which is an integer comparison, since they are interned strings).

Syntax contexts are added to Idents during macro expansion. There is a distinguished context, EMPTY_CTXT, which is initially given to each Ident during parsing. Each Ident gets a new context during expansion.

Macro hygiene works in the same way for macro_rules and procedural macros. It depends only on the AST produced by the macro, not how it was produced. However, procedural macros can manually override this process by assigning syntax contexts to Idents themselves. If the syntax context is unknown to the hygiene algorithm, then it will be left alone. If it is known, then it will be treated the same way as if the algorithm has assigned it during a previous expansion.

Since we compare Names by comparing indices into a table, if we create two different entries in that table, then we will have two Names with the same string representation, but which are not equal when compared. This is sometimes useful for creating fresh variables with useful names which will not clash with other variables in the same scope. To get such a name use the gensym or gensym_ident functions in token.rs.

Note that if you use such identifiers in a syntax extension, then macro hygiene using syntax contexts is applied 'on top'. So two Idents will be incompatible if they are different entries in the interner table to start with or if they are renamed differently. In fact, such gensym'ed names are a key part of the hygiene algorithm.

The mtwt algorithm

The macro hygiene algorithm we use in Rust comes from a paper titled Macros That Work Together, thus we call it the mtwt algorithm. It is mostly implemented in mtwt.rs.

We use the mtwt algorithm during macro expansion. We walk the whole AST for the crate, expanding macro uses and applying the algorithm to all identifiers. Note that mtwt requires we touch the whole tree, not just those parts which are produced by macros.

In some ways, we apply hygiene lazily. We walk the AST during expansion computing syntax contexts, but we don't actually apply those syntax contexts until we resolve an Ident to a Name using its syntax context (I think this is only done in name resolution).

Mtwt has two key concepts - marking and renaming. Marking is applied when we expand a macro and renaming is done when we enter a new scope. Conceptually, a syntax context under mtwt consists of a series of marks and renames. However, syntax contexts are interned, and so the syntax context in an Ident is an integer index into the table of syntax contexts - two Idents with the same conceptual syntax context should have the same index.

marking

When we expand a macro, the newly created code is marked with a fresh mark (you can imagine just using integers for the marks). Arguments are not marked when they are expanded. So, if we have the following macro definition,

macro_rules! foo {  
    ($x: ident) => {
        let y = 42;
        let $x = y;
    }
}

We would replace foo!(a) with the code

let y = 42;
let a = y;

where the declaration and use of y would be marked with a fresh mark, but a would not.

renaming

Whenever we enter a scope during expansion we rename all identifiers. However, the meaning of scope here is somewhat interesting. In a traditional setting, a let expression looks something like (let var expr body) (please excuse the pseudo-code) which defines a new variable called var with value given by expr and scope body. This is roughly equivalent to { let var = expr; body } in Rust. Note that in the traditional syntax, the scope of the variable is very explicit.

In Rust, a scope from the hygiene algorithm's point of view includes functions, but not blocks (i.e., { ... }). It also includes an implicit scope which starts with any let statement (or other pattern introduction) and ends when the variables declared go out of scope. E.g.,

fn foo(w: Bar) {     //             | w
    let x = ...;     //         | x |
    {                //         |   |
        let y = ...; //     | y |   |
        ...          //     |   |   |
        let z = ...; // | z |   |   |
        ...          // |   |   |   |
    }                //         |   |
    ...              //         |   |
}                    //

Here, there are four scopes, shown on the right hand side.

Every identifier in a scope is 'renamed'. A renaming syntax context consists of a 'from' Ident and a 'to' Name. The 'from' Ident is the variable identifier that starts the scope (e.g., w or z in the above example). We use an Ident rather than a Name for the 'from' identifier because we may have applied the mtwt algorithm to it earlier, so it may already be marked and/or renamed.

Note that renaming is cumulative, so the syntax context for the ... under let z... in the above example will include renames for w, x, y, and z. The name we rename to is a fresh name which looks the same as the original name, but is not equal to it. We use the gensym approach described above to implement this. So we might rename from foo interned at index 31 to foo interned at 45234.

resolution

When we call mtwt::resolve on an Ident we compute how its syntax context affects the Name. We essentially walk the syntax context applying marks and renaming where relevant, ending up with a renamed Name. Marks are simple, we apply them all. So at any point of resolution, we have a name and a Vec of marks applied. Marks do not affect the end result of resolution (i.e., once we have fully resolved, we then ignore the marks), but they do affect renaming. When we come across a rename we compare the 'from' Ident with the current name being resolved, they are considered equal if the names are equal (which will take into account renaming already performed) and the list of marks on each are the same.

Let's go through an example. We'll denote names as foo[i] where foo is the name as written in the source code and i is the index into the interned string table. So, foo[1] and foo[2] will look the same in the source code, but are not equal. It is not possible to have different names with the same index.

A mark is denoted as mark(n) where n is an integer, this just marks all names with n.

A renaming is denoted as rename(from -> to) where from is an Ident, however, we will consider them pre-resolved to a name and a set of marks, e.g., foo[42](2, 5, 6) the name foo with index 42 marked with 2, 5, and 6.

Let's look at the following syntax context:

rename(x[1]() -> x[4]), mark(1), mark(2), rename(y[2]() -> y[5]), rename(y[2](1, 2) -> y[6])

Lets first apply this context to the name x[1]: we start with no marks: x[1](). The first rename matches, so we rename to x[4](), then we add the two marks: x[4](1, 2). Neither of the later two renames match, so we end up with the resolved name x[4].

Now consider y[2]. In this case the first rename does not match, but we still apply the marks, giving y[2](1, 2). The next rename matches the name, but not the marks, so is not applied, the next rename does match, so we get y[6](1, 2), so the resolved name is y[6].