/ Mozilla

Sets of scopes macro hygiene

  • On the first day, God created programming languages, and he saw that they were good.
  • On the second day, he saw that there were expressivity problems and he created macros.
  • On the third day, he realised he'd been drinking pretty heavily on the second day and that he had unleashed a sack of footguns upon mankind. He decided that naming should depend on the scope where the macro is defined, not expanded. After a little contemplation, he decided this was harder than it sounds and went back to bed.
  • On the fourth, fifth, and sixth days God went about creating increasingly more complex schemes for enforcing this macro hygiene thing. (Note that for the purposes of this analogy, one God day == 10 Lisp research years). Eventually, he saw that it was good enough.
  • On the seventh day, God rested while Matthew Flatt came up with the sets of scopes hygiene algorithm, which turns out to be much nicer.

Hygiene with scopes

The basic idea is very simple. Imagine we have some abstraction for a scope. Then before expansion, we mark every identifier in the program (binding or otherwise) with the scope where the identifier is written. Then we do macro expansion. Then when we do name resolution, we use the scope that we marked the identifier with, rather than the scope where the identifier is actually written. In fact name resolution becomes much simpler - we just have a global table of bindings mapping names to definitions, but those names include the scopes we marked them all with before expansion.

Of course real life is not that simple. Lexical scoping means that identifiers are visible in multiple scopes, before macros expand it is hard to define what a scope means, and with procedural macros, the macro might want to mess around with the hygiene information.

Sets of scopes

The first innovation is to mark identifiers not with a single scope, but with a set of scope. We match names not by saying they must have the same scope, but by saying that the binding must have a subset of the scopes belonging to the use.

In normal name resolution, when dealing with lexical scoping, we don't just lookup a name in a single scope, we go up a tree of scopes until we find the name we are interested in. In sets of scopes, this is made explicit; identifiers are marked with all scopes in which they are present.

When there are multiple bindings which have a subset of scopes, we take the binding with the largest subset (note that in some cases we might have two (or more) bindings with a subset of scopes where neither subset is larger than the other, in this case name resolution is ambiguous).

Example:

fn foo(x: ...) {
    let x = ...;
    let x = ...;
    x;
}

Here there are three bindings of x and one use. The use clearly refers to the last let, but lets look at the scope sets. First, lets make the scopes explicit and label them:

fn foo(x: ...) {           // 1
    {                      // 2
        let x = ...;
        {                  // 3
            let x = ...;
            x;
        }
    }
}

Now, lets annotate the variables:

fn foo(x /*{1}*/: ...) {           // 1
    {                              // 2
        let x /*{1, 2}*/ = ...;
        {                          // 3
            let x /*{1, 2, 3}*/ = ...;
            x /*{1, 2, 3}*/;
        }
    }
}

So, when resolving the use of x, we look up x{1, 2, 3} in a binding table:

x{1} -> ...
x{1, 2} -> ...
x{1, 2, 3} -> ...

Comparing the use with the entries in the table, each entry in the table has the right name and each set of scopes is a subset of the use's set of scopes; the final entry has the largest subset and so is the binding we pick. This matches our intuition.

So many scopes

That is the intuition of using sets of scopes, but what about macros, that is the whole point of this right? We get into the nitty-gritty pretty quickly, and I expect this is why it took so long for an algorithm in this style to appear.

When we expand a macro we add a new scope to all identifiers from the macro expansion (the 'intro' scope). Identifiers which are passed to the macro do not get this scope, but do get a 'use-site' scope. Note that in all cases, identifiers will have the scopes due to the context where they are declared (or due to previous expansions).

A defintion context is a context where name bindings and uses are intermingled, rather than a name binding covering a given scope (in Rust think of items in modules as opposed to let statements in function bodies). In a definition context, instead of having a single scope as in the simple let case, we have two scopes - an inside-edge scope and an outside-edge scope. Identifiers declared in the scope get both scopes. Identifiers which expand into the scope get given the inside-edge scope only in addition to either the intro or use scopes. However, when a macro is expanded which adds a binding to a definition, context we do not add a use-site scope to that variable (which we would if the bound name was passed to the macro).

Applying scope sets to Rust

An interesting aspect of how to apply sets of scopes to Rust is how the scheme interacts with name resolution, but I'll save that for another post. We'll ignore imports for now.

At first blush it seems that items have definition context (i.e., we should use inisde and outside edge scopes, but omit use-site scopes on bindings) and local variables follow the scoped rules. However, I think all Rust names must be treated as being in definition context (see below for why), and therefore should get the inside/outside edge treatment. Furthermore, since there are no bindings which are not treated like definition contexts, there is no need for use-site scopes.

Since we can't name items in parent modules without imports, we don't add scopes for outer modules. Function arguments and local variables do get treated as lexically scoping (up to the innermost function), where each let statement is considered to start a new scope. Macro expansion follows the algorithm as stated. Fields, methods, and associated types are not treated hygienically. I believe the prelude can be given the empty set of scopes, this means prelude names would be visible from anywhere, but have the lowest priority, which I think is the correct behaviour.

A function can have items and local variables declared in it. From the function body, we can name the items, but the items are declared before all variables (including the formal arguments). Therefore a function has two scopes - an outer one which includes the nested items and an inner one which includes the formal arguments and all variables.

There are a few wrinkles I foresee: the behaviour of scopes introduced by let, and the interaction of the opacity of modules and macro expansion (I expect the former to be unique to Rust, but the latter I would have thought might affect Racket, however, I don't think such behaviour is covered in the paper).

The let problem is that unlike in Scheme the scope of a let is implicit, it ends at the end of the current block. If a macro introduces a let, the expectation is that the scope ends at the end of the current scope *post-*expansion. So, in

fn main() {
    macro foo() { let x = ...; }

    foo!();
    ...
}

The scope of let x once foo! is expanded should terminate at the end of the main function, after any statements in ....

This is why we must treat local variable contexts like definition contexts - we want the following to work:

fn main() {
    macro foo($x) { let $x = ...; }

    foo!(x);
    ...x...  // names the definition in the expansion of foo!
}

In mod a { fn f() {} mod b {} }, f cannot be named from inside b, and so identifiers in b should not be marked with a's scope (as opposed to in lexically nested let expressions).

The module problem is that if we have a macro inside a module c, we do not know whether we should mark identifiers in the macro with c's scope or not until after we have parsed the body of the macro.

These problems have related solutions. We deal with scopes for macro bodies after expansion. To do this, when we walk the AST assigning scopes, we stop at macro boundaries and record the scopes we would apply to the macro body as a pending scope set (note that if the macro is declared in a function, we only add the 'outer' scope as described earlier). When we expand, after parsing, we apply this pending set to the expanded AST taking into account the usual rules. If at the end of the macro body's AST we have open scopes from let statements we continue with any subsequent code until we find the end of a scope, adding the scopes to identifiers.

Since we want to be able to define variables in macros and refer to them from outside, we must add an inside-edge scope to macro expansions for each scope they expand into, including a function's inner scope. I.e., we treat all contexts as definition contexts. Thus, use-site scopes are redundant. This is convenient. Because it is impossible to tell if a variable use in Rust is a binding or not until name resolution, the treatment of use-site scopes would be awkward.

Examples

We'll start with scope 0 being the whole crate, I'll write the scopes due to a binder in square brackets and the scopes on a variable use in braces:

mod a { // a[1in, 1out] a1{0in, 0out}
    macro foo($x: ident) { // foo{1in, 1out} pending{1in, 1out}
        let x = 42;
        let $x = 0;
        bar(x);
    }

    fn bar(x: u32) { // bar{1in, 1out}
        println!("{}", x);
    }
}

fn main() { // main[3in, 3out, 4in, 4out] main{0in, 0out}
    let x = 100; // x[5in, 5out] x{0in, 0out, 3in, 3out, 4in, 4out, 5in, 5out}
    ::a1::foo!(x); // x{0in, 0out, 3in, 3out, 4in, 4out, 5in, 5out}
    ::a1::bar(x); // x{0in, 0out, 3in, 3out, 4in, 4out, 5in, 5out}
}

I'll wave my hands a bit about name resolution until a later blog post, but when we see an absolute path we look for the crate root ignoring most of the scopes, then continue lookup using the scopes we find for the given paths. Anyway, lets assume we can look up the macros and we'll leave other name resolution out completely. After expansion we have:

...

fn main() { // main[3in, 3out, 4in, 4out] main{0in, 0out}
    let x = 100; // x[5in, 5out] x{0in, 0out, 3in, 3out, 4in, 4out, 5in, 5out}

    let x = 42; // x{1in, 1out, 4in, 5in, 6intro, 7in, 7out}
    let x = 0;  // x{0in, 0out, 3in, 3out, 4in, 4out, 5in, 5out, 7in, 7out, 8in, 8out}
    bar(x);     // bar{1in, 1out, ...} x{1in, 1out, 4in, 5in, 6intro, 7in, 7out, 8in, 8out}

    ::a1::bar(x); // x{0in, 0out, 3in, 3out, 4in, 4out, 5in, 5out, 7in, 7out, 8in, 8out}
}

We've applied the pending scopes 1in, 1out to the macro body and applied the 6intro scope. We've added inside-edge scopes 4 and 5 due to main's inner scope and the first let. We've also gone through the newly created scopes due to the lets and added scopes 7 and 8 (in and out).

Name resolution then follows these scope sets. Of interest is the first use of bar, in current Rust, this would fail. However, using sets of scopes for name resolution, we find the right definitions.

Next consider the x in that call to bar, here the only x with a subset of scopes is the second let (the first in the macro). The x in the second call to bar is matched by both the first and third lets, but the third has a larger subset.

Next, we'll consider macros in items, plus some library macros I've been considering:

mod a { // a[1in, 1out] a1{0in, 0out}
    fn c() { ... } // c{1in, 1out}

    ::b::foo!(d, c); // d{1in, 1out}, c{1in, 1out}
}

mod b { // b[2in, 2out] b{0in, 0out}
    fn c() { ... } // c{2in, 2out}

    macro foo($x: ident, $y: ident) { // pending{2in, 2out}
        fn $concat!($x, _1)() { c() }
        fn $concat!($x, _2)() { $y() }
        fn $define!(d_3)() {}
    }
}

Here the macro concat! takes an identifier and some strings and creates a new identifier with the hygiene of the input identifier (see this blog post for more details). The macro define! takes a string and makes an identifier out of it with the empty set of scopes. Note that both these macros are implemented as procedural macros and take the option of doing their own hygiene computation and do not add intro scopes nor add their own use scopes to arguments.

The $m! syntax means expand the macro m! eagerly, i.e., before we parse the rest of the macro. More precisely, we expand m in the context of the current macro expansion, rather than in it's own context (this is basically local-expand from Scheme). They do not pick up the intro scopes of the macro they are used in, but do make their own intro scopes, however, in these examples the macros used avoid making intro scopes.

Expansion starts with:

mod a {
    ...

    fn $concat!(d, _1)() { c() } // d{1in, 1out} c{}
    fn $concat!(d, _2)() { c() } // d{1in, 1out} c{1in, 1out}
    fn $define!(d_3)() {}
}

...

Expansion continues (before parsing):

mod a {
    ...

    fn d_1() { c() } // d_1{1in, 1out} c{}
    fn d_2() { c() } // d_2{1in, 1out} c{1in, 1out}
    fn d_3() {}      // d_3{1in}
}

...

All of the function names pick up a 1in scope due to expanding into a, but this only affects d_3.

We would complete expansion by parsing, adding the pending scopes from the macro, and any scopes due to expansion:

mod a {
    ...

    fn d_1() { c() } // d_1{1in, 1out} d_1[5in, 5out, 6in, 6out] c{2in, 2out, 5in, 5out, 6in, 6out}
    fn d_2() { c() } // d_2{1in, 1out} d_2[7in, 7out, 8in, 8out] c{1in, 1out, 7in, 7out, 8in, 8out}
    fn d_3() {}      // d_3{1in} d_3[9in, 9out, 10in, 10out]
}

...

After expansion the calls to c() will find the correct function based on their scopes. From outside a macro, anyone can call any of the d_ functions since they will be looking for a name with a subset of {1in, 1out, ...}.

If we wanted to call d_1 from inside the macro foo (say from the body of d_3), we would have to use concat!($x, _1) (which would give scopes {1in, 1out, 9in, 9out, 10in, 10out}), not the literal ident d_1 (scopes: {1in, 2in, 2out, 4intro, 9in, 9out, 10in, 10out}) or $define(d_1) (scopes: {1in, 9in, 9out, 10in, 10out}).

Let's now consider a macro defined inside the same function where it is used:

fn main() { // main[1in, 1out, 2in, 2out] main{0in, 0out}
    fn baz() {}  // baz{0in, 0out, 1in, 1out}

    macro foo($x) { // foo{0in, 0out, 1in, 1out} pending{0in, 0out, 1in, 1out}
        let x = 0;
        let $x = 42;
        bar(x);
        baz();
    }

    foo!(x); // x{0in, 0out, 1in, 1out, 2in, 2out}
    bar(x); // x{0in, 0out, 1in, 1out, 2in, 2out}
}

Expands to

fn main() { // main[1in, 1out, 2in, 2out] main{0in, 0out}
    ...

    let x = 0;  //   x{0in, 0out, 1in, 1out, 2in, 3intro, 4in, 4out}
    let x = 42; //   x{0in, 0out, 1in, 1out, 2in, 2out, 4in, 4out, 5in, 5out}
    bar(x);     //   x{0in, 0out, 1in, 1out, 2in, 3intro, 4in, 4out, 5in, 5out}
    baz();      // baz{0in, 0out, 1in, 1out, 2in, 3intro, 4in, 4out, 5in, 5out}

    bar(x);     //   x{0in, 0out, 1in, 1out, 2in, 2out, 4in, 4out, 5in, 5out}
}

During expansion we apply the pending scopes, an introduction scope, an inside-edge scope for the scopes we are expanding into, and scopes due to the newly introduced lets. Examining the first use of x, only the first definition of x can apply because the second's scopes are not a subset - it includes 2out. The second use only matches the second definition because the first has 3intro. There is no change to the treatment of bar from the earlier example. The use of baz still corresponds to its definition.

One problem here is that we don't know whether the foo! use is the last item or the first statement, so it's not clear whether it should get main's inner scope or not. However, I think it is sage either way since there can't be any scopes before it and macro definitions must be items.

Backwards compatibility

I would like to replace mtwt in Rust with sets of scopes. For new macros this will be no problem, but for existing macro_rules macros we must maintain backwards compatibility. Where we are currently hygienic, that should not be a problem, however, where we are currently unhygienic we must emulate this lack of hygiene. For example,

fn a() -> u32 { 42 }

macro_rules! foo {
    () => { a() }
}

fn main() {
    fn a() -> u32 { 0 }
    
    println!("{}", foo!());
}

When executing this program, we currently print 0. However, under the scheme discussed above, we would print 42. We can't solve this by doing unhygienic comparison on types and other identifiers which are currently unhygienic, because by the time we come to name resolution, we don't know which kind of macro the identifier came through.

We should be able to emulate non-hygienic expansion with sets of scopes - just use the set of scopes for the expansion site, rather than the 'proper' sets of scopes. However, the question is whether we know at expansion time whether an identifier should be treated hygienically or unhygienically. I believe that at the end of expansion, after parsing, we do know that, so we can then go back and change scopes where necessary. This needs some more thought to make sure.