Macros in Rust pt3
Previously I covered macro_rules and procedural macros. In this post I'll cover macro hygiene in a bit more detail. In upcoming posts, I'll cover modularisation for macros, some implementation details, some areas where I think we should improve the system, and some ideas for doing so.
macro hygiene in Rust
I explained the general concept of macro hygiene in this blog post.
The good news is that all macros in Rust are hygienic (caveat, see the bad news below). When you use macros in Rust, variable naming is hygienic, this is best illustrated with some examples:
macro_rules! foo {
() => {
let x = 0;
}
}
fn main() {
let mut x = 42;
foo!();
println!("{}", x);
}
Here the x
s defined in main
and foo!
are disjoint, so we print 42
.
macro_rules! foo {
($x: ident) => {
$x = 0;
}
}
fn main() {
let mut x = 42;
foo!(x);
println!("{}", x);
}
Here, x
is passed in to the macro so foo!
can modify main
's x
. We print 0
.
macro_rules! foo {
($x: ident) => {
let $x = 0;
}
}
fn main() {
let x = 42;
foo!(x);
println!("{}", x);
}
Again, we print 0
. This version is (IMO) slightly unfortunate because looking at main
it seems like x
should not be mutable, but foo!
redefines x
. A limitation of Rust's hygiene system is that although it limits a macro's effects to objects 'passed' to the macro, it does not limit those effects to what you would expect from the rest of the language. One could imagine a stricter system where the user must explicitly state how the macro can affect the use context (in this case by adding a variable to the environment).
fn main() {
let mut x = 42;
macro_rules! foo {
() => {
x = 0;
}
}
foo!();
println!("{}", x);
}
In this example, main
's x
is in scope for foo!
, so we print 0
.
fn main() {
let mut x = 42;
macro_rules! foo {
() => {
x = 0;
}
}
let mut x = 33;
foo!();
println!("{}", x);
}
Finally, in this example we print 33
because foo!
can only 'see' the first x
and modifies that one, not the second x
which is printed. Note that you can't write code without macros to do what foo!
does here - you can't name the first x
because it is hidden by the second one, but macro hygiene means the two x
s are different to the macro.
Rust is also hygienic with respect to labels, e.g.,
fn main() {
macro_rules! foo {
($e: expr) => {
'foo: loop {
$e;
}
}
}
'foo: loop {
foo!{
break 'foo
}
}
}
This example terminates after a single iteration, whereas with unhygienic macro expansion, it would loop forever. (Compiling this example does give an erroneous warning, but that is due to a bug).
The macro system is also hygienic with respect to Rust's system of feature gating and stability checking.
limitations
There is some bad news too: hygiene basically only works on expression variables and labels, there's a bunch of places we're poor with hygiene. We don't apply hygiene to lifetime or type variables or types themselves. We are also unhygienic with respect to privacy and safety (i.e., use of unsafe
blocks).
When referring to items from a macro, we resolve names in the context of the use, without taking into account the definition site. That means that to be safe, all references to non-local names must use an absolute path, and such names must be visible anywhere the macro is to be used.
It is not exactly clear what hygienic expansion of items should look like. For example, if a macro defines a function (or any other item), should that function be nameable from outside the macro (assuming it's name wasn't passed in)? It currently behaves naively - it can be accessed and has the scope of the use, without applying any kind of hygiene.
implementation
For the purposes of this discussion, the phases of the Rust compiler basically look like this:
lexing -> parsing -> macro expansion -> name resolution -> analysis and translation
Note that lexing, parsing, and macro expansion are all sort of mixed up together to some extent, but that doesn't matter too much. The macro expansion phase is hygienic, we'll get back to exactly why later. Name resolution is the phase where we take syntactic names and resolve them to definitions. For example in
let foo = ...;
...
let x = foo;
when we resolve foo
in the second statement, we find the definition of foo
in the first statement and store a 'link' between the two. Name resolution is a bit complicated - it must take into account relative and absolute paths, use
items including pub use
, glob imports, imports which introduce an alias, lexical scoping, redefinitions using let
, the type vs value namespacing, and so forth. At the end of all of that we have to be able to answer the question does foo == foo
? In an un-hygienic system, this just comes down to string equality (well, modulo possible unicode normalisation). In Rust, the question is a little more complicated.
When we parse Rust code, identifiers are represented as Name
s, these are just strings, interned for the sake of efficiency. For name resolution, we consider Ident
s, an Ident
is a Name
and a syntax context. These syntax contexts are created during macro expansion. To check if two Ident
s are equal we resolve each Name
under it's syntax context (note that this resolution is a completely separate concept from name resolution). The result of this resolution is a new Name
, we then compare these resolved Name
s for equality (which is an integer comparison, since they are interned strings).
Syntax contexts are added to Ident
s during macro expansion. There is a distinguished context, EMPTY_CTXT
, which is initially given to each Ident
during parsing. Each Ident
gets a new context during expansion.
Macro hygiene works in the same way for macro_rules and procedural macros. It depends only on the AST produced by the macro, not how it was produced. However, procedural macros can manually override this process by assigning syntax contexts to Ident
s themselves. If the syntax context is unknown to the hygiene algorithm, then it will be left alone. If it is known, then it will be treated the same way as if the algorithm has assigned it during a previous expansion.
Since we compare Name
s by comparing indices into a table, if we create two different entries in that table, then we will have two Name
s with the same string representation, but which are not equal when compared. This is sometimes useful for creating fresh variables with useful names which will not clash with other variables in the same scope. To get such a name use the gensym
or gensym_ident
functions in token.rs.
Note that if you use such identifiers in a syntax extension, then macro hygiene using syntax contexts is applied 'on top'. So two Ident
s will be incompatible if they are different entries in the interner table to start with or if they are renamed differently. In fact, such gensym'ed names are a key part of the hygiene algorithm.
The mtwt algorithm
The macro hygiene algorithm we use in Rust comes from a paper titled Macros That Work Together, thus we call it the mtwt algorithm. It is mostly implemented in mtwt.rs.
We use the mtwt algorithm during macro expansion. We walk the whole AST for the crate, expanding macro uses and applying the algorithm to all identifiers. Note that mtwt requires we touch the whole tree, not just those parts which are produced by macros.
In some ways, we apply hygiene lazily. We walk the AST during expansion computing syntax contexts, but we don't actually apply those syntax contexts until we resolve an Ident
to a Name
using its syntax context (I think this is only done in name resolution).
Mtwt has two key concepts - marking and renaming. Marking is applied when we expand a macro and renaming is done when we enter a new scope. Conceptually, a syntax context under mtwt consists of a series of marks and renames. However, syntax contexts are interned, and so the syntax context in an Ident
is an integer index into the table of syntax contexts - two Idents
with the same conceptual syntax context should have the same index.
marking
When we expand a macro, the newly created code is marked with a fresh mark (you can imagine just using integers for the marks). Arguments are not marked when they are expanded. So, if we have the following macro definition,
macro_rules! foo {
($x: ident) => {
let y = 42;
let $x = y;
}
}
We would replace foo!(a)
with the code
let y = 42;
let a = y;
where the declaration and use of y
would be marked with a fresh mark, but a
would not.
renaming
Whenever we enter a scope during expansion we rename all identifiers. However, the meaning of scope here is somewhat interesting. In a traditional setting, a let
expression looks something like (let var expr body)
(please excuse the pseudo-code) which defines a new variable called var
with value given by expr
and scope body
. This is roughly equivalent to { let var = expr; body }
in Rust. Note that in the traditional syntax, the scope of the variable is very explicit.
In Rust, a scope from the hygiene algorithm's point of view includes functions, but not blocks (i.e., { ... }
). It also includes an implicit scope which starts with any let
statement (or other pattern introduction) and ends when the variables declared go out of scope. E.g.,
fn foo(w: Bar) { // | w
let x = ...; // | x |
{ // | |
let y = ...; // | y | |
... // | | |
let z = ...; // | z | | |
... // | | | |
} // | |
... // | |
} //
Here, there are four scopes, shown on the right hand side.
Every identifier in a scope is 'renamed'. A renaming syntax context consists of a 'from' Ident
and a 'to' Name
. The 'from' Ident
is the variable identifier that starts the scope (e.g., w
or z
in the above example). We use an Ident
rather than a Name
for the 'from' identifier because we may have applied the mtwt algorithm to it earlier, so it may already be marked and/or renamed.
Note that renaming is cumulative, so the syntax context for the ...
under let z...
in the above example will include renames for w
, x
, y
, and z
. The name we rename to is a fresh name which looks the same as the original name, but is not equal to it. We use the gensym approach described above to implement this. So we might rename from foo
interned at index 31
to foo
interned at 45234
.
resolution
When we call mtwt::resolve
on an Ident
we compute how its syntax context affects the Name
. We essentially walk the syntax context applying marks and renaming where relevant, ending up with a renamed Name
. Marks are simple, we apply them all. So at any point of resolution, we have a name and a Vec
of marks applied. Marks do not affect the end result of resolution (i.e., once we have fully resolved, we then ignore the marks), but they do affect renaming. When we come across a rename we compare the 'from' Ident
with the current name being resolved, they are considered equal if the names are equal (which will take into account renaming already performed) and the list of marks on each are the same.
Let's go through an example. We'll denote names as foo[i]
where foo
is the name as written in the source code and i
is the index into the interned string table. So, foo[1]
and foo[2]
will look the same in the source code, but are not equal. It is not possible to have different names with the same index.
A mark is denoted as mark(n)
where n
is an integer, this just marks all names with n
.
A renaming is denoted as rename(from -> to)
where from
is an Ident
, however, we will consider them pre-resolved to a name and a set of marks, e.g., foo[42](2, 5, 6)
the name foo
with index 42
marked with 2
, 5
, and 6
.
Let's look at the following syntax context:
rename(x[1]() -> x[4]), mark(1), mark(2), rename(y[2]() -> y[5]), rename(y[2](1, 2) -> y[6])
Lets first apply this context to the name x[1]
: we start with no marks: x[1]()
. The first rename matches, so we rename to x[4]()
, then we add the two marks: x[4](1, 2)
. Neither of the later two renames match, so we end up with the resolved name x[4]
.
Now consider y[2]
. In this case the first rename does not match, but we still apply the marks, giving y[2](1, 2)
. The next rename matches the name, but not the marks, so is not applied, the next rename does match, so we get y[6](1, 2)
, so the resolved name is y[6]
.