Macros and name resolution
This should be the last blog post on the future of macros. Next step is to make some RFCs!
I've talked previously about how macros should be nameable just like other items, rather than having their own system of naming. That means we must be able to perform name resolution at the same time as macro expansion. That's not currently possible. In this blog post I want to spell out how we might make it happen and make it work with the sets of scopes hygiene algorithm. Niko Matsakis has also been thinking about name resolution a lot, you can find his prototype algorithm in this repo. This blog post will basically describe that algorithm, tied in with the sets of scopes hygiene algorithm.
Name resolution is the phase in the compiler where we match uses of a name with definitions. For example, in
fn foo() { ... }
fn bar() { foo(); }
We match up the use of foo
in the body of bar
with the definition of foo
above.
This gets a bit more complicated because of shadowing, e.g., if we change bar
:
fn bar() {
fn foo() { ... }
foo();
}
Now the use of foo
refers to the nested bar
function, not the one outside foo
.
Imports also complicate resolution: to resolve a name we might need to resolve an import (which might be a glob import). The compiler also enforces rules about imports during name resolution, for example we can't import two names into the same namespace.
Name resolution today
Name resolution takes place fairly early on in the compiler, just after the AST has been lowered to the HIR. There are three passes over the AST, the first builds the 'reduced graph', the second resolves names, and the third checks for unused names. The 'reduced graph' is a record of all definitions and imports in a program (without resolving anything), I believe it is necessary to allow forward references.
Goals
We would like to make it possible to resolve names during macro expansion so that macros can be modularised in the same way as other items. We would also like to be a little more permissive with restricted imports - it should be allowed to import the same name into a scope twice if it refers to the same definition, or if that name is never used.
There are some issues with this idea. In particular, a macro expansion can introduce new names. With no restrictions, a name could refer to different definitions at different points in program expansion, or even be legal at some point and become illegal later due to an introduced duplicate name. This is unfortunate because as well as being confusing, it means the order of macro expansion becomes important, whereas we would like for the expansion order to be immaterial.
Using sets of scopes for name resolution
Previously, I outlined how we could use the sets of scopes hygiene algorithm in Rust. Sets of scopes can also be used to do name resolution, in fact that is how it is intended to be used. The traditional name resolution approach is, essentially, walking up a tree of scopes to find a binding. With sets of scopes, all bindings are treated as being in a global 'namespace'. A name lookup will result in a set of bindings, we then compare the sets of scopes for each binding with the set of scopes for the name use and try to find an unambiguously best binding.
The plan
The plan is basically to split name resolution into two stages. The first stage will happen at the same time as macro expansion and will resolve imports to define (conceptually) a mapping from names in scope to definitions. The second stage is looking up a definition in this mapping for any given use of a name. This could happen at any time, to start with we'll probably leave it where it is. Later it could be merged with the AST to HIR lowering or performed lazily.
When we need to lookup a macro during expansion, we use the sets of names as currently known. We must ensure that each such lookup remains unchanged throughout expansion.
In addition, the method of actually looking up a name is changed. Rather than looking through nested scopes to find a name, we use the sets of scopes from the hygiene algorithm. We pretty much have a global table of names, and we use the name and sets of scopes to lookup the definition.
Details
Currently, parsing and macro expansion happen at the same time. With this proposal, we add import resolution to that mix too. The output of libsyntax is an expanded AST with sets of scopes for each identifier and tables of name bindings which bind names to sets of scopes to definitions.
We rely on a monotonicity property in macro expansion - once an item exists in a certain place, it will always exist in that place. It will never disappear and never change. Note that for the purposes of this property, I do not consider code annotated with a macro to exist until it has been fully expanded.
A consequence of this is that if the compiler resolves a name, then does some expansion and resolves it again, the first resolution will still be valid. However, a better resolution may appear, so the resolution of a name may change as we expand. It can also change from a good resolution to an ambiguity. It is also possible to change from good to ambiguous to good again. There is even an edge case where we go from good to ambiguous to the same good resolution (but via a different route).
In order to keep macro expansion comprehensible to programmers, we must enforce that all macro uses resolve to the same binding at the end of resolution as they do when they were resolved.
We start by parsing as much of the program as we can - like today we don't parse macros. We then walk the AST top-down computing sets of scopes. We add these to every identifier. When we find items which bind a name, we add it to the binding table. When entering a module we also add self
and super
bindings. When we find imports and macro uses we add these to a work list. When we find a glob import, we have to record a 'back link'; so that when a public name is added for the supplying module, we can add it for the importing module. At this stage a glob would not add any names for a module since we don't resolve the path prefix. For each macro definition, we add the pending set of scopes.
We then loop over the work list trying to resolve names. If a name has exactly one best resolution then we use it (and record it on a list of resolved names). If there are zero, or more than one resolution, then we put it back on the work list. When we reach a fixed point, i.e., the work list no longer changes, then we are done. If the work list is empty, then expansion/import resolution succeeded, otherwise there are names not found or ambiguous names and we failed.
If the name we are resolving is for a non-glob import, then we just record it in the binding table. If it is a glob import, we add bindings for every public name currently known.
If the resolved name is for a macro, then we expand that macro by parsing the arguments, pattern matching, and doing hygienic expansion. At the end of expansion, we apply the pending scopes and compute scopes due to the expanded macro. This may include applying more scopes to identifiers outside the expansion. We add new names to the binding table and push any new macro uses and imports to the work list to resolve.
If we add public names for a module which has back links, we must follow them and add these names to the importing module. When following these back links, we check for cycles, signaling an error if one is found.
If we succeed, then after running the fix-point algorithm, we check our record of name resolutions. We re-resolve and check we get the same result. We also check for un-used macros at this point.
The second part of name resolution happens later, when we resolve names of types and values. We could hopefully merge this with the lowering from AST to HIR. After resolving all names, we can check for unused imports and names.
Resolving names
To resolve a name, we look up the 'literal' name in the binding table. That gives us a list of sets of scopes. We can discard any which are not a subset of the scopes on the name, then order the remainder by sub-setting and pick the largest (according to the subset ordering, not the set with most elements). If there is no largest set, then name resolution is ambiguous. If there are no subsets, then the binding is undefined. Otherwise we get the definition we're looking for.
To resolve a path we look up the first segment like a name. We then lookup the next segment using the scope defined by the previous segment. Furthermore, in all lookups apart from the initial segment, we ignore outside edge scopes. For example, consider:
// scope 0
mod foo { // scope 1
fn bar() {} // scope 2
}
fn main() { // scope 3
use foo::bar;
}
The use
item has set of scopes {0, 3}
the foo
name is thus bound by the module foo
in scope {0}
. We then lookup bar
with scopes {1}
(the scope defined by foo
) and find the bar
function. By ignoring outside edge scopes, this will be the case even if bar
is introduced by a macro (if the macro is expanded in the right way).
Names which are looked up based on type are first looked up using the type, then sets of scopes are compared when checking names. E.g., to check a field lookup e.f
we find the type of e
then get all fields on the struct with that type. We then try and find the one called f
and with the largest subset of the scopes found on f
in e.f
. The same applies for enum variants, methods, etc.
To be honest, I'm not sure if type based lookups should be hygienic. I can't think of strong use cases. However, the only consistent position seems to be hygiene everywhere or nowhere, including items (which sometimes causes problems in current Rust).
Data structures
The output of expansion is now more than just the AST. Each identifier in the AST should be annotated with its sets of scopes info. We pass the complete binding table to the compiler. Exported macros are still part of the crate and come with sets of pending scopes. I don't think we need any information about imports.
When a crate is packed into metadata, we must also include the binding table. We must include private entries due to macros that the crate might export. We don't need data for function bodies though. For functions which are serialised for inlining/monomorphisation, we should include local data (although it's probably better to serialise the HIR or MIR, then it should be unnecessary). We need pending scopes for any exported macros.
Due to the nature of functions, I think we can apply an optimisation of having local binding tables for function bodies. That makes preserving and discarding the right data easier, as well as making lookups faster because there won't be a million global entries for x
. Any binding with a function's scope is put into that function's local table. Other than nested functions, I don't think an item can have scopes from multiple functions. I don't think there is a good reason to extend this to nested functions.
Questions
- Do we ever need to find an AST node from a scope? I hope not.
- How do represent a definition in a binding? An id, a reference to an AST node?
- See above, hygiene for 'type-based' lookups.