compiler (incremental)
Enable the compiler to cache incremental workproducts.
The goal of incremental compilation is, naturally, to improve build times when making small edits. Any reader who has never felt the need for such a feature is strongly encouraged to attempt hacking on the compiler or servo sometime (naturally, all readers are so encouraged, regardless of their opinion on the need for incremental compilation).
The basic usage will be that one enables incremental compilation using
a compiler flag like -C incremental-compilation=TMPDIR
. The TMPDIR
directory is intended to be an empty directory that the compiler can
use to store intermediate by-products; the compiler will automatically
"GC" this directory, deleting older files that are no longer relevant
and creating new ones.
The high-level idea is that we will track the following intermediate workproducts for every function (and, indeed, for other kinds of items as well, but functions are easiest to describe):
.o
files that must be relinked (anyone who has worked
in a substantial C++ project can attest, however, that linking can
take a non-trivial amount of time).Of course, the key to any incremental design is to determine what must be changed. This can be encoded in a dependency graph. This graph connects the various bits of the HIR to the external products (signatures, MIR, and object files). It is of the utmost importance that this dependency graph is complete: if edges are missing, the result will be obscure errors where changes are not fully propagated, yielding inexplicable behavior at runtime. This RFC proposes an automatic scheme based on encapsulation.
Although rustc does not yet support compiler plugins through a stable interface, we have long planned to allow for custom lints, syntax extensions, and other sorts of plugins. It would be nice therefore to be able to accommodate such plugins in the design, so that their inputs can be tracked and accounted for as well.
It is important to clarify, though, that this design does not attempt to enable full optimizing for incremental compilation; indeed the two are somewhat at odds with one another, as full optimization may perform inlining and inter-function analysis, which can cause small edits in one function to affect the generated code of another. This situation is further exacerbated by the fact that LLVM does not provide any way to track these sorts of dependencies (e.g., one cannot even determine what inlining took place, though @dotdash suggested a clever trick of using llvm lifetime hints). Strategies for handling this are discussed in the Optimization section below.
We begin with a high-level execution plan, followed by sections that explore aspects of the plan in more detail. The high-level summary includes links to each of the other sections.
Regardless of whether it is invoked in incremental compilation mode or not, the compiler will always parse and macro expand the entire crate, resulting in a HIR tree. Once we have a complete HIR tree, and if we are invoked in incremental compilation mode, the compiler will then try to determine which parts of the crate have changed since the last execution. For each item, we compute a (mostly) stable id based primarily on the item's name and containing module. We then compute a hash of its contents and compare that hash against the hash that the item had in the compilation (if any).
Once we know which items have changed, we consult a dependency graph to tell us which artifacts are still usable. These artifacts can take the form of serializing MIR graphs, LLVM IR, compiled object code, and so forth. The dependency graph tells us which bits of AST contributed to each artifact. It is constructed by dynamically monitoring what the compiler accesses during execution.
Finally, we can begin execution. The compiler is currently structured in a series of passes, each of which walks the entire AST. We do not need to change this structure to enable incremental compilation. Instead, we continue to do every pass as normal, but when we come to an item for which we have a pre-existing artifact (for example, if we are type-checking a fn that has not changed since the last execution), we can simply skip over that fn instead. Similar strategies can be used to enable lazy or parallel compilation at later times. (Eventually, though, it might be nice to restructure the compiler so that it operates in more of a demand driven style, rather than a series of sweeping passes.)
When we come to the final LLVM stages, we must separate the functions into distinct "codegen units" for the purpose of LLVM code generation. This will build on the existing "codegen-units" used for parallel code generation. LLVM may perform inlining or interprocedural analysis within a unit, but not across units, which limits the amount of reoptimization needed when one of those functions changes.
Finally, the RFC closes with a discussion of testing strategies we can use to help avoid bugs due to incremental compilation.
One important question is how to stage the incremental compilation work. That is, it'd be nice to start seeing some benefit as soon as possible. One possible plan is as follows:
The most notable staging point here is that we can begin by just saving object code, and then gradually add more artifacts that get saved. The effect of saving fewer things (such as only saving object code) will simply be to make incremental compilation somewhat less effective, since we will be forced to re-type-check and re-trans functions where we might have gotten away with only generating new object code. However, this is expected to be be a second order effect overall, particularly since LLVM optimization time can be a very large portion of compilation.
In order to correlate artifacts between compilations, we need some
stable way to name items across compilations (and across crates). The
compiler currently uses something called a DefId
to identify each
item. However, these ids today are based on a node-id, which is just
an index into the HIR and hence will change whenever anything
preceding it in the HIR changes. We need to make the DefId
for an
item independent of changes to other items.
Conceptually, the idea is to change DefId
into the pair of a crate
and a path:
DEF_ID = (CRATE, PATH)
CRATE = <crate identifier>
PATH = PATH_ELEM | PATH :: PATH_ELEM
PATH_ELEM = (PATH_ELEM_DATA, <disambiguating integer>)
PATH_ELEM_DATA = Crate(ID)
| Mod(ID)
| Item(ID)
| TypeParameter(ID)
| LifetimeParameter(ID)
| Member(ID)
| Impl
| ...
However, rather than actually store the path in the compiler, we will
instead intern the paths in the CStore
, and the DefId
will simply
store an integer. So effectively the node
field of DefId
, which
currently indexes into the HIR of the appropriate crate, becomes an
index into the crate's list of paths.
For the most part, these paths match up with user's intuitions. So a
struct Foo
declared in a module bar
would just have a path like
bar::Foo
. However, the paths are also able to express things for
which there is no syntax, such as an item declared within a function
body.
For the most part, paths should naturally be unique. However, there are some cases where a single parent may have multiple children with the same path. One case would be erroneous programs, where there are (e.g.) two structs declared with the same name in the same module. Another is that some items, such as impls, do not have a name, and hence we cannot easily distinguish them. Finally, it is possible to declare multiple functions with the same name within function bodies:
fn foo() {
{
fn bar() { }
}
{
fn bar() { }
}
}
All of these cases are handled by a simple disambiguation mechanism. The idea is that we will assign a path to each item as we traverse the HIR. If we find that a single parent has two children with the same name, such as two impls, then we simply assign them unique integers in the order that they appear in the program text. For example, the following program would use the paths shown (I've elided the disambiguating integer except where it is relevant):
mod foo { // Path: <root>::foo
pub struct Type { } // Path: <root>::foo::Type
impl Type { // Path: <root>::foo::(<impl>,0)
fn bar() {..} // Path: <root>::foo::(<impl>,0)::bar
}
impl Type { } // Path: <root>::foo::(<impl>,1)
}
Note that the impls were arbitrarily assigned indices based on the order in which they appear. This does mean that reordering impls may cause spurious recompilations. We can try to mitigate this somewhat by making the path entry for an impl include some sort of hash for its header or its contents, but that will be something we can add later.
Implementation note: Refactoring DefIds in this way is a large task. I've made several attempts at doing it, but my latest branch appears to be working out (it is not yet complete). As a side benefit, I've uncovered a few fishy cases where we using the node id from external crates to index into the local crate's HIR map, which is certainly incorrect. --nmatsakis
## Identifying and tracking dependenciesNaturally any form of incremental compilation requires a detailed
understanding of how each work item is dependent on other work items.
This is most readily visualized as a dependency graph; the
finer-grained the nodes and edges in this graph, the better. For example,
consider a function foo
that calls a function bar
:
fn foo() {
...
bar();
...
}
Now imagine that the body (but not the external signature) of bar
changes. Do we need to type-check foo
again? Of course not: foo
only cares about the signature of bar
, not its body. For the
compiler to understand this, though, we'll need to create distinct
graph nodes for the signature and body of each function.
(Note that our policy of making "external signatures" fully explicit
is helpful here. If we supported, e.g., return type inference, than it
would be harder to know whether a change to bar
means foo
must be
recompiled.)
This section gives a kind of "first draft" of the set of graph nodes/edges that we will use. It is expected that the full set of nodes/edges will evolve in the course of implementation (and of course over time as well). In particular, some parts of the graph as presented here are intentionally quite coarse and we envision that the graph will be gradually more fine-grained.
The nodes fall into the following categories:
SIG(X)
would represent the signature of some fn item
X
that the user wrote (i.e., the names of the types,
where-clauses, etc)BODY(X)
would be the body of some fn item X
ty
representation of a fn signature. These also frequently correspond
to a single entry in one of the various compiler hashmaps. These are
the outputs (and intermediate steps) of the compilation process
ITEM_TYPE(X)
-- entry in the obscurely named tcache
table
for X
(what is returned by the rather-more-clearly-named
lookup_item_type
)PREDICATES(X)
-- entry in the predicates
tableADT(X)
-- ADT node for a struct (this may want to be more
fine-grained, particularly to cover the ivars)MIR(X)
-- the MIR for the item X
LLVM(X)
-- the LLVM IR for the item X
OBJECT(X)
-- the object code generated by compiling some item
X
; the precise way that this is saved will depend on whether
we use .o
files that are linked together, or if we attempt to
amend the shared library in place.COLLECT(X)
-- the collect code executing on item X
WFCHECK(X)
-- the wfcheck code executing on item X
BORROWCK(X)
-- the borrowck code executing on item X
To see how this all fits together, let's consider the graph for a simple example:
fn foo() {
bar();
}
fn bar() {
}
This might generate a graph like the following (the following sections
will describe how this graph is constructed). Note that this is not a
complete graph, it only shows the data needed to produce MIR(foo)
.
BODY(foo) ----------------------------> TYPECK(foo) --> MIR(foo)
^ ^ ^ ^ |
SIG(foo) ----> COLLECT(foo) | | | | |
| | | | | v
+--> ITEM_TYPE(foo) -----+ | | | LLVM(foo)
+--> PREDICATES(foo) ------+ | | |
| | |
SIG(bar) ----> COLLECT(bar) | | v
| | | OBJECT(foo)
+--> ITEM_TYPE(bar) ---------+ |
+--> PREDICATES(bar) ----------+
As you can see, this graph indicates that if the signature of either
function changes, we will need to rebuild the MIR for foo
. But there
is no path from the body of bar
to the MIR for foo, so changes there
need not trigger a rebuild (we are assuming here that bar
is not
inlined into foo
; see the section on optimizations
for more details on how to handle those sorts of dependencies).
It is very important the dependency graph contain all edges. If any edges are missing, it will mean that we will get inconsistent builds, where something should have been rebuilt what was not. Hand-coding a graph like this, therefore, is probably not the best choice -- we might get it right at first, but it's easy to for such a setup to fall out of sync as the code is edited. (For example, if a new table is added, or a function starts reading data that it didn't before.)
Another consideration is compiler plugins. At present, of course, we don't have a stable API for such plugins, but eventually we'd like to support a rich family of them, and they may want to participate in the incremental compilation system as well. So we need to have an idea of what data a plugin accesses and modifies, and for what purpose.
The basic strategy then is to build the graph dynamically with an API that looks something like this:
push_procedure(procedure_node)
pop_procedure(procedure_node)
read_from(data_node)
write_to(data_node)
Here, the procedure_node
arguments are one of the procedure labels
above (like COLLECT(X)
), and the data_node
arguments are either
HIR or IR nodes (e.g., SIG(X)
, MIR(X)
).
The idea is that we maintain for each thread a stack of active
procedures. When push_procedure
is called, a new entry is pushed
onto that stack, and when pop_procedure
is called, an entry is
popped. When read_from(D)
is called, we add an edge from D
to the
top of the stack (it is an error if the stack is empty). Similarly,
write_to(D)
adds an edge from the top of the stack to D
.
Naturally it is easy to misuse the above methods: one might forget to push/pop a procedure at the right time, or fail to invoke read/write. There are a number of refactorings we can do on the compiler to make this scheme more robust.
Most of the compiler passes operate an item at a time. Nonetheless, they are largely encoded using the standard visitor, which walks all HIR nodes. We can refactor most of them to instead use an outer visitor, which walks items, and an inner visitor, which walks a particular item. (Many passes, such as borrowck, already work this way.) This outer visitor will be parameterized with the label for the pass, and will automatically push/pop procedure nodes as appropriate. This means that as long as you base your pass on the generic framework, you don't really have to worry.
In general, while I described the general case of a stack of procedure nodes, it may be desirable to try and maintain the invariant that there is only ever one procedure node on the stack at a time. Otherwise, failing to push/pop a procedure at the right time could result in edges being added to the wrong procedure. It is likely possible to refactor things to maintain this invariant, but that has to be determined as we go.
Adding edges to the IR nodes that represent the compiler's
intermediate byproducts can be done by leveraging privacy. The idea is
to enforce the use of accessors to the maps and so forth, rather than
allowing direct access. These accessors will call the read_from
and
write_to
methods as appropriate to add edges to/from the current
active procedure.
HIR nodes are a bit trickier to encapsulate. After all, the HIR map itself gives access to the root of the tree, which in turn gives access to everything else -- and encapsulation is harder to enforce here.
Some experimentation will be required here, but the rough plan is to:
DefId
.This will integrate with the "default visitor" described under procedure nodes. This visitor can hand off just an opaque id for each item, requiring the pass itself to go through the map to fetch the actual HIR, thus triggering a read edge (we might also bake this behavior into the visitor for convenience).
Once we've built the graph, we have to persist it, along with some associated information. The idea is that the compiler, when invoked, will be supplied with a directory. It will store temporary files in there. We could also consider extending the design to support use by multiple simultaneous compiler invocations, which could mean incremental compilation results even across branches, much like ccache (but this may require tweaks to the GC strategy).
Once we get to the point of persisting the graph, we don't need the full details of the graph. The process nodes, in particular, can be removed. They exist only to create links between the other nodes. To remove them, we first compute the transitive reachability relationship and then drop the process nodes out of the graph, leaving only the HIR nodes (inputs) and IR nodes (output). (In fact, we only care about the IR nodes that we intend to persist, which may be only a subset of the IR nodes, so we can drop those that we do not plan to persist.)
For each HIR node, we will hash the HIR and store that alongside the
node. This indicates precisely the state of the node at the time.
Note that we only need to hash the HIR itself; contextual information
(like use
statements) that are needed to interpret the text will be
part of a separate HIR node, and there should be edges from that node
to the relevant compiler data structures (such as the name resolution
tables).
For each IR node, we will serialize the relevant information from the table and store it. The following data will need to be serialized:
This list was gathered primarily by spelunking through the compiler. It is probably somewhat incomplete. The appendix below lists an exhaustive exploration.
The general procedure when the compiler starts up in incremental mode will be to parse and macro expand the input, create the corresponding set of HIR nodes, and compute their hashes. We can then load the previous dependency graph and reconcile it against the current state:
We then delete the transitive closure of nodes queued for deletion (that is, all the HIR nodes that have changed or been removed, and all nodes reachable from those HIR nodes). As part of the deletion process, we remove whatever on disk artifact that may have existed.
There are times when the precise span of an item is a significant part of its metadata. For example, debuginfo needs to identify line numbers and so forth. However, editing one fn will affect the line numbers for all subsequent fns in the same file, and it'd be best if we can avoid recompiling all of them. Our plan is to phase span support in incrementally:
There is an inherent tension between incremental compilation and full optimization. Full optimization may perform inlining and inter-function analysis, which can cause small edits in one function to affect the generated code of another. This situation is further exacerbated by the fact that LLVM does not provide any means to track when one function was inlined into another, or when some sort of interprocedural analysis took place (to the best of our knowledge, at least).
This RFC proposes a simple mechanism for permitting aggressive optimization, such as inlining, while also supporting reasonable incremental compilation. The idea is to create codegen units that compartmentalize closely related functions (for example, on a module boundary). This means that those compartmentalized functions may analyze one another, while treating functions from other compartments as opaque entities. This means that when a function in compartment X changes, we know that functions from other compartments are unaffected and their object code can be reused. Moreover, while the other functions in compartment X must be re-optimized, we can still reuse the existing LLVM IR. (These are the same codegen units as we use for parallel codegen, but setup differently.)
In terms of the dependency graph, we would create one IR node representing the codegen unit. This would have the object code as an associated artifact. We would also have edges from each component of the codegen unit. As today, generic or inlined functions would not belong to any codegen unit, but rather would be instantiated anew into each codegen unit in which they are (transitively) referenced.
There is an analogy here with C++, which naturally faces the same problems. In that setting, templates and inlineable functions are often placed into header files. Editing those header files naturally triggers more recompilation. The compiler could employ a similar strategy by replicating things that look like good candidates for inlining into each module; call graphs and profiling information may be a good input for such heuristics.
If we are not careful, incremental compilation has the potential to produce an infinite stream of irreproducible bug reports, so it's worth considering how we can best test this code.
The first and most obvious piece of infrastructure is something for reliable regression testing. The plan is simply to have a series of sources and patches. The source will have each patch applied in sequence, rebuilding (incrementally) at each point. We can then check that (a) we only rebuilt what we expected to rebuild and (b) compare the result with the result of a fresh build from scratch. This allows us to build up tests for specific scenarios or bug reports, but doesn't help with finding bugs in the first place.
The next step is to search across crates.io for consecutive
releases. For a given package, we can checkout version X.Y
and then
version X.(Y+1)
and check that incrementally building from one to
the other is successful and that all tests still yield the same
results as before (pass or fail).
A similar search can be performed across git history, where we identify pairs of consecutive commits. This has the advantage of being more fine-grained, but the disadvantage of being a MUCH larger search space.
The problem with replaying crates.io versions and even git commits is that they are probably much larger changes than the typical recompile. Another option is to use fuzzing, making "innocuous" changes that should trigger a recompile. Fuzzing is made easier here because we have an oracle -- that is, we can check that the results of recompiling incrementally match the results of compiling from scratch. It's also not necessary that the edits are valid Rust code, though we should test that too -- in particular, we want to test that the proper errors are reported when code is invalid, as well. @nrc also suggested a clever hybrid, where we use git commits as a source for the fuzzer's edits, gradually building up the commit.
The primary drawback is that incremental compilation may introduce a new vector for bugs. The design mitigates this concern by attempting to make the construction of the dependency graph as automated as possible. We also describe automated testing strategies.
This design is an evolution from RFC 594.
None.