How fast can I build Rust?
I've been collecting some data on the fastest way to build the Rust compiler. This is primarily for Rust developers to optimise their workflow, but it might also be of general interest.
TL;DR: the fastest ways to build Rust (on a computer with lots of cores) is with -j6
, RUSTFLAGS=-Ccodegen-units=10
.
I tested using a commit, from the 24th November 2016. I was using the make build system (though I would expect the same results using Rustbuild). The test machine is a dedicated build machine - it has 12 physical cores, lots of RAM, and an SSD. It wasn't used for anything else during the benchmarking, and doesn't run a windowing system. It was running Ubuntu 16.10 (Linux). I only did one run per set of variables. That is not ideal, but where I repeated runs, they were fairly consistent - usually within a second or two and never more than 10. I've rounded all results to the nearest 10 seconds, and I believe that level of precision is about right for the experiment.
I varied the number of jobs (-jn
) and the number of codegen units (RUSTFLAGS=-Ccodegen-units=n
). The command line looked something like RUSTFLAGS=-Ccodegen-units=10 make -j6
. I measured the time to do a normal build of the whole compiler and libraries (make
), to build the stage1 compiler (make rustc-stage
, this is the minimal amount of work required to get a compiler for testing), and to build and bootstrap the compiler and run all tests (make && make check
, I didn't run a simple make check
because adding -jn
to that causes many tests to be skipped; setting codegen-units > 1 causes some tests to fail).
The jobs number is the number of tasks make can run in parallel. These runs are self-contained instances of the compiler, i.e., this is parallelism outside the compiler. The amount of parallelism is limited by dependencies between crates in the compiler. Since the crates in the compiler are rather large and there are a lot of dependencies, the benefits of using a large number of jobs is much weaker than in a typical C or C++ program (e.g., LLVM). Note however that there is no real drawback to using a larger number of jobs, there just won't be any benefit.
Codegen units introduce parallelism within the compiler. First, some background. Compilation can be roughly split into two: first, code is analysed (parsing, type checking, etc.), then object code is generated from the products of analysis. The Rust compiler uses LLVM for the code generation part. Roughly half the time running an optimised build is spent in each of analysis and code generation. Nearly all optimisation is performed in the code generation part.
The compilation unit in Rust is a crate; that is, the Rust compiler analyses and compiles a single crate at a time. By default, code generation also works at the same level of granularity. However, by specifying the number of codegen units, we tell the compiler that once analysis is complete, it should break the crate into smaller units and run LLVM code generation on each unit in parallel. That means we get parallelism inside the compiler, albeit only for about half of the work. There is a disadvantage, however: using multiple codegen units means the program will not be optimised as well as if a single unit were used. This is analogous to turning off LTO in a C program. For this reason, you should not use multiple codegen units when building production software.
So when building the compiler, if we use many codegen units we might expect the compilation to go faster, but when we run the new compiler, it will be slower. Since we use the new compiler to build at least the libraries and sometimes another compiler, this could be an important factor in the total time.
If you're interested in this kind of thing, we keep track of compiler performance at perf.r-l.o (although only single-threaded builds). Nicholas Nethercote has recently written a couple of blog posts on running and optimising the compiler.
make
This experiment ran a simple make
build. It builds two versions of the compiler - one using the last beta, and the second using the first.
cg1 | cg2 | cg4 | cg6 | cg8 | cg10 | cg12 | |
---|---|---|---|---|---|---|---|
-j1 | 48m50s | 39m40s | 31m30s | 29m50s | 29m20s | ||
-j2 | 34m10s | 27m40s | 21m40s | 20m30s | 20m10s | 19m30s | 19m20s |
-j4 | 28m10s | 23m00s | 17m50s | 16m50s | 16m40s | 16m00s | 16m00s |
-j6 | 27m40s | 22m40s | 17m20s | 16m20s | 16m10s | 15m40s | 15m50s |
-j8 | 27m40s | 22m30s | 17m20s | 16m30s | 16m30s | 15m40s | 15m40s |
-j10 | 27m40s | ||||||
-j12 | 27m40s | ||||||
-j14 | 27m50s | ||||||
-j16 | 27m50s |
In general, we get better results using more jobs and more codegen units. Looking at the number of jobs, there is no improvement after 6. For codegen units, the improvements quickly diminish, but there is some improvement right up to using 10 (for all jobs > 2
, 12 codegen units gave the same result as 10). It is possible that 9 or 11 codegen units may be more optimal (I only tested even numbers), but probably not by enough to be significant, given the precision of the experiment.
make rustc-stage1
This experiment ran make rustc-stage1
. That builds a single compiler and the libraries necessary to use that compiler. It is the minimal amount of work necessary to test modifications to the compiler. It is significantly quicker than make
.
cg1 | cg2 | cg4 | cg6 | cg8 | cg10 | cg12 | |
---|---|---|---|---|---|---|---|
-j1 | 15m10s | 12m10s | 9m40s | 9m10s | 9m10s | 8m50s | 8m50s |
-j2 | 11m00s | 8m50s | 6m50s | 6m20s | 6m20s | 6m00s | 6m00s |
-j4 | 9m00s | 7m30s | 5m40s | 5m20s | 5m20s | 5m10s | 5m00s |
-j6 | 9m00s | 7m10s | 5m30s | 5m10s | 5m00s | 5m00s | 5m00s |
I only tested jobs up to 6, since there seems no way for more jobs to be profitable here, if not in the previous experiment. It turned out that 6 jobs was only marginally better than 4 in this case, I assume because of more dependency bottlenecks, relative to a full make
.
I would expect more codegen units to be more effective here (since we're using the resulting compiler for less), but I was wrong. This may just be due to the precision of the test (and the relatively shorter total time), but for all numbers of jobs, 6 codegen units were as good as more. So, for this kind of build, six jobs and six codegen units is optimal; however, using ten codegen units (as for make
) is not harmful.
make -jn && make check
This experiment is the way to build all the compilers and libraries and run all tests. I measured the two parts separately. As you might expect, the first part corresponded exactly with the results of the make
experiment. The second part (make check
) took a fairly consistent amount of time - it is independent of the number of jobs since the test infrastructure does its own parallelisation. I would expect compilation of tests to be slower with a compiler compiled with a larger number of codegen units. For one or two codegen units, make check
took 12m40s, for four to ten, it took 12m50s, a marginal difference. That means that the optimal build used six jobs and ten codegen units (as for make
), giving a total time of 28m30s (c.f., 61m40s for one job and one codegen unit).