And these are just "ar archives" (the ar stands for "archive" already, it's
a bit of an "ATM machine" situation), at least on Linux here:
And these, in turn, contain symbols: variables, functions, etc. and their
associated code, already compiled, and ready to be linked:
The notable differences here is that rustc is called once per crate . The
object files are not "per source file", they're per "rust codegen unit"
(RCGU ).
So:
cargo
invokes rustc
once per crate
rustc
decides how many "codegen units" to do in parallel,
writes them as .o
files, then archives them in an .rlib
cargo
does one final rustc
invocation with all the required .rlib
,
which ends up calling the linker to make a binary
There's a bit of confusion around nomenclature here: something that has a
Cargo.toml
is a package . Packages are published on a package registry,
like https://crates.io , and each package may have multiple crates: a build
script crate, a lib crate, one or more bin crates, etc.
Thanks to @SimonSapin for clearing that up!
If we use cargo's --verbose
flag, we can see it making separate rustc
invocations: let's try it on a smaller crate so the output is a more manageable
size.
Shell session
$ cargo build --verbose
Blocking waiting for file lock on build directory
Fresh unicode-xid v0.2.2
Fresh cc v1.0.72
Compiling regex-syntax v0.6.25
Fresh proc-macro2 v1.0.35
Fresh quote v1.0.10
Compiling memchr v2.4.1
Running `rustc --crate-name regex_syntax --edition=2018 /home/amos/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-syntax-0.6.25/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 --cfg 'feature="default"' --cfg 'feature="unicode"' --cfg 'feature="unicode-age"' --cfg 'feature="unicode-bool"' --cfg 'feature="unicode-case"' --cfg 'feature="unicode-gencat"' --cfg 'feature="unicode-perl"' --cfg 'feature="unicode-script"' --cfg 'feature="unicode-segment"' -C metadata=5f99d85e7aa20c3c -C extra-filename=-5f99d85e7aa20c3c --out-dir /home/amos/bearcove/tree-sitter-collection/target/debug/deps -L dependency=/home/amos/bearcove/tree-sitter-collection/target/debug/deps --cap-lints allow -C link-arg=-fuse-ld=lld`
Fresh syn v1.0.84
Fresh thiserror-impl v1.0.30
Running `rustc --crate-name memchr --edition=2018 /home/amos/.cargo/registry/src/github.com-1ecc6299db9ec823/memchr-2.4.1/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 --cfg 'feature="default"' --cfg 'feature="std"' -C metadata=6061dad8797913cc -C extra-filename=-6061dad8797913cc --out-dir /home/amos/bearcove/tree-sitter-collection/target/debug/deps -L dependency=/home/amos/bearcove/tree-sitter-collection/target/debug/deps --cap-lints allow -C link-arg=-fuse-ld=lld --cfg memchr_runtime_simd --cfg memchr_runtime_sse2 --cfg memchr_runtime_sse42 --cfg memchr_runtime_avx`
Compiling thiserror v1.0.30
Running `rustc --crate-name thiserror --edition=2018 /home/amos/.cargo/registry/src/github.com-1ecc6299db9ec823/thiserror-1.0.30/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 -C metadata=e606936c310ae4d3 -C extra-filename=-e606936c310ae4d3 --out-dir /home/amos/bearcove/tree-sitter-collection/target/debug/deps -L dependency=/home/amos/bearcove/tree-sitter-collection/target/debug/deps --extern thiserror_impl=/home/amos/bearcove/tree-sitter-collection/target/debug/deps/libthiserror_impl-b40be9909dd03ab5.so --cap-lints allow -C link-arg=-fuse-ld=lld`
Compiling aho-corasick v0.7.18
Running `rustc --crate-name aho_corasick --edition=2018 /home/amos/.cargo/registry/src/github.com-1ecc6299db9ec823/aho-corasick-0.7.18/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 --cfg 'feature="default"' --cfg 'feature="std"' -C metadata=d54b0d9e8dd84d99 -C extra-filename=-d54b0d9e8dd84d99 --out-dir /home/amos/bearcove/tree-sitter-collection/target/debug/deps -L dependency=/home/amos/bearcove/tree-sitter-collection/target/debug/deps --extern memchr=/home/amos/bearcove/tree-sitter-collection/target/debug/deps/libmemchr-6061dad8797913cc.rmeta --cap-lints allow -C link-arg=-fuse-ld=lld`
Compiling regex v1.5.4
Running `rustc --crate-name regex --edition=2018 /home/amos/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-1.5.4/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 --cfg 'feature="aho-corasick"' --cfg 'feature="default"' --cfg 'feature="memchr"' --cfg 'feature="perf"' --cfg 'feature="perf-cache"' --cfg 'feature="perf-dfa"' --cfg 'feature="perf-inline"' --cfg 'feature="perf-literal"' --cfg 'feature="std"' --cfg 'feature="unicode"' --cfg 'feature="unicode-age"' --cfg 'feature="unicode-bool"' --cfg 'feature="unicode-case"' --cfg 'feature="unicode-gencat"' --cfg 'feature="unicode-perl"' --cfg 'feature="unicode-script"' --cfg 'feature="unicode-segment"' -C metadata=84dc590185c6397b -C extra-filename=-84dc590185c6397b --out-dir /home/amos/bearcove/tree-sitter-collection/target/debug/deps -L dependency=/home/amos/bearcove/tree-sitter-collection/target/debug/deps --extern aho_corasick=/home/amos/bearcove/tree-sitter-collection/target/debug/deps/libaho_corasick-d54b0d9e8dd84d99.rmeta --extern memchr=/home/amos/bearcove/tree-sitter-collection/target/debug/deps/libmemchr-6061dad8797913cc.rmeta --extern regex_syntax=/home/amos/bearcove/tree-sitter-collection/target/debug/deps/libregex_syntax-5f99d85e7aa20c3c.rmeta --cap-lints allow -C link-arg=-fuse-ld=lld`
Compiling tree-sitter v0.20.1
Running `rustc --crate-name tree_sitter --edition=2018 /home/amos/.cargo/registry/src/github.com-1ecc6299db9ec823/tree-sitter-0.20.1/binding_rust/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 -C metadata=434249d5421db9dc -C extra-filename=-434249d5421db9dc --out-dir /home/amos/bearcove/tree-sitter-collection/target/debug/deps -L dependency=/home/amos/bearcove/tree-sitter-collection/target/debug/deps --extern regex=/home/amos/bearcove/tree-sitter-collection/target/debug/deps/libregex-84dc590185c6397b.rmeta --cap-lints allow -C link-arg=-fuse-ld=lld -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-97060a66e399d154/out -l static=tree-sitter`
Compiling tree-sitter-highlight v0.20.1
Running `rustc --crate-name tree_sitter_highlight --edition=2018 /home/amos/.cargo/registry/src/github.com-1ecc6299db9ec823/tree-sitter-highlight-0.20.1/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi --crate-type lib --crate-type staticlib --emit=dep-info,link -C embed-bitcode=no -C debuginfo=2 -C metadata=f889da5f4ac6bce2 -C extra-filename=-f889da5f4ac6bce2 --out-dir /home/amos/bearcove/tree-sitter-collection/target/debug/deps -L dependency=/home/amos/bearcove/tree-sitter-collection/target/debug/deps --extern regex=/home/amos/bearcove/tree-sitter-collection/target/debug/deps/libregex-84dc590185c6397b.rlib --extern thiserror=/home/amos/bearcove/tree-sitter-collection/target/debug/deps/libthiserror-e606936c310ae4d3.rlib --extern tree_sitter=/home/amos/bearcove/tree-sitter-collection/target/debug/deps/libtree_sitter-434249d5421db9dc.rlib --cap-lints allow -C link-arg=-fuse-ld=lld -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-97060a66e399d154/out`
Compiling tree-sitter-collection v0.26.0 (/home/amos/bearcove/tree-sitter-collection)
Running `rustc --crate-name tree_sitter_collection --edition=2018 src/lib.rs --error-format=json --json=diagnostic-rendered-ansi --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 -C metadata=a567e6662873a172 -C extra-filename=-a567e6662873a172 --out-dir /home/amos/bearcove/tree-sitter-collection/target/debug/deps -C incremental=/home/amos/bearcove/tree-sitter-collection/target/debug/incremental -L dependency=/home/amos/bearcove/tree-sitter-collection/target/debug/deps --extern tree_sitter=/home/amos/bearcove/tree-sitter-collection/target/debug/deps/libtree_sitter-434249d5421db9dc.rmeta --extern tree_sitter_highlight=/home/amos/bearcove/tree-sitter-collection/target/debug/deps/libtree_sitter_highlight-f889da5f4ac6bce2.rlib -C link-arg=-fuse-ld=lld -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-collection-c4ebc5be2f6a2943/out -l static=tree-sitter-go -l static=tree-sitter-c -l static=tree-sitter-rust -l static=tree-sitter-javascript -l static=tree-sitter-java -l static=tree-sitter-typescript -l static=tree-sitter-tsx -l static=tree-sitter-toml -l static=tree-sitter-bash-parser -l static=tree-sitter-bash-scanner -l stdc++ -l static=tree-sitter-html-parser -l static=tree-sitter-html-scanner -l stdc++ -l static=tree-sitter-python-parser -l static=tree-sitter-python-scanner -l stdc++ -l static=tree-sitter-ini-parser -l static=tree-sitter-meson-parser -l static=tree-sitter-yaml-parser -l static=tree-sitter-yaml-scanner -l stdc++ -l static=tree-sitter-dockerfile-parser -L native=/home/amos/bearcove/tree-sitter-collection/target/debug/build/tree-sitter-97060a66e399d154/out`
Finished dev [unoptimized + debuginfo] target(s) in 9.95s
Yay!
And if we don't trust cargo
, instead of asking nicely, we can also just
straight up spy on it, with strace
.
Let's try it on an even smaller crate, because that one's pretty noisy:
In the invocation below, we trace system calls (strace
), including children
processes (following forks, -f
), for the execution of cargo build --quiet
.
We then redirect standard error (stderr
, file descriptor 2) to standard output
(stdout
, file descriptor 1), and pipe into
grep (from g/re/p
: Globally search for a
Regular Expression and Print matching lines), using "extended" regular
expression syntax (-E
), and we look for something that starts with execve(
and ends with = 0
(which indicates success).
Shell session
$ cargo clean && strace -f -e execve -- cargo build --quiet 2>&1 | grep -E 'execve\(.*= 0'
execve("/home/amos/.cargo/bin/cargo", ["cargo", "build", "--quiet"], 0x7fff74a9bec0 /* 51 vars */) = 0
execve("/home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/cargo", ["/home/amos/.rustup/toolchains/st"..., "build", "--quiet"], 0x5628508c5b60 /* 56 vars */) = 0
[pid 702423] execve("/home/amos/.cargo/bin/rustc", ["rustc", "-vV"], 0x5604183fdd60 /* 58 vars */) = 0
[pid 702423] execve("/home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc", ["/home/amos/.rustup/toolchains/st"..., "-vV"], 0x55d9605a1250 /* 58 vars */) = 0
[pid 702424] execve("/home/amos/.cargo/bin/rustc", ["rustc", "-", "--crate-name", "___", "--print=file-names", "--crate-type", "bin", "--crate-type", "rlib", "--crate-type", "dylib", "--crate-type", "cdylib", "--crate-type", "staticlib", "--crate-type", "proc-macro", "-Csplit-debuginfo=packed"], 0x5604184cac20 /* 58 vars */) = 0
[pid 702424] execve("/home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc", ["/home/amos/.rustup/toolchains/st"..., "-", "--crate-name", "___", "--print=file-names", "--crate-type", "bin", "--crate-type", "rlib", "--crate-type", "dylib", "--crate-type", "cdylib", "--crate-type", "staticlib", "--crate-type", "proc-macro", "-Csplit-debuginfo=packed"], 0x55eaa32ffb50 /* 58 vars */) = 0
[pid 702425] execve("/home/amos/.cargo/bin/rustc", ["rustc", "-", "--crate-name", "___", "--print=file-names", "--crate-type", "bin", "--crate-type", "rlib", "--crate-type", "dylib", "--crate-type", "cdylib", "--crate-type", "staticlib", "--crate-type", "proc-macro", "--force-warn=rust-2021-compatibi"...], 0x56041841ba80 /* 58 vars */) = 0
[pid 702425] execve("/home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc", ["/home/amos/.rustup/toolchains/st"..., "-", "--crate-name", "___", "--print=file-names", "--crate-type", "bin", "--crate-type", "rlib", "--crate-type", "dylib", "--crate-type", "cdylib", "--crate-type", "staticlib", "--crate-type", "proc-macro", "--force-warn=rust-2021-compatibi"...], 0x559d2d9dc7b0 /* 58 vars */) = 0
[pid 702427] execve("/home/amos/.cargo/bin/rustc", ["rustc", "-", "--crate-name", "___", "--print=file-names", "--crate-type", "bin", "--crate-type", "rlib", "--crate-type", "dylib", "--crate-type", "cdylib", "--crate-type", "staticlib", "--crate-type", "proc-macro", "--print=sysroot", "--print=cfg"], 0x5604184cac20 /* 58 vars */) = 0
[pid 702427] execve("/home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc", ["/home/amos/.rustup/toolchains/st"..., "-", "--crate-name", "___", "--print=file-names", "--crate-type", "bin", "--crate-type", "rlib", "--crate-type", "dylib", "--crate-type", "cdylib", "--crate-type", "staticlib", "--crate-type", "proc-macro", "--print=sysroot", "--print=cfg"], 0x55e0804f1d00 /* 58 vars */) = 0
[pid 702429] execve("/home/amos/.cargo/bin/rustc", ["rustc", "-vV"], 0x5604183fdd60 /* 58 vars */) = 0
[pid 702429] execve("/home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc", ["/home/amos/.rustup/toolchains/st"..., "-vV"], 0x55ffc26bb690 /* 58 vars */) = 0
[pid 702432] execve("/home/amos/.cargo/bin/rustc", ["rustc", "--crate-name", "hello_world", "--edition=2021", "src/main.rs", "--error-format=json", "--json=diagnostic-rendered-ansi", "--crate-type", "bin", "--emit=dep-info,link", "-C", "embed-bitcode=no", "-C", "debuginfo=2", "-C", "metadata=edf9f7e82579517e", "-C", "extra-filename=-edf9f7e82579517e", "--out-dir", "/home/amos/bearcove/hello-world/"..., "-C", "incremental=/home/amos/bearcove/"..., "-L", "dependency=/home/amos/bearcove/h"...], 0x7f39d8004c30 /* 76 vars */) = 0
[pid 702432] execve("/home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc", ["/home/amos/.rustup/toolchains/st"..., "--crate-name", "hello_world", "--edition=2021", "src/main.rs", "--error-format=json", "--json=diagnostic-rendered-ansi", "--crate-type", "bin", "--emit=dep-info,link", "-C", "embed-bitcode=no", "-C", "debuginfo=2", "-C", "metadata=edf9f7e82579517e", "-C", "extra-filename=-edf9f7e82579517e", "--out-dir", "/home/amos/bearcove/hello-world/"..., "-C", "incremental=/home/amos/bearcove/"..., "-L", "dependency=/home/amos/bearcove/h"...], 0x5578142c7ba0 /* 76 vars */) = 0
[pid 702446] execve("/usr/bin/cc", ["cc", "-m64", "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "-Wl,--as-needed", "-L", "/home/amos/bearcove/hello-world/"..., "-L", "/home/amos/.rustup/toolchains/st"..., "-Wl,--start-group", "-Wl,-Bstatic", "/home/amos/.rustup/toolchains/st"..., "/home/amos/.rustup/toolchains/st"..., "/home/amos/.rustup/toolchains/st"..., "/home/amos/.rustup/toolchains/st"..., "/home/amos/.rustup/toolchains/st"..., "/home/amos/.rustup/toolchains/st"..., "/home/amos/.rustup/toolchains/st"..., "/home/amos/.rustup/toolchains/st"..., "/home/amos/.rustup/toolchains/st"..., "/home/amos/.rustup/toolchains/st"..., "/home/amos/.rustup/toolchains/st"..., "/home/amos/.rustup/toolchains/st"..., "/home/amos/.rustup/toolchains/st"..., "/home/amos/.rustup/toolchains/st"..., ...], 0x7f712c3bcd00 /* 78 vars */) = 0
[pid 702447] execve("/usr/libexec/gcc/x86_64-redhat-linux/11/collect2", ["/usr/libexec/gcc/x86_64-redhat-l"..., "-plugin", "/usr/libexec/gcc/x86_64-redhat-l"..., "-plugin-opt=/usr/libexec/gcc/x86"..., "-plugin-opt=-fresolution=/tmp/cc"..., "--build-id", "--no-add-needed", "--eh-frame-hdr", "--hash-style=gnu", "-m", "elf_x86_64", "-dynamic-linker", "/lib64/ld-linux-x86-64.so.2", "-pie", "-o", "/home/amos/bearcove/hello-world/"..., "/usr/lib/gcc/x86_64-redhat-linux"..., "/usr/lib/gcc/x86_64-redhat-linux"..., "/usr/lib/gcc/x86_64-redhat-linux"..., "-L/home/amos/bearcove/hello-worl"..., "-L/home/amos/.rustup/toolchains/"..., "-L/home/amos/.rustup/toolchains/"..., "-L/usr/lib/gcc/x86_64-redhat-lin"..., "-L/usr/lib/gcc/x86_64-redhat-lin"..., "-L/lib/../lib64", "-L/usr/lib/../lib64", "-L/usr/lib/gcc/x86_64-redhat-lin"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., ...], 0x1ece9c0 /* 85 vars */) = 0
[pid 702448] execve("/usr/bin/ld", ["/usr/bin/ld", "-plugin", "/usr/libexec/gcc/x86_64-redhat-l"..., "-plugin-opt=/usr/libexec/gcc/x86"..., "-plugin-opt=-fresolution=/tmp/cc"..., "--build-id", "--no-add-needed", "--eh-frame-hdr", "--hash-style=gnu", "-m", "elf_x86_64", "-dynamic-linker", "/lib64/ld-linux-x86-64.so.2", "-pie", "-o", "/home/amos/bearcove/hello-world/"..., "/usr/lib/gcc/x86_64-redhat-linux"..., "/usr/lib/gcc/x86_64-redhat-linux"..., "/usr/lib/gcc/x86_64-redhat-linux"..., "-L/home/amos/bearcove/hello-worl"..., "-L/home/amos/.rustup/toolchains/"..., "-L/home/amos/.rustup/toolchains/"..., "-L/usr/lib/gcc/x86_64-redhat-lin"..., "-L/usr/lib/gcc/x86_64-redhat-lin"..., "-L/lib/../lib64", "-L/usr/lib/../lib64", "-L/usr/lib/gcc/x86_64-redhat-lin"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., ...], 0x7ffc58148430 /* 85 vars */) = 0
Yay! We see everything . It does indeed calls rustc
a bunch of times, and the
final invocation calls cc
, which calls
collect2
and eventually
ld
.
And if we ask it to use lld
, LLVM's linker, instead, it does that!
Shell session
$ cargo clean && RUSTFLAGS="-C link-args=-fuse-ld=lld" strace -f -e execve -- cargo build --quiet 2>&1 | grep -E 'execve\(.*= 0'
(cut)
[pid 705269] execve("/usr/libexec/gcc/x86_64-redhat-linux/11/collect2", ["/usr/libexec/gcc/x86_64-redhat-l"..., "-plugin", "/usr/libexec/gcc/x86_64-redhat-l"..., "-plugin-opt=/usr/libexec/gcc/x86"..., "-plugin-opt=-fresolution=/tmp/cc"..., "--build-id", "--no-add-needed", "--eh-frame-hdr", "--hash-style=gnu", "-m", "elf_x86_64", "-dynamic-linker", "/lib64/ld-linux-x86-64.so.2", "-pie", "-fuse-ld=lld", "-o", "/home/amos/bearcove/hello-world/"..., "/usr/lib/gcc/x86_64-redhat-linux"..., "/usr/lib/gcc/x86_64-redhat-linux"..., "/usr/lib/gcc/x86_64-redhat-linux"..., "-L/home/amos/bearcove/hello-worl"..., "-L/home/amos/.rustup/toolchains/"..., "-L/home/amos/.rustup/toolchains/"..., "-L/usr/lib/gcc/x86_64-redhat-lin"..., "-L/usr/lib/gcc/x86_64-redhat-lin"..., "-L/lib/../lib64", "-L/usr/lib/../lib64", "-L/usr/lib/gcc/x86_64-redhat-lin"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., ...], 0x6e2970 /* 86 vars */) = 0
👇
[pid 705270] execve("/usr/bin/ld.lld", ["/usr/bin/ld.lld", "-plugin", "/usr/libexec/gcc/x86_64-redhat-l"..., "-plugin-opt=/usr/libexec/gcc/x86"..., "-plugin-opt=-fresolution=/tmp/cc"..., "--build-id", "--no-add-needed", "--eh-frame-hdr", "--hash-style=gnu", "-m", "elf_x86_64", "-dynamic-linker", "/lib64/ld-linux-x86-64.so.2", "-pie", "-o", "/home/amos/bearcove/hello-world/"..., "/usr/lib/gcc/x86_64-redhat-linux"..., "/usr/lib/gcc/x86_64-redhat-linux"..., "/usr/lib/gcc/x86_64-redhat-linux"..., "-L/home/amos/bearcove/hello-worl"..., "-L/home/amos/.rustup/toolchains/"..., "-L/home/amos/.rustup/toolchains/"..., "-L/usr/lib/gcc/x86_64-redhat-lin"..., "-L/usr/lib/gcc/x86_64-redhat-lin"..., "-L/lib/../lib64", "-L/usr/lib/../lib64", "-L/usr/lib/gcc/x86_64-redhat-lin"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., "/home/amos/bearcove/hello-world/"..., ...], 0x7ffd9119d598 /* 86 vars */) = 0
That's all really nice, but... it doesn't really explain why our build was slow.
For starters, unless we run the build with --verbose
and watch the terminal
to monitor how long different parts are taking, we don't really know anything
about timings.
And that's why cargo has a --timings
option !
Shell session
$ cargo clean && cargo build --release --quiet --timings
Whoa, --quiet
is really quiet.
That took forever, but now we have... a cargo-timing.html
file!
It's interactive and stuff, so I can't really do it justice, but I still want to
show you what it looks like, so here goes, non-dark-mode and non-hidpi-friendly
screenshots and all.
First we have a nice summary:
"Fresh" units would be those that we don't need to recompile, but this is a
"cold" build, since we ran cargo clean
beforehand.
Then we have a timeline! That shows aaaaaall 420 compilation units:
Okay be real with me: did you add unnecessary dependencies just so the
total would be 420?
420 rows don't really fit on my screen, and if they did, they'd be too small to
be readable, but luckily, we can crank up that "Min unit time" slider to hide the
really quick ones:
What do the colors mean? I'm so glad you asked!
Orange means we're running a build script. Everything else has to deal with
pipelining: cargo always tries to build as many crates as possible in parallel,
but it can only start the "next" crate if there's enough metadata available
about all its dependencies.
So, the light blue parts are where we're still busy generating metadata, and
any dependents (crates that depend on us) are blocked. When transitioning to
lavender , dependents are unblocked and can start building immediately.
You can see this happening here: regex-syntax
transitions from light blue to
lavender and regex
(line 16) immediately starts building: the dotted line
shows the dependency:
You can read more about
pipelining
on the Rust compiler team's repository.
And here we can see, at a glance, the most expensive crates:
tree-sitter-collection
, one of my own crates, which compiles, like, 12
different tree-sitter parsers (as C code, with gcc), libsqlite3-sys
, which
builds all of sqlite3 (so I don't have to have it installed on whichever
system I run this binary), libquickjs-sys
, which is a whole-ass JavaScript
engine (which I use for KaTeX ).
And then the last hugeeeeeee light blue bar, that's just the binary crate we're
compiling. (And that is the crux of my issue).
For completeness, here's what else it shows us: a CPU usage and concurrency
graph:
You can see that CPU usage is really high for the first third of the
compilation, and it spikes again at the end, but there's a big valley in
between, where it appeares to be doing very little.
"Very little" is misleading, of course, since there's 32 logical cores, we're
actually looking at one core being at full utilization. Still, the system as a
whole is "mostly idle".
Finally, it gives us a table, sort of a hall of fame of slow-to-build crates:
Now for a hot build! If we change a single line in main.rs
and rebuild,
we get:
Which is almost completely devoid of insight.
Because it's a bin crate, cargo --timings
doesn't report codegen time, and
(at least up until the time of this writing), it doesn't report link time
either, so... we have a big opaque block of 72 seconds, doing... something .
If you want to visualize "how parallel" your build is, or "where cargo is
spending time", you can use --timings
.
For very large Rust projects, and for hot builds , it's not uncommon for
linking to make up most of the "build time". And that's why people tend to
reach for lld
(LLVM's linker), for example.
Thing is... we're already using lld
.
Here's the graph with GNU ld instead (the default for this system):
That is the difference lld
makes here: it's three seconds faster than GNU ld
for this build (out of some unknown "link duration" — we could measure it, but I
can't be bothered right now).
Using mold
instead, we have this:
Barely faster than lld
. So linking is not the bottleneck here.
When building larger applications, linking can become the slowest / most
expensive part of a rebuild. GNU ld hasn't been the best option for a while now.
Rui Ueyama changed the game, twice , by working on lld
, LLVM's linker, and
then on mold
, which will be recognized officially by
GCC soon!
By default, no debug symbols are included in "release" cargo builds.
Debug symbols allow mapping memory addresses to "source locations", among other
things. Which in turns, unlocks comfortable ways of debugging our executable.
For example, with debug information, we can ask the debugger to stop execution
on a specific function by adding a breakpoint on a function name:
Shell session
$ rust-gdb --quiet --args futile serve
Reading symbols from futile...
(gdb) set print thread-events off
(gdb) break futile::Config::read
Breakpoint 1 at 0x7d105a: file src/main.rs, line 69.
(gdb) r
Starting program: /home/amos/.cargo/bin/futile serve
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Thread 1 "futile" hit Breakpoint 1, futile::Config::read () at src/main.rs:69
69 let config_path = if config_path.is_dir() {
(gdb)
Wait, couldn't we set a breakpoint even if we didn't have debug info?
Well, let's try it!
Shell session
$ rust-gdb --quiet --args futile serve
Reading symbols from futile...
(No debugging symbols found in futile)
(gdb) set print thread-events off
(gdb) break futile::Config::read
Function "futile::Config::read" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (futile::Config::read) pending.
(gdb) r
Starting program: /home/amos/.cargo/bin/futile serve
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
INFO futile > Reading config from ./futile.toml
INFO futile > No config file found, will use defaults
INFO futile > Base dir is /home/amos/bearcove/fasterthanli.me
WARN futile > sqlite: [283] recovered 2 frames from WAL file /home/amos/bearcove/fasterthanli.me/.futile/content.db-wal
(etc.)
No!
But I thought... don't ELF files (like executable and dynamic libraries, even
static libraries) have symbol tables, that are distinct from debug info?
They do! If we look at libm.so
(lib math), it has a bunch of symbols.
Over a thousand, in fact:
Shell session
$ nm --dynamic --defined-only /usr/lib64/libm.so.6 | head -10
00000000000119d0 W acos@@GLIBC_2.2.5
0000000000015370 W acosf@@GLIBC_2.2.5
0000000000064bc0 W acosf128@@GLIBC_2.26
0000000000046510 T __acosf128_finite@GLIBC_2.26
0000000000015370 W acosf32@@GLIBC_2.27
00000000000119d0 W acosf32x@@GLIBC_2.27
00000000000119d0 W acosf64@@GLIBC_2.27
0000000000010170 W acosf64x@@GLIBC_2.27
0000000000038310 T __acosf_finite@GLIBC_2.15
0000000000025630 i __acos_finite@GLIBC_2.15
$ nm --dynamic --defined-only /usr/lib64/libm.so.6 | wc -l
1102
Our executable however, doesn't define any symbols:
Shell session
$ nm --dynamic --defined-only $(which futile)
It expects symbols to be defined by other dynamic libraries it depends on, though!
Shell session
$ nm --dynamic --undefined-only $(which futile) | head
U abort@GLIBC_2.2.5
U accept4@GLIBC_2.10
U access@GLIBC_2.2.5
U acos@GLIBC_2.2.5
U acosh@GLIBC_2.2.5
U asin@GLIBC_2.2.5
U asinh@GLIBC_2.2.5
U __assert_fail@GLIBC_2.2.5
U atan@GLIBC_2.2.5
U atan2@GLIBC_2.2.5
Note the difference: we were using --defined-only
, now we're using
--undefined-only
, emphasis on the privative un
prefix.
Look, math functions! I guess it links against libm.so
...
Shell session
$ objdump -p $(which futile) | grep 'NEEDED'
NEEDED libstdc++.so.6
NEEDED libgcc_s.so.1
NEEDED libm.so.6
NEEDED libc.so.6
NEEDED ld-linux-x86-64.so.2
Yes it does.
So symbols are only exported if they need to be exported. In the case of a
library, we need to be able to call acos
and asin
from somewhere else, and
so they're in the table. But for an executable, all we need to know (sort of),
is where the code starts, and that's the "start address" here:
Shell session
$ objdump -f $(which futile)
/home/amos/.cargo/bin/futile: file format elf64-x86-64
architecture: i386:x86-64, flags 0x00000150:
HAS_SYMS, DYNAMIC, D_PAGED
start address 0x0000000000635ec0
Okay, so futile::Config::read
doesn't need to be "exported", because it's
only ever called "internally" from futile into futile, and... so how did GDB
ever successfully set a breakpoint there?
It looked at debug information instead!
Because debug information is so useful, and because I like to be able to look at
stack traces with "source location" (file name and line number) information, and
because I like to be able to step through code in a debugger, even on release
builds, I have this:
TOML markup
# in `futile/Cargo.toml`
[ profile . release ]
debug = 1
And so my executable has a bunch of additional ELF sections:
Shell session
$ objdump --wide --headers $(which futile) | grep -F '.debug'
12 .debug_gdb_scripts 00000022 00000000004a8b99 00000000004a8b99 004a8b99 2**0 CONTENTS, ALLOC, LOAD, READONLY, DATA
32 .debug_abbrev 0006a4d3 0000000000000000 0000000000000000 0164f4dc 2**0 CONTENTS, READONLY, DEBUGGING, OCTETS
33 .debug_info 0195631c 0000000000000000 0000000000000000 016b99af 2**0 CONTENTS, READONLY, DEBUGGING, OCTETS
34 .debug_aranges 00108190 0000000000000000 0000000000000000 0300fcd0 2**4 CONTENTS, READONLY, DEBUGGING, OCTETS
35 .debug_ranges 00d770e0 0000000000000000 0000000000000000 03117e60 2**0 CONTENTS, READONLY, DEBUGGING, OCTETS
36 .debug_str 016b3e43 0000000000000000 0000000000000000 03e8ef40 2**0 CONTENTS, READONLY, DEBUGGING, OCTETS
37 .debug_pubnames 0154980f 0000000000000000 0000000000000000 05542d83 2**0 CONTENTS, READONLY, DEBUGGING, OCTETS
38 .debug_pubtypes 0000690c 0000000000000000 0000000000000000 06a8c592 2**0 CONTENTS, READONLY, DEBUGGING, OCTETS
39 .debug_line 00da927e 0000000000000000 0000000000000000 06a92e9e 2**0 CONTENTS, READONLY, DEBUGGING, OCTETS
40 .debug_loclists 00453d2b 0000000000000000 0000000000000000 0783c11c 2**0 CONTENTS, READONLY, DEBUGGING, OCTETS
41 .debug_rnglists 0009618f 0000000000000000 0000000000000000 07c8fe47 2**0 CONTENTS, READONLY, DEBUGGING, OCTETS
42 .debug_line_str 000025c6 0000000000000000 0000000000000000 07d25fd6 2**0 CONTENTS, READONLY, DEBUGGING, OCTETS
43 .debug_macro 00002a41 0000000000000000 0000000000000000 07d2859c 2**0 CONTENTS, READONLY, DEBUGGING, OCTETS
44 .debug_frame 000001d0 0000000000000000 0000000000000000 07d2afe0 2**3 CONTENTS, READONLY, DEBUGGING, OCTETS
And, the format of those is quite complicated, but with a little help from our
friends we can trace back what GDB used to set a breakpoint on
futile::Config::read
:
Shell session
$ readelf --debug-dump=info $(which futile) | rustfilt | grep 'futile::Config::read'
<2e68b4> DW_AT_linkage_name: (indirect string, offset: 0xb960a1): futile::Config::read
Note that rustfilt
is necessary here, otherwise we have the mangled form
_ZN6futile6Config4read17h2622e8ecc143c293E
instead.
Alternatively, we could've piped into grep -E 'futile.Config.read'
, which
would have searched for the words futile
, Config
, and read
separated by
any single character.
But here's the point I'm getting at: generating that debug information is not
free. It takes time!
In fact, the
profile.<name>.debug
cargo configuration key is not a simple on/off switch: there's levels 0 (or
false), 1 and 2 (or true).
Most of the time, 2 really is overkill. Let's compare timings just so you can be
convinced that it really does make an impact on the build times of futile
:
With debug = 0
, a cold build takes ~1m55s
With debug = 1
, a cold build takes ~2m04s
With debug = 2
, a cold build takes ~2m12s
Now keep in mind we have a block of roughly ~72s where we're not sure what the
compiler is doing, and the rest is embarassingly parallel, so the difference
there is less impressive than it should be. We'll try again once we fix whatever
is taking so long.
Debug information is super duper useful, if only so when you get a crash, you
can easily map a stack trace (a bunch of memory addresses) to a bunch of source
locations (file name / line number, but also function name, etc).
In Cargo.toml
, debug = true
actually means debug = 2
, and it's usually
overkill, unless you're doing the sort of debugging where you need to be able
to inspect the value of local variables for example. If all you're after is
a stack trace, debug = 1
is good enough.
If you think debug information is too large, first off, you're correct, but
secondly, try reaching for objcopy --compress-debug-sections
to compress it
instead of throwing it away with strip
: you might find that you're happy with
that compromise.
One thing I've made sure to mention since the beginning of this article is that
I'm talking about release builds, not debug builds. That's why all cargo
invocations have had the --release
flag.
Is it as slow to make a debug build of futile
?
Cold debug build: 1m33s (vs 2m04s for release)
Hot debug build: 19s (vs 1m11s for release)
Huh indeed! So again, we expect there to be a difference: in debug builds, we
do very few optimizations, whereas in release builds, we try to optimize as much
as possible.
And optimizing isn't free, so, it makes sense that a release build would take
longer. But that much longer?
There's another difference: by default, cargo's debug profile enables
incremental
builds .
And cargo's release profile has incremental builds disabled.
What happens if we enable incremental builds for release?
TOML markup
# in `futile/Cargo.toml`
[ profile . release ]
debug = 1
incremental = true
Cold incremental build: 2m08s (vs 2m04s non-incremental)
Hot incremental build: 20s (vs 1m11s non-incremental)
AhAH! That's better! Much better!
Yes it is! That's the whole trade-off of incremental builds: cold builds are a
little slower, but hot builds should be faster.
Mhh I'm looking at the docs here and... it says something about "codegen units"
too, and it being different for incremental builds vs non-incremental builds?
Yes! The goal of "incremental builds" is to be able to recompile fast. It helps
if compiling is fast in general, and if we only have to recompile as little code
as possible.
So, with incremental builds, crates are split into more, smaller codegen units.
In 1.57.0 at least, non-incremental builds default to only 16 codegen units,
whereas incremental builds default to 256 codegen units.
Could it be what's making the difference? The number of codegen units, and not
incremental builds?
Let's try it!
TOML markup
# in `futile/Cargo.toml`
[ profile . release ]
debug = 1
incremental = false
codegen-units = 256
Cold 256-cgu build: 2m02s (vs 2m04s for 16-cgu)
Hot 256-cgu build: 1m11s (vs 1m11s for 16-cgu)
Nope. No significant difference.
Incremental builds sound great . Why wouldn't we want to use them all the time?
Well, they're not well-supported by sccache ,
which I use all the time in CI, so it's not really an option there.
I'm not sure if there's any compelling reason not to use incremental builds all
the time locally : I used to think more codegen units meant less optimization
opportunities, but from what I read of the cargo docs, I'm not sure that's true
anymore.
Generally I would advise setting incremental = true
in profile.release
, and
just disabling it in CI by setting the environment variable CARGO_INCREMENTAL
to 0
.
Incremental builds are super useful to iterate quickly on something. That's why
they're enabled by default for debug builds.
If you make a lot of release builds locally, you may want to enable incremental
builds for the release profile too.
If the reason you make release builds locally is because you depend on crates
that do gzip/bzip2/brotli/zstd decompression for example, and those are super
slow in debug, you may want to set up
overrides
for those dependencies instead!
Unless you explicitly set lto = "off"
in your profile in Cargo.toml
, cargo
performs "thin local LTO", which means it'll try to inline calls across codegen
units within the same crate.
But if we want it to be able to inline calls across crates , we have to pick
other types of LTO. We can do "fat" LTO, the classical method, and that one
should be super expensive:
TOML markup
# in `futile/Cargo.toml`
[ profile . release ]
debug = 1
lto = "fat"
Cold "fat" LTO release build: 3m39s (vs 2m04s thin-local LTO)
Hot "fat" LTO release build: 2m50s (vs 1m11s thin-local LTO)
As expected, it's doing a lot more work, and so it's slower. (But with the
promise of maybe generating faster code).
And we can do thin LTO :
TOML markup
# in `futile/Cargo.toml`
[ profile . release ]
debug = 1
lto = "thin"
Cold thin LTO release build: 2m10s (vs 2m04s thin-local LTO)
Hot thin LTO release build: 1m20s (vs 1m11s thin-local LTO)
And as promised, thin LTO is a lot faster than "fat" LTO.
But we still have a block of 72s where it's doing something that's not
linking,
right?
Right! We should probably do those measurements again once we've solved that.
Since everything we've tried (and that usually works) has failed, it's time to
go to the next level, which I didn't know existed until a couple days ago.
I said it was sorta easier to see what's going on when building, say, C/C++ code
using gcc, because the steps were obvious: in a Makefile or so. But that wasn't
entirely true.
Because what we have here, is the same as if we had a single .c
source file
that took over a minute to build. We know exactly what's being run, but we don't
know where the compiler is spending time.
And I have no idea how easy it is to profile GCC, Clang, or MSVC, but I know
it's really easy to profile rustc, with -Z self-profile
That's also a nightly-only option, so we'll use the same RUSTC_BOOTSTRAP
crime. It's also a rustc flag, not a cargo flag, so instead of using cargo build
we'll need to use cargo rustc
.
First, let's make a cold build, because what we really want to know is what
happens during a hot build:
Shell session
$ cargo clean
$ RUSTC_BOOTSTRAP=1 cargo rustc --release
(cut)
Finished release [optimized + debuginfo] target(s) in 2m 04s
And now let's change something in main.rs
and rebuild, passing -Z self-profile
. Because it's a rustc option, we need to pass it after --
, which
separates cargo options from rustc options:
Shell session
$ RUSTC_BOOTSTRAP=1 cargo rustc --release -- -Z self-profile
Compiling futile v1.9.0 (/home/amos/bearcove/futile)
Finished release [optimized + debuginfo] target(s) in 1m 14s
rustc wrote a profile to disk, because we asked nicely:
Shell session
$ ls -lhA *profdata
-rw-r--r--. 1 amos amos 39M Dec 30 12:52 futile-1004573.mm_profdata
To visualize it, we have a couple options! So we'll install a couple tools:
Shell session
$ cargo install --git https://github.com/rust-lang/measureme crox flamegraph summarize
(cut)
Summary Successfully installed crox, flamegraph, summarize!
Let's try summarize
, the simplest, first:
Shell session
$ summarize summarize futile-1004573.mm_profdata | head -10
+------------------------------------------------+-----------+-----------------+----------+------------+
| Item | Self time | % of total time | Time | Item count |
+------------------------------------------------+-----------+-----------------+----------+------------+
| evaluate_obligation | 58.01s | 36.678 | 58.17s | 31347 |
+------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_module_optimize_module_passes | 23.74s | 15.008 | 23.74s | 16 |
+------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_lto_optimize | 21.09s | 13.334 | 21.09s | 16 |
+------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_module_codegen_emit_obj | 16.17s | 10.220 | 16.17s | 17 |
(cut)
Yes, the summarize
command-line tools accepts subcommands, and one of them is
summarize
. The stutter is intentional.
Interesting! It's spending 58s in evaluate_obligation
.
Let's try flamegraph
next!
Shell session
$ flamegraph futile-1004573.mm_profdata
Mhhh.. nothing?
Shell session
$ ls -lhA *.svg
-rw-r--r--. 1 amos amos 28K Dec 30 12:58 rustc.svg
Ah!
It's interactive and stuff, like you'd expect a flamegraph to be (you can hover
on items to see their full names, the number and % of samples, and you can zoom
around and filter stuff!), but again, all you get is a screenshot unless you
go and do it yourself:
I've highlighted evaluate_obligation
, since it was the most expensive bit
reported by summarize
.
There's a third way to visualize that information, which is the richest of them
all, and it's to use crox
to convert it to a "Chrome profile":
Shell session
$ crox futile-1004573.mm_profdata
$ ls -lhA chrome_profiler.json
-rw-r--r--. 1 amos amos 161M Dec 30 13:12 chrome_profiler.json
Now we have a boatload of JSON that we can load into chrome://tracing
(only
works in Chromium-based browsers):
We see much of what we've seen in the flamegraph, but with richer info! Also,
it's not the same graph at all: that one is a timeline, not a flame graph.
Flame graphs let us known about frequency, whereas timelines show what actually
happened in chronological order. Which is made obvious if we search for
evaluate_obligation
to highlight it:
We can see there's a lot of evaluate_obligation
calls, there's just one
that's taking a lot longer. Which points to our issue: it's not that there's a
lot of obligations to evaluate, it's that one is taking forever (over 40
seconds).
If you look closer, you can see there's actually a link
item in there, at
the top-right: it takes only 6.2 seconds. That's where the difference was made
between GNU ld, LLVM lld and mold.
Oh yeah, definitely worth measuring those again!
But... we still don't know what's happening. We know something's wrong, which is
great! Or rather, we now strongly suspect we're hitting a pathological case
somewhere.
What's a pathological case? It's when an algorithm is the slowest is can
possibly be, due to some input.
Take insertion sort, for example: if you go to this Sorting Algorithms
Animation page and click
play on "Insertion", look at the "Reversed" input: see how long it takes? It has
to do the maximum amount of swaps, because it's being given the worst input
possible. By comparison, "Nearly sorted" finishes very quickly.
So here, we probably have an algorithm in rustc that performs reasonably well
for most inputs, but our program somehow has a shape that makes that algorithm
perform really poorly . This can happen when an algorithm accidentally has
"quadratic complexity".
Allow me, for a minute, to get all scientific with MS Paint drawings.
Logarithmic complexity is fine, as the time increases less and less as the input
size increases.
O ( log ( n ) ) O(\log(n)) O ( log ( n )) — logarithmic time complexity
Linear complexity is fine too, as far as I'm concerned: if you have twice as
much input, it takes twice as long:
O ( n ) O(n) O ( n ) — linear time complexity
But quadratic complexity, is not fine at all. Well, it might be fine for small
values, but the time required shoots up real quick :
O ( n 2 ) O(n^2) O ( n 2 ) — quadratic time complexity
Okay, back to our actual build. We suspect something like that is happening...
but where?
We've already said it would be a good idea to break down our big binary crate
into smaller crates but... where to start? Plus, that's a lot of effort! I would
really like to know what's causing it.
Luckily, there's an additional rustc option we can use there to record the
arguments passed to its functions: so we'll known which obligation it's
evaluating.
Let's change something else in main.rs
and measure again:
Shell session
$ RUSTC_BOOTSTRAP=1 cargo rustc --release -- -Z self-profile -Z self-profile-events=default,args
This time we'll do two things differently: first off, when converting the
profile to the Chrome format, we'll ignore events below a certain duration:
that'll make for a smaller, easier to explore profile.
But before we do, let's use another subcommand of summarize
: diff
:
Shell session
$ summarize diff futile-1004573.mm_profdata futile-1034530.mm_profdata | head -10
+-------------------------------------------------+---------------+------------------+---------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| Item | Self Time | Self Time Change | Time | Time Change | Item count | Cache hits | Blocked time | Incremental load time | Incremental hashing time |
+-------------------------------------------------+---------------+------------------+---------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| self_profile_alloc_query_strings | +6.609373529s | +55952.80% | +6.631967425s | +56144.08% | +0 | +0 | +0ns | +0ns | +0ns |
+-------------------------------------------------+---------------+------------------+---------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| finish_ongoing_codegen | -5.700469852s | -100.00% | -5.701132355s | -100.00% | +0 | +0 | +0ns | +0ns | +0ns |
+-------------------------------------------------+---------------+------------------+---------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| LLVM_module_codegen_emit_obj | +840.972695ms | +5.20% | +840.972695ms | +5.20% | +0 | +0 | +0ns | +0ns | +0ns |
+-------------------------------------------------+---------------+------------------+---------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| LLVM_lto_optimize | +614.923743ms | +2.92% | +614.923743ms | +2.92% | +0 | +0 | +0ns | +0ns | +0ns |
(cut)
Well, it's not really a fair comparison since we passed different parameters to
self-profile
, and I guess that's what that result is showing us.
But it would be really useful if we had changed something meaningful about our
code instead!
So, let's convert to a chrome profile, ignoring anything that takes less than,
say... half a second.
Shell session
$ crox --minimum-duration 500000 futile-1034530.mm_profdata
$ ls -lhA chrome_profiler.json
-rw-r--r--. 1 amos amos 345K Dec 30 13:48 chrome_profiler.json
(Yes, --minimum-duration
takes microseconds. Yes, I had to look it up. No, I
don't trust myself to make order-of-magnitude mistakes)
That's a much, much smaller profile! The second thing we're doing different
is, instead of using chrome://tracing
, we're going to use another visualizer:
Perfetto .
--minimum-duration
made the profile a lot easier to read, eliminating the
noise. Loading it in Perfetto made it much nicer to look at. And adding -Z self-profile-events=default,args
means we can see which obligation it's
evaluating, and it is:
TraitPredicate(<
futures::future::Map<warp::hyper::Server<warp::hyper::server::conn::AddrIncoming, warp::hyper::service::make::MakeServiceFn<[closure@warp::Server<warp::log::internal::WithLog<[closure@warp::log::{closure#0}], warp::filter::or::Or<warp::filter::or::Or<warp::filter::or::Or<warp::filter::map::Map<warp::filter::and::And<warp::filter::and::And<warp::filter::FilterFn<[closure@warp::filters::method::method_is<[closure@warp::get::{closure#0}]>::{closure#0}]>, warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filters::any::Any, warp::path::Exact<warp::path::internal::Opaque<serve::serve::{closure#0}::__StaticPath>>>, warp::path::Exact<warp::path::internal::Opaque<serve::serve::{closure#0}::__StaticPath>>>, warp::filter::FilterFn<[closure@warp::path::end::{closure#0}]>>>, warp::filter::map::Map<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::map::Map<warp::filters::any::Any, [closure@src/serve/mod.rs:158:38: 158:66]>, warp::filter::FilterFn<[closure@warp::path::full::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::query::raw::{closure#0}], futures::future::Ready<std::result::Result<std::string::String, warp::Rejection>>>::{closure#0}]>, [closure@src/serve/mod.rs:165:26: 168:18]>>>, warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::header::optional<std::string::String>::{closure#0}], futures::future::Ready<std::result::Result<std::option::Option<std::string::String>, warp::Rejection>>>::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::and_then::AndThen<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::body::body::{closure#0}], futures::future::Ready<std::result::Result<warp::hyper::Body, warp::Rejection>>>::{closure#0}]>, [closure@warp::body::bytes::{closure#0}]>, [closure@src/serve/mod.rs:174:26: 180:18]>>>, [closure@src/serve/mod.rs:184:13: 207:14]>>, [closure@src/serve/mod.rs:231:14: 231:92]>, warp::filter::and_then::AndThen<warp::filter::and::And<warp::filter::FilterFn<[closure@warp::filters::method::method_is<[closure@warp::get::{closure#0}]>::{closure#0}]>, warp::filter::map::Map<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::map::Map<warp::filters::any::Any, [closure@src/serve/mod.rs:158:38: 158:66]>, warp::filter::FilterFn<[closure@warp::path::full::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::query::raw::{closure#0}], futures::future::Ready<std::result::Result<std::string::String, warp::Rejection>>>::{closure#0}]>, [closure@src/serve/mod.rs:165:26: 168:18]>>>, warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::header::optional<std::string::String>::{closure#0}], futures::future::Ready<std::result::Result<std::option::Option<std::string::String>, warp::Rejection>>>::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::and_then::AndThen<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::body::body::{closure#0}], futures::future::Ready<std::result::Result<warp::hyper::Body, warp::Rejection>>>::{closure#0}]>, [closure@warp::body::bytes::{closure#0}]>, [closure@src/serve/mod.rs:174:26: 180:18]>>>, [closure@src/serve/mod.rs:184:13: 207:14]>>, [closure@src/serve/mod.rs:235:19: 235:61]>>, warp::filter::and_then::AndThen<warp::filter::and::And<warp::filter::FilterFn<[closure@warp::filters::method::method_is<[closure@warp::head::{closure#0}]>::{closure#0}]>, warp::filter::map::Map<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::map::Map<warp::filters::any::Any, [closure@src/serve/mod.rs:158:38: 158:66]>, warp::filter::FilterFn<[closure@warp::path::full::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::query::raw::{closure#0}], futures::future::Ready<std::result::Result<std::string::String, warp::Rejection>>>::{closure#0}]>, [closure@src/serve/mod.rs:165:26: 168:18]>>>, warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::header::optional<std::string::String>::{closure#0}], futures::future::Ready<std::result::Result<std::option::Option<std::string::String>, warp::Rejection>>>::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::and_then::AndThen<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::body::body::{closure#0}], futures::future::Ready<std::result::Result<warp::hyper::Body, warp::Rejection>>>::{closure#0}]>, [closure@warp::body::bytes::{closure#0}]>, [closure@src/serve/mod.rs:174:26: 180:18]>>>, [closure@src/serve/mod.rs:184:13: 207:14]>>, [closure@src/serve/mod.rs:239:19: 239:61]>>, warp::filter::and_then::AndThen<warp::filter::and::And<warp::filter::FilterFn<[closure@warp::filters::method::method_is<[closure@warp::post::{closure#0}]>::{closure#0}]>, warp::filter::map::Map<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::map::Map<warp::filters::any::Any, [closure@src/serve/mod.rs:158:38: 158:66]>, warp::filter::FilterFn<[closure@warp::path::full::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::query::raw::{closure#0}], futures::future::Ready<std::result::Result<std::string::String, warp::Rejection>>>::{closure#0}]>, [closure@src/serve/mod.rs:165:26: 168:18]>>>, warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::header::optional<std::string::String>::{closure#0}], futures::future::Ready<std::result::Result<std::option::Option<std::string::String>, warp::Rejection>>>::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::and_then::AndThen<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::body::body::{closure#0}], futures::future::Ready<std::result::Result<warp::hyper::Body, warp::Rejection>>>::{closure#0}]>, [closure@warp::body::bytes::{closure#0}]>, [closure@src/serve/mod.rs:174:26: 180:18]>>>, [closure@src/serve/mod.rs:184:13: 207:14]>>, [closure@src/serve/mod.rs:243:19: 243:62]>>>>::bind_ephemeral<std::net::SocketAddr>::{closure#1}::{closure#0}]>>, [closure@warp::Server<warp::log::internal::WithLog<[closure@warp::log::{closure#0}], warp::filter::or::Or<warp::filter::or::Or<warp::filter::or::Or<warp::filter::map::Map<warp::filter::and::And<warp::filter::and::And<warp::filter::FilterFn<[closure@warp::filters::method::method_is<[closure@warp::get::{closure#0}]>::{closure#0}]>, warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filters::any::Any, warp::path::Exact<warp::path::internal::Opaque<serve::serve::{closure#0}::__StaticPath>>>, warp::path::Exact<warp::path::internal::Opaque<serve::serve::{closure#0}::__StaticPath>>>, warp::filter::FilterFn<[closure@warp::path::end::{closure#0}]>>>, warp::filter::map::Map<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::map::Map<warp::filters::any::Any, [closure@src/serve/mod.rs:158:38: 158:66]>, warp::filter::FilterFn<[closure@warp::path::full::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::query::raw::{closure#0}], futures::future::Ready<std::result::Result<std::string::String, warp::Rejection>>>::{closure#0}]>, [closure@src/serve/mod.rs:165:26: 168:18]>>>, warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::header::optional<std::string::String>::{closure#0}], futures::future::Ready<std::result::Result<std::option::Option<std::string::String>, warp::Rejection>>>::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::and_then::AndThen<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::body::body::{closure#0}], futures::future::Ready<std::result::Result<warp::hyper::Body, warp::Rejection>>>::{closure#0}]>, [closure@warp::body::bytes::{closure#0}]>, [closure@src/serve/mod.rs:174:26: 180:18]>>>, [closure@src/serve/mod.rs:184:13: 207:14]>>, [closure@src/serve/mod.rs:231:14: 231:92]>, warp::filter::and_then::AndThen<warp::filter::and::And<warp::filter::FilterFn<[closure@warp::filters::method::method_is<[closure@warp::get::{closure#0}]>::{closure#0}]>, warp::filter::map::Map<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::map::Map<warp::filters::any::Any, [closure@src/serve/mod.rs:158:38: 158:66]>, warp::filter::FilterFn<[closure@warp::path::full::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::query::raw::{closure#0}], futures::future::Ready<std::result::Result<std::string::String, warp::Rejection>>>::{closure#0}]>, [closure@src/serve/mod.rs:165:26: 168:18]>>>, warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::header::optional<std::string::String>::{closure#0}], futures::future::Ready<std::result::Result<std::option::Option<std::string::String>, warp::Rejection>>>::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::and_then::AndThen<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::body::body::{closure#0}], futures::future::Ready<std::result::Result<warp::hyper::Body, warp::Rejection>>>::{closure#0}]>, [closure@warp::body::bytes::{closure#0}]>, [closure@src/serve/mod.rs:174:26: 180:18]>>>, [closure@src/serve/mod.rs:184:13: 207:14]>>, [closure@src/serve/mod.rs:235:19: 235:61]>>, warp::filter::and_then::AndThen<warp::filter::and::And<warp::filter::FilterFn<[closure@warp::filters::method::method_is<[closure@warp::head::{closure#0}]>::{closure#0}]>, warp::filter::map::Map<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::map::Map<warp::filters::any::Any, [closure@src/serve/mod.rs:158:38: 158:66]>, warp::filter::FilterFn<[closure@warp::path::full::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::query::raw::{closure#0}], futures::future::Ready<std::result::Result<std::string::String, warp::Rejection>>>::{closure#0}]>, [closure@src/serve/mod.rs:165:26: 168:18]>>>, warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::header::optional<std::string::String>::{closure#0}], futures::future::Ready<std::result::Result<std::option::Option<std::string::String>, warp::Rejection>>>::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::and_then::AndThen<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::body::body::{closure#0}], futures::future::Ready<std::result::Result<warp::hyper::Body, warp::Rejection>>>::{closure#0}]>, [closure@warp::body::bytes::{closure#0}]>, [closure@src/serve/mod.rs:174:26: 180:18]>>>, [closure@src/serve/mod.rs:184:13: 207:14]>>, [closure@src/serve/mod.rs:239:19: 239:61]>>, warp::filter::and_then::AndThen<warp::filter::and::And<warp::filter::FilterFn<[closure@warp::filters::method::method_is<[closure@warp::post::{closure#0}]>::{closure#0}]>, warp::filter::map::Map<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::and::And<warp::filter::map::Map<warp::filters::any::Any, [closure@src/serve/mod.rs:158:38: 158:66]>, warp::filter::FilterFn<[closure@warp::path::full::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::query::raw::{closure#0}], futures::future::Ready<std::result::Result<std::string::String, warp::Rejection>>>::{closure#0}]>, [closure@src/serve/mod.rs:165:26: 168:18]>>>, warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::header::optional<std::string::String>::{closure#0}], futures::future::Ready<std::result::Result<std::option::Option<std::string::String>, warp::Rejection>>>::{closure#0}]>>, warp::filter::unify::Unify<warp::filter::recover::Recover<warp::filter::and_then::AndThen<warp::filter::FilterFn<[closure@warp::filter::filter_fn_one<[closure@warp::body::body::{closure#0}], futures::future::Ready<std::result::Result<warp::hyper::Body, warp::Rejection>>>::{closure#0}]>, [closure@warp::body::bytes::{closure#0}]>, [closure@src/serve/mod.rs:174:26: 180:18]>>>, [closure@src/serve/mod.rs:184:13: 207:14]>>, [closure@src/serve/mod.rs:243:19: 243:62]>>>>::bind_ephemeral<std::net::SocketAddr>::{closure#0}]>
as
warp::Future
>)
That is... big.
And that's from warp , the web server framework I'm
using in futile.
rustc has a built-in profiler (-Z self-profile
), and its output can be
visualized in a multitude of ways. The
measureme repository contains
a summarize
tool, which shows a table in the CLI, a flamegraph
tool which
generates a flamegraph ,
and crox
, which converts the profile to Chrome's tracing format.
Those can be viewed in chrome://tracing
in a Chromium-based browser, on
Perfetto , Speedscope ,
and more!
...I didn't actually trust warp
that much. It's perfectly fine! It just.. has
a tendency to make for very very long types, like the one we see above. Just
like tower .
But I hadn't noticed those long compile times before... did something change
recently?
Let's try Rust 1.54.0 just to check. Because rustup
makes it unreasonably
easy!
Shell session
$ rustup install 1.54.0
$ cargo clean
$ cargo +1.54.0 build --release
(cut)
(Note: I had to commit a few crimes to get 1.54.0 to build my project since I've
already updated a couple crates to Rust edition 2021 - the RUSTC_BOOTSTRAP
escape
hatch + manually adding cargo-features = ["edition2021"]
did the trick).
Here are the timings:
Cold 1.54.0 build: 1m12s (vs 2m04s for 1.57.0)
Hot 1.54.0 build: 18.07s (vs 1m11s for 1.57.0)
Yeah. It's a pretty bad regression. And it is
known .
Apparently Rust 1.58 should improve the situation. Does that mean... nightly
should do better?
Shell session
$ rustup install nightly
$ cargo clean
$ cargo +nightly build --release
(cut)
Cold 2021-12-19 build: 1m05s (vs 2m04s for 1.57.0)
Hot 2021-12-19 build: 15.25s (vs 1m11s for 1.57.0)
That is much, much better.
So what, we just switch to nightly?
Well, we could! It would be as simple as pinning that nightly version in our
rust-toolchain.toml
, ie. changing:
TOML markup
[ toolchain ]
channel = "1.57.0"
components = [ "rustfmt" , "clippy" ]
To:
TOML markup
[ toolchain ]
channel = "nightly-2021-12-19"
components = [ "rustfmt" , "clippy" ]
But there's probably a workaround. Warp provides
Filter::boxed ,
and boxing is a great way to hide actual types.
That will mean more heap allocations, but I think I can take the performance
hit, so, let's give it a try.
Right now, the code I'm using to compose all routes looks like this:
Rust code
let livereload_route = method:: get ( )
. and ( warp:: path!( "api" / "livereload" ) )
. and ( with_cx ( ) )
. map ( |cx : Context | warp:: sse:: reply ( warp:: sse:: keep_alive ( ) . stream ( sse_events ( cx) ) ) ) ;
let catchall_get = method:: get ( )
. and ( with_cx ( ) )
. and_then ( |cx : Context | cx. handle ( routes:: serve_get) ) ;
let catchall_head = method:: head ( )
. and ( with_cx ( ) )
. and_then ( |cx : Context | cx. handle ( routes:: serve_get) ) ;
let catchall_post = method:: post ( )
. and ( with_cx ( ) )
. and_then ( |cx : Context | cx. handle ( routes:: serve_post) ) ;
let all_routes = livereload_route
. or ( catchall_get)
. or ( catchall_head)
. or ( catchall_post) ;
let access_log = warp:: filters:: log:: log ( "access" ) ;
let addr: SocketAddr = config. address . parse ( ) ?;
warp:: serve ( all_routes. with ( access_log) ) . run ( addr) . await ;
It's not very warp-y, since I do some routing in the catchall_{get,head,post}
functions. You can see I've already been trying to stay away from large types,
and kinda fighting off "the warp way". Probably a good sign that the framework
just isn't for me - I'm excited to try out axum .
Note that there's a bunch of methods and closures not shown here: they deal with
server-sent events , passing
around a Context
struct (a reference-counted state of the server), dealing with
cookies, extracting the query string, etc.
Simply peppering a few .boxed()
...
Rust code
let all_routes = livereload_route
. boxed ( )
. or ( catchall_get. boxed ( ) )
. or ( catchall_head. boxed ( ) )
. or ( catchall_post. boxed ( ) ) ;
...brought down the build times significantly.
Cold boxed build: 1m06s (vs 2m04s non-boxed)
Hot boxed build: 14s (vs 1m11s non-boxed)
Now that we've got a new standard for our build times, let's try our other
tricks again and see what change they make!
Because 14s is much much better than 1m11s, but it's already more than I'm
comfortable with.
First, let's get a rough idea of where we're spending our time now.
Shell session
$ RUSTC_BOOTSTRAP=1 cargo rustc --release -- -Z self-profile -Z self-profile-events=default,args
(cut)
$ crox --minimum-duration 500000 futile-1118608.mm_profdata
Okay! We're spending most of our time in codegen and LLVM: linking only accounts
for about two seconds, if we're to trust that link
item (and even then, it's
spending time in finish_ongoing_codegen
, which sounds like it's just waiting
for the other parts to be finished).
First, instead of doing a release build, let's try making a debug build instead:
this should, most significantly: 1) enable incremental builds, 2) bump codegen
units from 16 to 256 3) disable optimizations.
Cold debug build: 36.44s (vs 1m06s release)
Hot debug build: 3.05s (vs 14s release)
That's more like it!!! Three seconds is super acceptable honestly.
Let's look where it's spending time again:
We've now reached the point where profiling is interfering with the build
(making it more than 4x slower), and where a minimum duration of 500ms is hiding
a lot of important information.
Let's try again with a minimum duration of 200ms:
Okay, so, still spending most of our time in codegen and LLVM.
The new iteration time to beat is still 3s.
The dev
cargo profile defaults to debug = true
, which is to say, debug = 2
.
Can we do better if we set it to 1
?
Cold debug = 1
build: 33.86s (vs 36.44s debug = 2
)
Hot debug = 1
build: 3.03s (vs 3.05s debug = 2
)
Not a huge difference.
Okay well... is linking only taking ~961ms like it says it does?
Let's try GNU ld, which should be worse. (I just commented out the relevant
section of my .cargo/config.toml
):
Cold "GNU ld" build: 41.50s (vs 36.44s "LLVM lld")
Hot "GNU ld" build: 7.77s (vs 3.05s "LLVM lld")
Okay, lld definitely doing a bunch of work for us here. How about mold
?
Shell session
$ cargo clean
$ mold -run cargo build
(cut)
Cold "mold" build: 37.18s (vs 36.44s "LLVM lld")
Hot "mold" build: 2.96s (vs 3.05s "LLVM lld")
No significant difference here, there probably would be a difference if the
linker was doing more work, but it seems lld is already doing great.
What if we disabled LTO entirely? Even thin-local LTO?
TOML markup
# in `futile/Cargo.toml`
[ profile . dev ]
lto = "off"
Cold "lto = off" build: 35.96s (vs 36.44s "thin-local LTO")
Hot "lto = off" build: 3.16s (vs 3.05s "thin-local LTO")
No significant difference here.
What else can we do? We've talked about splitting the bin crate into multiple
crates, and it being a lot of work... but here's a split that's easy to do:
we turn it into a lib
crate and add a bin
to it.
The bin
part will simply be in charge of setting up error handling, tracing,
and parsing command-line arguments, and it'll rely on the lib
part for any
actual work.
I won't go into too many details here, but the basic idea is to rename
src/main.rs
to src/lib.rs
, then create src/bin/futile.rs
, and move some
code from the former to the latter.
Also, to add this to the Cargo.toml
file:
TOML markup
[[ bin ]]
name = "futile"
path = "src/bin/futile.rs"
And now, well, it'll be hard to compare but let's try anyway:
Cold build: 36.26s (vs Cold 36.44s "just-bin")
Hot "changed bin" build: 2.47s (vs Hot 3.05s "just-bin")
Hot "changed lib" build: 3.63s (vs Hot 3.05s "just-bin")
Mhh. When we change the lib, it's blocked on building the lib before
it can build the bin again, which I think explains the longer times.
But when we change the bin, the whole lib doesn't need to be touched
whatsoever, just linked, which explains the faster times.
And with mold
?
Cold build: 37.53s (vs 36.26s "LLVM lld")
Hot "changed bin": 2.29s (vs 2.47s "LLVM lld")
Hot "changed lib": 3.62s (vs 3.63s "LLVM lld")
Again, in this case, mold
hardly makes a difference here: I think the linking
is really cobbling together two static libraries (one for the lib, one for the
bin), and a handful of dynamic libraries (libc, libm, etc.), so there's just not
a lot of work to do.
I gave it a couple hours, and now I have a lot of crates instead of just one!
TOML markup
# in the top-level `Cargo.toml`
[ workspace ]
members = [
"crates/futile" ,
"crates/futile-db" ,
"crates/futile-config" ,
"crates/futile-templating" ,
"crates/futile-backtrace-printer" ,
"crates/futile-highlight" ,
"crates/futile-content" ,
"crates/futile-frontmatter" ,
"crates/futile-extract-text" ,
"crates/futile-reading-time" ,
"crates/futile-parent-path" ,
"crates/futile-asset-rewriter" ,
"crates/futile-friendcodes" ,
"crates/futile-reddit" ,
"crates/futile-patreon" ,
]
[ profile . release ]
debug = 1
It's not perfect — futile-templating
for example, is one of the largest bits:
Shell session
$ for i in crates/*; do echo $i $(tokei $i -o json | jq .Rust.code); done
crates/futile 1472
crates/futile-asset-rewriter 433
crates/futile-backtrace-printer 12
crates/futile-config 158
crates/futile-content 769
crates/futile-db 818
crates/futile-extract-text 26
crates/futile-friendcodes 78
crates/futile-frontmatter 105
crates/futile-highlight 533
crates/futile-parent-path 10
crates/futile-patreon 522
crates/futile-query 14
crates/futile-reading-time 6
crates/futile-reddit 69
crates/futile-templating 1513
And I don't love how the dependency graph looks:
Shell session
$ cargo install cargo-deps
$ cargo deps --filter $(cargo metadata --format-version 1 | jq '.workspace_members[]' -r | cut -d ' ' -f 1 | tr '\n' ' ') | dot -Tsvg > depgraph.svg
But in my defense, there's definitely a way to lay that out that's less
confusing.
Let's look at build times!
Cold build time: 35.28s (vs 37.53s bin+lib crate)
Hot build time (changed futile-extract-text): 4.22s
Hot build time (changed futile-content): 3.97s
Hot build time (changed futile-templating): 3.74s
Hot build time (changed bin): 2.70s
Well. At least with incremental builds, this didn't change much.
Let's try release builds again maybe?
Cold release build time: 1m03s (vs 1m06s)
Hot release build time (changed futile-extract-text): 12s (vs 14s)
Hot release build time (changed futile-content): 11.87s (vs 14s)
Hot release build time (changed futile-templating): 11.94s (vs 14s)
Hot release build time (changed bin): 8s (vs 14s)
That's interesting! Does linking dominate those times? I wonder. Let's try
with mold
again, since there's probably more linking work to be done there.
Cold release mold: 1m04s (vs 1m06s)
Hot release mold (futile-extract-text): 12s (vs 14s)
Hot release mold (futile-content): 10.23s (vs 14s)
Hot release mold (futile-templating): 11.84s (vs 14s)
Hot release mold (bin): 8.10s (vs 14s)
Not super different from lld
: again, I guess we don't have a lot of linking to
do.
Let's look again at a rustc profile:
Mh! It seems we're spending a bunch of time "monomorphizing", which is when we
turn something like:
Rust code
fn add < T > ( a : T , b : T ) -> T
where
T : Add ,
{
a + b
}
Into this:
Rust code
fn add_usize ( a : usize , b : usize ) -> usize {
a + b
}
Except with much bigger types. And many of these.
I'll leave you with a last tip: another factor in build time is the sheer amount
of LLVM IR being generated, and you can easily look at that with
cargo-llvm-lines .
For best results, we'll want to pass a couple other flags.
-Zsymbol-mangling-version=v0
enables a new mangling
scheme
which should be the default soon, and -Z build-std
instructs cargo to build
libstd from source, instead of using the prebuilt version that comes with our
rust distribution.
Let's try it out!
Shell session
$ cd crates/futile
$ RUSTFLAGS=-Zsymbol-mangling-version=v0 cargo -Z build-std llvm-lines --bin futile | head -20
Compiling futile v1.9.0 (/home/amos/bearcove/futile/crates/futile)
warning: `futile` (bin "futile") generated 2 warnings
Finished dev [unoptimized + debuginfo] target(s) in 6.30s
Lines Copies Function name
----- ------ -------------
304016 (100%) 6746 (100%) (TOTAL)
2665 (0.9%) 1 (0.0%) futile[507d45e82218702e]::serve::routes::revision_routes::serve_single::{closure#0}
2462 (0.8%) 1 (0.0%) <futile[507d45e82218702e]::serve::Context>::serve_template::{closure#0}
2222 (0.7%) 1 (0.0%) <hyper[cbe7455c31603e19]::proto::h2::server::Serving<hyper[cbe7455c31603e19]::common::io::rewind::Rewind<hyper[cbe7455c31603e19]::server::tcp::addr_stream::AddrStream>, hyper[cbe7455c31603e19]::body::body::Body>>::poll_server::<hyper[cbe7455c31603e19]::service::util::ServiceFn<<warp[d2eac20ce511edd0]::server::Server<warp[d2eac20ce511edd0]::filters::log::internal::WithLog<warp[d2eac20ce511edd0]::filters::log::log::{closure#0}, warp[d2eac20ce511edd0]::filter::or::Or<warp[d2eac20ce511edd0]::filter::or::Or<warp[d2eac20ce511edd0]::filter::or::Or<warp[d2eac20ce511edd0]::filter::boxed::BoxedFilter<(warp[d2eac20ce511edd0]::filters::sse::SseReply<warp[d2eac20ce511edd0]::filters::sse::SseKeepAlive<futures_util[db872dedfe3ac01e]::stream::stream::chain::Chain<futures_util[db872dedfe3ac01e]::stream::iter::Iter<alloc[f5d0236c37905b20]::vec::into_iter::IntoIter<core[49c27447ffaf9e48]::result::Result<warp[d2eac20ce511edd0]::filters::sse::Event, core[49c27447ffaf9e48]::convert::Infallible>>>, futures_util[db872dedfe3ac01e]::stream::stream::filter_map::FilterMap<tokio_stream[b4aa1729856eb832]::wrappers::broadcast::BroadcastStream<alloc[f5d0236c37905b20]::string::String>, core[49c27447ffaf9e48]::future::from_generator::GenFuture<futile[507d45e82218702e]::serve::serve::{closure#0}::sse_events::{closure#0}::{closure#0}>, futile[507d45e82218702e]::serve::serve::{closure#0}::sse_events::{closure#0}>>>>,)>, warp[d2eac20ce511edd0]::filter::boxed::BoxedFilter<(alloc[f5d0236c37905b20]::boxed::Box<dyn warp[d2eac20ce511edd0]::reply::Reply>,)>>, warp[d2eac20ce511edd0]::filter::boxed::BoxedFilter<(alloc[f5d0236c37905b20]::boxed::Box<dyn warp[d2eac20ce511edd0]::reply::Reply>,)>>, warp[d2eac20ce511edd0]::filter::boxed::BoxedFilter<(alloc[f5d0236c37905b20]::boxed::Box<dyn warp[d2eac20ce511edd0]::reply::Reply>,)>>>>>::bind_ephemeral<std[5c4e0e91f40690d7]::net::addr::SocketAddr>::{closure#1}::{closure#0}::{closure#0}, hyper[cbe7455c31603e19]::body::body::Body>, hyper[cbe7455c31603e19]::common::exec::Exec>
2218 (0.7%) 1 (0.0%) <hyper[cbe7455c31603e19]::proto::h2::server::H2Stream<warp[d2eac20ce511edd0]::filter::service::FilteredFuture<warp[d2eac20ce511edd0]::filters::log::internal::WithLogFuture<warp[d2eac20ce511edd0]::filters::log::log::{closure#0}, warp[d2eac20ce511edd0]::filter::or::EitherFuture<warp[d2eac20ce511edd0]::filter::or::Or<warp[d2eac20ce511edd0]::filter::or::Or<warp[d2eac20ce511edd0]::filter::boxed::BoxedFilter<(warp[d2eac20ce511edd0]::filters::sse::SseReply<warp[d2eac20ce511edd0]::filters::sse::SseKeepAlive<futures_util[db872dedfe3ac01e]::stream::stream::chain::Chain<futures_util[db872dedfe3ac01e]::stream::iter::Iter<alloc[f5d0236c37905b20]::vec::into_iter::IntoIter<core[49c27447ffaf9e48]::result::Result<warp[d2eac20ce511edd0]::filters::sse::Event, core[49c27447ffaf9e48]::convert::Infallible>>>, futures_util[db872dedfe3ac01e]::stream::stream::filter_map::FilterMap<tokio_stream[b4aa1729856eb832]::wrappers::broadcast::BroadcastStream<alloc[f5d0236c37905b20]::string::String>, core[49c27447ffaf9e48]::future::from_generator::GenFuture<futile[507d45e82218702e]::serve::serve::{closure#0}::sse_events::{closure#0}::{closure#0}>, futile[507d45e82218702e]::serve::serve::{closure#0}::sse_events::{closure#0}>>>>,)>, warp[d2eac20ce511edd0]::filter::boxed::BoxedFilter<(alloc[f5d0236c37905b20]::boxed::Box<dyn warp[d2eac20ce511edd0]::reply::Reply>,)>>, warp[d2eac20ce511edd0]::filter::boxed::BoxedFilter<(alloc[f5d0236c37905b20]::boxed::Box<dyn warp[d2eac20ce511edd0]::reply::Reply>,)>>, warp[d2eac20ce511edd0]::filter::boxed::BoxedFilter<(alloc[f5d0236c37905b20]::boxed::Box<dyn warp[d2eac20ce511edd0]::reply::Reply>,)>>>>, hyper[cbe7455c31603e19]::body::body::Body>>::poll2
2191 (0.7%) 1 (0.0%) futile[507d45e82218702e]::serve::serve::{closure#0}
2080 (0.7%) 1 (0.0%) <futile_patreon[a3d798751567a674]::FutileCredentials>::load_from_cookies::{closure#0}
1871 (0.6%) 1 (0.0%) <futile_patreon[a3d798751567a674]::PatreonCredentials>::to_futile_credentials_once::{closure#0}
1579 (0.5%) 1 (0.0%) futile[507d45e82218702e]::serve::routes::revision_routes::serve::{closure#0}
1494 (0.5%) 1 (0.0%) futile[507d45e82218702e]::serve::routes::login::serve_patreon_oauth::{closure#0}
1438 (0.5%) 1 (0.0%) <hyper[cbe7455c31603e19]::proto::h2::PipeToSendStream<hyper[cbe7455c31603e19]::body::body::Body> as core[49c27447ffaf9e48]::future::future::Future>::poll
1387 (0.5%) 94 (1.4%) warp::server::Server<warp::filters::log::internal::WithLog<[closure@warp::filters::log::log::{closure#0}], warp::filter::or::Or<warp::filter::or::Or<warp::filter::or::Or<warp::filter::boxed::BoxedFilter<
1329 (0.4%) 1 (0.0%) futile[507d45e82218702e]::serve::routes::serve_get::{closure#0}
1311 (0.4%) 1 (0.0%) <hyper[cbe7455c31603e19]::proto::h1::dispatch::Dispatcher<hyper[cbe7455c31603e19]::proto::h1::dispatch::Server<hyper[cbe7455c31603e19]::service::util::ServiceFn<<warp[d2eac20ce511edd0]::server::Server<warp[d2eac20ce511edd0]::filters::log::internal::WithLog<warp[d2eac20ce511edd0]::filters::log::log::{closure#0}, warp[d2eac20ce511edd0]::filter::or::Or<warp[d2eac20ce511edd0]::filter::or::Or<warp[d2eac20ce511edd0]::filter::or::Or<warp[d2eac20ce511edd0]::filter::boxed::BoxedFilter<(warp[d2eac20ce511edd0]::filters::sse::SseReply<warp[d2eac20ce511edd0]::filters::sse::SseKeepAlive<futures_util[db872dedfe3ac01e]::stream::stream::chain::Chain<futures_util[db872dedfe3ac01e]::stream::iter::Iter<alloc[f5d0236c37905b20]::vec::into_iter::IntoIter<core[49c27447ffaf9e48]::result::Result<warp[d2eac20ce511edd0]::filters::sse::Event, core[49c27447ffaf9e48]::convert::Infallible>>>, futures_util[db872dedfe3ac01e]::stream::stream::filter_map::FilterMap<tokio_stream[b4aa1729856eb832]::wrappers::broadcast::BroadcastStream<alloc[f5d0236c37905b20]::string::String>, core[49c27447ffaf9e48]::future::from_generator::GenFuture<futile[507d45e82218702e]::serve::serve::{closure#0}::sse_events::{closure#0}::{closure#0}>, futile[507d45e82218702e]::serve::serve::{closure#0}::sse_events::{closure#0}>>>>,)>, warp[d2eac20ce511edd0]::filter::boxed::BoxedFilter<(alloc[f5d0236c37905b20]::boxed::Box<dyn warp[d2eac20ce511edd0]::reply::Reply>,)>>, warp[d2eac20ce511edd0]::filter::boxed::BoxedFilter<(alloc[f5d0236c37905b20]::boxed::Box<dyn warp[d2eac20ce511edd0]::reply::Reply>,)>>, warp[d2eac20ce511edd0]::filter::boxed::BoxedFilter<(alloc[f5d0236c37905b20]::boxed::Box<dyn warp[d2eac20ce511edd0]::reply::Reply>,)>>>>>::bind_ephemeral<std[5c4e0e91f40690d7]::net::addr::SocketAddr>::{closure#1}::{closure#0}::{closure#0}, hyper[cbe7455c31603e19]::body::body::Body>, hyper[cbe7455c31603e19]::body::body::Body>, hyper[cbe7455c31603e19]::body::body::Body, hyper[cbe7455c31603e19]::server::tcp::addr_stream::AddrStream, hyper[cbe7455c31603e19]::proto::h1::role::Server>>::poll_write
1252 (0.4%) 72 (1.1%) warp::filters::log::log::{closure#0}], warp::filter::or::EitherFuture<warp::filter::or::Or<warp::filter::or::Or<warp::filter::boxed::BoxedFilter<
1249 (0.4%) 1 (0.0%) <color_backtrace[f3bfce8006cb17f7]::Frame>::print::<futile[507d45e82218702e]::serve::html_color_output::HtmlColorOutput>
1249 (0.4%) 1 (0.0%) <color_backtrace[f3bfce8006cb17f7]::Frame>::print::<termcolor[ed53d2579171698e]::StandardStream>
1228 (0.4%) 1 (0.0%) <hyper[cbe7455c31603e19]::proto::h1::io::Buffered<hyper[cbe7455c31603e19]::server::tcp::addr_stream::AddrStream, hyper[cbe7455c31603e19]::proto::h1::encode::EncodedBuf<bytes[65b0fce232ff3b0d]::bytes::Bytes>>>::parse::<hyper[cbe7455c31603e19]::proto::h1::role::Server>
What to do with those results? Well, nothing looks out of the ordinary here. If
a single symbol was responsible for say, >10% of LLVM lines (left column), we'd
definitely want to take a closer look at it. But from the looks of this, we're
fine.
I hope you had fun learning about all this, and that you can use it to make your
builds faster. If you didn't before, you should now have a lot of places to look
at when you want to make your builds faster.
You can go deeper than this still! I mentioned in How I learned to love build
systems that some companies don't
even use cargo
to build their Rust code, and I've recently been experimenting
with that myself — and it's quite usable, even when you're not a megacorp.
Let me know if you'd like to read more about this: I realized that I've
accidentally acquired some expertise in release engineering, so now I'm bound
to get more questions about this.
Until next time, take care!