Thanks to my sponsors: Eugene Bulkin, Berkus Decker, James Rhodes, Chris Biscardi, Romain Ruetschi, René Ribaud, compwhizii, James Leitch, David Souther, Daniel Wagner-Hall, Nyefan, Ivo Murrell, Duane Sibilly, Horváth-Lázár Péter, Matěj Volf, Geoff Cant, Mathias Brossard, notryanb, Philipp Hatt, Kristoffer Winther Balling and 254 more
Profiling linkers
👋 This page was last updated ~3 years ago. Just so you know.
In the wake of Why is my Rust build so
slow?, developers from the mold
and
lld
linkers reached
out,
wondering why using their linker didn't make a big difference.
Of course the answer was "there's just not that much linking to do", and so any
difference between mold
and lld
was within a second. GNU ld was lagging way
behind, at four seconds or so.
But in their quest for linker performance (I love that this is a thing, by the way), they told me how to profile them!
Profiling mold
The way I usually run mold is simply by wrapping the cargo
invocation with
mold -run
:
$ mold -run cargo build
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished dev [unoptimized + debuginfo] target(s) in 1.74s
I can then verify that it's actually been built with mold
by reading ELF
comment sections with readelf
:
$ readelf -p .comment target/debug/salvage
String dump of section '.comment':
[ 0] GCC: (GNU) 11.2.1 20211203 (Red Hat 11.2.1-7)
[ 2e] mold 1.0.0 (ed9924895d9b9584106791247596677db8113528; compatible with GNU ld and GNU gold)
[ 89] GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
[ b3] clang version 12.0.1
mold
's profiling option is -perf
, but I couldn't find a way to use it in
conjunction with -run
, so, I did the next best thing: a wrapper script.
In ~/mold-wrapper
, I put:
#!/bin/bash
mold -perf "$@" 2>&1 >> /tmp/mold-log.txt
Here, "$@"
forwards all arguments from our script to mold. 2>&1
redirects
input from standard error (stderr
, file descriptor 2), to standard output
(stdout
, file descriptor 1), and then we append (>>
) to a file at
/tmp/mold-log.txt
.
Since it didn't make a big difference for futile
, I'll try it on another
project of mine, salvage. Let's also try release builds instead of debug builds!
First with GNU ld:
# in `.cargo/config.toml`
[target.x86_64-unknown-linux-gnu]
# use the default linker: here it's GNU ld
rustflags = []
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 6.99s
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 7.04s
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 6.91s
Okay, so 7 seconds is our baseline.
Now with mold-wrapper:
# in `.cargo/config.toml`
[target.x86_64-unknown-linux-gnu]
rustflags = [
"-C", "link-arg=-fuse-ld=/home/amos/mold-wrapper",
]
$ cargo build
(cut)
= note: cc: error: unrecognized command-line option '-fuse-ld=/home/amos/mold-wrapper'
Oh no! What's happening?
Ah. Well... "linker" can mean many things. When running executables, a dynamic linker is involved, like we discussed in making our own executable packer.
Here it's definitely a "build-time linker", but really there's, I think: a
linker driver, and the actual linker. The linker driver, by default here, is
cc
, which is provided by GCC. It accepts the -fuse-ld=flabazorp
option to
use the flabazorp
linker.
Except GCC has an allowlist of values it'll accept for -fuse-ld
. If we use
clang
as a linker driver, however, we can specify a path to whichever linker
we want, even our wrapper script:
# in `.cargo/config.toml`
[target.x86_64-unknown-linux-gnu]
linker = "clang"
rustflags = [
"-C", "link-arg=-fuse-ld=/home/amos/mold-wrapper",
]
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 5.17s
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 5.12s
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 5.10s
Okay, we're closer to 5 seconds - almost 2 seconds faster!
Let's look at what we have in /tmp/mold-log.txt
:
User System Real Name
0.800 5.360 0.392 all
0.171 0.515 0.177 read_input_files
0.068 2.334 0.101 register_section_pieces
0.564 2.512 0.114 total
0.380 0.102 0.021 before_copy
0.000 0.000 0.000 apply_exclude_libs
0.062 0.000 0.002 resolve_obj_symbols
0.000 0.000 0.000 eliminate_comdats
0.003 0.002 0.000 convert_common_symbols
0.000 0.000 0.000 apply_version_script
0.000 0.000 0.000 compute_import_export
0.113 0.050 0.006 gc
0.005 0.015 0.001 mark_nonalloc_fragments
0.037 0.028 0.002 collect_root_set
0.066 0.000 0.002 mark
0.005 0.007 0.001 sweep
0.103 0.003 0.004 compute_merged_section_sizes
0.098 0.003 0.004 MergedSection assign_offsets
0.002 0.015 0.001 bin_sections
0.000 0.002 0.001 claim_unresolved_symbols
0.003 0.020 0.000 check_dup_syms
0.000 0.000 0.000 sort_init_fini
0.002 0.004 0.001 compute_section_sizes
0.064 0.002 0.003 scan_rels
0.000 0.000 0.000 DynsymSection::finalize
0.000 0.000 0.000 fill_verdef
0.000 0.001 0.000 fill_verneed
0.022 0.000 0.001 compute_symtab
0.005 0.001 0.001 eh_frame
0.000 0.000 0.000 osec_offset
0.000 0.001 0.001 open_file
0.184 2.407 0.091 copy
0.102 2.400 0.084 copy_buf
0.000 0.000 0.000 (header)
0.000 0.000 0.000 .fini_array
0.000 0.000 0.000 (header)
0.000 0.000 0.000 .interp
0.000 0.000 0.000 .note.gnu.build-id
0.088 2.073 0.070 .debug_line
0.000 0.000 0.000 .note.ABI-tag
0.000 0.000 0.000 .rela.dyn
0.000 0.002 0.000 .init_array
0.000 0.000 0.000 .strtab
0.006 0.125 0.004 .data
0.000 0.000 0.000 .shstrtab
0.100 2.399 0.082 .debug_str
0.000 0.000 0.000 .got.plt
0.000 0.000 0.000 .rela.plt
0.007 0.145 0.005 .debug_abbrev
0.064 1.271 0.044 .symtab
0.000 0.000 0.000 .debug_frame
0.089 2.291 0.076 .debug_info
0.000 0.000 0.000 .gnu.build.attributes.startup
0.000 0.001 0.000 .gnu.build.attributes.exit
0.102 2.399 0.084 .debug_pubnames
0.000 0.001 0.000 .dynsym
0.000 0.000 0.000 .bss
0.000 0.000 0.000 .comment
0.055 1.133 0.038 .debug_aranges
0.000 0.000 0.000 .common
0.014 0.423 0.015 .gcc_except_table
0.000 0.000 0.000 .gnu.build.attributes
0.000 0.000 0.000 .tm_clone_table
0.097 2.394 0.080 .debug_ranges
0.000 0.002 0.000 .fini
0.000 0.001 0.000 .rodata.str
0.000 0.000 0.000 .tbss
0.006 0.125 0.005 .debug_rnglists
0.002 0.089 0.003 .got
0.000 0.000 0.000 .note.stapsdt
0.000 0.001 0.000 .gnu.hash
0.000 0.000 0.000 .gnu.build.attributes.unlikely
0.000 0.000 0.000 (header)
0.000 0.001 0.000 .debug_loc
0.084 1.893 0.063 .debug_loclists
0.000 0.002 0.000 .debug_line_str
0.000 0.000 0.000 .gnu.build.attributes.hot
0.000 0.001 0.000 .dynstr
0.000 0.001 0.000 .gnu.version
0.015 0.519 0.017 .rodata.str
0.000 0.000 0.000 .dynamic
0.035 0.847 0.028 .eh_frame
0.064 1.271 0.044 .data.rel.ro
0.000 0.000 0.000 .gnu.version_r
0.000 0.000 0.000 .debug_gdb_scripts
0.000 0.000 0.000 .init
0.096 2.390 0.080 .text
0.000 0.000 0.000 .tdata
0.001 0.023 0.002 .debug_pubtypes
0.000 0.000 0.000 .plt
0.000 0.000 0.000 .plt.got
0.086 1.972 0.066 .rodata
0.015 0.188 0.006 .eh_frame_hdr
0.000 0.004 0.000 .rodata
0.000 0.021 0.001 .rodata.cst
0.002 0.007 0.001 sort_dynamic_relocs
0.000 0.000 0.000 clear_padding
0.080 0.001 0.007 build_id
0.000 0.001 0.001 close_file
mold
is self-reporting a "real" run-time of 392 milliseconds here. That's
pretty fast! Nothing particular jumps out to me — but if there was some pass
that seemed slow, I'd know where to file an issue.
Profiling lld
Now let's try lld
!
I made another wrapper script, in /home/amos/ld.lld-wrapper
:
#!/bin/bash
ld.lld --time-trace --time-trace-file=/tmp/lld.time-trace.json "$@"
Let's do a few hot builds:
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 5.27s
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 5.26s
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 5.19s
Not too bad! Here's how different linkers compared:
(Note that this is full build time, not just link-time).
The lld maintainers suggested using jq to report those metrics:
$ cat /tmp/lld.time-trace.json | jq -r '.traceEvents[] | select(.name|contains("Total")) | "\(.dur/1000000) \(.name)"'
0.449559 Total ExecuteLinker
0.445119 Total Link
0.193442 Total Write output file
0.074866 Total Parse input files
0.061933 Total Merge/finalize input sections
0.028164 Total Split sections
0.023066 Total markLive
0.022392 Total Scan relocations
0.009324 Total Assign sections
0.007581 Total Add local symbols
0.005941 Total Finalize synthetic sections
0.004179 Total Load input files
0.004127 Total Finalize .eh_frame
0.003314 Total Process symbol versions
0.003111 Total Add symbols to symtabs
0.00283 Total LTO
0.002436 Total Finalize address dependent content
0.001592 Total Aggregate sections
0.001035 Total Combine EH sections
0.000874 Total Locate library
0.000826 Total Strip sections
0.000807 Total Redirect symbols
0.000448 Total Demote shared symbols
0.00033 Total Replace common symbols
0.00031 Total Sort sections
0.000134 Total Create output files
0.000124 Total Read linker script
0 Total Resolve SHF_LINK_ORDER
0 Total Diagnose orphan sections
And:
$ cat /tmp/lld.time-trace.json | jq -r '.traceEvents[] | select(.name|contains("Write")) | "\(.dur/1000000) \(.name)"'
0.193442 Write output file
0.193442 Total Write output file
Hey that's a useless use of cat!
It is! Gotta let redditors comment something about our otherwise perfect articles.
Here nothing jumps out either: lld self-reports 449 milliseconds, about 60 milliseconds more than mold.
What about LTO?
Let's profile link-time optimization too! We'll omit mold
from these
measurements, since it doesn't support
LTO at the time of this writing.
[profile.release]
lto = "fat"
debug = 1
Let's try with GNU ld:
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 1m10s
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 1m10s
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 1m10s
Man, LTO is not free.
Nope!
Now "fat" LTO with lld
:
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 1m10s
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 1m09s
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 1m10s
I'm starting to think maybe GNU ld wasn't involved at all there...
Yeah, me too. Rust LTO seems LLVM-specific, so I wonder if it's even using GNU ld at all.
Let's look at the timings:
$ cat /tmp/lld.time-trace.json | jq -r '.traceEvents[] | select(.name|contains("Total")) | "\(.dur/1000000) \(.name)"'
0.242555 Total ExecuteLinker
0.239797 Total Link
0.131408 Total Write output file
0.037888 Total Merge/finalize input sections
0.021132 Total Parse input files
0.016607 Total Scan relocations
0.01026 Total markLive
0.007482 Total Split sections
0.004815 Total Add local symbols
0.003422 Total Assign sections
0.002502 Total Load input files
0.002039 Total Finalize synthetic sections
0.001379 Total Finalize .eh_frame
0.001058 Total Finalize address dependent content
0.000891 Total Locate library
0.000878 Total Add symbols to symtabs
0.000505 Total Aggregate sections
0.000359 Total LTO
0.000336 Total Process symbol versions
0.000221 Total Strip sections
0.000212 Total Combine EH sections
0.000203 Total Redirect symbols
0.000188 Total Sort sections
0.000125 Total Create output files
0.000121 Total Read linker script
9.5e-05 Total Demote shared symbols
5.7e-05 Total Replace common symbols
0 Total Resolve SHF_LINK_ORDER
0 Total Diagnose orphan sections
Well that can't be right.
Mhh no, it doesn't look right at all. I would expect some of those values to be much higher.
I'm sure someone will point out what we're doing wrong!
Finally, let's try thin LTO:
[profile.release]
lto = "thin"
debug = 1
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 14.55s
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 14.73s
$ # (change main.rs)
$ cargo build --release
Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
Finished release [optimized + debuginfo] target(s) in 14.95
And here's the timings:
$ cat /tmp/lld.time-trace.json | jq -r '.traceEvents[] | select(.name|contains("Total")) | "\(.dur/1000000) \(.name)"'
0.388797 Total ExecuteLinker
0.369694 Total Link
0.175247 Total Write output file
0.059522 Total Merge/finalize input sections
0.043918 Total Parse input files
0.02205 Total Split sections
0.020696 Total Scan relocations
0.018724 Total Load input files
0.017484 Total markLive
0.007214 Total Assign sections
0.006661 Total Add local symbols
0.004376 Total Finalize synthetic sections
0.002845 Total Finalize .eh_frame
0.002703 Total Add symbols to symtabs
0.001943 Total Finalize address dependent content
0.001889 Total Process symbol versions
0.001439 Total LTO
0.001075 Total Aggregate sections
0.001072 Total Locate library
0.00069 Total Combine EH sections
0.000504 Total Strip sections
0.000459 Total Redirect symbols
0.000282 Total Sort sections
0.000273 Total Demote shared symbols
0.000162 Total Replace common symbols
0.000145 Total Read linker script
0.000111 Total Create output files
0 Total Resolve SHF_LINK_ORDER
0 Total Diagnose orphan sections
Again, those are fishy. I'm either not measuring the right thing, or the work is being done elsewhere.
Finally, here's the comparison:
Happy profiling!
Here's another article just for you:
Recursive iterators in Rust
I've been looking for this blog post everywhere, but it doesn't exist, so I guess it's my turn to write about Some Fun with Rust.
The task at hand
Let's say you have a recursive, acyclic data structure, like so:
struct Node {
values: Vec<i32>,
children: Vec<Node>,
}
This allows you to represent a tree-like structure: