Profiling linkers
Thanks to my sponsors: Sam Leonard, Mark Tomlin, Jonathan Adams, Guy Waldman, Torben Clasen, Daniel Strittmatter, Scott Sanderson, Jon Gjengset, Marcus Griep, Brian L. Troutwine, Morgan Rosenkranz, Andy Gocke, Steven Pham, Òscar Pérez, Jimmy Hartzell, Egor Ternovoi, Dave Minter, Eugene Bulkin, Max Heaton, Enrico Zschemisch and 278 more
👋 This page was last updated ~4 years ago. Just so you know.
In the wake of Why is my Rust build so
slow?, developers from the mold and
lld linkers reached
out,
wondering why using their linker didn’t make a big difference.
Of course the answer was “there’s just not that much linking to do”, and so any
difference between mold and lld was within a second. GNU ld was lagging way
behind, at four seconds or so.
But in their quest for linker performance (I love that this is a thing, by the way), they told me how to profile them!
Profiling mold
The way I usually run mold is simply by wrapping the cargo invocation with
mold -run:
$ mold -run cargo build
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished dev [unoptimized + debuginfo] target(s) in 1.74s
I can then verify that it’s actually been built with mold by reading ELF
comment sections with readelf:
$ readelf -p .comment target/debug/salvage
String dump of section '.comment':
  [     0]  GCC: (GNU) 11.2.1 20211203 (Red Hat 11.2.1-7)
  [    2e]  mold 1.0.0 (ed9924895d9b9584106791247596677db8113528; compatible with GNU ld and GNU gold)
  [    89]  GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
  [    b3]  clang version 12.0.1
mold’s profiling option is -perf, but I couldn’t find a way to use it in
conjunction with -run, so, I did the next best thing: a wrapper script.
In ~/mold-wrapper, I put:
#!/bin/bash
mold -perf "$@" 2>&1 >> /tmp/mold-log.txt
Here, "$@" forwards all arguments from our script to mold. 2>&1 redirects
input from standard error (stderr, file descriptor 2), to standard output
(stdout, file descriptor 1), and then we append (>>) to a file at
/tmp/mold-log.txt.
Since it didn’t make a big difference for futile, I’ll try it on another
project of mine, salvage. Let’s also try release builds instead of debug builds!
First with GNU ld:
# in `.cargo/config.toml`
[target.x86_64-unknown-linux-gnu]
# use the default linker: here it's GNU ld
rustflags = []
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 6.99s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 7.04s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 6.91s
Okay, so 7 seconds is our baseline.
Now with mold-wrapper:
# in `.cargo/config.toml`
[target.x86_64-unknown-linux-gnu]
rustflags = [
    "-C", "link-arg=-fuse-ld=/home/amos/mold-wrapper",
]
$ cargo build
(cut)
  = note: cc: error: unrecognized command-line option '-fuse-ld=/home/amos/mold-wrapper'
Oh no! What’s happening?
Ah. Well… “linker” can mean many things. When running executables, a dynamic linker is involved, like we discussed in making our own executable packer.
Here it’s definitely a “build-time linker”, but really there’s, I think: a
linker driver, and the actual linker. The linker driver, by default here, is
cc, which is provided by GCC. It accepts the -fuse-ld=flabazorp option to
use the flabazorp linker.
Except GCC has an allowlist of values it’ll accept for -fuse-ld. If we use
clang as a linker driver, however, we can specify a path to whichever linker
we want, even our wrapper script:
# in `.cargo/config.toml`
[target.x86_64-unknown-linux-gnu]
linker = "clang"
rustflags = [
    "-C", "link-arg=-fuse-ld=/home/amos/mold-wrapper",
]
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 5.17s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 5.12s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 5.10s
Okay, we’re closer to 5 seconds - almost 2 seconds faster!
Let’s look at what we have in /tmp/mold-log.txt:
     User   System     Real  Name
    0.800    5.360    0.392  all
    0.171    0.515    0.177    read_input_files
    0.068    2.334    0.101    register_section_pieces
    0.564    2.512    0.114    total
    0.380    0.102    0.021      before_copy
    0.000    0.000    0.000        apply_exclude_libs
    0.062    0.000    0.002        resolve_obj_symbols
    0.000    0.000    0.000        eliminate_comdats
    0.003    0.002    0.000        convert_common_symbols
    0.000    0.000    0.000        apply_version_script
    0.000    0.000    0.000        compute_import_export
    0.113    0.050    0.006        gc
    0.005    0.015    0.001          mark_nonalloc_fragments
    0.037    0.028    0.002          collect_root_set
    0.066    0.000    0.002          mark
    0.005    0.007    0.001          sweep
    0.103    0.003    0.004        compute_merged_section_sizes
    0.098    0.003    0.004          MergedSection assign_offsets
    0.002    0.015    0.001        bin_sections
    0.000    0.002    0.001        claim_unresolved_symbols
    0.003    0.020    0.000        check_dup_syms
    0.000    0.000    0.000        sort_init_fini
    0.002    0.004    0.001        compute_section_sizes
    0.064    0.002    0.003        scan_rels
    0.000    0.000    0.000        DynsymSection::finalize
    0.000    0.000    0.000        fill_verdef
    0.000    0.001    0.000        fill_verneed
    0.022    0.000    0.001        compute_symtab
    0.005    0.001    0.001        eh_frame
    0.000    0.000    0.000        osec_offset
    0.000    0.001    0.001      open_file
    0.184    2.407    0.091      copy
    0.102    2.400    0.084        copy_buf
    0.000    0.000    0.000          (header)
    0.000    0.000    0.000          .fini_array
    0.000    0.000    0.000          (header)
    0.000    0.000    0.000          .interp
    0.000    0.000    0.000          .note.gnu.build-id
    0.088    2.073    0.070          .debug_line
    0.000    0.000    0.000          .note.ABI-tag
    0.000    0.000    0.000          .rela.dyn
    0.000    0.002    0.000          .init_array
    0.000    0.000    0.000          .strtab
    0.006    0.125    0.004          .data
    0.000    0.000    0.000          .shstrtab
    0.100    2.399    0.082          .debug_str
    0.000    0.000    0.000          .got.plt
    0.000    0.000    0.000          .rela.plt
    0.007    0.145    0.005          .debug_abbrev
    0.064    1.271    0.044          .symtab
    0.000    0.000    0.000          .debug_frame
    0.089    2.291    0.076          .debug_info
    0.000    0.000    0.000          .gnu.build.attributes.startup
    0.000    0.001    0.000          .gnu.build.attributes.exit
    0.102    2.399    0.084          .debug_pubnames
    0.000    0.001    0.000          .dynsym
    0.000    0.000    0.000          .bss
    0.000    0.000    0.000          .comment
    0.055    1.133    0.038          .debug_aranges
    0.000    0.000    0.000          .common
    0.014    0.423    0.015          .gcc_except_table
    0.000    0.000    0.000          .gnu.build.attributes
    0.000    0.000    0.000          .tm_clone_table
    0.097    2.394    0.080          .debug_ranges
    0.000    0.002    0.000          .fini
    0.000    0.001    0.000          .rodata.str
    0.000    0.000    0.000          .tbss
    0.006    0.125    0.005          .debug_rnglists
    0.002    0.089    0.003          .got
    0.000    0.000    0.000          .note.stapsdt
    0.000    0.001    0.000          .gnu.hash
    0.000    0.000    0.000          .gnu.build.attributes.unlikely
    0.000    0.000    0.000          (header)
    0.000    0.001    0.000          .debug_loc
    0.084    1.893    0.063          .debug_loclists
    0.000    0.002    0.000          .debug_line_str
    0.000    0.000    0.000          .gnu.build.attributes.hot
    0.000    0.001    0.000          .dynstr
    0.000    0.001    0.000          .gnu.version
    0.015    0.519    0.017          .rodata.str
    0.000    0.000    0.000          .dynamic
    0.035    0.847    0.028          .eh_frame
    0.064    1.271    0.044          .data.rel.ro
    0.000    0.000    0.000          .gnu.version_r
    0.000    0.000    0.000          .debug_gdb_scripts
    0.000    0.000    0.000          .init
    0.096    2.390    0.080          .text
    0.000    0.000    0.000          .tdata
    0.001    0.023    0.002          .debug_pubtypes
    0.000    0.000    0.000          .plt
    0.000    0.000    0.000          .plt.got
    0.086    1.972    0.066          .rodata
    0.015    0.188    0.006          .eh_frame_hdr
    0.000    0.004    0.000          .rodata
    0.000    0.021    0.001          .rodata.cst
    0.002    0.007    0.001        sort_dynamic_relocs
    0.000    0.000    0.000        clear_padding
    0.080    0.001    0.007        build_id
    0.000    0.001    0.001      close_file
mold is self-reporting a “real” run-time of 392 milliseconds here. That’s
pretty fast! Nothing particular jumps out to me — but if there was some pass
that seemed slow, I’d know where to file an issue.
Profiling lld
Now let’s try lld!
I made another wrapper script, in /home/amos/ld.lld-wrapper:
#!/bin/bash
ld.lld --time-trace --time-trace-file=/tmp/lld.time-trace.json "$@"
Let’s do a few hot builds:
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 5.27s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 5.26s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 5.19s
Not too bad! Here’s how different linkers compared:
(Note that this is full build time, not just link-time).
The lld maintainers suggested using jq to report those metrics:
$ cat /tmp/lld.time-trace.json | jq -r '.traceEvents[] | select(.name|contains("Total")) | "\(.dur/1000000) \(.name)"'
0.449559 Total ExecuteLinker
0.445119 Total Link
0.193442 Total Write output file
0.074866 Total Parse input files
0.061933 Total Merge/finalize input sections
0.028164 Total Split sections
0.023066 Total markLive
0.022392 Total Scan relocations
0.009324 Total Assign sections
0.007581 Total Add local symbols
0.005941 Total Finalize synthetic sections
0.004179 Total Load input files
0.004127 Total Finalize .eh_frame
0.003314 Total Process symbol versions
0.003111 Total Add symbols to symtabs
0.00283 Total LTO
0.002436 Total Finalize address dependent content
0.001592 Total Aggregate sections
0.001035 Total Combine EH sections
0.000874 Total Locate library
0.000826 Total Strip sections
0.000807 Total Redirect symbols
0.000448 Total Demote shared symbols
0.00033 Total Replace common symbols
0.00031 Total Sort sections
0.000134 Total Create output files
0.000124 Total Read linker script
0 Total Resolve SHF_LINK_ORDER
0 Total Diagnose orphan sections
And:
$ cat /tmp/lld.time-trace.json | jq -r '.traceEvents[] | select(.name|contains("Write")) | "\(.dur/1000000) \(.name)"'
0.193442 Write output file
0.193442 Total Write output file
Hey that’s a useless use of cat!
It is! Gotta let redditors comment something about our otherwise perfect articles.
Here nothing jumps out either: lld self-reports 449 milliseconds, about 60 milliseconds more than mold.
What about LTO?
Let’s profile link-time optimization too! We’ll omit mold from these
measurements, since it doesn’t support
LTO at the time of this writing.
[profile.release]
lto = "fat"
debug = 1
Let’s try with GNU ld:
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 1m10s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 1m10s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 1m10s
Man, LTO is not free.
Nope!
Now “fat” LTO with lld:
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 1m10s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 1m09s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 1m10s
I’m starting to think maybe GNU ld wasn’t involved at all there…
Yeah, me too. Rust LTO seems LLVM-specific, so I wonder if it’s even using GNU ld at all.
Let’s look at the timings:
$ cat /tmp/lld.time-trace.json | jq -r '.traceEvents[] | select(.name|contains("Total")) | "\(.dur/1000000) \(.name)"'
0.242555 Total ExecuteLinker
0.239797 Total Link
0.131408 Total Write output file
0.037888 Total Merge/finalize input sections
0.021132 Total Parse input files
0.016607 Total Scan relocations
0.01026 Total markLive
0.007482 Total Split sections
0.004815 Total Add local symbols
0.003422 Total Assign sections
0.002502 Total Load input files
0.002039 Total Finalize synthetic sections
0.001379 Total Finalize .eh_frame
0.001058 Total Finalize address dependent content
0.000891 Total Locate library
0.000878 Total Add symbols to symtabs
0.000505 Total Aggregate sections
0.000359 Total LTO
0.000336 Total Process symbol versions
0.000221 Total Strip sections
0.000212 Total Combine EH sections
0.000203 Total Redirect symbols
0.000188 Total Sort sections
0.000125 Total Create output files
0.000121 Total Read linker script
9.5e-05 Total Demote shared symbols
5.7e-05 Total Replace common symbols
0 Total Resolve SHF_LINK_ORDER
0 Total Diagnose orphan sections
Well that can’t be right.
Mhh no, it doesn’t look right at all. I would expect some of those values to be much higher.
I’m sure someone will point out what we’re doing wrong!
Finally, let’s try thin LTO:
[profile.release]
lto = "thin"
debug = 1
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 14.55s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 14.73s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 14.95
And here’s the timings:
$ cat /tmp/lld.time-trace.json | jq -r '.traceEvents[] | select(.name|contains("Total")) | "\(.dur/1000000) \(.name)"'
0.388797 Total ExecuteLinker
0.369694 Total Link
0.175247 Total Write output file
0.059522 Total Merge/finalize input sections
0.043918 Total Parse input files
0.02205 Total Split sections
0.020696 Total Scan relocations
0.018724 Total Load input files
0.017484 Total markLive
0.007214 Total Assign sections
0.006661 Total Add local symbols
0.004376 Total Finalize synthetic sections
0.002845 Total Finalize .eh_frame
0.002703 Total Add symbols to symtabs
0.001943 Total Finalize address dependent content
0.001889 Total Process symbol versions
0.001439 Total LTO
0.001075 Total Aggregate sections
0.001072 Total Locate library
0.00069 Total Combine EH sections
0.000504 Total Strip sections
0.000459 Total Redirect symbols
0.000282 Total Sort sections
0.000273 Total Demote shared symbols
0.000162 Total Replace common symbols
0.000145 Total Read linker script
0.000111 Total Create output files
0 Total Resolve SHF_LINK_ORDER
0 Total Diagnose orphan sections
Again, those are fishy. I’m either not measuring the right thing, or the work is being done elsewhere.
Finally, here’s the comparison:
Happy profiling!
Here's another article just for you:
A Rust match made in hell
I often write pieces that showcase how well Rust can work for you, and how it can let you build powerful abstractions, and prevents you from making a bunch of mistakes.
If you read something like Some mistakes Rust doesn’t catch in isolation, it could seem as if I had only nice things to say about Rust, and it’s a perfect little fantasy land where nothing ever goes wrong.