Profiling linkers

In the wake of Why is my Rust build so slow?, developers from the mold and lld linkers reached out, wondering why using their linker didn't make a big difference.

Of course the answer was "there's just not that much linking to do", and so any difference between mold and lld was within a second. GNU ld was lagging way behind, at four seconds or so.

But in their quest for linker performance (I love that this is a thing, by the way), they told me how to profile them!

Profiling mold

The way I usually run mold is simply by wrapping the cargo invocation with mold -run:

Shell session
$ mold -run cargo build
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished dev [unoptimized + debuginfo] target(s) in 1.74s

I can then verify that it's actually been built with mold by reading ELF comment sections with readelf:

Shell session
$ readelf -p .comment target/debug/salvage 

String dump of section '.comment':
  [     0]  GCC: (GNU) 11.2.1 20211203 (Red Hat 11.2.1-7)
  [    2e]  mold 1.0.0 (ed9924895d9b9584106791247596677db8113528; compatible with GNU ld and GNU gold)
  [    89]  GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
  [    b3]  clang version 12.0.1

mold's profiling option is -perf, but I couldn't find a way to use it in conjunction with -run, so, I did the next best thing: a wrapper script.

In ~/mold-wrapper, I put:

Bash
#!/bin/bash
mold -perf "$@" 2>&1 >> /tmp/mold-log.txt

Here, "$@" forwards all arguments from our script to mold. 2>&1 redirects input from standard error (stderr, file descriptor 2), to standard output (stdout, file descriptor 1), and then we append (>>) to a file at /tmp/mold-log.txt.

Since it didn't make a big difference for futile, I'll try it on another project of mine, salvage. Let's also try release builds instead of debug builds!

First with GNU ld:

TOML markup
# in `.cargo/config.toml`

[target.x86_64-unknown-linux-gnu]
# use the default linker: here it's GNU ld
rustflags = []
Shell session
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 6.99s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 7.04s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 6.91s

Okay, so 7 seconds is our baseline.

Now with mold-wrapper:

TOML markup
# in `.cargo/config.toml`

[target.x86_64-unknown-linux-gnu]
rustflags = [
    "-C", "link-arg=-fuse-ld=/home/amos/mold-wrapper",
]
Shell session
$ cargo build
(cut)
  = note: cc: error: unrecognized command-line option '-fuse-ld=/home/amos/mold-wrapper'

Oh no! What's happening?

Ah. Well... "linker" can mean many things. When running executables, a dynamic linker is involved, like we discussed in making our own executable packer.

Here it's definitely a "build-time linker", but really there's, I think: a linker driver, and the actual linker. The linker driver, by default here, is cc, which is provided by GCC. It accepts the -fuse-ld=flabazorp option to use the flabazorp linker.

Except GCC has an allowlist of values it'll accept for -fuse-ld. If we use clang as a linker driver, however, we can specify a path to whichever linker we want, even our wrapper script:

TOML markup
# in `.cargo/config.toml`

[target.x86_64-unknown-linux-gnu]
linker = "clang"
rustflags = [
    "-C", "link-arg=-fuse-ld=/home/amos/mold-wrapper",
]
Shell session
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 5.17s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 5.12s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 5.10s

Okay, we're closer to 5 seconds - almost 2 seconds faster!

Let's look at what we have in /tmp/mold-log.txt:

     User   System     Real  Name
    0.800    5.360    0.392  all
    0.171    0.515    0.177    read_input_files
    0.068    2.334    0.101    register_section_pieces
    0.564    2.512    0.114    total
    0.380    0.102    0.021      before_copy
    0.000    0.000    0.000        apply_exclude_libs
    0.062    0.000    0.002        resolve_obj_symbols
    0.000    0.000    0.000        eliminate_comdats
    0.003    0.002    0.000        convert_common_symbols
    0.000    0.000    0.000        apply_version_script
    0.000    0.000    0.000        compute_import_export
    0.113    0.050    0.006        gc
    0.005    0.015    0.001          mark_nonalloc_fragments
    0.037    0.028    0.002          collect_root_set
    0.066    0.000    0.002          mark
    0.005    0.007    0.001          sweep
    0.103    0.003    0.004        compute_merged_section_sizes
    0.098    0.003    0.004          MergedSection assign_offsets
    0.002    0.015    0.001        bin_sections
    0.000    0.002    0.001        claim_unresolved_symbols
    0.003    0.020    0.000        check_dup_syms
    0.000    0.000    0.000        sort_init_fini
    0.002    0.004    0.001        compute_section_sizes
    0.064    0.002    0.003        scan_rels
    0.000    0.000    0.000        DynsymSection::finalize
    0.000    0.000    0.000        fill_verdef
    0.000    0.001    0.000        fill_verneed
    0.022    0.000    0.001        compute_symtab
    0.005    0.001    0.001        eh_frame
    0.000    0.000    0.000        osec_offset
    0.000    0.001    0.001      open_file
    0.184    2.407    0.091      copy
    0.102    2.400    0.084        copy_buf
    0.000    0.000    0.000          (header)
    0.000    0.000    0.000          .fini_array
    0.000    0.000    0.000          (header)
    0.000    0.000    0.000          .interp
    0.000    0.000    0.000          .note.gnu.build-id
    0.088    2.073    0.070          .debug_line
    0.000    0.000    0.000          .note.ABI-tag
    0.000    0.000    0.000          .rela.dyn
    0.000    0.002    0.000          .init_array
    0.000    0.000    0.000          .strtab
    0.006    0.125    0.004          .data
    0.000    0.000    0.000          .shstrtab
    0.100    2.399    0.082          .debug_str
    0.000    0.000    0.000          .got.plt
    0.000    0.000    0.000          .rela.plt
    0.007    0.145    0.005          .debug_abbrev
    0.064    1.271    0.044          .symtab
    0.000    0.000    0.000          .debug_frame
    0.089    2.291    0.076          .debug_info
    0.000    0.000    0.000          .gnu.build.attributes.startup
    0.000    0.001    0.000          .gnu.build.attributes.exit
    0.102    2.399    0.084          .debug_pubnames
    0.000    0.001    0.000          .dynsym
    0.000    0.000    0.000          .bss
    0.000    0.000    0.000          .comment
    0.055    1.133    0.038          .debug_aranges
    0.000    0.000    0.000          .common
    0.014    0.423    0.015          .gcc_except_table
    0.000    0.000    0.000          .gnu.build.attributes
    0.000    0.000    0.000          .tm_clone_table
    0.097    2.394    0.080          .debug_ranges
    0.000    0.002    0.000          .fini
    0.000    0.001    0.000          .rodata.str
    0.000    0.000    0.000          .tbss
    0.006    0.125    0.005          .debug_rnglists
    0.002    0.089    0.003          .got
    0.000    0.000    0.000          .note.stapsdt
    0.000    0.001    0.000          .gnu.hash
    0.000    0.000    0.000          .gnu.build.attributes.unlikely
    0.000    0.000    0.000          (header)
    0.000    0.001    0.000          .debug_loc
    0.084    1.893    0.063          .debug_loclists
    0.000    0.002    0.000          .debug_line_str
    0.000    0.000    0.000          .gnu.build.attributes.hot
    0.000    0.001    0.000          .dynstr
    0.000    0.001    0.000          .gnu.version
    0.015    0.519    0.017          .rodata.str
    0.000    0.000    0.000          .dynamic
    0.035    0.847    0.028          .eh_frame
    0.064    1.271    0.044          .data.rel.ro
    0.000    0.000    0.000          .gnu.version_r
    0.000    0.000    0.000          .debug_gdb_scripts
    0.000    0.000    0.000          .init
    0.096    2.390    0.080          .text
    0.000    0.000    0.000          .tdata
    0.001    0.023    0.002          .debug_pubtypes
    0.000    0.000    0.000          .plt
    0.000    0.000    0.000          .plt.got
    0.086    1.972    0.066          .rodata
    0.015    0.188    0.006          .eh_frame_hdr
    0.000    0.004    0.000          .rodata
    0.000    0.021    0.001          .rodata.cst
    0.002    0.007    0.001        sort_dynamic_relocs
    0.000    0.000    0.000        clear_padding
    0.080    0.001    0.007        build_id
    0.000    0.001    0.001      close_file

mold is self-reporting a "real" run-time of 392 milliseconds here. That's pretty fast! Nothing particular jumps out to me — but if there was some pass that seemed slow, I'd know where to file an issue.

Profiling lld

Now let's try lld!

I made another wrapper script, in /home/amos/ld.lld-wrapper:

Bash
#!/bin/bash
ld.lld --time-trace --time-trace-file=/tmp/lld.time-trace.json "$@"

Let's do a few hot builds:

Shell session
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 5.27s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 5.26s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 5.19s

Not too bad! Here's how different linkers compared:

(Note that this is full build time, not just link-time).

The lld maintainers suggested using jq to report those metrics:

Shell session
$ cat /tmp/lld.time-trace.json | jq -r '.traceEvents[] | select(.name|contains("Total")) | "\(.dur/1000000) \(.name)"'
0.449559 Total ExecuteLinker
0.445119 Total Link
0.193442 Total Write output file
0.074866 Total Parse input files
0.061933 Total Merge/finalize input sections
0.028164 Total Split sections
0.023066 Total markLive
0.022392 Total Scan relocations
0.009324 Total Assign sections
0.007581 Total Add local symbols
0.005941 Total Finalize synthetic sections
0.004179 Total Load input files
0.004127 Total Finalize .eh_frame
0.003314 Total Process symbol versions
0.003111 Total Add symbols to symtabs
0.00283 Total LTO
0.002436 Total Finalize address dependent content
0.001592 Total Aggregate sections
0.001035 Total Combine EH sections
0.000874 Total Locate library
0.000826 Total Strip sections
0.000807 Total Redirect symbols
0.000448 Total Demote shared symbols
0.00033 Total Replace common symbols
0.00031 Total Sort sections
0.000134 Total Create output files
0.000124 Total Read linker script
0 Total Resolve SHF_LINK_ORDER
0 Total Diagnose orphan sections

And:

Shell session
$ cat /tmp/lld.time-trace.json | jq -r '.traceEvents[] | select(.name|contains("Write")) | "\(.dur/1000000) \(.name)"'
0.193442 Write output file
0.193442 Total Write output file

Hey that's a useless use of cat!

It is! Gotta let redditors comment something about our otherwise perfect articles.

Here nothing jumps out either: lld self-reports 449 milliseconds, about 60 milliseconds more than mold.

What about LTO?

Let's profile link-time optimization too! We'll omit mold from these measurements, since it doesn't support LTO at the time of this writing.

TOML markup
[profile.release]
lto = "fat"
debug = 1

Let's try with GNU ld:

Shell session
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 1m10s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 1m10s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 1m10s

Man, LTO is not free.

Nope!

Now "fat" LTO with lld:

Shell session
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 1m10s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 1m09s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 1m10s

I'm starting to think maybe GNU ld wasn't involved at all there...

Yeah, me too. Rust LTO seems LLVM-specific, so I wonder if it's even using GNU ld at all.

Let's look at the timings:

Shell session
$ cat /tmp/lld.time-trace.json | jq -r '.traceEvents[] | select(.name|contains("Total")) | "\(.dur/1000000) \(.name)"'
0.242555 Total ExecuteLinker
0.239797 Total Link
0.131408 Total Write output file
0.037888 Total Merge/finalize input sections
0.021132 Total Parse input files
0.016607 Total Scan relocations
0.01026 Total markLive
0.007482 Total Split sections
0.004815 Total Add local symbols
0.003422 Total Assign sections
0.002502 Total Load input files
0.002039 Total Finalize synthetic sections
0.001379 Total Finalize .eh_frame
0.001058 Total Finalize address dependent content
0.000891 Total Locate library
0.000878 Total Add symbols to symtabs
0.000505 Total Aggregate sections
0.000359 Total LTO
0.000336 Total Process symbol versions
0.000221 Total Strip sections
0.000212 Total Combine EH sections
0.000203 Total Redirect symbols
0.000188 Total Sort sections
0.000125 Total Create output files
0.000121 Total Read linker script
9.5e-05 Total Demote shared symbols
5.7e-05 Total Replace common symbols
0 Total Resolve SHF_LINK_ORDER
0 Total Diagnose orphan sections

Well that can't be right.

Mhh no, it doesn't look right at all. I would expect some of those values to be much higher.

I'm sure someone will point out what we're doing wrong!

Finally, let's try thin LTO:

TOML markup
[profile.release]
lto = "thin"
debug = 1
Shell session
$ cargo build --release
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 14.55s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 14.73s
$ # (change main.rs)
$ cargo build --release
   Compiling salvage v1.4.0 (/home/amos/bearcove/salvage)
    Finished release [optimized + debuginfo] target(s) in 14.95

And here's the timings:

Shell session
$ cat /tmp/lld.time-trace.json | jq -r '.traceEvents[] | select(.name|contains("Total")) | "\(.dur/1000000) \(.name)"'
0.388797 Total ExecuteLinker
0.369694 Total Link
0.175247 Total Write output file
0.059522 Total Merge/finalize input sections
0.043918 Total Parse input files
0.02205 Total Split sections
0.020696 Total Scan relocations
0.018724 Total Load input files
0.017484 Total markLive
0.007214 Total Assign sections
0.006661 Total Add local symbols
0.004376 Total Finalize synthetic sections
0.002845 Total Finalize .eh_frame
0.002703 Total Add symbols to symtabs
0.001943 Total Finalize address dependent content
0.001889 Total Process symbol versions
0.001439 Total LTO
0.001075 Total Aggregate sections
0.001072 Total Locate library
0.00069 Total Combine EH sections
0.000504 Total Strip sections
0.000459 Total Redirect symbols
0.000282 Total Sort sections
0.000273 Total Demote shared symbols
0.000162 Total Replace common symbols
0.000145 Total Read linker script
0.000111 Total Create output files
0 Total Resolve SHF_LINK_ORDER
0 Total Diagnose orphan sections

Again, those are fishy. I'm either not measuring the right thing, or the work is being done elsewhere.

Finally, here's the comparison:

Happy profiling!

If you liked what you saw, please support my work!

Github logo Donate on GitHub Patreon logo Donate on Patreon

Latest video

video cover image
How does the detour crate work?

We want to hook a function, so that our code gets called... but we also want the original code to execute. How the heck does that work?

Watch now

You can watch more videos over there

Looking for the homepage?
Another article: A Rust match made in hell