I write a ton of articles about rust. And in those articles, the main focus is about writing Rust code that compiles. Once it compiles, well, we're basically in the clear! Especially if it compiles to a single executable, that's made up entirely of Rust code.

That works great for short tutorials, or one-off explorations.

Unfortunately, "in the real world", our code often has to share the stage with other code. And Rust is great at that. Compiling Go code to a static library, for example, is relatively finnicky. It insists on being built with GCC (and no other compiler), and linked with GNU ld (and no other linker).

In contrast, Rust lends itself very well to "just write a bit of fast and safe code and integrate it into something else". It uses LLVM for codegen, which, as detractors will point out, doesn't support as many targets as GCC does (but there's always work in progress to address that), and it supports using GCC, Clang, and MSVC to compile C dependencies, and GNU ld, the LLVM linker, and the Microsoft linker to link the result.

In this murder mystery, there are many potential suspects, since the project in question is:

The crime? Murdering the static TLS block.

Shell session
$ ./node_modules/.bin/electron --no-sandbox ../tests/test.js
App threw an error during load
Error: /home/amos/work/valet/artifacts/linux-x86_64/index.node: cannot allocate memory in static TLS block
    at process.func [as dlopen] (electron/js2c/asar.js:140:31)
    at Object.Module._extensions..node (internal/modules/cjs/loader.js:1034:18)
    at Object.func [as .node] (electron/js2c/asar.js:140:31)
    at Module.load (internal/modules/cjs/loader.js:815:32)
    at Module._load (internal/modules/cjs/loader.js:727:14)
    at Function.Module._load (electron/js2c/asar.js:769:28)
    at Module.require (internal/modules/cjs/loader.js:852:19)
    at require (internal/modules/cjs/helpers.js:74:18)
    at Object.<anonymous> (/home/amos/work/valet/index.js:61:13)
    at Module._compile (internal/modules/cjs/loader.js:967:30)

Setting the scene

In the past few months, I've been working on a new release of an Electron-based desktop application.

Cool bear's hot tip

What's that?

It's a git tag built via continuous integration and deployed to our servers so our customers can enjoy new bugs and feature fixes, but that's not important right now.

I figured out pretty early that writing core business logic in JavaScript (or even in TypeScript), using node.js APIs, was troublesome. I wanted to access lower-level APIs (actual Win32 or Cocoa functions, for example) for various purposes: pre-allocating disk space, sandboxing, better control over the network stack, etc.

So I set out to write most of it in Go. The desktop app would contain mostly user interface code, and on first run, it would download a second executable that acted as a JSON-RPC service.

Everything was mostly fine, until I realized two things:

  1. Downloading an extra executable that listens on a (local) TCP port makes third-party Antivirus software very unhappy on Windows.
  2. I do not want to write any more Go.

That solution worked for most people, mind you. Most of the time, the executable was downloaded fine, and most of the time, it was able to start. Most of the time, it was able to bind to a port on 127.0.0.1 and most of the time the desktop app was able to connect to it and start exchanging JSON-RPC messages.

But in some cases, for some users, on some full moons, it didn't! And those times were extremely hard to diagnose. I mostly blame Antivirus software. I'm not fond of them.

The reality though, is that I had built a system with too many moving parts. Having the desktop app download components was great, because it meant I could push updates to those components without updating the app itself. And it could update those components without restarting!

But I eventually decided the whole thing was too fragile, and that's when I started working the "dynamic library" angle.

Cool bear's hot tip

So just to be clear for the readers - the original problem is 100% your fault, right?

I mean... yes. In the sense that I was too trusting. I thought that Antivirus software making critical decisions at random was a thing of the past. But the AV mafia is still alive and well, so, for the time being, I have to find ways to appease them.

Cool bear's hot tip

You say you're working the "dynamic library" angle, what does that mean exactly?

Well, node.js (and Electron) lets you write native modules. You write C/C++ code, build it as a dynamic library, rename the result to a .node file, and then you can require() it just like any other JavaScript module.

Cool bear's hot tip

AhAH! But you said earlier that you wanted to access lower-level APIs - and that's why you wrote Go code in the first place.

Exactly! One important thing to note is that, when I started, native node addons used the "node APIs" directly, which meant you had to recompile them every time you switched to a different node version. And the version of node shipped with Electron often didn't match the version of node you had installed on your system.

It was only later that N-API was stabilized. By then, I was already deeply invested in the "second executable" approach.

Cool bear's hot tip

So now you use N-API to make a native addon that you don't need to recompile all the time?

That's correct! Now, my native addon is built by CI/CD workers, uploaded to GitHub, and the native bits are downloaded during npm install.

And the result works across all (semi-recent) versions of node and Electron.

Cool bear's hot tip

Okay. There's still one thing I don't get: why is Rust involved at all?

Couldn't you have just compiled your existing Go code into a native node addon?

Well, using cgo and a non-trivial amount of C or C++ code as glue, I'm sure there would be a way. (And others have been trying exactly that).

But I also wanted to move away from Go in general, and if I was going to write some glue code, I was going to set the rules - I didn't really feel like writing C or C++ code either. I wanted my glue code to be checked by a compiler that has more information about what's going on, so it can prevent more mistakes.

Cool bear's hot tip

Okay so you threw away a design that isn't Antivirus-friendly, you now have to build a dynamic library "just the way node likes it", using the stable "N-API" ABI, you're using Rust for the glue code, and Go is still in the picture because you have a large existing codebase you don't want to rewrite entirely right now.

Now you get it!

Cool bear's hot tip

And that new design is "less fragile" than the previous one because as soon as it compiles, you're in the clear. There's no "download" phase, there's no sockets involved, just function calls.

Right! Unless the dynamic linker is unhappy.

Cool bear's hot tip

Unless the dynamic linker is unhappy.

Wherein the dynamic linker becomes unhappy

The transition to the new design went well. Since N-API doesn't ship with official Rust support, I had to get a bit creative and take inspiration from existing crates, but things were more or less smooth sailing.

It's just C functions!

After much fiddling, I was able to conjure the correct set of flags to get all my Go (and cgo) code to compile and link into a single dynamic library, that acts as a valid node.js module. This is probably worth a write-up in itself, but I'm still not sure enough folks would be curious about it.

Cool bear's hot tip

So instead you write about thread-local storage?

Yeah! I uhh I see your point. But linker problems, and TLS in particular, are problems you can encounter with any programming language, in any situation. They're not specific to my current setup.

So, things were going swimmingly, and I was porting over the self-update logic for my desktop app. I had some trouble with futures on the way, which I was happy to solve and go on my merry way.

And then I got this error:

Error: index.node: cannot allocate memory in static TLS block

I've written about thread-local storage before, so I had at least the notion that this was unrelated to cryptography.

Beyond that, though? I was completely at a loss as to how to diagnose and fix this problem.

The same codebase, which was already pretty large the day before, compiled, linked and loaded fine! I had just added a couple dozen lines of code to make some HTTP requests. What could've possibly gone wrong over a day of work?

Before we move into the investigation phase of the murder mystery, let's take make a quick detour to try and understand why thread-local storage is useful.

Who even wants TLS?

Let's take a completely artificial example, that somehow always finds its way into explanations about threading - a classic, if you will.

Say we have five threads, and they all increment a shared counter by one, a thousand times each.

Normally the explanation would be written in C, but I can't bring myself to write C, so we'll have to simulate C with unsafe Rust instead:

Rust code
fn main() {
    let mut counter: usize = 0;
    let unsafe_on_purpose = UnsafeOnPurpose(&mut counter);

    #[derive(Clone, Copy)]
    struct UnsafeOnPurpose(*mut usize);
    unsafe impl Sync for UnsafeOnPurpose {}
    unsafe impl Send for UnsafeOnPurpose {}

    let mut handles = vec![];
    for _ in 0..5 {
        handles.push(std::thread::spawn(move || {
            for _ in 0..1000 {
                unsafe {
                    *unsafe_on_purpose.0 += 1;
                }
            }
        }));
    }
    handles.into_iter().for_each(|h| h.join().unwrap());

    println!("counter = {}", counter);
}

The classic explanation would go on and ask you what you think the final value of counter would be. 5000, right?

Shell session
$ cargo run -q
counter = 2796
$ cargo run -q
counter = 2507
$ cargo run -q
counter = 1615
$ cargo run -q
counter = 2630

"Wrong! Very wrong!" the writer would cackle, as if they got a kick out of getting the wrong answer out of you.

"You see..." the the explanation would go on, explaining that programming with threads is fraught with danger, and that only the most experienced hackers can hope to write multi-threaded code that gives the expected result most of the time.

Of course, that classic has aged pretty badly, as I've just looked at the back of the box Rust came in, and it says "fearless concurrency" in flashy letters.

So if we turn off the C simulator, and let the Rust compiler massage our code into something that's safe, this is one of the things could end up with:

Rust code
use std::sync::{Arc, Mutex};

fn main() {
    let counter = Arc::new(Mutex::new(0_usize));

    let mut handles = vec![];
    for _ in 0..5 {
        let counter = counter.clone();
        handles.push(std::thread::spawn(move || {
            for _ in 0..1000 {
                *counter.lock().unwrap() += 1;
            }
        }));
    }
    handles.into_iter().for_each(|h| h.join().unwrap());

    println!("counter = {}", counter.lock().unwrap());
}

And there, the answer is always 5000:

Shell session
$ cargo run -q
counter = 5000
$ cargo run -q
counter = 5000
$ cargo run -q
counter = 5000

But of course, that's not the end of the road for us. Now that Rust is there to ensure memory safety, and now that correctness is out of the way, we can hyperfocus on performance instead.

For starters, clippy tells us we should consider using an AtomicUsize instead:

Shell session
$ cargo clippy
warning: Consider using an `AtomicUsize` instead of a `Mutex` here. If you just want the locking behavior and not the internal type, consider using `Mutex<()>`.
 --> src/main.rs:4:28
  |
4 |     let counter = Arc::new(Mutex::new(0_usize));
  |                            ^^^^^^^^^^^^^^^^^^^
  |
  = note: `#[warn(clippy::mutex_atomic)]` on by default
  = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#mutex_atomic

And I can see its point! It is entirely possible that an atomic variable would be much faster than a mutex here. And it would still be just as correct:

Rust code
use std::sync::atomic::{AtomicUsize, Ordering};

static COUNTER: AtomicUsize = AtomicUsize::new(0);

fn main() {
    let mut handles = vec![];
    for _ in 0..5 {
        handles.push(std::thread::spawn(|| {
            for _ in 0..1000 {
                COUNTER.fetch_add(1, Ordering::SeqCst);
            }
        }));
    }
    handles.into_iter().for_each(|h| h.join().unwrap());

    println!("counter = {}", COUNTER.load(Ordering::SeqCst));
}
Shell session
$ cargo run -q
counter = 5000
$ cargo run -q
counter = 5000
$ cargo run -q
counter = 5000

But if we really wanted to hyperfocus on performance, we'd surely point out that atomic operations are not free.

And that in that case, it would probably be better to have each thread first accumulate in a variable of their own, and only later sum up the result of each thread.

Rust code
fn main() {
    let mut handles = vec![];
    for _ in 0..5 {
        handles.push(std::thread::spawn(|| {
            let mut counter = 0;
            for _ in 0..1000 {
                counter += 1;
            }
            counter
        }));
    }

    let mut counter = 0;
    handles
        .into_iter()
        .for_each(|h| counter += h.join().unwrap());

    println!("counter = {}", counter);
}
Shell session
$ cargo run -q
counter = 5000

In fact, that silly, artificial example code could be written in a much more succint manner if we allowed ourselves a peek at rayon, for example.

But what if we weren't in charge of counting? What if we were using a legacy counting API, that only had those two functions:

Rust code
fn counter_inc() {
    todo!()
}

fn counter_get() -> usize {
    todo!()
}

Clearly our legacy counting API was designed at a time where multi-threaded programming was not really a concern - maybe it was originally implemented like that:

Rust code
static mut GLOBAL_COUNTER: usize = 0;

fn counter_inc() {
    unsafe {
        GLOBAL_COUNTER += 1;
    }
}

fn counter_get() -> usize {
    unsafe { GLOBAL_COUNTER }
}

If we just used this API as-is, our counts would be all wrong:

Rust code
fn main() {
    let mut handles = vec![];
    for _ in 0..5 {
        handles.push(std::thread::spawn(|| {
            for _ in 0..1000 {
                counter_inc();
            }
            counter_get()
        }));
    }

    let mut counter = 0;
    handles
        .into_iter()
        .for_each(|h| counter += h.join().unwrap());

    println!("counter = {}", counter);
}
Shell session
$ cargo run -q
counter = 8909
$ cargo run -q
counter = 11047
$ cargo run -q
counter = 9026
$ cargo run -q
counter = 6757

Luckily, our legacy API is made up of functions, so we could build a contraption to let each thread have its own counter:

Shell session
$ cargo add once_cell
    Updating 'https://github.com/rust-lang/crates.io-index' index
      Adding once_cell v1.4.0 to dependencies
Rust code
use once_cell::sync::Lazy;
use std::{cell::RefCell, collections::HashMap, sync::RwLock, thread::ThreadId};

// omitted: main

struct NotThreadSafe<T>(RefCell<T>);
unsafe impl<T> Sync for NotThreadSafe<T> {}

static THREAD_COUNTERS: Lazy<RwLock<HashMap<ThreadId, NotThreadSafe<usize>>>> =
    Lazy::new(|| RwLock::new(HashMap::new()));

fn counter_inc() {
    let tid = std::thread::current().id();
    let r_guard = THREAD_COUNTERS.read().unwrap();
    match r_guard.get(&tid) {
        Some(counter) => {
            *counter.0.borrow_mut() += 1;
        }
        None => {
            drop(r_guard);
            THREAD_COUNTERS
                .write()
                .unwrap()
                .insert(tid, NotThreadSafe(RefCell::new(1)));
        }
    }
}

fn counter_get() -> usize {
    let tid = std::thread::current().id();
    THREAD_COUNTERS
        .read()
        .unwrap()
        .get(&tid)
        .map(|counter| *counter.0.borrow())
        .unwrap_or_default()
}

And then our legacy API would start working for multi-threaded programs:

Shell session
$ cargo run -q
counter = 5000

Of course, if that counting API was part of a family of system libraries, and one of them controlled threading, we would be able to use a much simpler approach, with less locking - we'd simply set up storage for counter whenever a new thread starts, and then reading and writing from it would involve no locking at all.

We could even... bake that into the compilers and linkers, and call it thread-local storage.

Rust code
thread_local! {
    static COUNTER: RefCell<usize> = RefCell::new(0);
}

fn counter_inc() {
    COUNTER.with(|c| *c.borrow_mut() += 1);
}

fn counter_get() -> usize {
    COUNTER.with(|c| *c.borrow())
}
Shell session
$ cargo run -q
counter = 5000
Cool bear's hot tip

Rust has a thread_local attribute awaiting stabilization, but we can get a preview on nightly:

Rust code
#![feature(thread_local)]

// omitted: main

#[thread_local]
static mut COUNTER: usize = 0;

fn counter_inc() {
    // my understanding is that this `unsafe` block should
    // not be necessary - but it is right now.
    unsafe { COUNTER += 1 }
}

fn counter_get() -> usize {
    unsafe { COUNTER }
}
Shell session
$ cargo +nightly run -q
counter = 5000

And that... is more or less the story of errno.

Consider the C function fopen:

C code
FILE *fopen(const char *pathname, const char *mode);

When something goes wrong, it returns NULL. But how do we know exactly what went wrong? That's where errno comes in. It's a global variable that is set when something goes wrong, to a constant like EEXIST, EBUSY, EACCESS, or ENOSPC.

The problem, of course, is that fopen can be called by multiple threads concurrently. So for a while, on Linux/glibc platforms, errno was declared as a macro that did pthreads-specific stuff to return the right errno for a given thread.

Later still, after the C99 standard was released, a scheme for supporting thread-local storage as a C extension was devised. For ELF, Ulrich Drepper published a document in 2013.

In this new scheme, errno no longer needs to be a macro - it is simply a global variable, but a thread-local one.

Of course, there's plenty of other legitimate uses for thread-local storage, we'll see one of them very shortly.

Anyway, for this scheme to work, several things need to happen:

And, perhaps most importantly:

How does all this work? Barely, if you ask me!

When executing a program:

At this point, we haven't started executing the program yet. But we know how much thread-local storage space we're going to need, so, we can allocate enough.

Or can we? We won't know about all the dependencies of an application until it's done running. It could call dlopen at any point and load additional libraries.

So, what does the GNU dynamic loader do?

C code
// in `glibc/csu/libc-tls.c`

/* Size of the static TLS block.  Giving this initialized value
   preallocates some surplus bytes in the static TLS area.  */
size_t _dl_tls_static_size = 2048;

It reserves a 2048 bytes more than it needs to. Just in case.

2048 bytes ought to be enough for everyone, right?

Cool bear's hot tip

Where have I heard that before...

2048 bytes is not enough for everyone

Clearly, the error message said what it said - there was not enough space in the static TLS block to load my dynamic library.

So I started suspecting everything. Was the Go code somehow using a large amount of TLS, and we were just under the limit before, which would explain why a relatively modest addition of Rust code would bring us over the limit?

I tried to look for a flag to control the thread-local storage model in Go and found out that, actually, Go doesn't have thread-local storage because, and I quote:

...every feature comes at a cost, and in my opinion the cost of threadlocals far outweighs their benefits. They're just not a good fit for Go.

Well. I started thinking maybe one of the C dependencies was putting us over our TLS budget then. But which one?

I decided to start looking at all the libraries my dynamic library depended on:

JavaScript code
// in `list.js`

let { execSync } = require("child_process");

let cmdLines = (command) => {
    return execSync(command, { encoding: "utf-8" }).split("\n")
}

for (let line of cmdLines(`ldd ${process.argv[2]}`)) {
    let tokens = line.split("=>");
    if (tokens.length == 2) {
        let matches = /[^\s]+/.exec(tokens[1]);
        if (matches) {
            let path = matches[0];
            let header = "";
            for (let line of cmdLines(`readelf -Wl "${path}"`)) {
                if (/PhysAddr/.test(line)) {
                    header = line;
                }
                if (/TLS/.test(line)) {
                    console.log(`==== ${path} ====`);
                    console.log(header);
                    console.log(line);
                }
            }
        }
    }
}
Shell session
$ node list.js ./index.node
==== /usr/lib/libc.so.6 ====
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  TLS            0x1bc440 0x00000000001bd440 0x00000000001bd440 0x000010 0x000090 R   0x8

The line we want to be looking at here is MemSiz. 0x90 is 144 bytes - that seems reasonable. Plus, libc.so.6 is probably something Electron already depends on:

Shell session
$ ldd electron | grep libc.s
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f9756e05000)

...so it wouldn't count against the 2048 bytes surplus.

After a lot of head-scratching, I decided to try and change the TLS model for my Rust code. We haven't really talked about TLS models so far - there's four of them: initial-exec, local-exec, local-dynamic, and global-dynamic.

They all have different restrictions and performance characteristics. For example, the most restrictive TLS model, initial-exec, each thread-local variable has an offset known on, well, initial execution.

Before process execution, when TLS relocations are processed, fixed offsets are written into the code, and everything is just as fast as if they were non-thread-local variables. The restriction is that we have to know about every thread-local variable on initial execution.

Thread-local variables declared by a dynamic library that's loaded at runtime with dlopen cannot be accessed with code under the initial-exec model.

Anyway, Rust has an unstable flag for controlling the TLS model, which is forwarded almost verbatim to LLVM. So I added the following to my .cargo/config:

TOML markup
[build]
rustflags = ["-Z", "tls-model=global-dynamic"]

But that didn't help. What I missed at the time was that:

So, by this point, I had the following elements:

And I finally checked how much thread-local storage my library needed:

Shell session
$ readelf -Wl ./artifacts/linux-x86_64/index.node | grep -E 'PhysAddr|TLS'
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  TLS            0x1379be0 0x000000000137abe0 0x000000000137abe0 0x000005 0x001270 R   0x20

0x1270? That's 4720 bytes. That's too much.

The investigation continues

I had to take a breather. I was already a bit anxious about inviting all those libraries in the same memory space, but now someone had committed a thread-local crime and nobody would fess up to it.

I was out of ideas! Should I try to remove dependencies one by one, and see when TLS usage would drop? That's kinda hard to do - I only used one or two crates, they pulled in a few more, and so on.

I tried a few things that didn't work. One workaround that's commonly recommended when faced with insufficient TLS static block space, is to just LD_PRELOAD the offending libraries.

Of course, this was not an option for me:

Shell session
$ LD_PRELOAD=./artifacts/linux-x86_64/index.node node -e 'require("./artifacts/linux-x86_64")'
internal/modules/cjs/loader.js:1250
  return process.dlopen(module, path.toNamespacedPath(filename));
                 ^

Error: Module did not self-register: '/home/amos/work/valet/artifacts/linux-x86_64/index.node'.
    at Object.Module._extensions..node (internal/modules/cjs/loader.js:1250:18)
    at Module.load (internal/modules/cjs/loader.js:1049:32)
    at Function.Module._load (internal/modules/cjs/loader.js:937:14)
    at Module.require (internal/modules/cjs/loader.js:1089:19)
(etc)

...because node.js modules have an initializer. Whenever they're loaded, their initializer function is called (by the dynamic loader), and that's when they register themselves with the Node.JS runtime.

By preloading it, I was running the initializer before node itself was loaded, so the size of the TLS static block was not an issue, but my module never had a chance to register itself.

I also went to see the old LD_DEBUG environment variable, but they didn't have much to tell me:

Shell session
$ LD_DEBUG=all node -e 'require("./artifacts/linux-x86_64")'
    215183:     checking for version `GLIBC_2.2.5' in file /usr/lib/libpthread.so.0 [0] required by file /home/amos/work/valet/artifacts/linux-x86_64/index.n
ode [0]                                                                                                                                                      
    215183:     object=/home/amos/work/valet/artifacts/linux-x86_64/index.node [0]                                                                           
    215183:      scope 0: node /usr/lib/libz.so.1 /usr/lib/libcares.so.2 /usr/lib/libnghttp2.so.14 /usr/lib/libcrypto.so.1.1 /usr/lib/libssl.so.1.1 /usr/lib/
libicui18n.so.67 /usr/lib/libicuuc.so.67 /usr/lib/libdl.so.2 /usr/lib/libstdc++.so.6 /usr/lib/libm.so.6 /usr/lib/libgcc_s.so.1 /usr/lib/libpthread.so.0 /usr/
lib/libc.so.6 /usr/lib/libicudata.so.67 /lib64/ld-linux-x86-64.so.2                                                                                          
    215183:      scope 1: /home/amos/work/valet/artifacts/linux-x86_64/index.node /usr/lib/libdl.so.2 /usr/lib/libpthread.so.0 /usr/lib/libgcc_s.so.1 /usr/li
b/libc.so.6 /lib64/ld-linux-x86-64.so.2 /usr/lib/libm.so.6                                                                                                   
    215183:                                                                                                                                                  
    215183:                                                                   
    215183:     relocation processing: /home/amos/work/valet/artifacts/linux-x86_64/index.node                                                               
    215183:
    215183:     file=/home/amos/work/valet/artifacts/linux-x86_64/index.node [0];  destroying link map                                                       
internal/modules/cjs/loader.js:1250
  return process.dlopen(module, path.toNamespacedPath(filename));                                                                                            
                 ^                                                            
                                                                                                                                                             
Error: /home/amos/work/valet/artifacts/linux-x86_64/index.node: cannot allocate memory in static TLS block                                                   
    at Object.Module._extensions..node (internal/modules/cjs/loader.js:1250:18)                                                                             
    at Module.load (internal/modules/cjs/loader.js:1049:32)                                                                                                  
    at Function.Module._load (internal/modules/cjs/loader.js:937:14)                                                                                         
    at Module.require (internal/modules/cjs/loader.js:1089:19)                
    at require (internal/modules/cjs/helpers.js:73:18)                                                                     
(many lines omitted)

"Destroying link map", they said. More like destroying evidence.

A feeling of hopelessness started washing over me. That's what happens when you play with fire. You think you know... until the linker is unhappy. And then it's all over.

Unless... unless we look at symbols.

That's it! Everything is just symbols. It's a Unix system^W^Wbunch of ELF files! I know this!

First, I looked in the .tdata section:

Shell session
$ objdump -t ./artifacts/linux-x86_64/index.node | grep -F '.tdata'
000000000137abe0 l    d  .tdata 0000000000000000              .tdata
0000000000000003 l       .tdata 0000000000000002              _ZN5tokio7runtime5enter7ENTERED7__getit5__KEY17ha484cdfaf18d69f5E
0000000000000000 l       .tdata 0000000000000003              _ZN5tokio4coop7CURRENT7__getit5__KEY17h8278716387cbc8f1E.llvm.760478660955371005

Ugh, name mangling. I really wasn't in the mood for that. I decided to use llvm-objdump instead:

Shell session
$ llvm-objdump -C -t ./artifacts/linux-x86_64/index.node | grep -F '.tdata'
000000000137abe0 l    d  .tdata 00000000 .tdata
0000000000000003 l     O .tdata 00000002 tokio::runtime::enter::ENTERED::__getit::__KEY::ha484cdfaf18d69f5
0000000000000000 l     O .tdata 00000003 tokio::coop::CURRENT::__getit::__KEY::h8278716387cbc8f1 (.llvm.760478660955371005)

Better. But - no suspects here. Just tokio being tokio.

Next up, .tbss:

Shell session
$ llvm-objdump -C -t ./artifacts/linux-x86_64/index.node | grep -F '.tbss'
000000000137ac00 l    d  .tbss  00000000 .tbss
0000000000001268 l     O .tbss  00000008 runtime.tlsg
0000000000001140 l     O .tbss  00000028 std::io::stdio::LOCAL_STDERR::__getit::__KEY::he207ce34e31d9b05
0000000000001100 l     O .tbss  00000028 std::io::stdio::LOCAL_STDOUT::__getit::__KEY::ha967a278cccaa842
00000000000011d8 l     O .tbss  00000040 tokio::runtime::context::CONTEXT::__getit::__KEY::hca1d1663d4c3ce31
0000000000001240 l     O .tbss  00000018 tokio::runtime::thread_pool::worker::CURRENT::FOO::__getit::__KEY::h191212b01662790d
0000000000001220 l     O .tbss  00000018 tokio::runtime::basic_scheduler::CURRENT::FOO::__getit::__KEY::h12b29ef52d395a65
00000000000010b8 l     O .tbss  00000018 reqwest::util::fast_random::RNG::__getit::__KEY::h9c9ced3e4c71f36f (.llvm.3808017548947348701)
00000000000010d0 l     O .tbss  00000020 std::collections::hash::map::RandomState::new::KEYS::__getit::__KEY::h2e52b9d06b996239 (.llvm.5237804589333414006)
0000000000001180 l     O .tbss  00000038 std::sys_common::thread_info::THREAD_INFO::__getit::__KEY::h043f97a4ccd012f9 (.llvm.5237804589333414006)
0000000000001258 l     O .tbss  00000010 tokio::park::thread::CURRENT_PARKER::__getit::__KEY::h5deb40f082f99b44 (.llvm.14254176473769782221)
0000000000000020 l     O .tbss  00001098 rand::rngs::thread::THREAD_RNG_KEY::__getit::__KEY::h65d289a02f603765 (.llvm.7091445804684704757)
00000000000011c0 l     O .tbss  00000018 std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::__KEY::h31577e4a8fe1e192 (.llvm.5237804589333414006)

AhAH! What's this?

0000000000000020 l     O .tbss  00001098 rand::rngs::thread::THREAD_RNG_KEY::__getit::__KEY::h65d289a02f603765

0x1098? 4248 bytes? That's too much.

The rand crate is our culprit.

Happy ending

The story didn't stop there. I filed a terrible report, and the truth came out.

The rand crate comes with several random number generators. As of version 0.7.x, it defaults to rand_chacha on non-wasm platforms. But before that, in version 0.6.5 for example, it depends on all those RNGs unconditionally, including rand_hc, which uses a "large" amount of thread-local storage.

And guess which crate depended on rand 0.6.5? The backoff crate, which I've just written about!

In the end, the investigation was a success, even though looking into linker errors is much harder than other kinds of errors.

Another case cracked.

What did we learn?

When debugging linker issues:

LD_DEBUG=all lets you trace everything the dynamic linker / dynamic loader does on Linux. Well, almost everything. Even though it didn't pan out for amos, it's always a good starting point, to gather additional information beyond "didn't load lol".

The ldd command is certainly a classic, but be aware that it only lists direct dependencies. The dynamic libraries listed may depend on other dynamic libraries, and even if you build the full graph yourself, it still won't take into account anything loaded at run-time with dlopen.

readelf and objdump (or llvm-objdump) are intimidating, but also invaluable. A proper debugger would've been no help at all, as we weren't dealing with a crash, just a dynamic linker error, which was correctly caught by the node.js runtime and resulted in a clean process exit, with code 1.

Finally, don't go at those problems alone! Exchanging ideas and theories with a fellow linker sufferer will very often unblock your investigation.

On that note, many thanks to @iximeow for listening to my dynamic linker ramblings and helping me get to the bottom of this.

That concludes our dynamic linker murder mystery... for now.

Until next crime, take care.