Thread-local storage
This article is part of the Making our own executable packer series.
- A short (and mostly incorrect) history of Intel chips
- The 286
- The 386
- The 64-bit era
- What about segmentation?
- A quick detour through typestates
- C programs
Contents
Welcome back and thanks for joining us for the reads notes... the thirteenth installment of our series on ELF files, what they are, what they can do, what does the dynamic linker do to them, and how can we do it ourselves.
I've been pretty successfully avoiding talking about TLS so far (no, not that one) but I guess we've reached a point where it cannot be delayed any further, so.
So. Thread-local storage.
We know from our adventures in reading files the hard way that, as a userland application, we are a guest on the computer.
Sure, we may execute some instructions, we may even politely request that certain devices tend to our needs but ultimately the one who's calling the shots is the kernel. We are tolerated here. Honestly, the kernel would rather nothing execute at all.
In fact, if you haven't read that article yet, go read it.
I'll even link it again. I'm serious. I'll wait, don't worry.
Occasionally though, the kernel will let non-kernel code execute. And again, it's in charge of exactly how and when that happens - and for how long.
By now we've formed a fairly good idea of how processes are loaded into memory: the kernel itself parses the file we want to execute, if it's ELF it parses it (it's not interested in nearly as many things as we are, though), maps a few things, then "hands off" control to it.
But what does hand off mean? In concrete terms, what happens? Well, today's not the day we get into kernel debugging (although... nah. unless? no.), but we sure can get a rough idea what's going on.
What is a computer? A miserable little pile of registers. That's right - it's global variables all the way down.
Here's the value of some of the CPU registers just as echidna
's main function
starts executing:
Is that all of them? Nope! There's 128-bit registers (SSE), 256-bit registers (AVX), 512-bit registers (AVX-512) - and of course we still have the x87/FPU registers, from back when you needed a co-processor for that.
TL;DR - it's a whole mess. The point is, we have a bunch of global variables that are, like, really fast to read from and write to. So optimizing compilers tend to want to use them whenever possible.
And by "them" I mean the general-purpose ones in the bunch - from %rax
to
%r15
. And sometimes, if your optimizer feels particularly giddy, some of
the %xmmN
registers as well (as we have painfully learned in the last
article).
And then there's special-purpose registers, like cs
, ss
, ds
, es
, etc.
We're not overly concerned with those four in particular, because on we're on
64-bit Linux and our memory model is somewhat simpler.
In fact, we've been using registers all this time to send the kernel love
letters - in echidna
's write
function for example:
pub unsafe fn write(fd: u32, buf: *const u8, count: usize) { let syscall_number: u64 = 1; asm!( "syscall", inout("rax") syscall_number => _, in("rdi") fd, in("rsi") buf, in("rdx") count, lateout("rcx") _, lateout("r11") _, options(nostack) ); }
So both the kernel and userland applications use registers. One of my
favorite registers - seeing as I'm in the middle of writing a series
about ELF files - is %rip
, the instruction pointer.
I'm being told that it wasn't always that simple, but on 64-bit Linux, it
just points to the (virtual) address of the code we're currently executing.
Whenever program execution moves forward, so does %rip
- by however many
bytes it took to encode the instruction that was just executed:
So, this answers part of our question - how does the kernel "hand off"
control to a program: it just changes %rip
! And the processor does the
rest. Well. Sorta kinda. "Among other things", let's say.
(Note that, on x86, you can't write to the %rip
register directly - you
have to use instructions like jmp
, call
, or ret
.)
To be fair, it also switches from ring 0 to ring 3 - again, something we've briefly discussed in Reading files the hard way Part 2. And it switches from the "kernel virtual address space" to the "userland virtual address space".
And other stuff happens too. I lied earlier. It's actually quite involved.
Point is - that's also how switching between processes works. As far as the user is concerned, processes execute in parallel, but as far as the kernel is concerned, its scheduler is handing out slices of time. Whenever it lets process "foo" execute for a bit, it:
- Sets up a system timer interrupt
- Restores the state of all CPU registers to what it was for that process
- Switches from Ring 0 to Ring 3, also jumping to whatever address
%rip
had when process "foo" was last interrupted
Eventually, the system timer interrupt goes off, and execution immediately jumps back to the kernel's interrupt handler - at which point the kernel decides whether the process has been naughty or nice and whether it merits more time.
If not - for example, if it decides we really should be giving process "bar" more time next, then the kernel saves the state of "foo", (most of the registers), resets a bunch of CPU state (mostly memory caches), and switches to "bar" the way we've just described.
That's the very distant overview of things. It's also not entirely correct. But for our purposes, it's correct enough.
That's for processes. But what about threads? Threads are also "preemptive multitasking" - instead of explicitly relinquishing control, their execution can be violently interrupted (ie. "preempted") so that other threads can be executed.
The "other" multitasking is cooperative multitasking - which you don't need the kernel's help to do. That's how coroutines work - just bits of userland state that all play nice together when it comes to deciding whose turn is it.
Switching between threads is simpler though. Because of all threads of a given process share the same address space. So there's less state to save and restore when switching from one to the other.
But then... the question arises: how do you tell threads apart? If several threads are started with the same entry point, how do you know which is which? Is that something the CPU handles? or the kernel?
What's the story here?
Let's run a little experiment.
$ cd samples/ $ mkdir twothreads
// in `samples/twothreads/twothreads.c` #include <unistd.h> #include <pthread.h> void *in_thread(void *unused) { while (1) { sleep(1); } } int main() { pthread_t t1, t2; pthread_create(&t1, NULL, in_thread, NULL); pthread_create(&t2, NULL, in_thread, NULL); pthread_join(t1, NULL); pthread_join(t2, NULL); }
There. Two threads, one entry point. Two enter, neither leaves.
$ cd samples/twothreads $ gcc twothreads.c -pthread -o twothreads $ ./twothreads (program doesn't print anything and never exits)
Now, let's run that program under GDB, break on in_thread
and
compare registers.
$ gdb --quiet ./twothreads Reading symbols from ./twothreads... (gdb) break in_thread Breakpoint 1 at 0x1175: file twothreads.c, line 6. (gdb) run Starting program: /home/amos/ftl/elf-series/samples/twothreads/twothreads [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". [New Thread 0x7ffff7dbe640 (LWP 27480)] [New Thread 0x7ffff75bd640 (LWP 27481)] [Switching to Thread 0x7ffff7dbe640 (LWP 27480)] Thread 2 "twothreads" hit Breakpoint 1, in_thread (unused=0x0) at twothreads.c:6 6 sleep(1);
Everything makes sense so far. We've got three threads - the main thread, and the two other we created. So really, the program should be named "threethreads".
Using the GDB command bt
(or backtrace
) shows us the backtrace of the current
thread:
(gdb) bt #0 in_thread (unused=0x0) at twothreads.c:6 #1 0x00007ffff7f943e9 in start_thread (arg=0x7ffff7dbe640) at pthread_create.c:463 #2 0x00007ffff7ec2293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Just like info registers
, GDB has info threads
, which lets us know
what's going on with all of them:
(gdb) info threads Id Target Id Frame 1 Thread 0x7ffff7dbf740 (LWP 27476) "twothreads" __pthread_clockjoin_ex (threadid=140737351771712, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145 * 2 Thread 0x7ffff7dbe640 (LWP 27480) "twothreads" in_thread (unused=0x0) at twothreads.c:6 3 Thread 0x7ffff75bd640 (LWP 27481) "twothreads" in_thread (unused=0x0) at twothreads.c:6
We can set the "current GDB thread" to whatever we want, for example if we want to see what the main thread is up to:
(gdb) thread 1 [Switching to thread 1 (Thread 0x7ffff7dbf740 (LWP 27476))] #0 __pthread_clockjoin_ex (threadid=140737351771712, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145 145 lll_futex_wait_cancel (&pd->tid, tid, LLL_SHARED);
I'm curious - our t1
and t2
variables - what do they contain exactly?
(gdb) frame 1 #1 0x00005555555551e3 in main () at twothreads.c:14 14 pthread_join(t1, NULL); (gdb) info locals t1 = 140737351771712 t2 = 140737343379008 (gdb) p/x {t1, t2} $5 = {0x7ffff7dbe640, 0x7ffff75bd640}
Those look like pointers. Okay. So, now that we know how to inspect the state of various threads, let's see what's going on with our two threads - here they are back-to-back:
Things look eerily similar. And they should - both threads are doing exactly the same thing - waiting for time to run out, one second at a time.
Sure, some register values are off by 0x1000
(%rbp
through %r10
), but,
for example, %rip
is exactly the same for both. Which is reassuring, to be
honest. Not all our assumptions are wrong.
But there has to be a way to tell them apart. For starters, "pthreads" (POSIX threads) are implemented as a userland library:
(gdb) info sharedlibrary From To Syms Read Shared Object Library 0x00007ffff7fd2090 0x00007ffff7ff2746 Yes /lib64/ld-linux-x86-64.so.2 0x00007ffff7f92a70 0x00007ffff7fa1025 Yes /usr/lib/libpthread.so.0 0x00007ffff7de8650 0x00007ffff7f336bd Yes /usr/lib/libc.so.6
...and it exposes functions like pthread_self()
- which returns the ID of the
calling thread. So it must know which thread we're currently in. And all
we have to go by are... registers.
But which one?
Let's do something I wish I had figured out months ago, when I was still researching whether "rolling your own dynamic linker" was even a halfway reasonable thing to do.
Let's disassemble pthread_self
.
(gdb) disas pthread_self Dump of assembler code for function pthread_self: 0x00007ffff7e48730 <+0>: endbr64 0x00007ffff7e48734 <+4>: mov rax,QWORD PTR fs:0x10 0x00007ffff7e4873d <+13>: ret End of assembler dump.
And with that, the hunt is over.
It's %fs
. That was the culprit all along.
The 6 Segment Registers are:
- Stack Segment (SS). Pointer to the stack.
- Code Segment (CS). Pointer to the code.
- Data Segment (DS). Pointer to the data.
- Extra Segment (ES). Pointer to extra data ('E' stands for 'Extra').
- F Segment (FS). Pointer to more extra data ('F' comes after 'E').
- G Segment (GS). Pointer to still more extra data ('G' comes after 'F').
Source: X86 Assembly Wikibook
Great. So the "s" stands for "segment" and the "f" stands for "fxtra data".
We've reached peak x86.
But hold on a second. I'm pretty sure %fs
was 0x0 every time we looked at
it. Let's double check:
(gdb) t a a i r fs Thread 3 (Thread 0x7ffff75bd640 (LWP 27481) "twothreads"): fs 0x0 0 Thread 2 (Thread 0x7ffff7dbe640 (LWP 27480) "twothreads"): fs 0x0 0 Thread 1 (Thread 0x7ffff7dbf740 (LWP 27476) "twothreads"): fs 0x0 0
t a a i r fs
is just the obscure way of saying thread apply all info register fs
.
That's right - whenever it's not ambiguous, GDB lets you shorten any command or
option name. In fact, if you see a shortcut being used and you're not sure what it does,
you can ask gdb, since its help
command also accepts the shortcut form.
For example:
(gdb) help ni nexti, ni Step one instruction, but proceed through subroutine calls. Usage: nexti [N] Argument N means step N times (or till program stops for another reason).
So, GDB tells us %fs
is 0x0
for all three of our threads.
Is this a lie? Yes. If that was the case, pthread_self
would try to read
from memory address 0x0+0x10
and definitely segfault.
But it doesn't:
(gdb) print (void*) pthread_self() [Switching to Thread 0x7ffff75bd640 (LWP 27481)] Thread 3 "twothreads" hit Breakpoint 1, in_thread (unused=0x0) at twothreads.c:6 6 sleep(1); The program stopped in another thread while making a function call from GDB. Evaluation of the expression containing the function (pthread_self) will be abandoned. When the function is done executing, GDB will silently stop.
It uh... reads notes hang on a minute:
(gdb) set scheduler-locking on (gdb) print (void*) pthread_self() $7 = (void *) 0x7ffff75bd640
It doesn't! It doesn't segfault.
scheduler-locking
is a feature of GDB that politely asks the Linux kernel
to, like, not preempt the current thread, because we're looking at it.
More info is available on Kevin Pouget's blog.
So GDB is lying. But it's not entirely surprising - the %fs
register
is thread-local (on Linux 64-bit! remember that whatever a register is
used for is entirely defined in the ABI and it's up to the kernel to make it
so), and GDB itself is running its own threads distinct from the inferior's
threads.
It's been a while since we've been over weird GDB terminology, so, just in case, the "inferior" is the "process being debugged". I know. Weird. Moving on.
Is there another way to grab the contents of the %fs
register? Sure there
is! We can ask the kernel politely via the arch_prctl
syscall. We'll use
libc's wrapper for it:
#include <asm/prctl.h> #include <sys/prctl.h> int arch_prctl(int code, unsigned long addr); int arch_prctl(int code, unsigned long *addr); #define ARCH_SET_GS 0x1001 #define ARCH_SET_FS 0x1002 #define ARCH_GET_FS 0x1003 #define ARCH_GET_GS 0x1004
That's right. The same function is defined once as taking an uint64_t
, and
a second time as taking a pointer to an uint64_t
. You know, since it can
both get and set things.
That's just how libc rolls, baby. Whoever tells you C has a type system is either delusional or mischievous.
(gdb) print (void) arch_prctl(0x1003, $rsp-0x8) $8 = void (gdb) x/gx $rsp-0x8 0x7ffff75bce38: 0x00007ffff75bd640
That looks like a real value!
Why the ceremony? Well, %fs
and %gs
aren't general-purpose registers -
they're segment registers. Segment registers were a lot more relevant before
the 64-bit era.
Let's go back in time for a little while...
A short (and mostly incorrect) history of Intel chips
The year is 1976. Four years have passed since the released of the 8-bit Intel 8008, and other companies are releasing 16-bit microprocessors left and right.
An Intel 8008 chip
Digital Equipment Corporation (DEC), Fairchild Semiconductor and National Semiconductor have all released some form of 16-bit microprocessor. One year prior, National even released the PACE, a single chip based loosely on its own IMP-16 design.
Meanwhile, Intel is one year into the iAPX 432 project, which.. really warrants at least one entire article. Ada was the intended programming language for the processor, and it supported object-oriented programming and capability-based addressing.
The iAPX 432 project was struggling though - turns out those abstractions weren't free. Not only did they require significantly more transistors, performance of equivalent programs suffered compared to competing microprocessors.
So, in May of 1976, the folks at Intel go "okay let's work on some 16-bit chip that we can release before iAPX 432 is done cooking. This is one month before Texas Instruments (TI) releases the TMS9900, another single-ship 16-bit microprocessor - the pressure is real.
But what does "a 16-bit chip" really mean? Well actually... it all depends.
For example, I've referred to the Intel 8008 as an "8-bit chip" - but it's not that simple.
Sure, the registers of the 8008 were eight bits. Each bit can be on or off:
This article is part 13 of the Making our own executable packer series.
If you liked what you saw, please support my work!