Looking at that latest mental model, it's.. a bit suspicious that every program ends up calling the same set of functions. It's almost like something different happens when calling those.

Are those even regular functions? Can we step through them with a debugger?

If we run our stdio-powered C program in gdb, and break on read, we can confirm that we indeed end up calling a read function:

Cool bear's hot tip

GDB is an open-source debugger that runs on Linux, macOS (sometimes), and Windows (with some limitations).

It allows, among many other things, setting breakpoints and stepping through code.

We'll be using it a bunch.

Same goes for the Rust program:

However, when we try to step through it... nothing. For the sake of the investigation, I cloned the glibc repository (since that's where the read function seems to live), and found this:

Cool bear's hot tip

The source code for glibc (shown above) can be found in this git repository.

A libc is a very complicated piece of software, for many historical and practical reasons. There are other popular ones, like musl.

The reason we can't find the source of read is because... it lives in another land entirely:

Cool bear's hot tip

"syscall" is what the "s" in strace stands for.

It's not just two different sets of software. They run with different privileges. The Linux kernel (and its device drivers) run in ring 0, where everything is allowed. Userland applications, however run in ring 3.

This is a classic diagram, so I had to show it here, but I don't think it's super intuitive. I prefer to think of it this way:

Because what you can do from ring 0 is a strict superset of what you can do from ring 3. Ring 3 is like a prison. Anybody from ring 0 can visit, but ring 3 can only send letters (ie. make syscalls).

So the kernel handles things like reading and writing. But it also handles things like processes. When we start our application, it runs in a process - and the kernel decides which process gets to run and when. It interrupts processes and resumes them, prioritizing the important stuff - but also, giving the illusion that a single CPU core can do multiple things at once (when it really can only do one at a time).

Cool bear's hot tip

Reality is a lot more complicated. CPU cores do do multiple things at once, just not in a way that's easy to observe.

There's a lot the kernel is responsible for, but let's focus on files. Processes have resources associated to it - like file descriptors! When we open a file (with the open syscall), the kernel:

And in our further communication with the kernel, whenever we want to refer to that resource, we'll just use that number.

And this answers one of the questions you might have had while following this article: in the strace output for most programs, we saw a call to close (which, well, closes a file descriptor) - but in our sample C program, we never bothered calling it!

This is because the kernel, in its infinite wisdom, keeps tracks of opened file descriptors, and cleans them all up once the process exits.

"This is all getting a bit theoretical" whispers someone in the back. "I'm glad he's not showing us kernel code, but.. are we just supposed to trust that the kernel cleans up file descriptors?"

Good question! There is a command to list open file descriptors for a specific path, so we can verify that real quick with a simple C program:

C code
int main(int argc, char **argv) { int fd = open("/etc/hosts", O_RDONLY); printf("Our file descriptor for /etc/hosts is %d\n", fd); printf("Press enter to exit...\n"); getc(stdin); }

What did we learn?

The kernel is all-powerful. It decides how processes are run, manages access to all devices (including disks), and is in charge of enforcing security.

Regular function calls are just "jumps" to another part of the code. Syscalls are not regular function calls. They are a secure interface between ring 3 (userland, our applications) and ring 0 (the kernel).

Making a syscall involves writing parameters somewhere accessible from userland, and politely asking the kernel to consider our request. The kernel is free to deny it, for various reasons: the file may not exist, we might not have permission to read it, etc.

Making a syscall

We need to clear up a potential source of confusion. We saw a read() function in the source code for glibc (the C library that ships with most Linux distributions), but it is distinct from the actual read syscall.

It seems like most of Unix is written in C, but can we make a syscall without using libc? Something like this:

Hey, Go is not C - does Go use libc to make syscalls? Let's find out.

This is the source code for a simple Go program that prints the contents of /etc/hosts:

Go code
package main import ( "io/ioutil" "fmt" ) func main() { payload, err := ioutil.ReadFile("/etc/hosts") if err != nil { panic(err) } fmt.Printf("%s\n", string(payload)) }

It sure works, but it doesn't seem to link against libc:

Cool bear's hot tip

ldd is a tool that "prints shared object dependencies".

The ldd man page has more info.

In our case, it's useful to make sure our program doesn't use glibc. Go programs are usually statically linked, unless they use obscure packages like net or os/user.

Welp, I guess Go programs are usually dynamic after all, but you can fix that.

Trying to break on read in gdb also gives nothing. Well, can we make sure it still uses the openat and read syscalls at least? Let's strace it:

It does! So one doesn't need libc to make a syscall. What a relief.

What did we learn?

Even though, in some respects, Go is a higher-level language than C (it has a garbage collector, it comes with concurrency primitives, etc.), it doesn't rely on libc to make syscalls.

This contrasts with the Node.js runtime, and the Rust standard library, which both use libc to make syscalls.

Making a Linux syscall on x86_64

So we've seen that pretty much all languages, no matter how many levels of abstractions they're on, have to eventually make syscalls one way or the other.

But how does one make a syscall? So far we've been using languages that either:

Let's try to make a syscall ourselves, in assembly.

Cool bear's hot tip

Assembly is one of the intermediate forms in which most compiled programs go through before they become full executables.

In very simple terms: a C compiler translate C to assembly, the assembler translate assembly to machine code, and a linker glues together several pieces of machine code into an executable.

We'll use nasm, because everything else gives me high blood pressure. Our code is going to be in readfile.asm, and we're going to build it with this makefile:

.PHONY: all all: nasm -f elf64 -F dwarf -g readfile.asm ld readfile.o -o readfile

The nasm invocation assembles our assembly into an object file, and the ld invocation links it into a full executable.

We're going to start with a very simple program, let's put this in readfile.asm:

X86 Assembly
global _start ; _start is our entry point - this is its declaration... section .text ; the text section is where we'll put executable code _start: xor rdi, rdi ; ...and this is its definition. we just set rdi to 0.
Cool bear's hot tip

rdi is a register.

Registers are memory locations in the CPU that are used for many things: temporary storage, passing arguments to functions, returning values from functions, etc. You can think of them as global variables.

Each architecture has a its own set of registers. We're on x86_64, so we'll be using a few of the 64-bit general-purpose registers, like rax, rdi, rsi. We'll also be using the stack pointer, rsp.

This compiles and links just fine, but it segfaults when we run it:

To get our program to exit cleanly, we need to.. make a syscall.

The first thing we need to make a syscall is its number, which we can find on our Manjaro/ArchLinux system in /usr/include/asm/unistd_64.h. Searching for exit reveals that its number is 60:

The second thing we need to do is.. count our lucky stars, because on x86_64, there is a dedicated instruction to make a syscall. (It's called syscall).

So now we just need to put the syscall number 60 in the rax register, and use the syscall instruction, and we should be good:

X86 Assembly
global _start section .text _start: mov rax, 60 syscall

And just like that, we made a syscall without using libc! That wasn't so hard.

Cool bear's hot tip

If it's not that hard to make a syscall without libc, why do so many languages use libc to make syscalls?

Well, it's easy to make Linux syscalls, on x86_64. 32-bit architectures have a different ways to make syscalls. Other operating systems have completely different sets of syscalls.

You can learn about some of those differences in the man page for syscall(2). You can pull it up from a Linux system with the man 2 syscall command, or read it online

Can we re-implement our whole readfile application in assembly? Let's see.

Logging isn't going to be as easy as with Node.js, or Rust, or C, or Go. So we're going to have to lean on the debugger a little bit. Thanks to the -F dwarf -g flags we passed to nasm, we have great debug information:

So, let's try the open syscall. It needs the same parameters as in C: first a path, then a set of flags. We'll store the path in the data section.

X86 Assembly
global _start section .text _start: mov rax, 2 ; "open" syscall mov rdi, path ; arg 1: path xor rsi, rsi ; arg 2: flags (0 = O_RDONLY) syscall mov rax, 60 ; "exit" syscall syscall section .data path: db "/etc/hosts", 0 ; null-terminated
Cool bear's hot tip

You don't need to be fluent in assembly to read this article - it's good to be exposed even to languages we don't fully understand. Most people never get "formal training" in assembly, but pick up bits and pieces over the years.

If you want to learn a little more assembly before continuing, you might want to check out this NASM tutorial first.

Stepping through this with gdb, we can make sure open succeeded, by using lsof in another terminal:

We can also use strace on our resulting binary. It shows whether syscalls succeed or fail, so that works out great:

Woops, it looks like we're not exiting with status code 0, let's fix that:

X86 Assembly
mov rax, 60 xor rdi, rdi ; <--- exit with code 0 syscall

That's better. Now let's try reading some bytes from this file descriptor. The return value of open is stored in rax, like all other syscalls.

We're going to be using rax to make our next syscall, so we need to save it to the stack with push. Also, we need to allocate memory for our buffer - let's allocate 16 bytes on the stack.

X86 Assembly
_start: mov rax, 2 ; "open" mov rdi, path ; xor rsi, rsi ; O_RDONLY syscall push rax ; push file descriptor onto stack sub rsp, 16 ; reserve 16 bytes of memory xor rax, rax ; "read" mov rdi, [rsp+16] ; file descriptor mov rsi, rsp ; address of buffer mov rdx, 16 ; size of buffer syscall ; exit, etc.
Cool bear's hot tip

The stack is an area we can use to store data. It's limited in size, but simpler to use than the heap.

To reserve memory, we simply subtract from rsp, a register that contains the address of the "top of the stack".

Here's a handy diagram that shows what happens to the stack just before we call read:

Running our program still prints nothing so far, but strace lets us know that everything went fine:

Now let's print that buffer to stdout by using the write call.

X86 Assembly
xor rax, rax ; "read" mov rdi, [rsp+16] ; file descriptor mov rsi, rsp ; address of buffer mov rdx, 16 ; size of buffer syscall ; `rax` contains the number of bytes read ; write takes the number of bytes to write via `rdx` mov rdx, rax ; number of bytes mov rax, 1 ; "write" mov rdi, 1 ; file descriptor (stdout) mov rsi, rsp ; address of buffer syscall

And we're finally seeing some output:

Finally, we just need to repeat reading and writing until read returns 0 bytes read. (We won't be doing any error checking).

Here's our final program:

X86 Assembly
global _start section .text _start: mov rax, 2 ; "open" mov rdi, path ; xor rsi, rsi ; O_RDONLY syscall push rax ; push file descriptor onto stack sub rsp, 16 ; reserve 16 bytes of memory read_buffer: xor rax, rax ; "read" mov rdi, [rsp+16] ; file descriptor mov rsi, rsp ; address of buffer mov rdx, 16 ; size of buffer syscall test rax, rax ; jz means 'jump if zero' jz exit mov rdx, rax ; number of bytes mov rax, 1 ; "write" mov rdi, 1 ; file descriptor (stdout) mov rsi, rsp ; address of buffer syscall jmp read_buffer exit: mov rax, 60 ; "exit" xor rdi, rdi ; return code 0 syscall section .data path: db "/etc/hosts", 0 ; null-terminated
What did we learn?

Making syscalls on Linux x86_64 involves putting values in some registers, and then using the syscall instruction.

We can use the stack (which grows downward) as temporary storage space.

Cool bear's hot tip

The information about syscalls in this article is extremely Linux-specific.

For example, Windows syscall numbers changes across OS versions - sometimes even service packs. If you're curious, check out this table.

While it's possible to make syscalls without using OS libraries, it's not always practical.

Memory-mapped files

Let's back up a little and take a look at the strace for mousepad, the program we used to read a file in the GUI.

Note: to obtain this trace, I had to use the -f flag for strace, because the I/O happens in a child process.

We recognize the openat syscall, and also fstat - but.. it doesn't use read or write. What's happening over here?

Well, remember when we said the kernel is an all-powerful overseer that controls everything the userland interacts with? That goes for memory too!

In an operating system like Linux, each process has its own virtual address space. Some of it is mapped to physical memory, via the Memory management unit (MMU for short).

Cool bear's hot tip

Physical memory is divided in "pages", to make them easier to address. Pages are often 4KiB, but not always!

When a process is started, a few pages are reserved for its stack. (Which we used above). When allocating memory on the heap, say, with malloc, glibc's allocator asks the kernel for more pages, and keeps track of all the allocations, so that free() works properly.

First, let's check that every process does indeed have a separate address space.

We can make a first program, write.c:

C code
#include <stdlib.h> #include <stdio.h> int main(int argc, char **argv) { // allocate 4 bytes int *ptr = malloc(sizeof(int)); // write a very specific value to it *ptr = 0xFEEDFACE; // read back the value, and print the address printf("Wrote %x to %p\n", *ptr, ptr); // wait for user input getc(stdin); }

There's a good chance this program will print a different address every time, but when I ran it, it printed this:

Shell session
$ gcc write.c -o write $ ./write Wrote feedface to 0x56459a9c7260

We can use that to write a second program, read.c:

C code
#include <stdlib.h> #include <stdio.h> int main(int argc, char **argv) { int *ptr = (int *) 0x56459a9c7260; printf("Read %x to %p\n", *ptr, ptr); }

And run it:

$ gcc read.c -o read $ ./read [1] 6429 segmentation fault (core dumped) ./read

What happened? 0x56459a9c7260 was a valid address in write's virtual address space, but not in read's. Attempting to read from it is an access violation, which results in the kernel sending a signal to our process, and the default handler for that signal terminates the process.

Cool bear's hot tip

We used this in Part 1 to get our stdio-powered C program to segfault!

An access violation is just one type of page fault. A page fault occurs when we try to read from or write to a (virtual) address that isn't currently mapped to physical memory.

And this is precisely the trick behind mmap. When we first mmap a file, the kernel might eagerly read the first 4K of the file into a buffer of its own, and sets up the page tables so that the (userland) process can read directly from that buffer:

But once the process reads past the first 4K, then that's a page fault!

Remember, the kernel can do anything in response to a page fault: it may decide that it's an access violation, and send a signal to the process. In this case, it chooses simply to fulfill its promise that "this virtual address range contains the contents of the file", just.. not until it's needed.

The requested part of the file is actually read, a new page mapping is set:

The kernel is of course free to "page out" parts of the file, when they haven't been accessed in a while (or as soon as it wants, really!).

Cool bear's hot tip

When executing a program, its image is memory-mapped.

This allows a program to start executing before it's entirely read from disks, which matters a lot if the executable is large, or the I/O device is slow.

Using mmap from assembly

Can we use that from our assembly program? Sure we can!

Since we're not sure what parameters mmap needs (and which registers to put them into), we'll use this Searchable Linux Syscall table for x86 and x86_64 by @FiloSottile.

First, as usual, we'll need to open the file:

X86 Assembly
_start: mov rax, 2 ; "open" mov rdi, path ; xor rsi, rsi ; O_RDONLY syscall

Next, we want to find the size of the file in bytes, so we can pass it to mmap. We'll use the fstat syscall for that:

To help me write the next part, I wrote a simple C program that dumps the struct's size, along with the offset of the st_size field, and two constants:

C code
#include <stdio.h> #include <stddef.h> #include <sys/stat.h> #include <sys/mman.h> int main() { printf("size of stat struct: %zu\n", sizeof(struct stat)); printf("offset of st_size : %zu\n", offsetof(struct stat, st_size)); printf("PROT_READ = 0x%x\n", PROT_READ); printf("MAP_PRIVATE = 0x%x\n", MAP_PRIVATE); }

Which outputs:

size of stat struct: 144 offset of st_size : 48 PROT_READ = 0x1 MAP_PRIVATE = 0x2

So, it looks like we'll need to allocate 144 bytes on the stack:

X86 Assembly
mov rdi, rax ; fd (returned from open) sub rsp, 144 ; allocate stat struct mov rsi, rsp ; address of 'struct stat' mov rax, 5 ; "fstat" syscall syscall

And then we can feed our file descriptor, file size, and flags to mmap. Note that we can specify an address (but NULL is fine) and an offset (but 0 is fine, since we want the whole file).

X86 Assembly
mov rsi, [rsp+48] ; len = file size (from 'struct stat') add rsp, 144 ; free 'struct stat' mov r8, rdi ; fd (still in rdi from last syscall) xor rdi, rdi ; address = 0 mov rdx, 0x1 ; protection = PROT_READ mov r10, 0x2 ; flags = MAP_PRIVATE xor r9, r9 ; offset = 0 mov rax, 9 ; "mmap" syscall syscall

Finally, we can write out the whole file in a single write syscall:

X86 Assembly
mov rdx, rsi ; count (file size from last call) mov rsi, rax ; buffer address (returned from mmap) mov rdi, 1 ; fd = stdout mov rax, 1 ; "write" syscall syscall

And there we have it:

What did we learn?

A process's address space refers to virtual memory, which is then mapped to physical memory via page tables. When an unmapped range is accessed, it results in a page fault.

Instead of reading parts of files with read, we can map them into the virtual address space with mmap. Reading from that range will result in the kernel reading the relevant parts of the file.

Executables are memory-mapped when ran (even on Windows).

In the next part, we're going to take a look within the kernel, to see how files are organized and read - and how we can find and read them by using this knowledge, bypassing as much abstraction as we can.