Aug 31, 2019 18 min #os · #golang · #strace · #kernel · #assembly

Reading files the hard way - Part 2 (x86 asm, linux kernel)

From the series Reading files the hard way

3

👋 This page was last updated ~6 years ago. Just so you know.

Thanks to my sponsors: Antoine Boegli, Neil Blakey-Milner, Thehbadger, budrick, hgranthorner, Ronen Ulanovsky, Aljaz Erzen, Beth Rennie, Paul Marques Mota, Johnathan Pagnutti, Christian Bourjau, Duane Sibilly, Romain Kelifa, Santiago Lema, Simon Menke, Michael, Max Bruckner, Kai Kaufman, Mason Ginter, Jack Duvall and 266 more

Looking at that latest mental model, it’s.. a bit suspicious that every program ends up calling the same set of functions. It’s almost like something different happens when calling those.

Are those even regular functions? Can we step through them with a debugger?

If we run our stdio-powered C program in gdb, and break on read, we can confirm that we indeed end up calling a read function (which is called __GI___libc_read here, but oh well):

$ gcc -g -O0 readfile-f.c -o readfile-f

$ gdb --silent ./readfile-f
Reading symbols from ./readfile-f...
(gdb) break read
Function "read" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (read) pending.
(gdb) r
Starting program: /home/amos/bearcove/read-files-the-hard-way/readfile-f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, __GI___libc_read (fd=3, buf=0x5555555594a0, nbytes=4096) at ../sysdeps/unix/sysv/linux/read.c:25
25      ../sysdeps/unix/sysv/linux/read.c: No such file or directory.
(gdb) bt
#0  __GI___libc_read (fd=3, buf=0x5555555594a0, nbytes=4096) at ../sysdeps/unix/sysv/linux/read.c:25
#1  0x00007ffff7c88596 in _IO_new_file_underflow (fp=0x5555555592a0) at ./libio/libioP.h:947
#2  0x00007ffff7c86e00 in __GI__IO_file_xsgetn (fp=0x5555555592a0, data=<optimized out>, n=15) at ./libio/fileops.c:1321
#3  0x00007ffff7c7b709 in __GI__IO_fread (buf=0x555555559480, size=1, count=15, fp=0x5555555592a0) at ./libio/iofread.c:38
#4  0x0000555555555288 in main (argc=1, argv=0x7fffffffdb98) at readfile-f.c:16
(gdb)

Cool Bear's hot tip

GDB is an open-source debugger that runs on Linux, macOS (sometimes), and Windows (with some limitations).

It allows, among many other things, setting breakpoints and stepping through code.

We’ll be using it a bunch.

Same goes for the Rust program:

$ rust-gdb --silent target/debug/readfile-rs
Reading symbols from target/debug/readfile-rs...
(gdb) break read
Breakpoint 1 at 0x1ea73: file library/std/src/fs.rs, line 872.
(gdb) r
Starting program: /home/amos/bearcove/read-files-the-hard-way/readfile-rs/target/debug/readfile-rs
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, std::sys::unix::fs::OpenOptions::read () at library/std/src/sys/unix/fs.rs:893
893     library/std/src/sys/unix/fs.rs: No such file or directory.
(gdb) c
Continuing.

Breakpoint 1, __GI___libc_read (fd=3, buf=0x5555555a8ba0, nbytes=220) at ../sysdeps/unix/sysv/linux/read.c:25
25      ../sysdeps/unix/sysv/linux/read.c: No such file or directory.
(gdb) bt
#0  __GI___libc_read (fd=3, buf=0x5555555a8ba0, nbytes=220) at ../sysdeps/unix/sysv/linux/read.c:25
#1  0x000055555557454f in std::sys::unix::fd::FileDesc::read_buf () at library/std/src/sys/unix/fd.rs:136
#2  std::sys::unix::fs::File::read_buf () at library/std/src/sys/unix/fs.rs:1062
#3  std::fs::{impl#5}::read_buf () at library/std/src/fs.rs:737
#4  std::io::default_read_to_end<std::fs::File> () at library/std/src/io/mod.rs:376
#5  0x000055555557297d in std::io::default_read_to_string::{closure#0}<std::fs::File> () at library/std/src/io/mod.rs:430
#6  std::io::append_to_string<std::io::default_read_to_string::{closure_env#0}<std::fs::File>> () at library/std/src/io/mod.rs:338
#7  std::io::default_read_to_string<std::fs::File> () at library/std/src/io/mod.rs:430
#8  std::fs::{impl#5}::read_to_string () at library/std/src/fs.rs:754
#9  0x000055555555c899 in readfile_rs::main () at src/main.rs:9

However, when we try to step through it… nothing. For the sake of the investigation, I cloned the glibc repository (since that’s where the read function seems to live), and found this:

// in `glibc/sysdeps/unix/sysv/linux/read.c`

/* Read NBYTES into BUF from FD.  Return the number read or -1.  */
ssize_t
__libc_read (int fd, void *buf, size_t nbytes)
{
  return SYSCALL_CANCEL (read, fd, buf, nbytes);
}
libc_hidden_def (__libc_read)

libc_hidden_def (__read)
weak_alias (__libc_read, __read)
libc_hidden_def (read)
weak_alias (__libc_read, read)

Cool Bear's hot tip

The source code for glibc (shown above) can be found in this git repository or this unofficial GitHub mirror.

A libc is a very complicated piece of software, for many historical and practical reasons. There are other popular ones, like musl.

The reason we can’t find the source of read is because… it lives in another land entirely:

Cool Bear's hot tip

“syscall” is what the “s” in strace stands for.

It’s not just two different sets of software. They run with different privileges. The Linux kernel (and its device drivers) run in ring 0, where everything is allowed. Userland applications, however run in ring 3.

This is a classic diagram, so I had to show it here, but I don’t think it’s super intuitive. I prefer to think of it this way:

Because what you can do from ring 0 is a strict superset of what you can do from ring 3. Ring 3 is like a prison. Anybody from ring 0 can visit, but ring 3 can only send letters (ie. make syscalls).

So the kernel handles things like reading and writing. But it also handles things like processes. When we start our application, it runs in a process - and the kernel decides which process gets to run and when. It interrupts processes and resumes them, prioritizing the important stuff - but also, giving the illusion that a single CPU core can do multiple things at once (when it really can only do one at a time).

Cool Bear's hot tip

Reality is a lot more complicated. CPU cores do do multiple things at once, just not in a way that’s easy to observe.

There’s a lot the kernel is responsible for, but let’s focus on files. Processes have resources associated to it - like file descriptors! When we open a file (with the open syscall), the kernel:

Decides whether or not this is allowed
Asks the VFS who’s responsible for this particular path
Reserves a file descriptor, which is:
- just a number, really
- unique per-process
Makes a note that this number correspond to that resource
Tells us what the number is

And in our further communication with the kernel, whenever we want to refer to that resource, we’ll just use that number.

And this answers one of the questions you might have had while following this article: in the strace output for most programs, we saw a call to close (which, well, closes a file descriptor) - but in our sample C program, we never bothered calling it!

This is because the kernel, in its infinite wisdom, keeps tracks of opened file descriptors, and cleans them all up once the process exits.

“This is all getting a bit theoretical” whispers someone in the back. “I’m glad he’s not showing us kernel code, but.. are we just supposed to trust that the kernel cleans up file descriptors?”

Good question! There is a command to list open file descriptors for a specific path, so we can verify that real quick with a simple C program:

// in `open.c`

#include <stdio.h>
#include <fcntl.h>

int main(int argc, char **argv) {
    int fd = open("/etc/hosts", O_RDONLY);

    printf("Our file descriptor for /etc/hosts is %d\n", fd);
    printf("Press enter to exit...\n");
    getc(stdin);
}

$ gcc open.c -o open

$ lsof /etc/hosts 2> /dev/null

$ ./open
Our file descriptor for /etc/hosts is 3
Press enter to exit...
^Z
[1]+  Stopped                 ./open

$ lsof /etc/hosts 2> /dev/null
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
open    1917875 amos    3r   REG    8,3      220 16253092 /etc/hosts

$ fg
./open


$ lsof /etc/hosts 2> /dev/null

What did we learn?

The kernel is all-powerful. It decides how processes are run, manages access to all devices (including disks), and is in charge of enforcing security.

Regular function calls are just “jumps” to another part of the code. Syscalls are not regular function calls. They are a secure interface between ring 3 (userland, our applications) and ring 0 (the kernel).

Making a syscall involves writing parameters somewhere accessible from userland, and politely asking the kernel to consider our request. The kernel is free to deny it, for various reasons: the file may not exist, we might not have permission to read it, etc.

Making a syscall

We need to clear up a potential source of confusion. We saw a read() function in the source code for glibc (the C library that ships with most Linux distributions), but it is distinct from the actual read syscall.

It seems like most of Unix is written in C, but can we make a syscall without using libc? Something like this:

Hey, Go is not C - does Go use libc to make syscalls? Let’s find out.

This is the source code for a simple Go program that prints the contents of /etc/hosts:

package main

import (
  "io/ioutil"
  "fmt"
)

func main() {
  payload, err := ioutil.ReadFile("/etc/hosts")
  if err != nil {
    panic(err)
  }

  fmt.Printf("%s\n", string(payload))
}

It sure works, but it doesn’t seem to link against libc:

$ go build main.go

$ ./main | head -3
127.0.0.1       localhost
127.0.1.1       sonic


$ ldd main
        not a dynamic executable

Cool Bear's hot tip

ldd is a tool that “prints shared object dependencies”.

The ldd man page has more info.

In our case, it’s useful to make sure our program doesn’t use glibc. Go programs are usually statically linked, unless they use obscure packages like net or os/user.

Welp, I guess Go programs are usually dynamic after all, but you can fix that.

Trying to break on read in gdb also gives nothing. Well, can we make sure it still uses the openat and read syscalls at least? Let’s strace it:

$ strace -e openat,read,write ./main
openat(AT_FDCWD, "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", O_RDONLY) = 3
read(3, "2097152\n", 20)                = 8
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1923780, si_uid=1000} ---
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1923780, si_uid=1000} ---
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 3
read(3, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 512) = 220
read(3, "", 292)                        = 0
write(1, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 221127.0.0.1 localhost
127.0.1.1       sonic

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

) = 221
+++ exited with 0 +++

It does! So one doesn’t need libc to make a syscall. What a relief.

What did we learn?

Even though, in some respects, Go is a higher-level language than C (it has a garbage collector, it comes with concurrency primitives, etc.), it doesn’t rely on libc to make syscalls.

This contrasts with the Node.js runtime, and the Rust standard library, which both use libc to make syscalls.

Making a Linux syscall on x86_64

So we’ve seen that pretty much all languages, no matter how many levels of abstractions they’re on, have to eventually make syscalls one way or the other.

But how does one make a syscall? So far we’ve been using languages that either:

Use libc to make syscalls, via wrapper functions (Node.js, Rust, C)
Make syscalls for us (Go)

Let’s try to make a syscall ourselves, in assembly.

Cool Bear's hot tip

Assembly is one of the intermediate forms in which most compiled programs go through before they become full executables.

In very simple terms: a C compiler translates C to assembly, the assembler translates assembly to machine code, and a linker glues together several pieces of machine code into an executable.

We’ll use yasm, because everything else gives me high blood pressure. Our code is going to be in readfile.asm, and we’re going to build it with this makefile:

.PHONY: all

all:
	yasm -f elf64 -g dwarf2 readfile.asm
	ld readfile.o -o readfile

The yasm invocation assembles our assembly into an object file, and the ld invocation links it into a full executable.

We’re going to start with a very simple program, let’s put this in readfile.asm:

            global _start ; _start is our entry point - this is its declaration...

            section .text ; the text section is where we'll put executable code
_start:     xor rdi, rdi  ; ...and this is its definition. we just set rdi to 0.

Cool Bear's hot tip

rdi is a register.

Registers are memory locations in the CPU that are used for many things: temporary storage, passing arguments to functions, returning values from functions, etc. You can think of them as global variables.

Each architecture has a its own set of registers. We’re on x86_64, so we’ll be using a few of the 64-bit general-purpose registers, like rax, rdi, rsi. We’ll also be using the stack pointer, rsp.

This compiles and links just fine, but it segfaults when we run it:

$ make
yasm -f elf64 -g dwarf2 readfile.asm
ld readfile.o -o readfile

$ ./readfile
Segmentation fault (core dumped)

To get our program to exit cleanly, we need to.. make a syscall.

The first thing we need to make a syscall is its number. On Ubuntu for example, you can find it in a header file:

// in `/usr/include/x86_64-linux-gnu/asm/unistd_64.h`

// (cut)
#define __NR_exit 60
// (cut)

But you can also find some syscall tables online, like Filippo’s, which is searchable.

The second thing we need to do is.. count our lucky stars, because on x86_64, there is a dedicated instruction to make a syscall. (It’s called syscall).

So now we just need to put the syscall number 60 in the rax register, and use the syscall instruction, and we should be good:

            global _start

            section .text
_start:     mov rax, 60
            syscall

$ make
yasm -f elf64 -g dwarf2 readfile.asm
ld readfile.o -o readfile

$ ldd ./readfile
        not a dynamic executable

And just like that, we made a syscall without using libc! That wasn’t so hard.

Cool Bear's hot tip

If it’s not that hard to make a syscall without libc, why do so many languages use libc to make syscalls?

Well, it’s easy to make Linux syscalls, on x86_64. 32-bit architectures have a different ways to make syscalls. Other operating systems have completely different sets of syscalls.

You can learn about some of those differences in the man page for syscall(2). You can pull it up from a Linux system with the man 2 syscall command, or read it online

Can we re-implement our whole readfile application in assembly? Let’s see.

Logging isn’t going to be as easy as with Node.js, or Rust, or C, or Go. So we’re going to have to lean on the debugger a little bit. Thanks to the -g dwarf2 flags we passed to yasm, we have great debug information:

$ gdb --silent ./readfile
Reading symbols from ./readfile...
(gdb) break _start
Breakpoint 1 at 0x401000: file readfile.asm, line 4.
(gdb) r
Starting program: /home/amos/bearcove/read-files-the-hard-way/readfile

Breakpoint 1, _start () at readfile.asm:4
4       _start:     mov rax, 60
(gdb) s
5                   syscall
(gdb) s
[Inferior 1 (process 1843089) exited normally]
(gdb)

So, let’s try the open syscall. It needs the same parameters as in C: first a path, then a set of flags. We’ll store the path in the data section.

            global _start

            section .text
_start:     mov     rax, 2      ; "open" syscall
            mov     rdi, path   ; arg 1: path
            xor     rsi, rsi    ; arg 2: flags (0 = O_RDONLY)
            syscall

            mov     rax, 60     ; "exit" syscall
            syscall

            section .data
path:       db      "/etc/hosts", 0 ; null-terminated

Cool Bear's hot tip

You don’t need to be fluent in assembly to read this article - it’s good to be exposed even to languages we don’t fully understand. Most people never get “formal training” in assembly, but pick up bits and pieces over the years.

If you want to learn a little more assembly before continuing, you might want to check out the YASM documentation or this this NASM tutorial first.

Stepping through this with gdb, we can make sure open succeeded, by using lsof in another terminal:

# GDB session
$ gdb --silent ./readfile
Reading symbols from ./readfile...
(gdb) starti
Starting program: /home/amos/bearcove/read-files-the-hard-way/readfile

Program stopped.
_start () at readfile.asm:4
4       _start:     mov     rax, 2      ; "open" syscall
(gdb) s
5                   mov     rdi, path   ; arg 1: path
(gdb)
6                   xor     rsi, rsi    ; arg 2: flags (0 = O_RDONLY)
(gdb)
7                   syscall
(gdb)
9                   mov     rax, 60     ; "exit" syscall
(gdb)

# In another shell
$ lsof /etc/hosts
COMMAND      PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
readfile 1846055 amos    3r   REG    8,3      220 16253092 /etc/hosts

We can also use strace on our resulting binary. It shows whether syscalls succeed or fail, so that works out great:

$ strace ./readfile
execve("./readfile", ["./readfile"], 0x7fffb827f0f0 /* 63 vars */) = 0
open("/etc/hosts", O_RDONLY)            = 3
exit(4202496)                           = ?
+++ exited with 0 +++

Woops, it looks like we’re not exiting with status code 0, let’s fix that:

            mov     rax, 60
            xor     rdi, rdi    ; <--- exit with code 0
            syscall

That’s better. Now let’s try reading some bytes from this file descriptor. The return value of open is stored in rax, like all other syscalls.

We’re going to be using rax to make our next syscall, so we need to save it to the stack with push. Also, we need to allocate memory for our buffer - let’s allocate 16 bytes on the stack.

global _start

            section .text
_start:     mov     rax, 2      ; "open"
            mov     rdi, path   ;
            xor     rsi, rsi    ; O_RDONLY
            syscall

            push rax            ; push file descriptor onto stack
            sub rsp, 16         ; reserve 16 bytes of memory

            xor     rax, rax    ; "read"
            mov     rdi, [rsp+16] ; file descriptor
            mov     rsi, rsp    ; address of buffer
            mov     rdx, 16     ; size of buffer
            syscall

            mov     rax, 60     ; "exit" syscall
            syscall

            section .data
path:       db      "/etc/hosts", 0 ; null-terminated

Cool Bear's hot tip

The stack is an area we can use to store data. It’s limited in size, but simpler to use than the heap.

To reserve memory, we simply subtract from rsp, a register that contains the address of the “top of the stack”.

Here’s a handy diagram that shows what happens to the stack just before we call read:

Running our program still prints nothing so far, but strace lets us know that everything went fine:

$ make
yasm -f elf64 -g dwarf2 readfile.asm
ld readfile.o -o readfile

$ ./readfile

$ strace ./readfile
execve("./readfile", ["./readfile"], 0x7ffc3dfee7d0 /* 60 vars */) = 0
open("/etc/hosts", O_RDONLY)            = 3
read(3, "127.0.0.1\tlocalh", 16)        = 16
exit(3)                                 = ?
+++ exited with 3 +++

$

Now let’s print that buffer to stdout by using the write call.

            xor     rax, rax    ; "read"
            mov     rdi, [rsp+16] ; file descriptor
            mov     rsi, rsp    ; address of buffer
            mov     rdx, 16     ; size of buffer
            syscall

            ; `rax` contains the number of bytes read
            ; write takes the number of bytes to write via `rdx`
            mov     rdx, rax    ; number of bytes
            mov     rax, 1      ; "write"
            mov     rdi, 1      ; file descriptor (stdout)
            mov     rsi, rsp    ; address of buffer
            syscall

And we’re finally seeing some output:

$ make
yasm -f elf64 -g dwarf2 readfile.asm
ld readfile.o -o readfile

$ ./readfile
127.0.0.1       localh

Finally, we just need to repeat reading and writing until read returns 0 bytes read. (We won’t be doing any error checking).

Here’s our final program:

            global _start

            section .text
_start:     mov     rax, 2      ; "open"
            mov     rdi, path   ;
            xor     rsi, rsi    ; O_RDONLY
            syscall

            push rax            ; push file descriptor onto stack
            sub rsp, 16         ; reserve 16 bytes of memory

read_buffer:
            xor     rax, rax    ; "read"
            mov     rdi, [rsp+16] ; file descriptor
            mov     rsi, rsp    ; address of buffer
            mov     rdx, 16     ; size of buffer
            syscall

            test    rax, rax
            ; jz means 'jump if zero'
            jz      exit

            mov     rdx, rax    ; number of bytes
            mov     rax, 1      ; "write"
            mov     rdi, 1      ; file descriptor (stdout)
            mov     rsi, rsp    ; address of buffer
            syscall

            jmp read_buffer

exit:
            mov     rax, 60     ; "exit"
            xor     rdi, rdi    ; return code 0
            syscall

            section .data
path:       db      "/etc/hosts", 0 ; null-terminated

What did we learn?

Making syscalls on Linux x86_64 involves putting values in some registers, and then using the syscall instruction.

We can use the stack (which grows downward) as temporary storage space.

Cool Bear's hot tip

The information about syscalls in this article is extremely Linux-specific.

For example, Windows syscall numbers changes across OS versions - sometimes even service packs. If you’re curious, check out this table.

While it’s possible to make syscalls without using OS libraries, it’s not always practical.

Memory-mapped files

Let’s try and use strace on another program: ripgrep.

$ strace rg 'localhost' /etc/hosts
(cut)
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 3
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=220, ...}) = 0
mmap(NULL, 220, PROT_READ, MAP_SHARED, 3, 0) = 0x7fd3a694a000
write(1, "\33[0m\33[32m1\33[0m:127.0.0.1\t\33[0m\33[1"..., 511:127.0.0.1        localhost) = 51
write(1, "\n", 1
)                       = 1
write(1, "\33[0m\33[32m5\33[0m:::1     ip6-\33[0m\33"..., 665:::1     ip6-localhost ip6-loopback) = 66
write(1, "\n", 1
)                       = 1
munmap(0x7fd3a694a000, 220)             = 0
close(3)                                = 0
(cut)

We recognize the openat syscall, and also statx - but.. it doesn’t use read. What’s happening over here?

Well, remember when we said the kernel is an all-powerful overseer that controls everything the userland interacts with? That goes for memory too!

In an operating system like Linux, each process has its own virtual address space. Some of it is mapped to physical memory, via the Memory management unit (MMU for short).

Cool Bear's hot tip

Physical memory is divided in “pages”, to make them easier to address. Pages are often 4KiB, but not always!

For example, Apple Silicon processors like the M1 have 16KiB pages.

When a process is started, a few pages are reserved for its stack. (Which we used above). When allocating memory on the heap, say, with malloc, glibc’s allocator asks the kernel for more pages, and keeps track of all the allocations, so that free() works properly.

First, let’s check that every process does indeed have a separate address space.

We can make a first program, write.c:

#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv) {
  // allocate 4 bytes
  int *ptr = malloc(sizeof(int));
  // write a very specific value to it
  *ptr = 0xFEEDFACE;
  // read back the value, and print the address
  printf("Wrote %x to %p\n", *ptr, ptr);
  // wait for user input
  getc(stdin);
}

There’s a good chance this program will print a different address every time, but when I ran it, it printed this:

$ gcc write.c -o write
$ ./write
Wrote feedface to 0x56459a9c7260

We can use that to write a second program, read.c:

#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv) {
    int *ptr = (int *) 0x56459a9c7260;
    printf("Read %x to %p\n", *ptr, ptr);
}

And run it:

$ gcc read.c -o read
$ ./read
[1]   6429 segmentation fault (core dumped)   ./read

What happened? 0x56459a9c7260 was a valid address in write’s virtual address space, but not in read’s. Attempting to read from it is an access violation, which results in the kernel sending a signal to our process, and the default handler for that signal terminates the process.

Cool Bear's hot tip

We used this in Part 1 to get our stdio-powered C program to segfault!

An access violation is just one type of page fault. A page fault occurs when we try to read from or write to a (virtual) address that isn’t currently mapped to physical memory.

And this is precisely the trick behind mmap. When we first mmap a file, the kernel might eagerly read the first 4K of the file into a buffer of its own, and sets up the page tables so that the (userland) process can read directly from that buffer:

But once the process reads past the first 4K, then that’s a page fault!

Remember, the kernel can do anything in response to a page fault: it may decide that it’s an access violation, and send a signal to the process. In this case, it chooses simply to fulfill its promise that “this virtual address range contains the contents of the file”, just.. not until it’s needed.

The requested part of the file is actually read, a new page mapping is set:

The kernel is of course free to “page out” parts of the file, when they haven’t been accessed in a while (or as soon as it wants, really!).

Cool Bear's hot tip

When executing a program, its image is memory-mapped.

This allows a program to start executing before it’s entirely read from disks, which matters a lot if the executable is large, or the I/O device is slow.

Using mmap from assembly

Can we use that from our assembly program? Sure we can!

Since we’re not sure what parameters mmap needs (and which registers to put them into), we’ll use this Searchable Linux Syscall table for x86 and x86_64 by @FiloSottile.

First, as usual, we’ll need to open the file:

_start:     mov     rax, 2          ; "open"
            mov     rdi, path       ;
            xor     rsi, rsi        ; O_RDONLY
            syscall

Next, we want to find the size of the file in bytes, so we can pass it to mmap. We’ll use the fstat syscall for that, which takes fd (a file descriptor) in register rdi, and struct stat __user* statbuf in register rsi.

To help me write the next part, I wrote a simple C program that dumps the struct’s size, along with the offset of the st_size field, and two constants:

#include <stdio.h>
#include <stddef.h>
#include <sys/stat.h>
#include <sys/mman.h>

int main() {
    printf("size of stat struct: %zu\n", sizeof(struct stat));
    printf("offset of st_size  : %zu\n", offsetof(struct stat, st_size));
    printf("PROT_READ   = 0x%x\n", PROT_READ);
    printf("MAP_PRIVATE = 0x%x\n", MAP_PRIVATE);
}

Which outputs:

size of stat struct: 144
offset of st_size  : 48
PROT_READ   = 0x1
MAP_PRIVATE = 0x2

So, it looks like we’ll need to allocate 144 bytes on the stack:

            mov     rdi, rax        ; fd (returned from open)
            sub     rsp, 144        ; allocate stat struct
            mov     rsi, rsp        ; address of 'struct stat'
            mov     rax, 5          ; "fstat" syscall
            syscall

And then we can feed our file descriptor, file size, and flags to mmap. Note that we can specify an address (but NULL is fine) and an offset (but 0 is fine, since we want the whole file).

Cool Bear's hot tip

The mmap syscall takes:

addr in %rdi
len in %rsi
prot (protection) in %rdx
flags in %r10
fd in %r8
off (offset) in %r9

            mov     rsi, [rsp+48]   ; len = file size (from 'struct stat')
            add     rsp, 144        ; free 'struct stat'
            mov     r8, rdi         ; fd (still in rdi from last syscall)
            xor     rdi, rdi        ; address = 0
            mov     rdx, 0x1        ; protection = PROT_READ
            mov     r10, 0x2        ; flags = MAP_PRIVATE
            xor     r9, r9          ; offset = 0
            mov     rax, 9          ; "mmap" syscall
            syscall

Finally, we can write out the whole file in a single write syscall:

Cool Bear's hot tip

The write syscall takes:

fd in %rdi
buf in %rsi
count in %rdx

            mov     rdx, rsi        ; count (file size from last call)
            mov     rsi, rax        ; buffer address (returned from mmap)
            mov     rdi, 1          ; fd = stdout
            mov     rax, 1          ; "write" syscall
            syscall

And there we have it:

$ make
yasm -f elf64 -g dwarf2 readfile.asm
ld readfile.o -o readfile

$ strace ./readfile > /dev/null
execve("./readfile", ["./readfile"], 0x7ffffc6ddbf0 /* 60 vars */) = 0
open("/etc/hosts", O_RDONLY)            = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=220, ...}) = 0
mmap(NULL, 220, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f11fbe5a000
write(1, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 220) = 220
exit(0)                                 = ?
+++ exited with 0 +++

What did we learn?

A process’s address space refers to virtual memory, which is then mapped to physical memory via page tables. When an unmapped range is accessed, it results in a page fault.

Instead of reading parts of files with read, we can map them into the virtual address space with mmap. Reading from that range will result in the kernel reading the relevant parts of the file.

Executables are memory-mapped when ran (even on Windows).

In the next part, we’re going to take a look within the kernel, to see how files are organized and read - and how we can find and read them by using this knowledge, bypassing as much abstraction as we can.

You're reading the Reading files the hard way series.

Here's another article just for you:

A terminal case of Linux

Sep 24, 2021

29 min #rust · #linux

Has this ever happened to you?

You want to look at a JSON file in your terminal, so you pipe it into jq so you can look at it with colors and stuff.

Cool Bear's hot tip

That’s a useless use of cat.

…oh hey cool bear. No warm-up today huh.

Sure, fine, okay, I’ll read the darn man page for jq… okay it takes a “filter” and then some files. And the filter we want is.. . which, just like files, means “the current thing”: