Thanks to my sponsors: Aiden Scandella, Nikolai Vincent Vaags, Laine Taffin Altman, clement, Ramen, Senyo Simpson, Chirag Jain, Enrico Zschemisch, Moritz Lammerich, Valentin Mariette, jer, Egor Ternovoi, Marcin Kołodziej, Daniel Papp, ofrighil, Luiz Irber, Max von Forell, Sawyer Knoblich, Chris Sims, Jack Duvall and 230 more
Reading files the hard way - Part 2 (x86 asm, linux kernel)
👋 This page was last updated ~5 years ago. Just so you know.
Looking at that latest mental model, it's.. a bit suspicious that every program ends up calling the same set of functions. It's almost like something different happens when calling those.
Are those even regular functions? Can we step through them with a debugger?
If we run our stdio-powered C program in gdb, and break on read
, we can
confirm that we indeed end up calling a read
function (which is called
__GI___libc_read
here, but oh well):
$ gcc -g -O0 readfile-f.c -o readfile-f
$ gdb --silent ./readfile-f
Reading symbols from ./readfile-f...
(gdb) break read
Function "read" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (read) pending.
(gdb) r
Starting program: /home/amos/bearcove/read-files-the-hard-way/readfile-f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, __GI___libc_read (fd=3, buf=0x5555555594a0, nbytes=4096) at ../sysdeps/unix/sysv/linux/read.c:25
25 ../sysdeps/unix/sysv/linux/read.c: No such file or directory.
(gdb) bt
#0 __GI___libc_read (fd=3, buf=0x5555555594a0, nbytes=4096) at ../sysdeps/unix/sysv/linux/read.c:25
#1 0x00007ffff7c88596 in _IO_new_file_underflow (fp=0x5555555592a0) at ./libio/libioP.h:947
#2 0x00007ffff7c86e00 in __GI__IO_file_xsgetn (fp=0x5555555592a0, data=<optimized out>, n=15) at ./libio/fileops.c:1321
#3 0x00007ffff7c7b709 in __GI__IO_fread (buf=0x555555559480, size=1, count=15, fp=0x5555555592a0) at ./libio/iofread.c:38
#4 0x0000555555555288 in main (argc=1, argv=0x7fffffffdb98) at readfile-f.c:16
(gdb)
Cool bear's hot tip
GDB is an open-source debugger that runs on Linux, macOS (sometimes), and Windows (with some limitations).
It allows, among many other things, setting breakpoints and stepping through code.
We'll be using it a bunch.
Same goes for the Rust program:
$ rust-gdb --silent target/debug/readfile-rs
Reading symbols from target/debug/readfile-rs...
(gdb) break read
Breakpoint 1 at 0x1ea73: file library/std/src/fs.rs, line 872.
(gdb) r
Starting program: /home/amos/bearcove/read-files-the-hard-way/readfile-rs/target/debug/readfile-rs
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, std::sys::unix::fs::OpenOptions::read () at library/std/src/sys/unix/fs.rs:893
893 library/std/src/sys/unix/fs.rs: No such file or directory.
(gdb) c
Continuing.
Breakpoint 1, __GI___libc_read (fd=3, buf=0x5555555a8ba0, nbytes=220) at ../sysdeps/unix/sysv/linux/read.c:25
25 ../sysdeps/unix/sysv/linux/read.c: No such file or directory.
(gdb) bt
#0 __GI___libc_read (fd=3, buf=0x5555555a8ba0, nbytes=220) at ../sysdeps/unix/sysv/linux/read.c:25
#1 0x000055555557454f in std::sys::unix::fd::FileDesc::read_buf () at library/std/src/sys/unix/fd.rs:136
#2 std::sys::unix::fs::File::read_buf () at library/std/src/sys/unix/fs.rs:1062
#3 std::fs::{impl#5}::read_buf () at library/std/src/fs.rs:737
#4 std::io::default_read_to_end<std::fs::File> () at library/std/src/io/mod.rs:376
#5 0x000055555557297d in std::io::default_read_to_string::{closure#0}<std::fs::File> () at library/std/src/io/mod.rs:430
#6 std::io::append_to_string<std::io::default_read_to_string::{closure_env#0}<std::fs::File>> () at library/std/src/io/mod.rs:338
#7 std::io::default_read_to_string<std::fs::File> () at library/std/src/io/mod.rs:430
#8 std::fs::{impl#5}::read_to_string () at library/std/src/fs.rs:754
#9 0x000055555555c899 in readfile_rs::main () at src/main.rs:9
However, when we try to step through it... nothing. For the sake of the investigation, I cloned the glibc repository (since that's where the read function seems to live), and found this:
// in `glibc/sysdeps/unix/sysv/linux/read.c`
/* Read NBYTES into BUF from FD. Return the number read or -1. */
ssize_t
__libc_read (int fd, void *buf, size_t nbytes)
{
return SYSCALL_CANCEL (read, fd, buf, nbytes);
}
libc_hidden_def (__libc_read)
libc_hidden_def (__read)
weak_alias (__libc_read, __read)
libc_hidden_def (read)
weak_alias (__libc_read, read)
Cool bear's hot tip
The source code for glibc (shown above) can be found in this git repository or this unofficial GitHub mirror.
A libc is a very complicated piece of software, for many historical and practical reasons. There are other popular ones, like musl.
The reason we can't find the source of read is because... it lives in another land entirely:
Cool bear's hot tip
"syscall" is what the "s" in strace
stands for.
It's not just two different sets of software. They run with different privileges. The Linux kernel (and its device drivers) run in ring 0, where everything is allowed. Userland applications, however run in ring 3.
This is a classic diagram, so I had to show it here, but I don't think it's super intuitive. I prefer to think of it this way:
Because what you can do from ring 0 is a strict superset of what you can do from ring 3. Ring 3 is like a prison. Anybody from ring 0 can visit, but ring 3 can only send letters (ie. make syscalls).
So the kernel handles things like reading and writing. But it also handles things like processes. When we start our application, it runs in a process - and the kernel decides which process gets to run and when. It interrupts processes and resumes them, prioritizing the important stuff - but also, giving the illusion that a single CPU core can do multiple things at once (when it really can only do one at a time).
Cool bear's hot tip
Reality is a lot more complicated. CPU cores do do multiple things at once, just not in a way that's easy to observe.
There's a lot the kernel is responsible for, but let's focus on files.
Processes have resources associated to it - like file descriptors! When we open
a file (with the open
syscall), the kernel:
- Decides whether or not this is allowed
- Asks the VFS who's responsible for this particular path
- Reserves a file descriptor, which is:
- just a number, really
- unique per-process
- Makes a note that this number correspond to that resource
- Tells us what the number is
And in our further communication with the kernel, whenever we want to refer to that resource, we'll just use that number.
And this answers one of the questions you might have had while following this
article: in the strace
output for most programs, we saw a call to close
(which, well, closes a file descriptor) - but in our sample C program, we
never bothered calling it!
This is because the kernel, in its infinite wisdom, keeps tracks of opened file descriptors, and cleans them all up once the process exits.
"This is all getting a bit theoretical" whispers someone in the back. "I'm glad he's not showing us kernel code, but.. are we just supposed to trust that the kernel cleans up file descriptors?"
Good question! There is a command to list open file descriptors for a specific path, so we can verify that real quick with a simple C program:
// in `open.c`
#include <stdio.h>
#include <fcntl.h>
int main(int argc, char **argv) {
int fd = open("/etc/hosts", O_RDONLY);
printf("Our file descriptor for /etc/hosts is %d\n", fd);
printf("Press enter to exit...\n");
getc(stdin);
}
$ gcc open.c -o open
$ lsof /etc/hosts 2> /dev/null
$ ./open
Our file descriptor for /etc/hosts is 3
Press enter to exit...
^Z
[1]+ Stopped ./open
$ lsof /etc/hosts 2> /dev/null
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
open 1917875 amos 3r REG 8,3 220 16253092 /etc/hosts
$ fg
./open
$ lsof /etc/hosts 2> /dev/null
What did we learn?
The kernel is all-powerful. It decides how processes are run, manages access to all devices (including disks), and is in charge of enforcing security.
Regular function calls are just "jumps" to another part of the code. Syscalls are not regular function calls. They are a secure interface between ring 3 (userland, our applications) and ring 0 (the kernel).
Making a syscall involves writing parameters somewhere accessible from userland, and politely asking the kernel to consider our request. The kernel is free to deny it, for various reasons: the file may not exist, we might not have permission to read it, etc.
Making a syscall
We need to clear up a potential source of confusion. We saw a read()
function
in the source code for glibc (the C library that ships with most Linux distributions),
but it is distinct from the actual read
syscall.
It seems like most of Unix is written in C, but can we make a syscall without using libc? Something like this:
Hey, Go is not C - does Go use libc to make syscalls? Let's find out.
This is the source code for a simple Go program that prints the
contents of /etc/hosts
:
package main
import (
"io/ioutil"
"fmt"
)
func main() {
payload, err := ioutil.ReadFile("/etc/hosts")
if err != nil {
panic(err)
}
fmt.Printf("%s\n", string(payload))
}
It sure works, but it doesn't seem to link against libc:
$ go build main.go
$ ./main | head -3
127.0.0.1 localhost
127.0.1.1 sonic
$ ldd main
not a dynamic executable
Cool bear's hot tip
ldd
is a tool that "prints shared object dependencies".
The ldd man page has more info.
In our case, it's useful to make sure our program doesn't use
glibc. Go programs are usually statically linked, unless they
use obscure packages like net
or os/user
.
Welp, I guess Go programs are usually dynamic after all, but you can fix that.
Trying to break on read
in gdb also gives nothing. Well, can we make sure
it still uses the openat
and read
syscalls at least? Let's strace it:
$ strace -e openat,read,write ./main
openat(AT_FDCWD, "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", O_RDONLY) = 3
read(3, "2097152\n", 20) = 8
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1923780, si_uid=1000} ---
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1923780, si_uid=1000} ---
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 3
read(3, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 512) = 220
read(3, "", 292) = 0
write(1, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 221127.0.0.1 localhost
127.0.1.1 sonic
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
) = 221
+++ exited with 0 +++
It does! So one doesn't need libc to make a syscall. What a relief.
What did we learn?
Even though, in some respects, Go is a higher-level language than C (it has a garbage collector, it comes with concurrency primitives, etc.), it doesn't rely on libc to make syscalls.
This contrasts with the Node.js runtime, and the Rust standard library, which both use libc to make syscalls.
Making a Linux syscall on x86_64
So we've seen that pretty much all languages, no matter how many levels of abstractions they're on, have to eventually make syscalls one way or the other.
But how does one make a syscall? So far we've been using languages that either:
- Use libc to make syscalls, via wrapper functions (Node.js, Rust, C)
- Make syscalls for us (Go)
Let's try to make a syscall ourselves, in assembly.
Cool bear's hot tip
Assembly is one of the intermediate forms in which most compiled programs go through before they become full executables.
In very simple terms: a C compiler translates C to assembly, the assembler translates assembly to machine code, and a linker glues together several pieces of machine code into an executable.
We'll use yasm, because everything else gives me high
blood pressure. Our code is going to be in readfile.asm
, and we're
going to build it with this makefile:
.PHONY: all
all:
yasm -f elf64 -g dwarf2 readfile.asm
ld readfile.o -o readfile
The yasm
invocation assembles our assembly into an object file, and the
ld
invocation links it into a full executable.
We're going to start with a very simple program, let's put this in readfile.asm
:
global _start ; _start is our entry point - this is its declaration...
section .text ; the text section is where we'll put executable code
_start: xor rdi, rdi ; ...and this is its definition. we just set rdi to 0.
Cool bear's hot tip
rdi
is a register.
Registers are memory locations in the CPU that are used for many things: temporary storage, passing arguments to functions, returning values from functions, etc. You can think of them as global variables.
Each architecture has a its own set of registers. We're on
x86_64, so we'll be using a few of
the 64-bit general-purpose registers, like rax
, rdi
, rsi
. We'll
also be using the stack pointer, rsp
.
This compiles and links just fine, but it segfaults when we run it:
$ make
yasm -f elf64 -g dwarf2 readfile.asm
ld readfile.o -o readfile
$ ./readfile
Segmentation fault (core dumped)
To get our program to exit cleanly, we need to.. make a syscall.
The first thing we need to make a syscall is its number. On Ubuntu for example, you can find it in a header file:
// in `/usr/include/x86_64-linux-gnu/asm/unistd_64.h`
// (cut)
#define __NR_exit 60
// (cut)
But you can also find some syscall tables online, like Filippo's, which is searchable.
The second thing we need to do is.. count our lucky stars, because on x86_64,
there is a dedicated instruction to make a syscall. (It's called syscall
).
So now we just need to put the syscall number 60 in the rax
register,
and use the syscall
instruction, and we should be good:
global _start
section .text
_start: mov rax, 60
syscall
$ make
yasm -f elf64 -g dwarf2 readfile.asm
ld readfile.o -o readfile
$ ldd ./readfile
not a dynamic executable
And just like that, we made a syscall without using libc! That wasn't so hard.
Cool bear's hot tip
If it's not that hard to make a syscall without libc, why do so many languages use libc to make syscalls?
Well, it's easy to make Linux syscalls, on x86_64. 32-bit architectures have a different ways to make syscalls. Other operating systems have completely different sets of syscalls.
You can learn about some of those differences in the man page for syscall(2)
.
You can pull it up from a Linux system with the man 2 syscall
command, or
read it online
Can we re-implement our whole readfile
application in assembly? Let's see.
Logging isn't going to be as easy as with Node.js, or Rust, or C, or Go. So we're
going to have to lean on the debugger a little bit. Thanks to the -g dwarf2
flags
we passed to yasm, we have great debug information:
$ gdb --silent ./readfile
Reading symbols from ./readfile...
(gdb) break _start
Breakpoint 1 at 0x401000: file readfile.asm, line 4.
(gdb) r
Starting program: /home/amos/bearcove/read-files-the-hard-way/readfile
Breakpoint 1, _start () at readfile.asm:4
4 _start: mov rax, 60
(gdb) s
5 syscall
(gdb) s
[Inferior 1 (process 1843089) exited normally]
(gdb)
So, let's try the open
syscall. It needs the same parameters as in C:
first a path, then a set of flags. We'll store the path in the data
section.
global _start
section .text
_start: mov rax, 2 ; "open" syscall
mov rdi, path ; arg 1: path
xor rsi, rsi ; arg 2: flags (0 = O_RDONLY)
syscall
mov rax, 60 ; "exit" syscall
syscall
section .data
path: db "/etc/hosts", 0 ; null-terminated
Cool bear's hot tip
You don't need to be fluent in assembly to read this article - it's good to be exposed even to languages we don't fully understand. Most people never get "formal training" in assembly, but pick up bits and pieces over the years.
If you want to learn a little more assembly before continuing, you might want to check out the YASM documentation or this this NASM tutorial first.
Stepping through this with gdb, we can make sure open
succeeded,
by using lsof
in another terminal:
# GDB session
$ gdb --silent ./readfile
Reading symbols from ./readfile...
(gdb) starti
Starting program: /home/amos/bearcove/read-files-the-hard-way/readfile
Program stopped.
_start () at readfile.asm:4
4 _start: mov rax, 2 ; "open" syscall
(gdb) s
5 mov rdi, path ; arg 1: path
(gdb)
6 xor rsi, rsi ; arg 2: flags (0 = O_RDONLY)
(gdb)
7 syscall
(gdb)
9 mov rax, 60 ; "exit" syscall
(gdb)
# In another shell
$ lsof /etc/hosts
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
readfile 1846055 amos 3r REG 8,3 220 16253092 /etc/hosts
We can also use strace
on our resulting binary. It shows whether syscalls
succeed or fail, so that works out great:
$ strace ./readfile
execve("./readfile", ["./readfile"], 0x7fffb827f0f0 /* 63 vars */) = 0
open("/etc/hosts", O_RDONLY) = 3
exit(4202496) = ?
+++ exited with 0 +++
Woops, it looks like we're not exiting with status code 0, let's fix that:
mov rax, 60
xor rdi, rdi ; <--- exit with code 0
syscall
That's better. Now let's try reading some bytes from this file descriptor.
The return value of open
is stored in rax
, like all other syscalls.
We're going to be using rax
to make our next syscall, so we need to
save it to the stack with push
. Also, we need to allocate memory for
our buffer - let's allocate 16 bytes on the stack.
global _start
section .text
_start: mov rax, 2 ; "open"
mov rdi, path ;
xor rsi, rsi ; O_RDONLY
syscall
push rax ; push file descriptor onto stack
sub rsp, 16 ; reserve 16 bytes of memory
xor rax, rax ; "read"
mov rdi, [rsp+16] ; file descriptor
mov rsi, rsp ; address of buffer
mov rdx, 16 ; size of buffer
syscall
mov rax, 60 ; "exit" syscall
syscall
section .data
path: db "/etc/hosts", 0 ; null-terminated
Cool bear's hot tip
The stack is an area we can use to store data. It's limited in size, but simpler to use than the heap.
To reserve memory, we simply subtract from rsp
, a register that contains
the address of the "top of the stack".
Here's a handy diagram that shows what happens to the stack just before we call read:
Running our program still prints nothing so far, but strace lets us know that everything went fine:
$ make
yasm -f elf64 -g dwarf2 readfile.asm
ld readfile.o -o readfile
$ ./readfile
$ strace ./readfile
execve("./readfile", ["./readfile"], 0x7ffc3dfee7d0 /* 60 vars */) = 0
open("/etc/hosts", O_RDONLY) = 3
read(3, "127.0.0.1\tlocalh", 16) = 16
exit(3) = ?
+++ exited with 3 +++
$
Now let's print that buffer to stdout by using the write
call.
xor rax, rax ; "read"
mov rdi, [rsp+16] ; file descriptor
mov rsi, rsp ; address of buffer
mov rdx, 16 ; size of buffer
syscall
; `rax` contains the number of bytes read
; write takes the number of bytes to write via `rdx`
mov rdx, rax ; number of bytes
mov rax, 1 ; "write"
mov rdi, 1 ; file descriptor (stdout)
mov rsi, rsp ; address of buffer
syscall
And we're finally seeing some output:
$ make
yasm -f elf64 -g dwarf2 readfile.asm
ld readfile.o -o readfile
$ ./readfile
127.0.0.1 localh
Finally, we just need to repeat reading and writing until read returns 0 bytes read. (We won't be doing any error checking).
Here's our final program:
global _start
section .text
_start: mov rax, 2 ; "open"
mov rdi, path ;
xor rsi, rsi ; O_RDONLY
syscall
push rax ; push file descriptor onto stack
sub rsp, 16 ; reserve 16 bytes of memory
read_buffer:
xor rax, rax ; "read"
mov rdi, [rsp+16] ; file descriptor
mov rsi, rsp ; address of buffer
mov rdx, 16 ; size of buffer
syscall
test rax, rax
; jz means 'jump if zero'
jz exit
mov rdx, rax ; number of bytes
mov rax, 1 ; "write"
mov rdi, 1 ; file descriptor (stdout)
mov rsi, rsp ; address of buffer
syscall
jmp read_buffer
exit:
mov rax, 60 ; "exit"
xor rdi, rdi ; return code 0
syscall
section .data
path: db "/etc/hosts", 0 ; null-terminated
What did we learn?
Making syscalls on Linux x86_64 involves putting values in some registers,
and then using the syscall
instruction.
We can use the stack (which grows downward) as temporary storage space.
Cool bear's hot tip
The information about syscalls in this article is extremely Linux-specific.
For example, Windows syscall numbers changes across OS versions - sometimes even service packs. If you're curious, check out this table.
While it's possible to make syscalls without using OS libraries, it's not always practical.
Memory-mapped files
Let's try and use strace on another program: ripgrep.
$ strace rg 'localhost' /etc/hosts
(cut)
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 3
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=220, ...}) = 0
mmap(NULL, 220, PROT_READ, MAP_SHARED, 3, 0) = 0x7fd3a694a000
write(1, "\33[0m\33[32m1\33[0m:127.0.0.1\t\33[0m\33[1"..., 511:127.0.0.1 localhost) = 51
write(1, "\n", 1
) = 1
write(1, "\33[0m\33[32m5\33[0m:::1 ip6-\33[0m\33"..., 665:::1 ip6-localhost ip6-loopback) = 66
write(1, "\n", 1
) = 1
munmap(0x7fd3a694a000, 220) = 0
close(3) = 0
(cut)
We recognize the openat
syscall, and also statx
- but.. it doesn't use
read
. What's happening over here?
Well, remember when we said the kernel is an all-powerful overseer that controls everything the userland interacts with? That goes for memory too!
In an operating system like Linux, each process has its own virtual address space. Some of it is mapped to physical memory, via the Memory management unit (MMU for short).
Cool bear's hot tip
Physical memory is divided in "pages", to make them easier to address. Pages are often 4KiB, but not always!
For example, Apple Silicon processors like the M1 have 16KiB pages.
When a process is started, a few pages are reserved for its stack. (Which
we used above). When allocating memory on the heap, say, with malloc,
glibc's allocator asks the kernel for more pages, and keeps track of all
the allocations, so that free()
works properly.
First, let's check that every process does indeed have a separate address space.
We can make a first program, write.c
:
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char **argv) {
// allocate 4 bytes
int *ptr = malloc(sizeof(int));
// write a very specific value to it
*ptr = 0xFEEDFACE;
// read back the value, and print the address
printf("Wrote %x to %p\n", *ptr, ptr);
// wait for user input
getc(stdin);
}
There's a good chance this program will print a different address every time, but when I ran it, it printed this:
$ gcc write.c -o write
$ ./write
Wrote feedface to 0x56459a9c7260
We can use that to write a second program, read.c
:
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char **argv) {
int *ptr = (int *) 0x56459a9c7260;
printf("Read %x to %p\n", *ptr, ptr);
}
And run it:
$ gcc read.c -o read
$ ./read
[1] 6429 segmentation fault (core dumped) ./read
What happened? 0x56459a9c7260
was a valid address in write
's virtual
address space, but not in read
's. Attempting to read from it is an access
violation, which results in the kernel sending a signal to our process, and
the default handler for that signal terminates the process.
Cool bear's hot tip
We used this in Part 1 to get our stdio-powered C program to segfault!
An access violation is just one type of page fault. A page fault occurs when we try to read from or write to a (virtual) address that isn't currently mapped to physical memory.
And this is precisely the trick behind mmap. When we first mmap a file, the kernel might eagerly read the first 4K of the file into a buffer of its own, and sets up the page tables so that the (userland) process can read directly from that buffer:
But once the process reads past the first 4K, then that's a page fault!
Remember, the kernel can do anything in response to a page fault: it may decide that it's an access violation, and send a signal to the process. In this case, it chooses simply to fulfill its promise that "this virtual address range contains the contents of the file", just.. not until it's needed.
The requested part of the file is actually read, a new page mapping is set:
The kernel is of course free to "page out" parts of the file, when they haven't been accessed in a while (or as soon as it wants, really!).
Cool bear's hot tip
When executing a program, its image is memory-mapped.
This allows a program to start executing before it's entirely read from disks, which matters a lot if the executable is large, or the I/O device is slow.
Using mmap from assembly
Can we use that from our assembly program? Sure we can!
Since we're not sure what parameters mmap needs (and which registers to put them into), we'll use this Searchable Linux Syscall table for x86 and x86_64 by @FiloSottile.
First, as usual, we'll need to open the file:
_start: mov rax, 2 ; "open"
mov rdi, path ;
xor rsi, rsi ; O_RDONLY
syscall
Next, we want to find the size of the file in bytes, so we can pass
it to mmap. We'll use the fstat
syscall for that, which takes fd
(a
file descriptor) in register rdi
, and struct stat __user* statbuf
in register rsi
.
To help me write the next part, I wrote a simple C program that dumps
the struct's size, along with the offset of the st_size
field, and two
constants:
#include <stdio.h>
#include <stddef.h>
#include <sys/stat.h>
#include <sys/mman.h>
int main() {
printf("size of stat struct: %zu\n", sizeof(struct stat));
printf("offset of st_size : %zu\n", offsetof(struct stat, st_size));
printf("PROT_READ = 0x%x\n", PROT_READ);
printf("MAP_PRIVATE = 0x%x\n", MAP_PRIVATE);
}
Which outputs:
size of stat struct: 144
offset of st_size : 48
PROT_READ = 0x1
MAP_PRIVATE = 0x2
So, it looks like we'll need to allocate 144 bytes on the stack:
mov rdi, rax ; fd (returned from open)
sub rsp, 144 ; allocate stat struct
mov rsi, rsp ; address of 'struct stat'
mov rax, 5 ; "fstat" syscall
syscall
And then we can feed our file descriptor, file size, and flags to mmap
.
Note that we can specify an address (but NULL is fine) and an offset
(but 0 is fine, since we want the whole file).
Cool bear's hot tip
The mmap
syscall takes:
addr
in%rdi
len
in%rsi
prot
(protection) in%rdx
flags
in%r10
fd
in%r8
off
(offset) in%r9
mov rsi, [rsp+48] ; len = file size (from 'struct stat')
add rsp, 144 ; free 'struct stat'
mov r8, rdi ; fd (still in rdi from last syscall)
xor rdi, rdi ; address = 0
mov rdx, 0x1 ; protection = PROT_READ
mov r10, 0x2 ; flags = MAP_PRIVATE
xor r9, r9 ; offset = 0
mov rax, 9 ; "mmap" syscall
syscall
Finally, we can write out the whole file in a single write
syscall:
Cool bear's hot tip
The write
syscall takes:
fd
in%rdi
buf
in%rsi
count
in%rdx
mov rdx, rsi ; count (file size from last call)
mov rsi, rax ; buffer address (returned from mmap)
mov rdi, 1 ; fd = stdout
mov rax, 1 ; "write" syscall
syscall
And there we have it:
$ make
yasm -f elf64 -g dwarf2 readfile.asm
ld readfile.o -o readfile
$ strace ./readfile > /dev/null
execve("./readfile", ["./readfile"], 0x7ffffc6ddbf0 /* 60 vars */) = 0
open("/etc/hosts", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=220, ...}) = 0
mmap(NULL, 220, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f11fbe5a000
write(1, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 220) = 220
exit(0) = ?
+++ exited with 0 +++
What did we learn?
A process's address space refers to virtual memory, which is then mapped to physical memory via page tables. When an unmapped range is accessed, it results in a page fault.
Instead of reading parts of files with read
, we can map them into the
virtual address space with mmap
. Reading from that range will result in the
kernel reading the relevant parts of the file.
Executables are memory-mapped when ran (even on Windows).
In the next part, we're going to take a look within the kernel, to see how files are organized and read - and how we can find and read them by using this knowledge, bypassing as much abstraction as we can.
Here's another article just for you:
Profiling linkers
In the wake of Why is my Rust build so
slow?, developers from the mold
and
lld
linkers reached
out,
wondering why using their linker didn't make a big difference.
Of course the answer was "there's just not that much linking to do", and so any
difference between mold
and lld
was within a second. GNU ld was lagging way
behind, at four seconds or so.