Reading files the hard way - Part 1 (node.js, C, rust, strace)

👋 This page was last updated ~5 years ago. Just so you know.

Everybody knows how to use files. You just open up File Explorer, the Finder, or a File Manager, and bam - it's chock-full of files. There's folders and files as far as the eye can see. It's a genuine filapalooza. I have never once heard someone complain there were not enough files on their computer.

But what is a file, really? And what does reading a file entail, exactly?

That's what we're about to find out. First, by reading a file in the simplest, most user-friendly way, and then peeling off abstraction layers until we are deep, deep into the entrails of the beast.

For the purpose of this article, we're interested in a specific file: /etc/hosts. It's a file present in most Unix-y systems, that maps domain names (like example.org) to IP addresses (like 127.0.0.1). It has a great backstory, too. To read it, we can simply use a GUI:

Prerequisites

Before we go any further: if you want to follow along with this article, you're going to need a Linux system. A real one - not WSL1. If you're already running Linux, great! If not, I recommend setting up a Manjaro virtual machine.

If you've never set up virtual machines before, VirtualBox will do the job, and it's rather easy to use. In fact, Manjaro in VirtualBox is exactly what I used to write this article:

A basic installation with the default settings is enough to follow the first part of the article. An Ubuntu or Fedora install will work just as well, and WSL2 will probably work too.

Alright, let's get back to it!

Command-line tools that deal with files

So far, we've opened File Manager and managed to read a file. If this was our only experience with files (I wish!), our mental model may look like this:

...but files don't actually live in the graphical user interface. They live.. somewhere, down below. Way, way down. And I can prove it.

If we switch to a text terminal, and completely shut down the GUI, we can still access the file:

And it's the exact same file. I'm not going to show it here, but if we edit or remove the file from the GUI or the text terminal, the changes will be reflected everywhere. It follows that they must point to the same location.

Our mental model can be upgraded to the vastly superior:

What did we learn?

Whether we use graphical tools (like Mousepad), or command-line tools (like cat), a name like /etc/hosts always refers to the same file.

Reading files with node.js

So far, we've been reading files by hand (either graphically, or with the terminal). That's all well and good, but what if you want to read lots of files?

Let's give node.js a shot:

JavaScript code
// require returns an object that has a `readFileSync` member,
// this is a destructuring assignment.
const { readFileSync } = require("fs");

// let's use the synchronous variant for now.
const contents = readFileSync("/etc/hosts");

// in a browser, this logs to the developer tools.
// in node.js, this just prints to the terminal.
console.log(contents);

Curious about destructuring assignments? Read more on MDN.

Running this code gives us the following output:

$ node readfile.js
<Buffer 31 32 37 2e 30 2e 30 2e 31 09 6c 6f 63 61 6c 68 6f 73 74 0a 31 32 37
2e 30 2e 31 2e 31 09 62 6c 69 6e 6b 70 61 64 0a 3a 3a 31 09 6c 6f 63 61 6c 68
6f ... 425 more bytes>

This doesn't look like text at all. In fact, it says right in the output: it's just a bunch of bytes. They could be anything at all! We just choose to interpret them as text. Terminal commands like cat are often text-oriented, so they usually assume we're dealing with text files.

If we want text, we're going to have to interpret those bytes as some text encoding - UTF-8 sounds like a safe bet. Luckily, node.js comes with a string decoder facility:

JavaScript code
const { readFileSync } = require("fs");
const { StringDecoder } = require("string_decoder");

const bytes = readFileSync("/etc/hosts");
const decoder = new StringDecoder("utf8");
const text = decoder.write(bytes);
console.log(text);

Running this gives the expected output:

$ node readfile.js
127.0.0.1       localhost
127.0.1.1       blinkpad
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

But here, we took the long way home, just to make a point. You can actually pass the encoding of the file directly to readFileSync, which makes for much shorter code.

JavaScript code
const { readFileSync } = require("fs");

const text = readFileSync("/etc/hosts", { encoding: "utf8" });
console.log(text);

This gives the exact same output. Instead of returning a Buffer, readFileSync now returns a String. All is well.

What happens if we use this code on a non-text file? Say, for example, /bin/true?

Although there are some strings in there, the output is all garbled and filled with replacement characters. It even messed up my terminal, so I had to use the reset command afterwards.

As an aside: if we really wanted to get the strings from /bin/true, we could have used the strings command (shocker!):

Shell session
$ strings /bin/true | head -15
/lib64/ld-linux-x86-64.so.2
__cxa_finalize
__libc_start_main
__cxa_atexit
dcgettext
error
abort
nl_langinfo
__fpending
fileno
__freading
fflush
__errno_location
fclose
lseek
What did we learn?

Files are just collections of bytes. To do anything useful with them, we must interpret them. UTF-8 is a common encoding for text.

Reading files with Rust

Let's move into the land of compiled languages for a hot minute.

If we use the cargo new readfile, go into the readfile directory and edit src/main.rs to be this:

Rust code
use std::{
    fs::File,
    io::{Error, Read},
};

fn main() -> Result<(), Error> {
    let mut file = File::open("/etc/hosts")?;
    let mut text = String::new();
    file.read_to_string(&mut text)?;
    println!("{}", text);

    Ok(())
}
Cool bear's hot tip

Note: As Pascal pointed out, we could just use std::fs::read_to_string.

But then we wouldn't get to learn as much stuff! We use a ReadAt trait in Part 3, and I think it's nice to see the difference.

Then cargo run will print the contents of /etc/hosts (as text). Of course, we could also run cargo build and have a compiled executable in target/debug/readfile.

Cool bear's hot tip

The Rust Programming Language is a gentle introduction to the language.

It's available online for free, if you want to take a break to go through the first few chapters, cool bear will not blame you.

For the benefit of those among us who are new to Rust, I'll go through each section and explain what it does. (If the code is obvious to you, feel free to skip to the next section)

This is a main function - it's what will get run when the program starts:

Rust code
fn main() {}

Since we're going to perform operations that can fail, like reading files, we can use a form of main that returns a result.

Rust code
fn main() -> Result<T, E> {}

Result is a generic type, T is the type of.. the result, if everything goes well. And E is the type of the error, if something goes wrong.

In our case, we don't really have a result, so for T we can just use the empty tuple, (). As for the error type, E, the only fallible operations are I/O (input-output), so we can use std::io::Error.

Rust code
use std::io::Error;

// `Error` is the type we just imported above with `use`
fn main() -> Result<(), Error> {
    // `Ok` is a variant of `Result` that indicates success. The pair of
    // parentheses inside is an empty tuple, just like we promised we'd return
    // Since this expression is the *last thing* in our function, and there
    // is no semi-colon (`;`) after it, it will get returned.
    Ok(())
}

Next, we notice that, unlike in our node.js example, here we have to open a file before reading it.

Rust code
use std::{
    io::Error,
    fs::File,
};

fn main() -> Result<(), Error> {
    // `File::open()` *also* returns a `Result<File, Error>`.
    // The `?` sigil means: if it returns an error, then we
    // should also return that error. If it returns a result,
    // then assign it to file and carry on.
    let file = File::open("/etc/hosts")?;

    Ok(())
}

As it stands, this program opens the file, but does nothing with it. (In fact, the compiler will warn us about that).

There is a read_to_string method defined in the std::io::Read trait, which std::fs::File implements, and it sounds exactly like what we want, so:

  • We need to use that trait, so that we can use its methods
  • And we need a String that the file can be read into.

Note that:

  • The String has to be mutable, because we're changing it (by putting the contents of the file into it).
  • The File also has to be mutable, because reading from it will change the position within the file - which is a mutation, not of the file, but of the file handle.

Finally, we can print it with the println macro. Since it's a macro, we need to write an exclamation mark after its name to invoke it.

Rust code
// import everything we need
use std::{
    fs::File,
    io::{Error, Read},
};

// main is fallible, it can fail because of an I/O error
fn main() -> Result<(), Error> {
    // open the file, return early if it fails.
    // the `file` binding is `mut`, which means mutable,
    // so we can take a mutable reference to it later.
    let mut file = File::open("/etc/hosts")?;
    // create an empty string, have a mutable binding to it,
    // so we can take a mutable reference to it later.
    let mut text = String::new();
    // call `Read::read_to_string` on the file, pass it
    // a mutable reference to our destination string.
    // The signature for `read_to_string` takes `&mut self`,
    // so this line actually takes a mutable reference to file too.
    file.read_to_string(&mut text)?;
    // at this point, `file` can be dropped, because we don't
    // use it anymore. this also frees OS resources associated with it.

    // call the println macro, our format string is just `{}`,
    // which will format an argument that implements the `std::fmt::Display`
    // trait. In our case, `String` just prints its contents as, well, text.
    println!("{}", text);

    // everything went fine, signal sucess with an empty (tuple) result.
    Ok(())
}

Hopefully this is clear enough, even for those who aren't familiar with Rust.

What did we learn?

In Rust, we must first create a new String, to receive the contents of a file.

Opening a file and reading a file are both operations that can fail.

Any mutation must be explicitly allowed via the mut keyword. Reading from a file is a mutation, because it changes our position in the file.

Reading files with C

The C standard library provides us with a "high-level" interface to open and read files, so let's give it a shot.

C code
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>

int main(int argc, char **argv) {
    FILE *file = fopen("/etc/hosts", "r");
    if (!file) {
        fprintf(stderr, "could not open file\n");
        return 1;
    }

    const size_t buffer_size = 16;
    char *buffer = malloc(buffer_size);

    while (true) {
        size_t size = fread(buffer, 1, buffer_size - 1, file);
        if (size == 0) {
            break;
        }

        buffer[size] = '\0';
        printf("%s", buffer);
    }

    printf("\n");
    return 0;
}

Well that was painful, let's review: first we need to open the file. Since it's a "high-level" API, we can specify the mode via a string constant. We're only interested in reading.

C code
    FILE *file = fopen("/etc/hosts", "r");

Then, we have to do a bit of error checking. Note that we won't be able to tell why we could not open the file, just that it didn't work. We'd have to call more functions to do that.

C code
    if (!file) {
        fprintf(stderr, "could not open file\n");
        return 1;
    }

Then, much like in Rust, we have to allocate a buffer. Note that nothing says it's a string at this point - it's just a place to store bytes. I arbitrarily chose "16 bytes" so that it's short enough that we'd notice if our reading logic was wrong (hopefully).

C code
    const size_t buffer_size = 16;
    char *buffer = malloc(buffer_size);

Then, we simply attempt to read a bufferful, repeatedly - until we reach an error or the end of the file. Again, proper error handling would require more code and more function calls.

Rust gave us all of that for free. Just saying.

C code
    while (true) {
        // the arguments are: destination buffer, member size,
        // number of members to read, and `FILE*` handle
        size_t size = fread(buffer, 1, buffer_size - 1, file);
        if (size == 0) {
            break;
        }

        buffer[size] = '\0';
        printf("%s", buffer);
    }

Oh, another fun bit in C. There is no string type. But that's okay, because the %s formatter for printf will just stop as soon as it encounters a null byte. As long as we never ever forget to set it, and never make off by one errors (who makes those anyway?) then we definitely won't expose sensitive information or allow remote code execution. Nothing to see here.

Finally, we make sure to print a final newline because that's what well-behaved CLI (command-line interface) applications do. And we return zero, because apparently that's C for success (except when it isn't, and it means error - it all depends).

What did we learn?

In C, there are no strings, only "arrays" of bytes. When using stdio, we must manually allocate a buffer into which to read a file. If the buffer isn't large enough, we have to repeatedly read into it.

The FILE* returned by fopen is an opaque struct that refers to the opened file. We can only use it with other stdio functions, never look inside.

C makes it extremely easy to shoot oneself in the paw: off-by-one errors, forgetting to terminate strings with a null byte (\0), forgetting to check return values or testing them incorrectly.

Cool bear's hot tip

Learning C is important, because so many projects are written in it.

But nowadays, there's often better alternatives.

Reading files with C, without stdio

But this is still too high level. The fopen, fclose, fread, fwrite family of functions does a lot for us. For example, they do buffering.

So if you run this program:

C code
#include <stdio.h>

int main() {
    for (int i = 0; i < 20; i++) {
        // this is the roundabout way to spell `printf`
        // point is, `printf` is in the `f` family.
        fprintf(stdout, ".");
    }

    // woops, that's illegal
    *((char*) 0xBADBADBAD) = 43;
}

Then it prints nothing:

$ gcc -O3 --std=c99 -Wall -Wpedantic woops.c -o woops
$ ./woops
[1]    22082 segmentation fault (core dumped)  ./woops

because the dots didn't actually get written out until a newline is printed. They were just stored in an in-memory buffer. And the program crashed, so exit handlers didn't get to run, so nothing got written to the terminal.

Cool bear's hot tip

This line:

C code
    *((char*) 0xBADBADBAD) = 43;

...causes a segmentation fault because we're writing to memory that is almost definitely not valid in our process's address space.

However, we can't rely on this always happening. On some embedded systems, this would either silently corrupt data or do nothing.

However! If we use open, read, write, close, now we're starting to get low-level. An indeed this program:

C code
#include <unistd.h>

int main() {
    for (int i = 0; i < 20; i++) {
        // hey that line's different
        write(STDOUT_FILENO, ".", 1);
    }

    // still illegal
    *((char*) 0xBADBADBAD) = 43;
}

...produces the expected output:

$ gcc -O3 --std=c99 -Wall -Wpedantic woops.c -o woops && ./woops
....................[1]    22741 segmentation fault (core dumped)  ./woops
Cool bear's hot tip

-O3 is the third-level of optimisation. -Wall means "all warnings". -Wpedantic is the compiler equivalent of asking to be roasted.

So, can we update our readfile program to use file descriptors instead of <stdio.h>? Absolutely!

C code
#include <stdlib.h>
#include <stdbool.h>
#include <stdio.h>

// that's a whole lot of headers, sorry
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char **argv) {
    int fd = open("/etc/hosts", O_RDONLY);
    // oh look this time '0' does not mean error!
    if (fd == -1) {
        fprintf(stderr, "could not open file\n");
        return 1;
    }

    const size_t buffer_size = 16;
    char *buffer = malloc(buffer_size);

    while (true) {
        // this time the size is signed!
        ssize_t size = read(fd, buffer, buffer_size);
        // and, again, -1 means error
        if (size == -1) {
            fprintf(stderr, "an error happened while reading the file");
            return 1;
        }
        // 0 means end of file, probably?
        if (size == 0) {
            break;
        }

        // no need to null-terminate here, just pass
        // a size to write().
        write(STDOUT_FILENO, buffer, size);
    }
    
    write(STDOUT_FILENO, "\n", 1);
    return 0;
}

Aww yeaahh, now this is verbose as heck, surely we are real programmers now.

What did we learn?

stdio is a "high-level" interface to files, that handles buffering, for performance's sake.

The open(), read(), and write() functions are lower-level. When using those, we should bring our own buffering!

A bit of introspection

Remember when we used cat to print the contents of /etc/hosts?

Gods, we were young then.

Well, cat ships with pretty much every Unix-y system, so it must be made by real programmers. If only there was a way to look at their secrets.

As it turns out, strace is exactly what we're looking for.

Shell session
$ strace cat /etc/hosts
execve("/usr/bin/cat", ["cat", "/etc/hosts"], 0x7ffee7075518 /* 60 vars */) = 0
brk(NULL)                               = 0x560f3346c000
arch_prctl(0x3001 /* ARCH_??? */, 0x7fff4452be80) = -1 EINVAL (Invalid argument)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe562a81000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=88313, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 88313, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe562a6b000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3206\2\0\0\0\0\0"..., 832) = 832
(cut)

What does strace do? It "traces system calls and signals". More on that later. At this point, all we know is that it prints a lot of lines, and we don't care about all of them.

We only want to know how cat reads files! The answer lies somewhere near the end of the trace:

openat(AT_FDCWD, "/etc/hosts", O_RDONLY) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=220, ...}, AT_EMPTY_PATH) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe562a46000
read(3, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 131072) = 220
write(1, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 220127.0.0.1 localhost

Hey, this is familiar! Let's compare with an strace of our readfile binary. (The C one, that uses open, read and write).

$ strace ./readfile
(cut)
openat(AT_FDCWD, "/etc/hosts", O_RDONLY) = 3
getrandom("\x99\x17\x0c\x29\x8c\xf3\x28\x2e", 8, GRND_NONBLOCK) = 8
brk(NULL)                               = 0x55cabb575000
brk(0x55cabb596000)                     = 0x55cabb596000
read(3, "127.0.0.1\tlocalh", 16)        = 16
write(1, "127.0.0.1\tlocalh", 16127.0.0.1       localh)       = 16

So, it looks like, ultimately, readfile and cat do the same thing, except that cat:

  • uses the fstat syscall to find out the size of the file,
  • allocates a buffer of the proper size
  • reads the entire file in one syscall
  • also writes the entire file in one syscall
What did we learn?

The strace tool allows us to figure out what a program does, under the hood.

Not everything shows up in strace traces, but a lot of interesting stuff does.

Cool bear's hot tip

The cat command takes its name from "catenate", a synonym of "concatenate".

Updating our mental model

Right now, it looks like there's a bunch of ways to read files. Depending on the language you use - C even gives you two ways out of the box!

Our mental model right now may look like this:

...but we can disprove that easily.

If we strace our very first program, the node.js one, we find this:

Shell session
$ strace node ./readfile.js
(cut)
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 17
statx(17, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=220, ...}) = 0
read(17, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 220) = 220
close(17)

Just like the others, it uses open and read! Instead of fstat, it uses statx, which returns extended file status.

Cool bear's hot tip

The "status" of a file is its size in bytes, which user and group it belongs to, various timestamps, and other, lower-level information, which we'll get to later.

Likewise, if we strace the rust one, we find this:

Shell session
$ strace readfile-rs/target/debug/readfile-rs
(cut)
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 3
statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=220, ...}) = 0
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 220) = 220
read(3, "", 32)                         = 0
write(1, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 220127.0.0.1 localhost

It also uses open and read.

So, no matter what language we use, we always end up back at openat, read, write.

Let's update our mental model:

What did we learn?

Node.js, Rust, and C applications all end up using the same few functions: open(), read(), write(), etc.

You did it!

Whew, that was a lot to cover. But you made it to the end!

We're going through a lot of material quickly. It's ok not to fully understand everything. A lot of these subjects may be covered more in-depth in separate articles at some point.

If Part 1 didn't teach you much, that's ok too! Sometimes a reminder is nice. The next parts do go a lot lower. I'm sure there's something in them you didn't know about.

You're reading the Reading files the hard way series.

If you liked what you saw, please support my work!

Github logo Donate on GitHub Patreon logo Donate on Patreon

Here's another article just for you:

Peeking inside a Rust enum

During a recent Rust Q&A Session on my twitch channel, someone asked a question that seemed simple: why are small string types, like SmartString or SmolStr, the same size as String, but small vec types, like SmallVec, are larger than Vec?

Now I know I just used the adjective simple, but the truth of the matter is: to understand the question, we're going to need a little bit of background.