Reading files the hard way - Part 1 (node.js, C, rust, strace)
👋 This page was last updated ~5 years ago. Just so you know.
Everybody knows how to use files. You just open up File Explorer, the Finder, or a File Manager, and bam - it's chock-full of files. There's folders and files as far as the eye can see. It's a genuine filapalooza. I have never once heard someone complain there were not enough files on their computer.
But what is a file, really? And what does reading a file entail, exactly?
That's what we're about to find out. First, by reading a file in the simplest, most user-friendly way, and then peeling off abstraction layers until we are deep, deep into the entrails of the beast.
For the purpose of this article, we're interested in a specific file: /etc/hosts
.
It's a file present in most Unix-y systems, that maps domain names (like example.org
)
to IP addresses (like 127.0.0.1
). It has a great backstory, too.
To read it, we can simply use a GUI:
Prerequisites
Before we go any further: if you want to follow along with this article, you're going to need a Linux system. A real one - not WSL1. If you're already running Linux, great! If not, I recommend setting up a Manjaro virtual machine.
If you've never set up virtual machines before, VirtualBox will do the job, and it's rather easy to use. In fact, Manjaro in VirtualBox is exactly what I used to write this article:
A basic installation with the default settings is enough to follow the first part of the article. An Ubuntu or Fedora install will work just as well, and WSL2 will probably work too.
Alright, let's get back to it!
Command-line tools that deal with files
So far, we've opened File Manager and managed to read a file. If this was our only experience with files (I wish!), our mental model may look like this:
...but files don't actually live in the graphical user interface. They live.. somewhere, down below. Way, way down. And I can prove it.
If we switch to a text terminal, and completely shut down the GUI, we can still access the file:
And it's the exact same file. I'm not going to show it here, but if we edit or remove the file from the GUI or the text terminal, the changes will be reflected everywhere. It follows that they must point to the same location.
Our mental model can be upgraded to the vastly superior:
Whether we use graphical tools (like Mousepad), or command-line tools (like
cat), a name like /etc/hosts
always refers to the same file.
Reading files with node.js
So far, we've been reading files by hand (either graphically, or with the terminal). That's all well and good, but what if you want to read lots of files?
Let's give node.js a shot:
// require returns an object that has a `readFileSync` member, // this is a destructuring assignment. const { readFileSync } = require("fs"); // let's use the synchronous variant for now. const contents = readFileSync("/etc/hosts"); // in a browser, this logs to the developer tools. // in node.js, this just prints to the terminal. console.log(contents);
Curious about destructuring assignments? Read more on MDN.
Running this code gives us the following output:
$ node readfile.js <Buffer 31 32 37 2e 30 2e 30 2e 31 09 6c 6f 63 61 6c 68 6f 73 74 0a 31 32 37 2e 30 2e 31 2e 31 09 62 6c 69 6e 6b 70 61 64 0a 3a 3a 31 09 6c 6f 63 61 6c 68 6f ... 425 more bytes>
This doesn't look like text at all. In fact, it says right in the output:
it's just a bunch of bytes. They could be anything at all! We just choose
to interpret them as text. Terminal commands like cat
are often text-oriented,
so they usually assume we're dealing with text files.
If we want text, we're going to have to interpret those bytes as some text encoding - UTF-8 sounds like a safe bet. Luckily, node.js comes with a string decoder facility:
const { readFileSync } = require("fs"); const { StringDecoder } = require("string_decoder"); const bytes = readFileSync("/etc/hosts"); const decoder = new StringDecoder("utf8"); const text = decoder.write(bytes); console.log(text);
Running this gives the expected output:
$ node readfile.js 127.0.0.1 localhost 127.0.1.1 blinkpad ::1 localhost ip6-localhost ip6-loopback ff02::1 ip6-allnodes ff02::2 ip6-allrouters
But here, we took the long way home, just to make a point. You can actually
pass the encoding of the file directly to readFileSync
, which makes for
much shorter code.
const { readFileSync } = require("fs"); const text = readFileSync("/etc/hosts", { encoding: "utf8" }); console.log(text);
This gives the exact same output. Instead of returning a Buffer
,
readFileSync
now returns a String
. All is well.
What happens if we use this code on a non-text file? Say, for
example, /bin/true
?
Although there are some strings in there, the output is all garbled
and filled with replacement characters. It even messed up my terminal,
so I had to use the reset
command afterwards.
As an aside: if we really wanted to get the strings from /bin/true
,
we could have used the strings
command (shocker!):
$ strings /bin/true | head -15 /lib64/ld-linux-x86-64.so.2 __cxa_finalize __libc_start_main __cxa_atexit dcgettext error abort nl_langinfo __fpending fileno __freading fflush __errno_location fclose lseek
Files are just collections of bytes. To do anything useful with them, we must interpret them. UTF-8 is a common encoding for text.
Reading files with Rust
Let's move into the land of compiled languages for a hot minute.
If we use the cargo new readfile
, go into the readfile
directory and edit
src/main.rs
to be this:
use std::{ fs::File, io::{Error, Read}, }; fn main() -> Result<(), Error> { let mut file = File::open("/etc/hosts")?; let mut text = String::new(); file.read_to_string(&mut text)?; println!("{}", text); Ok(()) }
Note: As Pascal pointed out, we could just use std::fs::read_to_string.
But then we wouldn't get to learn as much stuff! We use a ReadAt
trait
in Part 3, and I think it's nice to see the difference.
Then cargo run
will print the contents of /etc/hosts
(as text).
Of course, we could also run cargo build
and have a compiled executable
in target/debug/readfile
.
The Rust Programming Language is a gentle introduction to the language.
It's available online for free, if you want to take a break to go through the first few chapters, cool bear will not blame you.
For the benefit of those among us who are new to Rust, I'll go through each section and explain what it does. (If the code is obvious to you, feel free to skip to the next section)
This is a main function - it's what will get run when the program starts:
fn main() {}
Since we're going to perform operations that can fail, like reading files, we can use a form of main that returns a result.
fn main() -> Result<T, E> {}
Result is a generic type, T
is the type of.. the result, if everything goes well.
And E
is the type of the error, if something goes wrong.
In our case, we don't really have a result, so for T
we can just use the
empty tuple, ()
. As for the error type, E
, the only fallible operations are
I/O (input-output), so we can use std::io::Error
.
use std::io::Error; // `Error` is the type we just imported above with `use` fn main() -> Result<(), Error> { // `Ok` is a variant of `Result` that indicates success. The pair of // parentheses inside is an empty tuple, just like we promised we'd return // Since this expression is the *last thing* in our function, and there // is no semi-colon (`;`) after it, it will get returned. Ok(()) }
Next, we notice that, unlike in our node.js example, here we have to open a file before reading it.
use std::{ io::Error, fs::File, }; fn main() -> Result<(), Error> { // `File::open()` *also* returns a `Result<File, Error>`. // The `?` sigil means: if it returns an error, then we // should also return that error. If it returns a result, // then assign it to file and carry on. let file = File::open("/etc/hosts")?; Ok(()) }
As it stands, this program opens the file, but does nothing with it. (In fact, the compiler will warn us about that).
There is a read_to_string
method defined in the std::io::Read
trait,
which std::fs::File
implements, and it sounds exactly like what we
want, so:
- We need to
use
that trait, so that we can use its methods - And we need a String that the file can be read into.
Note that:
- The String has to be mutable, because we're changing it (by putting the contents of the file into it).
- The File also has to be mutable, because reading from it will change the position within the file - which is a mutation, not of the file, but of the file handle.
Finally, we can print it with the println
macro. Since it's a macro,
we need to write an exclamation mark after its name to invoke it.
// import everything we need use std::{ fs::File, io::{Error, Read}, }; // main is fallible, it can fail because of an I/O error fn main() -> Result<(), Error> { // open the file, return early if it fails. // the `file` binding is `mut`, which means mutable, // so we can take a mutable reference to it later. let mut file = File::open("/etc/hosts")?; // create an empty string, have a mutable binding to it, // so we can take a mutable reference to it later. let mut text = String::new(); // call `Read::read_to_string` on the file, pass it // a mutable reference to our destination string. // The signature for `read_to_string` takes `&mut self`, // so this line actually takes a mutable reference to file too. file.read_to_string(&mut text)?; // at this point, `file` can be dropped, because we don't // use it anymore. this also frees OS resources associated with it. // call the println macro, our format string is just `{}`, // which will format an argument that implements the `std::fmt::Display` // trait. In our case, `String` just prints its contents as, well, text. println!("{}", text); // everything went fine, signal sucess with an empty (tuple) result. Ok(()) }
Hopefully this is clear enough, even for those who aren't familiar with Rust.
In Rust, we must first create a new String, to receive the contents of a file.
Opening a file and reading a file are both operations that can fail.
Any mutation must be explicitly allowed via the mut
keyword. Reading from a
file is a mutation, because it changes our position in the file.
Reading files with C
The C standard library provides us with a "high-level" interface to open and read files, so let's give it a shot.
#include <stdlib.h> #include <stdio.h> #include <stdbool.h> int main(int argc, char **argv) { FILE *file = fopen("/etc/hosts", "r"); if (!file) { fprintf(stderr, "could not open file\n"); return 1; } const size_t buffer_size = 16; char *buffer = malloc(buffer_size); while (true) { size_t size = fread(buffer, 1, buffer_size - 1, file); if (size == 0) { break; } buffer[size] = '\0'; printf("%s", buffer); } printf("\n"); return 0; }
Well that was painful, let's review: first we need to open the file. Since it's a "high-level" API, we can specify the mode via a string constant. We're only interested in reading.
FILE *file = fopen("/etc/hosts", "r");
Then, we have to do a bit of error checking. Note that we won't be able to tell why we could not open the file, just that it didn't work. We'd have to call more functions to do that.
if (!file) { fprintf(stderr, "could not open file\n"); return 1; }
Then, much like in Rust, we have to allocate a buffer. Note that nothing says it's a string at this point - it's just a place to store bytes. I arbitrarily chose "16 bytes" so that it's short enough that we'd notice if our reading logic was wrong (hopefully).
const size_t buffer_size = 16; char *buffer = malloc(buffer_size);
Then, we simply attempt to read a bufferful, repeatedly - until we reach an error or the end of the file. Again, proper error handling would require more code and more function calls.
Rust gave us all of that for free. Just saying.
while (true) { // the arguments are: destination buffer, member size, // number of members to read, and `FILE*` handle size_t size = fread(buffer, 1, buffer_size - 1, file); if (size == 0) { break; } buffer[size] = '\0'; printf("%s", buffer); }
Oh, another fun bit in C. There is no string type. But that's okay, because
the %s
formatter for printf will just stop as soon as it encounters a null
byte. As long as we never ever forget to set it, and never make off by one
errors (who makes those anyway?) then we definitely won't expose sensitive
information or allow remote code execution. Nothing to see here.
Finally, we make sure to print a final newline because that's what well-behaved CLI (command-line interface) applications do. And we return zero, because apparently that's C for success (except when it isn't, and it means error - it all depends).
In C, there are no strings, only "arrays" of bytes. When using stdio, we must manually allocate a buffer into which to read a file. If the buffer isn't large enough, we have to repeatedly read into it.
The FILE*
returned by fopen
is an opaque struct that refers to the opened
file. We can only use it with other stdio functions, never look inside.
C makes it extremely easy to shoot oneself in the paw: off-by-one errors,
forgetting to terminate strings with a null byte (\0
), forgetting to check
return values or testing them incorrectly.
Learning C is important, because so many projects are written in it.
But nowadays, there's often better alternatives.
Reading files with C, without stdio
But this is still too high level. The fopen
, fclose
, fread
, fwrite
family of functions does a lot for us. For example, they do buffering.
So if you run this program:
#include <stdio.h> int main() { for (int i = 0; i < 20; i++) { // this is the roundabout way to spell `printf` // point is, `printf` is in the `f` family. fprintf(stdout, "."); } // woops, that's illegal *((char*) 0xBADBADBAD) = 43; }
Then it prints nothing:
$ gcc -O3 --std=c99 -Wall -Wpedantic woops.c -o woops $ ./woops [1] 22082 segmentation fault (core dumped) ./woops
because the dots didn't actually get written out until a newline is printed. They were just stored in an in-memory buffer. And the program crashed, so exit handlers didn't get to run, so nothing got written to the terminal.
This line:
*((char*) 0xBADBADBAD) = 43;
...causes a segmentation fault because we're writing to memory that is almost definitely not valid in our process's address space.
However, we can't rely on this always happening. On some embedded systems, this would either silently corrupt data or do nothing.
However! If we use open
, read
, write
, close
, now we're starting
to get low-level. An indeed this program:
#include <unistd.h> int main() { for (int i = 0; i < 20; i++) { // hey that line's different write(STDOUT_FILENO, ".", 1); } // still illegal *((char*) 0xBADBADBAD) = 43; }
...produces the expected output:
$ gcc -O3 --std=c99 -Wall -Wpedantic woops.c -o woops && ./woops ....................[1] 22741 segmentation fault (core dumped) ./woops
-O3
is the third-level of optimisation. -Wall
means "all warnings".
-Wpedantic
is the compiler equivalent of asking to be roasted.
So, can we update our readfile
program to use file descriptors instead
of <stdio.h>
? Absolutely!
#include <stdlib.h> #include <stdbool.h> #include <stdio.h> // that's a whole lot of headers, sorry #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main(int argc, char **argv) { int fd = open("/etc/hosts", O_RDONLY); // oh look this time '0' does not mean error! if (fd == -1) { fprintf(stderr, "could not open file\n"); return 1; } const size_t buffer_size = 16; char *buffer = malloc(buffer_size); while (true) { // this time the size is signed! ssize_t size = read(fd, buffer, buffer_size); // and, again, -1 means error if (size == -1) { fprintf(stderr, "an error happened while reading the file"); return 1; } // 0 means end of file, probably? if (size == 0) { break; } // no need to null-terminate here, just pass // a size to write(). write(STDOUT_FILENO, buffer, size); } write(STDOUT_FILENO, "\n", 1); return 0; }
Aww yeaahh, now this is verbose as heck, surely we are real programmers now.
stdio is a "high-level" interface to files, that handles buffering, for performance's sake.
The open()
, read()
, and write()
functions are lower-level. When using those,
we should bring our own buffering!
A bit of introspection
Remember when we used cat
to print the contents of /etc/hosts
?
Gods, we were young then.
Well, cat
ships with pretty much every Unix-y system, so it must be made
by real programmers. If only there was a way to look at their secrets.
As it turns out, strace
is exactly what we're looking for.
$ strace cat /etc/hosts execve("/usr/bin/cat", ["cat", "/etc/hosts"], 0x7ffee7075518 /* 60 vars */) = 0 brk(NULL) = 0x560f3346c000 arch_prctl(0x3001 /* ARCH_??? */, 0x7fff4452be80) = -1 EINVAL (Invalid argument) mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe562a81000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=88313, ...}, AT_EMPTY_PATH) = 0 mmap(NULL, 88313, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe562a6b000 close(3) = 0 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3206\2\0\0\0\0\0"..., 832) = 832 (cut)
What does strace do? It "traces system calls and signals". More on that later. At this point, all we know is that it prints a lot of lines, and we don't care about all of them.
We only want to know how cat
reads files! The answer lies somewhere near
the end of the trace:
openat(AT_FDCWD, "/etc/hosts", O_RDONLY) = 3 newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=220, ...}, AT_EMPTY_PATH) = 0 fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0 mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe562a46000 read(3, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 131072) = 220 write(1, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 220127.0.0.1 localhost
Hey, this is familiar! Let's compare with an strace of our readfile
binary.
(The C one, that uses open
, read
and write
).
$ strace ./readfile (cut) openat(AT_FDCWD, "/etc/hosts", O_RDONLY) = 3 getrandom("\x99\x17\x0c\x29\x8c\xf3\x28\x2e", 8, GRND_NONBLOCK) = 8 brk(NULL) = 0x55cabb575000 brk(0x55cabb596000) = 0x55cabb596000 read(3, "127.0.0.1\tlocalh", 16) = 16 write(1, "127.0.0.1\tlocalh", 16127.0.0.1 localh) = 16
So, it looks like, ultimately, readfile
and cat
do the same thing, except
that cat
:
- uses the
fstat
syscall to find out the size of the file, - allocates a buffer of the proper size
- reads the entire file in one syscall
- also writes the entire file in one syscall
The strace
tool allows us to figure out what a program does, under the hood.
Not everything shows up in strace traces, but a lot of interesting stuff does.
The cat
command takes its name from "catenate", a synonym of "concatenate".
Updating our mental model
Right now, it looks like there's a bunch of ways to read files. Depending on the language you use - C even gives you two ways out of the box!
Our mental model right now may look like this:
...but we can disprove that easily.
If we strace our very first program, the node.js one, we find this:
$ strace node ./readfile.js (cut) openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 17 statx(17, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=220, ...}) = 0 read(17, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 220) = 220 close(17)
Just like the others, it uses open
and read
! Instead of fstat
,
it uses statx
, which returns extended file status.
The "status" of a file is its size in bytes, which user and group it belongs to, various timestamps, and other, lower-level information, which we'll get to later.
Likewise, if we strace the rust one, we find this:
$ strace readfile-rs/target/debug/readfile-rs (cut) openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 3 statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address) statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=220, ...}) = 0 lseek(3, 0, SEEK_CUR) = 0 read(3, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 220) = 220 read(3, "", 32) = 0 write(1, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 220127.0.0.1 localhost
It also uses open
and read
.
So, no matter what language we use, we always end up back at openat
,
read
, write
.
Let's update our mental model:
Node.js, Rust, and C applications all end up using the same few functions:
open()
, read()
, write()
, etc.
You did it!
Whew, that was a lot to cover. But you made it to the end!
We're going through a lot of material quickly. It's ok not to fully understand everything. A lot of these subjects may be covered more in-depth in separate articles at some point.
If Part 1 didn't teach you much, that's ok too! Sometimes a reminder is nice. The next parts do go a lot lower. I'm sure there's something in them you didn't know about.
Thanks to my sponsors: Kamran Khan, Dimitri Merejkowsky, Brandon Piña, Alejandro Angulo, Marc-Andre Giroux, Hadrien G., Astrid, Helge Eichhorn, Michael Mrozek, clement, Sarah Berrettini, Matt Jackson, Nicolas Goy, Tom Forbes, Christoph Grabo, Jörn Huxhorn, AdrianEddy, Vincent, playest, Philipp Gniewosz and 227 more
If you liked what you saw, please support my work!
Here's another article just for you:
During a recent Rust Q&A Session on my twitch
channel, someone asked a question that
seemed simple: why are small string types, like SmartString
or SmolStr
,
the same size as String
, but small vec types, like SmallVec
, are larger
than Vec
?
Now I know I just used the adjective simple, but the truth of the matter is: to understand the question, we're going to need a little bit of background.