...but files don't actually live in the graphical user interface. They live..
somewhere, down below. Way, way down . And I can
prove it.
And it's the exact same file. I'm not going to show it here, but if we edit or remove the
file from the GUI or the text terminal, the changes will be reflected everywhere. It follows
that they must point to the same location.
Whether we use graphical tools (like Mousepad), or command-line tools (like
cat), a name like /etc/hosts
always refers to the same file.
So far, we've been reading files by hand (either graphically, or with the
terminal). That's all well and good, but what if you want to read lots of files?
Let's give node.js a shot:
JavaScript code
// require returns an object that has a `readFileSync` member,
// this is a destructuring assignment.
const { readFileSync } = require ( "fs" ) ;
// let's use the synchronous variant for now.
const contents = readFileSync ( "/etc/hosts" ) ;
// in a browser, this logs to the developer tools.
// in node.js, this just prints to the terminal.
console . log ( contents ) ;
Curious about destructuring assignments? Read more on MDN .
Running this code gives us the following output:
$ node readfile.js
<Buffer 31 32 37 2e 30 2e 30 2e 31 09 6c 6f 63 61 6c 68 6f 73 74 0a 31 32 37
2e 30 2e 31 2e 31 09 62 6c 69 6e 6b 70 61 64 0a 3a 3a 31 09 6c 6f 63 61 6c 68
6f ... 425 more bytes>
This doesn't look like text at all. In fact, it says right in the output:
it's just a bunch of bytes. They could be anything at all! We just choose
to interpret them as text. Terminal commands like cat
are often text-oriented,
so they usually assume we're dealing with text files.
If we want text, we're going to have to interpret those bytes as some
text encoding - UTF-8 sounds like a safe bet.
Luckily, node.js comes with a string decoder facility:
JavaScript code
const { readFileSync } = require ( "fs" ) ;
const { StringDecoder } = require ( "string_decoder" ) ;
const bytes = readFileSync ( "/etc/hosts" ) ;
const decoder = new StringDecoder ( "utf8" ) ;
const text = decoder . write ( bytes ) ;
console . log ( text ) ;
Running this gives the expected output:
$ node readfile.js
127.0.0.1 localhost
127.0.1.1 blinkpad
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
But here, we took the long way home, just to make a point. You can actually
pass the encoding of the file directly to readFileSync
, which makes for
much shorter code.
JavaScript code
const { readFileSync } = require ( "fs" ) ;
const text = readFileSync ( "/etc/hosts" , { encoding : "utf8" } ) ;
console . log ( text ) ;
This gives the exact same output. Instead of returning a Buffer
,
readFileSync
now returns a String
. All is well.
What happens if we use this code on a non-text file? Say, for
example, /bin/true
?
Although there are some strings in there, the output is all garbled
and filled with replacement characters . It even messed up my terminal,
so I had to use the reset
command afterwards.
As an aside: if we really wanted to get the strings from /bin/true
,
we could have used the strings
command (shocker!):
Shell session
$ strings /bin/true | head -15
/lib64/ld-linux-x86-64.so.2
__cxa_finalize
__libc_start_main
__cxa_atexit
dcgettext
error
abort
nl_langinfo
__fpending
fileno
__freading
fflush
__errno_location
fclose
lseek
Files are just collections of bytes. To do anything useful with them, we must
interpret them. UTF-8 is a common encoding for text.
Let's move into the land of compiled languages for a hot minute.
If we use the cargo new readfile
, go into the readfile
directory and edit
src/main.rs
to be this:
Rust code
use std:: {
fs:: File,
io:: { Error, Read} ,
} ;
fn main ( ) -> Result < ( ) , Error > {
let mut file = File:: open ( "/etc/hosts" ) ?;
let mut text = String:: new ( ) ;
file. read_to_string ( & mut text) ?;
println ! ( "{}" , text) ;
Ok ( ( ) )
}
Then cargo run
will print the contents of /etc/hosts
(as text).
Of course, we could also run cargo build
and have a compiled executable
in target/debug/readfile
.
The Rust Programming Language is
a gentle introduction to the language.
It's available online for free, if you want to take a break to go through the
first few chapters, cool bear will not blame you.
For the benefit of those among us who are new to Rust, I'll go through each
section and explain what it does. (If the code is obvious to you, feel free
to skip to the next section)
This is a main function - it's what will get run when the program starts:
Since we're going to perform operations that can fail, like reading files,
we can use a form of main that returns a result.
Rust code
fn main ( ) -> Result < T , E > { }
Result is a generic type, T
is the type of.. the result, if everything goes well.
And E
is the type of the error, if something goes wrong.
In our case, we don't really have a result, so for T
we can just use the
empty tuple, ()
. As for the error type, E
, the only fallible operations are
I/O (input-output), so we can use std::io::Error
.
Rust code
use std:: io:: Error;
// `Error` is the type we just imported above with `use`
fn main ( ) -> Result < ( ) , Error > {
// `Ok` is a variant of `Result` that indicates success. The pair of
// parentheses inside is an empty tuple, just like we promised we'd return
// Since this expression is the *last thing* in our function, and there
// is no semi-colon (`;`) after it, it will get returned.
Ok ( ( ) )
}
Next, we notice that, unlike in our node.js example, here we have to open a
file before reading it.
Rust code
use std:: {
io:: Error,
fs:: File,
} ;
fn main ( ) -> Result < ( ) , Error > {
// `File::open()` *also* returns a `Result<File, Error>`.
// The `?` sigil means: if it returns an error, then we
// should also return that error. If it returns a result,
// then assign it to file and carry on.
let file = File:: open ( "/etc/hosts" ) ?;
Ok ( ( ) )
}
As it stands, this program opens the file, but does nothing with it.
(In fact, the compiler will warn us about that).
There is a read_to_string
method defined in the std::io::Read
trait,
which std::fs::File
implements, and it sounds exactly like what we
want, so:
We need to use
that trait, so that we can use its methods
And we need a String that the file can be read into.
Note that:
The String has to be mutable, because we're changing it (by putting
the contents of the file into it).
The File also has to be mutable, because reading from it will change
the position within the file - which is a mutation, not of the file, but
of the file handle .
Finally, we can print it with the println
macro. Since it's a macro,
we need to write an exclamation mark after its name to invoke it.
Rust code
// import everything we need
use std:: {
fs:: File,
io:: { Error, Read} ,
} ;
// main is fallible, it can fail because of an I/O error
fn main ( ) -> Result < ( ) , Error > {
// open the file, return early if it fails.
// the `file` binding is `mut`, which means mutable,
// so we can take a mutable reference to it later.
let mut file = File:: open ( "/etc/hosts" ) ?;
// create an empty string, have a mutable binding to it,
// so we can take a mutable reference to it later.
let mut text = String:: new ( ) ;
// call `Read::read_to_string` on the file, pass it
// a mutable reference to our destination string.
// The signature for `read_to_string` takes `&mut self`,
// so this line actually takes a mutable reference to file too.
file. read_to_string ( & mut text) ?;
// at this point, `file` can be dropped, because we don't
// use it anymore. this also frees OS resources associated with it.
// call the println macro, our format string is just `{}`,
// which will format an argument that implements the `std::fmt::Display`
// trait. In our case, `String` just prints its contents as, well, text.
println ! ( "{}" , text) ;
// everything went fine, signal sucess with an empty (tuple) result.
Ok ( ( ) )
}
Hopefully this is clear enough, even for those who aren't familiar with Rust.
In Rust , we must first create a new String, to
receive the contents of a file.
Opening a file and reading a file are both operations that can fail.
Any mutation must be explicitly allowed via the mut
keyword. Reading from a
file is a mutation, because it changes our position in the file.
The C standard library provides us with a "high-level" interface to open
and read files, so let's give it a shot.
C code
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
int main (int argc , char * * argv ) {
FILE * file = fopen ("/etc/hosts" , "r" );
if (!file ) {
fprintf (stderr , "could not open file\n" );
return 1;
}
const size_t buffer_size = 16;
char * buffer = malloc (buffer_size );
while (true) {
size_t size = fread (buffer , 1, buffer_size - 1, file );
if (size == 0) {
break ;
}
buffer [size ] = '\0';
printf ("%s" , buffer );
}
printf ("\n" );
return 0;
}
Well that was painful, let's review: first we need to open the file. Since
it's a "high-level" API, we can specify the mode via a string constant.
We're only interested in reading.
C code
FILE * file = fopen ("/etc/hosts" , "r" );
Then, we have to do a bit of error checking. Note that we won't be
able to tell why we could not open the file, just that it didn't work.
We'd have to call more functions to do that.
C code
if (!file ) {
fprintf (stderr , "could not open file\n" );
return 1;
}
Then, much like in Rust, we have to allocate a buffer. Note that nothing
says it's a string at this point - it's just a place to store bytes. I
arbitrarily chose "16 bytes" so that it's short enough that we'd notice if
our reading logic was wrong (hopefully).
C code
const size_t buffer_size = 16;
char * buffer = malloc (buffer_size );
Then, we simply attempt to read a bufferful, repeatedly - until we reach
an error or the end of the file. Again, proper error handling would require
more code and more function calls.
Rust gave us all of that for free. Just saying.
C code
while (true) {
// the arguments are: destination buffer, member size,
// number of members to read, and `FILE*` handle
size_t size = fread (buffer , 1, buffer_size - 1, file );
if (size == 0) {
break ;
}
buffer [size ] = '\0';
printf ("%s" , buffer );
}
Oh, another fun bit in C. There is no string type. But that's okay, because
the %s
formatter for printf will just stop as soon as it encounters a null
byte. As long as we never ever forget to set it, and never make off by one
errors (who makes those anyway?) then we definitely won't expose sensitive
information or allow remote code execution. Nothing to see here.
Finally, we make sure to print a final newline because that's what
well-behaved CLI (command-line interface) applications do. And we return
zero, because apparently that's C for success (except when it isn't, and
it means error - it all depends).
In C, there are no strings, only "arrays" of bytes. When using stdio, we must
manually allocate a buffer into which to read a file. If the buffer isn't
large enough, we have to repeatedly read into it.
The FILE*
returned by fopen
is an opaque struct that refers to the opened
file. We can only use it with other stdio functions, never look inside.
C makes it extremely easy to shoot oneself in the paw: off-by-one errors,
forgetting to terminate strings with a null byte (\0
), forgetting to check
return values or testing them incorrectly.
Learning C is important, because so many projects are written in it.
But nowadays, there's often better alternatives .
But this is still too high level. The fopen
, fclose
, fread
, fwrite
family of functions does a lot for us. For example, they do buffering.
So if you run this program:
C code
#include <stdio.h>
int main () {
for (int i = 0; i < 20; i ++ ) {
// this is the roundabout way to spell `printf`
// point is, `printf` is in the `f` family.
fprintf (stdout , "." );
}
// woops, that's illegal
* ((char * ) 0xBADBADBAD) = 43;
}
Then it prints nothing:
$ gcc -O3 --std=c99 -Wall -Wpedantic woops.c -o woops
$ ./woops
[1] 22082 segmentation fault (core dumped) ./woops
because the dots didn't actually get written out until a newline
is printed. They were just stored in an in-memory buffer. And
the program crashed, so exit handlers didn't get to run, so
nothing got written to the terminal.
This line:
C code
* ((char * ) 0xBADBADBAD) = 43;
...causes a segmentation fault because we're writing to memory that is
almost definitely not valid in our process's address space.
However, we can't rely on this always happening. On some embedded
systems , this would either
silently corrupt data or do nothing.
However! If we use open
, read
, write
, close
, now we're starting
to get low-level. An indeed this program:
C code
#include <unistd.h>
int main () {
for (int i = 0; i < 20; i ++ ) {
// hey that line's different
write (STDOUT_FILENO , "." , 1);
}
// still illegal
* ((char * ) 0xBADBADBAD) = 43;
}
...produces the expected output:
$ gcc -O3 --std=c99 -Wall -Wpedantic woops.c -o woops && ./woops
....................[1] 22741 segmentation fault (core dumped) ./woops
-O3
is the third-level of optimisation. -Wall
means "all warnings".
-Wpedantic
is the compiler equivalent of asking to be roasted.
So, can we update our readfile
program to use file descriptors instead
of <stdio.h>
? Absolutely!
C code
#include <stdlib.h>
#include <stdbool.h>
#include <stdio.h>
// that's a whole lot of headers, sorry
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main (int argc , char * * argv ) {
int fd = open ("/etc/hosts" , O_RDONLY );
// oh look this time '0' does not mean error!
if (fd == -1) {
fprintf (stderr , "could not open file\n" );
return 1;
}
const size_t buffer_size = 16;
char * buffer = malloc (buffer_size );
while (true) {
// this time the size is signed!
ssize_t size = read (fd , buffer , buffer_size );
// and, again, -1 means error
if (size == -1) {
fprintf (stderr , "an error happened while reading the file" );
return 1;
}
// 0 means end of file, probably?
if (size == 0) {
break ;
}
// no need to null-terminate here, just pass
// a size to write().
write (STDOUT_FILENO , buffer , size );
}
write (STDOUT_FILENO , "\n" , 1);
return 0;
}
Aww yeaahh, now this is verbose as heck, surely we are real programmers now.
stdio is a "high-level" interface to files, that handles buffering, for performance's sake.
The open()
, read()
, and write()
functions are lower-level. When using those,
we should bring our own buffering!
Remember when we used cat
to print the contents of /etc/hosts
?
Gods, we were young then.
Well, cat
ships with pretty much every Unix-y system, so it must be made
by real programmers. If only there was a way to look at their secrets.
As it turns out, strace
is exactly what we're looking for.
Shell session
$ strace cat /etc/hosts
execve("/usr/bin/cat", ["cat", "/etc/hosts"], 0x7ffee7075518 /* 60 vars */) = 0
brk(NULL) = 0x560f3346c000
arch_prctl(0x3001 /* ARCH_??? */, 0x7fff4452be80) = -1 EINVAL (Invalid argument)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe562a81000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=88313, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 88313, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe562a6b000
close(3) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3206\2\0\0\0\0\0"..., 832) = 832
(cut)
What does strace do? It "traces system calls and signals". More on that later.
At this point, all we know is that it prints a lot of lines, and we don't
care about all of them.
We only want to know how cat
reads files! The answer lies somewhere near
the end of the trace:
openat(AT_FDCWD, "/etc/hosts", O_RDONLY) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=220, ...}, AT_EMPTY_PATH) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe562a46000
read(3, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 131072) = 220
write(1, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 220127.0.0.1 localhost
Hey, this is familiar! Let's compare with an strace of our readfile
binary.
(The C one, that uses open
, read
and write
).
$ strace ./readfile
(cut)
openat(AT_FDCWD, "/etc/hosts", O_RDONLY) = 3
getrandom("\x99\x17\x0c\x29\x8c\xf3\x28\x2e", 8, GRND_NONBLOCK) = 8
brk(NULL) = 0x55cabb575000
brk(0x55cabb596000) = 0x55cabb596000
read(3, "127.0.0.1\tlocalh", 16) = 16
write(1, "127.0.0.1\tlocalh", 16127.0.0.1 localh) = 16
So, it looks like, ultimately, readfile
and cat
do the same thing, except
that cat
:
uses the fstat
syscall to find out the size of the file,
allocates a buffer of the proper size
reads the entire file in one syscall
also writes the entire file in one syscall
The strace
tool allows us to figure out what a program does, under the hood.
Not everything shows up in strace traces, but a lot of interesting stuff does.
The cat
command takes its name from "catenate", a synonym of "concatenate".
Right now, it looks like there's a bunch of ways to read files. Depending
on the language you use - C even gives you two ways out of the box!
Our mental model right now may look like this:
...but we can disprove that easily.
If we strace our very first program, the node.js one, we find this:
Shell session
$ strace node ./readfile.js
(cut)
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 17
statx(17, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=220, ...}) = 0
read(17, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 220) = 220
close(17)
Just like the others, it uses open
and read
! Instead of fstat
,
it uses statx
, which returns extended file status .
The "status" of a file is its size in bytes, which user and group it belongs
to, various timestamps, and other, lower-level information, which we'll get
to later.
Likewise, if we strace the rust one, we find this:
Shell session
$ strace readfile-rs/target/debug/readfile-rs
(cut)
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 3
statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=220, ...}) = 0
lseek(3, 0, SEEK_CUR) = 0
read(3, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 220) = 220
read(3, "", 32) = 0
write(1, "127.0.0.1\tlocalhost\n127.0.1.1\tso"..., 220127.0.0.1 localhost
It also uses open
and read
.
So, no matter what language we use, we always end up back at openat
,
read
, write
.
Let's update our mental model:
Node.js, Rust, and C applications all end up using the same few functions:
open()
, read()
, write()
, etc.
Whew, that was a lot to cover. But you made it to the end!
We're going through a lot of material quickly. It's ok not to fully
understand everything. A lot of these subjects may be covered more in-depth
in separate articles at some point.
If Part 1 didn't teach you much, that's ok too! Sometimes a reminder is nice.
The next parts do go a lot lower. I'm sure there's something in them you
didn't know about.