Reading files the hard way - Part 3 (ftrace, disk layouts, ext4)
This article is part of the Reading files the hard way series.
- In the belly of the beast
- Where did our paths go?
- You can make a filesystem out of that (or can you?)
- Enter inodes
- Solving the remaining problems
- Now for a taste of the real world
- Let's read a whole partition I guess
- Block groups
- Reading an inode in ext4
Contents
So far, we've seen many ways to read a file from different programming languages, we've learned about syscalls, how to make those from assembly, then we've learned about memory mapping, virtual address spaces, and generally some of the mechanisms in which userland and the kernel interact.
But in our exploration, we've always considered the kernel more or less like a "black box". It's time to change that.
In the belly of the beast
In part 2, we've used gdb
to step through our userland program, readfile
.
This allowed us to inspect the state of CPU registers after each instruction.
But we weren't able to see what was going on in the kernel.
We've also used strace
to trace system calls. It was less noisy than gdb,
because it only displayed information when our userland program was communicating
with the kernel.
Now it's time to step it up and use ftrace
. It's a tracing facility integrated
into the Linux kernel. Thanks to it, we're able to enable tracing of various
kernel functions, and events.
There's an excellent Introduction to ftrace over on Julia Evans's blog.
Speaking of Julia, check out her fantastic zines!
The easiest way for us to get started is to use trace-cmd
. On Manjaro/ArchLinux,
there's an AUR package for that, so we can just grab it:
yay -S trace-cmd
The plan is to:
- Enable tracing for all
ext4
events (more on this later) - Run our
readfile
program (the assembly one) - Look at the report and try to piece together what's going on
My first trace wasn't successful - it only contained events related to zsh (my shell), and git (called by my shell's prompt to show indicators).
It turns out there are several levels of caching, and /etc/hosts
was already cached to the maximum, so reading it didn't even register.
Luckily, we can clear all these caches (while the system is running!) by running the following as root:
sync echo 3 > /proc/sys/vm/drop_caches
Then, we can start tracing (again, as root):
trace-cmd record -e ext4
Then launch our program, and ask trace-cmd
to format the collected
data in a human-readable format:
trace-cmd report trace.dat > trace.txt
And there we have it:
readfile-30843 [001] 1557.395867: ext4_es_lookup_extent_enter: dev 8,1 ino 1311701 lblk 0 readfile-30843 [001] 1557.395869: ext4_es_lookup_extent_exit: dev 8,1 ino 1311701 found 0 [0/0) 0 readfile-30843 [001] 1557.395872: ext4_ext_map_blocks_enter: dev 8,1 ino 1311701 lblk 0 len 3 flags readfile-30843 [001] 1557.395875: ext4_ext_show_extent: dev 8,1 ino 1311701 lblk 0 pblk 5339898 len 3 readfile-30843 [001] 1557.395877: ext4_ext_map_blocks_exit: dev 8,1 ino 1311701 flags lblk 0 pblk 5339898 len 3 mflags 0x20 ret 3 readfile-30843 [001] 1557.395878: ext4_es_insert_extent: dev 8,1 ino 1311701 es [0/3) mapped 5339898 status W readfile-30843 [001] 1557.403239: ext4_es_lookup_extent_enter: dev 8,1 ino 1840090 lblk 0 readfile-30843 [001] 1557.403243: ext4_es_lookup_extent_exit: dev 8,1 ino 1840090 found 0 [0/0) 0 readfile-30843 [001] 1557.403245: ext4_ext_map_blocks_enter: dev 8,1 ino 1840090 lblk 0 len 1 flags readfile-30843 [001] 1557.403248: ext4_ext_show_extent: dev 8,1 ino 1840090 lblk 0 pblk 26724819 len 1 readfile-30843 [001] 1557.403251: ext4_ext_map_blocks_exit: dev 8,1 ino 1840090 flags lblk 0 pblk 26724819 len 1 mflags 0x20 ret 1 readfile-30843 [001] 1557.403252: ext4_es_insert_extent: dev 8,1 ino 1840090 es [0/1) mapped 26724819 status W
The columns appear to be formatted like so:
process_name-pid [cpu_core] timestamp: event_name: parameters
I was unable to find precise documentation on each of the events, but we can spot that two main things are being done here:
- Lookup and map "inode 1311701"
- Lookup and map "inode 1840090"
And that's it.
ftrace
allows tracing behavior inside kernel code. We can't change it,
but we can get a better idea what's going on.
Where did our paths go?
Whereas strace
showed us /etc/hosts
, the path of the file we're mapping,
ftrace
doesn't. Instead it shows us inode numbers. To understand why, we must
think about the ergonomics of a file system.
✨ Let's talk about symbolic links ✨
Symbolic links (symlinks for short) are files, but they're not regular files. They point to another file, and whenever they're opened, the other file should be opened instead. There's.. reasons to use symlinks, which I won't get into, just trust me on that one.
Point is, there's a bunch of symlinks in almost any Linux system, for example
in /usr/lib
:

Here, libcairo.so
and libcairo.so.2
both point to libcairo.so.2.11703.0
.
If we run a command like wc
(which count words, but also bytes), we see that
it's accessing the file being pointed to in both cases:

However, they're not copies. They're just.. symbolic links. As for implementation details, we notice two things: the size of those symlink files is exactly the size of the path they point to, and they have a slightly different mode:

So symbolic links are files, with a special mode flag, whose contents is the path they point to. This means they can be dangling!
A dangling symlink is one that points to a non-existent file.
See the example below:

You can make a filesystem out of that (or can you?)
Let's talk about folders. As we've seen, symlinks are just files, with a special mode, and their contents is a path. Couldn't we do the same for folders?
Let's make up a filesystem. What's a filesystem? Just a way to organize file and folders, their contents, and their metadata on disk. You can think of a disk as a big blank slate, like so:

Throughout the rest of this article, we'll be using the word "disk" when what we really mean is "storage device". It's shorter!
And let's say that, just for fun, we divide this disk in a series of blocks. Let's make them 4096 bytes (or 4KiB). That way, they might be the same size as kernel memory pages. Or a multiple. At least they're a power of two.

Let's say in our filesystem, each folder is just a list of names: their children. We can traverse the filesystem down just fine:
/
has/etc
in its list of children/etc
has/etc/hosts
in its list of children/etc/hosts
is a regular file
Sounds good right? I mean, we still have to solve a few problems.
First, we need to decide where on the disk our folders and files live. But that's not a terribly hard problem. We can just have a list of paths and a list of offsets, maybe like that:

And when we want to access /etc/hosts
, well, we just: 1) go through our table,
find the right entry, and 2) read it!

This doesn't seem so bad. Why is it problematic?
Problem 1: Looking for a file is O(n), where n is the number of files.
I've just counted the files on one of my Linux systems, and there are 1'128'252 of them. Sounds like looking up a file with a linear search is going to get expensive.
Problem 2: The contents of each file needs to be contiguous.
With our current layout, if a file becomes too big to fit in one block, and the next block is used, then we need to move it:

Problem 3: Metadata is mixed up with data.
Let's say we want to implement the tree
command. It prints the name
mode and size of every file in the filesystem, no matter how deeply nested.
With our disk layout, we'd have to read the start of every block that contains a file just to know its metadata. Along with, of course, the contents of every folder (but that's normal).

This is especially problematic because reading metadata (which happens very often) requires a lot of seeking. Seeking is the process of "jumping to a different location" when reading a storage device.
This is expensive for optical disk readers, or hard disk drives. It is extremely expensive for tape drives.

Problem 4: Rename operations are extremely expensive.
If we want to rename /etc
to /config
, we have a bunch of entries to
re-write:

This means reading a bunch of entries to find out what we should even rewrite, then write, in many different locations, all across the disk. Remember seeking being expensive? Forget about it.
It gets worse the more subfolders and files a directory has.
Problem 5: Every file operation needs to parse entire paths.
With our scheme, if we want to go from /etc/hosts
to /etc
, there's
no direct link! We have to parse /etc/hosts
into ["etc", "hosts"]
,
remove the last one to get ["etc"]
and build the /etc
path from that.
Then we have to use our lookup table at the very beginning to find where
/etc
actually is.
All in all, this is a pretty bad disk layout. It would technically work for a toy operating system, but I wouldn't want to use it as a daily driver.
A filesystem's disk layout must be carefully designed.
Operations like enumerating files, traversing downwards or upwards, renaming folders, should be achievable in as few operations as possible, minimizing seeking.
Enter inodes
One of the lessons we learned from our first attempt at a filesystem disk layout is that we should probably separate metadata and data. We should also find another way to refer to files than by their fully qualified names.
Whether our file lives at /etc/hosts
our /config/hosts
- it's the same file!
And our scheme should reflect that.
So, let's start by storing all the metadata next to each other, as early as possible. We need to reserve a big chunk of space, large enough to describe every file on disk.
This will be our index, each element is an "index node", or "inode".

Each of those inodes will contain the usual, but also, it'll contain the number of the block we can find its data in:

As for folders - easy! In the data block, we store a list of directory entries. Each entry refers to another inode, and gives it a name.

Let's review. Did we actually solve our problems? Firstly, how do we look up a file in this new scheme?
Let's say we want to open /etc/hosts
. First we read the inode for /
.
For convenience's sake, let's decide that, since it's the filesystem
root (topmost) node, it will always be in inode 1:

Then we read its directory entries, look for "etc". Then we read "etc"'s inode. Then we read its directories entires, look for "hosts". Finally, we read "hosts"'s data.
I made a quick diagram so it's easier to follow:

Ok, so it's not easier to follow.
But there is one upside compared to our previous scheme. Instead of doing a linear search among all the files in the filesystem, by absolute paths, we search across much smaller sets of directory entries, multiple times:

The diagram above is not to scale: we just reduced the worst case from 1.1M string comparisons to 228 string comparisons.
The new worst case is 0.02% of the old one.
Of course, this depends how many files there are per directory!
What about rename operations?. You might have noticed in our new scheme,
the inodes don't contain path information - there's no name
field
in there.
The names are actually contained in the directory entries:

If we want to rename /etc
, we just need to change the directory
entry - we don't need to touch the inodes of any of the children,
no matter how deep the subdirectory hierarchy goes:

If we want to move /etc/hosts
to /hosts
, we simply remove
the directory entry in /etc
's inode, and add one to /
's inode.

Again, it's a flat number of operations, no matter how many directories and files our directories have. This is much better.
Right now, traversing downwards is efficient (more than before), but
traversing upwards (going from /etc
to /
, for example) still requires
manipulating paths, and we don't want to do that.
Here's an idea: what if we add an entry to each directory, that points to its parent? We could call it "..". And for the root, we'll just make it point to itself:

Solving the remaining problems
I don't know about you, but I'm pretty happy with our disk layout so far.
However, there's (at least) two things we could be doing better.
Right now, every directory entry is stored in no particular order, in a list. To find a specific entry, we have to do a linear search, ie. do a comparison with each entry one after the other, and stop only when we've found what we've looking for.
This is only practical if directories never have more than a few thousand files. If we get into the tens of thousands, or hundreds of thousands territory, then things will start to be really slow.
A potential solution is to hash the names of all the entries, and use a Self-balancing binary search tree.
I won't get into the details here, but basically, it allows insertion (adding an entry), deletion (removing an entry), and searching in logarithmic time, which means it will get expensive much slower than a linear data structure.
"Data structures" is a rich and complex field, which would warrant many, many articles.
Let's no go there today.
The second problem is that our files still have to be contiguous. If they grow, we might have to move them to another set of blocks. We might even find that there is no contiguous set of blocks large enough to hold the file!
To address that, we can use extents.
For each file, we can store a series of (start, length)
pairs:

This brings on a new problem: previously, if we wanted to read just the second half of a file, we could simply calculate the address of the first block:

But now, the middle of the file may be in any of the extents. It's not simple arithmetic anymore. To remedy this, we can also use a tree data structure.
If we want to read the file from the beginning, then we can simply walk the tree in order. If we're seeking somewhere else, we'll simply do a search through the tree until we find the right extent, and then so simple arithmetic.
Now for a taste of the real world
Enough playing around! It's time to look at a real live filesystem.
Before we started thinking about disk layouts, we were tracing
kernel events while opening and mapping /etc/hosts
.
The trace we got contained lines like:
ext4_ext_map_blocks_enter: dev 8,1 ino 1311701 lblk 0 len 3 flags ... ext4_ext_map_blocks_enter: dev 8,1 ino 1840090 lblk 0 len 1 flags
This makes a little more sense now.
The first line refers to inode "1311701" (we now know what this means). The
_ext_
part of the event name probably refers to extents, because there's a
"len 3" (length of 3).
Could it be /etc/hosts
?
On Linux, we can pass the -i
flag to the ls
command, to make it display inode numbers.

Nope! /etc/hosts
is 1840090
. That's the second line.
What's the first one then?
Well, remember when I said binaries were mapped when they got executed?

Bingo.
So as far as we can tell, the real file system this Linux system is using is also using inodes. And extents!
We can find out what filesystem it is with the df
command (specifically,
the -T
flag):

So, apparently it's ext4. It appears the most up-to-date documentation for ext4 is on kernel.org.
Unfortunately, it seems this documentation wasn't written by the authors of ext4, but rather by someone else reading the code for ext4 in the Linux kernel. Oh well, that's life.
But first: what's this /dev/sda1
?

Well, remember when I said the kernel was boss? And that it controlled the reality userland processes see? That goes doubly for files.
Not all things that can be open()'d are files in the traditional sense.
Some are just.. resources.
For example, /proc/cpuinfo
"contains" information about the installed CPUs:

It's not actually on any storage device. It's just a way for the kernel
to give userland processes some information, through standard system calls
like open
, read
, write
and close
.
/proc/self/
is a directory that contains information about the currently
running process. If we modify our Go program like so:
package main import ( "fmt" "io/ioutil" ) func main() { payload, _ := ioutil.ReadFile("/proc/self/cmdline") fmt.Printf("%s\n", string(payload)) }
...and run it like so, it'll print its arguments:

In fact, in The Unix Philosophy, everything is a file.
This has made a lot of people angry.
So, /dev/sda1
is simply a "file" that contains the entire contents of
our root ext4 partition.
If we wanted the whole disk we'd open /dev/sda
instead. But
then we'd have to worry about partition tables, and that's better
left for later.
Let's read a whole partition I guess
Now that we're done with the introductory material, let's jump right into it.
We'll be using rust for this next part.
use std::fs::OpenOptions; use std::io::Read; use hex_slice::AsHex; fn main() -> Result<(), std::io::Error> { // open our ext4 partition, READ-ONLY. let mut file = OpenOptions::new().read(true).open("/dev/sda1")?; // allocate a 128-byte buffer let mut buf = vec![0u8; 128]; // read the first 128 bytes of the file file.read_exact(&mut buf)?; // print it as hexadecimal println!("{:x}", buf.as_hex()); Ok(()) }
This builds happily with cargo build
, but it won't run:

See, we're currently running this program as a normal user. And if any user could have access to any partition willy-nilly, then they could bypass file permissions.
And that would be bad.
So let's pretend we're root for a hot minute and:

Ah. It's full of zeroes. We might need that documentation after all.
Here's what it has to say:
For the special case of block group 0, the first 1024 bytes are unused, to allow for the installation of x86 boot sectors and other oddities. The superblock will start at offset 1024 bytes, whichever block that happens to be (usually 0).
Okay then, let's read starting from byte 1024. We'll use the positioned-io crate for that.
// instead of std::io::Read use positioned_io::ReadAt; fn main() -> Result<(), std::io::Error> { // (cut) // read the first 128 bytes of the file file.read_exact_at(1024, &mut buf)?; // (cut) }

Now we're cooking! The docs mention a "superblock", and if we read up on
its structure, it says that we should find the magic number 0xEF53
at offset 0x38
. It also says it's a little-endian 16-bit integer.
Well, there's a crate for that.
Since we're going to be reading a lot of stuff, let's make a helper struct. We'll need a few more use directives:
// this will let us read integers of various width use byteorder::{LittleEndian, ReadBytesExt}; // a cursor keeps track of our position within any // resource that implements ReadAt. use positioned_io::{ReadAt, Cursor}; // 'failure' gives us backtraces when we return // errors. use failure::Fallible; type Result<T> = std::result::Result<T, failure::Error>; // we'll be able to read from any type that // implements ReadAt. struct Reader<IO: ReadAt> { inner: IO, } impl<IO: ReadAt> Reader<IO> { fn new(inner: IO) -> Self { Self { inner } } fn u16(&self, offset: u64) -> Fallible<u16> { let mut cursor = Cursor::new_pos(&self.inner, offset); Ok(cursor.read_u16::<LittleEndian>()?) } }
Now, any time we want to read something from the partition, we'll first make a Slice
, pass it to a Reader
, and use it.
Note that neither the File, Slice, or Reader need to be mutable, because none of them maintain position information.
They can even be used concurrently!
use positioned_io::Slice; // using our new result type: fn main() -> Result<()> { let file = OpenOptions::new().read(true).open("/dev/sda1")?; // create a slice that corresponds to the superblock let r = Reader::new(Slice::new(file, 1024, None)); // as per the docs: let magic = r.u16(0x38)?; println!("magic = {:x}", magic); Ok(()) }

Hey, that's the value the documentation gave us! We're on the right track.
Block groups
In our toy disk layout, we divided the disk in blocks. ext4 does
just the same thing! At offset 0x18
of the super block, we find
the size of a block.
Well, we find n
, where the block size is 2 ^ (10 + n)
.
(Note: for this part, we need to add a u32
method to our Reader
type. It's modelled exactly after u16
, so I won't add it here).

What luck! The blocks are 4KB, just like in our toy disk layout.
However, if we look at the docs, we'll notice the blocks are grouped (into.. block groups).

And those are described in... group descriptors (GDT stands for "group descriptor table"):

The number of blocks per group is stored in the superblock, at offset 0x20:
// in main let bpg = r.u32(0x20)?; println!("blocks per group = {}", bpg);

Finally, the number of inodes per group is stored at offset 0x28;
// in main let ipg = r.u32(0x28)?; println!("inodes per group = {}", ipg);

The good news is that all those values look legit. The bad news is that we're nowhere near reading an inode.
Reading an inode in ext4
Let's recap. An ext4 partition starts with 1024 bytes of padding, then a superblock, which contains various settings like the size of a block, the number of blocks per group, the number of inodes per group, etc.
Then come the block group descriptors. Those contain the offset of a few things, but the one we're interested in is the inode table:

Let's start at the beginning - find the inode for /
.
There's a few special inodes in ext4:

The documentation seems about as confident as we are. Those question marks there are.. just great.
Anyway, the inode for the root, /
, is always 2. How do we find it?
Again, the docs help:

And that makes sense!
Since we're using a good programming language, let's start building out a few abstractions.
First, the superblock:
// this is like derive(Debug), but better. // see https://crates.io/crates/custom_debug_derive use custom_debug_derive::{Debug as CustomDebug}; #[derive(CustomDebug)] struct Superblock { #[debug(format = "{:x}")] magic: u16, block_size: u64, blocks_per_group: u64, inodes_per_group: u64, inode_size: u64, } impl Superblock { fn new(dev: &dyn ReadAt) -> Result<Self> { let r = Reader::new(Slice::new(dev, 1024, None)); // note: we're casting a few fields to `u64` now. // this will save us a bunch of grief later. Ok(Self { magic: r.u16(0x38)?, block_size: (2u32.pow(10 + r.u32(0x18)?)) as u64, blocks_per_group: r.u32(0x20)? as u64, inodes_per_group: r.u32(0x28)? as u64, inode_size: r.u16(0x58)? as u64, }) } }
Let's take it for a spin:
fn main() -> Result<()> { // open our ext4 partition, READ-ONLY. let file = OpenOptions::new().read(true).open("/dev/sda1")?; let sb = Superblock::new(&file)?; println!("{:#?}", sb); Ok(()) }

Seems fine!
Now let's make a type for block group descriptors. We're going to use it as a container for a constant: the size of a block group descriptor:
struct BlockGroupDescriptor {} impl BlockGroupDescriptor { const SIZE: u64 = 64; }
Next, let's make a type for block group numbers. Sure, it's just a number - but we'll add a method on it to get a slice of the descriptor.
A tuple or struct with a single field is called a newtype.
#[derive(Debug, Clone, Copy)] struct BlockGroupNumber(u64); impl BlockGroupNumber { fn desc_slice<T>(self, sb: &Superblock, dev: T) -> Slice<T> where T: ReadAt, { assert!(sb.block_size != 1024, "1024 block size not supported"); // the superblock takes up 1 block let gdt_start = 1 * sb.block_size; let offset = gdt_start + self.0 * BlockGroupDescriptor::SIZE; Slice::new(dev, offset, None) } }
Next up, let's make a type for inode numbers! We'll add a method that tells us in which block group a particular inode is.
#[derive(Debug, Clone, Copy)] struct InodeNumber(u64); impl InodeNumber { fn blockgroup_number(self, sb: &Superblock) -> BlockGroupNumber { let n = (self.0 - 1) / sb.inodes_per_group; BlockGroupNumber(n) } }
Let's try this all out. We're looking for inode 2:
fn main() -> Result<()> { // open our ext4 partition, READ-ONLY. let file = OpenOptions::new().read(true).open("/dev/sda1")?; let sb = Superblock::new(&file)?; println!("{:#?}", sb); let root_bg = InodeNumber(2).blockgroup_number(&sb); dbg!(&root_bg); let mut buf = vec![0u8; 64]; root_bg.desc_slice(&sb, &file).read_at(0, &mut buf)?; println!("{:x}", buf.as_hex()); Ok(()) }

As expected, inode 2 is in block group 0, and the descriptor slice, well, is not all zeroes, so that's a start.
The only thing we care about in the block descriptor is the offset of the inode table. Like most other locations, it's a block number.
This one's a bit annoying, because the lower 32-bits are stored at 0x8, and the upper 32-bits are stored at 0x28.
We could do the bit twiddling directly in BlockGroupDescriptor
,
but let's add a convenience method to Reader
instead:
// in `impl Reader {` fn u64_lohi(&self, lo: u64, hi: u64) -> Fallible<u64> { Ok(self.u32(lo)? as u64 + ((self.u32(hi)? as u64) << 32)) }
And let's fill out BlockGroupDescriptor
:
#[derive(Debug)] struct BlockGroupDescriptor { inode_table: u64, } impl BlockGroupDescriptor { const SIZE: u64 = 64; fn new(slice: &dyn ReadAt) -> Result<Self> { let r = Reader::new(slice); Ok(Self { inode_table: r.u64_lohi(0x8, 0x28)?, }) } }
Now, we can add a method to directly grab the block group
descriptor to BlockGroupNumber
:
// in impl BlockGroupNumber fn desc(self, sb: &Superblock, dev: &dyn ReadAt) -> Result<BlockGroupDescriptor> { let slice = self.desc_slice(sb, dev); BlockGroupDescriptor::new(&slice) }
Again, let's try it:

Hey, 1070 looks like a pretty reasonable number for the inode table of the first block group.
✨ We're making progress ✨
Now, the inode table is a series of fixed-size inodes.
The first field is a 16-bit (little-endian, still) integer that contains the file mode. A bit later on, we find the inode's size, again divided between lower 32 bits (at 0x4) and upper 32 bits (at 0x6C).
At 0x28, we find the "block map or extent tree". It's a block of 60 bytes. I have a feeling we're going to need that as well!
Let's add a convenience method to reader so we can grab that block easily:
// in impl Reader fn vec(&self, offset: u64, len: usize) -> Fallible<Vec<u8>> { let mut v = vec![0u8; len]; self.inner.read_exact_at(offset, &mut v)?; Ok(v) }
Next, let's add a method to grab the slice for a specific inode number.
// in impl InodeNumber fn inode_slice<T>(self, sb: &Superblock, dev: T) -> Result<Slice<T>> where T: ReadAt, { let desc = self.blockgroup_number(sb).desc(sb, &dev)?; let table_off = desc.inode_table * sb.block_size; let idx_in_table = (self.0 - 1) % sb.inodes_per_group; let inode_off = table_off + sb.inode_size * idx_in_table; Ok(Slice::new(dev, inode_off, Some(sb.inode_size))) }
Now, let's make an Inode
type, with all the data we wanted: mode, size,
and the 60-byte block. We'll use Reader::vec
which we added recently:
#[derive(CustomDebug)] struct Inode { #[debug(format = "{:o}")] mode: u16, size: u64, #[debug(skip)] block: Vec<u8>, } impl Inode { fn new(slice: &dyn ReadAt) -> Result<Self> { let r = Reader::new(slice); Ok(Self { mode: r.u16(0x0)?, size: r.u64_lohi(0x4, 0x6C)?, block: r.vec(0x28, 60)?, }) } }
And finally, let's add a method to InodeNumber
that directly reads
the inode:
// in impl InodeNumber fn inode(self, sb: &Superblock, dev: &dyn ReadAt) -> Result<Inode> { let slice = self.inode_slice(sb, dev)?; Inode::new(&slice) }
We're ready. Let's try it:
fn main() -> Result<()> { // open our ext4 partition, READ-ONLY. let file = OpenOptions::new().read(true).open("/dev/sda1")?; let sb = Superblock::new(&file)?; println!("{:#?}", sb); let root_inode = InodeNumber(2).inode(&sb, &file)?; println!("{:#?}", root_inode); Ok(()) }

Hurray! The values match what the stat
command reports. I'm
sure there's lots of directories with the same permissions and size,
so we could be reading the wrong inode, but at least we're probably
not reading, say, a random data block instead.
Let's quickly add a Filetype
enum so we can make sure it's a directory.
The file type is contained in the mode:

We'll use the num_enum crate here:
use num_enum::*; use std::convert::TryFrom; #[derive(Debug, TryFromPrimitive)] #[repr(u16)] enum Filetype { Fifo = 0x1000, CharacterDevice = 0x2000, Directory = 0x4000, BlockDevice = 0x6000, Regular = 0x8000, SymbolicLink = 0xA000, Socket = 0xC000, }
And add a getter to Inode
:
// in impl Inode fn filetype(&self) -> Filetype { Filetype::try_from(self.mode & 0xF000).unwrap() }
To keep the example code concise, we use unwrap()
here.
If we somehow read an inode from the wrong slice, or if the inode has an invalid mode, our program will panic.
This is not desirable in a real filesystem implementation, but here it works as a nice sanity check.
Let's give it a shot:
// in main let root_inode = InodeNumber(2).inode(&sb, &file)?; println!("({:?}) {:#?}", root_inode.filetype(), root_inode);

Everything looks fine!
Now that we're sure we're dealing with a directory, we can start reading its entries.
But first - remember extents? Those are in the block we just read!
The kernel documentation is here to help.
#[derive(Debug)] struct ExtentHeader { entries: u64, depth: u64, } impl ExtentHeader { fn new(slice: &dyn ReadAt) -> Result<Self> { let r = Reader::new(slice); let magic = r.u16(0x0)?; assert_eq!(magic, 0xF30A); Ok(Self { entries: r.u16(0x2)? as u64, depth: r.u16(0x6)? as u64, }) } }
Let's read one.
// in main let ext_header = ExtentHeader::new(&Slice::new(&root_inode.block, 0, Some(12)))?; println!("{:#?}", ext_header);
Well, it's not much but - again, the values make sense. It's followed
by a single entry, so we know the data for /
is stored in a single range
of contiguous blocks. It has a depth of 0
, which means it's a leaf node.
In our case, we'll assume that /
, /etc
, and /etc/hosts
all only have
one extent node.
This makes sense, since there's only 24 children for /
, only 204 children
for /etc
, and /etc/hosts
is shorter (511) than a block (4096).
However, we'll verify our assumptions with assert!
.
#[derive(Debug)] struct Extent { len: u64, start: u64, } impl Extent { fn new(slice: &dyn ReadAt) -> Result<Self> { let r = Reader::new(slice); Ok(Self { len: r.u16(0x4)? as u64, // the block number the extent points to is split // between upper 16-bits and lower 32-bits. start: ((r.u16(0x6)? as u64) << 32) + r.u32(0x8)? as u64, }) } }
Our complete extent reading code becomes:
// in main, after finding root_inode let ext_header = ExtentHeader::new(&Slice::new(&root_inode.block, 0, Some(12)))?; println!("{:#?}", ext_header); assert_eq!(ext_header.depth, 0); assert_eq!(ext_header.entries, 1); let ext = Extent::new(&Slice::new(&root_inode.block, 12, Some(12)))?; println!("{:#?}", ext);

Hey, those are still reasonable values!
According to this, we can find the data for /
at block 9262
.
Let's take a moment to add a convenience method to the Inode
type:
fn data<T>(&self, sb: &Superblock, dev: T) -> Result<Slice<T>> where T: ReadAt, { let ext_header = ExtentHeader::new(&Slice::new(&self.block, 0, Some(12)))?; assert_eq!(ext_header.depth, 0); assert_eq!(ext_header.entries, 1); let ext = Extent::new(&Slice::new(&self.block, 12, Some(12)))?; assert_eq!(ext.len, 1); let offset = ext.start * sb.block_size; let len = ext.len * sb.block_size; Ok(Slice::new(dev, offset, Some(len))) }
And then dump the first 128 bytes of /
's data, as if it were a string:
// in main let root_inode = InodeNumber(2).inode(&sb, &file)?; println!("({:?}) {:#?}", root_inode.filetype(), root_inode); let root_data = root_inode.data(&sb, &file)?; let data_start = Reader::new(&root_data).vec(0, 128)?; println!("{}", String::from_utf8_lossy(&data_start));

Now, I don't know about you, but this looks very exciting to me.
We can see some of the top-level directory names in there: proc
,
dev
, bin
- and even .
and ..
! (the current directory and
parent directory, respectively).
Let's read them proper: the directory entry format is actually pretty straightforward.
#[derive(CustomDebug)] struct DirectoryEntry { #[debug(skip)] len: u64, inode: InodeNumber, name: String, } impl DirectoryEntry { fn new(slice: &dyn ReadAt) -> Result<Self> { let r = Reader::new(slice); let name_len = r.u8(0x6)? as usize; Ok(Self { inode: InodeNumber(r.u32(0x0)? as u64), len: r.u16(0x4)? as u64, name: String::from_utf8_lossy(&r.vec(0x8, name_len)?).into(), }) } }
We'll simply add a method to Inode
that returns a vec of DirectoryEntry
.
This won't fit all usecases, but it'll fit ours!
// in impl Inode fn dir_entries(&self, sb: &Superblock, dev: &dyn ReadAt) -> Result<Vec<DirectoryEntry>> { let data = self.data(sb, dev)?; let mut entries = Vec::new(); let mut offset: u64 = 0; loop { let entry = DirectoryEntry::new(&Slice::new(&data, offset, None))?; if entry.inode.0 == 0 { break; } offset += entry.len; entries.push(entry); } Ok(entries) }
Now for the moment of truth:
// in main: let root_inode = InodeNumber(2).inode(&sb, &file)?; println!("({:?}) {:#?}", root_inode.filetype(), root_inode); let root_entries = root_inode.dir_entries(&sb, &file)?; println!("{:#?}", root_entries);

🎉🎉🎉
Everything appears to be in order.
Let's add a convenience method to find a particular child:
// in impl Inode fn child(&self, name: &str, sb: &Superblock, dev: &dyn ReadAt) -> Result<Option<InodeNumber>> { let entries = self.dir_entries(sb, dev)?; Ok(entries .into_iter() .filter(|x| x.name == name) .map(|x| x.inode) .next()) }
Why is the return type Result<Option<T>>
?
Well, reading from the partition might fail - and that would return Err(e)
.
Reading from the partition might succeed, but there might be no child
with this name. That would return Ok(None)
.
Finally, reading might succeed, and we might find such a child, and
that would return Ok(Some(n))
.
And let's use it:
// in main let root_inode = InodeNumber(2).inode(&sb, &file)?; println!("({:?}) {:#?}", root_inode.filetype(), root_inode); let etc_inode = root_inode.child("etc", &sb, &file)?; println!("{:#?}", etc_inode);

Everything checks out.
Let's go one step further and read that inode too:
// in main let root_inode = InodeNumber(2).inode(&sb, &file)?; println!("({:?}) {:#?}", root_inode.filetype(), root_inode); let etc_inode = root_inode .child("etc", &sb, &file)? .expect("/etc should exist") .inode(&sb, &file)?; println!("({:?}) {:#?}", etc_inode.filetype(), etc_inode);

Still good. Can we find hosts
in there?
// in main let hosts_inode = etc_inode .child("hosts", &sb, &file)? .expect("/etc/hosts should exist") .inode(&sb, &file)?; println!("({:?}) {:#?}", hosts_inode.filetype(), hosts_inode);

Excitement intensifies
And can we read it as a string??
// in main let hosts_data = hosts_inode.data(&sb, &file)?; let hosts_data = Reader::new(&hosts_data).vec(0, hosts_inode.size as usize)?; let hosts_data = String::from_utf8_lossy(&hosts_data); println!("---------------------------------------------"); println!("{}", hosts_data);

You can find the complete code listing on GitHub Gist.
It clocks in at exactly 300 lines of Rust! (Counting empty lines and punctuation lines).
This article is part 3 of the Reading files the hard way series.
If you liked what you saw, please support my work!
Latest video View all

I ported some Advent of Code solutions from C/C++ to Rust, and used the opportunity to compare performance. When I couldn't explain why they performed differently, I had no choice but to disassemble both and look at what the codegen was like!