Reading files the hard way - Part 3 (ftrace, disk layouts, ext4)

From the series Reading files the hard way

Aug 31, 2019 33 min #os · #rust · #kernel · #filesystems

Thanks to my sponsors: Dylan Anthony, Twan Walpot, René Ribaud, messense, Chris Sims, Jesse Luehrs, old.woman.josiah, Michael, Braidon Whatley, Tiziano Santoro, Tom Forbes, Timothée Gerber, Guilherme Neubaner, Boris Dolgov, clement, Zachary Myers, Horváth-Lázár Péter, Lena Schönburg, Enrico Zschemisch, Matt Heise and 278 more

👋 This page was last updated ~6 years ago. Just so you know.

So far, we’ve seen many ways to read a file from different programming languages, we’ve learned about syscalls, how to make those from assembly, then we’ve learned about memory mapping, virtual address spaces, and generally some of the mechanisms in which userland and the kernel interact.

But in our exploration, we’ve always considered the kernel more or less like a “black box”. It’s time to change that.

In the belly of the beast

In part 2, we’ve used gdb to step through our userland program, readfile. This allowed us to inspect the state of CPU registers after each instruction. But we weren’t able to see what was going on in the kernel.

We’ve also used strace to trace system calls. It was less noisy than gdb, because it only displayed information when our userland program was communicating with the kernel.

Now it’s time to step it up and use ftrace. It’s a tracing facility integrated into the Linux kernel. Thanks to it, we’re able to enable tracing of various kernel functions, and events.

Cool Bear's hot tip

There’s an excellent Introduction to ftrace over on Julia Evans’s blog.

Speaking of Julia, check out her fantastic zines!

The easiest way for us to get started is to use trace-cmd. On Manjaro/ArchLinux, there’s an AUR package for that, so we can just grab it:

$ yay -S trace-cmd

And on Ubuntu, there’s a package as well:

$ sudo apt install trace-cmd

The plan is to:

Enable tracing for all ext4 events (more on this later)
Run our readfile program (the assembly one)
Look at the report and try to piece together what’s going on

My first trace wasn’t successful - it only contained events related to zsh (my shell), and git (called by my shell’s prompt to show indicators).

It turns out there are several levels of caching, and /etc/hosts was already cached to the maximum, so reading it didn’t even register.

Luckily, we can clear all these caches (while the system is running!) by running the following as root:

$ sync

$ echo 3 > /proc/sys/vm/drop_caches

Then, we can start tracing (again, as root):

$ trace-cmd record -e ext4

Then launch our program, hit Ctrl-C to stop trace-cmd, and launch it again trace-cmd to format the collected data in a human-readable format:

$ trace-cmd report trace.dat > trace.txt

And there we have it:

readfile-2056669 [002] 112770.475550: ext4_es_lookup_extent_enter: dev 8,3 ino 16252929 lblk 0
readfile-2056669 [002] 112770.475551: ext4_es_lookup_extent_exit: dev 8,3 ino 16252929 found 1 [0/1) 65019936 WR
readfile-2056669 [002] 112770.475553: ext4_es_lookup_extent_enter: dev 8,3 ino 16252929 lblk 1
readfile-2056669 [002] 112770.475553: ext4_es_lookup_extent_exit: dev 8,3 ino 16252929 found 1 [1/2) 65020464 WR
readfile-2056669 [002] 112770.475565: ext4_es_lookup_extent_enter: dev 8,3 ino 16253092 lblk 0
readfile-2056669 [002] 112770.475565: ext4_es_lookup_extent_exit: dev 8,3 ino 16253092 found 0 [0/0) 0
readfile-2056669 [002] 112770.475565: ext4_ext_map_blocks_enter: dev 8,3 ino 16253092 lblk 0 len 1 flags
readfile-2056669 [002] 112770.475566: ext4_es_cache_extent: dev 8,3 ino 16253092 es [0/1) mapped 65050624 status W
readfile-2056669 [002] 112770.475566: ext4_ext_show_extent: dev 8,3 ino 16253092 lblk 0 pblk 65050624 len 1
readfile-2056669 [002] 112770.475566: ext4_ext_map_blocks_exit: dev 8,3 ino 16253092 flags  lblk 0 pblk 65050624 len 1 mflags M ret 1
readfile-2056669 [002] 112770.475566: ext4_es_insert_extent: dev 8,3 ino 16253092 es [0/1) mapped 65050624 status W
readfile-2056669 [002] 112770.474893: ext4_es_lookup_extent_enter: dev 8,3 ino 18491454 lblk 0
readfile-2056669 [002] 112770.474893: ext4_es_lookup_extent_exit: dev 8,3 ino 18491454 found 0 [0/0) 0
readfile-2056669 [002] 112770.474893: ext4_ext_map_blocks_enter: dev 8,3 ino 18491454 lblk 0 len 4 flags
readfile-2056669 [002] 112770.474894: ext4_es_cache_extent: dev 8,3 ino 18491454 es [0/5) mapped 93400358 status W
readfile-2056669 [002] 112770.474894: ext4_ext_show_extent: dev 8,3 ino 18491454 lblk 0 pblk 93400358 len 5
readfile-2056669 [002] 112770.474894: ext4_ext_map_blocks_exit: dev 8,3 ino 18491454 flags  lblk 0 pblk 93400358 len 4 mflags M ret 4
readfile-2056669 [002] 112770.474894: ext4_es_insert_extent: dev 8,3 ino 18491454 es [0/4) mapped 93400358 status W

The columns appear to be formatted like so:

process_name-pid [cpu_core]  timestamp: event_name: parameters

I was unable to find precise documentation on each of the events, but we can spot that three main things are being done here:

Look up “inode 16252929”
Look up and map “inode 16253092”
Look up and map “inode 18491454”

And that’s it.

What did we learn?

ftrace allows tracing behavior inside kernel code. We can’t change it, but we can get a better idea what’s going on.

Where did our paths go?

Whereas strace showed us /etc/hosts, the path of the file we’re mapping, ftrace doesn’t. Instead it shows us inode numbers. To understand why, we must think about the ergonomics of a file system.

✨ Let’s talk about symbolic links ✨

Symbolic links (symlinks for short) are files, but they’re not regular files. They point to another file, and whenever they’re opened, the other file should be opened instead. There’s.. reasons to use symlinks, which I won’t get into, just trust me on that one.

Point is, there’s a bunch of symlinks in almost any Linux system, for example in /usr/lib:

$ ll /usr/lib/x86_64-linux-gnu/libcairo.so*
lrwxrwxrwx 1 root root      21 Oct  1 21:19 /usr/lib/x86_64-linux-gnu/libcairo.so.2 -> libcairo.so.2.11600.0
-rw-r--r-- 1 root root 1203976 Oct  7  2021 /usr/lib/x86_64-linux-gnu/libcairo.so.2.11600.0

Here, libcairo.so.2 points to libcairo.so.2.11600.0. If we run a command like wc (which count words, but also bytes), we see that it’s accessing the file being pointed to in both cases:

$ wc /usr/lib/x86_64-linux-gnu/libcairo.so.2
   4419   24794 1203976 /usr/lib/x86_64-linux-gnu/libcairo.so.2

$ wc /usr/lib/x86_64-linux-gnu/libcairo.so.2.11600.0
   4419   24794 1203976 /usr/lib/x86_64-linux-gnu/libcairo.so.2.11600.0

However, they’re not copies. They’re just.. symbolic links. As for implementation details, we notice two things: the size of those symlink files is exactly the size of the path they point to, and they have a slightly different mode:

So symbolic links are files, with a special mode flag, whose contents is the path they point to. This means they can be dangling!

Cool Bear's hot tip

A dangling symlink is one that points to a non-existent file.

See the example below:

$ echo "Hello there" > regular.txt

$ cat regular.txt
Hello there

$ ln -s regular.txt symlink.txt

$ ll
total 144
drwxrwxr-x  2 amos amos   4096 Mar 11 14:11 ./
drwxrwxrwt 76 root root 135168 Mar 11 14:11 ../
-rw-rw-r--  1 amos amos     12 Mar 11 14:11 regular.txt
lrwxrwxrwx  1 amos amos     11 Mar 11 14:11 symlink.txt -> regular.txt

$ cat symlink.txt
Hello there

$ rm regular.txt

$ cat symlink.txt
cat: symlink.txt: No such file or directory

You can make a filesystem out of that (or can you?)

Let’s talk about folders. As we’ve seen, symlinks are just files, with a special mode, and their contents is a path. Couldn’t we do the same for folders?

Let’s make up a filesystem. What’s a filesystem? Just a way to organize file and folders, their contents, and their metadata on disk. You can think of a disk as a big blank slate, like so:

Cool Bear's hot tip

Throughout the rest of this article, we’ll be using the word “disk” when what we really mean is “storage device”. It’s shorter!

And let’s say that, just for fun, we divide this disk in a series of blocks. Let’s make them 4096 bytes (or 4KiB). That way, they might be the same size as kernel memory pages. Or a multiple. At least they’re a power of two.

Let’s say in our filesystem, each folder is just a list of names: their children. We can traverse the filesystem down just fine:

/ has /etc in its list of children
/etc has /etc/hosts in its list of children
/etc/hosts is a regular file

Sounds good right? I mean, we still have to solve a few problems.

First, we need to decide where on the disk our folders and files live. But that’s not a terribly hard problem. We can just have a list of paths and a list of offsets, maybe like that:

And when we want to access /etc/hosts, well, we just: 1) go through our table, find the right entry, and 2) read it!

This doesn’t seem so bad. Why is it problematic?

Problem 1: Looking for a file is O(n), where n is the number of files.

I’ve just counted the files on one of my Linux systems, and there are 1’128’252 of them. Sounds like looking up a file with a linear search is going to get expensive.

Problem 2: The contents of each file needs to be contiguous.

With our current layout, if a file becomes too big to fit in one block, and the next block is used, then we need to move it:

Problem 3: Metadata is mixed up with data.

Let’s say we want to implement the tree command. It prints the name mode and size of every file in the filesystem, no matter how deeply nested.

With our disk layout, we’d have to read the start of every block that contains a file just to know its metadata. Along with, of course, the contents of every folder (but that’s normal).

This is especially problematic because reading metadata (which happens very often) requires a lot of seeking. Seeking is the process of “jumping to a different location” when reading a storage device.

This is expensive for optical disk readers, or hard disk drives. It is extremely expensive for tape drives.

Problem 4: Rename operations are extremely expensive.

If we want to rename /etc to /config, we have a bunch of entries to re-write:

This means reading a bunch of entries to find out what we should even rewrite, then write, in many different locations, all across the disk. Remember seeking being expensive? Forget about it.

It gets worse the more subfolders and files a directory has.

Problem 5: Every file operation needs to parse entire paths.

With our scheme, if we want to go from /etc/hosts to /etc, there’s no direct link! We have to parse /etc/hosts into ["etc", "hosts"], remove the last one to get ["etc"] and build the /etc path from that.

Then we have to use our lookup table at the very beginning to find where /etc actually is.

All in all, this is a pretty bad disk layout. It would technically work for a toy operating system, but I wouldn’t want to use it as a daily driver.

What did we learn?

A filesystem’s disk layout must be carefully designed.

Operations like enumerating files, traversing downwards or upwards, renaming folders, should be achievable in as few operations as possible, minimizing seeking.

Enter inodes

One of the lessons we learned from our first attempt at a filesystem disk layout is that we should probably separate metadata and data. We should also find another way to refer to files than by their fully qualified names.

Whether our file lives at /etc/hosts our /config/hosts - it’s the same file! And our scheme should reflect that.

So, let’s start by storing all the metadata next to each other, as early as possible. We need to reserve a big chunk of space, large enough to describe every file on disk.

This will be our index, each element is an “index node”, or “inode”.

Each of those inodes will contain the usual, but also, it’ll contain the number of the block we can find its data in:

As for folders - easy! In the data block, we store a list of directory entries. Each entry refers to another inode, and gives it a name.

Let’s review. Did we actually solve our problems? Firstly, how do we look up a file in this new scheme?

Let’s say we want to open /etc/hosts. First we read the inode for /. For convenience’s sake, let’s decide that, since it’s the filesystem root (topmost) node, it will always be in inode 1:

Then we read its directory entries, look for “etc”. Then we read “etc”’s inode. Then we read its directories entires, look for “hosts”. Finally, we read “hosts”’s data.

I made a quick diagram so it’s easier to follow:

Ok, so it’s not easier to follow.

But there is one upside compared to our previous scheme. Instead of doing a linear search among all the files in the filesystem, by absolute paths, we search across much smaller sets of directory entries, multiple times:

Cool Bear's hot tip

The diagram above is not to scale: we just reduced the worst case from 1.1M string comparisons to 228 string comparisons.

The new worst case is 0.02% of the old one.

Of course, this depends how many files there are per directory!

What about rename operations?. You might have noticed in our new scheme, the inodes don’t contain path information - there’s no name field in there.

The names are actually contained in the directory entries:

If we want to rename /etc, we just need to change the directory entry - we don’t need to touch the inodes of any of the children, no matter how deep the subdirectory hierarchy goes:

If we want to move /etc/hosts to /hosts, we simply remove the directory entry in /etc’s inode, and add one to /’s inode.

Again, it’s a flat number of operations, no matter how many directories and files our directories have. This is much better.

Right now, traversing downwards is efficient (more than before), but traversing upwards (going from /etc to /, for example) still requires manipulating paths, and we don’t want to do that.

Here’s an idea: what if we add an entry to each directory, that points to its parent? We could call it “..”. And for the root, we’ll just make it point to itself:

Solving the remaining problems

I don’t know about you, but I’m pretty happy with our disk layout so far.

However, there’s (at least) two things we could be doing better.

Right now, every directory entry is stored in no particular order, in a list. To find a specific entry, we have to do a linear search, ie. do a comparison with each entry one after the other, and stop only when we’ve found what we’ve looking for.

This is only practical if directories never have more than a few thousand files. If we get into the tens of thousands, or hundreds of thousands territory, then things will start to be really slow.

A potential solution is to hash the names of all the entries, and use a Self-balancing binary search tree.

I won’t get into the details here, but basically, it allows insertion (adding an entry), deletion (removing an entry), and searching in logarithmic time, which means it will get expensive much slower than a linear data structure.

Cool Bear's hot tip

“Data structures” is a rich and complex field, which would warrant many, many articles.

Let’s not go there today.

The second problem is that our files still have to be contiguous. If they grow, we might have to move them to another set of blocks. We might even find that there is no contiguous set of blocks large enough to hold the file!

To address that, we can use extents.

For each file, we can store a series of (start, length) pairs:

This brings on a new problem: previously, if we wanted to read just the second half of a file, we could simply calculate the address of the first block:

But now, the middle of the file may be in any of the extents. It’s not simple arithmetic anymore. To remedy this, we can also use a tree data structure.

If we want to read the file from the beginning, then we can simply walk the tree in order. If we’re seeking somewhere else, we’ll simply do a search through the tree until we find the right extent, and then so simple arithmetic.

Now for a taste of the real world

Enough playing around! It’s time to look at a real live filesystem.

Before we started thinking about disk layouts, we were tracing kernel events while opening and mapping /etc/hosts.

The trace we got contained lines like:

ext4_ext_map_blocks_enter: dev 8,3 ino 18491454 lblk 0 len 4 flags

ext4_ext_map_blocks_enter: dev 8,3 ino 16253092 lblk 0 len 1 flags

This makes a little more sense now.

The first line refers to inode “18491454” (we now know what this means). The _ext_ part of the event name probably refers to extents, because there’s a “len 4” (length of 4).

Could it be /etc/hosts?

Cool Bear's hot tip

On Linux, we can pass the -i flag to the ls command, to make it display inode numbers.

$ ls -i -l /etc/hosts
16253092 -rw-r--r-- 1 root root 220 Oct  1 21:21 /etc/hosts

Nope! /etc/hosts is 16253092. That’s the second line.

What’s the first one then?

Well, remember when I said binaries were mapped when they got executed?

$ ls -i ./readfile
18491454 ./readfile

Bingo.

So as far as we can tell, the real file system this Linux system is using is also using inodes. And extents!

We can find out what filesystem it is with the df command (specifically, the -T flag):

$ df -Th /
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/sda3      ext4  548G   96G  425G  19% /

So, apparently it’s ext4. It appears the most up-to-date documentation for ext4 is on kernel.org.

Unfortunately, it seems this documentation wasn’t written by the authors of ext4, but rather by someone else reading the code for ext4 in the Linux kernel. Oh well, that’s life.

But first: what’s this /dev/sda3?

Well, remember when I said the kernel was boss? And that it controlled the reality userland processes see? That goes doubly for files.

Not all things that can be open()’d are files in the traditional sense.

Some are just.. resources.

For example, /proc/cpuinfo “contains” information about the installed CPUs:

$ head -5 /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 33
model name      : AMD Ryzen 9 5950X 16-Core Processor

It’s not actually on any storage device. It’s just a way for the kernel to give userland processes some information, through standard system calls like open, read, write and close.

/proc/self/ is a directory that contains information about the currently running process. If we modify our Go program like so:

package main

import (
    "fmt"
    "strings"
    "io/ioutil"
)

func main() {
    payload, _ := ioutil.ReadFile("/proc/self/cmdline")
    fmt.Printf("%#v\n", strings.Split(string(payload), "\x00"))
}

…and run it like so, it’ll print its arguments:

$ go build ./main.go

$ ./main "foo" "argument 2" "bar"
[]string{"./main", "foo", "argument 2", "bar", ""}

In fact, in The Unix Philosophy, everything is a file.

Cool Bear's hot tip

This has made a lot of people angry.

So, /dev/sda3 is simply a “file” that contains the entire contents of our root ext4 partition.

If we wanted the whole disk we’d open /dev/sda instead. But then we’d have to worry about partition tables, and that’s better left for later.

Let’s read a whole partition I guess

Now that we’re done with the introductory material, let’s jump right into it.

We’ll be using rust for this next part.

$ cargo new read-raw-ext4
     Created binary (application) `read-raw-ext4` package

$ cd read-raw-ext4/

Let’s add one of the crates we’ll need:

$ cargo add hex-slice@0.1
    Updating crates.io index
      Adding hex-slice v0.1.4 to dependencies.
    Updating crates.io index

// in `src/main.rs`

use hex_slice::AsHex;
use std::{fs::OpenOptions, io::Read};

fn main() -> Result<(), std::io::Error> {
    // open our ext4 partition, READ-ONLY.
    let mut file = OpenOptions::new().read(true).open("/dev/sda3")?;

    // allocate a 128-byte buffer
    let mut buf = vec![0u8; 128];

    // read the first 128 bytes of the file
    file.read_exact(&mut buf)?;

    // print it as hexadecimal
    println!("{:x}", buf.as_hex());

    Ok(())
}

This builds happily with cargo build, but it won’t run:

$ cargo build
   Compiling hex-slice v0.1.4
   Compiling read-raw-ext4 v0.1.0 (/home/amos/bearcove/read-raw-ext4)
    Finished dev [unoptimized + debuginfo] target(s) in 0.37s

$ ./target/debug/read-raw-ext4
Error: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }

See, we’re currently running this program as a normal user. And if any user could have access to any partition willy-nilly, then they could bypass file permissions.

And that would be bad.

So let’s pretend we’re root for a hot minute and:

$ sudo ./target/debug/read-raw-ext4
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Ah. It’s full of zeroes. We might need that documentation after all.

Here’s what it has to say:

For the special case of block group 0, the first 1024 bytes are unused, to allow for the installation of x86 boot sectors and other oddities. The superblock will start at offset 1024 bytes, whichever block that happens to be (usually 0).

Okay then, let’s read starting from byte 1024. We’ll use the positioned-io crate for that.

$ cargo add positioned-io@0.3
    Updating crates.io index
      Adding positioned-io v0.3 to dependencies.
             Features as of v0.3.0:
             + byteorder
    Updating crates.io index

// in `src/main.rs`

use hex_slice::AsHex;
use positioned_io::ReadAt;
use std::fs::OpenOptions;

fn main() -> Result<(), std::io::Error> {
    let file = OpenOptions::new().read(true).open("/dev/sda3")?;
    let mut buf = vec![0u8; 128];

    // read 128 bytes of the file starting at offset 1024
    file.read_exact_at(1024, &mut buf)?;

    println!("{:x}", buf.as_hex());

    Ok(())
}

$ cargo b && sudo ./target/debug/read-raw-ext4
   Compiling libc v0.2.140
   Compiling byteorder v1.4.3
   Compiling positioned-io v0.3.1
   Compiling read-raw-ext4 v0.1.0 (/home/amos/bearcove/read-raw-ext4)
    Finished dev [unoptimized + debuginfo] target(s) in 0.84s
[0 e0 2d 2 0 3b b7 8 c0 8f 6f 0 3b 9 d 7 ac 7e 10 2 0 0 0 0 2 0 0 0 2 0 0 0 0 80 0 0 0 80 0 0 0 20 0 0 c5 b4 4 64 c4 b4 4 64 43 0 ff ff 53 ef 1 0 1 0 0 0 98 92 38 63 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 b 0 0 0 0 1 0 0 3c 0 0 0 c6 2 0 0 6b 4 0 0 f7 6a 4c a7 98 2f 4d 8b bb cb ae 71 15 ed d6 90 0 0 0 0 0 0 0 0]

Now we’re cooking! The docs mention a “superblock”, and if we read up on its structure, it says that we should find the magic number 0xEF53 at offset 0x38. It also says it’s a little-endian 16-bit integer.

Well, there’s a crate for that: byteorder.

$ cargo add byteorder@1.4
    Updating crates.io index
      Adding byteorder v1.4 to dependencies.
             Features as of v1.4.0:
             + std
             - i128

And let’s bring one more for error handling:

$ cargo add color-eyre@0.6
    Updating crates.io index
      Adding color-eyre v0.6 to dependencies.
             Features as of v0.6.0:
             + capture-spantrace
             + color-spantrace
             + tracing-error
             + track-caller
             - issue-url
             - url
    Updating crates.io index

Since we’re going to be reading a lot of stuff, let’s make a helper struct. We’ll need a few more use directives:

// this will let us read integers of various width
use byteorder::{LittleEndian, ReadBytesExt};

// a cursor keeps track of our position within any
// resource that implements ReadAt.
use positioned_io::{ReadAt, Cursor};

// this gives us a generic error type that's pretty solid by default
use color_eyre::Result;

struct Reader<IO> {
  inner: IO,
}

impl<IO: ReadAt> Reader<IO> {
    fn new(inner: IO) -> Self {
        Self { inner }
    }

    fn u16(&self, offset: u64) -> Result<u16> {
        let mut cursor = Cursor::new_pos(&self.inner, offset);
        Ok(cursor.read_u16::<LittleEndian>()?)
    }
}

Now, any time we want to read something from the partition, we’ll first make a positioned_io::Slice, pass it to a Reader, and use it.

Cool Bear's hot tip

Note that neither the File, Slice, or Reader need to be mutable, because none of them maintain position information.

They can even be used concurrently!

use positioned_io::Slice;

fn main() -> Result<()> {
    let file = OpenOptions::new().read(true).open("/dev/sda3")?;

    // create a slice that corresponds to the superblock
    let r = Reader::new(Slice::new(file, 1024, None));

    // as per the docs:
    let magic = r.u16(0x38)?;
    println!("magic = {magic:x}");

    Ok(())
}

$ cargo b -q && sudo ./target/debug/read-raw-ext4
magic = ef53

Hey, that’s the value the documentation gave us! We’re on the right track.

Block groups

In our toy disk layout, we divided the disk in blocks. ext4 does just the same thing! At offset 0x18 of the super block, we find the size of a block.

Well, we find n, where the block size is 2 ^ (10 + n).

We need to add a u32 method to our Reader:

impl<IO: ReadAt> Reader<IO> {
    fn new(inner: IO) -> Self {
        Self { inner }
    }

    fn u16(&self, offset: u64) -> Result<u16> {
        let mut cursor = Cursor::new_pos(&self.inner, offset);
        Ok(cursor.read_u16::<LittleEndian>()?)
    }

    // 👇 new!
    fn u32(&self, offset: u64) -> Result<u32> {
        let mut cursor = Cursor::new_pos(&self.inner, offset);
        Ok(cursor.read_u32::<LittleEndian>()?)
    }
}

And in main, do a little bit of math:

fn main() -> Result<()> {
    let file = OpenOptions::new().read(true).open("/dev/sda3")?;

    let r = Reader::new(Slice::new(file, 1024, None));

    let magic = r.u16(0x38)?;
    println!("magic = {magic:x}");

    // 👇 new!
    let n = r.u32(0x18)?;
    let block_size = 1 << (10 + n);
    println!("block_size = {block_size}");

    Ok(())
}

$ cargo b -q && sudo ./target/debug/read-raw-ext4
magic = ef53
block_size = 4096

What luck! The blocks are 4KB, just like in our toy disk layout.

However, if we look at the docs, we’ll notice the blocks are grouped (into.. block groups).

An ext4 file system is split into a series of block groups. To reduce performance difficulties due to fragmentation, the block allocator tries very hard to keep each file’s blocks within the same group, thereby reducing seek times. The size of a block group is specified in sb.s_blocks_per_group blocks, though it can also calculated as 8 * block_size_in_bytes. With the default block size of 4KiB, each group will contain 32,768 blocks, for a length of 128MiB. The number of block groups is the size of the device divided by the size of a block group.

— Kernel docs

And those are described in… group descriptors (GDT stands for “group descriptor table”):

The layout of a standard block group is approximately as follows (each of these fields is discussed in a separate section below):

Group 0 Padding ext4 Super Block Group Descriptors Reserved GDT Blocks Data Block Bitmap inode Bitmap inode Table Data Blocks
1024 bytes 1 block many blocks many blocks 1 block 1 block many blocks many more blocks
— Kernel docs

Group 0 Padding	ext4 Super Block	Group Descriptors	Reserved GDT Blocks	Data Block Bitmap	inode Bitmap	inode Table	Data Blocks
1024 bytes	1 block	many blocks	many blocks	1 block	1 block	many blocks	many more blocks

The number of blocks per group is stored in the superblock, at offset 0x20:

    // in main
    let bpg = r.u32(0x20)?;
    println!("blocks per group = {bpg}");

$ cargo b -q && sudo ./target/debug/read-raw-ext4
magic = ef53
block_size = 4096
blocks per group = 32768

Finally, the number of inodes per group is stored at offset 0x28;

    // in main
    let ipg = r.u32(0x28)?;
    println!("inodes per group = {ipg}");

$ cargo b -q && sudo ./target/debug/read-raw-ext4
magic = ef53
block_size = 4096
blocks per group = 32768
inodes per group = 8192

The good news is that all those values look legit. The bad news is that we’re nowhere near reading an inode.

Reading an inode in ext4

Let’s recap. An ext4 partition starts with 1024 bytes of padding, then a superblock, which contains various settings like the size of a block, the number of blocks per group, the number of inodes per group, etc.

Then come the block group descriptors. Those contain the offset of a few things, but the one we’re interested in is the inode table:

Let’s start at the beginning - find the inode for /.

There’s a few special inodes in ext4:

inode Number Purpose
0 Doesn’t exist; there is no inode 0.
1 List of defective blocks.
2 Root directory.
3 User quota.
4 Group quota.
5 Boot loader.
6 Undelete directory.
7 Reserved group descriptors inode. (“resize inode”)
8 Journal inode.
9 The “exclude” inode, for snapshots(?)
10 Replica inode, used for non-upstream feature?
11 Traditional first non-reserved inode. Usually this is the lost+found directory. See s_first_ino in the superblock.
— Kernel docs

inode Number	Purpose
0	Doesn’t exist; there is no inode 0.
1	List of defective blocks.
2	Root directory.
3	User quota.
4	Group quota.
5	Boot loader.
6	Undelete directory.
7	Reserved group descriptors inode. (“resize inode”)
8	Journal inode.
9	The “exclude” inode, for snapshots(?)
10	Replica inode, used for non-upstream feature?
11	Traditional first non-reserved inode. Usually this is the lost+found directory. See s_first_ino in the superblock.

The documentation seems about as confident as we are. Those question marks there are.. just great.

Anyway, the inode for the root, /, is always 2. How do we find it? Again, the docs help:

4.1.2. Finding an Inode

Each block group contains sb->s_inodes_per_group inodes. Because inode 0 is defined not to exist, this formula can be used to find the block group that an inode lives in: bg = (inode_num - 1) / sb->s_inodes_per_group. The particular inode can be found within the block group’s inode table at index = (inode_num - 1) % sb->s_inodes_per_group. To get the byte address within the inode table, use offset = index * sb->s_inode_size.

— Kernel docs

And that makes sense!

Since we’re using a good programming language, let’s start building out a few abstractions.

$ cargo add custom_debug@0.5
    Updating crates.io index
      Adding custom_debug v0.5 to dependencies.

First, the superblock:

// this is like derive(Debug), but better.
// see https://crates.io/crates/custom_debug
use custom_debug::{Debug as CustomDebug};

#[derive(CustomDebug)]
struct Superblock {
    #[debug(format = "{:x}")]
    magic: u16,
    block_size: u64,
    blocks_per_group: u64,
    inodes_per_group: u64,
    inode_size: u64,
}

impl Superblock {
    fn new(dev: &dyn ReadAt) -> Result<Self> {
        let r = Reader::new(Slice::new(dev, 1024, None));
        // note: we're casting a few fields to `u64` now.
        // this will save us a bunch of grief later.
        Ok(Self {
            magic: r.u16(0x38)?,
            block_size: (2u32.pow(10 + r.u32(0x18)?)) as u64,
            blocks_per_group: r.u32(0x20)? as u64,
            inodes_per_group: r.u32(0x28)? as u64,
            inode_size: r.u16(0x58)? as u64,
        })
    }
}

Let’s take it for a spin:

fn main() -> Result<()> {
    color_eyre::install()?;

    // open our ext4 partition, READ-ONLY.
    let file = OpenOptions::new().read(true).open("/dev/sda3")?;

    let sb = Superblock::new(&file)?;
    println!("{sb:#?}");

    Ok(())
}

$ cargo b -q && sudo ./target/debug/read-raw-ext4
Superblock {
    magic: ef53,
    block_size: 4096,
    blocks_per_group: 32768,
    inodes_per_group: 8192,
    inode_size: 256,
}

Seems fine!

Now let’s make a type for block group descriptors. We’re going to use it as a container for a constant: the size of a block group descriptor:

struct BlockGroupDescriptor {}

impl BlockGroupDescriptor {
    const SIZE: u64 = 64;
}

Next, let’s make a type for block group numbers. Sure, it’s just a number - but we’ll add a method on it to get a slice of the descriptor.

Cool Bear's hot tip

A tuple or struct with a single field is called a newtype.

#[derive(Debug, Clone, Copy)]
struct BlockGroupNumber(u64);

impl BlockGroupNumber {
    fn desc_slice<T>(self, sb: &Superblock, dev: T) -> Slice<T>
    where
        T: ReadAt,
    {
        assert!(sb.block_size != 1024, "1024 block size not supported");
        // the superblock takes up 1 block
        let gdt_start = sb.block_size;
        let offset = gdt_start + self.0 * BlockGroupDescriptor::SIZE;
        Slice::new(dev, offset, None)
    }
}

Next up, let’s make a type for inode numbers! We’ll add a method that tells us in which block group a particular inode is.

#[derive(Debug, Clone, Copy)]
struct InodeNumber(u64);

impl InodeNumber {
    fn blockgroup_number(self, sb: &Superblock) -> BlockGroupNumber {
        let n = (self.0 - 1) / sb.inodes_per_group;
        BlockGroupNumber(n)
    }
}

Let’s try this all out. We’re looking for inode 2:

fn main() -> Result<()> {
    color_eyre::install()?;

    // open our ext4 partition, READ-ONLY.
    let file = OpenOptions::new().read(true).open("/dev/sda3")?;

    let sb = Superblock::new(&file)?;
    println!("{sb:#?}");

    let root_bg = InodeNumber(2).blockgroup_number(&sb);
    dbg!(&root_bg);

    let mut buf = vec![0u8; 64];
    root_bg.desc_slice(&sb, &file).read_at(0, &mut buf)?;
    println!("{:x}", buf.as_hex());

    Ok(())
}

$ cargo b -q && sudo ./target/debug/read-raw-ext4
Superblock {
    magic: ef53,
    block_size: 4096,
    blocks_per_group: 32768,
    inodes_per_group: 8192,
    inode_size: 256,
}
[src/main.rs:92] &root_bg = BlockGroupNumber(
    0,
)
[47 4 0 0 57 4 0 0 67 4 0 0 a6 4d ee 1f 2 0 4 0 0 0 0 0 2b 7e e7 d ed 1f a4 83 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a8 e1 c3 9c 0 0 0 0]

As expected, inode 2 is in block group 0, and the descriptor slice, well, is not all zeroes, so that’s a start.

The only thing we care about in the block descriptor is the offset of the inode table. Like most other locations, it’s a block number.

This one’s a bit annoying, because the lower 32-bits are stored at 0x8, and the upper 32-bits are stored at 0x28.

We could do the bit twiddling directly in BlockGroupDescriptor, but let’s add a convenience method to Reader instead:

// in `impl Reader {`

    fn u64_lohi(&self, lo: u64, hi: u64) -> Result<u64> {
        Ok(self.u32(lo)? as u64 + ((self.u32(hi)? as u64) << 32))
    }

And let’s fill out BlockGroupDescriptor:

#[derive(Debug)]
struct BlockGroupDescriptor {
    inode_table: u64,
}

impl BlockGroupDescriptor {
    const SIZE: u64 = 64;

    fn new(slice: &dyn ReadAt) -> Result<Self> {
        let r = Reader::new(slice);
        Ok(Self {
            inode_table: r.u64_lohi(0x8, 0x28)?,
        })
    }
}

Now, we can add a method to directly grab the block group descriptor to BlockGroupNumber:

    // in impl BlockGroupNumber
    fn desc(self, sb: &Superblock, dev: &dyn ReadAt) -> Result<BlockGroupDescriptor> {
        let slice = self.desc_slice(sb, dev);
        BlockGroupDescriptor::new(&slice)
    }

Again, let’s try it:

fn main() -> Result<()> {
    color_eyre::install()?;

    // open our ext4 partition, READ-ONLY.
    let file = OpenOptions::new().read(true).open("/dev/sda3")?;

    let sb = Superblock::new(&file)?;
    println!("{sb:#?}");

    let bgd = InodeNumber(2).blockgroup_number(&sb).desc(&sb, &file)?;
    println!("{bgd:#?}");

    Ok(())
}

$ cargo b -q && sudo ./target/debug/read-raw-ext4
Superblock {
    magic: ef53,
    block_size: 4096,
    blocks_per_group: 32768,
    inodes_per_group: 8192,
    inode_size: 256,
}
BlockGroupDescriptor {
    inode_table: 1127,
}

Hey, 1127 looks like a pretty reasonable number for the inode table of the first block group.

✨ We’re making progress ✨

Now, the inode table is a series of fixed-size inodes.

The first field is a 16-bit (little-endian, still) integer that contains the file mode. A bit later on, we find the inode’s size, again divided between lower 32 bits (at 0x4) and upper 32 bits (at 0x6C).

At 0x28, we find the “block map or extent tree”. It’s a block of 60 bytes. I have a feeling we’re going to need that as well!

Let’s add a convenience method to reader so we can grab that block easily:

    // in impl Reader
    fn vec(&self, offset: u64, len: usize) -> Result<Vec<u8>> {
        let mut v = vec![0u8; len];
        self.inner.read_exact_at(offset, &mut v)?;
        Ok(v)
    }

Next, let’s add a method to grab the slice for a specific inode number.

    // in impl InodeNumber
    fn inode_slice<T>(self, sb: &Superblock, dev: T) -> Result<Slice<T>>
    where
        T: ReadAt,
    {
        let desc = self.blockgroup_number(sb).desc(sb, &dev)?;
        let table_off = desc.inode_table * sb.block_size;
        let idx_in_table = (self.0 - 1) % sb.inodes_per_group;
        let inode_off = table_off + sb.inode_size * idx_in_table;
        Ok(Slice::new(dev, inode_off, Some(sb.inode_size)))
    }

Now, let’s make an Inode type, with all the data we wanted: mode, size, and the 60-byte block. We’ll use Reader::vec which we added recently:

#[derive(CustomDebug)]
struct Inode {
    #[debug(format = "{:o}")]
    mode: u16,
    size: u64,

    #[debug(skip)]
    block: Vec<u8>,
}

impl Inode {
    fn new(slice: &dyn ReadAt) -> Result<Self> {
        let r = Reader::new(slice);
        Ok(Self {
            mode: r.u16(0x0)?,
            size: r.u64_lohi(0x4, 0x6C)?,
            block: r.vec(0x28, 60)?,
        })
    }
}

And finally, let’s add a method to InodeNumber that directly reads the inode:

    // in impl InodeNumber
    fn inode(self, sb: &Superblock, dev: &dyn ReadAt) -> Result<Inode> {
        let slice = self.inode_slice(sb, dev)?;
        Inode::new(&slice)
    }

We’re ready. Let’s try it:

fn main() -> Result<()> {
    color_eyre::install()?;

    // open our ext4 partition, READ-ONLY.
    let file = OpenOptions::new().read(true).open("/dev/sda3")?;

    let sb = Superblock::new(&file)?;
    println!("{sb:#?}");

    let root_inode = InodeNumber(2).inode(&sb, &file)?;
    println!("{root_inode:#?}");

    Ok(())
}

$ cargo b -q && sudo ./target/debug/read-raw-ext4
Superblock {
    magic: ef53,
    block_size: 4096,
    blocks_per_group: 32768,
    inodes_per_group: 8192,
    inode_size: 256,
}
Inode {
    mode: 40755,
    size: 4096,
}

Hurray! The values match what the stat command reports:

$ stat /
  File: /
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: 803h/2051d      Inode: 2           Links: 23
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-03-11 12:40:08.175908741 +0100
Modify: 2022-12-26 15:42:11.483999585 +0100
Change: 2022-12-26 15:42:11.483999585 +0100
 Birth: 2022-10-01 21:18:48.000000000 +0200

I’m sure there’s lots of directories with the same permissions and size, so we could be reading the wrong inode, but at least we’re probably not reading, say, a random data block instead.

Let’s quickly add a Filetype enum so we can make sure it’s a directory. The file type is contained in the mode:

These are mutually-exclusive file types
0x1000 S_IFIFO (FIFO)
0x2000 S_IFCHR (Character device)
0x4000 S_IFDIR (Directory)
0x6000 S_IFBLK (Block device)
0x8000 S_IFREG (Regular file)
0xA000 S_IFLNK (Symbolic link)
0xC000 S_IFSOCK (Socket)
— Kernel docs

	These are mutually-exclusive file types
0x1000	S_IFIFO (FIFO)
0x2000	S_IFCHR (Character device)
0x4000	S_IFDIR (Directory)
0x6000	S_IFBLK (Block device)
0x8000	S_IFREG (Regular file)
0xA000	S_IFLNK (Symbolic link)
0xC000	S_IFSOCK (Socket)

We’ll use the num_enum crate here:

$ cargo add num_enum@0.5
    Updating crates.io index
      Adding num_enum v0.5 to dependencies.
             Features as of v0.5.0:
             + std
             - complex-expressions
             - external_doc
    Updating crates.io index

use num_enum::*;
use std::convert::TryFrom;

#[derive(Debug, TryFromPrimitive)]
#[repr(u16)]
enum Filetype {
    Fifo = 0x1000,
    CharacterDevice = 0x2000,
    Directory = 0x4000,
    BlockDevice = 0x6000,
    Regular = 0x8000,
    SymbolicLink = 0xA000,
    Socket = 0xC000,
}

And add a getter to Inode:

    // in impl Inode
    fn filetype(&self) -> Filetype {
        Filetype::try_from(self.mode & 0xF000).unwrap()
    }

Cool Bear's hot tip

To keep the example code concise, we use unwrap() here.

If we somehow read an inode from the wrong slice, or if the inode has an invalid mode, our program will panic.

This is not desirable in a real filesystem implementation, but here it works as a nice sanity check.

Let’s give it a shot:

    // in main
    let root_inode = InodeNumber(2).inode(&sb, &file)?;
    let root_inode_type = root_inode.filetype();
    println!("({root_inode_type:?}) {root_inode:#?}");

$ cargo b -q && sudo ./target/debug/read-raw-ext4
Superblock {
    magic: ef53,
    block_size: 4096,
    blocks_per_group: 32768,
    inodes_per_group: 8192,
    inode_size: 256,
}
(Directory) Inode {
    mode: 40755,
    size: 4096,
}

Everything looks fine!

Now that we’re sure we’re dealing with a directory, we can start reading its entries.

But first - remember extents? Those are in the block we just read!

The kernel documentation is here to help.

#[derive(Debug)]
struct ExtentHeader {
    entries: u64,
    depth: u64,
}

impl ExtentHeader {
    fn new(slice: &dyn ReadAt) -> Result<Self> {
        let r = Reader::new(slice);
        let magic = r.u16(0x0)?;
        assert_eq!(magic, 0xF30A);

        Ok(Self {
            entries: r.u16(0x2)? as u64,
            depth: r.u16(0x6)? as u64,
        })
    }
}

Let’s read one.

    // in main
    let ext_header = ExtentHeader::new(&Slice::new(&root_inode.block, 0, Some(12)))?;
    println!("{ext_header:#?}");

$ cargo b -q && sudo ./target/debug/read-raw-ext4
(cut)
ExtentHeader {
    entries: 1,
    depth: 0,
}

Well, it’s not much but - again, the values make sense. It’s followed by a single entry, so we know the data for / is stored in a single range of contiguous blocks. It has a depth of 0, which means it’s a leaf node.

Cool Bear's hot tip

In our case, we’ll assume that /, /etc, and /etc/hosts all only have one extent node.

This makes sense, since there’s only 24 children for /, only 204 children for /etc, and /etc/hosts is shorter (511) than a block (4096).

However, we’ll verify our assumptions with assert!.

#[derive(Debug)]
struct Extent {
    len: u64,
    start: u64,
}

impl Extent {
    fn new(slice: &dyn ReadAt) -> Result<Self> {
        let r = Reader::new(slice);
        Ok(Self {
            len: r.u16(0x4)? as u64,
            // the block number the extent points to is split
            // between upper 16-bits and lower 32-bits.
            start: ((r.u16(0x6)? as u64) << 32) + r.u32(0x8)? as u64,
        })
    }
}

Our complete extent reading code becomes:

    // in main, after finding root_inode
    let ext_header = ExtentHeader::new(&Slice::new(&root_inode.block, 0, Some(12)))?;
    println!("{ext_header:#?}");

    assert_eq!(ext_header.depth, 0);
    assert_eq!(ext_header.entries, 1);

    let ext = Extent::new(&Slice::new(&root_inode.block, 12, Some(12)))?;
    println!("{:#?}", ext);

$ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
    mode: 40755,
    size: 4096,
}
ExtentHeader {
    entries: 1,
    depth: 0,
}
Extent {
    len: 1,
    start: 9319,
}

Hey, those are still reasonable values!

According to this, we can find the data for / at block 9319.

Let’s take a moment to add a convenience method to the Inode type:

    // in impl Inode
    fn data<T>(&self, sb: &Superblock, dev: T) -> Result<Slice<T>>
    where
        T: ReadAt,
    {
        let ext_header = ExtentHeader::new(&Slice::new(&self.block, 0, Some(12)))?;
        assert_eq!(ext_header.depth, 0);
        assert_eq!(ext_header.entries, 1);

        let ext = Extent::new(&Slice::new(&self.block, 12, Some(12)))?;
        assert_eq!(ext.len, 1);

        let offset = ext.start * sb.block_size;
        let len = ext.len * sb.block_size;
        Ok(Slice::new(dev, offset, Some(len)))
    }

And then dump the first 128 bytes of /’s data, as if it were a string:

    // in main
    let root_inode = InodeNumber(2).inode(&sb, &file)?;
    let root_inode_type = root_inode.filetype();
    println!("({root_inode_type:?}) {root_inode:#?}");

    let root_data = root_inode.data(&sb, &file)?;
    let data_start = Reader::new(&root_data).vec(0, 128)?;
    println!("{}", String::from_utf8_lossy(&data_start));

$ $ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
    mode: 40755,
    size: 4096,
}

.
 ..

lost+found(
           boot
              swapfile�
                       etc�media�
                                 var
bin@

Now, I don’t know about you, but this looks very exciting to me.

We can see some of the top-level directory names in there: lost+found, boot, swapfile — and even . and ..! (the current directory and parent directory, respectively).

Let’s read them proper: the directory entry format is actually pretty straightforward.

#[derive(CustomDebug)]
struct DirectoryEntry {
    #[debug(skip)]
    len: u64,
    inode: InodeNumber,
    name: String,
}

impl DirectoryEntry {
    fn new(slice: &dyn ReadAt) -> Result<Self> {
        let r = Reader::new(slice);
        // adding `Reader::u8` is left as an exercise to the reader
        let name_len = r.u8(0x6)? as usize;
        Ok(Self {
            inode: InodeNumber(r.u32(0x0)? as u64),
            len: r.u16(0x4)? as u64,
            name: String::from_utf8_lossy(&r.vec(0x8, name_len)?).into(),
        })
    }
}

We’ll simply add a method to Inode that returns a vec of DirectoryEntry. This won’t fit all usecases, but it’ll fit ours!

    // in impl Inode
    fn dir_entries(&self, sb: &Superblock, dev: &dyn ReadAt) -> Result<Vec<DirectoryEntry>> {
        let data = self.data(sb, dev)?;

        let mut entries = Vec::new();
        let mut offset: u64 = 0;
        loop {
            let entry = DirectoryEntry::new(&Slice::new(&data, offset, None))?;
            if entry.inode.0 == 0 {
                break;
            }
            offset += entry.len;
            entries.push(entry);
        }
        Ok(entries)
    }

Now for the moment of truth:

    // in main:
    let root_inode = InodeNumber(2).inode(&sb, &file)?;
    let root_inode_type = root_inode.filetype();
    println!("({root_inode_type:?}) {root_inode:#?}");

    let root_entries = root_inode.dir_entries(&sb, &file)?;
    println!("{root_entries:#?}");

$ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
    mode: 40755,
    size: 4096,
}
[
    DirectoryEntry {
        inode: InodeNumber(
            2,
        ),
        name: ".",
    },
    DirectoryEntry {
        inode: InodeNumber(
            2,
        ),
        name: "..",
    },
    DirectoryEntry {
        inode: InodeNumber(
            11,
        ),
        name: "lost+found",
    },
    DirectoryEntry {
        inode: InodeNumber(
            19398657,
        ),
        name: "boot",
    },
    DirectoryEntry {
        inode: InodeNumber(
            12,
        ),
        name: "swapfile",
    },
    DirectoryEntry {
        inode: InodeNumber(
            16252929,
        ),
        name: "etc",
    },
    (cut)

🎉🎉🎉

Everything appears to be in order.

Let’s add a convenience method to find a particular child:

    // in impl Inode
    fn child(&self, name: &str, sb: &Superblock, dev: &dyn ReadAt) -> Result<Option<InodeNumber>> {
        let entries = self.dir_entries(sb, dev)?;
        Ok(entries
            .into_iter()
            .filter(|x| x.name == name)
            .map(|x| x.inode)
            .next())
    }

Cool Bear's hot tip

Why is the return type Result<Option<T>>?

Well, reading from the partition might fail - and that would return Err(e).

Reading from the partition might succeed, but there might be no child with this name. That would return Ok(None).

Finally, reading might succeed, and we might find such a child, and that would return Ok(Some(n)).

And let’s use it:

    // in main
    let root_inode = InodeNumber(2).inode(&sb, &file)?;
    println!("({:?}) {:#?}", root_inode.filetype(), root_inode);

    let etc_inode = root_inode.child("etc", &sb, &file)?;
    println!("{etc_inode:?}");

$ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
    mode: 40755,
    size: 4096,
}
Some(InodeNumber(16252929))

$ stat /etc
  File: /etc
  Size: 12288           Blocks: 24         IO Block: 4096   directory
Device: 803h/2051d      Inode: 16252929    Links: 146
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-03-11 19:51:34.661133912 +0100
Modify: 2023-03-11 19:47:58.858792961 +0100
Change: 2023-03-11 19:47:58.858792961 +0100
 Birth: 2022-10-01 21:18:52.867725869 +0200

Everything checks out.

Let’s go one step further and read that inode too:

    // in main
    let root_inode = InodeNumber(2).inode(&sb, &file)?;
    let root_inode_type = root_inode.filetype();
    println!("({root_inode_type:?}) {root_inode:#?}");

    let etc_inode = root_inode
        .child("etc", &sb, &file)?
        .expect("/etc should exist")
        .inode(&sb, &file)?;
    let etc_inode_type = etc_inode.filetype();
    println!("({etc_inode_type:?}) {etc_inode:#?}");

$ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
    mode: 40755,
    size: 4096,
}
(Directory) Inode {
    mode: 40755,
    size: 12288,
}

Still good. Can we find hosts in there?

    // in main
    let hosts_inode = etc_inode
        .child("hosts", &sb, &file)?
        .expect("/etc/hosts should exist")
        .inode(&sb, &file)?;
    let hosts_inode_type = hosts_inode.filetype();
    println!("({hosts_inode_type:?}) {hosts_inode:#?}");

$ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
    mode: 40755,
    size: 4096,
}
(Directory) Inode {
    mode: 40755,
    size: 12288,
}
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `2`,
 right: `1`', src/main.rs:110:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Ah!

See, when I originally wrote this series, all the way back in 2019, on an ArchLinux system installed on a Lenovo X200 (I know right?), this used to work.

But now it no longer does, because there’s two entries in our extent header, not just one.

So, that’s okay, we can adjust our code a little!

First, let’s add an Inode::data_count method that returns the number of data blocks associated with an inode:

    // in impl Inode

    fn data_count(&self) -> Result<u64> {
        let ext_header = ExtentHeader::new(&Slice::new(&self.block, 0, Some(12)))?;
        assert_eq!(ext_header.depth, 0);
        Ok(ext_header.entries)
    }

And we’ll change the data function to take an index, so it can retrieve any of the data blocks.

    // in impl Inode

    // new! this method takes an index parameter 👇
    fn data<T>(&self, sb: &Superblock, dev: T, index: u64) -> Result<Slice<T>>
    where
        T: ReadAt,
    {
        let ext_header = ExtentHeader::new(&Slice::new(&self.block, 0, Some(12)))?;
        assert_eq!(ext_header.depth, 0);
        // 👇 new: (previously was asserting only 1 entry, now checking index)
        assert!(index < ext_header.entries);

        // 👇 new: the offset depends of the index
        let offset = 12 * (index + 1);
        let ext = Extent::new(&Slice::new(&self.block, offset, Some(12)))?;
        assert_eq!(ext.len, 1);

        let offset = ext.start * sb.block_size;
        let len = ext.len * sb.block_size;
        Ok(Slice::new(dev, offset, Some(len)))
    }

Now, in dir_entries, where we call self.data, we actually have to pass an index! Also, our end condition (finding an entry with inode number 0) only works for the last data block. For the ones before that, we must stop when we reach the end of the extent.

    // in impl Inode

    fn dir_entries(&self, sb: &Superblock, dev: &dyn ReadAt) -> Result<Vec<DirectoryEntry>> {
        let mut entries = Vec::new();

        for block_index in 0..self.data_count()? {
            let data = self.data(sb, dev, block_index)?;
            let data_size = data.size()?.unwrap();
            let mut offset: u64 = 0;
            while offset < data_size {
                let entry = DirectoryEntry::new(&Slice::new(&data, offset, None))?;
                if entry.inode.0 == 0 {
                    break;
                }
                offset += entry.len;
                entries.push(entry);
            }
        }

        Ok(entries)
    }

And now…

$ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
    mode: 40755,
    size: 4096,
}
(Directory) Inode {
    mode: 40755,
    size: 12288,
}
(Regular) Inode {
    mode: 100644,
    size: 220,
}

It finds /etc/hosts!

excitement intensifies

And can we read it as a string??

    // in main
    let hosts_data = hosts_inode.data(&sb, &file, 0)?;
    let hosts_data = Reader::new(&hosts_data).vec(0, hosts_inode.size as usize)?;
    let hosts_data = String::from_utf8_lossy(&hosts_data);
    println!("---------------------------------------------");
    println!("{hosts_data}");

$ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
    mode: 40755,
    size: 4096,
}
(Directory) Inode {
    mode: 40755,
    size: 12288,
}
(Regular) Inode {
    mode: 100644,
    size: 220,
}
---------------------------------------------
127.0.0.1       localhost
127.0.1.1       sonic

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

We did it!

You can find the complete code listing on GitHub:

https://github.com/fasterthanlime/read-raw-ext4

It clocks in at ~320 lines of Rust code.

If you’ve read all the way to the end, consider supporting my work through donations, if you aren’t already.

Also, if you’ve run out of things to read, I make videos!

You've somehow finished reading Reading files the hard way.

During a recent Rust Q&A Session on my twitch channel, someone asked a question that seemed simple: why are small string types, like SmartString or SmolStr, the same size as String, but small vec types, like SmallVec, are larger than Vec?

Now I know I just used the adjective simple, but the truth of the matter is: to understand the question, we’re going to need a little bit of background.

Reading files the hard way - Part 3 (ftrace, disk layouts, ext4)

1

2

3

In the belly of the beast

Where did our paths go?

You can make a filesystem out of that (or can you?)

Enter inodes

Solving the remaining problems

Now for a taste of the real world

Let’s read a whole partition I guess

Block groups

Reading an inode in ext4

4.1.2. Finding an Inode

1

2

3

Peeking inside a Rust enum