Thanks to my sponsors: David E Disch, Mario Fleischhacker, Lucille Blumire, Brandon Piña, ofrighil, Steven Pham, zaurask, Makoto Nakashima, Henrik Tudborg, Shane Lillie, joseph, Paige Ruten, Aiden Scandella, clement, Ronen Ulanovsky, WeblWabl, Dirkjan Ochtman, Santiago Lema, Blake Johnson, old.woman.josiah and 230 more
Reading files the hard way - Part 3 (ftrace, disk layouts, ext4)
👋 This page was last updated ~5 years ago. Just so you know.
So far, we've seen many ways to read a file from different programming languages, we've learned about syscalls, how to make those from assembly, then we've learned about memory mapping, virtual address spaces, and generally some of the mechanisms in which userland and the kernel interact.
But in our exploration, we've always considered the kernel more or less like a "black box". It's time to change that.
In the belly of the beast
In part 2, we've used gdb
to step through our userland program, readfile
.
This allowed us to inspect the state of CPU registers after each instruction.
But we weren't able to see what was going on in the kernel.
We've also used strace
to trace system calls. It was less noisy than gdb,
because it only displayed information when our userland program was communicating
with the kernel.
Now it's time to step it up and use ftrace
. It's a tracing facility integrated
into the Linux kernel. Thanks to it, we're able to enable tracing of various
kernel functions, and events.
Cool bear's hot tip
There's an excellent Introduction to ftrace over on Julia Evans's blog.
Speaking of Julia, check out her fantastic zines!
The easiest way for us to get started is to use trace-cmd
. On Manjaro/ArchLinux,
there's an AUR package for that, so we can just grab it:
$ yay -S trace-cmd
And on Ubuntu, there's a package as well:
$ sudo apt install trace-cmd
The plan is to:
- Enable tracing for all
ext4
events (more on this later) - Run our
readfile
program (the assembly one) - Look at the report and try to piece together what's going on
My first trace wasn't successful - it only contained events related to zsh (my shell), and git (called by my shell's prompt to show indicators).
It turns out there are several levels of caching, and /etc/hosts
was already cached to the maximum, so reading it didn't even register.
Luckily, we can clear all these caches (while the system is running!) by running the following as root:
$ sync
$ echo 3 > /proc/sys/vm/drop_caches
Then, we can start tracing (again, as root):
$ trace-cmd record -e ext4
Then launch our program, hit Ctrl-C
to stop trace-cmd
, and launch it again
trace-cmd
to format the collected data in a human-readable format:
$ trace-cmd report trace.dat > trace.txt
And there we have it:
readfile-2056669 [002] 112770.475550: ext4_es_lookup_extent_enter: dev 8,3 ino 16252929 lblk 0
readfile-2056669 [002] 112770.475551: ext4_es_lookup_extent_exit: dev 8,3 ino 16252929 found 1 [0/1) 65019936 WR
readfile-2056669 [002] 112770.475553: ext4_es_lookup_extent_enter: dev 8,3 ino 16252929 lblk 1
readfile-2056669 [002] 112770.475553: ext4_es_lookup_extent_exit: dev 8,3 ino 16252929 found 1 [1/2) 65020464 WR
readfile-2056669 [002] 112770.475565: ext4_es_lookup_extent_enter: dev 8,3 ino 16253092 lblk 0
readfile-2056669 [002] 112770.475565: ext4_es_lookup_extent_exit: dev 8,3 ino 16253092 found 0 [0/0) 0
readfile-2056669 [002] 112770.475565: ext4_ext_map_blocks_enter: dev 8,3 ino 16253092 lblk 0 len 1 flags
readfile-2056669 [002] 112770.475566: ext4_es_cache_extent: dev 8,3 ino 16253092 es [0/1) mapped 65050624 status W
readfile-2056669 [002] 112770.475566: ext4_ext_show_extent: dev 8,3 ino 16253092 lblk 0 pblk 65050624 len 1
readfile-2056669 [002] 112770.475566: ext4_ext_map_blocks_exit: dev 8,3 ino 16253092 flags lblk 0 pblk 65050624 len 1 mflags M ret 1
readfile-2056669 [002] 112770.475566: ext4_es_insert_extent: dev 8,3 ino 16253092 es [0/1) mapped 65050624 status W
readfile-2056669 [002] 112770.474893: ext4_es_lookup_extent_enter: dev 8,3 ino 18491454 lblk 0
readfile-2056669 [002] 112770.474893: ext4_es_lookup_extent_exit: dev 8,3 ino 18491454 found 0 [0/0) 0
readfile-2056669 [002] 112770.474893: ext4_ext_map_blocks_enter: dev 8,3 ino 18491454 lblk 0 len 4 flags
readfile-2056669 [002] 112770.474894: ext4_es_cache_extent: dev 8,3 ino 18491454 es [0/5) mapped 93400358 status W
readfile-2056669 [002] 112770.474894: ext4_ext_show_extent: dev 8,3 ino 18491454 lblk 0 pblk 93400358 len 5
readfile-2056669 [002] 112770.474894: ext4_ext_map_blocks_exit: dev 8,3 ino 18491454 flags lblk 0 pblk 93400358 len 4 mflags M ret 4
readfile-2056669 [002] 112770.474894: ext4_es_insert_extent: dev 8,3 ino 18491454 es [0/4) mapped 93400358 status W
The columns appear to be formatted like so:
process_name-pid [cpu_core] timestamp: event_name: parameters
I was unable to find precise documentation on each of the events, but we can spot that three main things are being done here:
- Look up "inode 16252929"
- Look up and map "inode 16253092"
- Look up and map "inode 18491454"
And that's it.
What did we learn?
ftrace
allows tracing behavior inside kernel code. We can't change it,
but we can get a better idea what's going on.
Where did our paths go?
Whereas strace
showed us /etc/hosts
, the path of the file we're mapping,
ftrace
doesn't. Instead it shows us inode numbers. To understand why, we must
think about the ergonomics of a file system.
✨ Let's talk about symbolic links ✨
Symbolic links (symlinks for short) are files, but they're not regular files. They point to another file, and whenever they're opened, the other file should be opened instead. There's.. reasons to use symlinks, which I won't get into, just trust me on that one.
Point is, there's a bunch of symlinks in almost any Linux system, for example
in /usr/lib
:
$ ll /usr/lib/x86_64-linux-gnu/libcairo.so*
lrwxrwxrwx 1 root root 21 Oct 1 21:19 /usr/lib/x86_64-linux-gnu/libcairo.so.2 -> libcairo.so.2.11600.0
-rw-r--r-- 1 root root 1203976 Oct 7 2021 /usr/lib/x86_64-linux-gnu/libcairo.so.2.11600.0
Here, libcairo.so.2
points to libcairo.so.2.11600.0
.
If we run a command like wc
(which count words, but also bytes), we see that
it's accessing the file being pointed to in both cases:
$ wc /usr/lib/x86_64-linux-gnu/libcairo.so.2
4419 24794 1203976 /usr/lib/x86_64-linux-gnu/libcairo.so.2
$ wc /usr/lib/x86_64-linux-gnu/libcairo.so.2.11600.0
4419 24794 1203976 /usr/lib/x86_64-linux-gnu/libcairo.so.2.11600.0
However, they're not copies. They're just.. symbolic links. As for implementation details, we notice two things: the size of those symlink files is exactly the size of the path they point to, and they have a slightly different mode:
So symbolic links are files, with a special mode flag, whose contents is the path they point to. This means they can be dangling!
Cool bear's hot tip
A dangling symlink is one that points to a non-existent file.
See the example below:
$ echo "Hello there" > regular.txt
$ cat regular.txt
Hello there
$ ln -s regular.txt symlink.txt
$ ll
total 144
drwxrwxr-x 2 amos amos 4096 Mar 11 14:11 ./
drwxrwxrwt 76 root root 135168 Mar 11 14:11 ../
-rw-rw-r-- 1 amos amos 12 Mar 11 14:11 regular.txt
lrwxrwxrwx 1 amos amos 11 Mar 11 14:11 symlink.txt -> regular.txt
$ cat symlink.txt
Hello there
$ rm regular.txt
$ cat symlink.txt
cat: symlink.txt: No such file or directory
You can make a filesystem out of that (or can you?)
Let's talk about folders. As we've seen, symlinks are just files, with a special mode, and their contents is a path. Couldn't we do the same for folders?
Let's make up a filesystem. What's a filesystem? Just a way to organize file and folders, their contents, and their metadata on disk. You can think of a disk as a big blank slate, like so:
Cool bear's hot tip
Throughout the rest of this article, we'll be using the word "disk" when what we really mean is "storage device". It's shorter!
And let's say that, just for fun, we divide this disk in a series of blocks. Let's make them 4096 bytes (or 4KiB). That way, they might be the same size as kernel memory pages. Or a multiple. At least they're a power of two.
Let's say in our filesystem, each folder is just a list of names: their children. We can traverse the filesystem down just fine:
/
has/etc
in its list of children/etc
has/etc/hosts
in its list of children/etc/hosts
is a regular file
Sounds good right? I mean, we still have to solve a few problems.
First, we need to decide where on the disk our folders and files live. But that's not a terribly hard problem. We can just have a list of paths and a list of offsets, maybe like that:
And when we want to access /etc/hosts
, well, we just: 1) go through our table,
find the right entry, and 2) read it!
This doesn't seem so bad. Why is it problematic?
Problem 1: Looking for a file is O(n), where n is the number of files.
I've just counted the files on one of my Linux systems, and there are 1'128'252 of them. Sounds like looking up a file with a linear search is going to get expensive.
Problem 2: The contents of each file needs to be contiguous.
With our current layout, if a file becomes too big to fit in one block, and the next block is used, then we need to move it:
Problem 3: Metadata is mixed up with data.
Let's say we want to implement the tree
command. It prints the name
mode and size of every file in the filesystem, no matter how deeply nested.
With our disk layout, we'd have to read the start of every block that contains a file just to know its metadata. Along with, of course, the contents of every folder (but that's normal).
This is especially problematic because reading metadata (which happens very often) requires a lot of seeking. Seeking is the process of "jumping to a different location" when reading a storage device.
This is expensive for optical disk readers, or hard disk drives. It is extremely expensive for tape drives.
Problem 4: Rename operations are extremely expensive.
If we want to rename /etc
to /config
, we have a bunch of entries to
re-write:
This means reading a bunch of entries to find out what we should even rewrite, then write, in many different locations, all across the disk. Remember seeking being expensive? Forget about it.
It gets worse the more subfolders and files a directory has.
Problem 5: Every file operation needs to parse entire paths.
With our scheme, if we want to go from /etc/hosts
to /etc
, there's
no direct link! We have to parse /etc/hosts
into ["etc", "hosts"]
,
remove the last one to get ["etc"]
and build the /etc
path from that.
Then we have to use our lookup table at the very beginning to find where
/etc
actually is.
All in all, this is a pretty bad disk layout. It would technically work for a toy operating system, but I wouldn't want to use it as a daily driver.
What did we learn?
A filesystem's disk layout must be carefully designed.
Operations like enumerating files, traversing downwards or upwards, renaming folders, should be achievable in as few operations as possible, minimizing seeking.
Enter inodes
One of the lessons we learned from our first attempt at a filesystem disk layout is that we should probably separate metadata and data. We should also find another way to refer to files than by their fully qualified names.
Whether our file lives at /etc/hosts
our /config/hosts
- it's the same file!
And our scheme should reflect that.
So, let's start by storing all the metadata next to each other, as early as possible. We need to reserve a big chunk of space, large enough to describe every file on disk.
This will be our index, each element is an "index node", or "inode".
Each of those inodes will contain the usual, but also, it'll contain the number of the block we can find its data in:
As for folders - easy! In the data block, we store a list of directory entries. Each entry refers to another inode, and gives it a name.
Let's review. Did we actually solve our problems? Firstly, how do we look up a file in this new scheme?
Let's say we want to open /etc/hosts
. First we read the inode for /
.
For convenience's sake, let's decide that, since it's the filesystem
root (topmost) node, it will always be in inode 1:
Then we read its directory entries, look for "etc". Then we read "etc"'s inode. Then we read its directories entires, look for "hosts". Finally, we read "hosts"'s data.
I made a quick diagram so it's easier to follow:
Ok, so it's not easier to follow.
But there is one upside compared to our previous scheme. Instead of doing a linear search among all the files in the filesystem, by absolute paths, we search across much smaller sets of directory entries, multiple times:
Cool bear's hot tip
The diagram above is not to scale: we just reduced the worst case from 1.1M string comparisons to 228 string comparisons.
The new worst case is 0.02% of the old one.
Of course, this depends how many files there are per directory!
What about rename operations?. You might have noticed in our new scheme,
the inodes don't contain path information - there's no name
field
in there.
The names are actually contained in the directory entries:
If we want to rename /etc
, we just need to change the directory
entry - we don't need to touch the inodes of any of the children,
no matter how deep the subdirectory hierarchy goes:
If we want to move /etc/hosts
to /hosts
, we simply remove
the directory entry in /etc
's inode, and add one to /
's inode.
Again, it's a flat number of operations, no matter how many directories and files our directories have. This is much better.
Right now, traversing downwards is efficient (more than before), but
traversing upwards (going from /etc
to /
, for example) still requires
manipulating paths, and we don't want to do that.
Here's an idea: what if we add an entry to each directory, that points to its parent? We could call it "..". And for the root, we'll just make it point to itself:
Solving the remaining problems
I don't know about you, but I'm pretty happy with our disk layout so far.
However, there's (at least) two things we could be doing better.
Right now, every directory entry is stored in no particular order, in a list. To find a specific entry, we have to do a linear search, ie. do a comparison with each entry one after the other, and stop only when we've found what we've looking for.
This is only practical if directories never have more than a few thousand files. If we get into the tens of thousands, or hundreds of thousands territory, then things will start to be really slow.
A potential solution is to hash the names of all the entries, and use a Self-balancing binary search tree.
I won't get into the details here, but basically, it allows insertion (adding an entry), deletion (removing an entry), and searching in logarithmic time, which means it will get expensive much slower than a linear data structure.
Cool bear's hot tip
"Data structures" is a rich and complex field, which would warrant many, many articles.
Let's not go there today.
The second problem is that our files still have to be contiguous. If they grow, we might have to move them to another set of blocks. We might even find that there is no contiguous set of blocks large enough to hold the file!
To address that, we can use extents.
For each file, we can store a series of (start, length)
pairs:
This brings on a new problem: previously, if we wanted to read just the second half of a file, we could simply calculate the address of the first block:
But now, the middle of the file may be in any of the extents. It's not simple arithmetic anymore. To remedy this, we can also use a tree data structure.
If we want to read the file from the beginning, then we can simply walk the tree in order. If we're seeking somewhere else, we'll simply do a search through the tree until we find the right extent, and then so simple arithmetic.
Now for a taste of the real world
Enough playing around! It's time to look at a real live filesystem.
Before we started thinking about disk layouts, we were tracing
kernel events while opening and mapping /etc/hosts
.
The trace we got contained lines like:
ext4_ext_map_blocks_enter: dev 8,3 ino 18491454 lblk 0 len 4 flags
ext4_ext_map_blocks_enter: dev 8,3 ino 16253092 lblk 0 len 1 flags
This makes a little more sense now.
The first line refers to inode "18491454" (we now know what this means). The
_ext_
part of the event name probably refers to extents, because there's a
"len 4" (length of 4).
Could it be /etc/hosts
?
Cool bear's hot tip
On Linux, we can pass the -i
flag to the ls
command, to make it display inode numbers.
$ ls -i -l /etc/hosts
16253092 -rw-r--r-- 1 root root 220 Oct 1 21:21 /etc/hosts
Nope! /etc/hosts
is 16253092
. That's the second line.
What's the first one then?
Well, remember when I said binaries were mapped when they got executed?
$ ls -i ./readfile
18491454 ./readfile
Bingo.
So as far as we can tell, the real file system this Linux system is using is also using inodes. And extents!
We can find out what filesystem it is with the df
command (specifically,
the -T
flag):
$ df -Th /
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda3 ext4 548G 96G 425G 19% /
So, apparently it's ext4. It appears the most up-to-date documentation for ext4 is on kernel.org.
Unfortunately, it seems this documentation wasn't written by the authors of ext4, but rather by someone else reading the code for ext4 in the Linux kernel. Oh well, that's life.
But first: what's this /dev/sda3
?
Well, remember when I said the kernel was boss? And that it controlled the reality userland processes see? That goes doubly for files.
Not all things that can be open()'d are files in the traditional sense.
Some are just.. resources.
For example, /proc/cpuinfo
"contains" information about the installed CPUs:
$ head -5 /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 25
model : 33
model name : AMD Ryzen 9 5950X 16-Core Processor
It's not actually on any storage device. It's just a way for the kernel
to give userland processes some information, through standard system calls
like open
, read
, write
and close
.
/proc/self/
is a directory that contains information about the currently
running process. If we modify our Go program like so:
package main
import (
"fmt"
"strings"
"io/ioutil"
)
func main() {
payload, _ := ioutil.ReadFile("/proc/self/cmdline")
fmt.Printf("%#v\n", strings.Split(string(payload), "\x00"))
}
...and run it like so, it'll print its arguments:
$ go build ./main.go
$ ./main "foo" "argument 2" "bar"
[]string{"./main", "foo", "argument 2", "bar", ""}
In fact, in The Unix Philosophy, everything is a file.
Cool bear's hot tip
This has made a lot of people angry.
So, /dev/sda3
is simply a "file" that contains the entire contents of
our root ext4 partition.
If we wanted the whole disk we'd open /dev/sda
instead. But
then we'd have to worry about partition tables, and that's better
left for later.
Let's read a whole partition I guess
Now that we're done with the introductory material, let's jump right into it.
We'll be using rust for this next part.
$ cargo new read-raw-ext4
Created binary (application) `read-raw-ext4` package
$ cd read-raw-ext4/
Let's add one of the crates we'll need:
$ cargo add hex-slice@0.1
Updating crates.io index
Adding hex-slice v0.1.4 to dependencies.
Updating crates.io index
// in `src/main.rs`
use hex_slice::AsHex;
use std::{fs::OpenOptions, io::Read};
fn main() -> Result<(), std::io::Error> {
// open our ext4 partition, READ-ONLY.
let mut file = OpenOptions::new().read(true).open("/dev/sda3")?;
// allocate a 128-byte buffer
let mut buf = vec![0u8; 128];
// read the first 128 bytes of the file
file.read_exact(&mut buf)?;
// print it as hexadecimal
println!("{:x}", buf.as_hex());
Ok(())
}
This builds happily with cargo build
, but it won't run:
$ cargo build
Compiling hex-slice v0.1.4
Compiling read-raw-ext4 v0.1.0 (/home/amos/bearcove/read-raw-ext4)
Finished dev [unoptimized + debuginfo] target(s) in 0.37s
$ ./target/debug/read-raw-ext4
Error: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
See, we're currently running this program as a normal user. And if any user could have access to any partition willy-nilly, then they could bypass file permissions.
And that would be bad.
So let's pretend we're root for a hot minute and:
$ sudo ./target/debug/read-raw-ext4
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Ah. It's full of zeroes. We might need that documentation after all.
Here's what it has to say:
For the special case of block group 0, the first 1024 bytes are unused, to allow for the installation of x86 boot sectors and other oddities. The superblock will start at offset 1024 bytes, whichever block that happens to be (usually 0).
Okay then, let's read starting from byte 1024. We'll use the positioned-io crate for that.
$ cargo add positioned-io@0.3
Updating crates.io index
Adding positioned-io v0.3 to dependencies.
Features as of v0.3.0:
+ byteorder
Updating crates.io index
// in `src/main.rs`
use hex_slice::AsHex;
use positioned_io::ReadAt;
use std::fs::OpenOptions;
fn main() -> Result<(), std::io::Error> {
let file = OpenOptions::new().read(true).open("/dev/sda3")?;
let mut buf = vec![0u8; 128];
// read 128 bytes of the file starting at offset 1024
file.read_exact_at(1024, &mut buf)?;
println!("{:x}", buf.as_hex());
Ok(())
}
$ cargo b && sudo ./target/debug/read-raw-ext4
Compiling libc v0.2.140
Compiling byteorder v1.4.3
Compiling positioned-io v0.3.1
Compiling read-raw-ext4 v0.1.0 (/home/amos/bearcove/read-raw-ext4)
Finished dev [unoptimized + debuginfo] target(s) in 0.84s
[0 e0 2d 2 0 3b b7 8 c0 8f 6f 0 3b 9 d 7 ac 7e 10 2 0 0 0 0 2 0 0 0 2 0 0 0 0 80 0 0 0 80 0 0 0 20 0 0 c5 b4 4 64 c4 b4 4 64 43 0 ff ff 53 ef 1 0 1 0 0 0 98 92 38 63 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 b 0 0 0 0 1 0 0 3c 0 0 0 c6 2 0 0 6b 4 0 0 f7 6a 4c a7 98 2f 4d 8b bb cb ae 71 15 ed d6 90 0 0 0 0 0 0 0 0]
Now we're cooking! The docs mention a "superblock", and if we read up on
its structure, it says that we should find the magic number 0xEF53
at offset 0x38
. It also says it's a little-endian 16-bit integer.
Well, there's a crate for that: byteorder.
$ cargo add byteorder@1.4
Updating crates.io index
Adding byteorder v1.4 to dependencies.
Features as of v1.4.0:
+ std
- i128
And let's bring one more for error handling:
$ cargo add color-eyre@0.6
Updating crates.io index
Adding color-eyre v0.6 to dependencies.
Features as of v0.6.0:
+ capture-spantrace
+ color-spantrace
+ tracing-error
+ track-caller
- issue-url
- url
Updating crates.io index
Since we're going to be reading a lot of stuff, let's make a helper struct. We'll need a few more use directives:
// this will let us read integers of various width
use byteorder::{LittleEndian, ReadBytesExt};
// a cursor keeps track of our position within any
// resource that implements ReadAt.
use positioned_io::{ReadAt, Cursor};
// this gives us a generic error type that's pretty solid by default
use color_eyre::Result;
struct Reader<IO> {
inner: IO,
}
impl<IO: ReadAt> Reader<IO> {
fn new(inner: IO) -> Self {
Self { inner }
}
fn u16(&self, offset: u64) -> Result<u16> {
let mut cursor = Cursor::new_pos(&self.inner, offset);
Ok(cursor.read_u16::<LittleEndian>()?)
}
}
Now, any time we want to read something from the partition, we'll first make a
positioned_io::Slice
, pass it to a Reader
, and use it.
Cool bear's hot tip
Note that neither the File, Slice, or Reader need to be mutable, because none of them maintain position information.
They can even be used concurrently!
use positioned_io::Slice;
fn main() -> Result<()> {
let file = OpenOptions::new().read(true).open("/dev/sda3")?;
// create a slice that corresponds to the superblock
let r = Reader::new(Slice::new(file, 1024, None));
// as per the docs:
let magic = r.u16(0x38)?;
println!("magic = {magic:x}");
Ok(())
}
$ cargo b -q && sudo ./target/debug/read-raw-ext4
magic = ef53
Hey, that's the value the documentation gave us! We're on the right track.
Block groups
In our toy disk layout, we divided the disk in blocks. ext4 does
just the same thing! At offset 0x18
of the super block, we find
the size of a block.
Well, we find n
, where the block size is 2 ^ (10 + n)
.
We need to add a u32
method to our Reader:
impl<IO: ReadAt> Reader<IO> {
fn new(inner: IO) -> Self {
Self { inner }
}
fn u16(&self, offset: u64) -> Result<u16> {
let mut cursor = Cursor::new_pos(&self.inner, offset);
Ok(cursor.read_u16::<LittleEndian>()?)
}
// 👇 new!
fn u32(&self, offset: u64) -> Result<u32> {
let mut cursor = Cursor::new_pos(&self.inner, offset);
Ok(cursor.read_u32::<LittleEndian>()?)
}
}
And in main
, do a little bit of math:
fn main() -> Result<()> {
let file = OpenOptions::new().read(true).open("/dev/sda3")?;
let r = Reader::new(Slice::new(file, 1024, None));
let magic = r.u16(0x38)?;
println!("magic = {magic:x}");
// 👇 new!
let n = r.u32(0x18)?;
let block_size = 1 << (10 + n);
println!("block_size = {block_size}");
Ok(())
}
$ cargo b -q && sudo ./target/debug/read-raw-ext4
magic = ef53
block_size = 4096
What luck! The blocks are 4KB, just like in our toy disk layout.
However, if we look at the docs, we'll notice the blocks are grouped (into.. block groups).
An ext4 file system is split into a series of block groups. To reduce performance difficulties due to fragmentation, the block allocator tries very hard to keep each file's blocks within the same group, thereby reducing seek times. The size of a block group is specified in
sb.s_blocks_per_group
blocks, though it can also calculated as 8 *block_size_in_bytes
. With the default block size of 4KiB, each group will contain 32,768 blocks, for a length of 128MiB. The number of block groups is the size of the device divided by the size of a block group.
And those are described in... group descriptors (GDT stands for "group descriptor table"):
The layout of a standard block group is approximately as follows (each of these fields is discussed in a separate section below):
Group 0 Padding ext4 Super Block Group Descriptors Reserved GDT Blocks Data Block Bitmap inode Bitmap inode Table Data Blocks 1024 bytes 1 block many blocks many blocks 1 block 1 block many blocks many more blocks
The number of blocks per group is stored in the superblock, at offset 0x20:
// in main
let bpg = r.u32(0x20)?;
println!("blocks per group = {bpg}");
$ cargo b -q && sudo ./target/debug/read-raw-ext4
magic = ef53
block_size = 4096
blocks per group = 32768
Finally, the number of inodes per group is stored at offset 0x28;
// in main
let ipg = r.u32(0x28)?;
println!("inodes per group = {ipg}");
$ cargo b -q && sudo ./target/debug/read-raw-ext4
magic = ef53
block_size = 4096
blocks per group = 32768
inodes per group = 8192
The good news is that all those values look legit. The bad news is that we're nowhere near reading an inode.
Reading an inode in ext4
Let's recap. An ext4 partition starts with 1024 bytes of padding, then a superblock, which contains various settings like the size of a block, the number of blocks per group, the number of inodes per group, etc.
Then come the block group descriptors. Those contain the offset of a few things, but the one we're interested in is the inode table:
Let's start at the beginning - find the inode for /
.
There's a few special inodes in ext4:
inode Number Purpose 0 Doesn't exist; there is no inode 0. 1 List of defective blocks. 2 Root directory. 3 User quota. 4 Group quota. 5 Boot loader. 6 Undelete directory. 7 Reserved group descriptors inode. ("resize inode") 8 Journal inode. 9 The "exclude" inode, for snapshots(?) 10 Replica inode, used for non-upstream feature? 11 Traditional first non-reserved inode. Usually this is the lost+found directory. See s_first_ino in the superblock.
The documentation seems about as confident as we are. Those question marks there are.. just great.
Anyway, the inode for the root, /
, is always 2. How do we find it?
Again, the docs help:
4.1.2. Finding an Inode
Each block group contains
sb->s_inodes_per_group
inodes. Because inode 0 is defined not to exist, this formula can be used to find the block group that an inode lives in:bg = (inode_num - 1) / sb->s_inodes_per_group
. The particular inode can be found within the block group's inode table atindex = (inode_num - 1) % sb->s_inodes_per_group
. To get the byte address within the inode table, useoffset = index * sb->s_inode_size
.
And that makes sense!
Since we're using a good programming language, let's start building out a few abstractions.
$ cargo add custom_debug@0.5
Updating crates.io index
Adding custom_debug v0.5 to dependencies.
First, the superblock:
// this is like derive(Debug), but better.
// see https://crates.io/crates/custom_debug
use custom_debug::{Debug as CustomDebug};
#[derive(CustomDebug)]
struct Superblock {
#[debug(format = "{:x}")]
magic: u16,
block_size: u64,
blocks_per_group: u64,
inodes_per_group: u64,
inode_size: u64,
}
impl Superblock {
fn new(dev: &dyn ReadAt) -> Result<Self> {
let r = Reader::new(Slice::new(dev, 1024, None));
// note: we're casting a few fields to `u64` now.
// this will save us a bunch of grief later.
Ok(Self {
magic: r.u16(0x38)?,
block_size: (2u32.pow(10 + r.u32(0x18)?)) as u64,
blocks_per_group: r.u32(0x20)? as u64,
inodes_per_group: r.u32(0x28)? as u64,
inode_size: r.u16(0x58)? as u64,
})
}
}
Let's take it for a spin:
fn main() -> Result<()> {
color_eyre::install()?;
// open our ext4 partition, READ-ONLY.
let file = OpenOptions::new().read(true).open("/dev/sda3")?;
let sb = Superblock::new(&file)?;
println!("{sb:#?}");
Ok(())
}
$ cargo b -q && sudo ./target/debug/read-raw-ext4
Superblock {
magic: ef53,
block_size: 4096,
blocks_per_group: 32768,
inodes_per_group: 8192,
inode_size: 256,
}
Seems fine!
Now let's make a type for block group descriptors. We're going to use it as a container for a constant: the size of a block group descriptor:
struct BlockGroupDescriptor {}
impl BlockGroupDescriptor {
const SIZE: u64 = 64;
}
Next, let's make a type for block group numbers. Sure, it's just a number - but we'll add a method on it to get a slice of the descriptor.
Cool bear's hot tip
A tuple or struct with a single field is called a newtype.
#[derive(Debug, Clone, Copy)]
struct BlockGroupNumber(u64);
impl BlockGroupNumber {
fn desc_slice<T>(self, sb: &Superblock, dev: T) -> Slice<T>
where
T: ReadAt,
{
assert!(sb.block_size != 1024, "1024 block size not supported");
// the superblock takes up 1 block
let gdt_start = sb.block_size;
let offset = gdt_start + self.0 * BlockGroupDescriptor::SIZE;
Slice::new(dev, offset, None)
}
}
Next up, let's make a type for inode numbers! We'll add a method that tells us in which block group a particular inode is.
#[derive(Debug, Clone, Copy)]
struct InodeNumber(u64);
impl InodeNumber {
fn blockgroup_number(self, sb: &Superblock) -> BlockGroupNumber {
let n = (self.0 - 1) / sb.inodes_per_group;
BlockGroupNumber(n)
}
}
Let's try this all out. We're looking for inode 2:
fn main() -> Result<()> {
color_eyre::install()?;
// open our ext4 partition, READ-ONLY.
let file = OpenOptions::new().read(true).open("/dev/sda3")?;
let sb = Superblock::new(&file)?;
println!("{sb:#?}");
let root_bg = InodeNumber(2).blockgroup_number(&sb);
dbg!(&root_bg);
let mut buf = vec![0u8; 64];
root_bg.desc_slice(&sb, &file).read_at(0, &mut buf)?;
println!("{:x}", buf.as_hex());
Ok(())
}
$ cargo b -q && sudo ./target/debug/read-raw-ext4
Superblock {
magic: ef53,
block_size: 4096,
blocks_per_group: 32768,
inodes_per_group: 8192,
inode_size: 256,
}
[src/main.rs:92] &root_bg = BlockGroupNumber(
0,
)
[47 4 0 0 57 4 0 0 67 4 0 0 a6 4d ee 1f 2 0 4 0 0 0 0 0 2b 7e e7 d ed 1f a4 83 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a8 e1 c3 9c 0 0 0 0]
As expected, inode 2 is in block group 0, and the descriptor slice, well, is not all zeroes, so that's a start.
The only thing we care about in the block descriptor is the offset of the inode table. Like most other locations, it's a block number.
This one's a bit annoying, because the lower 32-bits are
stored at 0x8
, and the upper 32-bits are stored at 0x28
.
We could do the bit twiddling directly in BlockGroupDescriptor
,
but let's add a convenience method to Reader
instead:
// in `impl Reader {`
fn u64_lohi(&self, lo: u64, hi: u64) -> Result<u64> {
Ok(self.u32(lo)? as u64 + ((self.u32(hi)? as u64) << 32))
}
And let's fill out BlockGroupDescriptor
:
#[derive(Debug)]
struct BlockGroupDescriptor {
inode_table: u64,
}
impl BlockGroupDescriptor {
const SIZE: u64 = 64;
fn new(slice: &dyn ReadAt) -> Result<Self> {
let r = Reader::new(slice);
Ok(Self {
inode_table: r.u64_lohi(0x8, 0x28)?,
})
}
}
Now, we can add a method to directly grab the block group
descriptor to BlockGroupNumber
:
// in impl BlockGroupNumber
fn desc(self, sb: &Superblock, dev: &dyn ReadAt) -> Result<BlockGroupDescriptor> {
let slice = self.desc_slice(sb, dev);
BlockGroupDescriptor::new(&slice)
}
Again, let's try it:
fn main() -> Result<()> {
color_eyre::install()?;
// open our ext4 partition, READ-ONLY.
let file = OpenOptions::new().read(true).open("/dev/sda3")?;
let sb = Superblock::new(&file)?;
println!("{sb:#?}");
let bgd = InodeNumber(2).blockgroup_number(&sb).desc(&sb, &file)?;
println!("{bgd:#?}");
Ok(())
}
$ cargo b -q && sudo ./target/debug/read-raw-ext4
Superblock {
magic: ef53,
block_size: 4096,
blocks_per_group: 32768,
inodes_per_group: 8192,
inode_size: 256,
}
BlockGroupDescriptor {
inode_table: 1127,
}
Hey, 1127 looks like a pretty reasonable number for the inode table of the first block group.
✨ We're making progress ✨
Now, the inode table is a series of fixed-size inodes.
The first field is a 16-bit (little-endian, still) integer that contains the
file mode. A bit later on, we find the inode's size, again divided between
lower 32 bits (at 0x4
) and upper 32 bits (at 0x6C
).
At 0x28
, we find the "block map or extent tree". It's a block of 60 bytes. I
have a feeling we're going to need that as well!
Let's add a convenience method to reader so we can grab that block easily:
// in impl Reader
fn vec(&self, offset: u64, len: usize) -> Result<Vec<u8>> {
let mut v = vec![0u8; len];
self.inner.read_exact_at(offset, &mut v)?;
Ok(v)
}
Next, let's add a method to grab the slice for a specific inode number.
// in impl InodeNumber
fn inode_slice<T>(self, sb: &Superblock, dev: T) -> Result<Slice<T>>
where
T: ReadAt,
{
let desc = self.blockgroup_number(sb).desc(sb, &dev)?;
let table_off = desc.inode_table * sb.block_size;
let idx_in_table = (self.0 - 1) % sb.inodes_per_group;
let inode_off = table_off + sb.inode_size * idx_in_table;
Ok(Slice::new(dev, inode_off, Some(sb.inode_size)))
}
Now, let's make an Inode
type, with all the data we wanted: mode, size,
and the 60-byte block. We'll use Reader::vec
which we added recently:
#[derive(CustomDebug)]
struct Inode {
#[debug(format = "{:o}")]
mode: u16,
size: u64,
#[debug(skip)]
block: Vec<u8>,
}
impl Inode {
fn new(slice: &dyn ReadAt) -> Result<Self> {
let r = Reader::new(slice);
Ok(Self {
mode: r.u16(0x0)?,
size: r.u64_lohi(0x4, 0x6C)?,
block: r.vec(0x28, 60)?,
})
}
}
And finally, let's add a method to InodeNumber
that directly reads
the inode:
// in impl InodeNumber
fn inode(self, sb: &Superblock, dev: &dyn ReadAt) -> Result<Inode> {
let slice = self.inode_slice(sb, dev)?;
Inode::new(&slice)
}
We're ready. Let's try it:
fn main() -> Result<()> {
color_eyre::install()?;
// open our ext4 partition, READ-ONLY.
let file = OpenOptions::new().read(true).open("/dev/sda3")?;
let sb = Superblock::new(&file)?;
println!("{sb:#?}");
let root_inode = InodeNumber(2).inode(&sb, &file)?;
println!("{root_inode:#?}");
Ok(())
}
$ cargo b -q && sudo ./target/debug/read-raw-ext4
Superblock {
magic: ef53,
block_size: 4096,
blocks_per_group: 32768,
inodes_per_group: 8192,
inode_size: 256,
}
Inode {
mode: 40755,
size: 4096,
}
Hurray! The values match what the stat
command reports:
$ stat /
File: /
Size: 4096 Blocks: 8 IO Block: 4096 directory
Device: 803h/2051d Inode: 2 Links: 23
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2023-03-11 12:40:08.175908741 +0100
Modify: 2022-12-26 15:42:11.483999585 +0100
Change: 2022-12-26 15:42:11.483999585 +0100
Birth: 2022-10-01 21:18:48.000000000 +0200
I'm sure there's lots of directories with the same permissions and size, so we could be reading the wrong inode, but at least we're probably not reading, say, a random data block instead.
Let's quickly add a Filetype
enum so we can make sure it's a directory.
The file type is contained in the mode:
These are mutually-exclusive file types 0x1000 S_IFIFO (FIFO) 0x2000 S_IFCHR (Character device) 0x4000 S_IFDIR (Directory) 0x6000 S_IFBLK (Block device) 0x8000 S_IFREG (Regular file) 0xA000 S_IFLNK (Symbolic link) 0xC000 S_IFSOCK (Socket)
We'll use the num_enum crate here:
$ cargo add num_enum@0.5
Updating crates.io index
Adding num_enum v0.5 to dependencies.
Features as of v0.5.0:
+ std
- complex-expressions
- external_doc
Updating crates.io index
use num_enum::*;
use std::convert::TryFrom;
#[derive(Debug, TryFromPrimitive)]
#[repr(u16)]
enum Filetype {
Fifo = 0x1000,
CharacterDevice = 0x2000,
Directory = 0x4000,
BlockDevice = 0x6000,
Regular = 0x8000,
SymbolicLink = 0xA000,
Socket = 0xC000,
}
And add a getter to Inode
:
// in impl Inode
fn filetype(&self) -> Filetype {
Filetype::try_from(self.mode & 0xF000).unwrap()
}
Cool bear's hot tip
To keep the example code concise, we use unwrap()
here.
If we somehow read an inode from the wrong slice, or if the inode has an invalid mode, our program will panic.
This is not desirable in a real filesystem implementation, but here it works as a nice sanity check.
Let's give it a shot:
// in main
let root_inode = InodeNumber(2).inode(&sb, &file)?;
let root_inode_type = root_inode.filetype();
println!("({root_inode_type:?}) {root_inode:#?}");
$ cargo b -q && sudo ./target/debug/read-raw-ext4
Superblock {
magic: ef53,
block_size: 4096,
blocks_per_group: 32768,
inodes_per_group: 8192,
inode_size: 256,
}
(Directory) Inode {
mode: 40755,
size: 4096,
}
Everything looks fine!
Now that we're sure we're dealing with a directory, we can start reading its entries.
But first - remember extents? Those are in the block we just read!
The kernel documentation is here to help.
#[derive(Debug)]
struct ExtentHeader {
entries: u64,
depth: u64,
}
impl ExtentHeader {
fn new(slice: &dyn ReadAt) -> Result<Self> {
let r = Reader::new(slice);
let magic = r.u16(0x0)?;
assert_eq!(magic, 0xF30A);
Ok(Self {
entries: r.u16(0x2)? as u64,
depth: r.u16(0x6)? as u64,
})
}
}
Let's read one.
// in main
let ext_header = ExtentHeader::new(&Slice::new(&root_inode.block, 0, Some(12)))?;
println!("{ext_header:#?}");
$ cargo b -q && sudo ./target/debug/read-raw-ext4
(cut)
ExtentHeader {
entries: 1,
depth: 0,
}
Well, it's not much but - again, the values make sense. It's followed
by a single entry, so we know the data for /
is stored in a single range
of contiguous blocks. It has a depth of 0
, which means it's a leaf node.
Cool bear's hot tip
In our case, we'll assume that /
, /etc
, and /etc/hosts
all only have
one extent node.
This makes sense, since there's only 24 children for /
, only 204 children
for /etc
, and /etc/hosts
is shorter (511) than a block (4096).
However, we'll verify our assumptions with assert!
.
#[derive(Debug)]
struct Extent {
len: u64,
start: u64,
}
impl Extent {
fn new(slice: &dyn ReadAt) -> Result<Self> {
let r = Reader::new(slice);
Ok(Self {
len: r.u16(0x4)? as u64,
// the block number the extent points to is split
// between upper 16-bits and lower 32-bits.
start: ((r.u16(0x6)? as u64) << 32) + r.u32(0x8)? as u64,
})
}
}
Our complete extent reading code becomes:
// in main, after finding root_inode
let ext_header = ExtentHeader::new(&Slice::new(&root_inode.block, 0, Some(12)))?;
println!("{ext_header:#?}");
assert_eq!(ext_header.depth, 0);
assert_eq!(ext_header.entries, 1);
let ext = Extent::new(&Slice::new(&root_inode.block, 12, Some(12)))?;
println!("{:#?}", ext);
$ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
mode: 40755,
size: 4096,
}
ExtentHeader {
entries: 1,
depth: 0,
}
Extent {
len: 1,
start: 9319,
}
Hey, those are still reasonable values!
According to this, we can find the data for /
at block 9319
.
Let's take a moment to add a convenience method to the Inode
type:
// in impl Inode
fn data<T>(&self, sb: &Superblock, dev: T) -> Result<Slice<T>>
where
T: ReadAt,
{
let ext_header = ExtentHeader::new(&Slice::new(&self.block, 0, Some(12)))?;
assert_eq!(ext_header.depth, 0);
assert_eq!(ext_header.entries, 1);
let ext = Extent::new(&Slice::new(&self.block, 12, Some(12)))?;
assert_eq!(ext.len, 1);
let offset = ext.start * sb.block_size;
let len = ext.len * sb.block_size;
Ok(Slice::new(dev, offset, Some(len)))
}
And then dump the first 128 bytes of /
's data, as if it were a string:
// in main
let root_inode = InodeNumber(2).inode(&sb, &file)?;
let root_inode_type = root_inode.filetype();
println!("({root_inode_type:?}) {root_inode:#?}");
let root_data = root_inode.data(&sb, &file)?;
let data_start = Reader::new(&root_data).vec(0, 128)?;
println!("{}", String::from_utf8_lossy(&data_start));
$ $ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
mode: 40755,
size: 4096,
}
.
..
lost+found(
boot
swapfile�
etc�media�
var
bin@
Now, I don't know about you, but this looks very exciting to me.
We can see some of the top-level directory names in there: lost+found
, boot
,
swapfile
— and even .
and ..
! (the current directory and parent directory,
respectively).
Let's read them proper: the directory entry format is actually pretty straightforward.
#[derive(CustomDebug)]
struct DirectoryEntry {
#[debug(skip)]
len: u64,
inode: InodeNumber,
name: String,
}
impl DirectoryEntry {
fn new(slice: &dyn ReadAt) -> Result<Self> {
let r = Reader::new(slice);
// adding `Reader::u8` is left as an exercise to the reader
let name_len = r.u8(0x6)? as usize;
Ok(Self {
inode: InodeNumber(r.u32(0x0)? as u64),
len: r.u16(0x4)? as u64,
name: String::from_utf8_lossy(&r.vec(0x8, name_len)?).into(),
})
}
}
We'll simply add a method to Inode
that returns a vec of DirectoryEntry
.
This won't fit all usecases, but it'll fit ours!
// in impl Inode
fn dir_entries(&self, sb: &Superblock, dev: &dyn ReadAt) -> Result<Vec<DirectoryEntry>> {
let data = self.data(sb, dev)?;
let mut entries = Vec::new();
let mut offset: u64 = 0;
loop {
let entry = DirectoryEntry::new(&Slice::new(&data, offset, None))?;
if entry.inode.0 == 0 {
break;
}
offset += entry.len;
entries.push(entry);
}
Ok(entries)
}
Now for the moment of truth:
// in main:
let root_inode = InodeNumber(2).inode(&sb, &file)?;
let root_inode_type = root_inode.filetype();
println!("({root_inode_type:?}) {root_inode:#?}");
let root_entries = root_inode.dir_entries(&sb, &file)?;
println!("{root_entries:#?}");
$ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
mode: 40755,
size: 4096,
}
[
DirectoryEntry {
inode: InodeNumber(
2,
),
name: ".",
},
DirectoryEntry {
inode: InodeNumber(
2,
),
name: "..",
},
DirectoryEntry {
inode: InodeNumber(
11,
),
name: "lost+found",
},
DirectoryEntry {
inode: InodeNumber(
19398657,
),
name: "boot",
},
DirectoryEntry {
inode: InodeNumber(
12,
),
name: "swapfile",
},
DirectoryEntry {
inode: InodeNumber(
16252929,
),
name: "etc",
},
(cut)
🎉🎉🎉
Everything appears to be in order.
Let's add a convenience method to find a particular child:
// in impl Inode
fn child(&self, name: &str, sb: &Superblock, dev: &dyn ReadAt) -> Result<Option<InodeNumber>> {
let entries = self.dir_entries(sb, dev)?;
Ok(entries
.into_iter()
.filter(|x| x.name == name)
.map(|x| x.inode)
.next())
}
Cool bear's hot tip
Why is the return type Result<Option<T>>
?
Well, reading from the partition might fail - and that would return Err(e)
.
Reading from the partition might succeed, but there might be no child
with this name. That would return Ok(None)
.
Finally, reading might succeed, and we might find such a child, and
that would return Ok(Some(n))
.
And let's use it:
// in main
let root_inode = InodeNumber(2).inode(&sb, &file)?;
println!("({:?}) {:#?}", root_inode.filetype(), root_inode);
let etc_inode = root_inode.child("etc", &sb, &file)?;
println!("{etc_inode:?}");
$ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
mode: 40755,
size: 4096,
}
Some(InodeNumber(16252929))
$ stat /etc
File: /etc
Size: 12288 Blocks: 24 IO Block: 4096 directory
Device: 803h/2051d Inode: 16252929 Links: 146
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2023-03-11 19:51:34.661133912 +0100
Modify: 2023-03-11 19:47:58.858792961 +0100
Change: 2023-03-11 19:47:58.858792961 +0100
Birth: 2022-10-01 21:18:52.867725869 +0200
Everything checks out.
Let's go one step further and read that inode too:
// in main
let root_inode = InodeNumber(2).inode(&sb, &file)?;
let root_inode_type = root_inode.filetype();
println!("({root_inode_type:?}) {root_inode:#?}");
let etc_inode = root_inode
.child("etc", &sb, &file)?
.expect("/etc should exist")
.inode(&sb, &file)?;
let etc_inode_type = etc_inode.filetype();
println!("({etc_inode_type:?}) {etc_inode:#?}");
$ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
mode: 40755,
size: 4096,
}
(Directory) Inode {
mode: 40755,
size: 12288,
}
Still good. Can we find hosts
in there?
// in main
let hosts_inode = etc_inode
.child("hosts", &sb, &file)?
.expect("/etc/hosts should exist")
.inode(&sb, &file)?;
let hosts_inode_type = hosts_inode.filetype();
println!("({hosts_inode_type:?}) {hosts_inode:#?}");
$ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
mode: 40755,
size: 4096,
}
(Directory) Inode {
mode: 40755,
size: 12288,
}
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `2`,
right: `1`', src/main.rs:110:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Ah!
See, when I originally wrote this series, all the way back in 2019, on an ArchLinux system installed on a Lenovo X200 (I know right?), this used to work.
But now it no longer does, because there's two entries in our extent header, not just one.
So, that's okay, we can adjust our code a little!
First, let's add an Inode::data_count
method that returns the number of data
blocks associated with an inode:
// in impl Inode
fn data_count(&self) -> Result<u64> {
let ext_header = ExtentHeader::new(&Slice::new(&self.block, 0, Some(12)))?;
assert_eq!(ext_header.depth, 0);
Ok(ext_header.entries)
}
And we'll change the data
function to take an index
, so it can retrieve any
of the data blocks.
// in impl Inode
// new! this method takes an index parameter 👇
fn data<T>(&self, sb: &Superblock, dev: T, index: u64) -> Result<Slice<T>>
where
T: ReadAt,
{
let ext_header = ExtentHeader::new(&Slice::new(&self.block, 0, Some(12)))?;
assert_eq!(ext_header.depth, 0);
// 👇 new: (previously was asserting only 1 entry, now checking index)
assert!(index < ext_header.entries);
// 👇 new: the offset depends of the index
let offset = 12 * (index + 1);
let ext = Extent::new(&Slice::new(&self.block, offset, Some(12)))?;
assert_eq!(ext.len, 1);
let offset = ext.start * sb.block_size;
let len = ext.len * sb.block_size;
Ok(Slice::new(dev, offset, Some(len)))
}
Now, in dir_entries
, where we call self.data
, we actually have to pass an
index! Also, our end condition (finding an entry with inode number 0) only works
for the last data block. For the ones before that, we must stop when we reach
the end of the extent.
// in impl Inode
fn dir_entries(&self, sb: &Superblock, dev: &dyn ReadAt) -> Result<Vec<DirectoryEntry>> {
let mut entries = Vec::new();
for block_index in 0..self.data_count()? {
let data = self.data(sb, dev, block_index)?;
let data_size = data.size()?.unwrap();
let mut offset: u64 = 0;
while offset < data_size {
let entry = DirectoryEntry::new(&Slice::new(&data, offset, None))?;
if entry.inode.0 == 0 {
break;
}
offset += entry.len;
entries.push(entry);
}
}
Ok(entries)
}
And now...
$ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
mode: 40755,
size: 4096,
}
(Directory) Inode {
mode: 40755,
size: 12288,
}
(Regular) Inode {
mode: 100644,
size: 220,
}
It finds /etc/hosts
!
excitement intensifies
And can we read it as a string??
// in main
let hosts_data = hosts_inode.data(&sb, &file, 0)?;
let hosts_data = Reader::new(&hosts_data).vec(0, hosts_inode.size as usize)?;
let hosts_data = String::from_utf8_lossy(&hosts_data);
println!("---------------------------------------------");
println!("{hosts_data}");
$ cargo b -q && sudo ./target/debug/read-raw-ext4
(Directory) Inode {
mode: 40755,
size: 4096,
}
(Directory) Inode {
mode: 40755,
size: 12288,
}
(Regular) Inode {
mode: 100644,
size: 220,
}
---------------------------------------------
127.0.0.1 localhost
127.0.1.1 sonic
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
We did it!
You can find the complete code listing on GitHub:
It clocks in at ~320 lines of Rust code.
If you've read all the way to the end, consider supporting my work through donations, if you aren't already.
Also, if you've run out of things to read, I make videos!
Here's another article just for you:
Pin and suffering
I'd like to think that my understanding of "async Rust" has increased over the past year or so. I'm 100% onboard with the basic principle: I would like to handle thousands of concurrent tasks using a handful of threads. That sounds great!
And to become proficient with async Rust, I've accepted a lot of things. There are blue functions and red functions, and red (async) functions are contagious.