Welcome back to the eighteenth and final part of "Making our own executable packer".

In the last article, we had a lot of fun. We already had a "packer" executable, minipak, which joined together stage1 (a launcher), and a compressed version of whichever executable we wanted to pack.

What we added, was a whole bunch of abstractions to parse ELF headers using deku, which we used from stage1 to be able to launch the guest executable from memory, instead of writing it to a file and using execve on it.

But then, we discovered that, unfortunately that only worked if the guest was a relocatable executable, ie. if it could be mapped anywhere.

For non-relocatable executables, well... if the guest executable and our stage1 launcher have the ~same base address, like here for example:

Shell session
$ readelf -Wl ~/go/bin/hugo | grep -A 2 "Program Headers"
Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
                                      👇
  PHDR           0x000040 0x0000000000400040 0x0000000000400040 0x000188 0x000188 R   0x1000
Shell session
$ readelf -Wl ./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/stage1 | grep -A 2 "Program Headers"
Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
                                      👇
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x000224 0x000224 R   0x1000

Then the resulting executable, the "packed" executable, has the same base address as stage1:

Shell session
$ readelf -Wl /tmp/hugo.pak | grep -A 2 "Program Headers"
Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
                                      👇
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x000224 0x000224 R   0x1000

...and when it tries to map the guest binary (hugo) to that same address, it overwrites itself, and kaboom!

So, we're going to have to fix that, because to qualify as a "real executable packer", we'd like our packer to be able to process both relocatable and non-relocatable binaries.

Why?

Well, you know what binaries are very large and also typically non-relocatable? Go binaries! Just like hugo. If minipak had any practical use, it would probably be to compress statically-linked Go binaries.

Meanwhile, in Rust nightly land

Before we do anything useful, let's check out what changed in Rust nightly since we last tried to work on minipak.

Let's bump rust-toolchain to the latest nightly version (at the time of this writing) that has all the components.

[toolchain]
channel = "nightly-2021-04-25"
components = ["rustfmt", "clippy"]
targets = ["x86_64-unknown-linux-gnu"]
Shell session
$ cargo clean
(this installs the newer toolchain)
$ cargo build
   Compiling proc-macro2 v1.0.24
   Compiling unicode-xid v0.2.1
(cut)
   Compiling stage1 v0.1.0 (/home/amos/ftl/minipak/crates/stage1)
error[E0557]: feature has been removed
  --> crates/stage1/src/main.rs:12:12
   |
12 | #![feature(link_args)]
   |            ^^^^^^^^^ feature has been removed
   |
   = note: removed in favor of using `-C link-arg=ARG` on command line, which is available from cargo build scripts with `cargo:rustc-link-arg` now

error: cannot find attribute `link_args` in this scope
  --> crates/stage1/src/main.rs:16:3
   |
16 | #[link_args = "-nostartfiles -nodefaultlibs -static"]
   |   ^^^^^^^^^ help: a built-in attribute with a similar name exists: `linkage`

Uh oh! A feature has disappeared from underneath us.

Well, that's what you get for using nightly.

Yeah, and everything would've been fine if I hadn't manually bumped the version in rust-toolchain!

But I like staying on top of things, so let's actually make the required changes here.

Instead of specifying linker arguments in the source, we now have to specify them in build scripts. We don't have a build script for stage1 yet, so let's add it:

Rust code
// in `crates/minipak/build.rs`

fn main() {
    println!(
        "cargo:rustc-link-arg={}",
        "-nostartfiles -nodefaultlibs -static"
    );
}

Then let's remove both #![feature(link)args] and #[link_args = ...] from stage1, and...

Shell session
$ cargo b
   Compiling stage1 v0.1.0 (/home/amos/ftl/minipak/crates/stage1)
   Compiling minipak v0.1.0 (/home/amos/ftl/minipak/crates/minipak)
warning: cargo:rustc-link-arg requires -Zextra-link-arg flag
error: linking with `cc` failed: exit status: 1
  |
  = note: "cc" "-m64" "-Wl,--eh-frame-hdr" "-Wl,-znoexecstack" "-Wl,--as-needed" "-L" "/home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "/home/amos/ftl/minipak/tar
  (cut: a very, very long error)

Oh no! We need to opt into this new cargo feature, just like we opted into the #[link_args] feature before.

We can do so with a .cargo/config file, no, not in the home folder, directly in our minipak/ cargo workspace:

[unstable]
extra-link-arg = true

Surely now things will work!

Shell session
$ cargo b
   Compiling minipak v0.1.0 (/home/amos/ftl/minipak/crates/minipak)
   Compiling stage1 v0.1.0 (/home/amos/ftl/minipak/crates/stage1)
error: linking with `cc` failed: exit status: 1
  |
  = note: "cc" "-m64" "-Wl,--eh-frame-hdr"
  (cut: very long error)
  = note: cc: error: unrecognized command-line option '-nostartfiles -nodefaultlibs -static'

Ohh.

Oh hey, it thinks we're passing all three arguments as.. a single argument!

Yeah! The problem with command-line arguments is... when you're invoking them from a shell, like that:

Shell session
$ command foo bar baz

Then command gets three arguments:

But here, in our build script:

Rust code
// in `crates/stage1/build.rs`

fn main() {
    println!(
        "cargo:rustc-link-arg={}",
        "-nostartfiles -nodefaultlibs -static"
    );
}

Every cargo:rustc-link-arg=SOMETHING line is supposed to be its own argument.

So it's as if we did this instead:

Shell session
$ command "foo bar baz"

Well, that's on problem at all! We can just write three lines instead:

Rust code
// in `crates/stage1/build.rs`

fn main() {
    for &arg in &["-nostartfiles", "-nodefaultlibs", "-static"] {
        println!("cargo:rustc-link-arg={}", arg);
    }
}

And now...

Shell session
$ cargo b
   Compiling stage1 v0.1.0 (/home/amos/ftl/minipak/crates/stage1)
   Compiling minipak v0.1.0 (/home/amos/ftl/minipak/crates/minipak)
error[E0557]: feature has been removed
  --> crates/minipak/src/main.rs:12:12
   |
12 | #![feature(link_args)]
   |            ^^^^^^^^^ feature has been removed
   |
   = note: removed in favor of using `-C link-arg=ARG` on command line, which is available from cargo build scripts with `cargo:rustc-link-arg` now

error: cannot find attribute `link_args` in this scope
  --> crates/minipak/src/main.rs:16:3
   |
16 | #[link_args = "-nostartfiles -nodefaultlibs -static"]
   |   ^^^^^^^^^ help: a built-in attribute with a similar name exists: `linkage`

..well, now we have to give the same treatment to the minipak crate — remember, we have several crates in our cargo workspace: encore, minipak, pixie and stage1.

Well it turns out minipak already has a build script, so let's just move some things around:

Cool bear's hot tip

minipak already has build script because it includes the stage1 binary at compile-time into itself, so it's able to generate "packed binaries" later.

Rust code
// in `crates/minipak/build.rs`

use std::{
    path::{Path, PathBuf},
    process::Command,
};

fn main() {
    for &arg in &["-nostartfiles", "-nodefaultlibs", "-static"] {
        println!("cargo:rustc-link-arg={}", arg);
    }

    cargo_build(&PathBuf::from("../stage1"));
}

fn cargo_build(path: &Path) {
    // omitted (same as before)
}

Now we just need to remember to remove #[link_args] and #![feature(link_args)] from the minipak crate, and...

Shell session
$ cargo run --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak && /tmp/hugo.pak
   Compiling minipak v0.1.0 (/home/amos/ftl/minipak/crates/minipak)
    Finished release [optimized + debuginfo] target(s) in 1.93s
     Running `target/release/minipak /home/amos/go/bin/hugo -o /tmp/hugo.pak`
Wrote /tmp/hugo.pak (51.05% of input)
The guest is at 18380..1edd205
[1]    9398 segmentation fault  /tmp/hugo.pak

Yeah! That's as far as we had gotten last article.

Okay. Good. Staying on top of things, alright!

Formulating a plan

So, let's try to summarize the predicament we find ourselves in.

We're trying to make two things fit into the same executable:

If they have the same base address, then by the time stage1 is mapped, and up and running, then it's too late: after we decompress hugo somewhere in memory, we cannot copy it where it wants to be, which is the same place that stage1 already is!

Ooh, ooh, I know! Why don't we simply make stage1 relocatable?

Well! If we made stage1 relocatable, then it would be mapped "at a random address", thanks to Address space layout randomization (ASLR).

And, I agree that it would be really unlucky that the address picked would be the same as the fixed address that our guest wants, but it could happen, and then we would overwrite ourselves all over again, which...

Kaboom?

Kaboom, yes.

Okay, well... if we made stage1 relocatable, then when minipak runs, it could relocate it to somewhere else, that doesn't conflict with the guest! Right?

Well... it could, definitely, yes. But there's an additional constraint: we can't just "get out of the way of the guest", because "truly static" executables, ie. executables that are non-relocatable, expect the heap to be at a specific address. They know exactly where it should be.

Not all executables use that, but if we want to make a "real executable packer", then we should try to make sure that the brk (the top of the heap) is where our guest expects it to be.

That.. that was hot nonsense, what are you even talking about?

Fair, let's make a few diagrams.

The brk, the brk and nothing but the brk

We've made that kind of diagram a lot, but let's give it one more try.

We know that an ELF file is basically just a database of "segments" to be mapped in memory in specific places. Sometimes the layout in memory looks an awful lot like the layout on disk, with some exceptions.

Here for example, the "globals" segment is bigger in memory — all the zero-initialized globals go last, and they're not mapped from the file:

The important bit in the diagram above is that the brk's initial value (the value initially returned by the brk syscall) is at the end of the globals — the end of the last segment.

It's kinda hard to show this, since not all programs even use brk — hugo, for example, doesn't.

But if we take a C program, like samples/hello-pie, which was just this:

C code
// in `samples/hello-pie.c`

#include <stdio.h>

int main() {
    printf("Hello! I am a C program.\n");

    return 0;
}

Then we can catch the first brk syscall from GDB:

Shell session
$ gdb --quiet --args ./samples/hello-pie
Reading symbols from ./samples/hello-pie...
(gdb) starti
Starting program: /home/amos/ftl/minipak/samples/hello-pie

Program stopped.
0x00007ffff7f4b840 in _start ()
(gdb) catch syscall brk
Catchpoint 1 (syscall 'brk' [12])
(gdb) c
Continuing.

Catchpoint 1 (call to syscall brk), 0x00007ffff7fad82b in brk ()
(gdb) c
Continuing.

Catchpoint 1 (returned from syscall brk), 0x00007ffff7fad82b in brk ()
(gdb) p/x $rax
$1 = 0x7ffff7fff000
(gdb)

Since hello-pie is relocatable, the calculation is a bit more complicated, since we need to take into account the base address that GDB picked. In this case, it's...

Shell session
(gdb) info proc mappings
process 18852
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
      0x7ffff7f3e000     0x7ffff7f41000     0x3000        0x0 [vvar]
      0x7ffff7f41000     0x7ffff7f43000     0x2000        0x0 [vdso]
           👇
      0x7ffff7f43000     0x7ffff7f4b000     0x8000        0x0 /home/amos/ftl/minipak/samples/hello-pie
      0x7ffff7f4b000     0x7ffff7fcd000    0x82000     0x8000 /home/amos/ftl/minipak/samples/hello-pie
      0x7ffff7fcd000     0x7ffff7ff6000    0x29000    0x8a000 /home/amos/ftl/minipak/samples/hello-pie
      0x7ffff7ff7000     0x7ffff7ffe000     0x7000    0xb3000 /home/amos/ftl/minipak/samples/hello-pie
      0x7ffff7ffe000     0x7ffff7fff000     0x1000        0x0 [heap]
      0x7ffffffdd000     0x7ffffffff000    0x22000        0x0 [stack]

0x7ffff7f43000.

So, if we look at the "Load" segments for hello-pie:

Shell session
$ readelf -Wl ./samples/hello-pie | grep -A 8 "Program Headers" | grep -E "MemSiz|LOAD"
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x007f20 0x007f20 R   0x1000
  LOAD           0x008000 0x0000000000008000 0x0000000000008000 0x081f7d 0x081f7d R E 0x1000
  LOAD           0x08a000 0x000000000008a000 0x000000000008a000 0x028bc8 0x028bc8 R   0x1000
                                 👇                                         👇          👇
  LOAD           0x0b3768 0x00000000000b4768 0x00000000000b4768 0x005ba8 0x007438 RW  0x1000

We can see that it's supposed to be at...

Shell session
(gdb) p/x 0x7ffff7f43000 + 0x00000000000b4768 + 0x007438
$2 = 0x7ffff7ffeba0

Well, we also need to align that to 0x1000:

Shell session
(gdb) p/x (0x7ffff7ffeba0 & ~(0xFFF)) + 0x1000
$3 = 0x7ffff7fff000

And we find... 0x7ffff7fff000! The value we got back from the very first brk syscall.

I'm getting mixed up with all these sevens and effs.

Yeah, me too, but I promise, they're the same value.

(gdb) p/x $3 - $1
$4 = 0x0

So, the problem with "just moving the launcher out of the way of the guest" is that we'll end up with the wrong brk value on startup:

Can't we adjust the brk after startup?

You'd think so, but: not really, no. If the guest was after our launcher, we could sorta kinda do that.

Effectively, we'd "allocate on the heap" the distance between the end of the launcher and the end of the guest, and the value would be correct — but it would look all wrong in debuggers, for example.

And what if the guest is non-relocatable and has a fixed base address that's very low? Low enough that we don't have room to fit our launcher?

Yeah, sounds messy.

So, here's my proposal: we do it in two steps.

Two stages, if you will.

First off, because we need brk to be at the correct position, we generate an executable that will run stage1, but that will have a layout very similar to the guest.

We then have stage1 map stage2 (a dynamic library) out of the way of the guest, and finally, we have stage2 map the guest (a non-relocatable executable), and jmp to it:

Well. When you put it like that, it seems real simple.

Yeah, and it's going to be! Because we got solid abstractions, and a good understanding of what it is we need to do. That's what the seventeen previous articles were all about.

Okay, what do we start with?

Chaaaaaaaaange places!

Well, we start with throwing away our current stage1 crate completely.

In our current version of minipak, stage1 is a relocatable binary, which we simply concatenate with our compressed guest payload (and a manifest, so that we can find everything at runtime).

But moving forward, stage1 is going to be... a dynamic library.

A library?

Yes!

But... stage1 is effectively the template for the output minipak produces, right?

That is correct.

So does that mean... we're going to turn a dynamic library into an executable.

Yes!

So let's do that.

Shell session
$ rm -rf crates/stage1
$ (cd crates && cargo new --lib stage1)
     Created library `stage1` package

We gotta be more specific, since cargo new --lib makes a rust library, whereas we want a "C dynamic library", or cdylib. And we'd like to use our encore crate:

TOML markup
# in crates/stage1/Cargo.toml

[lib]
crate-type = ["cdylib"]

[dependencies]
encore = { path =  "../encore" }

So, just as before, our library is going to be no_std. Let's export a single empty function, to see what we get:

Rust code
// Don't use libstd
#![no_std]
// Use the default allocation error handler
#![feature(default_alloc_error_handler)]

use encore::prelude::*;

/// # Safety
/// Wildly unsafe, do not call.
#[no_mangle]
pub unsafe extern "C" fn entry() {}
Shell session
$ cargo clean

$ cargo build --package stage1
(cut)
    Finished dev [unoptimized + debuginfo] target(s) in 5.34s

$ nm -D ./target/debug/libstage1.so
00000000000013f0 T bcmp
                 w __cxa_finalize
0000000000001330 T entry
                 w __gmon_start__
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
00000000000024a0 T memcmp
0000000000002110 T memcpy
0000000000002200 T memmove
00000000000023f0 T memset
00000000000013e0 T _Unwind_Resume

Alright, cool! There's our entry symbol right there. I'm sure if we move some things around, we can make that the entry point of an executable.

But before we go any further... I'd like to just add a tiny teeny linker flag.

The problem with making dynamic libraries — or "shared" libraries, as GCC and GNU ld tend to call them, is that having undefined symbols is a-okay.

For example, if we were to call a function that does not exist from within stage1:

Rust code
// in `crates/stage1/src/lib.rs`

// Don't use libstd
#![no_std]
// Use the default allocation error handler
#![feature(default_alloc_error_handler)]

use encore::prelude::*;

extern "C" {
    fn i_do_not_exist();
}

/// # Safety
/// Wildly unsafe, do not call.
#[no_mangle]
pub unsafe extern "C" fn entry() {
    i_do_not_exist();
}

Then it would still build just fine:

Rust code
$ cargo build --quiet --package stage1 && nm -D ./target/debug/libstage1.so
00000000000013f0 T bcmp
                 w __cxa_finalize
0000000000001330 T entry
                 w __gmon_start__
                 👇 👇
                 U i_do_not_exist
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
00000000000024a0 T memcmp
0000000000002110 T memcpy
0000000000002200 T memmove
00000000000023f0 T memset
00000000000013e0 T _Unwind_Resume

It would just... ask for an i_do_not_exist symbol. So the error would be pushed to load time, ie. whenever the library is dlopened by some program. Or possibly even later, when i_do_not_exist is called, if the dynamic loader is feeling particularly lazy.

There's just one problem with that...

We do not intend to dlopen this file. So we'll never see the error, just a crash.

Our libstage1.so must be entirely self-contained. It cannot possibly depend on something else that's "already present at load time".

In other words, the output of nm -D libstage1.so must not contain any U entries.

The good news is: there is a linker flag for that! (And also, that is the default behavior for executables).

So, just as before, we'll need to add a build script for the stage1 crate:

Rust code
// in `crates/stage1/build.rs`

fn main() {
    println!("cargo:rustc-link-arg=-Wl,-z,defs");
}

And with that:

Shell session
$ cargo build --quiet --package stage1 && nm -D ./target/debug/libstage1.so
error: linking with `cc` failed: exit status: 1
  |
  = note: "cc" "-m64" "-Wl,--eh-frame-hdr" "-Wl,-znoexecstack" "-Wl,--as-needed" "-L" "/home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "/home/amos/ftl/minipak/target/debug/deps/stage1.4mxsy4jfjjxw9bnk.rcgu.o" "-o" "/home/amos/ftl/minipak/target/debug/deps/libstage1.so" "-Wl,--version-script=/tmp/rustcjlHmwn/list" "/home/amos/ftl/minipak/target/debug/deps/stage1.4vnmey6720pw9y0s.rcgu.o" "-Wl,--gc-sections" "-shared" "-Wl,-zrelro" "-Wl,-znow" "-nodefaultlibs" "-L" "/home/amos/ftl/minipak/target/debug/deps" "-L" "/home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-Wl,-Bstatic" "/home/amos/ftl/minipak/target/debug/deps/libencore-3df7881431236f53.rlib" "/home/amos/ftl/minipak/target/debug/deps/libbitflags-a7b2d45df01a1ead.rlib" "/home/amos/ftl/minipak/target/debug/deps/liblinked_list_allocator-e21337cbc3886b5a.rlib" "/home/amos/ftl/minipak/target/debug/deps/libspinning_top-9628df27563378b7.rlib" "/home/amos/ftl/minipak/target/debug/deps/liblock_api-6930031095633bf5.rlib" "/home/amos/ftl/minipak/target/debug/deps/libscopeguard-3fda057cf6676019.rlib" "/home/amos/ftl/minipak/target/debug/deps/librlibc-0003478002090c29.rlib" "/home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/liballoc-9849bb0fbad7f0f5.rlib" "/home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcompiler_builtins-8b33f9cbbc9652fe.rlib" "/home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc_std_workspace_core-a1fd7734706d5518.rlib" "/home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcore-c8ded1707ad10767.rlib" "/home/amos/ftl/minipak/target/debug/deps/libcompiler_builtins-98a8751107c546a9.rlib" "-Wl,-z,defs" "-Wl,-Bdynamic"
  = note: /usr/sbin/ld: /home/amos/ftl/minipak/target/debug/deps/stage1.4mxsy4jfjjxw9bnk.rcgu.o: in function `entry':
          /home/amos/ftl/minipak/crates/stage1/src/lib.rs:17: undefined reference to `i_do_not_exist'
          collect2: error: ld returned 1 exit status


error: aborting due to previous error

error: could not compile `stage1`

To learn more, run the command again with --verbose.

...it errors out if we ever refer to something that stage1 does not define itself.

This may sound trivial, but had I known about it when I started researching this part, it would've saved me a lot of grief. Also, encore here is doing the heavy lifting, providing a panic handler, and a memory allocator, so there's less chances of us getting it wrong.

Wait, don't we need to hook up the memory allocator manually?

Well it's a library so-

Well, now it's a library, but once minipak will be done with it, it'll be an executable. So it'll need to set up its own memory allocator! Oh, and don't forget: entry points are not regular functions. We need to save the stack pointer to pass it to the program later!

Right, right, let's set all that up:

Rust code
// in `crates/stage1/src/lib.rs`

// Allow inline assembly
#![feature(asm)]
// Allow naked (no-prelude) functions
#![feature(naked_functions)]
// Don't use libstd
#![no_std]
// Use the default allocation error handler
#![feature(default_alloc_error_handler)]

extern crate alloc;

use encore::prelude::*;

macro_rules! info {
    ($($tokens: tt)*) => {
        println!("[stage1] {}", alloc::format!($($tokens)*));
    }
}

/// # Safety
/// Uses inline assembly so it can behave as the entry point of a static
/// executable.
#[no_mangle]
#[naked]
pub unsafe extern "C" fn entry() {
    asm!("mov rdi, rsp", "call premain", options(noreturn))
}

/// # Safety
/// Initializes the allocator.
#[no_mangle]
#[inline(never)]
unsafe fn premain(stack_top: *mut u8) -> ! {
    init_allocator();
    crate::main(stack_top)
}

/// # Safety
/// Nothing bad so far.
#[inline(never)]
unsafe fn main(stack_top: *mut u8) -> ! {
    info!("Stack top: {:?}", stack_top);

    syscall::exit(0)
}

There! That's a good start.

Shell session
$ cargo build --quiet --package stage1 && nm -D ./target/debug/libstage1.so
0000000000003780 T bcmp
                 w __cxa_finalize
00000000000029d0 T entry
                 w __gmon_start__
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
00000000000084c0 T memcmp
0000000000008130 T memcpy
0000000000008220 T memmove
0000000000008410 T memset
00000000000029e0 T premain
0000000000003770 T _Unwind_Resume

Now, there's actually two entry points into that ELF object.

entry is the correct entry point if we want it to behave as a static executable. But as a first test, we can definitely try to load it as a dynamic library, and call premain, just to see if the rest is all wired up correctly!

C code
// in `samples/loadtest.c`

#include <dlfcn.h>
#include <stdio.h>
#include <stdint.h>

int main() {
    void *lib = dlopen("target/debug/libstage1.so", RTLD_NOW);
    if (!lib) {
        fprintf(stderr, "Could not load library\n");
        return 1;
    }

    void *sym = dlsym(lib, "premain");
    if (!sym) {
        fprintf(stderr, "Could not find symbol\n");
        return 1;
    }

    typedef void (*premain_t)(uint64_t);
    premain_t premain = (premain_t)(sym);
    fprintf(stderr, "Calling premain...\n");
    premain(0x1234);
}
# in `samples/Justfile`

loadtest:
    gcc -g loadtest.c -o loadtest -ldl
    file loadtest
Shell session
$ just samples/loadtest
gcc -g loadtest.c -o loadtest -ldl
file loadtest
loadtest: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=4d6d20c4608d5291659f96def4fb0e387b70dda4, for GNU/Linux 4.4.0, with debug_info, not stripped

$ samples/loadtest
Calling premain...
[stage1] Stack top: 0x1234

Eyyyyyy! First time!

Very nice!

Now all we have to do is turn that dynamic library into a non-relocatable executable. And we have most of the tools to do that.

First, let's just adjust minipak's build script to operate on libstage1.so instaed of stage1 (since it used to be a binary):

Rust code
// in `crates/minipak/build.rs`

use std::{
    path::{Path, PathBuf},
    process::Command,
};

fn main() {
    for &arg in &["-nostartfiles", "-nodefaultlibs", "-static"] {
        println!("cargo:rustc-link-arg={}", arg);
    }

    cargo_build(&PathBuf::from("../stage1"));
}

fn cargo_build(path: &Path) {
    println!("cargo:rerun-if-changed=..");

    let out_dir = std::env::var("OUT_DIR").unwrap();
    let target_dir = format!("{}/embeds", out_dir);

    let output = Command::new("cargo")
        .arg("build")
        .arg("--target-dir")
        .arg(&target_dir)
        .arg("--release")
        .current_dir(path)
        .spawn()
        .unwrap()
        .wait_with_output()
        .unwrap();
    if !output.status.success() {
        panic!(
            "Building {} failed.\nStdout: {}\nStderr: {}",
            path.display(),
            String::from_utf8_lossy(&output.stdout[..]),
            String::from_utf8_lossy(&output.stderr[..]),
        );
    }

    // Let's just assume the library has the same name as the crate
    //                         👇
    let lib_name = format!("lib{}.so", path.file_name().unwrap().to_str().unwrap());
    let output = Command::new("objcopy")
        .arg("--strip-all")
        .arg(&format!("release/{}", lib_name))
        .arg(lib_name)
        .current_dir(&target_dir)
        .spawn()
        .unwrap()
        .wait_with_output()
        .unwrap();
    if !output.status.success() {
        panic!(
            "Stripping failed.\nStdout: {}\nStderr: {}",
            String::from_utf8_lossy(&output.stdout[..]),
            String::from_utf8_lossy(&output.stderr[..]),
        );
    }
}

In Part 15 of this series, we enabled an option in rust-analyzer, and turns out, it changed names! If in .vscode/settings.json you had rust-analyzer.cargo.loadOutDirsFromCheck, you may want to change it.

The whole file should read:

JSON
{
  "rust-analyzer.checkOnSave.allTargets": false,
  "rust-analyzer.procMacro.enable": true,
  "rust-analyzer.cargo.runBuildScripts": true
}

(Don't forget to reload the vscode window after that change)

Now! What should minipak do?

Well, first it should calculate the convex hull of the guest executable, so that we know how to lay out the "output" executable.

That, we know how to do:

Rust code
// in `crates/minipak/src/main.rs`

// Opt out of libstd
#![no_std]
// Let us worry about the entry point.
#![no_main]
// Use the default allocation error handler
#![feature(default_alloc_error_handler)]
// Let us make functions without any prologue - assembly only!
#![feature(naked_functions)]
// Let us use inline assembly!
#![feature(asm)]

/// Our entry point.
#[naked]
#[no_mangle]
unsafe extern "C" fn _start() {
    asm!("mov rdi, rsp", "call pre_main", options(noreturn))
}

use encore::prelude::*;
use pixie::{Object, PixieError, Writer};

mod cli;

#[no_mangle]
unsafe fn pre_main(stack_top: *mut u8) {
    init_allocator();
    main(Env::read(stack_top)).unwrap();
    syscall::exit(0);
}

#[allow(clippy::unnecessary_wraps)]
fn main(env: Env) -> Result<(), PixieError> {
    let args = cli::Args::parse(&env);

    println!("Packing guest {:?}", args.input);
    let guest_file = File::open(args.input)?;
    let guest_map = guest_file.map()?;
    let guest_obj = Object::new(guest_map.as_ref())?;

    let guest_hull = guest_obj.segments().load_convex_hull()?;
    println!("Guest hull: {:0x?}", guest_hull);

    let mut output = Writer::new(&args.output, 0o755)?;
    output.write_all("TODO\n".as_bytes())?;

    Ok(())
}

Let's take it for a spin:

Shell session
$ cargo run --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak
(cut)
Packing guest "/home/amos/go/bin/hugo"
Guest hull: 400000..3180968

$ cat /tmp/hugo.pak
TODO

Does that seem right?

Shell session
$ readelf -Wl ~/go/bin/hugo | grep -E "LOAD|MemSiz"
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x172b4b0 0x172b4b0 R E 0x1000
  LOAD           0x172c000 0x0000000001b2c000 0x0000000001b2c000 0x155ceb8 0x155ceb8 R   0x1000
  LOAD           0x2c89000 0x0000000003089000 0x0000000003089000 0x0b08c0 0x0f7968 RW  0x1000

$ gdb -q -ex "set noconfirm" -ex "p/x 0x0000000003089000 + 0x0f7968" -ex "quit"
No symbol table is loaded.  Use the "file" command.
$1 = 0x3180968

Yeah, that's bang on!

Okay, cool — next up, we're going to need to relink stage1, so that it transforms from a dynamic library into a static executable, that has roughly the same shape as the guest, as shown in the first column of our three-column plan:

I thought it was a two-stage plan?

Yeah! Three columns, two stages. Get it together, bear.

Okay, cool, so, to relink stage1 we're going to need the guest hull, and also a mutable reference to the Writer, let's make a function for that:

Rust code
// in `crates/minipak/src/main.rs`

// new!
use core::ops::Range;

#[allow(clippy::unnecessary_wraps)]
//                                👇
fn main(env: Env) -> Result<(), Error> {
    let args = cli::Args::parse(&env);

    println!("Packing guest {:?}", args.input);
    let guest_file = File::open(args.input)?;
    let guest_map = guest_file.map()?;
    let guest_obj = Object::new(guest_map.as_ref())?;

    let guest_hull = guest_obj.segments().load_convex_hull()?;
    let mut output = Writer::new(&args.output, 0o755)?;
    relink_stage1(guest_hull, &mut output)?;

    Ok(())
}

fn relink_stage1(guest_hull: Range<u64>, writer: &mut Writer) -> Result<(), Error> {
    println!("Guest hull: {:0x?}", guest_hull);

    let obj = Object::new(include_bytes!(concat!(
        env!("OUT_DIR"),
        "/embeds/libstage1.so"
    )))?;

    // TODO: fill out!

    Ok(())
}

Quick detour through error handling: we might return various types of errors: from deku, from encore, or from pixie, so let's give minipak an Error type now and not have to worry about it later:

TOML markup
# in `crates/minipak/Cargo.toml`

[dependencies]
displaydoc = { version = "0.1.7", default-features = false }
Rust code
use encore::prelude::*;
use pixie::{deku::DekuError, PixieError};

#[derive(displaydoc::Display, Debug)]
pub enum Error {
    /// `{0}`
    Encore(EncoreError),
    /// deku error: `{0}`
    Deku(DekuError),
    /// pixie error: `{0}`
    Pixie(PixieError),
}

impl From<EncoreError> for Error {
    fn from(e: EncoreError) -> Self {
        Self::Encore(e)
    }
}

impl From<DekuError> for Error {
    fn from(e: DekuError) -> Self {
        Self::Deku(e)
    }
}

impl From<PixieError> for Error {
    fn from(e: PixieError) -> Self {
        Self::Pixie(e)
    }
}

And we just need to use it from main.rs:

Rust code
// in `crates/minipak/src/main.rs`

mod error;
use error::Error;

Does everything still work? Yes?

Shell session
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak
Packing guest "/home/amos/go/bin/hugo"
Guest hull: 400000..3180968

Yes. Good.

Next up: some basic tests. We expect stage1 to be relocatable, so, let's check that expectation. If it is relocatable, then its convex hull start at 0x0:

Rust code
// in `relink_stage1`

    let hull = obj.segments().load_convex_hull()?;
    assert_eq!(hull.start, 0, "stage1 must be relocatable");

Then we have a decision to make: where will our executable start? If we're packing a relocatable executable, we can pick any base address! If we're packing a non-relocatable executable, we have to pick their base address.

Rust code
// in `relink_stage1`

    // Pick a base offset. If our guest is a relocatable executable, pick a
    // random one, otherwise, pick theirs.
    let base_offset = if guest_hull.start == 0 {
        0x800000 // by fair dice roll
    } else {
        guest_hull.start
    };
    println!("Picked base_offset 0x{:x}", base_offset);

    let hull = (hull.start + base_offset)..(hull.end + base_offset);
    println!("Stage1 hull: {:x?}", hull);
    println!(" Guest hull: {:x?}", guest_hull);

This alone should give us some interesting output:

Shell session
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak
Packing guest "/home/amos/go/bin/hugo"
Picked base_offset 0x400000
Stage1 hull: 400000..40b048
 Guest hull: 400000..3180968

Cool! hugo is not relocatable, so we will be mapping stage starting from the same base address.

If we try to pack ls however, which is relocatable:

Shell session
$ cargo run --quiet --release --bin minipak -- /bin/ls -o /tmp/gcc.pak
Packing guest "/bin/ls"
Picked base_offset 0x800000
Stage1 hull: 800000..80b048
 Guest hull: 0..24558

Then we pick 0x800000 as a base address.

Alright, cool. Next up, we're going to proceed as if we wanted to load stage1 as a library.

It's an ELF object, so it has segments, and so it can be mapped:

Rust code
    let mut mapped = MappedObject::new(&obj, None)?;
    println!("Loaded stage1");

And then we should relocate it.

Wait, relocate it? Like we did in parts uhhh... in the previous parts?

Exactly! But we'll only care about four relocation types: 64, GlobDat, JumpSlot, Relative.

Because that's all we have in our library:

Shell session
$ readelf -Wr ./target/debug/libstage1.so | grep -oE 'R_X86\w+' | sort -u
R_X86_64_GLOB_DAT
R_X86_64_JUMP_SLOT
R_X86_64_RELATIVE

Okay, so 64 hasn't showed up yet, but let's handle it anyway.

Alright!

Relocating is ELF business, and we do our ELF business in the pixie crate.

Let's add a relocate method to MappedObject:

Rust code
// in `crates/pixie/src/lib.rs`

use alloc::boxed::Box;

impl<'a> MappedObject<'a> {
    /// Apply relocations with the given base offset
    pub fn relocate(&mut self, base_offset: u64) -> Result<(), PixieError> {
        if !self.is_relocatable() {
            return Err(PixieError::CannotRelocateNonRelocatableObject);
        }

        let dyn_entries = self.object.read_dynamic_entries()?;
        let syms = dyn_entries.syms()?;

        let relas = dyn_entries
            .find(DynamicTagType::Rela)?
            .parse_all(dyn_entries.find(DynamicTagType::RelaSz)?);
        let plt_relas: Box<dyn Iterator<Item = _>> = match dyn_entries.find(DynamicTagType::JmpRel)
        {
            Ok(jmprel) => Box::new(jmprel.parse_all(dyn_entries.find(DynamicTagType::PltRelSz)?)),
            Err(_) => Box::new(core::iter::empty()) as _,
        };

        for rela in relas.chain(plt_relas) {
            let rela = rela?;
            self.apply_rela(&syms, &rela, base_offset)?;
        }
        Ok(())
    }
}

Okay, uh, we jumped a few steps — there's a lot of symbols we haven't defined yet in here.

We need a new error variant:

Rust code
// in `crates/pixie/src/lib.rs`

#[derive(displaydoc::Display, Debug)]
/// A pixie error
pub enum PixieError {
    /// `{0}`
    Deku(DekuError),
    /// `{0}
    Encore(EncoreError),

    /// no segments found
    NoSegmentsFound,
    /// could not find segment of type `{0:?}`
    SegmentNotFound(SegmentType),

    /// cannot map non-relocatable object at fixed position
    CannotMapNonRelocatableObjectAtFixedPosition,

    // 👇 new!

    /// cannot relocate non-relocatable object
    CannotRelocateNonRelocatableObject,
}

And then we need to teach pixie about the kind of entries contained in the "Dynamic" segment.

Rust code
// in `crates/pixie/src/format/dynamic.rs`

use super::prelude::*;

#[derive(Debug, Clone, DekuRead, DekuWrite)]
pub struct DynamicTag {
    pub typ: DynamicTagType,
    pub addr: u64,
}

#[derive(Debug, DekuRead, DekuWrite, Clone, Copy, PartialEq)]
#[deku(type = "u64")]
pub enum DynamicTagType {
    #[deku(id = "0")]
    Null,
    #[deku(id = "2")]
    PltRelSz,
    #[deku(id = "5")]
    StrTab,
    #[deku(id = "6")]
    SymTab,
    #[deku(id = "7")]
    Rela,
    #[deku(id = "8")]
    RelaSz,
    #[deku(id = "11")]
    SymEnt,
    #[deku(id = "23")]
    JmpRel,
    #[deku(id_pat = "_")]
    Other(u64),
}

That's a new module of pixie::format, we need to declare it:

Rust code
// in `crates/pixie/src/format/mod.rs`

mod dynamic;
pub use dynamic::*;

Then, we'll need to implement Object::read_dynamic_entries.

Rust code
// in `crates/pixie/src/lib.rs`

impl<'a> Object<'a> {
    /// Read all dynamic entries
    pub fn read_dynamic_entries(&self) -> Result<DynamicEntries<'a>, PixieError> {
        let dyn_seg = self.segments.find(SegmentType::Dynamic)?;
        let mut entries = DynamicEntries::default();
        let mut input = (dyn_seg.slice(), 0);
        loop {
            let (rest, tag) = DynamicTag::from_bytes(input)?;
            if tag.typ == DynamicTagType::Null {
                break;
            }
            entries.items.push(DynamicEntry {
                tag,
                full_slice: &self.slice,
            });
            input = rest;
        }
        Ok(entries)
    }
}

Which returns a type DynamicEntries, very similar to the Segments type we made before — it just has a bunch of utility methods:

Rust code
// in `crates/pixie/src/lib.rs`

/// Entries in the `DYNAMIC` segment.
#[derive(Default)]
pub struct DynamicEntries<'a> {
    items: Vec<DynamicEntry<'a>>,
}

impl<'a> DynamicEntries<'a> {
    /// Returns a slice of all entries
    pub fn all(&self) -> &[DynamicEntry<'a>] {
        &self.items
    }

    /// Iterates over all entries of a given type
    pub fn of_type(&self, typ: DynamicTagType) -> impl Iterator<Item = &DynamicEntry<'a>> {
        self.items.iter().filter(move |entry| entry.typ() == typ)
    }

    /// Finds the first entry of a given type
    pub fn find(&self, typ: DynamicTagType) -> Result<&DynamicEntry<'a>, PixieError> {
        self.of_type(typ)
            .next()
            .ok_or(PixieError::DynamicEntryNotFound(typ))
    }

    /// Constructs an instance of `Syms`. Requires the presence of the `SymTab`,
    /// `SymEnt` and `StrTab` dynamic entries.
    pub fn syms(&'a self) -> Result<Syms<'a>, PixieError> {
        Ok(Syms {
            symtab: self.find(DynamicTagType::SymTab)?,
            syment: self.find(DynamicTagType::SymEnt)?,
            strtab: self.find(DynamicTagType::StrTab)?,
        })
    }
}

This brings a new error variant — when we can't find a dynamic entry of the requested type:

Rust code
#[derive(displaydoc::Display, Debug)]
/// A pixie error
pub enum PixieError {
    // (cut)

    /// could not find dynamic entry of type `{0:?}`
    DynamicEntryNotFound(DynamicTagType),
}

The DynamicEntry type holds both a "dynamic tag", and the corresponding data (which is always more or less an u64, but can be a number, an address, or something else still):

Rust code
// in `crates/pixie/src/lib.rs`

/// An entry in the `DYNAMIC` section
pub struct DynamicEntry<'a> {
    /// The dynamic tag as read from the `DYNAMIC` section
    tag: DynamicTag,

    /// A slice of the full ELF object
    full_slice: &'a [u8],
}

impl<'a> DynamicEntry<'a> {
    /// Returns the type of this dynamic entry
    pub fn typ(&self) -> DynamicTagType {
        self.tag.typ
    }

    /// Returns a slice of the full file starting with this entry interpreted as
    /// an offset.
    pub fn as_slice(&self) -> &'a [u8] {
        &self.full_slice[self.as_usize()..]
    }

    /// Returns this entry's value as an `usize`
    pub fn as_usize(&self) -> usize {
        self.as_u64() as usize
    }

    /// Returns this entry's value as an `u64`
    pub fn as_u64(&self) -> u64 {
        self.tag.addr
    }

    /// Parses several `T` records, using `self` at the start of the input, and
    /// `len` total length of the input.
    pub fn parse_all<T>(
        &self,
        len: &DynamicEntry<'a>,
    ) -> impl Iterator<Item = Result<T, PixieError>> + 'a
    where
        T: DekuContainerRead<'a>,
    {
        let slice = &self.as_slice()[..len.as_usize()];
        let mut input = (slice, 0);

        core::iter::from_fn(move || -> Option<Result<T, PixieError>> {
            if input.0.is_empty() {
                return None;
            }

            let (rest, t) = match T::from_bytes(input) {
                Ok(x) => x,
                Err(e) => return Some(Err(e.into())),
            };
            input = rest;
            Some(Ok(t))
        })
    }

    /// Parses the nth `T` record, using `self` as the start of the input, and
    /// `record_len` as the record length.
    pub fn parse_nth<T>(&self, record_len: &DynamicEntry<'a>, n: usize) -> Result<T, DekuError>
    where
        T: DekuContainerRead<'a>,
    {
        let slice = &self.as_slice()[(record_len.as_usize() * n)..];
        let input = (slice, 0);
        let (_, t) = T::from_bytes(input)?;
        Ok(t)
    }
}

Then, we have Syms, which allows looking up symbol names, using the symtab, syment, and strtab dynamic entries:

Rust code
// in `crates/pixie/src/lib.rs`

/// Allows reading symbols out of an ELF file
pub struct Syms<'a> {
    /// Indicates the start of the symbol table
    symtab: &'a DynamicEntry<'a>,
    /// Indicates the size of a symbol entry
    syment: &'a DynamicEntry<'a>,
    /// Indicates the start of the string table
    strtab: &'a DynamicEntry<'a>,
}

impl<'a> Syms<'a> {
    /// Read the nth symbol
    pub fn nth(&self, n: usize) -> Result<(Sym, &'a str), PixieError> {
        let sym: Sym = self.symtab.parse_nth(&self.syment, n)?;
        let name = unsafe { self.strtab.as_slice().as_ptr().add(sym.name as _).cstr() };
        Ok((sym, name))
    }

    /// Find a symbol by name. Will end up panicking if the symbol
    /// is not found!
    pub fn by_name(&self, name: &str) -> Result<Sym, PixieError> {
        let mut i = 0;
        loop {
            let (sym, sym_name) = self.nth(i)?;
            if sym_name == name {
                return Ok(sym);
            }
            i += 1;
        }
    }
}

Sym is also its own type:

Rust code
// in `pixie/src/format/mod.rs`

mod sym;
pub use sym::*;
Rust code
// in `pixie/src/format/sym.rs`

use super::prelude::*;

#[derive(Debug, DekuRead, DekuWrite, Clone)]
pub struct Sym {
    pub name: u32,

    pub bind: SymBind,
    #[deku(pad_bytes_after = "1")]
    pub typ: SymType,

    pub shndx: u16,
    pub value: u64,
    pub size: u64,
}

#[derive(Debug, DekuRead, DekuWrite, Clone, Copy, PartialEq)]
#[deku(type = "u8", bits = 4)]
pub enum SymBind {
    #[deku(id = "0")]
    Local,
    #[deku(id = "1")]
    Global,
    #[deku(id = "2")]
    Weak,
    #[deku(id_pat = "_")]
    Other(u8),
}

#[derive(Debug, DekuRead, DekuWrite, Clone, Copy, PartialEq)]
#[deku(type = "u8", bits = 4)]
pub enum SymType {
    #[deku(id = "0")]
    None,
    #[deku(id = "1")]
    Object,
    #[deku(id = "2")]
    Func,
    #[deku(id = "3")]
    Section,
    #[deku(id = "4")]
    File,
    #[deku(id = "6")]
    Tls,
    #[deku(id = "10")]
    IFunc,
    #[deku(id_pat = "_")]
    Other(u8),
}

Whoa, whoahey, that's a lot of code, isn't it?

Yeah, but it's not that bad, is it? We're just making some nice abstractions, as usual, and using what deku gives us to parse symbols easily.

Now that we have all that, we can focus on actually applying relocations.

And, remember how we very carefully handled each relocation type differently, making sure to apply the formula from the SysV ABI?

Yeah?

Yeah well, not this time.

Rust code
// in `crates/pixie/src/lib.rs`

impl<'a> MappedObject<'a> {
    /// Apply a single relocation
    fn apply_rela(&mut self, syms: &Syms, rela: &Rela, base_offset: u64) -> Result<(), PixieError> {
        match rela.typ {
            RelType::_64 | RelType::GlobDat | RelType::JumpSlot | RelType::Relative => {
                // we support these
            }
            _ => {
                return Err(PixieError::UnsupportedRela(rela.clone()));
            }
        }

        // some relocations don't use symbols, we'll just use the 0th symbol
        // for them, which is fine.
        let (sym, _) = syms.nth(rela.sym as _)?;
        let value = base_offset + sym.value + rela.addend;

        let mem_offset = self.vaddr_to_mem_offset(rela.offset);
        unsafe {
            let target = self.mem.as_ptr().add(mem_offset) as *mut u64;
            *target = value;
        }

        Ok(())
    }
}

Turns out: we can all compute them the same way! It's literally always just "base_offset + symbol value + addend". Some relocation refer to the 0th symbol, which has a value of 0, and some relocations don't have an addend, so the addend "field" is 0, and it all ends up being correct.

There's still two missing pieces: yet another error variant, in case we encounter a relocation type we do not support (which should never happen):

Rust code
// in `crates/pixie/src/lib.rs`

#[derive(displaydoc::Display, Debug)]
/// A pixie error
pub enum PixieError {
    // (cut)

    /// unsupported relocation type `{0:?}`
    UnsupportedRela(Rela),
}

And of course, the Rela and RelType types, which are also part of the ELF format:

Rust code
// in `crates/pixie/src/format/mod.rs`

mod rela;
pub use rela::*;
Rust code
// in `crates/pixie/src/format/rela.rs`

use super::prelude::*;

#[derive(Debug, DekuRead, DekuWrite, Clone)]
pub struct Rela {
    pub offset: u64,
    pub typ: RelType,
    pub sym: u32,
    pub addend: u64,
}

#[derive(Debug, DekuRead, DekuWrite, Clone, Copy, PartialEq)]
#[deku(type = "u32")]
pub enum RelType {
    #[deku(id = "0")]
    Null,
    #[deku(id = "1")]
    _64,
    #[deku(id = "6")]
    GlobDat,
    #[deku(id = "7")]
    JumpSlot,
    #[deku(id = "8")]
    Relative,
    #[deku(id = "16")]
    DtpMod64,
    #[deku(id_pat = "_")]
    Other(u32),
}

And with all that, we can relocate stage1.

To get an idea of the result, let's write the relocated version of stage1 directly to the output:

Rust code
// in `crates/minipak/src/main.rs`

fn relink_stage1(guest_hull: Range<u64>, writer: &mut Writer) -> Result<(), Error> {
    let obj = Object::new(include_bytes!(concat!(
        env!("OUT_DIR"),
        "/embeds/libstage1.so"
    )))?;

    let hull = obj.segments().load_convex_hull()?;
    assert_eq!(hull.start, 0, "stage1 must be relocatable");

    // Pick a base offset. If our guest is a relocatable executable, pick a
    // random one, otherwise, pick theirs.
    let base_offset = if guest_hull.start == 0 {
        0x800000 // by fair dice roll
    } else {
        guest_hull.start
    };
    println!("Picked base_offset 0x{:x}", base_offset);

    let hull = (hull.start + base_offset)..(hull.end + base_offset);
    println!("Stage1 hull: {:x?}", hull);
    println!(" Guest hull: {:x?}", guest_hull);

    // Map stage1 wherever...
    let mut mapped = MappedObject::new(&obj, None)?;
    println!("Loaded stage1");

    // 👇 new code's here

    // ...but relocate it as if it was mapped at `base_offset`
    mapped.relocate(base_offset)?;
    println!("Relocated stage1");

    // Dump the relocated version of the executable segment to disk, for comparison:
    let exec_segment = mapped.vaddr_slice(
        obj.segments()
            .of_type(pixie::SegmentType::Load)
            .find(|x| x.header().flags == (ProgramHeader::EXECUTE | ProgramHeader::READ))
            .unwrap()
            .header()
            .mem_range(),
    );
    writer.write_all(exec_segment)?;

    Ok(())
}
Shell session
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak
Packing guest "/home/amos/go/bin/hugo"
Picked base_offset 0x400000
Stage1 hull: 400000..40b048
 Guest hull: 400000..3180968
Loaded stage1
Relocated stage1

Okay! Now let's try to compare the non-relocated and the relocated version of the first segment. First we need to extract just the right part of the the non-relocated libstage1.so:

Shell session
$ readelf -Wl ./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/libstage1.so | grep -E 'MemSiz|LOAD'
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x001060 0x001060 R   0x1000
  LOAD           0x002000 0x0000000000002000 0x0000000000002000 0x004b0d 0x004b0d R E 0x1000
  LOAD           0x007000 0x0000000000007000 0x0000000000007000 0x001b88 0x001b88 R   0x1000
  LOAD           0x009750 0x000000000000a750 0x000000000000a750 0x0008c0 0x0008f8 RW  0x1000

$ dd if=./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/libstage1.so of=/tmp/unrelocated bs=1 skip=$((0x002000)) count=$((0x004b0d))
19213+0 records in
19213+0 records out
19213 bytes (19 kB, 19 KiB) copied, 0.0276595 s, 695 kB/s

Now, if my calculations are correct, the first few bytes should be the same:

Shell session
$ xxd /tmp/unrelocated | head -3
00000000: f30f 1efa 4883 ec08 488b 05e9 8f00 0048  ....H...H......H
00000010: 85c0 7402 ffd0 4883 c408 c300 0000 0000  ..t...H.........
00000020: ff35 c28e 0000 ff25 c48e 0000 0f1f 4000  .5.....%......@.

$ xxd /tmp/hugo.pak | head -3
00000000: f30f 1efa 4883 ec08 488b 05e9 8f00 0048  ....H...H......H
00000010: 85c0 7402 ffd0 4883 c408 c300 0000 0000  ..t...H.........
00000020: ff35 c28e 0000 ff25 c48e 0000 0f1f 4000  .5.....%......@.

Yeah, yes! That looks similar.

Let's find the differences, shall we?

Shell session
$ diff <(xxd /tmp/unrelocated) <(xxd /tmp/hugo.pak)

$

Huh. No output? They're the same? Did our program just... do nothing?

Maybe there's no relocations in the executable segment?

Ohhhhhhhh, right! It probably uses rip-relative addressing to avoid any relocations touching the executable segment, so that it can be shared across multiple loads of the same dynamic library.

We've seen that in Part 9, uh, over a year ago.

Time flies!

So then, where are relocations?

Shell session
$ readelf -Wr ./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/libstage1.so | head

Relocation section '.rela.dyn' at offset 0x490 contains 125 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
             👇
000000000000a750  0000000000000008 R_X86_64_RELATIVE                         28e0
000000000000a758  0000000000000008 R_X86_64_RELATIVE                         2890
000000000000a760  0000000000000008 R_X86_64_RELATIVE                         2900
000000000000a778  0000000000000008 R_X86_64_RELATIVE                         2a90
000000000000a780  0000000000000008 R_X86_64_RELATIVE                         2910
000000000000a788  0000000000000008 R_X86_64_RELATIVE                         2a40
000000000000a790  0000000000000008 R_X86_64_RELATIVE                         7000

$ readelf -Wl ./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/libstage1.so | grep -E "MemSiz|LOAD"
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x001060 0x001060 R   0x1000
  LOAD           0x002000 0x0000000000002000 0x0000000000002000 0x004b0d 0x004b0d R E 0x1000
  LOAD           0x007000 0x0000000000007000 0x0000000000007000 0x001b88 0x001b88 R   0x1000
                                         👇
  LOAD           0x009750 0x000000000000a750 0x000000000000a750 0x0008c0 0x0008f8 RW  0x1000

Ah! In the read-write segment.

Okay then:

Rust code
// in `relink_stage1`

    let rw_segment = mapped.vaddr_slice(
        obj.segments()
            .of_type(pixie::SegmentType::Load)
            .find(|x| x.header().flags == (ProgramHeader::READ | ProgramHeader::WRITE))
            .unwrap()
            .header()
            .mem_range(),
    );
    writer.write_all(rw_segment)?;
Shell session
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak
Packing guest "/home/amos/go/bin/hugo"
Picked base_offset 0x400000
Stage1 hull: 400000..40b048
 Guest hull: 400000..3180968
Loaded stage1
Relocated stage1

$ readelf -Wl ./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/libstage1.so | grep -E "MemSiz|LOAD"
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x001060 0x001060 R   0x1000
  LOAD           0x002000 0x0000000000002000 0x0000000000002000 0x004b0d 0x004b0d R E 0x1000
  LOAD           0x007000 0x0000000000007000 0x0000000000007000 0x001b88 0x001b88 R   0x1000
                     👇                                            👇
  LOAD           0x009750 0x000000000000a750 0x000000000000a750 0x0008c0 0x0008f8 RW  0x1000

$ dd if=./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/libstage1.so of=/tmp/unrelocated bs=1 skip=$((0x009750)) count=$((0x0008c0))
2240+0 records in
2240+0 records out
2240 bytes (2.2 kB, 2.2 KiB) copied, 0.0038677 s, 579 kB/s

Let's diff again:

Shell session
$ diff <(xxd /tmp/unrelocated) <(xxd /tmp/hugo.pak) | head -14
1,6c1,6
                  👇
< 00000000: e028 0000 0000 0000 9028 0000 0000 0000  .(.......(......
< 00000010: 0029 0000 0000 0000 0800 0000 0000 0000  .)..............
< 00000020: 0800 0000 0000 0000 902a 0000 0000 0000  .........*......
< 00000030: 1029 0000 0000 0000 402a 0000 0000 0000  .)......@*......
< 00000040: 0070 0000 0000 0000 4b00 0000 0000 0000  .p......K.......
< 00000050: 5c01 0000 1300 0000 0029 0000 0000 0000  \........)......
---
                  👇
> 00000000: e028 4000 0000 0000 9028 4000 0000 0000  .(@......(@.....
> 00000010: 0029 4000 0000 0000 0800 0000 0000 0000  .)@.............
> 00000020: 0800 0000 0000 0000 902a 4000 0000 0000  .........*@.....
> 00000030: 1029 4000 0000 0000 402a 4000 0000 0000  .)@.....@*@.....
> 00000040: 0070 4000 0000 0000 4b00 0000 0000 0000  .p@.....K.......
> 00000050: 5c01 0000 1300 0000 0029 4000 0000 0000  \........)@.....

Ah, there we have it! A bunch of 0 that become 4.

Is it because hugo has a base address of 0x40000? What would happen if we operated on /bin/ls instead?

Well, let's try it:

Shell session
$ cargo run --quiet --release --bin minipak -- /bin/ls -o /tmp/ls.pak
Packing guest "/bin/ls"
Picked base_offset 0x800000
Stage1 hull: 800000..80b048
 Guest hull: 0..24558
Loaded stage1
Relocated stage1

$ diff <(xxd /tmp/unrelocated) <(xxd /tmp/ls.pak) | head -14
1,6c1,6
                  👇
< 00000000: e028 0000 0000 0000 9028 0000 0000 0000  .(.......(......
< 00000010: 0029 0000 0000 0000 0800 0000 0000 0000  .)..............
< 00000020: 0800 0000 0000 0000 902a 0000 0000 0000  .........*......
< 00000030: 1029 0000 0000 0000 402a 0000 0000 0000  .)......@*......
< 00000040: 0070 0000 0000 0000 4b00 0000 0000 0000  .p......K.......
< 00000050: 5c01 0000 1300 0000 0029 0000 0000 0000  \........)......
---
                  👇
> 00000000: e028 8000 0000 0000 9028 8000 0000 0000  .(.......(......
> 00000010: 0029 8000 0000 0000 0800 0000 0000 0000  .)..............
> 00000020: 0800 0000 0000 0000 902a 8000 0000 0000  .........*......
> 00000030: 1029 8000 0000 0000 402a 8000 0000 0000  .)......@*......
> 00000040: 0070 8000 0000 0000 4b00 0000 0000 0000  .p......K.......
> 00000050: 5c01 0000 1300 0000 0029 8000 0000 0000  \........)......

Yup, sure enough! They're now 8.

Alright, well, there's no telling if our relocations are correct yet, but at least there's definitely something being relocated.

Which means all we need to do now... is generate an ELF object that happens to be an executable.

So, what, write a header?

Yes! And program headers, everything we need.

And it'll be easy! Because deku not only lets us read binary formats, it also lets us write binary formats.

Let's go!

Rust code
fn relink_stage1(guest_hull: Range<u64>, writer: &mut Writer) -> Result<(), Error> {
    let obj = Object::new(include_bytes!(concat!(
        env!("OUT_DIR"),
        "/embeds/libstage1.so"
    )))?;

    let hull = obj.segments().load_convex_hull()?;
    assert_eq!(hull.start, 0, "stage1 must be relocatable");

    // Pick a base offset. If our guest is a relocatable executable, pick a
    // random one, otherwise, pick theirs.
    let base_offset = if guest_hull.start == 0 {
        0x800000 // by fair dice roll
    } else {
        guest_hull.start
    };
    println!("Picked base_offset 0x{:x}", base_offset);

    let hull = (hull.start + base_offset)..(hull.end + base_offset);
    println!("Stage1 hull: {:x?}", hull);
    println!(" Guest hull: {:x?}", guest_hull);

    // Map stage1 wherever...
    let mut mapped = MappedObject::new(&obj, None)?;
    println!("Loaded stage1");

    // ...but relocate it as if it was mapped at `base_offset`
    mapped.relocate(base_offset)?;
    println!("Relocated stage1");

    println!("Looking for `entry` in stage1...");
    let entry_sym = mapped.lookup_sym("entry")?;
    let entry_point = base_offset + entry_sym.value;

    // Collect all the load segments
    let mut load_segs = obj
        .segments()
        .of_type(SegmentType::Load)
        .collect::<Vec<_>>();

    // Now write out some ELF!
    let out_header = ObjectHeader {
        class: pixie::ElfClass::Elf64,
        endianness: Endianness::Little,
        version: 1,
        os_abi: OsAbi::SysV,
        typ: ElfType::Exec,
        machine: ElfMachine::X86_64,
        version_bis: 1,
        entry_point,

        flags: 0,
        hdr_size: ObjectHeader::SIZE,
        // Two additional segments: one for `brk` alignment, and GNU_STACK.
        ph_count: load_segs.len() as u16 + 2,
        ph_offset: ObjectHeader::SIZE as _,
        ph_entsize: ProgramHeader::SIZE,
        // We're not adding any sections, our object will be opaque to debuggers
        sh_count: 0,
        sh_entsize: 0,
        sh_nidx: 0,
        sh_offset: 0,
    };
    writer.write_deku(&out_header)?;

    let static_headers = load_segs.iter().map(|seg| {
        let mut ph = seg.header().clone();
        ph.vaddr += base_offset;
        ph.paddr += base_offset;
        ph
    });
    for ph in static_headers {
        writer.write_deku(&ph)?;
    }

    // Insert dummy segment to offset the `brk` to its original position
    // for the guest, if we can.
    {
        let current_hull = align_hull(hull);
        let desired_hull = align_hull(guest_hull);

        let pad_size = if current_hull.end > desired_hull.end {
            println!("WARNING: Guest executable is too small, the `brk` will be wrong.");
            0x0
        } else {
            desired_hull.end - current_hull.end
        };

        let ph = ProgramHeader {
            paddr: current_hull.end,
            vaddr: current_hull.end,
            memsz: pad_size,
            filesz: 0,
            offset: 0,
            align: 0x1000,
            typ: SegmentType::Load,
            flags: ProgramHeader::WRITE | ProgramHeader::READ,
        };
        writer.write_deku(&ph)?;
    }

    // Add a GNU_STACK program header for alignment and to make it
    // non-executable.
    {
        let ph = ProgramHeader {
            paddr: 0,
            vaddr: 0,
            memsz: 0,
            filesz: 0,
            offset: 0,
            align: 0x10,
            typ: SegmentType::GnuStack,
            flags: ProgramHeader::WRITE | ProgramHeader::READ,
        };
        writer.write_deku(&ph)?;
    }

    // Sort load segments by file offset and copy them.
    {
        load_segs.sort_by_key(|&seg| seg.header().offset);

        println!("Copying stage1 segments...");
        let copy_start_offset = writer.offset();
        println!("copy_start_offset = 0x{:x}", copy_start_offset);
        let copied_segments = load_segs
            .into_iter()
            .filter(move |seg| seg.header().offset > copy_start_offset);

        for cp_seg in copied_segments {
            let ph = cp_seg.header();
            println!("copying {:?}", ph);

            // Pad space between segments with zeros:
            writer.pad(ph.offset - writer.offset())?;

            // Then copy.
            let start = ph.vaddr;
            let len = ph.filesz;
            let end = start + len;

            writer.write_all(mapped.vaddr_slice(start..end))?;
        }
    }

    // Pad end of last segment with zeros:
    writer.align(0x1000)?;

    Ok(())
}

We've used a couple helper functions, let's define them now: align_hull:

Rust code
// in `crates/pixie/src/lib.rs`

/// Align *down* to the nearest 4K boundary
pub fn floor(val: u64) -> u64 {
    val & !0xFFF
}

/// Align *up* to the nearest 4K boundary
pub fn ceil(val: u64) -> u64 {
    if floor(val) == val {
        val
    } else {
        floor(val + 0x1000)
    }
}

/// Given a convex hull, align its start *down* to the nearest 4K boundary and
/// its end *up* to the nearest 4K boundary
pub fn align_hull(hull: Range<u64>) -> Range<u64> {
    floor(hull.start)..ceil(hull.end)
}

And then MappedObject::lookup_sym, which we use to find the address of entry in libstage1.so. Luckily this one is trivially expressed using the abstractions we've already carefully constructed:

Rust code
// in `crates/pixie/src/lib.rs`

impl<'a> MappedObject<'a> {
    /// Returns the (non-relocated) vaddr of a symbol by name
    pub fn lookup_sym(&self, name: &str) -> Result<Sym, PixieError> {
        let dyn_entries = self.object.read_dynamic_entries()?;
        dyn_entries.syms()?.by_name(name)
    }
}

And now, well... we should be generating a fully-relocated, statically linked executable from libstage1.so.

Let's try it?

Shell session
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak
Packing guest "/home/amos/go/bin/hugo"
Picked base_offset 0x400000
Stage1 hull: 400000..40b048
 Guest hull: 400000..3180968
Loaded stage1
Relocated stage1
Looking for `entry` in stage1...
Copying stage1 segments...
copy_start_offset = 0x190
copying ProgramHeader { typ: Load, flags: 0x5, offset: 0x2000, vaddr: 0x2000, paddr: 0x2000, filesz: 0x4b0d, memsz: 0x4b0d, align: 0x1000 }
copying ProgramHeader { typ: Load, flags: 0x4, offset: 0x7000, vaddr: 0x7000, paddr: 0x7000, filesz: 0x1b88, memsz: 0x1b88, align: 0x1000 }
copying ProgramHeader { typ: Load, flags: 0x6, offset: 0x9750, vaddr: 0xa750, paddr: 0xa750, filesz: 0x8c0, memsz: 0x8f8, align: 0x1000 }

$ /tmp/hugo.pak
[stage1] Stack top: 0x7fffa8187a40

Hurray!!

Woo! We did it!

...and if we've done our job correctly, it should have a structure that's very similar to the original guest executable:

Shell session
$ readelf -Wl ~/go/bin/hugo | grep -E 'LOAD|MemSiz'
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x172b4b0 0x172b4b0 R E 0x1000
  LOAD           0x172c000 0x0000000001b2c000 0x0000000001b2c000 0x155ceb8 0x155ceb8 R   0x1000
  LOAD           0x2c89000 0x0000000003089000 0x0000000003089000 0x0b08c0 0x0f7968 RW  0x1000

$ readelf -Wl /tmp/hugo.pak | grep -E 'LOAD|MemSiz'
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x001060 0x001060 R   0x1000
  LOAD           0x002000 0x0000000000402000 0x0000000000402000 0x004b0d 0x004b0d R E 0x1000
  LOAD           0x007000 0x0000000000407000 0x0000000000407000 0x001b88 0x001b88 R   0x1000
  LOAD           0x009750 0x000000000040a750 0x000000000040a750 0x0008c0 0x0008f8 RW  0x1000
  LOAD           0x000000 0x000000000040c000 0x000000000040c000 0x000000 0x2d75000 RW  0x1000

$ gdb -q -ex "set noconfirm" -ex "p/x 0x0000000003089000 + 0x0f7968" -ex "p/x 0x000000000040c000 + 0x2d75000" -ex "quit"
No symbol table is loaded.  Use the "file" command.
$1 = 0x3180968
$2 = 0x3181000

Yes! After alignment, both executables end at the same address, and so their brk should be the same.

Enter stage two

Now remember, we cannot have stage1 directly load the guest — well, right now, we're not even writing the compressed guest to our output file, so it's tiny:

Shell session
$ ls -lhA /tmp/hugo.pak
-rwxr-xr-x 1 amos amos 44K May  1 22:59 /tmp/hugo.pak

But still, we cannot have stage1 load the guest, because it's mapped where the guest should be:

First, we need to map stage2 out of the way.

But where is "out of the way"?

Well, that's the beauty of it! The whole area where guest will eventually be is already mapped from our executable.

So any call to mmap (without the FIXED flag) will give us a region that's "out of the way" — it won't overwrite an already-mapped region.

Well, right now we don't even have a stage2 to map, so, let's make one!

Shell session
$ (cd crates && cargo new --lib stage2)
warning: compiling this new package may not work due to invalid workspace configuration

current package believes it's in a workspace when it's not:
current:   /home/amos/ftl/minipak/crates/stage2/Cargo.toml
workspace: /home/amos/ftl/minipak/Cargo.toml

this may be fixable by adding `crates/stage2` to the `workspace.members` array of the manifest located at: /home/amos/ftl/minipak/Cargo.toml
Alternatively, to keep it out of the workspace, add the package to the `workspace.exclude` array, or add an empty `[workspace]` table to the package's manifest.
     Created library `stage2` package

Well, it's requested so politely:

TOML markup
# in the top-level `Cargo.toml`

[workspace]
members = [
    "crates/encore",
    "crates/pixie",
    "crates/minipak",
    "crates/stage1",
    "crates/stage2",
]

Let's also add a build script:

Rust code
// in `crates/stage2/build.rs`

fn main() {
    println!("cargo:rustc-link-arg=-Wl,-z,defs");
}

A dependency on encore, and setting the crate type to cdylib:

TOML markup
# in `crates/stage2/Cargo.toml`

[lib]
crate-type = ["cdylib"]

[dependencies]
encore = { path =  "../encore" }

And, well, let's add an entry point to it too!

Rust code
// in `crates/stage2/src/lib.rs`

// Don't use libstd
#![no_std]
// Allow inline assembly
#![feature(asm)]
// Allow naked (no-prelude) functions
#![feature(naked_functions)]
// Use the default allocation error handler
#![feature(default_alloc_error_handler)]

extern crate alloc;

use encore::prelude::*;

macro_rules! info {
    ($($tokens: tt)*) => {
        println!("[stage2] {}", alloc::format!($($tokens)*));
    }
}

#[no_mangle]
#[inline(never)]
/// # Safety
/// Does a raw syscall, initializes the global allocator
unsafe extern "C" fn entry(stack_top: *mut u8) -> ! {
    init_allocator();
    crate::main(stack_top)
}

/// # Safety
/// Maps and jmps to another ELF object
#[inline(never)]
unsafe fn main(stack_top: *mut u8) -> ! {
    info!("Stack top: {:?}", stack_top);
    encore::syscall::exit(0);
}

Now, let's consider the chain of events: minipak generates its executable from the guest and stage1. So by the time the "packed executable" starts up, stage1 is already mapped.

stage2, however, must be mapped by stage1. So, it must be embedded into the "packed executable" as well.

Luckily, we made this next part very easy for ourselves.

First off, whenever we build minipak, we also want to build stage2 — let's add it to our build script:

Rust code
// in `crates/minipak/build.rs`

// omitted: other functions

fn main() {
    for &arg in &["-nostartfiles", "-nodefaultlibs", "-static"] {
        println!("cargo:rustc-link-arg={}", arg);
    }

    cargo_build(&PathBuf::from("../stage1"));
    //                          new! 👇
    cargo_build(&PathBuf::from("../stage2"));
}

Second, let's add it as a Resource in the pixie manifest:

Rust code
// in `crates/pixie/src/manifest.rs`

#[derive(Debug, DekuRead, DekuWrite)]
#[deku(magic = b"piximani")]
pub struct Manifest {
    pub stage2: Resource,
    pub guest: Resource,
}

And thirdly, well, thirdly let's embed both libstage2.so and the compressed guest into the output executable.

They'll go right after our "relinked stage1":

Rust code
// in `crates`

#[allow(clippy::unnecessary_wraps)]
fn main(env: Env) -> Result<(), Error> {
    let args = cli::Args::parse(&env);

    println!("Packing guest {:?}", args.input);
    let guest_file = File::open(args.input)?;
    let guest_map = guest_file.map()?;
    let guest_obj = Object::new(guest_map.as_ref())?;

    let guest_hull = guest_obj.segments().load_convex_hull()?;
    let mut output = Writer::new(&args.output, 0o755)?;
    relink_stage1(guest_hull, &mut output)?;

    let stage2_slice = include_bytes!(concat!(env!("OUT_DIR"), "/embeds/libstage2.so"));

    let stage2_offset = output.offset();
    println!("Copying stage2 at 0x{:x}", stage2_offset);
    output.write_all(stage2_slice)?;
    output.align(0x8)?;

    println!("Compressing guest...");
    let compressed_guest = lz4_flex::compress_prepend_size(guest_map.as_ref());
    let guest_offset = output.offset();
    println!("Copying compressed guest at 0x{:x}", guest_offset);
    output.write_all(&compressed_guest)?;
    output.align(0x8)?;

    let manifest_offset = output.offset();
    println!("Writing manifest at 0x{:x}", manifest_offset);
    let manifest = Manifest {
        stage2: Resource {
            offset: stage2_offset as _,
            len: stage2_slice.len(),
        },
        guest: Resource {
            offset: guest_offset as _,
            len: compressed_guest.len(),
        },
    };
    output.write_deku(&manifest)?;
    output.align(0x8)?;

    println!("Writing end marker");
    let end_marker = EndMarker {
        manifest_offset: manifest_offset as _,
    };
    output.write_deku(&end_marker)?;

    println!("Written to ({})", args.output);

    Ok(())
}

There! Now minipak is feature-complete. Well, the minipak crate, not the whole project — stage1 and stage2 are still not complete, but our minipak executable does everything we want it to do, and its output is a lot chunkier than before:

Shell session
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak
Packing guest "/home/amos/go/bin/hugo"
Picked base_offset 0x400000
Stage1 hull: 400000..40b048
 Guest hull: 400000..3180968
Loaded stage1
Relocated stage1
Looking for `entry` in stage1...
Copying stage1 segments...
copy_start_offset = 0x190
copying ProgramHeader { typ: Load, flags: 0x5, offset: 0x2000, vaddr: 0x2000, paddr: 0x2000, filesz: 0x4b0d, memsz: 0x4b0d, align: 0x1000 }
copying ProgramHeader { typ: Load, flags: 0x4, offset: 0x7000, vaddr: 0x7000, paddr: 0x7000, filesz: 0x1b88, memsz: 0x1b88, align: 0x1000 }
copying ProgramHeader { typ: Load, flags: 0x6, offset: 0x9750, vaddr: 0xa750, paddr: 0xa750, filesz: 0x8c0, memsz: 0x8f8, align: 0x1000 }
Copying stage2 at 0xb000
Compressing guest...
Copying compressed guest at 0x15670
Writing manifest at 0x1eda4f8
Writing end marker
Written to (/tmp/hugo.pak)

$ ls -lhA /tmp/hugo.pak
-rwxr-xr-x 1 amos amos 31M May  2 00:06 /tmp/hugo.pak

Mh. How does it compare with the original binary though?

Shh let's keep that for later. When we actually get it to work.

Okay, so! Clearly stage1 has to read the EndMarker, to find the Manifest, so it knows where stage2 is, and it can map it.

Turns out this is relatively compact:

Rust code
// in `stage1/src/lib.rs`

use pixie::{Manifest, MappedObject, Object};

/// # Safety
/// Maps and calls into another ELF object
#[inline(never)]
unsafe fn main(stack_top: *mut u8) -> ! {
    info!("Stack top: {:?}", stack_top);

    // Open ourselves and read the manifest.
    let file = File::open("/proc/self/exe").unwrap();
    let map = file.map().unwrap();
    let slice = map.as_ref();
    let manifest = Manifest::read_from_full_slice(slice).unwrap();

    // Load stage2 anywhere in memory
    let s2_slice = &slice[manifest.stage2.as_range()];
    let s2_obj = Object::new(s2_slice).unwrap();
    let mut s2_mapped = MappedObject::new(&s2_obj, None).unwrap();
    info!(
        "Mapped stage2 at base 0x{:x} (offset 0x{:x})",
        s2_mapped.base(),
        s2_mapped.base_offset()
    );
    info!("Relocating stage2...");
    s2_mapped.relocate(s2_mapped.base_offset()).unwrap();
    info!("Relocating stage2... done!");

    // Find stage2's entry function and call it
    let s2_entry = s2_mapped.lookup_sym("entry").unwrap();
    info!("Found entry_sym {:?}", s2_entry);
    let entry: unsafe extern "C" fn(*mut u8) -> ! =
        core::mem::transmute(s2_mapped.base_offset() + s2_entry.value);

    entry(stack_top);
}

Of course this uses some types and functions from pixie, so:

TOML markup
# Cargo.toml

[dependencies]
pixie = { path = "../pixie" }

And now... well, the whole thing won't quite work, but at least we should reach stage2.

Shell session
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak
Packing guest "/home/amos/go/bin/hugo"
(cut)

$ /tmp/hugo.pak
[stage1] Stack top: 0x7ffca05e6de0
[stage1] Mapped stage2 at base 0x7f199bfd2000 (offset 0x7f199bfd2000)
[stage1] Relocating stage2...
[stage1] Relocating stage2... done!
[stage1] Found entry_sym Sym { name: 112, bind: Global, typ: Func, shndx: 7, value: 26240, size: 20 }
[stage2] Stack top: 0x7ffca05e6de0

...and we do!

And now, the pièce de résistance.

Running an executable from memory

Hey, we've done that already!

Why yes, yes we have! But this our last — and our best.

To launch the executable we have to:

Oh, what about dynamically-linked executables? That need an interpreter?

Ah, I guess we can do that too! If we find an interpreter segment, we can map it in memory as well, and jump to its entry point, instead of the guest's.

Let's go!

Rust code
// in `crates/stage2/src/lib.rs`

use pixie::{Manifest, MappedObject, Object, ObjectHeader};

/// # Safety
/// Maps and jmps to another ELF object
#[inline(never)]
unsafe fn main(stack_top: *mut u8) -> ! {
    info!("Stack top: {:?}", stack_top);

    let mut stack = Env::read(stack_top as _);

    // Open ourselves and read the manifest.
    let file = File::open("/proc/self/exe").unwrap();
    info!("Mapping self...");
    let map = file.map().unwrap();
    info!("Mapping self... done!");
    let slice = map.as_ref();
    let manifest = Manifest::read_from_full_slice(slice).unwrap();

    let compressed_guest = &slice[manifest.guest.as_range()];
    let guest = lz4_flex::decompress_size_prepended(compressed_guest).unwrap();
    let guest_obj = Object::new(guest.as_ref()).unwrap();
    let guest_hull = guest_obj.segments().load_convex_hull().unwrap();

    let at = if guest_hull.start == 0 {
        // guest is relocatable, load it with the same base as ourselves
        let elf_header_address = stack.find_vector(AuxvType::PHDR).value;
        let self_base = elf_header_address - ObjectHeader::SIZE as u64;
        Some(self_base)
    } else {
        // guest is non-relocatable, it'll be loaded at its preferred offset
        None
    };
    let base_offset = at.unwrap_or_default();

    let guest_mapped = MappedObject::new(&guest_obj, at).unwrap();
    info!("Mapped guest at 0x{:x}", guest_mapped.base());

    // Set phdr auxiliary vector
    let at_phdr = stack.find_vector(AuxvType::PHDR);
    at_phdr.value = guest_mapped.base() + guest_obj.header().ph_offset;

    // Set phnum auxiliary vector
    let at_phnum = stack.find_vector(AuxvType::PHNUM);
    at_phnum.value = guest_obj.header().ph_count as _;

    // Set entry auxiliary vector
    let at_entry = stack.find_vector(AuxvType::ENTRY);
    at_entry.value = base_offset + guest_obj.header().entry_point;

    match guest_obj.segments().find(SegmentType::Interp) {
        Ok(interp) => {
            let interp = core::str::from_utf8(interp.slice()).unwrap();
            println!("Should load interpreter {}!", interp);

            let interp_file = File::open(interp).unwrap();
            let interp_map = interp_file.map().unwrap();
            let interp_obj = Object::new(interp_map.as_ref()).unwrap();
            let interp_hull = interp_obj.segments().load_convex_hull().unwrap();
            if interp_hull.start != 0 {
                panic!("Expected interpreter to be relocatable");
            }

            // Map interpreter anywhere
            let interp_mapped = MappedObject::new(&interp_obj, None).unwrap();

            // Adjust base
            let at_base = stack.find_vector(AuxvType::BASE);
            at_base.value = interp_mapped.base();

            let entry_point = interp_mapped.base() + interp_obj.header().entry_point;
            info!("Jumping to interpreter's entry point 0x{:x}", entry_point);
            pixie::launch(stack_top, entry_point);
        }
        Err(_) => {
            let entry_point = base_offset + guest_obj.header().entry_point;
            info!("Jumping to guest's entry point 0x{:x}", entry_point);
            pixie::launch(stack_top, entry_point);
        }
    }
}

We just need a couple dependencies:

TOML markup
# in `crates/stage2/Cargo.toml`

[dependencies]
encore = { path =  "../encore" }
# 👇 new!
pixie = { path =  "../pixie" }
# 👇 also new!
lz4_flex = { version = "0.7.5", default-features = false, features = ["safe-encode", "safe-decode"] }

And we're off to the races!

Shell session
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak
Packing guest "/home/amos/go/bin/hugo"
(cut)

$ /tmp/hugo.pak
[stage1] Stack top: 0x7ffc7371c880
[stage1] Mapped stage2 at base 0x7f6cf5481000 (offset 0x7f6cf5481000)
[stage1] Relocating stage2...
[stage1] Relocating stage2... done!
[stage1] Found entry_sym Sym { name: 119, bind: Global, typ: Func, shndx: 7, value: 75936, size: 20 }
[stage2] Stack top: 0x7ffc7371c880
[stage2] Mapping self...
[stage2] Mapping self... done!
[stage2] Mapped guest at 0x400000
[stage2] Jumping to guest's entry point 0x4712a0
Total in 0 ms
Error: Unable to locate config file or config directory. Perhaps you need to create a new site.
       Run `hugo help new` for details.

✨✨✨

We did it! We finally did it.

Let's make sure it also works with dynamically-linked executables:

Shell session
$ cargo run --quiet --release --bin minipak -- /bin/ls -o /tmp/ls.pak
Packing guest "/bin/ls"
Picked base_offset 0x800000
Stage1 hull: 800000..81e048
 Guest hull: 0..24558
Loaded stage1
Relocated stage1
Looking for `entry` in stage1...
WARNING: Guest executable is too small, the `brk` will be wrong.
Copying stage1 segments...
copy_start_offset = 0x190
copying ProgramHeader { typ: Load, flags: 0x5, offset: 0x3000, vaddr: 0x3000, paddr: 0x3000, filesz: 0x1226d, memsz: 0x1226d, align: 0x1000 }
copying ProgramHeader { typ: Load, flags: 0x4, offset: 0x16000, vaddr: 0x16000, paddr: 0x16000, filesz: 0x485c, memsz: 0x485c, align: 0x1000 }
copying ProgramHeader { typ: Load, flags: 0x6, offset: 0x1b380, vaddr: 0x1c380, paddr: 0x1c380, filesz: 0x1c90, memsz: 0x1cc8, align: 0x1000 }
Copying stage2 at 0x1e000
Compressing guest...
Copying compressed guest at 0x39670
Writing manifest at 0x4e180
Writing end marker
Written to (/tmp/ls.pak)

$ /tmp/ls.pak -lhA
[stage1] Stack top: 0x7fff2495c6e0
[stage1] Mapped stage2 at base 0x7fce6c804000 (offset 0x7fce6c804000)
[stage1] Relocating stage2...
[stage1] Relocating stage2... done!
[stage1] Found entry_sym Sym { name: 119, bind: Global, typ: Func, shndx: 7, value: 75936, size: 20 }
[stage2] Stack top: 0x7fff2495c6e0
[stage2] Mapping self...
[stage2] Mapping self... done!
[stage2] Mapped guest at 0x800000
Should load interpreter /lib64/ld-linux-x86-64.so.2!
[stage2] Jumping to interpreter's entry point 0x7fce6474b090
total 48K
drwxr-xr-x 2 amos amos 4.0K May  1 15:53 .cargo
-rw-r--r-- 1 amos amos 6.5K May  2 00:38 Cargo.lock
-rw-r--r-- 1 amos amos  223 May  1 23:33 Cargo.toml
drwxr-xr-x 7 amos amos 4.0K May  1 23:33 crates
-rw------- 1 amos amos 2.6K May  1 18:11 .gdb_history
drwxr-xr-x 8 amos amos 4.0K May  1 23:28 .git
-rw-r--r-- 1 amos amos   21 Feb 21 20:14 .gitignore
-rw-r--r-- 1 amos amos  117 May  1 15:44 rust-toolchain
drwxr-xr-x 2 amos amos 4.0K May  1 19:45 samples
drwxr-xr-x 4 amos amos 4.0K May  1 19:58 target
drwxr-xr-x 2 amos amos 4.0K Feb 21 18:28 .vscode

Beary cool! Does it really make sense to compress ls though?

Well, no, not really. ls is already so small, our packed version is actually larger:

Shell session
$ ls -lhA /bin/ls
-rwxr-xr-x 1 root root 139K Mar  6  2020 /bin/ls

$ ls -lhA /tmp/ls.pak
-rwxr-xr-x 1 amos amos 313K May  2 00:48 /tmp/ls.pak

...because it contains stage1, stage2 and a compressed version of ls, and both stages are pretty chunky right now:

Shell session
$ ls -lhA ./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/
total 244K
-rw-r--r-- 1 amos amos  177 May  1 19:58 CACHEDIR.TAG
-rwxr-xr-x 1 amos amos 118K May  2 00:47 libstage1.so
-rwxr-xr-x 1 amos amos 110K May  2 00:47 libstage2.so
drwxr-xr-x 7 amos amos 4.0K May  2 00:47 release
-rw-r--r-- 1 amos amos 1.6K May  1 19:58 .rustc_info.json

So, let's answer a bunch of questions!

Why are stage1 / stage2 sorta chunky?

Well, first off, ~110K is not that chunky, by desktop computer standards.

It's positively tiny by server computer standards, and it's enormous by embedded standards, but we're not targeting your smartwatch, so all is well.

Still, I was curious what was in there, so I looked, using Bloaty McBloatface:

Shell session
$ cargo build --release
$ objcopy --strip-all ./target/release/libstage1.so /tmp/libstage1.so
$ bloaty -d symbols -n 0 --debug-file ./target/release/libstage1.so /tmp/libstage1.so | head -30
    FILE SIZE        VM SIZE
 --------------  --------------
   6.9%  8.07Ki   0.0%       0    [Unmapped]
   6.3%  7.41Ki   6.9%  7.41Ki    [section .rela.dyn]
   4.7%  5.50Ki   5.1%  5.50Ki    [section .rodata]
   4.3%  5.02Ki   4.7%  5.02Ki    [section .data.rel.ro]
   3.2%  3.77Ki   3.5%  3.77Ki    _$LT$pixie..format..header..ObjectHeader$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::hf0dca140941584be
   2.7%  3.13Ki   2.9%  3.13Ki    _$LT$bitvec..ptr..span..BitSpanError$LT$T$GT$$u20$as$u20$core..fmt..Debug$GT$::fmt::hae2f2e9efb0f4129
   2.1%  2.49Ki   2.3%  2.49Ki    pixie::MappedObject::relocate::h2fcec852915cca41
   2.0%  2.31Ki   2.1%  2.31Ki    bitvec::slice::BitSlice$LT$O$C$T$GT$::clone_from_bitslice::hf0e2687b7949ac19
   1.7%  2.02Ki   1.9%  2.02Ki    [section .text]
   1.5%  1.77Ki   1.6%  1.77Ki    core::fmt::Formatter::pad::hcb18266da989bb74
   1.3%  1.58Ki   1.5%  1.58Ki    deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$u16$GT$::read::h52a077145863edef
   1.3%  1.58Ki   1.5%  1.58Ki    deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$u32$GT$::read::h66b6fabedc184b5f
   1.3%  1.58Ki   1.5%  1.58Ki    deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$usize$GT$::read::h62ed5fa41ab068c2
   1.3%  1.52Ki   1.4%  1.52Ki    deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$u8$GT$::read::h4e61a7d6b96113b7
   1.3%  1.50Ki   0.0%       0    [ELF Headers]
   1.1%  1.35Ki   1.3%  1.35Ki    _$LT$str$u20$as$u20$core..fmt..Debug$GT$::fmt::h06fbb5704eb2e464
   1.1%  1.28Ki   1.2%  1.28Ki    stage1::main::hdd1a3e200abaead0
   1.1%  1.26Ki   1.2%  1.26Ki    pixie::Object::new::hb545e2dcce88210a
   1.0%  1.23Ki   1.1%  1.23Ki    _$LT$pixie..format..sym..Sym$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::h3cb963e32c924534
   1.0%  1.21Ki   1.1%  1.21Ki    core::fmt::Formatter::pad_integral::h5030801cc5b3cd80
   0.9%  1.05Ki   1.0%  1.05Ki    _$LT$pixie..format..program_header..ProgramHeader$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::h92ac268876508bb9
   0.9%  1.04Ki   1.0%  1.04Ki    core::str::slice_error_fail::h02d9683ab20ccc40
   0.8%    1023   0.9%    1023    _$LT$pixie..manifest..Manifest$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::h9d75f433f2e12a83
   0.8%     920   0.8%     920    linked_list_allocator::hole::HoleList::allocate_first_fit::h2b05751692364505
   0.7%     873   0.8%     873    bitvec::vec::api::_$LT$impl$u20$bitvec..vec..BitVec$LT$O$C$T$GT$$GT$::extend_with::h24831e0a831e998d
   0.7%     870   0.8%     870    pixie::MappedObject::lookup_sym::hf84feee74de2706b
   0.7%     853   0.8%     853    bitvec::slice::BitSlice$LT$O$C$T$GT$::copy_within_unchecked::h460ce8747088367a
   0.7%     817   0.7%     817    _$LT$pixie..format..rela..Rela$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::hef92d06a7ec7525c

Well well well. I won't call out anyone here, but, convenience does come at a cost, it would seem.

Let's look at stage2:

Shell session
$ objcopy --strip-all ./target/release/libstage2.so /tmp/libstage2.so
$ bloaty -d symbols -n 0 --debug-file ./target/release/libstage2.so /tmp/libstage2.so | head -30
    FILE SIZE        VM SIZE
 --------------  --------------
   8.0%  8.77Ki   0.0%       0    [Unmapped]
   5.8%  6.40Ki   6.4%  6.40Ki    [section .rela.dyn]
   5.1%  5.58Ki   5.6%  5.58Ki    stage2::main::he49bd1bea95ff619
   4.6%  5.08Ki   5.1%  5.08Ki    [section .rodata]
   3.4%  3.77Ki   3.8%  3.77Ki    _$LT$pixie..format..header..ObjectHeader$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::hf0dca140941584be
   3.3%  3.58Ki   3.6%  3.58Ki    [section .data.rel.ro]
   2.1%  2.31Ki   2.3%  2.31Ki    bitvec::slice::BitSlice$LT$O$C$T$GT$::clone_from_bitslice::hf0e2687b7949ac19
   1.7%  1.90Ki   1.9%  1.90Ki    [section .text]
   1.6%  1.77Ki   1.8%  1.77Ki    core::fmt::Formatter::pad::hcb18266da989bb74
   1.4%  1.58Ki   1.6%  1.58Ki    deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$u16$GT$::read::h52a077145863edef
   1.4%  1.58Ki   1.6%  1.58Ki    deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$u32$GT$::read::h66b6fabedc184b5f
   1.4%  1.58Ki   1.6%  1.58Ki    deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$usize$GT$::read::h62ed5fa41ab068c2
   1.4%  1.57Ki   1.6%  1.57Ki    _$LT$bitvec..ptr..span..BitSpanError$LT$T$GT$$u20$as$u20$core..fmt..Debug$GT$::fmt::hae2f2e9efb0f4129
   1.4%  1.52Ki   1.5%  1.52Ki    deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$u8$GT$::read::h4e61a7d6b96113b7
   1.3%  1.38Ki   0.0%       0    [ELF Headers]
   1.2%  1.35Ki   1.4%  1.35Ki    _$LT$str$u20$as$u20$core..fmt..Debug$GT$::fmt::h06fbb5704eb2e464
   1.1%  1.26Ki   1.3%  1.26Ki    pixie::Object::new::hb545e2dcce88210a
   1.1%  1.21Ki   1.2%  1.21Ki    core::fmt::Formatter::pad_integral::h5030801cc5b3cd80
   1.0%  1.05Ki   1.1%  1.05Ki    _$LT$pixie..format..program_header..ProgramHeader$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::h92ac268876508bb9
   1.0%  1.04Ki   1.1%  1.04Ki    core::str::slice_error_fail::h02d9683ab20ccc40
   0.9%    1023   1.0%    1023    _$LT$pixie..manifest..Manifest$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::h9d75f433f2e12a83
   0.8%     920   0.9%     920    linked_list_allocator::hole::HoleList::allocate_first_fit::h2b05751692364505
   0.8%     873   0.9%     873    bitvec::vec::api::_$LT$impl$u20$bitvec..vec..BitVec$LT$O$C$T$GT$$GT$::extend_with::h24831e0a831e998d
   0.8%     853   0.8%     853    bitvec::slice::BitSlice$LT$O$C$T$GT$::copy_within_unchecked::h460ce8747088367a
   0.7%     812   0.8%     812    _$LT$pixie..manifest..EndMarker$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::h83ebca148e53570a
   0.7%     796   0.8%     796    encore::fs::File::raw_open::hdba6f4608ca26b0d
   0.7%     772   0.8%     772    _$LT$core..fmt..builders..PadAdapter$u20$as$u20$core..fmt..Write$GT$::write_str::h7ca3568df6f09b6a
   0.7%     770   0.8%     770    bitvec::slice::BitSlice$LT$O$C$T$GT$::copy_within_unchecked::h31c15c4829e980b1

Interestingly, lz4_flex (which stage1 does not use) doesn't even show up in the top 30 hungriest hippos symbols:

Shell session
$ bloaty -d symbols -n 0 --debug-file ./target/release/libstage2.so /tmp/libstage2.so | grep lz4
   0.4%     433   0.4%     433    lz4_flex::block::decompress_safe::duplicate_overlapping_slice::h018f1b297902314e
   0.3%     370   0.4%     370    _$LT$lz4_flex..block..DecompressError$u20$as$u20$core..fmt..Debug$GT$::fmt::hb893af1a554d89f6
   0.2%     205   0.2%     205    lz4_flex::block::decompress_safe::copy_24::hfa85bd20ab36cca7

Although, maybe it's just been mostly inlined in stage2::main? Hard to tell.

We can always try to ask the compiler to "optimize for size", see if it makes a difference?

TOML markup
[profile.release]
opt-level = "s"
Shell session
$ cargo build --release
$ objcopy --strip-all ./target/release/libstage1.so /tmp/libstage1.so
$ objcopy --strip-all ./target/release/libstage2.so /tmp/libstage2.so
$ ls -lhA /tmp/libstage*
-rwxr-xr-x 1 amos amos 114K May  2 01:30 /tmp/libstage1.so
-rwxr-xr-x 1 amos amos  98K May  2 01:31 /tmp/libstage2.so

It helps a little!

What about optimization level z?

TOML markup
$ ls -lhA /tmp/libstage*
-rwxr-xr-x 1 amos amos 118K May  2 01:32 /tmp/libstage1.so
-rwxr-xr-x 1 amos amos 106K May  2 01:32 /tmp/libstage2.so

Mh, nope, s was better for us.

What if we switch from "thin" LTO to "fat" LTO?

TOML markup
[profile.release]
lto = "fat"
Shell session
$ ls -lhA /tmp/libstage*
-rwxr-xr-x 1 amos amos 94K May  2 01:33 /tmp/libstage1.so
-rwxr-xr-x 1 amos amos 82K May  2 01:33 /tmp/libstage2.so

Mhh, mhh, small gains. We can even bring down codegen-units to 1, to really take advantage of LTO, as explained here by James.

TOML markup
[profile.release]
codegen-units = 1
incremental = false
Shell session
$ ls -lhA /tmp/libstage*
-rwxr-xr-x 1 amos amos 82K May  2 01:36 /tmp/libstage1.so
-rwxr-xr-x 1 amos amos 70K May  2 01:37 /tmp/libstage2.so

Finally, we can force libcore to be built with those settings as well:

Cool bear's hot tip

To get this to compile, we had to comment out mentions of compiler-builtins in the encore crate.

Also, -Z build-std is a nightly flag, it only works because we ask for a nightly toolchain in the rust-toolchain file.

Shell session
$ cargo build -Z build-std --target x86_64-unknown-linux-gnu --release

Note also that -Z build-std requires --target to be set, and that changes the directory where the libraries are produced:

Shell session
$ objcopy --strip-all ./target/x86_64-unknown-linux-gnu/release/libstage1.so /tmp/libstage1.so
$ objcopy --strip-all ./target/x86_64-unknown-linux-gnu/release/libstage2.so /tmp/libstage2.so
$ ls -lhA /tmp/libstage*
-rwxr-xr-x 1 amos amos 70K May  2 01:43 /tmp/libstage1.so
-rwxr-xr-x 1 amos amos 54K May  2 01:43 /tmp/libstage2.so

Let's try to re-pack ls:

Shell session
$ ./target/x86_64-unknown-linux-gnu/release/minipak /bin/ls -o /tmp/ls.pak
Packing guest "/bin/ls"
(cut)

$ ls -lhA /tmp/ls.pak
-rwxr-xr-x 1 amos amos 237K May  2 01:44 /tmp/ls.pak

Well, that's 25% less than before! Pretty cool.

There's other things we could do!

For example, we could compress stage2 itself — we've seen that adding lz4_flex to the mix didn't make a big difference between stage1 and stage2, and stage2 is actually quite compressible:

$ lz4 -9 ./target/x86_64-unknown-linux-gnu/release/libstage2.so /tmp/libstage2.so.lz4
Compressed 701800 bytes into 243010 bytes ==> 34.63%

We could also make stage1 not use pixie at all: we could have minipak do most of the work, generating a list of relocations that we can easily read and process from stage1 without knowledge of the ELF file format, that would probably cut down on stage1's size.

And finally, we could switch to a different compression format. I'm not sure LZ4 is the best compromise in terms of compression ratio, decompression speed and code size.

So, how did we do?

The last thing I want to do is compare against... well, the only ELF packer I'm aware of: UPX.

Unfortunately, UPX refuses to pack /bin/ls:

Shell session
$ upx -1 /bin/ls -o /tmp/ls.upx1
                       Ultimate Packer for eXecutables
                          Copyright (C) 1996 - 2020
UPX git-d7ba31+ Markus Oberhumer, Laszlo Molnar & John Reiser   Jan 23rd 2020

        File size         Ratio      Format      Name
   --------------------   ------   -----------   -----------
upx: /bin/ls: CantPackException: bad DT_GNU_HASH n_bucket=0x15  n_bitmask=0x2  len=0xb0

Packed 0 files.

But it'll happily pack hugo, so, let's get comparing:

Shell session
$ ls -lhA ~/go/bin/hugo /tmp/hugo.*
-rwxr-xr-x 1 amos amos 61M Jan 26 10:44 /home/amos/go/bin/hugo
-rwxr-xr-x 1 amos amos 31M May  2 01:56 /tmp/hugo.pak
-rwxr-xr-x 1 amos amos 29M Jan 26 10:44 /tmp/hugo.upx1
-rwxr-xr-x 1 amos amos 26M Jan 26 10:44 /tmp/hugo.upx9

Honestly? I'm pretty happy with those results.

I'm also curious how long it takes to start up each of these.

I'm on a laptop right now, so this will be, uh, less than scientific, also there's disk caches involved, and the whole thing is running in WSL2, but still, let's take a look using hyperfine:

Shell session
$ hyperfine --warmup 5 '~/go/bin/hugo version' '/tmp/hugo.upx1 version' '/tmp/hugo.upx9 version' '/tmp/hugo.pak version'
Benchmark #1: ~/go/bin/hugo version
  Time (mean ± σ):      24.9 ms ±   3.6 ms    [User: 30.6 ms, System: 16.2 ms]
  Range (min … max):    20.5 ms …  41.8 ms    112 runs

Benchmark #2: /tmp/hugo.upx1 version
  Time (mean ± σ):     209.2 ms ±  15.5 ms    [User: 214.9 ms, System: 17.1 ms]
  Range (min … max):   195.8 ms … 253.6 ms    14 runs

Benchmark #3: /tmp/hugo.upx9 version
  Time (mean ± σ):     179.8 ms ±  18.8 ms    [User: 183.1 ms, System: 20.0 ms]
  Range (min … max):   160.9 ms … 232.5 ms    16 runs

Benchmark #4: /tmp/hugo.pak version
  Time (mean ± σ):     203.4 ms ±   9.3 ms    [User: 179.7 ms, System: 45.8 ms]
  Range (min … max):   187.2 ms … 217.5 ms    15 runs

Summary
  '~/go/bin/hugo version' ran
    7.23 ± 1.28 times faster than '/tmp/hugo.upx9 version'
    8.18 ± 1.23 times faster than '/tmp/hugo.pak version'
    8.41 ± 1.36 times faster than '/tmp/hugo.upx1 version'

Again, nothing to be ashamed of there — the upx -9 version seems faster than both the upx -1 version and the minipak version, but dang, I'm pretty happy with those results.

Now let's try it on a large Rust binary: futile, which powers this website.

Shell session
$ ls -lhA ~/futile /tmp/futile.*
-rwxr-xr-x 1 amos amos  27M May  2 02:05 /home/amos/futile
-rwxr-xr-x 1 amos amos  12M May  2 02:06 /tmp/futile.pak
-rwxr-xr-x 1 amos amos  11M May  2 02:05 /tmp/futile.upx1
-rwxr-xr-x 1 amos amos 8.3M May  2 02:05 /tmp/futile.upx9

Again, not too bad! upx -9 has a strong lead here too, but keep it mind it's been developed over two decades and its -9 setting uses seven different passes!

What about startup times?

Shell session
$ hyperfine --warmup 15 '~/futile help' '/tmp/futile.upx1 help' '/tmp/futile.upx9 help' '/tmp/futile.pak help'
Benchmark #1: ~/futile help
  Time (mean ± σ):       7.0 ms ±   3.5 ms    [User: 4.4 ms, System: 6.2 ms]
  Range (min … max):     1.8 ms …  16.1 ms    520 runs

  Warning: Command took less than 5 ms to complete. Results might be inaccurate.

Benchmark #2: /tmp/futile.upx1 help
  Time (mean ± σ):      80.5 ms ±   9.4 ms    [User: 76.1 ms, System: 6.1 ms]
  Range (min … max):    74.9 ms … 115.1 ms    37 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark #3: /tmp/futile.upx9 help
  Time (mean ± σ):      72.0 ms ±   6.7 ms    [User: 69.0 ms, System: 4.6 ms]
  Range (min … max):    67.7 ms … 103.6 ms    39 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark #4: /tmp/futile.pak help
  Time (mean ± σ):      71.9 ms ±   4.9 ms    [User: 66.4 ms, System: 6.9 ms]
  Range (min … max):    68.3 ms …  94.5 ms    38 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '~/futile help' ran
   10.34 ± 5.26 times faster than '/tmp/futile.pak help'
   10.35 ± 5.30 times faster than '/tmp/futile.upx9 help'
   11.58 ± 5.99 times faster than '/tmp/futile.upx1 help'

Not bad at all!

(I did my best here, but there were statistical outliers in all the runs. I closed every program I could afford to, and still I couldn't get it to behave. Ah well.)

So, we've done big Go binary, big Rust binary... how about big C++ binary?

Let's compress electron with this!

Compressing with UPX went fine, but minipak crashed:

Shell session
$ ~/ftl/minipak/target/x86_64-unknown-linux-gnu/release/minipak electron -o electron.pak
Packing guest "electron"
Picked base_offset 0x800000
Stage1 hull: 800000..815040
 Guest hull: 0..8227a08
Loaded stage1
Relocated stage1Looking for `entry` in stage1...
Copying stage1 segments...copy_start_offset = 0x190
copying ProgramHeader { typ: Load, flags: 0x5, offset: 0x2000, vaddr: 0x2000, paddr: 0x2000, filesz: 0xbdc1, memsz: 0xbdc1, align: 0x1000 }copying ProgramHeader { typ: Load, flags: 0x4, offset: 0xe000, vaddr: 0xe000, paddr: 0xe000, filesz: 0x4224, memsz: 0x4224, align: 0x1000 }
copying ProgramHeader { typ: Load, flags: 0x6, offset: 0x12a20, vaddr: 0x13a20, paddr: 0x13a20, filesz: 0x15e8, memsz: 0x1620, align: 0x1000 }
Copying stage2 at 0x15000
Compressing guest...panicked at 'memory allocation of 150278328 bytes failed', /home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:386:9
[1]    32021 illegal hardware instruction  ~/ftl/minipak/target/x86_64-unknown-linux-gnu/release/minipak electron -o

Well, yup, we used a fixed heap size and it looks like, for this, 128 MiB weren't enough!

Let's bump that to 512:

Rust code
// in `crates/encore/src/items.rs`

/// Heap size, in megabytes
const HEAP_SIZE_MB: u64 = 512;

And now, compression works!

Let's compare sizes:

Shell session
$ ls -lhA ./electron*
-rwxr-xr-x 1 amos amos 131M May  2 02:15 ./electron
-rwxr-xr-x 1 amos amos  73M May  2 02:23 ./electron.pak
-rwxr-xr-x 1 amos amos  63M May  2 02:15 ./electron.upx1
-rwxr-xr-x 1 amos amos  53M May  2 02:15 ./electron.upx9

And startup times:

Shell session
$ hyperfine --warmup 3 './electron -v' './electron.upx1 -v' './electron.upx9 -v' './electron.pak -v'
Benchmark #1: ./electron -v
  Time (mean ± σ):     107.5 ms ±  12.0 ms    [User: 72.0 ms, System: 11.5 ms]
  Range (min … max):    98.2 ms … 140.1 ms    29 runs                                                                                                 t help to                                                                                                                                             . It migh  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark #2: ./electron.upx1 -v
  Time (mean ± σ):      1.679 s ±  0.124 s    [User: 766.7 ms, System: 84.0 ms]
  Range (min … max):    1.491 s …  1.807 s    10 runs

Benchmark #3: ./electron.upx9 -v
  Time (mean ± σ):      1.511 s ±  0.138 s    [User: 704.6 ms, System: 66.0 ms]
  Range (min … max):    1.335 s …  1.670 s    10 runs

Benchmark #4: ./electron.pak -v
  Time (mean ± σ):      2.079 s ±  0.597 s    [User: 558.6 ms, System: 309.1 ms]
  Range (min … max):    1.235 s …  2.833 s    10 runs

Summary
  './electron -v' ran
   14.06 ± 2.03 times faster than './electron.upx9 -v'
   15.63 ± 2.09 times faster than './electron.upx1 -v'
   19.34 ± 5.96 times faster than './electron.pak -v'

No surprises there — electron is a beast (but still, 100ms startup time uncompressed is remarkable, given how much it packs).

But, like... we made an executable packer, that compresses electron.

And it runs, it's the real thing!

That's pretty darn cool.

Epilogue

This concludes my longest series so far, "Making our own executable packer".

All the way back in Part 3, I jokingly predicted that I would never finish it:

This series is never going to end.

In 2060, when I'm 70, and everybody will have switched to using Fuchsia on the desktop, my friends will still poke fun at me: "Hey amos, remember your ELF series? When's it gonna end?", and I'll feign a smile, but inside I will be acutely, painfully aware that I have angered the binary gods and that I should have left well enough alone.

Well, take that, 2020 me. We did it reddit! Woo!

I'd like to thank everyone for sticking around to see this series to its conclusion, especially my patrons.

I know you're all probably wondering "what's next??" and the answer is: sleep.

Lots and lots of sleep.

And then, who knows! So many interesting topics. I'm sure y'all will have great suggestions.

I hope you enjoyed the series, and if you've followed at home, send me screenshots of your stuff running! That would make me really happy.

Until next time, take care!