Fine, we'll relocate our own binary!
👋 This page was last updated ~4 years ago. Just so you know.
Welcome back to the eighteenth and final part of "Making our own executable packer".
In the last article, we had
a lot of fun. We already had a "packer" executable, minipak
, which joined
together stage1
(a launcher), and a compressed version of whichever executable
we wanted to pack.
What we added, was a whole bunch of abstractions to parse ELF headers using
deku, which we used from stage1
to be able to
launch the guest executable from memory, instead of writing it to a file and
using execve
on it.
But then, we discovered that, unfortunately that only worked if the guest was a relocatable executable, ie. if it could be mapped anywhere.
For non-relocatable executables, well... if the guest executable and our
stage1
launcher have the ~same base address, like here for example:
$ readelf -Wl ~/go/bin/hugo | grep -A 2 "Program Headers" Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align 👇 PHDR 0x000040 0x0000000000400040 0x0000000000400040 0x000188 0x000188 R 0x1000
$ readelf -Wl ./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/stage1 | grep -A 2 "Program Headers" Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align 👇 LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x000224 0x000224 R 0x1000
Then the resulting executable, the "packed" executable, has the same base
address as stage1
:
$ readelf -Wl /tmp/hugo.pak | grep -A 2 "Program Headers" Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align 👇 LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x000224 0x000224 R 0x1000
...and when it tries to map the guest binary (hugo) to that same address, it overwrites itself, and kaboom!
So, we're going to have to fix that, because to qualify as a "real executable packer", we'd like our packer to be able to process both relocatable and non-relocatable binaries.
Why?
Well, you know what binaries are very large and also typically
non-relocatable? Go binaries! Just like hugo
. If minipak
had any practical
use, it would probably be to compress statically-linked Go binaries.
Meanwhile, in Rust nightly land
Before we do anything useful, let's check out what changed in Rust nightly
since we last tried to work on minipak
.
Let's bump rust-toolchain
to the latest nightly version (at the time of this
writing) that has all the
components.
[toolchain] channel = "nightly-2021-04-25" components = ["rustfmt", "clippy"] targets = ["x86_64-unknown-linux-gnu"]
$ cargo clean (this installs the newer toolchain) $ cargo build Compiling proc-macro2 v1.0.24 Compiling unicode-xid v0.2.1 (cut) Compiling stage1 v0.1.0 (/home/amos/ftl/minipak/crates/stage1) error[E0557]: feature has been removed --> crates/stage1/src/main.rs:12:12 | 12 | #![feature(link_args)] | ^^^^^^^^^ feature has been removed | = note: removed in favor of using `-C link-arg=ARG` on command line, which is available from cargo build scripts with `cargo:rustc-link-arg` now error: cannot find attribute `link_args` in this scope --> crates/stage1/src/main.rs:16:3 | 16 | #[link_args = "-nostartfiles -nodefaultlibs -static"] | ^^^^^^^^^ help: a built-in attribute with a similar name exists: `linkage`
Uh oh! A feature has disappeared from underneath us.
Well, that's what you get for using nightly.
Yeah, and everything would've been fine if I hadn't manually bumped the
version in rust-toolchain
!
But I like staying on top of things, so let's actually make the required changes here.
Instead of specifying linker arguments in the source, we now have to specify
them in build scripts. We don't have a build script for stage1
yet, so let's
add it:
// in `crates/minipak/build.rs` fn main() { println!( "cargo:rustc-link-arg={}", "-nostartfiles -nodefaultlibs -static" ); }
Then let's remove both #![feature(link)args]
and #[link_args = ...]
from
stage1
, and...
$ cargo b Compiling stage1 v0.1.0 (/home/amos/ftl/minipak/crates/stage1) Compiling minipak v0.1.0 (/home/amos/ftl/minipak/crates/minipak) warning: cargo:rustc-link-arg requires -Zextra-link-arg flag error: linking with `cc` failed: exit status: 1 | = note: "cc" "-m64" "-Wl,--eh-frame-hdr" "-Wl,-znoexecstack" "-Wl,--as-needed" "-L" "/home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "/home/amos/ftl/minipak/tar (cut: a very, very long error)
Oh no! We need to opt into this new cargo feature, just like we opted into the
#[link_args]
feature before.
We can do so with a .cargo/config
file, no, not in the home folder, directly
in our minipak/
cargo workspace:
[unstable] extra-link-arg = true
Surely now things will work!
$ cargo b Compiling minipak v0.1.0 (/home/amos/ftl/minipak/crates/minipak) Compiling stage1 v0.1.0 (/home/amos/ftl/minipak/crates/stage1) error: linking with `cc` failed: exit status: 1 | = note: "cc" "-m64" "-Wl,--eh-frame-hdr" (cut: very long error) = note: cc: error: unrecognized command-line option '-nostartfiles -nodefaultlibs -static'
Ohh.
Oh hey, it thinks we're passing all three arguments as.. a single argument!
Yeah! The problem with command-line arguments is... when you're invoking them from a shell, like that:
$ command foo bar baz
Then command
gets three arguments:
- "foo"
- "bar"
- "baz"
But here, in our build script:
// in `crates/stage1/build.rs` fn main() { println!( "cargo:rustc-link-arg={}", "-nostartfiles -nodefaultlibs -static" ); }
Every cargo:rustc-link-arg=SOMETHING
line is supposed to be its own argument.
So it's as if we did this instead:
$ command "foo bar baz"
Well, that's on problem at all! We can just write three lines instead:
// in `crates/stage1/build.rs` fn main() { for &arg in &["-nostartfiles", "-nodefaultlibs", "-static"] { println!("cargo:rustc-link-arg={}", arg); } }
And now...
$ cargo b Compiling stage1 v0.1.0 (/home/amos/ftl/minipak/crates/stage1) Compiling minipak v0.1.0 (/home/amos/ftl/minipak/crates/minipak) error[E0557]: feature has been removed --> crates/minipak/src/main.rs:12:12 | 12 | #![feature(link_args)] | ^^^^^^^^^ feature has been removed | = note: removed in favor of using `-C link-arg=ARG` on command line, which is available from cargo build scripts with `cargo:rustc-link-arg` now error: cannot find attribute `link_args` in this scope --> crates/minipak/src/main.rs:16:3 | 16 | #[link_args = "-nostartfiles -nodefaultlibs -static"] | ^^^^^^^^^ help: a built-in attribute with a similar name exists: `linkage`
..well, now we have to give the same treatment to the minipak
crate —
remember, we have several crates in our cargo workspace: encore
, minipak
,
pixie
and stage1
.
Well it turns out minipak
already has a build script, so let's just move some
things around:
minipak
already has build script because it includes the stage1
binary at
compile-time into itself, so it's able to generate "packed binaries" later.
// in `crates/minipak/build.rs` use std::{ path::{Path, PathBuf}, process::Command, }; fn main() { for &arg in &["-nostartfiles", "-nodefaultlibs", "-static"] { println!("cargo:rustc-link-arg={}", arg); } cargo_build(&PathBuf::from("../stage1")); } fn cargo_build(path: &Path) { // omitted (same as before) }
Now we just need to remember to remove #[link_args]
and
#![feature(link_args)]
from the minipak
crate, and...
$ cargo run --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak && /tmp/hugo.pak Compiling minipak v0.1.0 (/home/amos/ftl/minipak/crates/minipak) Finished release [optimized + debuginfo] target(s) in 1.93s Running `target/release/minipak /home/amos/go/bin/hugo -o /tmp/hugo.pak` Wrote /tmp/hugo.pak (51.05% of input) The guest is at 18380..1edd205 [1] 9398 segmentation fault /tmp/hugo.pak
Yeah! That's as far as we had gotten last article.
Okay. Good. Staying on top of things, alright!
Formulating a plan
So, let's try to summarize the predicament we find ourselves in.
We're trying to make two things fit into the same executable:
- Our launcher,
stage1
- Whichever compressed guest executable, lately
hugo
If they have the same base address, then by the time stage1
is mapped, and up
and running, then it's too late: after we decompress hugo
somewhere in memory,
we cannot copy it where it wants to be, which is the same place that stage1
already is!
Ooh, ooh, I know! Why don't we simply make stage1
relocatable?
Well! If we made stage1
relocatable, then it would be mapped "at a random
address", thanks to Address space layout randomization
(ASLR).
And, I agree that it would be really unlucky that the address picked would be the same as the fixed address that our guest wants, but it could happen, and then we would overwrite ourselves all over again, which...
Kaboom?
Kaboom, yes.
Okay, well... if we made stage1
relocatable, then when minipak
runs, it
could relocate it to somewhere else, that doesn't conflict with the guest! Right?
Well... it could, definitely, yes. But there's an additional constraint: we can't just "get out of the way of the guest", because "truly static" executables, ie. executables that are non-relocatable, expect the heap to be at a specific address. They know exactly where it should be.
Not all executables use that, but if we want to make a "real executable packer",
then we should try to make sure that the brk
(the top of the heap) is where
our guest expects it to be.
That.. that was hot nonsense, what are you even talking about?
Fair, let's make a few diagrams.
The brk, the brk and nothing but the brk
We've made that kind of diagram a lot, but let's give it one more try.
We know that an ELF file is basically just a database of "segments" to be mapped in memory in specific places. Sometimes the layout in memory looks an awful lot like the layout on disk, with some exceptions.
Here for example, the "globals" segment is bigger in memory — all the zero-initialized globals go last, and they're not mapped from the file:
The important bit in the diagram above is that the brk
's initial value (the
value initially returned by the brk
syscall) is at the end of the globals —
the end of the last segment.
It's kinda hard to show this, since not all programs even use brk
— hugo, for
example, doesn't.
But if we take a C program, like samples/hello-pie
, which was just this:
// in `samples/hello-pie.c` #include <stdio.h> int main() { printf("Hello! I am a C program.\n"); return 0; }
Then we can catch the first brk
syscall from GDB:
$ gdb --quiet --args ./samples/hello-pie Reading symbols from ./samples/hello-pie... (gdb) starti Starting program: /home/amos/ftl/minipak/samples/hello-pie Program stopped. 0x00007ffff7f4b840 in _start () (gdb) catch syscall brk Catchpoint 1 (syscall 'brk' [12]) (gdb) c Continuing. Catchpoint 1 (call to syscall brk), 0x00007ffff7fad82b in brk () (gdb) c Continuing. Catchpoint 1 (returned from syscall brk), 0x00007ffff7fad82b in brk () (gdb) p/x $rax $1 = 0x7ffff7fff000 (gdb)
Since hello-pie
is relocatable, the calculation is a bit more complicated,
since we need to take into account the base address that GDB picked. In this
case, it's...
(gdb) info proc mappings process 18852 Mapped address spaces: Start Addr End Addr Size Offset objfile 0x7ffff7f3e000 0x7ffff7f41000 0x3000 0x0 [vvar] 0x7ffff7f41000 0x7ffff7f43000 0x2000 0x0 [vdso] 👇 0x7ffff7f43000 0x7ffff7f4b000 0x8000 0x0 /home/amos/ftl/minipak/samples/hello-pie 0x7ffff7f4b000 0x7ffff7fcd000 0x82000 0x8000 /home/amos/ftl/minipak/samples/hello-pie 0x7ffff7fcd000 0x7ffff7ff6000 0x29000 0x8a000 /home/amos/ftl/minipak/samples/hello-pie 0x7ffff7ff7000 0x7ffff7ffe000 0x7000 0xb3000 /home/amos/ftl/minipak/samples/hello-pie 0x7ffff7ffe000 0x7ffff7fff000 0x1000 0x0 [heap] 0x7ffffffdd000 0x7ffffffff000 0x22000 0x0 [stack]
0x7ffff7f43000
.
So, if we look at the "Load" segments for hello-pie
:
$ readelf -Wl ./samples/hello-pie | grep -A 8 "Program Headers" | grep -E "MemSiz|LOAD" Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x007f20 0x007f20 R 0x1000 LOAD 0x008000 0x0000000000008000 0x0000000000008000 0x081f7d 0x081f7d R E 0x1000 LOAD 0x08a000 0x000000000008a000 0x000000000008a000 0x028bc8 0x028bc8 R 0x1000 👇 👇 👇 LOAD 0x0b3768 0x00000000000b4768 0x00000000000b4768 0x005ba8 0x007438 RW 0x1000
We can see that it's supposed to be at...
(gdb) p/x 0x7ffff7f43000 + 0x00000000000b4768 + 0x007438 $2 = 0x7ffff7ffeba0
Well, we also need to align that to 0x1000
:
(gdb) p/x (0x7ffff7ffeba0 & ~(0xFFF)) + 0x1000 $3 = 0x7ffff7fff000
And we find... 0x7ffff7fff000
! The value we got back from the very first brk
syscall.
I'm getting mixed up with all these sevens and effs.
Yeah, me too, but I promise, they're the same value.
(gdb) p/x $3 - $1 $4 = 0x0
So, the problem with "just moving the launcher out of the way of the guest" is
that we'll end up with the wrong brk
value on startup:
Can't we adjust the brk after startup?
You'd think so, but: not really, no. If the guest was after our launcher, we could sorta kinda do that.
Effectively, we'd "allocate on the heap" the distance between the end of the launcher and the end of the guest, and the value would be correct — but it would look all wrong in debuggers, for example.
And what if the guest is non-relocatable and has a fixed base address that's very low? Low enough that we don't have room to fit our launcher?
Yeah, sounds messy.
So, here's my proposal: we do it in two steps.
Two stages, if you will.
First off, because we need brk
to be at the correct position, we generate an
executable that will run stage1
, but that will have a layout very similar to
the guest.
We then have stage1
map stage2
(a dynamic library) out of the way of the
guest, and finally, we have stage2
map the guest
(a non-relocatable
executable), and jmp to it:
Well. When you put it like that, it seems real simple.
Yeah, and it's going to be! Because we got solid abstractions, and a good understanding of what it is we need to do. That's what the seventeen previous articles were all about.
Okay, what do we start with?
Chaaaaaaaaange places!
Well, we start with throwing away our current stage1
crate completely.
In our current version of minipak, stage1
is a relocatable binary, which we
simply concatenate with our compressed guest payload (and a manifest, so that we
can find everything at runtime).
But moving forward, stage1
is going to be... a dynamic library.
A library?
Yes!
But... stage1
is effectively the template for the output minipak
produces,
right?
That is correct.
So does that mean... we're going to turn a dynamic library into an executable.
Yes!
So let's do that.
$ rm -rf crates/stage1 $ (cd crates && cargo new --lib stage1) Created library `stage1` package
We gotta be more specific, since cargo new --lib
makes a rust library,
whereas we want a "C dynamic library", or cdylib
. And we'd like to use our
encore
crate:
# in crates/stage1/Cargo.toml [lib] crate-type = ["cdylib"] [dependencies] encore = { path = "../encore" }
So, just as before, our library is going to be no_std
. Let's export a single
empty function, to see what we get:
// Don't use libstd #![no_std] // Use the default allocation error handler #![feature(default_alloc_error_handler)] use encore::prelude::*; /// # Safety /// Wildly unsafe, do not call. #[no_mangle] pub unsafe extern "C" fn entry() {}
$ cargo clean $ cargo build --package stage1 (cut) Finished dev [unoptimized + debuginfo] target(s) in 5.34s $ nm -D ./target/debug/libstage1.so 00000000000013f0 T bcmp w __cxa_finalize 0000000000001330 T entry w __gmon_start__ w _ITM_deregisterTMCloneTable w _ITM_registerTMCloneTable 00000000000024a0 T memcmp 0000000000002110 T memcpy 0000000000002200 T memmove 00000000000023f0 T memset 00000000000013e0 T _Unwind_Resume
Alright, cool! There's our entry
symbol right there. I'm sure if we move some
things around, we can make that the entry point of an executable.
But before we go any further... I'd like to just add a tiny teeny linker flag.
The problem with making dynamic libraries — or "shared" libraries, as GCC and GNU ld tend to call them, is that having undefined symbols is a-okay.
For example, if we were to call a function that does not exist from within stage1:
// in `crates/stage1/src/lib.rs` // Don't use libstd #![no_std] // Use the default allocation error handler #![feature(default_alloc_error_handler)] use encore::prelude::*; extern "C" { fn i_do_not_exist(); } /// # Safety /// Wildly unsafe, do not call. #[no_mangle] pub unsafe extern "C" fn entry() { i_do_not_exist(); }
Then it would still build just fine:
$ cargo build --quiet --package stage1 && nm -D ./target/debug/libstage1.so 00000000000013f0 T bcmp w __cxa_finalize 0000000000001330 T entry w __gmon_start__ 👇 👇 U i_do_not_exist w _ITM_deregisterTMCloneTable w _ITM_registerTMCloneTable 00000000000024a0 T memcmp 0000000000002110 T memcpy 0000000000002200 T memmove 00000000000023f0 T memset 00000000000013e0 T _Unwind_Resume
It would just... ask for an i_do_not_exist
symbol. So the error would be
pushed to load time, ie. whenever the library is dlopen
ed by some program. Or
possibly even later, when i_do_not_exist
is called, if the dynamic loader is
feeling particularly lazy.
There's just one problem with that...
We do not intend to dlopen
this file. So we'll never see the error, just a
crash.
Our libstage1.so
must be entirely self-contained. It cannot possibly depend
on something else that's "already present at load time".
In other words, the output of nm -D libstage1.so
must not contain any U
entries.
The good news is: there is a linker flag for that! (And also, that is the default behavior for executables).
So, just as before, we'll need to add a build script for the stage1
crate:
// in `crates/stage1/build.rs` fn main() { println!("cargo:rustc-link-arg=-Wl,-z,defs"); }
And with that:
$ cargo build --quiet --package stage1 && nm -D ./target/debug/libstage1.so error: linking with `cc` failed: exit status: 1 | = note: "cc" "-m64" "-Wl,--eh-frame-hdr" "-Wl,-znoexecstack" "-Wl,--as-needed" "-L" "/home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "/home/amos/ftl/minipak/target/debug/deps/stage1.4mxsy4jfjjxw9bnk.rcgu.o" "-o" "/home/amos/ftl/minipak/target/debug/deps/libstage1.so" "-Wl,--version-script=/tmp/rustcjlHmwn/list" "/home/amos/ftl/minipak/target/debug/deps/stage1.4vnmey6720pw9y0s.rcgu.o" "-Wl,--gc-sections" "-shared" "-Wl,-zrelro" "-Wl,-znow" "-nodefaultlibs" "-L" "/home/amos/ftl/minipak/target/debug/deps" "-L" "/home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-Wl,-Bstatic" "/home/amos/ftl/minipak/target/debug/deps/libencore-3df7881431236f53.rlib" "/home/amos/ftl/minipak/target/debug/deps/libbitflags-a7b2d45df01a1ead.rlib" "/home/amos/ftl/minipak/target/debug/deps/liblinked_list_allocator-e21337cbc3886b5a.rlib" "/home/amos/ftl/minipak/target/debug/deps/libspinning_top-9628df27563378b7.rlib" "/home/amos/ftl/minipak/target/debug/deps/liblock_api-6930031095633bf5.rlib" "/home/amos/ftl/minipak/target/debug/deps/libscopeguard-3fda057cf6676019.rlib" "/home/amos/ftl/minipak/target/debug/deps/librlibc-0003478002090c29.rlib" "/home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/liballoc-9849bb0fbad7f0f5.rlib" "/home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcompiler_builtins-8b33f9cbbc9652fe.rlib" "/home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc_std_workspace_core-a1fd7734706d5518.rlib" "/home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcore-c8ded1707ad10767.rlib" "/home/amos/ftl/minipak/target/debug/deps/libcompiler_builtins-98a8751107c546a9.rlib" "-Wl,-z,defs" "-Wl,-Bdynamic" = note: /usr/sbin/ld: /home/amos/ftl/minipak/target/debug/deps/stage1.4mxsy4jfjjxw9bnk.rcgu.o: in function `entry': /home/amos/ftl/minipak/crates/stage1/src/lib.rs:17: undefined reference to `i_do_not_exist' collect2: error: ld returned 1 exit status error: aborting due to previous error error: could not compile `stage1` To learn more, run the command again with --verbose.
...it errors out if we ever refer to something that stage1
does not define
itself.
This may sound trivial, but had I known about it when I started researching this
part, it would've saved me a lot of grief. Also, encore
here is doing the
heavy lifting, providing a panic handler, and a memory allocator, so there's
less chances of us getting it wrong.
Wait, don't we need to hook up the memory allocator manually?
Well it's a library so-
Well, now it's a library, but once minipak
will be done with it, it'll be an
executable. So it'll need to set up its own memory allocator! Oh, and don't
forget: entry points are not regular functions. We need to save the stack
pointer to pass it to the program later!
Right, right, let's set all that up:
// in `crates/stage1/src/lib.rs` // Allow inline assembly #![feature(asm)] // Allow naked (no-prelude) functions #![feature(naked_functions)] // Don't use libstd #![no_std] // Use the default allocation error handler #![feature(default_alloc_error_handler)] extern crate alloc; use encore::prelude::*; macro_rules! info { ($($tokens: tt)*) => { println!("[stage1] {}", alloc::format!($($tokens)*)); } } /// # Safety /// Uses inline assembly so it can behave as the entry point of a static /// executable. #[no_mangle] #[naked] pub unsafe extern "C" fn entry() { asm!("mov rdi, rsp", "call premain", options(noreturn)) } /// # Safety /// Initializes the allocator. #[no_mangle] #[inline(never)] unsafe fn premain(stack_top: *mut u8) -> ! { init_allocator(); crate::main(stack_top) } /// # Safety /// Nothing bad so far. #[inline(never)] unsafe fn main(stack_top: *mut u8) -> ! { info!("Stack top: {:?}", stack_top); syscall::exit(0) }
There! That's a good start.
$ cargo build --quiet --package stage1 && nm -D ./target/debug/libstage1.so 0000000000003780 T bcmp w __cxa_finalize 00000000000029d0 T entry w __gmon_start__ w _ITM_deregisterTMCloneTable w _ITM_registerTMCloneTable 00000000000084c0 T memcmp 0000000000008130 T memcpy 0000000000008220 T memmove 0000000000008410 T memset 00000000000029e0 T premain 0000000000003770 T _Unwind_Resume
Now, there's actually two entry points into that ELF object.
entry
is the correct entry point if we want it to behave as a static
executable. But as a first test, we can definitely try to load it as a
dynamic library, and call premain
, just to see if the rest is all wired
up correctly!
// in `samples/loadtest.c` #include <dlfcn.h> #include <stdio.h> #include <stdint.h> int main() { void *lib = dlopen("target/debug/libstage1.so", RTLD_NOW); if (!lib) { fprintf(stderr, "Could not load library\n"); return 1; } void *sym = dlsym(lib, "premain"); if (!sym) { fprintf(stderr, "Could not find symbol\n"); return 1; } typedef void (*premain_t)(uint64_t); premain_t premain = (premain_t)(sym); fprintf(stderr, "Calling premain...\n"); premain(0x1234); }
# in `samples/Justfile` loadtest: gcc -g loadtest.c -o loadtest -ldl file loadtest
$ just samples/loadtest gcc -g loadtest.c -o loadtest -ldl file loadtest loadtest: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=4d6d20c4608d5291659f96def4fb0e387b70dda4, for GNU/Linux 4.4.0, with debug_info, not stripped $ samples/loadtest Calling premain... [stage1] Stack top: 0x1234
Eyyyyyy! First time!
Very nice!
Now all we have to do is turn that dynamic library into a non-relocatable executable. And we have most of the tools to do that.
First, let's just adjust minipak
's build script to operate on libstage1.so
instaed of stage1
(since it used to be a binary):
// in `crates/minipak/build.rs` use std::{ path::{Path, PathBuf}, process::Command, }; fn main() { for &arg in &["-nostartfiles", "-nodefaultlibs", "-static"] { println!("cargo:rustc-link-arg={}", arg); } cargo_build(&PathBuf::from("../stage1")); } fn cargo_build(path: &Path) { println!("cargo:rerun-if-changed=.."); let out_dir = std::env::var("OUT_DIR").unwrap(); let target_dir = format!("{}/embeds", out_dir); let output = Command::new("cargo") .arg("build") .arg("--target-dir") .arg(&target_dir) .arg("--release") .current_dir(path) .spawn() .unwrap() .wait_with_output() .unwrap(); if !output.status.success() { panic!( "Building {} failed.\nStdout: {}\nStderr: {}", path.display(), String::from_utf8_lossy(&output.stdout[..]), String::from_utf8_lossy(&output.stderr[..]), ); } // Let's just assume the library has the same name as the crate // 👇 let lib_name = format!("lib{}.so", path.file_name().unwrap().to_str().unwrap()); let output = Command::new("objcopy") .arg("--strip-all") .arg(&format!("release/{}", lib_name)) .arg(lib_name) .current_dir(&target_dir) .spawn() .unwrap() .wait_with_output() .unwrap(); if !output.status.success() { panic!( "Stripping failed.\nStdout: {}\nStderr: {}", String::from_utf8_lossy(&output.stdout[..]), String::from_utf8_lossy(&output.stderr[..]), ); } }
In Part 15 of this series, we enabled an option in rust-analyzer, and turns out,
it changed names! If in .vscode/settings.json
you had
rust-analyzer.cargo.loadOutDirsFromCheck
, you may want to change it.
The whole file should read:
{ "rust-analyzer.checkOnSave.allTargets": false, "rust-analyzer.procMacro.enable": true, "rust-analyzer.cargo.runBuildScripts": true }
(Don't forget to reload the vscode window after that change)
Now! What should minipak
do?
Well, first it should calculate the convex hull of the guest executable, so that we know how to lay out the "output" executable.
That, we know how to do:
// in `crates/minipak/src/main.rs` // Opt out of libstd #![no_std] // Let us worry about the entry point. #![no_main] // Use the default allocation error handler #![feature(default_alloc_error_handler)] // Let us make functions without any prologue - assembly only! #![feature(naked_functions)] // Let us use inline assembly! #![feature(asm)] /// Our entry point. #[naked] #[no_mangle] unsafe extern "C" fn _start() { asm!("mov rdi, rsp", "call pre_main", options(noreturn)) } use encore::prelude::*; use pixie::{Object, PixieError, Writer}; mod cli; #[no_mangle] unsafe fn pre_main(stack_top: *mut u8) { init_allocator(); main(Env::read(stack_top)).unwrap(); syscall::exit(0); } #[allow(clippy::unnecessary_wraps)] fn main(env: Env) -> Result<(), PixieError> { let args = cli::Args::parse(&env); println!("Packing guest {:?}", args.input); let guest_file = File::open(args.input)?; let guest_map = guest_file.map()?; let guest_obj = Object::new(guest_map.as_ref())?; let guest_hull = guest_obj.segments().load_convex_hull()?; println!("Guest hull: {:0x?}", guest_hull); let mut output = Writer::new(&args.output, 0o755)?; output.write_all("TODO\n".as_bytes())?; Ok(()) }
Let's take it for a spin:
$ cargo run --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak (cut) Packing guest "/home/amos/go/bin/hugo" Guest hull: 400000..3180968 $ cat /tmp/hugo.pak TODO
Does that seem right?
$ readelf -Wl ~/go/bin/hugo | grep -E "LOAD|MemSiz" Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x172b4b0 0x172b4b0 R E 0x1000 LOAD 0x172c000 0x0000000001b2c000 0x0000000001b2c000 0x155ceb8 0x155ceb8 R 0x1000 LOAD 0x2c89000 0x0000000003089000 0x0000000003089000 0x0b08c0 0x0f7968 RW 0x1000 $ gdb -q -ex "set noconfirm" -ex "p/x 0x0000000003089000 + 0x0f7968" -ex "quit" No symbol table is loaded. Use the "file" command. $1 = 0x3180968
Yeah, that's bang on!
Okay, cool — next up, we're going to need to relink stage1
, so that it
transforms from a dynamic library into a static executable, that has roughly
the same shape as the guest, as shown in the first column of our three-column
plan:
I thought it was a two-stage plan?
Yeah! Three columns, two stages. Get it together, bear.
Okay, cool, so, to relink stage1
we're going to need the guest hull, and also
a mutable reference to the Writer
, let's make a function for that:
// in `crates/minipak/src/main.rs` // new! use core::ops::Range; #[allow(clippy::unnecessary_wraps)] // 👇 fn main(env: Env) -> Result<(), Error> { let args = cli::Args::parse(&env); println!("Packing guest {:?}", args.input); let guest_file = File::open(args.input)?; let guest_map = guest_file.map()?; let guest_obj = Object::new(guest_map.as_ref())?; let guest_hull = guest_obj.segments().load_convex_hull()?; let mut output = Writer::new(&args.output, 0o755)?; relink_stage1(guest_hull, &mut output)?; Ok(()) } fn relink_stage1(guest_hull: Range<u64>, writer: &mut Writer) -> Result<(), Error> { println!("Guest hull: {:0x?}", guest_hull); let obj = Object::new(include_bytes!(concat!( env!("OUT_DIR"), "/embeds/libstage1.so" )))?; // TODO: fill out! Ok(()) }
Quick detour through error handling: we might return various types of errors:
from deku
, from encore
, or from pixie
, so let's give minipak an Error
type now and not have to worry about it later:
# in `crates/minipak/Cargo.toml` [dependencies] displaydoc = { version = "0.1.7", default-features = false }
use encore::prelude::*; use pixie::{deku::DekuError, PixieError}; #[derive(displaydoc::Display, Debug)] pub enum Error { /// `{0}` Encore(EncoreError), /// deku error: `{0}` Deku(DekuError), /// pixie error: `{0}` Pixie(PixieError), } impl From<EncoreError> for Error { fn from(e: EncoreError) -> Self { Self::Encore(e) } } impl From<DekuError> for Error { fn from(e: DekuError) -> Self { Self::Deku(e) } } impl From<PixieError> for Error { fn from(e: PixieError) -> Self { Self::Pixie(e) } }
And we just need to use it from main.rs
:
// in `crates/minipak/src/main.rs` mod error; use error::Error;
Does everything still work? Yes?
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak Packing guest "/home/amos/go/bin/hugo" Guest hull: 400000..3180968
Yes. Good.
Next up: some basic tests. We expect stage1
to be relocatable, so, let's check
that expectation. If it is relocatable, then its convex hull start at 0x0
:
// in `relink_stage1` let hull = obj.segments().load_convex_hull()?; assert_eq!(hull.start, 0, "stage1 must be relocatable");
Then we have a decision to make: where will our executable start? If we're packing a relocatable executable, we can pick any base address! If we're packing a non-relocatable executable, we have to pick their base address.
// in `relink_stage1` // Pick a base offset. If our guest is a relocatable executable, pick a // random one, otherwise, pick theirs. let base_offset = if guest_hull.start == 0 { 0x800000 // by fair dice roll } else { guest_hull.start }; println!("Picked base_offset 0x{:x}", base_offset); let hull = (hull.start + base_offset)..(hull.end + base_offset); println!("Stage1 hull: {:x?}", hull); println!(" Guest hull: {:x?}", guest_hull);
This alone should give us some interesting output:
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak Packing guest "/home/amos/go/bin/hugo" Picked base_offset 0x400000 Stage1 hull: 400000..40b048 Guest hull: 400000..3180968
Cool! hugo
is not relocatable, so we will be mapping stage
starting from
the same base address.
If we try to pack ls
however, which is relocatable:
$ cargo run --quiet --release --bin minipak -- /bin/ls -o /tmp/gcc.pak Packing guest "/bin/ls" Picked base_offset 0x800000 Stage1 hull: 800000..80b048 Guest hull: 0..24558
Then we pick 0x800000
as a base address.
Alright, cool. Next up, we're going to proceed as if we wanted to load stage1 as a library.
It's an ELF object, so it has segments, and so it can be mapped:
let mut mapped = MappedObject::new(&obj, None)?; println!("Loaded stage1");
And then we should relocate it.
Wait, relocate it? Like we did in parts uhhh... in the previous parts?
Exactly! But we'll only care about four relocation types: 64
, GlobDat
,
JumpSlot
, Relative
.
Because that's all we have in our library:
$ readelf -Wr ./target/debug/libstage1.so | grep -oE 'R_X86\w+' | sort -u R_X86_64_GLOB_DAT R_X86_64_JUMP_SLOT R_X86_64_RELATIVE
Okay, so 64
hasn't showed up yet, but let's handle it anyway.
Alright!
Relocating is ELF business, and we do our ELF business in the pixie
crate.
Let's add a relocate
method to MappedObject
:
// in `crates/pixie/src/lib.rs` use alloc::boxed::Box; impl<'a> MappedObject<'a> { /// Apply relocations with the given base offset pub fn relocate(&mut self, base_offset: u64) -> Result<(), PixieError> { if !self.is_relocatable() { return Err(PixieError::CannotRelocateNonRelocatableObject); } let dyn_entries = self.object.read_dynamic_entries()?; let syms = dyn_entries.syms()?; let relas = dyn_entries .find(DynamicTagType::Rela)? .parse_all(dyn_entries.find(DynamicTagType::RelaSz)?); let plt_relas: Box<dyn Iterator<Item = _>> = match dyn_entries.find(DynamicTagType::JmpRel) { Ok(jmprel) => Box::new(jmprel.parse_all(dyn_entries.find(DynamicTagType::PltRelSz)?)), Err(_) => Box::new(core::iter::empty()) as _, }; for rela in relas.chain(plt_relas) { let rela = rela?; self.apply_rela(&syms, &rela, base_offset)?; } Ok(()) } }
Okay, uh, we jumped a few steps — there's a lot of symbols we haven't defined yet in here.
We need a new error variant:
// in `crates/pixie/src/lib.rs` #[derive(displaydoc::Display, Debug)] /// A pixie error pub enum PixieError { /// `{0}` Deku(DekuError), /// `{0} Encore(EncoreError), /// no segments found NoSegmentsFound, /// could not find segment of type `{0:?}` SegmentNotFound(SegmentType), /// cannot map non-relocatable object at fixed position CannotMapNonRelocatableObjectAtFixedPosition, // 👇 new! /// cannot relocate non-relocatable object CannotRelocateNonRelocatableObject, }
And then we need to teach pixie
about the kind of entries contained in the
"Dynamic" segment.
// in `crates/pixie/src/format/dynamic.rs` use super::prelude::*; #[derive(Debug, Clone, DekuRead, DekuWrite)] pub struct DynamicTag { pub typ: DynamicTagType, pub addr: u64, } #[derive(Debug, DekuRead, DekuWrite, Clone, Copy, PartialEq)] #[deku(type = "u64")] pub enum DynamicTagType { #[deku(id = "0")] Null, #[deku(id = "2")] PltRelSz, #[deku(id = "5")] StrTab, #[deku(id = "6")] SymTab, #[deku(id = "7")] Rela, #[deku(id = "8")] RelaSz, #[deku(id = "11")] SymEnt, #[deku(id = "23")] JmpRel, #[deku(id_pat = "_")] Other(u64), }
That's a new module of pixie::format
, we need to declare it:
// in `crates/pixie/src/format/mod.rs` mod dynamic; pub use dynamic::*;
Then, we'll need to implement Object::read_dynamic_entries
.
// in `crates/pixie/src/lib.rs` impl<'a> Object<'a> { /// Read all dynamic entries pub fn read_dynamic_entries(&self) -> Result<DynamicEntries<'a>, PixieError> { let dyn_seg = self.segments.find(SegmentType::Dynamic)?; let mut entries = DynamicEntries::default(); let mut input = (dyn_seg.slice(), 0); loop { let (rest, tag) = DynamicTag::from_bytes(input)?; if tag.typ == DynamicTagType::Null { break; } entries.items.push(DynamicEntry { tag, full_slice: &self.slice, }); input = rest; } Ok(entries) } }
Which returns a type DynamicEntries
, very similar to the Segments
type we
made before — it just has a bunch of utility methods:
// in `crates/pixie/src/lib.rs` /// Entries in the `DYNAMIC` segment. #[derive(Default)] pub struct DynamicEntries<'a> { items: Vec<DynamicEntry<'a>>, } impl<'a> DynamicEntries<'a> { /// Returns a slice of all entries pub fn all(&self) -> &[DynamicEntry<'a>] { &self.items } /// Iterates over all entries of a given type pub fn of_type(&self, typ: DynamicTagType) -> impl Iterator<Item = &DynamicEntry<'a>> { self.items.iter().filter(move |entry| entry.typ() == typ) } /// Finds the first entry of a given type pub fn find(&self, typ: DynamicTagType) -> Result<&DynamicEntry<'a>, PixieError> { self.of_type(typ) .next() .ok_or(PixieError::DynamicEntryNotFound(typ)) } /// Constructs an instance of `Syms`. Requires the presence of the `SymTab`, /// `SymEnt` and `StrTab` dynamic entries. pub fn syms(&'a self) -> Result<Syms<'a>, PixieError> { Ok(Syms { symtab: self.find(DynamicTagType::SymTab)?, syment: self.find(DynamicTagType::SymEnt)?, strtab: self.find(DynamicTagType::StrTab)?, }) } }
This brings a new error variant — when we can't find a dynamic entry of the requested type:
#[derive(displaydoc::Display, Debug)] /// A pixie error pub enum PixieError { // (cut) /// could not find dynamic entry of type `{0:?}` DynamicEntryNotFound(DynamicTagType), }
The DynamicEntry
type holds both a "dynamic tag", and the corresponding data
(which is always more or less an u64
, but can be a number, an address, or something
else still):
// in `crates/pixie/src/lib.rs` /// An entry in the `DYNAMIC` section pub struct DynamicEntry<'a> { /// The dynamic tag as read from the `DYNAMIC` section tag: DynamicTag, /// A slice of the full ELF object full_slice: &'a [u8], } impl<'a> DynamicEntry<'a> { /// Returns the type of this dynamic entry pub fn typ(&self) -> DynamicTagType { self.tag.typ } /// Returns a slice of the full file starting with this entry interpreted as /// an offset. pub fn as_slice(&self) -> &'a [u8] { &self.full_slice[self.as_usize()..] } /// Returns this entry's value as an `usize` pub fn as_usize(&self) -> usize { self.as_u64() as usize } /// Returns this entry's value as an `u64` pub fn as_u64(&self) -> u64 { self.tag.addr } /// Parses several `T` records, using `self` at the start of the input, and /// `len` total length of the input. pub fn parse_all<T>( &self, len: &DynamicEntry<'a>, ) -> impl Iterator<Item = Result<T, PixieError>> + 'a where T: DekuContainerRead<'a>, { let slice = &self.as_slice()[..len.as_usize()]; let mut input = (slice, 0); core::iter::from_fn(move || -> Option<Result<T, PixieError>> { if input.0.is_empty() { return None; } let (rest, t) = match T::from_bytes(input) { Ok(x) => x, Err(e) => return Some(Err(e.into())), }; input = rest; Some(Ok(t)) }) } /// Parses the nth `T` record, using `self` as the start of the input, and /// `record_len` as the record length. pub fn parse_nth<T>(&self, record_len: &DynamicEntry<'a>, n: usize) -> Result<T, DekuError> where T: DekuContainerRead<'a>, { let slice = &self.as_slice()[(record_len.as_usize() * n)..]; let input = (slice, 0); let (_, t) = T::from_bytes(input)?; Ok(t) } }
Then, we have Syms
, which allows looking up symbol names, using the symtab
,
syment
, and strtab
dynamic entries:
// in `crates/pixie/src/lib.rs` /// Allows reading symbols out of an ELF file pub struct Syms<'a> { /// Indicates the start of the symbol table symtab: &'a DynamicEntry<'a>, /// Indicates the size of a symbol entry syment: &'a DynamicEntry<'a>, /// Indicates the start of the string table strtab: &'a DynamicEntry<'a>, } impl<'a> Syms<'a> { /// Read the nth symbol pub fn nth(&self, n: usize) -> Result<(Sym, &'a str), PixieError> { let sym: Sym = self.symtab.parse_nth(&self.syment, n)?; let name = unsafe { self.strtab.as_slice().as_ptr().add(sym.name as _).cstr() }; Ok((sym, name)) } /// Find a symbol by name. Will end up panicking if the symbol /// is not found! pub fn by_name(&self, name: &str) -> Result<Sym, PixieError> { let mut i = 0; loop { let (sym, sym_name) = self.nth(i)?; if sym_name == name { return Ok(sym); } i += 1; } } }
Sym
is also its own type:
// in `pixie/src/format/mod.rs` mod sym; pub use sym::*;
// in `pixie/src/format/sym.rs` use super::prelude::*; #[derive(Debug, DekuRead, DekuWrite, Clone)] pub struct Sym { pub name: u32, pub bind: SymBind, #[deku(pad_bytes_after = "1")] pub typ: SymType, pub shndx: u16, pub value: u64, pub size: u64, } #[derive(Debug, DekuRead, DekuWrite, Clone, Copy, PartialEq)] #[deku(type = "u8", bits = 4)] pub enum SymBind { #[deku(id = "0")] Local, #[deku(id = "1")] Global, #[deku(id = "2")] Weak, #[deku(id_pat = "_")] Other(u8), } #[derive(Debug, DekuRead, DekuWrite, Clone, Copy, PartialEq)] #[deku(type = "u8", bits = 4)] pub enum SymType { #[deku(id = "0")] None, #[deku(id = "1")] Object, #[deku(id = "2")] Func, #[deku(id = "3")] Section, #[deku(id = "4")] File, #[deku(id = "6")] Tls, #[deku(id = "10")] IFunc, #[deku(id_pat = "_")] Other(u8), }
Whoa, whoahey, that's a lot of code, isn't it?
Yeah, but it's not that bad, is it? We're just making some nice abstractions, as
usual, and using what deku
gives us to parse symbols easily.
Now that we have all that, we can focus on actually applying relocations.
And, remember how we very carefully handled each relocation type differently, making sure to apply the formula from the SysV ABI?
Yeah?
Yeah well, not this time.
// in `crates/pixie/src/lib.rs` impl<'a> MappedObject<'a> { /// Apply a single relocation fn apply_rela(&mut self, syms: &Syms, rela: &Rela, base_offset: u64) -> Result<(), PixieError> { match rela.typ { RelType::_64 | RelType::GlobDat | RelType::JumpSlot | RelType::Relative => { // we support these } _ => { return Err(PixieError::UnsupportedRela(rela.clone())); } } // some relocations don't use symbols, we'll just use the 0th symbol // for them, which is fine. let (sym, _) = syms.nth(rela.sym as _)?; let value = base_offset + sym.value + rela.addend; let mem_offset = self.vaddr_to_mem_offset(rela.offset); unsafe { let target = self.mem.as_ptr().add(mem_offset) as *mut u64; *target = value; } Ok(()) } }
Turns out: we can all compute them the same way! It's literally always just "base_offset + symbol value + addend". Some relocation refer to the 0th symbol, which has a value of 0, and some relocations don't have an addend, so the addend "field" is 0, and it all ends up being correct.
There's still two missing pieces: yet another error variant, in case we encounter a relocation type we do not support (which should never happen):
// in `crates/pixie/src/lib.rs` #[derive(displaydoc::Display, Debug)] /// A pixie error pub enum PixieError { // (cut) /// unsupported relocation type `{0:?}` UnsupportedRela(Rela), }
And of course, the Rela
and RelType
types, which are also part of the ELF
format:
// in `crates/pixie/src/format/mod.rs` mod rela; pub use rela::*;
// in `crates/pixie/src/format/rela.rs` use super::prelude::*; #[derive(Debug, DekuRead, DekuWrite, Clone)] pub struct Rela { pub offset: u64, pub typ: RelType, pub sym: u32, pub addend: u64, } #[derive(Debug, DekuRead, DekuWrite, Clone, Copy, PartialEq)] #[deku(type = "u32")] pub enum RelType { #[deku(id = "0")] Null, #[deku(id = "1")] _64, #[deku(id = "6")] GlobDat, #[deku(id = "7")] JumpSlot, #[deku(id = "8")] Relative, #[deku(id = "16")] DtpMod64, #[deku(id_pat = "_")] Other(u32), }
And with all that, we can relocate stage1
.
To get an idea of the result, let's write the relocated version of stage1
directly to the output:
// in `crates/minipak/src/main.rs` fn relink_stage1(guest_hull: Range<u64>, writer: &mut Writer) -> Result<(), Error> { let obj = Object::new(include_bytes!(concat!( env!("OUT_DIR"), "/embeds/libstage1.so" )))?; let hull = obj.segments().load_convex_hull()?; assert_eq!(hull.start, 0, "stage1 must be relocatable"); // Pick a base offset. If our guest is a relocatable executable, pick a // random one, otherwise, pick theirs. let base_offset = if guest_hull.start == 0 { 0x800000 // by fair dice roll } else { guest_hull.start }; println!("Picked base_offset 0x{:x}", base_offset); let hull = (hull.start + base_offset)..(hull.end + base_offset); println!("Stage1 hull: {:x?}", hull); println!(" Guest hull: {:x?}", guest_hull); // Map stage1 wherever... let mut mapped = MappedObject::new(&obj, None)?; println!("Loaded stage1"); // 👇 new code's here // ...but relocate it as if it was mapped at `base_offset` mapped.relocate(base_offset)?; println!("Relocated stage1"); // Dump the relocated version of the executable segment to disk, for comparison: let exec_segment = mapped.vaddr_slice( obj.segments() .of_type(pixie::SegmentType::Load) .find(|x| x.header().flags == (ProgramHeader::EXECUTE | ProgramHeader::READ)) .unwrap() .header() .mem_range(), ); writer.write_all(exec_segment)?; Ok(()) }
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak Packing guest "/home/amos/go/bin/hugo" Picked base_offset 0x400000 Stage1 hull: 400000..40b048 Guest hull: 400000..3180968 Loaded stage1 Relocated stage1
Okay! Now let's try to compare the non-relocated and the relocated version of
the first segment. First we need to extract just the right part of the the
non-relocated libstage1.so
:
$ readelf -Wl ./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/libstage1.so | grep -E 'MemSiz|LOAD' Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x001060 0x001060 R 0x1000 LOAD 0x002000 0x0000000000002000 0x0000000000002000 0x004b0d 0x004b0d R E 0x1000 LOAD 0x007000 0x0000000000007000 0x0000000000007000 0x001b88 0x001b88 R 0x1000 LOAD 0x009750 0x000000000000a750 0x000000000000a750 0x0008c0 0x0008f8 RW 0x1000 $ dd if=./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/libstage1.so of=/tmp/unrelocated bs=1 skip=$((0x002000)) count=$((0x004b0d)) 19213+0 records in 19213+0 records out 19213 bytes (19 kB, 19 KiB) copied, 0.0276595 s, 695 kB/s
Now, if my calculations are correct, the first few bytes should be the same:
$ xxd /tmp/unrelocated | head -3 00000000: f30f 1efa 4883 ec08 488b 05e9 8f00 0048 ....H...H......H 00000010: 85c0 7402 ffd0 4883 c408 c300 0000 0000 ..t...H......... 00000020: ff35 c28e 0000 ff25 c48e 0000 0f1f 4000 .5.....%......@. $ xxd /tmp/hugo.pak | head -3 00000000: f30f 1efa 4883 ec08 488b 05e9 8f00 0048 ....H...H......H 00000010: 85c0 7402 ffd0 4883 c408 c300 0000 0000 ..t...H......... 00000020: ff35 c28e 0000 ff25 c48e 0000 0f1f 4000 .5.....%......@.
Yeah, yes! That looks similar.
Let's find the differences, shall we?
$ diff <(xxd /tmp/unrelocated) <(xxd /tmp/hugo.pak) $
Huh. No output? They're the same? Did our program just... do nothing?
Maybe there's no relocations in the executable segment?
Ohhhhhhhh, right! It probably uses rip-relative addressing to avoid any relocations touching the executable segment, so that it can be shared across multiple loads of the same dynamic library.
We've seen that in Part 9, uh, over a year ago.
Time flies!
So then, where are relocations?
$ readelf -Wr ./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/libstage1.so | head Relocation section '.rela.dyn' at offset 0x490 contains 125 entries: Offset Info Type Symbol's Value Symbol's Name + Addend 👇 000000000000a750 0000000000000008 R_X86_64_RELATIVE 28e0 000000000000a758 0000000000000008 R_X86_64_RELATIVE 2890 000000000000a760 0000000000000008 R_X86_64_RELATIVE 2900 000000000000a778 0000000000000008 R_X86_64_RELATIVE 2a90 000000000000a780 0000000000000008 R_X86_64_RELATIVE 2910 000000000000a788 0000000000000008 R_X86_64_RELATIVE 2a40 000000000000a790 0000000000000008 R_X86_64_RELATIVE 7000 $ readelf -Wl ./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/libstage1.so | grep -E "MemSiz|LOAD" Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x001060 0x001060 R 0x1000 LOAD 0x002000 0x0000000000002000 0x0000000000002000 0x004b0d 0x004b0d R E 0x1000 LOAD 0x007000 0x0000000000007000 0x0000000000007000 0x001b88 0x001b88 R 0x1000 👇 LOAD 0x009750 0x000000000000a750 0x000000000000a750 0x0008c0 0x0008f8 RW 0x1000
Ah! In the read-write segment.
Okay then:
// in `relink_stage1` let rw_segment = mapped.vaddr_slice( obj.segments() .of_type(pixie::SegmentType::Load) .find(|x| x.header().flags == (ProgramHeader::READ | ProgramHeader::WRITE)) .unwrap() .header() .mem_range(), ); writer.write_all(rw_segment)?;
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak Packing guest "/home/amos/go/bin/hugo" Picked base_offset 0x400000 Stage1 hull: 400000..40b048 Guest hull: 400000..3180968 Loaded stage1 Relocated stage1 $ readelf -Wl ./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/libstage1.so | grep -E "MemSiz|LOAD" Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x001060 0x001060 R 0x1000 LOAD 0x002000 0x0000000000002000 0x0000000000002000 0x004b0d 0x004b0d R E 0x1000 LOAD 0x007000 0x0000000000007000 0x0000000000007000 0x001b88 0x001b88 R 0x1000 👇 👇 LOAD 0x009750 0x000000000000a750 0x000000000000a750 0x0008c0 0x0008f8 RW 0x1000 $ dd if=./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/libstage1.so of=/tmp/unrelocated bs=1 skip=$((0x009750)) count=$((0x0008c0)) 2240+0 records in 2240+0 records out 2240 bytes (2.2 kB, 2.2 KiB) copied, 0.0038677 s, 579 kB/s
Let's diff again:
$ diff <(xxd /tmp/unrelocated) <(xxd /tmp/hugo.pak) | head -14 1,6c1,6 👇 < 00000000: e028 0000 0000 0000 9028 0000 0000 0000 .(.......(...... < 00000010: 0029 0000 0000 0000 0800 0000 0000 0000 .).............. < 00000020: 0800 0000 0000 0000 902a 0000 0000 0000 .........*...... < 00000030: 1029 0000 0000 0000 402a 0000 0000 0000 .)......@*...... < 00000040: 0070 0000 0000 0000 4b00 0000 0000 0000 .p......K....... < 00000050: 5c01 0000 1300 0000 0029 0000 0000 0000 \........)...... --- 👇 > 00000000: e028 4000 0000 0000 9028 4000 0000 0000 .(@......(@..... > 00000010: 0029 4000 0000 0000 0800 0000 0000 0000 .)@............. > 00000020: 0800 0000 0000 0000 902a 4000 0000 0000 .........*@..... > 00000030: 1029 4000 0000 0000 402a 4000 0000 0000 .)@.....@*@..... > 00000040: 0070 4000 0000 0000 4b00 0000 0000 0000 .p@.....K....... > 00000050: 5c01 0000 1300 0000 0029 4000 0000 0000 \........)@.....
Ah, there we have it! A bunch of 0
that become 4
.
Is it because hugo
has a base address of 0x40000
? What would happen if we
operated on /bin/ls
instead?
Well, let's try it:
$ cargo run --quiet --release --bin minipak -- /bin/ls -o /tmp/ls.pak Packing guest "/bin/ls" Picked base_offset 0x800000 Stage1 hull: 800000..80b048 Guest hull: 0..24558 Loaded stage1 Relocated stage1 $ diff <(xxd /tmp/unrelocated) <(xxd /tmp/ls.pak) | head -14 1,6c1,6 👇 < 00000000: e028 0000 0000 0000 9028 0000 0000 0000 .(.......(...... < 00000010: 0029 0000 0000 0000 0800 0000 0000 0000 .).............. < 00000020: 0800 0000 0000 0000 902a 0000 0000 0000 .........*...... < 00000030: 1029 0000 0000 0000 402a 0000 0000 0000 .)......@*...... < 00000040: 0070 0000 0000 0000 4b00 0000 0000 0000 .p......K....... < 00000050: 5c01 0000 1300 0000 0029 0000 0000 0000 \........)...... --- 👇 > 00000000: e028 8000 0000 0000 9028 8000 0000 0000 .(.......(...... > 00000010: 0029 8000 0000 0000 0800 0000 0000 0000 .).............. > 00000020: 0800 0000 0000 0000 902a 8000 0000 0000 .........*...... > 00000030: 1029 8000 0000 0000 402a 8000 0000 0000 .)......@*...... > 00000040: 0070 8000 0000 0000 4b00 0000 0000 0000 .p......K....... > 00000050: 5c01 0000 1300 0000 0029 8000 0000 0000 \........)......
Yup, sure enough! They're now 8
.
Alright, well, there's no telling if our relocations are correct yet, but at least there's definitely something being relocated.
Which means all we need to do now... is generate an ELF object that happens to be an executable.
So, what, write a header?
Yes! And program headers, everything we need.
And it'll be easy! Because deku
not only lets us read binary formats,
it also lets us write binary formats.
Let's go!
fn relink_stage1(guest_hull: Range<u64>, writer: &mut Writer) -> Result<(), Error> { let obj = Object::new(include_bytes!(concat!( env!("OUT_DIR"), "/embeds/libstage1.so" )))?; let hull = obj.segments().load_convex_hull()?; assert_eq!(hull.start, 0, "stage1 must be relocatable"); // Pick a base offset. If our guest is a relocatable executable, pick a // random one, otherwise, pick theirs. let base_offset = if guest_hull.start == 0 { 0x800000 // by fair dice roll } else { guest_hull.start }; println!("Picked base_offset 0x{:x}", base_offset); let hull = (hull.start + base_offset)..(hull.end + base_offset); println!("Stage1 hull: {:x?}", hull); println!(" Guest hull: {:x?}", guest_hull); // Map stage1 wherever... let mut mapped = MappedObject::new(&obj, None)?; println!("Loaded stage1"); // ...but relocate it as if it was mapped at `base_offset` mapped.relocate(base_offset)?; println!("Relocated stage1"); println!("Looking for `entry` in stage1..."); let entry_sym = mapped.lookup_sym("entry")?; let entry_point = base_offset + entry_sym.value; // Collect all the load segments let mut load_segs = obj .segments() .of_type(SegmentType::Load) .collect::<Vec<_>>(); // Now write out some ELF! let out_header = ObjectHeader { class: pixie::ElfClass::Elf64, endianness: Endianness::Little, version: 1, os_abi: OsAbi::SysV, typ: ElfType::Exec, machine: ElfMachine::X86_64, version_bis: 1, entry_point, flags: 0, hdr_size: ObjectHeader::SIZE, // Two additional segments: one for `brk` alignment, and GNU_STACK. ph_count: load_segs.len() as u16 + 2, ph_offset: ObjectHeader::SIZE as _, ph_entsize: ProgramHeader::SIZE, // We're not adding any sections, our object will be opaque to debuggers sh_count: 0, sh_entsize: 0, sh_nidx: 0, sh_offset: 0, }; writer.write_deku(&out_header)?; let static_headers = load_segs.iter().map(|seg| { let mut ph = seg.header().clone(); ph.vaddr += base_offset; ph.paddr += base_offset; ph }); for ph in static_headers { writer.write_deku(&ph)?; } // Insert dummy segment to offset the `brk` to its original position // for the guest, if we can. { let current_hull = align_hull(hull); let desired_hull = align_hull(guest_hull); let pad_size = if current_hull.end > desired_hull.end { println!("WARNING: Guest executable is too small, the `brk` will be wrong."); 0x0 } else { desired_hull.end - current_hull.end }; let ph = ProgramHeader { paddr: current_hull.end, vaddr: current_hull.end, memsz: pad_size, filesz: 0, offset: 0, align: 0x1000, typ: SegmentType::Load, flags: ProgramHeader::WRITE | ProgramHeader::READ, }; writer.write_deku(&ph)?; } // Add a GNU_STACK program header for alignment and to make it // non-executable. { let ph = ProgramHeader { paddr: 0, vaddr: 0, memsz: 0, filesz: 0, offset: 0, align: 0x10, typ: SegmentType::GnuStack, flags: ProgramHeader::WRITE | ProgramHeader::READ, }; writer.write_deku(&ph)?; } // Sort load segments by file offset and copy them. { load_segs.sort_by_key(|&seg| seg.header().offset); println!("Copying stage1 segments..."); let copy_start_offset = writer.offset(); println!("copy_start_offset = 0x{:x}", copy_start_offset); let copied_segments = load_segs .into_iter() .filter(move |seg| seg.header().offset > copy_start_offset); for cp_seg in copied_segments { let ph = cp_seg.header(); println!("copying {:?}", ph); // Pad space between segments with zeros: writer.pad(ph.offset - writer.offset())?; // Then copy. let start = ph.vaddr; let len = ph.filesz; let end = start + len; writer.write_all(mapped.vaddr_slice(start..end))?; } } // Pad end of last segment with zeros: writer.align(0x1000)?; Ok(()) }
We've used a couple helper functions, let's define them now: align_hull
:
// in `crates/pixie/src/lib.rs` /// Align *down* to the nearest 4K boundary pub fn floor(val: u64) -> u64 { val & !0xFFF } /// Align *up* to the nearest 4K boundary pub fn ceil(val: u64) -> u64 { if floor(val) == val { val } else { floor(val + 0x1000) } } /// Given a convex hull, align its start *down* to the nearest 4K boundary and /// its end *up* to the nearest 4K boundary pub fn align_hull(hull: Range<u64>) -> Range<u64> { floor(hull.start)..ceil(hull.end) }
And then MappedObject::lookup_sym
, which we use to find the address of entry
in libstage1.so
. Luckily this one is trivially expressed using the abstractions
we've already carefully constructed:
// in `crates/pixie/src/lib.rs` impl<'a> MappedObject<'a> { /// Returns the (non-relocated) vaddr of a symbol by name pub fn lookup_sym(&self, name: &str) -> Result<Sym, PixieError> { let dyn_entries = self.object.read_dynamic_entries()?; dyn_entries.syms()?.by_name(name) } }
And now, well... we should be generating a fully-relocated, statically linked
executable from libstage1.so
.
Let's try it?
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak Packing guest "/home/amos/go/bin/hugo" Picked base_offset 0x400000 Stage1 hull: 400000..40b048 Guest hull: 400000..3180968 Loaded stage1 Relocated stage1 Looking for `entry` in stage1... Copying stage1 segments... copy_start_offset = 0x190 copying ProgramHeader { typ: Load, flags: 0x5, offset: 0x2000, vaddr: 0x2000, paddr: 0x2000, filesz: 0x4b0d, memsz: 0x4b0d, align: 0x1000 } copying ProgramHeader { typ: Load, flags: 0x4, offset: 0x7000, vaddr: 0x7000, paddr: 0x7000, filesz: 0x1b88, memsz: 0x1b88, align: 0x1000 } copying ProgramHeader { typ: Load, flags: 0x6, offset: 0x9750, vaddr: 0xa750, paddr: 0xa750, filesz: 0x8c0, memsz: 0x8f8, align: 0x1000 } $ /tmp/hugo.pak [stage1] Stack top: 0x7fffa8187a40
Hurray!!
Woo! We did it!
...and if we've done our job correctly, it should have a structure that's very similar to the original guest executable:
$ readelf -Wl ~/go/bin/hugo | grep -E 'LOAD|MemSiz' Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x172b4b0 0x172b4b0 R E 0x1000 LOAD 0x172c000 0x0000000001b2c000 0x0000000001b2c000 0x155ceb8 0x155ceb8 R 0x1000 LOAD 0x2c89000 0x0000000003089000 0x0000000003089000 0x0b08c0 0x0f7968 RW 0x1000 $ readelf -Wl /tmp/hugo.pak | grep -E 'LOAD|MemSiz' Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x001060 0x001060 R 0x1000 LOAD 0x002000 0x0000000000402000 0x0000000000402000 0x004b0d 0x004b0d R E 0x1000 LOAD 0x007000 0x0000000000407000 0x0000000000407000 0x001b88 0x001b88 R 0x1000 LOAD 0x009750 0x000000000040a750 0x000000000040a750 0x0008c0 0x0008f8 RW 0x1000 LOAD 0x000000 0x000000000040c000 0x000000000040c000 0x000000 0x2d75000 RW 0x1000 $ gdb -q -ex "set noconfirm" -ex "p/x 0x0000000003089000 + 0x0f7968" -ex "p/x 0x000000000040c000 + 0x2d75000" -ex "quit" No symbol table is loaded. Use the "file" command. $1 = 0x3180968 $2 = 0x3181000
Yes! After alignment, both executables end at the same address, and so their
brk
should be the same.
Enter stage two
Now remember, we cannot have stage1
directly load the guest — well, right
now, we're not even writing the compressed guest to our output file, so it's
tiny:
$ ls -lhA /tmp/hugo.pak -rwxr-xr-x 1 amos amos 44K May 1 22:59 /tmp/hugo.pak
But still, we cannot have stage1
load the guest, because it's mapped where
the guest should be:
First, we need to map stage2
out of the way.
But where is "out of the way"?
Well, that's the beauty of it! The whole area where guest
will eventually be
is already mapped from our executable.
So any call to mmap
(without the FIXED
flag) will give us a region that's
"out of the way" — it won't overwrite an already-mapped region.
Well, right now we don't even have a stage2
to map, so, let's make one!
$ (cd crates && cargo new --lib stage2) warning: compiling this new package may not work due to invalid workspace configuration current package believes it's in a workspace when it's not: current: /home/amos/ftl/minipak/crates/stage2/Cargo.toml workspace: /home/amos/ftl/minipak/Cargo.toml this may be fixable by adding `crates/stage2` to the `workspace.members` array of the manifest located at: /home/amos/ftl/minipak/Cargo.toml Alternatively, to keep it out of the workspace, add the package to the `workspace.exclude` array, or add an empty `[workspace]` table to the package's manifest. Created library `stage2` package
Well, it's requested so politely:
# in the top-level `Cargo.toml` [workspace] members = [ "crates/encore", "crates/pixie", "crates/minipak", "crates/stage1", "crates/stage2", ]
Let's also add a build script:
// in `crates/stage2/build.rs` fn main() { println!("cargo:rustc-link-arg=-Wl,-z,defs"); }
A dependency on encore
, and setting the crate type to cdylib
:
# in `crates/stage2/Cargo.toml` [lib] crate-type = ["cdylib"] [dependencies] encore = { path = "../encore" }
And, well, let's add an entry point to it too!
// in `crates/stage2/src/lib.rs` // Don't use libstd #![no_std] // Allow inline assembly #![feature(asm)] // Allow naked (no-prelude) functions #![feature(naked_functions)] // Use the default allocation error handler #![feature(default_alloc_error_handler)] extern crate alloc; use encore::prelude::*; macro_rules! info { ($($tokens: tt)*) => { println!("[stage2] {}", alloc::format!($($tokens)*)); } } #[no_mangle] #[inline(never)] /// # Safety /// Does a raw syscall, initializes the global allocator unsafe extern "C" fn entry(stack_top: *mut u8) -> ! { init_allocator(); crate::main(stack_top) } /// # Safety /// Maps and jmps to another ELF object #[inline(never)] unsafe fn main(stack_top: *mut u8) -> ! { info!("Stack top: {:?}", stack_top); encore::syscall::exit(0); }
Now, let's consider the chain of events: minipak
generates its executable from
the guest and stage1
. So by the time the "packed executable" starts up,
stage1
is already mapped.
stage2
, however, must be mapped by stage1
. So, it must be embedded into the
"packed executable" as well.
Luckily, we made this next part very easy for ourselves.
First off, whenever we build minipak
, we also want to build stage2
— let's
add it to our build script:
// in `crates/minipak/build.rs` // omitted: other functions fn main() { for &arg in &["-nostartfiles", "-nodefaultlibs", "-static"] { println!("cargo:rustc-link-arg={}", arg); } cargo_build(&PathBuf::from("../stage1")); // new! 👇 cargo_build(&PathBuf::from("../stage2")); }
Second, let's add it as a Resource
in the pixie manifest:
// in `crates/pixie/src/manifest.rs` #[derive(Debug, DekuRead, DekuWrite)] #[deku(magic = b"piximani")] pub struct Manifest { pub stage2: Resource, pub guest: Resource, }
And thirdly, well, thirdly let's embed both libstage2.so
and the compressed
guest into the output executable.
They'll go right after our "relinked stage1":
// in `crates` #[allow(clippy::unnecessary_wraps)] fn main(env: Env) -> Result<(), Error> { let args = cli::Args::parse(&env); println!("Packing guest {:?}", args.input); let guest_file = File::open(args.input)?; let guest_map = guest_file.map()?; let guest_obj = Object::new(guest_map.as_ref())?; let guest_hull = guest_obj.segments().load_convex_hull()?; let mut output = Writer::new(&args.output, 0o755)?; relink_stage1(guest_hull, &mut output)?; let stage2_slice = include_bytes!(concat!(env!("OUT_DIR"), "/embeds/libstage2.so")); let stage2_offset = output.offset(); println!("Copying stage2 at 0x{:x}", stage2_offset); output.write_all(stage2_slice)?; output.align(0x8)?; println!("Compressing guest..."); let compressed_guest = lz4_flex::compress_prepend_size(guest_map.as_ref()); let guest_offset = output.offset(); println!("Copying compressed guest at 0x{:x}", guest_offset); output.write_all(&compressed_guest)?; output.align(0x8)?; let manifest_offset = output.offset(); println!("Writing manifest at 0x{:x}", manifest_offset); let manifest = Manifest { stage2: Resource { offset: stage2_offset as _, len: stage2_slice.len(), }, guest: Resource { offset: guest_offset as _, len: compressed_guest.len(), }, }; output.write_deku(&manifest)?; output.align(0x8)?; println!("Writing end marker"); let end_marker = EndMarker { manifest_offset: manifest_offset as _, }; output.write_deku(&end_marker)?; println!("Written to ({})", args.output); Ok(()) }
There! Now minipak
is feature-complete. Well, the minipak
crate, not the
whole project — stage1
and stage2
are still not complete, but our minipak
executable does everything we want it to do, and its output is a lot chunkier
than before:
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak Packing guest "/home/amos/go/bin/hugo" Picked base_offset 0x400000 Stage1 hull: 400000..40b048 Guest hull: 400000..3180968 Loaded stage1 Relocated stage1 Looking for `entry` in stage1... Copying stage1 segments... copy_start_offset = 0x190 copying ProgramHeader { typ: Load, flags: 0x5, offset: 0x2000, vaddr: 0x2000, paddr: 0x2000, filesz: 0x4b0d, memsz: 0x4b0d, align: 0x1000 } copying ProgramHeader { typ: Load, flags: 0x4, offset: 0x7000, vaddr: 0x7000, paddr: 0x7000, filesz: 0x1b88, memsz: 0x1b88, align: 0x1000 } copying ProgramHeader { typ: Load, flags: 0x6, offset: 0x9750, vaddr: 0xa750, paddr: 0xa750, filesz: 0x8c0, memsz: 0x8f8, align: 0x1000 } Copying stage2 at 0xb000 Compressing guest... Copying compressed guest at 0x15670 Writing manifest at 0x1eda4f8 Writing end marker Written to (/tmp/hugo.pak) $ ls -lhA /tmp/hugo.pak -rwxr-xr-x 1 amos amos 31M May 2 00:06 /tmp/hugo.pak
Mh. How does it compare with the original binary though?
Shh let's keep that for later. When we actually get it to work.
Okay, so! Clearly stage1
has to read the EndMarker
, to find the Manifest
,
so it knows where stage2
is, and it can map it.
Turns out this is relatively compact:
// in `stage1/src/lib.rs` use pixie::{Manifest, MappedObject, Object}; /// # Safety /// Maps and calls into another ELF object #[inline(never)] unsafe fn main(stack_top: *mut u8) -> ! { info!("Stack top: {:?}", stack_top); // Open ourselves and read the manifest. let file = File::open("/proc/self/exe").unwrap(); let map = file.map().unwrap(); let slice = map.as_ref(); let manifest = Manifest::read_from_full_slice(slice).unwrap(); // Load stage2 anywhere in memory let s2_slice = &slice[manifest.stage2.as_range()]; let s2_obj = Object::new(s2_slice).unwrap(); let mut s2_mapped = MappedObject::new(&s2_obj, None).unwrap(); info!( "Mapped stage2 at base 0x{:x} (offset 0x{:x})", s2_mapped.base(), s2_mapped.base_offset() ); info!("Relocating stage2..."); s2_mapped.relocate(s2_mapped.base_offset()).unwrap(); info!("Relocating stage2... done!"); // Find stage2's entry function and call it let s2_entry = s2_mapped.lookup_sym("entry").unwrap(); info!("Found entry_sym {:?}", s2_entry); let entry: unsafe extern "C" fn(*mut u8) -> ! = core::mem::transmute(s2_mapped.base_offset() + s2_entry.value); entry(stack_top); }
Of course this uses some types and functions from pixie
, so:
# Cargo.toml [dependencies] pixie = { path = "../pixie" }
And now... well, the whole thing won't quite work, but at least we should
reach stage2
.
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak Packing guest "/home/amos/go/bin/hugo" (cut) $ /tmp/hugo.pak [stage1] Stack top: 0x7ffca05e6de0 [stage1] Mapped stage2 at base 0x7f199bfd2000 (offset 0x7f199bfd2000) [stage1] Relocating stage2... [stage1] Relocating stage2... done! [stage1] Found entry_sym Sym { name: 112, bind: Global, typ: Func, shndx: 7, value: 26240, size: 20 } [stage2] Stack top: 0x7ffca05e6de0
...and we do!
And now, the pièce de résistance.
Running an executable from memory
Hey, we've done that already!
Why yes, yes we have! But this our last — and our best.
To launch the executable we have to:
- Map ourselves
- Read the manifest
- Decompress the guest in memory
- Map it in place
- Adjust the
PHDR
,PHNUM
andENTRY
auxiliary vectors - Jump to the entry point
Oh, what about dynamically-linked executables? That need an interpreter?
Ah, I guess we can do that too! If we find an interpreter segment, we can map it in memory as well, and jump to its entry point, instead of the guest's.
Let's go!
// in `crates/stage2/src/lib.rs` use pixie::{Manifest, MappedObject, Object, ObjectHeader}; /// # Safety /// Maps and jmps to another ELF object #[inline(never)] unsafe fn main(stack_top: *mut u8) -> ! { info!("Stack top: {:?}", stack_top); let mut stack = Env::read(stack_top as _); // Open ourselves and read the manifest. let file = File::open("/proc/self/exe").unwrap(); info!("Mapping self..."); let map = file.map().unwrap(); info!("Mapping self... done!"); let slice = map.as_ref(); let manifest = Manifest::read_from_full_slice(slice).unwrap(); let compressed_guest = &slice[manifest.guest.as_range()]; let guest = lz4_flex::decompress_size_prepended(compressed_guest).unwrap(); let guest_obj = Object::new(guest.as_ref()).unwrap(); let guest_hull = guest_obj.segments().load_convex_hull().unwrap(); let at = if guest_hull.start == 0 { // guest is relocatable, load it with the same base as ourselves let elf_header_address = stack.find_vector(AuxvType::PHDR).value; let self_base = elf_header_address - ObjectHeader::SIZE as u64; Some(self_base) } else { // guest is non-relocatable, it'll be loaded at its preferred offset None }; let base_offset = at.unwrap_or_default(); let guest_mapped = MappedObject::new(&guest_obj, at).unwrap(); info!("Mapped guest at 0x{:x}", guest_mapped.base()); // Set phdr auxiliary vector let at_phdr = stack.find_vector(AuxvType::PHDR); at_phdr.value = guest_mapped.base() + guest_obj.header().ph_offset; // Set phnum auxiliary vector let at_phnum = stack.find_vector(AuxvType::PHNUM); at_phnum.value = guest_obj.header().ph_count as _; // Set entry auxiliary vector let at_entry = stack.find_vector(AuxvType::ENTRY); at_entry.value = base_offset + guest_obj.header().entry_point; match guest_obj.segments().find(SegmentType::Interp) { Ok(interp) => { let interp = core::str::from_utf8(interp.slice()).unwrap(); println!("Should load interpreter {}!", interp); let interp_file = File::open(interp).unwrap(); let interp_map = interp_file.map().unwrap(); let interp_obj = Object::new(interp_map.as_ref()).unwrap(); let interp_hull = interp_obj.segments().load_convex_hull().unwrap(); if interp_hull.start != 0 { panic!("Expected interpreter to be relocatable"); } // Map interpreter anywhere let interp_mapped = MappedObject::new(&interp_obj, None).unwrap(); // Adjust base let at_base = stack.find_vector(AuxvType::BASE); at_base.value = interp_mapped.base(); let entry_point = interp_mapped.base() + interp_obj.header().entry_point; info!("Jumping to interpreter's entry point 0x{:x}", entry_point); pixie::launch(stack_top, entry_point); } Err(_) => { let entry_point = base_offset + guest_obj.header().entry_point; info!("Jumping to guest's entry point 0x{:x}", entry_point); pixie::launch(stack_top, entry_point); } } }
We just need a couple dependencies:
# in `crates/stage2/Cargo.toml` [dependencies] encore = { path = "../encore" } # 👇 new! pixie = { path = "../pixie" } # 👇 also new! lz4_flex = { version = "0.7.5", default-features = false, features = ["safe-encode", "safe-decode"] }
And we're off to the races!
$ cargo run --quiet --release --bin minipak -- ~/go/bin/hugo -o /tmp/hugo.pak Packing guest "/home/amos/go/bin/hugo" (cut) $ /tmp/hugo.pak [stage1] Stack top: 0x7ffc7371c880 [stage1] Mapped stage2 at base 0x7f6cf5481000 (offset 0x7f6cf5481000) [stage1] Relocating stage2... [stage1] Relocating stage2... done! [stage1] Found entry_sym Sym { name: 119, bind: Global, typ: Func, shndx: 7, value: 75936, size: 20 } [stage2] Stack top: 0x7ffc7371c880 [stage2] Mapping self... [stage2] Mapping self... done! [stage2] Mapped guest at 0x400000 [stage2] Jumping to guest's entry point 0x4712a0 Total in 0 ms Error: Unable to locate config file or config directory. Perhaps you need to create a new site. Run `hugo help new` for details.
✨✨✨
We did it! We finally did it.
Let's make sure it also works with dynamically-linked executables:
$ cargo run --quiet --release --bin minipak -- /bin/ls -o /tmp/ls.pak Packing guest "/bin/ls" Picked base_offset 0x800000 Stage1 hull: 800000..81e048 Guest hull: 0..24558 Loaded stage1 Relocated stage1 Looking for `entry` in stage1... WARNING: Guest executable is too small, the `brk` will be wrong. Copying stage1 segments... copy_start_offset = 0x190 copying ProgramHeader { typ: Load, flags: 0x5, offset: 0x3000, vaddr: 0x3000, paddr: 0x3000, filesz: 0x1226d, memsz: 0x1226d, align: 0x1000 } copying ProgramHeader { typ: Load, flags: 0x4, offset: 0x16000, vaddr: 0x16000, paddr: 0x16000, filesz: 0x485c, memsz: 0x485c, align: 0x1000 } copying ProgramHeader { typ: Load, flags: 0x6, offset: 0x1b380, vaddr: 0x1c380, paddr: 0x1c380, filesz: 0x1c90, memsz: 0x1cc8, align: 0x1000 } Copying stage2 at 0x1e000 Compressing guest... Copying compressed guest at 0x39670 Writing manifest at 0x4e180 Writing end marker Written to (/tmp/ls.pak) $ /tmp/ls.pak -lhA [stage1] Stack top: 0x7fff2495c6e0 [stage1] Mapped stage2 at base 0x7fce6c804000 (offset 0x7fce6c804000) [stage1] Relocating stage2... [stage1] Relocating stage2... done! [stage1] Found entry_sym Sym { name: 119, bind: Global, typ: Func, shndx: 7, value: 75936, size: 20 } [stage2] Stack top: 0x7fff2495c6e0 [stage2] Mapping self... [stage2] Mapping self... done! [stage2] Mapped guest at 0x800000 Should load interpreter /lib64/ld-linux-x86-64.so.2! [stage2] Jumping to interpreter's entry point 0x7fce6474b090 total 48K drwxr-xr-x 2 amos amos 4.0K May 1 15:53 .cargo -rw-r--r-- 1 amos amos 6.5K May 2 00:38 Cargo.lock -rw-r--r-- 1 amos amos 223 May 1 23:33 Cargo.toml drwxr-xr-x 7 amos amos 4.0K May 1 23:33 crates -rw------- 1 amos amos 2.6K May 1 18:11 .gdb_history drwxr-xr-x 8 amos amos 4.0K May 1 23:28 .git -rw-r--r-- 1 amos amos 21 Feb 21 20:14 .gitignore -rw-r--r-- 1 amos amos 117 May 1 15:44 rust-toolchain drwxr-xr-x 2 amos amos 4.0K May 1 19:45 samples drwxr-xr-x 4 amos amos 4.0K May 1 19:58 target drwxr-xr-x 2 amos amos 4.0K Feb 21 18:28 .vscode
Beary cool! Does it really make sense to compress ls
though?
Well, no, not really. ls
is already so small, our packed version is actually
larger:
$ ls -lhA /bin/ls -rwxr-xr-x 1 root root 139K Mar 6 2020 /bin/ls $ ls -lhA /tmp/ls.pak -rwxr-xr-x 1 amos amos 313K May 2 00:48 /tmp/ls.pak
...because it contains stage1
, stage2
and a compressed version of ls
,
and both stages are pretty chunky right now:
$ ls -lhA ./target/release/build/minipak-51b667ed4cbdb6ec/out/embeds/ total 244K -rw-r--r-- 1 amos amos 177 May 1 19:58 CACHEDIR.TAG -rwxr-xr-x 1 amos amos 118K May 2 00:47 libstage1.so -rwxr-xr-x 1 amos amos 110K May 2 00:47 libstage2.so drwxr-xr-x 7 amos amos 4.0K May 2 00:47 release -rw-r--r-- 1 amos amos 1.6K May 1 19:58 .rustc_info.json
So, let's answer a bunch of questions!
Why are stage1 / stage2 sorta chunky?
Well, first off, ~110K is not that chunky, by desktop computer standards.
It's positively tiny by server computer standards, and it's enormous by embedded standards, but we're not targeting your smartwatch, so all is well.
Still, I was curious what was in there, so I looked, using Bloaty McBloatface:
$ cargo build --release $ objcopy --strip-all ./target/release/libstage1.so /tmp/libstage1.so $ bloaty -d symbols -n 0 --debug-file ./target/release/libstage1.so /tmp/libstage1.so | head -30 FILE SIZE VM SIZE -------------- -------------- 6.9% 8.07Ki 0.0% 0 [Unmapped] 6.3% 7.41Ki 6.9% 7.41Ki [section .rela.dyn] 4.7% 5.50Ki 5.1% 5.50Ki [section .rodata] 4.3% 5.02Ki 4.7% 5.02Ki [section .data.rel.ro] 3.2% 3.77Ki 3.5% 3.77Ki _$LT$pixie..format..header..ObjectHeader$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::hf0dca140941584be 2.7% 3.13Ki 2.9% 3.13Ki _$LT$bitvec..ptr..span..BitSpanError$LT$T$GT$$u20$as$u20$core..fmt..Debug$GT$::fmt::hae2f2e9efb0f4129 2.1% 2.49Ki 2.3% 2.49Ki pixie::MappedObject::relocate::h2fcec852915cca41 2.0% 2.31Ki 2.1% 2.31Ki bitvec::slice::BitSlice$LT$O$C$T$GT$::clone_from_bitslice::hf0e2687b7949ac19 1.7% 2.02Ki 1.9% 2.02Ki [section .text] 1.5% 1.77Ki 1.6% 1.77Ki core::fmt::Formatter::pad::hcb18266da989bb74 1.3% 1.58Ki 1.5% 1.58Ki deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$u16$GT$::read::h52a077145863edef 1.3% 1.58Ki 1.5% 1.58Ki deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$u32$GT$::read::h66b6fabedc184b5f 1.3% 1.58Ki 1.5% 1.58Ki deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$usize$GT$::read::h62ed5fa41ab068c2 1.3% 1.52Ki 1.4% 1.52Ki deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$u8$GT$::read::h4e61a7d6b96113b7 1.3% 1.50Ki 0.0% 0 [ELF Headers] 1.1% 1.35Ki 1.3% 1.35Ki _$LT$str$u20$as$u20$core..fmt..Debug$GT$::fmt::h06fbb5704eb2e464 1.1% 1.28Ki 1.2% 1.28Ki stage1::main::hdd1a3e200abaead0 1.1% 1.26Ki 1.2% 1.26Ki pixie::Object::new::hb545e2dcce88210a 1.0% 1.23Ki 1.1% 1.23Ki _$LT$pixie..format..sym..Sym$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::h3cb963e32c924534 1.0% 1.21Ki 1.1% 1.21Ki core::fmt::Formatter::pad_integral::h5030801cc5b3cd80 0.9% 1.05Ki 1.0% 1.05Ki _$LT$pixie..format..program_header..ProgramHeader$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::h92ac268876508bb9 0.9% 1.04Ki 1.0% 1.04Ki core::str::slice_error_fail::h02d9683ab20ccc40 0.8% 1023 0.9% 1023 _$LT$pixie..manifest..Manifest$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::h9d75f433f2e12a83 0.8% 920 0.8% 920 linked_list_allocator::hole::HoleList::allocate_first_fit::h2b05751692364505 0.7% 873 0.8% 873 bitvec::vec::api::_$LT$impl$u20$bitvec..vec..BitVec$LT$O$C$T$GT$$GT$::extend_with::h24831e0a831e998d 0.7% 870 0.8% 870 pixie::MappedObject::lookup_sym::hf84feee74de2706b 0.7% 853 0.8% 853 bitvec::slice::BitSlice$LT$O$C$T$GT$::copy_within_unchecked::h460ce8747088367a 0.7% 817 0.7% 817 _$LT$pixie..format..rela..Rela$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::hef92d06a7ec7525c
Well well well. I won't call out anyone here, but, convenience does come at a cost, it would seem.
Let's look at stage2
:
$ objcopy --strip-all ./target/release/libstage2.so /tmp/libstage2.so $ bloaty -d symbols -n 0 --debug-file ./target/release/libstage2.so /tmp/libstage2.so | head -30 FILE SIZE VM SIZE -------------- -------------- 8.0% 8.77Ki 0.0% 0 [Unmapped] 5.8% 6.40Ki 6.4% 6.40Ki [section .rela.dyn] 5.1% 5.58Ki 5.6% 5.58Ki stage2::main::he49bd1bea95ff619 4.6% 5.08Ki 5.1% 5.08Ki [section .rodata] 3.4% 3.77Ki 3.8% 3.77Ki _$LT$pixie..format..header..ObjectHeader$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::hf0dca140941584be 3.3% 3.58Ki 3.6% 3.58Ki [section .data.rel.ro] 2.1% 2.31Ki 2.3% 2.31Ki bitvec::slice::BitSlice$LT$O$C$T$GT$::clone_from_bitslice::hf0e2687b7949ac19 1.7% 1.90Ki 1.9% 1.90Ki [section .text] 1.6% 1.77Ki 1.8% 1.77Ki core::fmt::Formatter::pad::hcb18266da989bb74 1.4% 1.58Ki 1.6% 1.58Ki deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$u16$GT$::read::h52a077145863edef 1.4% 1.58Ki 1.6% 1.58Ki deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$u32$GT$::read::h66b6fabedc184b5f 1.4% 1.58Ki 1.6% 1.58Ki deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$usize$GT$::read::h62ed5fa41ab068c2 1.4% 1.57Ki 1.6% 1.57Ki _$LT$bitvec..ptr..span..BitSpanError$LT$T$GT$$u20$as$u20$core..fmt..Debug$GT$::fmt::hae2f2e9efb0f4129 1.4% 1.52Ki 1.5% 1.52Ki deku::impls::primitive::_$LT$impl$u20$deku..DekuRead$LT$$LP$deku..ctx..Endian$C$deku..ctx..Size$RP$$GT$$u20$for$u20$u8$GT$::read::h4e61a7d6b96113b7 1.3% 1.38Ki 0.0% 0 [ELF Headers] 1.2% 1.35Ki 1.4% 1.35Ki _$LT$str$u20$as$u20$core..fmt..Debug$GT$::fmt::h06fbb5704eb2e464 1.1% 1.26Ki 1.3% 1.26Ki pixie::Object::new::hb545e2dcce88210a 1.1% 1.21Ki 1.2% 1.21Ki core::fmt::Formatter::pad_integral::h5030801cc5b3cd80 1.0% 1.05Ki 1.1% 1.05Ki _$LT$pixie..format..program_header..ProgramHeader$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::h92ac268876508bb9 1.0% 1.04Ki 1.1% 1.04Ki core::str::slice_error_fail::h02d9683ab20ccc40 0.9% 1023 1.0% 1023 _$LT$pixie..manifest..Manifest$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::h9d75f433f2e12a83 0.8% 920 0.9% 920 linked_list_allocator::hole::HoleList::allocate_first_fit::h2b05751692364505 0.8% 873 0.9% 873 bitvec::vec::api::_$LT$impl$u20$bitvec..vec..BitVec$LT$O$C$T$GT$$GT$::extend_with::h24831e0a831e998d 0.8% 853 0.8% 853 bitvec::slice::BitSlice$LT$O$C$T$GT$::copy_within_unchecked::h460ce8747088367a 0.7% 812 0.8% 812 _$LT$pixie..manifest..EndMarker$u20$as$u20$deku..DekuContainerRead$GT$::from_bytes::h83ebca148e53570a 0.7% 796 0.8% 796 encore::fs::File::raw_open::hdba6f4608ca26b0d 0.7% 772 0.8% 772 _$LT$core..fmt..builders..PadAdapter$u20$as$u20$core..fmt..Write$GT$::write_str::h7ca3568df6f09b6a 0.7% 770 0.8% 770 bitvec::slice::BitSlice$LT$O$C$T$GT$::copy_within_unchecked::h31c15c4829e980b1
Interestingly, lz4_flex
(which stage1
does not use) doesn't even show up in
the top 30 hungriest hippos symbols:
$ bloaty -d symbols -n 0 --debug-file ./target/release/libstage2.so /tmp/libstage2.so | grep lz4 0.4% 433 0.4% 433 lz4_flex::block::decompress_safe::duplicate_overlapping_slice::h018f1b297902314e 0.3% 370 0.4% 370 _$LT$lz4_flex..block..DecompressError$u20$as$u20$core..fmt..Debug$GT$::fmt::hb893af1a554d89f6 0.2% 205 0.2% 205 lz4_flex::block::decompress_safe::copy_24::hfa85bd20ab36cca7
Although, maybe it's just been mostly inlined in stage2::main
? Hard to tell.
We can always try to ask the compiler to "optimize for size", see if it makes a difference?
[profile.release] opt-level = "s"
$ cargo build --release $ objcopy --strip-all ./target/release/libstage1.so /tmp/libstage1.so $ objcopy --strip-all ./target/release/libstage2.so /tmp/libstage2.so $ ls -lhA /tmp/libstage* -rwxr-xr-x 1 amos amos 114K May 2 01:30 /tmp/libstage1.so -rwxr-xr-x 1 amos amos 98K May 2 01:31 /tmp/libstage2.so
It helps a little!
What about optimization level z
?
$ ls -lhA /tmp/libstage* -rwxr-xr-x 1 amos amos 118K May 2 01:32 /tmp/libstage1.so -rwxr-xr-x 1 amos amos 106K May 2 01:32 /tmp/libstage2.so
Mh, nope, s
was better for us.
What if we switch from "thin" LTO to "fat" LTO?
[profile.release] lto = "fat"
$ ls -lhA /tmp/libstage* -rwxr-xr-x 1 amos amos 94K May 2 01:33 /tmp/libstage1.so -rwxr-xr-x 1 amos amos 82K May 2 01:33 /tmp/libstage2.so
Mhh, mhh, small gains. We can even bring down codegen-units
to 1, to really
take advantage of LTO, as explained here by James.
[profile.release] codegen-units = 1 incremental = false
$ ls -lhA /tmp/libstage* -rwxr-xr-x 1 amos amos 82K May 2 01:36 /tmp/libstage1.so -rwxr-xr-x 1 amos amos 70K May 2 01:37 /tmp/libstage2.so
Finally, we can force libcore
to be built with those settings as well:
To get this to compile, we had to comment out mentions of compiler-builtins
in the encore
crate.
Also, -Z build-std
is a nightly flag, it only works because we ask for a
nightly toolchain in the rust-toolchain
file.
$ cargo build -Z build-std --target x86_64-unknown-linux-gnu --release
Note also that -Z build-std
requires --target
to be set, and that changes
the directory where the libraries are produced:
$ objcopy --strip-all ./target/x86_64-unknown-linux-gnu/release/libstage1.so /tmp/libstage1.so $ objcopy --strip-all ./target/x86_64-unknown-linux-gnu/release/libstage2.so /tmp/libstage2.so $ ls -lhA /tmp/libstage* -rwxr-xr-x 1 amos amos 70K May 2 01:43 /tmp/libstage1.so -rwxr-xr-x 1 amos amos 54K May 2 01:43 /tmp/libstage2.so
Let's try to re-pack ls
:
$ ./target/x86_64-unknown-linux-gnu/release/minipak /bin/ls -o /tmp/ls.pak Packing guest "/bin/ls" (cut) $ ls -lhA /tmp/ls.pak -rwxr-xr-x 1 amos amos 237K May 2 01:44 /tmp/ls.pak
Well, that's 25% less than before! Pretty cool.
There's other things we could do!
For example, we could compress stage2
itself — we've seen that adding
lz4_flex
to the mix didn't make a big difference between stage1
and
stage2
, and stage2
is actually quite compressible:
$ lz4 -9 ./target/x86_64-unknown-linux-gnu/release/libstage2.so /tmp/libstage2.so.lz4 Compressed 701800 bytes into 243010 bytes ==> 34.63%
We could also make stage1
not use pixie
at all: we could have minipak
do
most of the work, generating a list of relocations that we can easily read and
process from stage1
without knowledge of the ELF file format, that would
probably cut down on stage1
's size.
And finally, we could switch to a different compression format. I'm not sure LZ4 is the best compromise in terms of compression ratio, decompression speed and code size.
So, how did we do?
The last thing I want to do is compare against... well, the only ELF packer I'm aware of: UPX.
Unfortunately, UPX refuses to pack /bin/ls
:
$ upx -1 /bin/ls -o /tmp/ls.upx1 Ultimate Packer for eXecutables Copyright (C) 1996 - 2020 UPX git-d7ba31+ Markus Oberhumer, Laszlo Molnar & John Reiser Jan 23rd 2020 File size Ratio Format Name -------------------- ------ ----------- ----------- upx: /bin/ls: CantPackException: bad DT_GNU_HASH n_bucket=0x15 n_bitmask=0x2 len=0xb0 Packed 0 files.
But it'll happily pack hugo, so, let's get comparing:
$ ls -lhA ~/go/bin/hugo /tmp/hugo.* -rwxr-xr-x 1 amos amos 61M Jan 26 10:44 /home/amos/go/bin/hugo -rwxr-xr-x 1 amos amos 31M May 2 01:56 /tmp/hugo.pak -rwxr-xr-x 1 amos amos 29M Jan 26 10:44 /tmp/hugo.upx1 -rwxr-xr-x 1 amos amos 26M Jan 26 10:44 /tmp/hugo.upx9
Honestly? I'm pretty happy with those results.
I'm also curious how long it takes to start up each of these.
I'm on a laptop right now, so this will be, uh, less than scientific, also there's disk caches involved, and the whole thing is running in WSL2, but still, let's take a look using hyperfine:
$ hyperfine --warmup 5 '~/go/bin/hugo version' '/tmp/hugo.upx1 version' '/tmp/hugo.upx9 version' '/tmp/hugo.pak version' Benchmark #1: ~/go/bin/hugo version Time (mean ± σ): 24.9 ms ± 3.6 ms [User: 30.6 ms, System: 16.2 ms] Range (min … max): 20.5 ms … 41.8 ms 112 runs Benchmark #2: /tmp/hugo.upx1 version Time (mean ± σ): 209.2 ms ± 15.5 ms [User: 214.9 ms, System: 17.1 ms] Range (min … max): 195.8 ms … 253.6 ms 14 runs Benchmark #3: /tmp/hugo.upx9 version Time (mean ± σ): 179.8 ms ± 18.8 ms [User: 183.1 ms, System: 20.0 ms] Range (min … max): 160.9 ms … 232.5 ms 16 runs Benchmark #4: /tmp/hugo.pak version Time (mean ± σ): 203.4 ms ± 9.3 ms [User: 179.7 ms, System: 45.8 ms] Range (min … max): 187.2 ms … 217.5 ms 15 runs Summary '~/go/bin/hugo version' ran 7.23 ± 1.28 times faster than '/tmp/hugo.upx9 version' 8.18 ± 1.23 times faster than '/tmp/hugo.pak version' 8.41 ± 1.36 times faster than '/tmp/hugo.upx1 version'
Again, nothing to be ashamed of there — the upx -9
version seems faster than
both the upx -1
version and the minipak
version, but dang, I'm pretty happy
with those results.
Now let's try it on a large Rust binary: futile, which powers this website.
$ ls -lhA ~/futile /tmp/futile.* -rwxr-xr-x 1 amos amos 27M May 2 02:05 /home/amos/futile -rwxr-xr-x 1 amos amos 12M May 2 02:06 /tmp/futile.pak -rwxr-xr-x 1 amos amos 11M May 2 02:05 /tmp/futile.upx1 -rwxr-xr-x 1 amos amos 8.3M May 2 02:05 /tmp/futile.upx9
Again, not too bad! upx -9
has a strong lead here too, but keep it mind it's
been developed over two decades and its -9
setting uses seven different
passes!
What about startup times?
$ hyperfine --warmup 15 '~/futile help' '/tmp/futile.upx1 help' '/tmp/futile.upx9 help' '/tmp/futile.pak help' Benchmark #1: ~/futile help Time (mean ± σ): 7.0 ms ± 3.5 ms [User: 4.4 ms, System: 6.2 ms] Range (min … max): 1.8 ms … 16.1 ms 520 runs Warning: Command took less than 5 ms to complete. Results might be inaccurate. Benchmark #2: /tmp/futile.upx1 help Time (mean ± σ): 80.5 ms ± 9.4 ms [User: 76.1 ms, System: 6.1 ms] Range (min … max): 74.9 ms … 115.1 ms 37 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Benchmark #3: /tmp/futile.upx9 help Time (mean ± σ): 72.0 ms ± 6.7 ms [User: 69.0 ms, System: 4.6 ms] Range (min … max): 67.7 ms … 103.6 ms 39 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Benchmark #4: /tmp/futile.pak help Time (mean ± σ): 71.9 ms ± 4.9 ms [User: 66.4 ms, System: 6.9 ms] Range (min … max): 68.3 ms … 94.5 ms 38 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary '~/futile help' ran 10.34 ± 5.26 times faster than '/tmp/futile.pak help' 10.35 ± 5.30 times faster than '/tmp/futile.upx9 help' 11.58 ± 5.99 times faster than '/tmp/futile.upx1 help'
Not bad at all!
(I did my best here, but there were statistical outliers in all the runs. I closed every program I could afford to, and still I couldn't get it to behave. Ah well.)
So, we've done big Go binary, big Rust binary... how about big C++ binary?
Let's compress electron
with this!
Compressing with UPX went fine, but minipak
crashed:
$ ~/ftl/minipak/target/x86_64-unknown-linux-gnu/release/minipak electron -o electron.pak Packing guest "electron" Picked base_offset 0x800000 Stage1 hull: 800000..815040 Guest hull: 0..8227a08 Loaded stage1 Relocated stage1Looking for `entry` in stage1... Copying stage1 segments...copy_start_offset = 0x190 copying ProgramHeader { typ: Load, flags: 0x5, offset: 0x2000, vaddr: 0x2000, paddr: 0x2000, filesz: 0xbdc1, memsz: 0xbdc1, align: 0x1000 }copying ProgramHeader { typ: Load, flags: 0x4, offset: 0xe000, vaddr: 0xe000, paddr: 0xe000, filesz: 0x4224, memsz: 0x4224, align: 0x1000 } copying ProgramHeader { typ: Load, flags: 0x6, offset: 0x12a20, vaddr: 0x13a20, paddr: 0x13a20, filesz: 0x15e8, memsz: 0x1620, align: 0x1000 } Copying stage2 at 0x15000 Compressing guest...panicked at 'memory allocation of 150278328 bytes failed', /home/amos/.rustup/toolchains/nightly-2021-04-25-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:386:9 [1] 32021 illegal hardware instruction ~/ftl/minipak/target/x86_64-unknown-linux-gnu/release/minipak electron -o
Well, yup, we used a fixed heap size and it looks like, for this, 128 MiB weren't enough!
Let's bump that to 512:
// in `crates/encore/src/items.rs` /// Heap size, in megabytes const HEAP_SIZE_MB: u64 = 512;
And now, compression works!
Let's compare sizes:
$ ls -lhA ./electron* -rwxr-xr-x 1 amos amos 131M May 2 02:15 ./electron -rwxr-xr-x 1 amos amos 73M May 2 02:23 ./electron.pak -rwxr-xr-x 1 amos amos 63M May 2 02:15 ./electron.upx1 -rwxr-xr-x 1 amos amos 53M May 2 02:15 ./electron.upx9
And startup times:
$ hyperfine --warmup 3 './electron -v' './electron.upx1 -v' './electron.upx9 -v' './electron.pak -v' Benchmark #1: ./electron -v Time (mean ± σ): 107.5 ms ± 12.0 ms [User: 72.0 ms, System: 11.5 ms] Range (min … max): 98.2 ms … 140.1 ms 29 runs t help to . It migh Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Benchmark #2: ./electron.upx1 -v Time (mean ± σ): 1.679 s ± 0.124 s [User: 766.7 ms, System: 84.0 ms] Range (min … max): 1.491 s … 1.807 s 10 runs Benchmark #3: ./electron.upx9 -v Time (mean ± σ): 1.511 s ± 0.138 s [User: 704.6 ms, System: 66.0 ms] Range (min … max): 1.335 s … 1.670 s 10 runs Benchmark #4: ./electron.pak -v Time (mean ± σ): 2.079 s ± 0.597 s [User: 558.6 ms, System: 309.1 ms] Range (min … max): 1.235 s … 2.833 s 10 runs Summary './electron -v' ran 14.06 ± 2.03 times faster than './electron.upx9 -v' 15.63 ± 2.09 times faster than './electron.upx1 -v' 19.34 ± 5.96 times faster than './electron.pak -v'
No surprises there — electron is a beast (but still, 100ms startup time uncompressed is remarkable, given how much it packs).
But, like... we made an executable packer, that compresses electron.
And it runs, it's the real thing!
That's pretty darn cool.
Epilogue
This concludes my longest series so far, "Making our own executable packer".
All the way back in Part 3, I jokingly predicted that I would never finish it:
This series is never going to end.
In 2060, when I'm 70, and everybody will have switched to using Fuchsia on the desktop, my friends will still poke fun at me: "Hey amos, remember your ELF series? When's it gonna end?", and I'll feign a smile, but inside I will be acutely, painfully aware that I have angered the binary gods and that I should have left well enough alone.
Well, take that, 2020 me. We did it reddit! Woo!
I'd like to thank everyone for sticking around to see this series to its conclusion, especially my patrons.
I know you're all probably wondering "what's next??" and the answer is: sleep.
Lots and lots of sleep.
And then, who knows! So many interesting topics. I'm sure y'all will have great suggestions.
I hope you enjoyed the series, and if you've followed at home, send me screenshots of your stuff running! That would make me really happy.
Until next time, take care!
Thanks to my sponsors: Kristoffer Winther Balling, Jean-David Gadina, Sam, Borys Minaiev, Joonas Koivunen, Max von Forell, Radu Matei, Christian Bourjau, Paul Marques Mota, Alex Doroshenko, belzael, Ernest French, Mikkel Rasmussen, Matthew T, you got maiL, Bob Ippolito, Olly Swanson, Michał Bartoszkiewicz, compwhizii, Sawyer Knoblich and 227 more
If you liked what you saw, please support my work!
Here's another article just for you:
Writing Rust is pretty neat. But you know what's even neater? Continuously testing Rust, releasing Rust, and eventually, shipping Rust to production. And for that, we want more than plug-in for a code editor.
We want... a workflow.
Why I specifically care about this
This gets pretty long, so if all you want is the advice, feel free to directly.