Like I said earlier, we're not going to bother with big-endian ELFs, or 32-bit ELFs, so we
can hardcode a few values. Let's get started!
That looks reasonable. Let's make another crate to test the delf
crate.
We'll name it "elk", for "Executable & Linker Kit"
Good! Good start.
Let's start with the type - declaring an enum in Rust is simple enough.
We'll want to derive some useful traits - Debug
for printing, Clone
and
Copy
so it has copy semantics (instead of transmitting ownership),
PartialEq
and Eq
to compare Type
values for equality.
Rust code
// in `delf/src/lib.rs`
#[ derive( Debug, Clone, Copy, PartialEq, Eq) ]
pub enum Type {
None,
Rel,
Exec,
Dyn,
Core,
}
When reading our ELF file though, we won't get a Type
- we'll get a u16
.
And similarly, when we write out an ELF file, we'll also need a u16
.
We could do the mapping ourselves, like so:
Rust code
// in `delf/src/lib.rs`
impl Type {
fn to_u16 ( & self ) -> u16 {
match self {
Self:: None => 0 ,
Self:: Rel => 1 ,
Self:: Exec => 2 ,
Self:: Dyn => 3 ,
// etc.
}
}
}
Or we could just set the representation of the Type
enum to u16
- and
then we'd get that for free:
Rust code
// in `delf/src/lib.rs`
#[ derive( Debug, Clone, Copy, PartialEq, Eq) ]
#[ repr( u16 ) ]
pub enum Type {
None = 0x0 ,
Rel = 0x1 ,
Exec = 0x2 ,
Dyn = 0x3 ,
Core = 0x4 ,
}
Now, we can use the as
operator to cast our Type
to u16
- let's
write a quick test to validate our assumptions:
Rust code
// in `delf/src/lib.rs`
#[ cfg( test) ]
mod tests {
#[ test]
fn type_to_u16 ( ) {
assert_eq ! ( super ::Type::Dyn as u16 , 0x3 ) ;
}
}
And run it:
Shell session
$ # t == test, --lib == only library unit tests (not doc tests)
$ cargo t --lib
Finished test [unoptimized + debuginfo] target(s) in 0.02s
Running target/debug/deps/delf-d6fdd5529c793a0b
running 1 test
test tests::type_to_u16 ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Splendid! Now there's the small matter of going the other way - converting
an u16
to a Type
. The problem here, of course, is that not all u16
values
are valid Type
values.
In fact, had we written from_u16
ourselves, we would have had to deal
with this problem head-on:
Rust code
// in `delf/src/lib.rs`
impl Type {
pub fn from_u16 ( x : u16 ) -> Self {
match x {
0 => Self:: None,
1 => Self:: Rel,
2 => Self:: Exec,
3 => Self:: Dyn,
4 => Self:: Core,
}
}
}
The code above fails to compile, with reason:
Shell session
# b = build, -q = quiet
$ cargo b -q
error[E0004]: non-exhaustive patterns: `5u16..=std::u16::MAX` not covered
--> src/lib.rs:17:15
|
17 | match x {
| ^ pattern `5u16..=std::u16::MAX` not covered
|
= help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
To deal with that possibility, we could return an Option<Self>
instead:
Rust code
// in `delf/src/lib.rs`
impl Type {
pub fn from_u16 ( x : u16 ) -> Option < Self > {
match x {
0 => Some ( Self:: None) ,
1 => Some ( Self:: Rel) ,
2 => Some ( Self:: Exec) ,
3 => Some ( Self:: Dyn) ,
4 => Some ( Self:: Core) ,
_ => None,
}
}
}
#[ cfg( test) ]
mod tests {
// omitted: previous tests
#[ test]
fn type_from_u16 ( ) {
assert_eq ! ( super ::Type::from_u16( 0x3 ) , Some( super ::Type::Dyn) ) ;
assert_eq ! ( super ::Type::from_u16( 0xf00d ) , None) ;
}
}
And that works!
Shell session
$ cargo t --lib
Finished test [unoptimized + debuginfo] target(s) in 0.02s
Running target/debug/deps/delf-d6fdd5529c793a0b
running 2 tests
test tests::type_to_u16 ... ok
test tests::type_from_u16 ... ok
test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
...but it's kinda tedious. We already specify a Type <-> u16
mapping, when we
first define the enum. Why should we repeat ourselves ?
As it turns out.. eyes light up in wonder there is a crate for that!
derive-try-from-primitive
solves exactly that problem. Let's give it a shot:
Shell session
$ cd delf/
$ cargo add derive-try-from-primitive@1
Adding derive-try-from-primitive v1.0.0 to dependencies
Rust code
// in `delf/src/lib.rs`
use derive_try_from_primitive:: TryFromPrimitive;
#[ derive( Debug, Clone, Copy, PartialEq, Eq, TryFromPrimitive) ]
#[ repr( u16 ) ]
pub enum Type {
None = 0x0 ,
Rel = 0x1 ,
Exec = 0x2 ,
Dyn = 0x3 ,
Core = 0x4 ,
}
#[ derive( Debug, Clone, Copy, PartialEq, Eq, TryFromPrimitive) ]
#[ repr( u16 ) ]
pub enum Machine {
X86 = 0x03 ,
X86_64 = 0x3e ,
}
#[ cfg( test) ]
mod tests {
use super :: Machine;
use std:: convert:: TryFrom;
#[ test]
fn try_enums ( ) {
assert_eq ! ( Machine::X86_64 as u16 , 0x3E ) ;
assert_eq ! ( Machine::try_from( 0x3E ) , Ok( Machine::X86_64) ) ;
assert_eq ! ( Machine::try_from( 0xFA ) , Err( 0xFA ) ) ;
}
}
In a previous version of this article, we used derive-try-from-primitive
v0.1.0,
which returned an Option<T>
. But, since then, derive-try-from-primitive
v1.0.0
was released, which returns a Result<T, E>
since it implements the standard
TryFrom
interface .
Sure enough, it works as expected!
Shell session
$ cargo t
Finished test [unoptimized + debuginfo] target(s) in 0.01s
Running /home/amos/ftl/elf-series/target/debug/deps/delf-2d44b198a598eda8
running 1 test
test tests::try_enums ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Doc-tests delf
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Now, we can finally parse both the type and the machine. First let's add
them to the File
struct:
Rust code
// in `delf/src/lib.rs`
#[ derive( Debug) ]
pub struct File {
pub type : Type ,
pub machine : Machine ,
}
But wait... isn't type
a keyword?
Shell session
$ cargo b -q
error: expected identifier, found keyword `type`
--> src/lib.rs:24:5
|
24 | type: Type,
| ^^^^ expected identifier, found keyword
|
help: you can escape reserved keywords to use them as identifiers
|
24 | r#type: Type,
| ^^^^^^
It is! We could use typ
instead - but just this once, let's use the
escaped form instead, like the compiler suggests.
Rust code
#[ derive( Debug) ]
pub struct File {
pub r#type : Type ,
pub machine : Machine ,
}
Better! Kinda awkward, but at least this compiles.
Let's parse:
Rust code
// in `delf/src/lib.rs`
// new!
use std:: convert:: TryFrom;
impl File {
// omitted: magic constant
pub fn parse ( i : parse:: Input ) -> parse:: Result < Self > {
use nom:: {
bytes:: complete:: { tag, take} ,
error:: context,
sequence:: tuple,
combinator:: map,
number:: complete:: le_u16,
} ;
// omitted: parse magic, etc.
let ( i, ( r#type, machine) ) = tuple ( (
context ( "Type" , map ( le_u16, |x| Type:: try_from ( x) . unwrap ( ) ) ) ,
context ( "Machine" , map ( le_u16, |x| Machine:: try_from ( x) . unwrap ( ) ) ) ,
) ) ( i) ?;
let res = Self { machine, r#type } ;
Ok ( ( i, res) )
}
}
To try that, we'll switch back to our binary crate, elk
, for a second,
and print the File
struct, now that it has fields!
Rust code
// in `elk/src/main.rs`
fn main ( ) -> Result < ( ) , Box < dyn Error > > {
let input_path = env:: args ( ) . nth ( 1 ) . expect ( "usage: elk FILE" ) ;
let input = fs:: read ( & input_path) ?;
let ( _, file) = delf:: File:: parse ( & input[ ..] ) . map_err ( |e| format ! ( "{:?}" , e) ) ?;
// new!
println ! ( "{:#?}" , file) ;
Ok ( ( ) )
}
Shell session
$ # in elk/
$ cargo b -q
$ ./target/debug/elk /bin/true
File {
type: Dyn,
machine: X86_64,
}
Wonderful! What happens if we run it on a 32-bit ELF?
Shell session
$ ./target/debug/elk /usr/lib32/libc.so
Error: "Error(VerboseError { errors: [([47, 42, 32, 71, 78, 85, 32, 108, 100, 32, 115, 99, 114, 105, 112, 116, 10, 32, 32, 32, 85, 115, 101, 32, 116, 104, 101, 32, 115, 104, 97, 114, 101, 100, 32, 108, 105, 98, 114, 97, 114, 121, 44, 32, 98, 117, 116, 32, 115, 111, 109, 101, 32, 102, 117, 110, 99, 116, 105, 111, 110, 115, 32, 97, 114, 101, 32, 111, 110, 108, 121, 32, 105, 110, 10, 32, 32, 32, 116, 104, 101, 32, 115, 116, 97, 116, 105, 99, 32, 108, 105, 98, 114, 97, 114, 121, 44, 32, 115, 111, 32, 116, 114, 121, 32, 116, 104, 97, 116, 32, 115, 101, 99, 111, 110, 100, 97, 114, 105, 108, 121, 46, 32, 32, 42, 47, 10, 79, 85, 84, 80, 85, 84, 95, 70, 79, 82, 77, 65, 84, 40, 101, 108, 102, 51, 50, 45, 105, 51, 56, 54, 41, 10, 71, 82, 79, 85, 80, 32, 40, 32, 47, 117, 115, 114, 47, 108, 105, 98, 51, 50, 47, 108, 105, 98, 99, 46, 115, 111, 46, 54, 32, 47, 117, 115, 114, 47, 108, 105, 98, 51, 50, 47, 108, 105, 98, 99, 95, 110, 111, 110, 115, 104, 97, 114, 101, 100, 46, 97, 32, 32, 65, 83, 95, 78, 69, 69, 68, 69, 68, 32, 40, 32, 47, 117, 115, 114, 47, 108, 105, 98, 51, 50, 47, 108, 100, 45, 108, 105, 110, 117, 120, 46, 115, 111, 46, 50, 32, 41, 32, 41, 10], Nom(Tag)), ([47, 42, 32, 71, 78, 85, 32, 108, 100, 32, 115, 99, 114, 105, 112, 116, 10, 32, 32, 32, 85, 115, 101, 32, 116, 104, 101, 32, 115, 104, 97, 114, 101, 100, 32, 108, 105, 98, 114, 97, 114, 121, 44, 32, 98, 117, 116, 32, 115, 111, 109, 101, 32, 102, 117, 110, 99, 116, 105, 111, 110, 115, 32, 97, 114, 101, 32, 111, 110, 108, 121, 32, 105, 110, 10, 32, 32, 32, 116, 104, 101, 32, 115, 116, 97, 116, 105, 99, 32, 108, 105, 98, 114, 97, 114, 121, 44, 32, 115, 111, 32, 116, 114, 121, 32, 116, 104, 97, 116, 32, 115, 101, 99, 111, 110, 100, 97, 114, 105, 108, 121, 46, 32, 32, 42, 47, 10, 79, 85, 84, 80, 85, 84, 95, 70, 79, 82, 77, 65, 84, 40, 101, 108, 102, 51, 50, 45, 105, 51, 56, 54, 41, 10, 71, 82, 79, 85, 80, 32, 40, 32, 47, 117, 115, 114, 47, 108, 105, 98, 51, 50, 47, 108, 105, 98, 99, 46, 115, 111, 46, 54, 32, 47, 117, 115, 114, 47, 108, 105, 98, 51, 50, 47, 108, 105, 98, 99, 95, 110, 111, 110, 115, 104, 97, 114, 101, 100, 46, 97, 32, 32, 65, 83, 95, 78, 69, 69, 68, 69, 68, 32, 40, 32, 47, 117, 115, 114, 47, 108, 105, 98, 51, 50, 47, 108, 100, 45, 108, 105, 110, 117, 120, 46, 115, 111, 46, 50, 32, 41, 32, 41, 10], Context(\"Magic\"))] })"
Oh my.
That's too much output. And we'd rather see hexadecimal notation here, wouldn't we?
Let's make a quick helper:
Rust code
// in `delf/src/lib.rs`
pub struct HexDump < ' a > ( & ' a [ u8 ] ) ;
use std:: fmt;
impl < ' a > fmt:: Debug for HexDump < ' a > {
fn fmt ( & self , f : & mut fmt:: Formatter ) -> fmt:: Result {
for & x in self . 0 . iter ( ) . take ( 20 ) {
write ! ( f, "{:02x} " , x) ?;
}
Ok ( ( ) )
}
}
And try it. Remember there are two failure modes in nom
:
nom::Err::Error
is recoverable. Maybe another branch will work
(if we're trying several parsers in a row), or maybe we just
need to get more input.
nom::Err::Failure
is unrecoverable - we've tried all the parsers,
more input won't help, something is just plain wrong.
In either case, they just wrap a nom::error::VerboseError
, which itself
can contain multiple nom::error::ErrorKind
, along with the relevant input slice.
I know, I know - that's a lot! But thanks to pattern matching , it's not so bad.
We can make a much better error printer in just a few lines. We'll add it
directly in delf
, so that we don't have use nom
types in elk
:
Rust code
// in `delf/src/lib.rs`
impl File {
pub fn parse_or_print_error ( i : parse:: Input ) -> Option < Self > {
match Self:: parse ( i) {
Ok( ( _, file) ) => Some ( file) ,
Err( nom:: Err:: Failure( err) ) | Err( nom:: Err:: Error( err) ) => {
eprintln ! ( "Parsing failed:" ) ;
for ( input, err) in err. errors {
eprintln ! ( "{:?} at:" , err) ;
eprintln ! ( "{:?}" , HexDump( input) ) ;
}
None
}
Err( _) => panic ! ( "unexpected nom error" ) ,
}
}
}
Now we can use it from elk
:
Rust code
// in `elk/src/main.rs`
fn main ( ) -> Result < ( ) , Box < dyn Error > > {
// omitted: grabbing first argument
let file = match delf:: File:: parse_or_print_error ( & input[ ..] ) {
Some( f) => f,
None => std:: process:: exit ( 1 ) ,
} ;
println ! ( "{:#?}" , file) ;
}
Shell session
$ cargo b -q
$ ./target/debug/elk /usr/lib32/libc.so
Parsing failed:
Nom(Tag) at:
2f 2a 20 47 4e 55 20 6c 64 20 73 63 72 69 70 74 0a 20 20 20
Context("Magic") at:
2f 2a 20 47 4e 55 20 6c 64 20 73 63 72 69 70 74 0a 20 20 20
Wonderful! Well. It could be improved a lot - for now we'll just add
the position of each error in the file, as a byte offset:
Rust code
// in `delf/src/lib.rs`
impl File {
pub fn parse_or_print_error ( i : parse:: Input ) -> Option < Self > {
match Self:: parse ( i) {
Ok( ( _, file) ) => Some ( file) ,
Err( nom:: Err:: Failure( err) ) | Err( nom:: Err:: Error( err) ) => {
eprintln ! ( "Parsing failed:" ) ;
for ( input, err) in err. errors {
// new!
use nom:: Offset;
let offset = i. offset ( input) ;
eprintln ! ( "{:?} at position {}:" , err, offset) ;
eprintln ! ( "{:>08x}: {:?}" , offset, HexDump( input) ) ;
}
None
}
Err( _) => panic ! ( "unexpected nom error" ) ,
}
}
And here's the result:
Shell session
$ cargo b -q && ./target/debug/elk /usr/lib32/libz.so
Parsing failed:
Nom(Tag) at position 4:
00000004: 01 01 01 00 00 00 00 00 00 00 00 00 03 00 03 00 01 00 00 00
Context("Class") at position 4:
00000004: 01 01 01 00 00 00 00 00 00 00 00 00 03 00 03 00 01 00 00 00
We could get much, much nicer output, but let's not get distracted - there
is so much to do!
If you want to read more about error handling (and pretty printing!) with
nom
, read Making our own ping: Part 9 .
Let's not waste any more time and move forward with our parser. One thing
I don't like about our current parser is that it panics if it encounters
say, a value for Machine
it doesn't support.
We're going to allow ourselves to panic in elk
- the binary crate, our CLI
tool, but we'll avoid panicking in delf
- the library crate. Luckily, we've
already learned how to do that .
In our case, we want to implement parse
on the Machine
type. It should
return a parse::Result
- like any other parser, and if we find a variant
we don't support, we shall return one of the built-in nom error kinds, say...
Alt
. Because what are enums
, if not a miserable little pile of alternatives?
Rust code
// in `delf/src/lib.rs`
impl Machine {
pub fn parse ( i : parse:: Input ) -> parse:: Result < Self > {
let original_i = i;
use nom:: {
error:: { ErrorKind, ParseError, VerboseError} ,
number:: complete:: le_u16,
Err,
} ;
let ( i, x) = le_u16 ( i) ?;
match Self:: try_from ( x) {
Some( res) => Ok ( ( i, res) ) ,
None => Err ( Err:: Failure ( VerboseError:: from_error_kind (
original_i,
ErrorKind:: Alt,
) ) ) ,
}
}
}
Okay. So.
That's a little bit on the verbose side. Also, there's no context! We won't know
which enum failed to parse. I'm sure we can do a bit better than that.
We've used the map
combinator before, but it won't work here - map
works when
all outputs of the previous parser are valid , whereas in our case, try_from
can fail.
Luckily, there's another combinator, map_res
, that fits our needs. It allows
us to return a Result<O2, E2>
instead - where O2
is the output of our mapping
function, and E2
is the errors our mapping function can return.
Rust code
// in `delf/src/lib.rs`
impl Machine {
pub fn parse ( i : parse:: Input ) -> parse:: Result < Self > {
use nom:: { combinator:: map_res, error:: ErrorKind, number:: complete:: le_u16} ;
map_res ( le_u16, |x| match Self:: try_from ( x) {
Ok( x) => Ok ( x) ,
Err( _) => Err ( ErrorKind:: Alt) ,
} ) ( i)
}
}
Better! Now nom
takes care of making a "verbose error", and picks an Err
variant for us. It even annotates with the location, since map_res
has a
reference to the input before le_u16
even runs.
But our code isn't ideal right now - try_from
returns a Result<T, E>
, and
we're using pattern matching to turn it into another Result<T, E>
. And as
much as I like pattern matching, there's a better tool for the job: map_err
.
Let's look at its prototype (keep in mind it's implemented on Result<T, E>
):
Rust code
// rust standard library
pub fn map_err < F , O > ( self , op : O ) -> Result < T , F > where
O : FnOnce ( E ) -> F
So, if the Result is a the Err
variant, map_err
will call the function we
provide, op
, to change it to another error type - from E
to F
.
This means we can just write:
Rust code
// in `delf/src/lib.rs`
impl Machine {
pub fn parse ( i : parse:: Input ) -> parse:: Result < Self > {
use nom:: { combinator:: map_res, error:: ErrorKind, number:: complete:: le_u16} ;
map_res ( le_u16, |x| Self:: try_from ( x) . map_err ( |_| ErrorKind:: Alt) ) ( i)
}
}
We can even add some context now. It's all just combinators!
Rust code
impl Machine {
pub fn parse ( i : parse:: Input ) -> parse:: Result < Self > {
use nom:: {
combinator:: map_res,
error:: { context, ErrorKind} ,
number:: complete:: le_u16,
} ;
context (
"Machine" ,
map_res ( le_u16, |x| Self:: try_from ( x) . map_err ( |_| ErrorKind:: Alt) ) ,
) ( i)
}
}
And even though this implementation is much more pleasant, I wouldn't want
it to write it by hand more than once, so.. let's make a macro!
We'll slap it in the parse
module:
Rust code
// in `delf/src/parse.rs`
// omitted: definitions for `Input` and `Result`
#[ macro_export]
macro_rules! impl_parse_for_enum {
( $type: ident, $number_parser: ident) => {
impl $type {
pub fn parse( i: parse::Input) -> parse::Result<Self> {
use nom::{
combinator::map_res,
error::{ context, ErrorKind} ,
number::complete::$number_parser,
} ;
let parser = map_res( $number_parser, |x| {
Self::try_from( x) .map_err( |_| ErrorKind::Alt)
} ) ;
context( stringify!( $type) , parser) ( i)
}
}
} ;
}
And now we can implement parse
for both Type
and Machine
:
Rust code
// in `delf/src/lib.rs`
// omitted: struct definitions for `Type` and `Machine`
impl_parse_for_enum ! ( Type, le_u16) ;
impl_parse_for_enum ! ( Machine, le_u16) ;
And finally use those in our main parser:
Rust code
impl File {
const MAGIC: & ' static [ u8 ] = & [ 0x7f , 0x45 , 0x4c , 0x46 ] ;
pub fn parse ( i : parse:: Input ) -> parse:: Result < Self > {
// note: you can remove `map` and `le_u16` imports
// omitted: rest of parser
// before:
let ( i, ( r#type, machine) ) = tuple ( (
context ( "Type" , map ( le_u16, |x| Type:: try_from ( x) . unwrap ( ) ) ) ,
context ( "Machine" , map ( le_u16, |x| Machine:: try_from ( x) . unwrap ( ) ) ) ,
) ) ( i) ?;
// after:
let ( i, ( r#type, machine) ) = tuple ( ( Type:: parse, Machine:: parse) ) ( i) ?;
let res = Self { machine, r#type } ;
Ok ( ( i, res) )
}
}
Phew. I feel lighter already. Nothing like a good abstraction to clear the mind.
Does everything still work?
Shell session
$ cargo b -q && ./target/debug/elk /bin/true
File {
type: Dyn,
machine: X86_64,
}
It does! Good, good.
I guess it's time to parse the rest of the ELF header.
Next up is the entry point
- in ELF64, it's.. a 64-bit integer. Well, it's
more of a memory address. We'll want to print that in hexadecimal. So let's make
a wrapper type!
We might end up doing some arithmetic with those addresses, so we'll quickly
grab a crate to derive more traits: derive_more
:
Shell session
$ cd delf/
$ cargo add derive_more@0.99
Adding derive_more v0.99.9 to dependencies
Rust code
// in `delf/src/lib.rs`
use derive_more:: * ;
// "Add" and "Sub" are in `derive_more`
#[ derive( Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Add, Sub) ]
pub struct Addr ( pub u64 ) ;
impl fmt:: Debug for Addr {
fn fmt ( & self , f : & mut fmt:: Formatter ) -> fmt:: Result {
write ! ( f, "{:08x}" , self .0 )
}
}
impl fmt:: Display for Addr {
fn fmt ( & self , f : & mut fmt:: Formatter ) -> fmt:: Result {
fmt:: Debug:: fmt ( self , f)
}
}
// This will come in handy when serializing
impl Into < u64 > for Addr {
fn into ( self ) -> u64 {
self . 0
}
}
// This will come in handy when indexing / sub-slicing slices
impl Into < usize > for Addr {
fn into ( self ) -> usize {
self . 0 as usize
}
}
// This will come in handy when parsing
impl From < u64 > for Addr {
fn from ( x : u64 ) -> Self {
Self ( x)
}
}
impl Addr {
pub fn parse ( i : parse:: Input ) -> parse:: Result < Self > {
use nom:: { combinator:: map, number:: complete:: le_u64} ;
map ( le_u64, From:: from) ( i)
}
}
Well, that was a mouthful. But it should be smooth sailing from now on!
We can add it to our File
struct, and our parser:
Rust code
// in `delf/src/lib.rs`
#[ derive( Debug) ]
pub struct File {
pub r#type : Type ,
pub machine : Machine ,
// new!
pub entry_point : Addr ,
}
impl File {
pub fn parse ( i : parse:: Input ) -> parse:: Result < Self > {
// omitted: most of parser
use nom:: { combinator:: verify, number:: complete:: le_u32} ;
let ( i, ( r#type, machine) ) = tuple ( ( Type:: parse, Machine:: parse) ) ( i) ?;
// this 32-bit integer should always be set to 1 in the current
// version of ELF, see the diagram. We don't *have* to check it,
// but it's so easy to, let's anyway!
let ( i, _) = context ( "Version (bis)" , verify ( le_u32, |& x| x == 1 ) ) ( i) ?;
let ( i, entry_point) = Addr:: parse ( i) ?;
let res = Self {
machine,
r#type,
entry_point,
} ;
Ok ( ( i, res) )
}
}
And then give elk
a go! We don't need to change anything in it -
it refers to delf
by path, and we've updated delf
's File
to have a
new field, so we just need to build and run it again:
Shell session
$ cargo b -q && ./target/debug/elk /bin/true
File {
type: Dyn,
machine: X86_64,
entry_point: 00002130,
}
Nice! But what does entry_point
mean anyway?
Our diagram said it's the "address at which program execution starts".
For example, if we run it on our sample assembly program from above:
Shell session
$ cargo b -q && ./target/debug/elk samples/hello
File {
type: Exec,
machine: X86_64,
entry_point: 00401000,
}
How can we verify our assumption? Well, we can debug our program!
When researching this article series, I went hunting for a good GDB GUI - and
I'm sorry to report that the state of GDB graphical user interfaces is... not
great. I mean, Nemiver still runs, but
I didn't have to look very far to find several annoying bugs.
So, naturally, I went hunting on https://crates.io , and found... ugdb ! It's a TUI (text user interface), but it's nicer than the
built-in GDB TUI, has syntax highlighting, a cozy default pane layout. I like it
a lot!
To install it, simply run:
Shell session
$ # note: you'll need cmake. use --force if updating
$ cargo install ugdb
We can then load up samples/hello
in ugdb
simply with:
Shell session
$ ugdb ./samples/hello
And be greeted with this:
Now, if we just use the start
GDB command, our program will run entirely,
and we'll see its output in the right pane:
But that's not very interesting. GDB will not break anywhere in our program by
default, because our program doesn't crash - it just makes two syscalls:
write
and exit
.
To quit GDB (and similarly, ugdb), use the quit
command (or q
, for short). If it asks
for a confirmation, reply with y
(and press enter).
We can't set breakpoint on syscalls... but we can set catchpoints!
(gdb) catch syscall
Catchpoint 1 (any syscall)
(gdb) start
Nice! GDB paused program execution right before the write
syscall was executed.
We can already see in this screenshot that the beginning of _start
- our entry point -
is at 0x401000
. Everything checks out so far!
Another way we could've checked this is to set a breakpoint at address 0x401000
,
like so:
(gdb) break *0x401000
Breakpoint 1 at 0x401000
(gdb) start
While in ugdb
, we can use the stepi
command (or si
for short) to advance to
the next instruction.
Here's a quick video:
There's one thing that bothers me though - the disassembly shown in ugdb
doesn't
really look like our source program. We wrote mov rdi, 1
, not mov $0x1, %edi
.
The explanation is simple: GDB is showing us AT&T assembly syntax, whereas we got
used to Intel syntax. Luckily, we can adjust that by creating a ~/.gdbinit
file:
# in ~/.gdbinit
set disassembly-flavor intel
Now GDB (and ugdb) shows us disassembly in Intel syntax!
Let's try doing all that on /bin/true
- the entry point for it was 0x2130
, so,
(gdb) break *0x2130
(gdb) start
Starting program: /usr/bin/true
Warning:
Cannot insert breakpoint 1.
Cannot access memory at address 0x2130
Oh. That's unfortunate. What address does /bin/true
start at, then?
And how can we break at the entry point, if it's not 0x2130
?
Recent versions of gdb
come with the starti
command, which does exactly that.
We can use it to inspect at which address /bin/true
actually starts:
Oh. 0x7ffff7fd4100
. Okay. Well, that's 140 terabytes. I sorta doubt /bin/true
needs that much memory to proceed, but uh...
Okay, forget /bin/true
for a minute - let's make a C program that prints
the address of main
. That should be the entry point, right?
C code
// in `elk/samples/entry_point.c`
#include <stdio.h>
int main () {
printf ("main is at %p\n" , & main );
}
Shell session
$ # in `elk/samples/`
$ gcc entry_point.c -o entry_point
$ ../target/debug/elk
File {
type: Dyn,
machine: X86_64,
entry_point: 00001040,
}
$ ./entry_point
main is at 0x562601e0f139
$ ./entry_point
main is at 0x56311bd0c139
$ ./entry_point
main is at 0x5588a0a39139
$ ./entry_point
main is at 0x559c3e3df139
$ ./entry_point
main is at 0x557578dc4139
$ ./entry_point
main is at 0x55c80e189139
Okay.
Uh.
Let's try gdb?
(gdb) starti
Starting program /home/amos/ftl/elk/samples/entry_point
Program stopped.
0x0000000000401ac0 in _start ()
(gdb) c
Continuing.
main is at 0x401be5
[Inferior 1 (process 7394) exited normally]
(gdb) starti
Starting program: /home/amos/ftl/elk/samples/entry_point
Program stopped.
0x0000000000401ac0 in _start ()
(gdb) c
Continuing.
main is at 0x401be5
[Inferior 1 (process 7397) exited normally]
(gdb) starti
Starting program: /home/amos/ftl/elk/samples/entry_point
Program stopped.
0x0000000000401ac0 in _start ()
(gdb) c
Continuing.
main is at 0x401be5
[Inferior 1 (process 7398) exited normally]
Well.
That doesn't answer any of my questions.
If anything, I have more questions now.
Why is the entry point stored in an ELF file sometimes the same one we see in GDB
and sometimes not?
Why is the entry point a valid offset into the ELF file and sometimes not?
Why is the entry point always the same in GDB but not when running the executable directly?
Clearly, we have a lot more detective work to do...
The ELF binary file format is old, but still in use today! It contains
information about the Operating System's Application Binary Interface (OS
ABI) and the processor architecture (ie. the Machine
), among other things.
It is easily parsed with nom . For Rust
enum
s, we can use the
derive-try-from-primitive
crate to convert back and forth between the integer representation (u8
, u16
, u32
) and the Rust enum variant.
The ELF header also contains the "entry point" for a program - where it should start executing. Although, when we run a program, we sometimes find a different address
being executed.
We can use xxd
to visualize the contents of a binary file - it's a
"hexdump" (hexadecimal.. printer) tool. It accepts options such as: offset,
length, endianness, and how many bytes to group together when printing. It does
not understand hexadecimal offsets, but we can use shell arithmetic - $(())
-
to do that.
GDB, the GNU debugger, will let us start a program and step through it, even
if doesn't have debug or source information - in that case, it will show us
disassembly of the program. We can even ask GDB to use Intel syntax when
showing assembly. ugdb is a nice text UI frontend
to GDB.