What's in the box?
- A practical and very innocent example
- The great appearing act
- What were we doing again?
-
What the heck is a
Box
? - But let's get back to boxes
- Now for some more Rust
-
One
Sized
fits all - The many ways we can return a Result
-
Error propagation and the
?
sigil -
How does
Box
unify types? - How do we unify types without forcing a heap allocation?
Contents
Here's a sentence I find myself saying several times a week:
...or we could just box it.
There's two remarkable things about this sentence.
The first, is that the advice is very rarely heeded, and instead, whoever I just said it to disappears for two days, emerging victorious, basking in the knowledge that, YES, the compiler could inline that, if it wanted to.
And the second is that, without a lot of context, this sentence is utter nonsense if you don't have a working knowledge of Rust. As a Java developer, you may be wondering if we're trying to turn numbers into objects (we are not). In fact, even as a Rust developer, you may have just accepted that boxing is just a fact of life.
It's just a thing we have to do sometimes, so the compiler stops being mad at us, and things just suddenly start working. That's not necessarily a bad thing. That's just how good compiler diagnostics are, that it can just tell you "hold on there friend, I really think you want to box it", and you can copy and paste the solution, and the puzzle is cracked.
But! Just because we can get by for a very long time without knowing what it means, doesn't mean I can resist the sweet sweet temptation of explaining in excruciating details what it actually means, and so, that's exactly what we're going to do in this article.
Before we do that, though, let's look at a simple example where we might be enjoined by a well-intentioned colleague to, as it were, "just box it".
A practical and very innocent example
Whenever cargo new
is invoked, it generates a simple "hello world"
application, that looks like this:
fn main() { println!("Hello, world!"); }
It is pure, and innocent, and devoid of things that can fail, which is great.
$ cargo run Compiling whatbox v0.1.0 (/home/amos/ftl/whatbox) Finished dev [unoptimized + debuginfo] target(s) in 0.47s Running `target/debug/whatbox` Hello, world!
But sometimes we want to do things that can fail!
Like reading a file, for example:
fn main() { println!("{}", std::fs::read_to_string("/etc/issue").unwrap()) }
$ cargo run --quiet Arch Linux \r (\l)
read_to_string
can fail! And that's why it returns a Result<String, E>
and not just a String
.
And that's also why we need to call .unwrap()
on it, to go from
Result<String, E>
to either:
- a panic
$ cargo run --quiet thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/main.rs:2:59 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
- or a
String
(which we've already seen)
Okay, good!
But let's say we want to read a string inside a function. Our own function.
Something like that:
fn main() { println!("{}", read_issue()) } fn read_issue() -> String { std::fs::read_to_string("/etc/issue").unwrap() }
Well, here, everything works:
$ cargo run --quiet Arch Linux \r (\l)
But that's not really the code we want. See, the read_issue
function feels
like "library code". Right now, it's in our application, but I could see
myself splitting that function into its own crate, maybe a crate named
linux-info
or something, because it could be useful to other applications.
And so, even though it's in the same crate as the main function
, I don't feel
comfortable causing a panic in read_issue
, the way I felt comfortable causing
a panic at the disco in main
.
Instead, I think I want read_issue
to return a Result
, too. Because
Result<T, E>
is an enum, that can represent two things: the operation has
succeeded (and we get a T
), or it failed (and we get an E
).
enum Result<T, E> { Ok(T), Err(E), }
And we know that when the operation succeeds, we get a String
, so we know
what to pick for T
. But the question is: what do we pick for E
?
fn main() { println!("{}", read_issue().unwrap()) } // what is `E` supposed to be? 👇 fn read_issue() -> Result<String, E> { std::fs::read_to_string("/etc/issue") }
And that problem, that specific problem, is not something we really have to worry about in some other languages, like for instance... ECMAScript! I mean, JavaScript!
Because in JavaScript, if something goes wrong, we just throw!
import { readFileSync } from "fs"; function main() { let issue = readIssue(); console.log(`${issue}`); } function readIssue() { return readFileSync("/etc/i-do-not-exist"); } main();
$ node js/index.mjs node:fs:505 handleErrorFromBinding(ctx); ^ Error: ENOENT: no such file or directory, open '/etc/i-do-not-exist' at Object.openSync (node:fs:505:3) at readFileSync (node:fs:401:35) at readIssue (file:///home/amos/ftl/whatbox/js/index.mjs:9:5) at main (file:///home/amos/ftl/whatbox/js/index.mjs:4:17) at file:///home/amos/ftl/whatbox/js/index.mjs:12:1 at ModuleJob.run (node:internal/modules/esm/module_job:154:23) at async Loader.import (node:internal/modules/esm/loader:177:24) at async Object.loadESM (node:internal/process/esm_loader:68:5) { errno: -2, syscall: 'open', code: 'ENOENT', path: '/etc/i-do-not-exist' }
And we don't have to worry whether readIssue
can or cannot throw when we
call it:
function main() { // 👇 let issue = readIssue(); console.log(`${issue}`); }
Well, maybe we should! Maybe we should wrap it in a try-catch, just so we can recover from any exceptions thrown. But we don't have to. Our program follows the happy path happily.
In Go, there's no exceptions, but there is usually an indication that a function can fail in its signature.
package main import ( "log" "os" ) func main() { issue := readIssue() log.Printf("issue = %v", issue) } func readIssue() string { bs, _ := os.ReadFile("") return string(bs) }
Here, readIssue
cannot fail! It only returns a string
.
But here, it can fail:
package main import ( "log" "os" ) func main() { // we get two values out of readIssue, including `err` issue, err := readIssue() // ...which we should check for nil-ness if err != nil { // ...and handle log.Fatalf("fatal error: %+v", err) } log.Printf("issue = %v", issue) } func readIssue() (string, error) { bs, err := os.ReadFile("") // same here, `ReadFile` does a multi-valued return, so we need // to check `err` first: if err != nil { return "", err } // and only here do we know reading the file actually succeeded: return string(bs), nil }
And here, since we do our error handling properly, the output we get is:
$ go run go/main.go 2021/04/17 20:47:37 fatal error: open : no such file or directory exit status 1
However, note that it does not tell us where in the code the error occurred, whereas the JavaScript/Node.js version did.
There's a solution to that, but by default, out of the box, Go errors do not capture stack traces.
And then there's Rust, which is the most strict of the three, that forces us to declare that a function can fail, forces us to handle any error that may have occurred in a function, but also forces us to describe "what possible error values are there".
And that's where it can get confusing.
You see, in JavaScript, you can throw anything.
function main() { let issue = readIssue(); console.log(`${issue}`); } function readIssue() { throw "woops"; } main();
$ node js/index.mjs node:internal/process/esm_loader:74 internalBinding('errors').triggerUncaughtException( ^ woops (Use `node --trace-uncaught ...` to show where the exception was thrown)
This is not a good idea. Mostly, because then we don't get a stack trace.
No, not even with --trace-uncaught
:
$ node --trace-uncaught js/index.mjs node:internal/process/esm_loader:74 internalBinding('errors').triggerUncaughtException( ^ woops Thrown at: at loadESM (node:internal/process/esm_loader:74:31)
So please, never ever do that.
Instead, throw an Error
object, like so:
function main() { let issue = readIssue(); console.log(`${issue}`); } function readIssue() { throw new Error("woops"); } main();
$ node js/index.mjs file:///home/amos/ftl/whatbox/js/index.mjs:7 throw new Error("woops"); ^ Error: woops at readIssue (file:///home/amos/ftl/whatbox/js/index.mjs:7:11) at main (file:///home/amos/ftl/whatbox/js/index.mjs:2:17) at file:///home/amos/ftl/whatbox/js/index.mjs:10:1 at ModuleJob.run (node:internal/modules/esm/module_job:154:23) at async Loader.import (node:internal/modules/esm/loader:177:24) at async Object.loadESM (node:internal/process/esm_loader:68:5)
As for Go, well. You can't just say you're going to return an error
, and
just return a string
. That's good.
func readIssue() (string, error) { return "", "woops" }
$ go run go/main.go # command-line-arguments go/main.go:17:13: cannot use "woops" (type string) as type error in return argument: string does not implement error (missing Error method)
Whatever you return has to be of type error
, and there is a shorthand
for that:
func readIssue() (string, error) { return "", errors.New("woops") }
Which is just this:
// New returns an error that formats as the given text. // Each call to New returns a distinct error value even if the text is identical. func New(text string) error { return &errorString{text} }
Where errorString
is simply a struct:
// errorString is a trivial implementation of error. type errorString struct { s string }
That implements the error
interface. All the interface asks for is that
there is an Error()
method that returns a string
:
func (e *errorString) Error() string { return e.s }
And so our sample program now shows this:
$ go run go/main.go 2021/04/17 20:59:37 fatal error: woops exit status 1
Which is not to say that error handling in Go is a walk in the park.
This first bit has been pointed out in almost every article that has even the slightest amount of feelings about Go: it's just way too easy to ignore, or "forget to handle" Go errors:
func readIssue() (string, error) { bs, err := os.ReadFile("/etc/issue") err = os.WriteFile("/tmp/issue-copy", bs, 0o644) if err != nil { return "", err } return string(bs), nil }
Woops! No warnings, no nothing. If we fail to read the file, that error is gone forever. The issue here is of course that Go returns "multiple things": both the "success value" and the "error value", and it's on you to pinky swear not to touch the success value, if you haven't checked the error value first.
And that problem doesn't exist in a language with sum types — a Rust Result
is either Result::Ok(T)
, or Result::Err(E)
, never both.
But everyone knows about that one. The other one is a lot more fun.
If we make our own error type:
type naughtyError struct{} func (ne *naughtyError) Error() string { return "oh no" }
Then we can return it as an error
. Because error
is an interface, and
*naughtyError
has an Error
method that returns a string, everything fits
together, boom, composition, alright!
func readIssue() (string, error) { return "", &naughtyError{} }
$ go run go/main.go 2021/04/17 21:06:28 fatal error: oh no exit status 1
But if we accidentally return a value of type *naughtyError
, that just
happens to be nil
, well...
package main import ( "log" ) func readIssue() (string, error) { var err *naughtyError log.Printf("(in readIssue) is err nil? %v", err == nil) return "", err } func main() { issue, err := readIssue() log.Printf("(in main) is err nil? %v", err == nil) if err != nil { log.Fatalf("fatal error: %+v", err) } log.Printf("issue = %v", issue) } // type naughtyError struct{} func (ne *naughtyError) Error() string { return "oh no" }
$ go run go/main.go 2021/04/17 21:08:08 (in readIssue) is err nil? true 2021/04/17 21:08:08 (in main) is err nil? false 2021/04/17 21:08:08 fatal error: oh no exit status 1
...then bad things happen.
And this is really fun to me, but it is really bad for Go.
The first issue, "forgetting to check for nil", is easy to understand. We told you where the error was. Just don't forget to check it. It's easy to fit into one's mental model of Go, which is advertised as really really simple.
The second one is a lot worse, because it betrays a leaky abstraction.
You see... there's some magic afoot.
The great appearing act
We have two err
values in our last, naughty sample program. One of them
compares equal to nil
, and the other does not.
But the differences don't stop there:
package main import ( "log" "unsafe" ) func readIssue() (string, error) { var err *naughtyError log.Printf("(in readIssue) nil? %v, size = %v", err == nil, unsafe.Sizeof(err)) return "", err } func main() { issue, err := readIssue() log.Printf("(in main) nil? %v, size = %v", err == nil, unsafe.Sizeof(err)) if err != nil { log.Fatalf("fatal error: %+v", err) } log.Printf("issue = %v", issue) } // type naughtyError struct{} func (ne *naughtyError) Error() string { return "oh no" }
Why is Sizeof
part of the unsafe
package? Well, that's a very good question.
The package docs say:
Package unsafe contains operations that step around the type safety of Go programs.
Packages that import unsafe may be non-portable and are not protected by the Go 1 compatibility guidelines.
...but what we're doing here is completely harmless. The important bit, as I understand it, is that as a Go developer, you're not supposed to care.
You're not supposed to look at these things. Go is simple! Byte slices are strings! Go has no pointer arithmetic! Who cares how large a type is!
Until you do care, and then, well, you're on your own. And "using unsafe" is exactly what "being on your own" is. But it's okay. We're all on our own together.
The program above prints the following:
$ go run go/main.go 2021/04/17 21:19:12 (in readIssue) nil? true, size = 8 2021/04/17 21:19:12 (in main) nil? false, size = 16 2021/04/17 21:19:12 fatal error: oh no exit status 1
Which is iiiiinteresting.
This is the kind of example that, given enough time, one could figure out the solution all on their own. But when coming face to face with it, and when it has been a while, it is... puzzling.
The first line makes a ton of sense.
We declared a pointer, like this:
var err *naughtyError
The zero value of a pointer is nil
, so it's equal to nil
. And we're (well,
I'm) on 64-bit Linux, so the size of a pointer is 64 bits, or 8 bytes.
Is a byte always 8 bits?
According to ISO/IEC 80000, yes.
If you're reading this from a machine whose byte isn't 8 bits, please, please send a picture.
The second line is a lot more surprising — not only does it not equal nil
,
but, it's also twice as large.
We can shed some light on the whole thing by introducing yet another error type:
package main import ( "log" ) func main() { var err error err = (*naughtyError)(nil) log.Printf("%v", err) err = (*niceError)(nil) log.Printf("%v", err) } type naughtyError struct{} func (ne *naughtyError) Error() string { return "oh no" } type niceError struct{} func (ne *niceError) Error() string { return "ho ho ho!" }
What a nice holiday-themed error. We have two nil
values, and they both
print different things!
$ go run go/main.go 2021/04/17 21:26:42 oh no 2021/04/17 21:26:42 ho ho ho!
Ah. AH! This is why it's bigger! This is why error
is wider than
*naughtyError
!
Yes bear?
Because these values are are both nil
! But uhhh when acting as an interface
value (for the error
interface), they behave differently!
Yes!
And so the size of an error
interface value is 16 bytes because... there's
two pointers!
Precisely!
And the second pointer is... to the type!
Well, in Go, yes!
And it allows us to "downcast" it.
To what?
To "downcast" it, ie. to go from the interface type, back to the concrete type:
package main import ( "errors" "log" ) func showType(err error) { // 👇 downcasting action happens here if _, ok := err.(*naughtyError); ok { log.Printf("got a *naughtyError") } else if _, ok := err.(*niceError); ok { log.Printf("got a *niceError") } else { log.Printf("got another kind of error") } } func main() { showType((*naughtyError)(nil)) showType((*niceError)(nil)) showType(errors.New("")) } type naughtyError struct{} func (ne *naughtyError) Error() string { return "oh no" } type niceError struct{} func (ne *niceError) Error() string { return "ho ho ho!" }
$ go run go/main.go 2021/04/17 21:33:48 got a *naughtyError 2021/04/17 21:33:48 got a *niceError 2021/04/17 21:33:48 got another kind of error
Ah, so mystery solved! One pointer for the value, one pointer for the type: 8 bytes each, together, 16 bytes.
Case closed.
Right! Close enough.
And now, let's turn our attention back to Rust.
What were we doing again?
Ah, right! We were here:
fn main() { println!("{}", read_issue().unwrap()) } fn read_issue() -> Result<String, E> { std::fs::read_to_string("/etc/issue") }
And we had to pick an E
.
Because, as we mentioned, Rust forces you to pick an "error type".
But... there is also a standard error type. Except in Rust, capitalization
does not mean "private or public" (there's a
keyword for that). Instead,
all types are capitalized, by convention, so it's not error
, it's
Error
.
More specifically, it's std::error::Error.
So, we can try to pick that:
// 👇 we import it here use std::error::Error; fn main() { println!("{}", read_issue().unwrap()) } // and use it there 👇 fn read_issue() -> Result<String, Error> { std::fs::read_to_string("/etc/issue") }
And, well...
$ cargo run --quiet warning: trait objects without an explicit `dyn` are deprecated --> src/main.rs:7:35 | 7 | fn read_issue() -> Result<String, Error> { | ^^^^^ help: use `dyn`: `dyn Error` | = note: `#[warn(bare_trait_objects)]` on by default (rest omitted)
Oh, no, a warning! It says to use the dyn
keyword. Alright, who am I
to object, let's use the dyn
keyword.
// 👇 fn read_issue() -> Result<String, dyn Error> { std::fs::read_to_string("/etc/issue") }
Let's try this again:
$ cargo run --quiet error[E0277]: the size for values of type `(dyn std::error::Error + 'static)` cannot be known at compilation time --> src/main.rs:7:20 | 7 | fn read_issue() -> Result<String, dyn Error> { | ^^^^^^^^^^^^^^^^^^^^^^^^^ doesn't have a size known at compile-time | ::: /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/result.rs:241:20 | 241 | pub enum Result<T, E> { | - required by this bound in `std::result::Result` | = help: the trait `Sized` is not implemented for `(dyn std::error::Error + 'static)` error: aborting due to previous error
And, especially coming from Go, this error is really puzzling.
Because this code feels more or less like a direct translation of that code:
func readIssue() (string, error) { bs, err := os.ReadFile("/etc/issue") return string(bs), err }
And that code "just works".
Well, the explanation is rather simple: it is not a direct translation.
A direct translation would look more like this:
use std::error::Error; fn main() { println!("{}", read_issue().unwrap()) } // 👇 fn read_issue() -> Result<String, Box<dyn Error>> { std::fs::read_to_string("/etc/issue") }
Which, as you can see, works just f-
$ cargo run --quiet error[E0308]: mismatched types --> src/main.rs:8:5 | 7 | fn read_issue() -> Result<String, Box<dyn Error>> { | ------------------------------ expected `std::result::Result<String, Box<(dyn std::error::Error + 'static)>>` because of return type 8 | std::fs::read_to_string("/etc/issue") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected struct `Box`, found struct `std::io::Error` | = note: expected enum `std::result::Result<_, Box<(dyn std::error::Error + 'static)>>` found enum `std::result::Result<_, std::io::Error>` error: aborting due to previous error
...okay, so it doesn't work. But we can make it work fairly easily:
use std::error::Error; fn main() { println!("{}", read_issue().unwrap()) } fn read_issue() -> Result<String, Box<dyn Error>> { Ok(std::fs::read_to_string("/etc/issue")?) }
$ cargo run --quiet Arch Linux \r (\l)
Theeeeeeere we go. Now we're even. This is the closest we'll get to the aforementioned Go code.
But at that point, you may very well have several questions.
What the heck is a Box
?
Well, for the time being, you can sort of think about it as a pointer.
But it's not, not really.
This is a pointer:
struct MyError { value: u32, } fn main() { let e = MyError { value: 32 }; let e_ptr: *const MyError = &e; print_error(e_ptr); } fn print_error(e: *const MyError) { if e != std::ptr::null() { println!("MyError (value = {})", unsafe { (*e).value }); } }
$ cargo run --quiet MyError (value = 32)
But as you may have noticed, dereferencing a pointer is unsafe:
fn print_error(e: *const MyError) { if e != std::ptr::null() { // 👇 println!("MyError (value = {})", unsafe { (*e).value }); } }
Why is dereferencing a pointer unsafe
? Well, because it might be null! Or
it might point to an address that does not fall within an area that's
meaningful for the currently running program, and that would cause a
segmentation fault.
So, whenever we dereference a pointer, we're on our own.
Getting the size of something, though, is perfectly safe:
struct MyError { value: u32, } fn main() { let e = MyError { value: 32 }; let e_ptr: *const MyError = &e; // 👇 no unsafe! dbg!(std::mem::size_of_val(&e_ptr)); print_error(e_ptr); } fn print_error(e: *const MyError) { if e != std::ptr::null() { println!("MyError (value = {})", unsafe { (*e).value }); } }
$ cargo run --quiet [src/main.rs:8] std::mem::size_of_val(&e_ptr) = 8 MyError (value = 32)
And, as expected, the size of a pointer is 8 bytes, because I'm still writing this from Linux 64-bit.
But: if constructing a pointer value is safe, dereferencing it (reading from the memory it points to, or writing to it) is not.
So we often don't use it at all in Rust.
Instead, we use references!
struct MyError { value: u32, } fn main() { let e = MyError { value: 32 }; let e_ref: &MyError = &e; dbg!(std::mem::size_of_val(&e_ref)); print_error(e_ref); } fn print_error(e: &MyError) { println!("MyError (value = {})", (*e).value); }
Which are still 8 bytes:
$ cargo run --quiet [src/main.rs:8] std::mem::size_of_val(&e_ref) = 8 MyError (value = 32)
...but they're also perfectly safe to dereference, because it is guaranteed that they point to valid memory: in safe code, it is impossible to construct an invalid reference, or to keep a reference to some value after that value has been freed.
In fact, it's so safe that we don't even need to use the *
operator to
dereference: we can just rely on "autoderef":
fn print_error(e: &MyError) { // star be gone! 👇 println!("MyError (value = {})", e.value); }
And that works just as well.
And now, a quick note about safety: you'll notice that I just said "in safe code, it is impossible to construct an invalid reference".
In unsafe code, it is very possible:
struct MyError { value: u32, } fn main() { let e: *const MyError = std::ptr::null(); // ooooh no no no. crimes! 👇 let e_ref: &MyError = unsafe { &*e }; dbg!(std::mem::size_of_val(&e_ref)); print_error(e_ref); } fn print_error(e: &MyError) { println!("MyError (value = {})", e.value); }
And then BOOM:
$ cargo run --quiet [src/main.rs:8] std::mem::size_of_val(&e_ref) = 8 [1] 17569 segmentation fault cargo run --quiet
Segmentation fault.
But that's not news. That's not a big flaw in Rust's safety model.
That is Rust's safety model.
The idea is that, if all the unsafe code is sound, then all the safe code is safe, too.
And you have a lot less "unsafe" code than you have "safe" code, which makes
it a lot easier to audit. It's also very visible, with explicit unsafe
blocks, unsafe
traits and unsafe
functions, and so it's easy to
statically determine where unsafe code
is — it's not just "woops you imported
the forbidden package".
Finally, there's tools like the Miri interpreter, that help with unsafe code, just like there's sanitizers for C/C++, which do not have that safe/unsafe split.
But let's get back to boxes
So, we've seen two kinds of "pointers" in Rust so far: raw pointers, aka
*const T
(and its sibling, *mut T
), and references (&T
and &mut T
).
We said we were going to ignore raw pointers, so let's focus on references.
In Go, when you get a pointer to an object, you can do anything with it. You
can hold onto it as long as you want, you can shove it into a map
— even if
that object was originally going to be freed, you, as a function that
receives a pointer to that object, can extend the lifetime of that object to
be however long you need it to.
This works because Go is garbage-collected, so, as long as there's at least one reference to an object, it's "live", and it's not going to be collected (or "freed").
As soon as there are zero references left to an object, then it qualifies for garbage collection. The garbage collector does not guarantee how soon an object will actually be freed, or if it will ever be freed. It just qualifies.
And it's not immediately obvious if we try to showcase this with code like that:
package main import ( "log" ) func main() { var slice []string addString(&slice) log.Printf("==== from main ====") for _, str := range slice { log.Printf("%v, %v", &str, str) } } func addString(slice *[]string) { var str = "hello" log.Printf("%v, %v", &str, str) *slice = append(*slice, str) }
This should show the address of the string in both the addString
function
and in main
, right? And I just said they were the same string, main
just
ends up keeping a reference to it.
But we get two different addresses:
$ go run ./go/main.go 2021/04/18 11:34:42 0xc00011e220, hello 2021/04/18 11:34:42 ==== from main ==== 2021/04/18 11:34:42 0xc00011e250, hello
To really see what's going on, we need to peel away one more layer of Go magic,
and cast our string
to a reflect.StringHeader
:
package main import ( "log" "reflect" "unsafe" ) func main() { var slice []string addString(&slice) log.Printf("==== from main ====") for _, str := range slice { log.Printf("%v, %v", &str, str) sh := (*reflect.StringHeader)(unsafe.Pointer(&str)) log.Printf("%#v", sh) } } func addString(slice *[]string) { var str = "hello" log.Printf("%v, %v", &str, str) sh := (*reflect.StringHeader)(unsafe.Pointer(&str)) log.Printf("%#v", sh) *slice = append(*slice, str) }
$ go run ./go/main.go 2021/04/18 11:35:24 0xc000010240, hello 2021/04/18 11:35:24 &reflect.StringHeader{Data:0x4c63e1, Len:5} 2021/04/18 11:35:24 ==== from main ==== 2021/04/18 11:35:24 0xc000010270, hello 2021/04/18 11:35:24 &reflect.StringHeader{Data:0x4c63e1, Len:5}
There. Now we know it's the same string.
We have reflect.StringHeader
, which is a Go struct, and the type that
string
actually is, and that has copy semantics, just like other Go
structs, and then we have "the string data", which lives at 0x4c63e1
.
Which... is a peculiar memory address. It's very low. Much lower than the two
StringHeader
values we have, which were at 0xc000010240
and
0xc000010270
, respectively.
So again, to understand what's really going on, we need to get our hands dirty.
$ go build ./go/main.go $ gdb --quiet ./main Reading symbols from ./main... Loading Go Runtime support. (gdb) catch syscall exit exit_group Catchpoint 1 (syscalls 'exit' [60] 'exit_group' [231]) (gdb) r Starting program: /home/amos/ftl/whatbox/main [New LWP 24224] [New LWP 24225] [New LWP 24226] [New LWP 24227] [New LWP 24228] 2021/04/18 11:41:24 0xc00011e220, hello 2021/04/18 11:41:24 &reflect.StringHeader{Data:0x4c63e1, Len:5} 2021/04/18 11:41:24 ==== from main ==== 2021/04/18 11:41:24 0xc00011e250, hello 2021/04/18 11:41:24 &reflect.StringHeader{Data:0x4c63e1, Len:5} Thread 1 "main" hit Catchpoint 1 (call to syscall exit_group), runtime.exit () at /usr/lib/go/src/runtime/sys_linux_amd64.s:57 57 RET
Okay, we've now successfully executed our main
Go binary from
GDB, and we've managed to pause
execution right before it exits.
And at that point, we can inspect memory mappings:
$ (gdb) info proc map process 24220 Mapped address spaces: Start Addr End Addr Size Offset objfile 0x400000 0x4a2000 0xa2000 0x0 /home/amos/ftl/whatbox/main 0x4a2000 0x545000 0xa3000 0xa2000 /home/amos/ftl/whatbox/main 0x545000 0x55b000 0x16000 0x145000 /home/amos/ftl/whatbox/main 0x55b000 0x58e000 0x33000 0x0 [heap] 0xc000000000 0xc004000000 0x4000000 0x0 0x7fffd1329000 0x7fffd369a000 0x2371000 0x0 0x7fffd369a000 0x7fffe381a000 0x10180000 0x0 0x7fffe381a000 0x7fffe381b000 0x1000 0x0 0x7fffe381b000 0x7ffff56ca000 0x11eaf000 0x0 0x7ffff56ca000 0x7ffff56cb000 0x1000 0x0 0x7ffff56cb000 0x7ffff7aa0000 0x23d5000 0x0 0x7ffff7aa0000 0x7ffff7aa1000 0x1000 0x0 0x7ffff7aa1000 0x7ffff7f1a000 0x479000 0x0 0x7ffff7f1a000 0x7ffff7f1b000 0x1000 0x0 0x7ffff7f1b000 0x7ffff7f9a000 0x7f000 0x0 0x7ffff7f9a000 0x7ffff7ffa000 0x60000 0x0 0x7ffff7ffa000 0x7ffff7ffd000 0x3000 0x0 [vvar] 0x7ffff7ffd000 0x7ffff7fff000 0x2000 0x0 [vdso] 0x7ffffffdd000 0x7ffffffff000 0x22000 0x0 [stack]
And what we see here is very interesting.
First off, we notice that 0x4c63e1
, where our string data actually was,
is in a region directly mapped from our the main
file:
Start Addr End Addr Size Offset objfile 0x4a2000 0x545000 0xa3000 0xa2000 /home/amos/ftl/whatbox/main
And indeed, if we read 5 bytes at region_start_addr - str_addr + region_file_offset
...
$ dd status=none if=./main skip=$((0x4c63e1-0x4a2000+0xa2000)) bs=1 count=5 hello%
...there it is!
The %
character is just what Z shell prints
when a command's output is not terminated with a new line.
That way the prompt is not messed up, but you still know that there was no newline.
And the other very interesting thing is that the StringHeader
values, in
the 0xc00011e000
neighborhood, are not in the region GDB tells us is the
[stack]
:
Start Addr End Addr Size Offset objfile 0x7ffffffdd000 0x7ffffffff000 0x22000 0x0 [stack]
And they're not in the region GDB tells us is the [heap]
:
Start Addr End Addr Size Offset objfile 0x55b000 0x58e000 0x33000 0x0 [heap]
Why is that?
Well, because Go has its own stack. And its own heap. And everything is garbage-collected. And also, that makes goroutines cheap, and they can adjust their stack size dynamically, and it also complicates FFI a bunch.
But, point is: our example is a little moot, because "hello"
is never going
to be garbage collected — it's read directly from the executable, which
never disappears as long as our program runs.
In fact, here's a fun way to show this:
package main import ( "log" "reflect" "runtime" "unsafe" ) func main() { var str string sh := (*reflect.StringHeader)(unsafe.Pointer(&str)) log.Printf("(main) %v, %#v", &str, str) log.Printf("(main) %#v", sh) data, len := lol() // Now that there's no pointers left to `"hello"`, let's try to get it // garbage-collected. There's no guarantees, still, but we're doing our // best. runtime.GC() sh.Data = uintptr(data) sh.Len = len log.Printf("(main) %v, %#v", &str, str) log.Printf("(main) %#v", sh) } func lol() (uint64, int) { var str = "hello" sh := (*reflect.StringHeader)(unsafe.Pointer(&str)) log.Printf("(lol) %v, %#v", &str, str) log.Printf("(lol) %#v", sh) // we return `sh.Data` as an `uint64`, which _does not count as pointer_ // because Go has a precise GC, not a conservative GC. return uint64(sh.Data), sh.Len }
$ go run ./go/main.go 2021/04/18 12:08:22 (main) 0xc00009e220, "" 2021/04/18 12:08:22 (main) &reflect.StringHeader{Data:0x0, Len:0} 2021/04/18 12:08:22 (lol) 0xc00009e230, "hello" 2021/04/18 12:08:22 (lol) &reflect.StringHeader{Data:0x4c63dd, Len:5} 2021/04/18 12:08:22 (main) 0xc00009e220, "hello" 2021/04/18 12:08:22 (main) &reflect.StringHeader{Data:0x4c63dd, Len:5}
Neat! Even though we explicitly invoke the garbage collector, the data at
0x4c63dd
doesn't "disappear". It's still there.
Whereas if we compare with this code, which puts "hello"
in the "Go heap":
// omitted: rest of the code func lol() (uint64, int) { var str = string([]byte{'h', 'e', 'l', 'l', 'o'}) sh := (*reflect.StringHeader)(unsafe.Pointer(&str)) log.Printf("(lol) %v, %#v", &str, str) log.Printf("(lol) %#v", sh) return uint64(sh.Data), sh.Len }
$ go run ./go/main.go 2021/04/18 12:12:06 (main) 0xc00009e220, "" 2021/04/18 12:12:06 (main) &reflect.StringHeader{Data:0x0, Len:0} 2021/04/18 12:12:06 (lol) 0xc00009e230, "hello" 2021/04/18 12:12:06 (lol) &reflect.StringHeader{Data:0xc0000b80b8, Len:5} 2021/04/18 12:12:06 (main) 0xc00009e220, "hello" 2021/04/18 12:12:06 (main) &reflect.StringHeader{Data:0xc0000b80b8, Len:5}
...then we see that "hello"
is indeed in the Go heap (it's in the
0xc000000000
neighborhood).
But uh... it doesn't disappear either.
Just curious, what did you expect?
To see the empty string? I don't know, that's a good question...
Well... the garbage collector doesn't really zero out memory blocks when it frees them, right?
Is just "marks them as free", and doesn't change anything about the contents of the memory.
Right, yes, I suppose.
And so unless something else gets allocated at the exact same location, re-using
the previously-freed block, then we should still see the same "hello"
string,
even if it's been garbage-collected.
Right.
If only there was a way to get the Go GC to fill a memory block with nonsense
after it's been freed oh wait, hang on, there it is,
we can just use GODEBUG=clobberfree=1
:
clobberfree: setting clobberfree=1 causes the garbage collector to clobber the memory content of an object with bad content when it frees the object.
Let's try it:
$ GODEBUG=clobberfree=1 go run ./go/main.go 2021/04/18 12:16:00 (main) 0xc000012240, "" 2021/04/18 12:16:00 (main) &reflect.StringHeader{Data:0x0, Len:0} 2021/04/18 12:16:00 (lol) 0xc000012250, "hello" 2021/04/18 12:16:00 (lol) &reflect.StringHeader{Data:0xc0000161a8, Len:5} 2021/04/18 12:16:00 (main) 0xc000012240, "ï¾\xde\xef" 2021/04/18 12:16:00 (main) &reflect.StringHeader{Data:0xc0000161a8, Len:5}
There! We have successfully written unsafe Go code. Through the help of the
unsafe
and reflect
packages.
To really make our example work, though, we have to run the version where
"hello"
was mapped directly from the executable file, also with clobberfree
:
func lol() (uint64, int) { var str = "hello" // etc. }
$ GODEBUG=clobberfree=1 go run ./go/main.go 2021/04/18 12:17:11 (main) 0xc000012240, "" 2021/04/18 12:17:11 (main) &reflect.StringHeader{Data:0x0, Len:0} 2021/04/18 12:17:11 (lol) 0xc000012250, "hello" 2021/04/18 12:17:11 (lol) &reflect.StringHeader{Data:0x4c63dd, Len:5} 2021/04/18 12:17:11 (main) 0xc000012240, "hello" 2021/04/18 12:17:11 (main) &reflect.StringHeader{Data:0x4c63dd, Len:5}
Go string
values are actually structs, with a Data
field, that points
somewhere in memory. The structs themselves, of type reflect.StringHeader
,
have copy semantics, so s2 := s1
creates a new StringHeader
, pointing
to the same area in memory.
The area in memory to which a StringHeader
can point to can be in two
different regions: "static data" mapped directly from the executable file,
for string constants, or "that big block Go allocates", where the GC heap
lives.
Now for some more Rust
Some of the same concepts apply to Rust code as well.
For instance, if we have a string literal, it will be neither on the heap nor the stack, it will be "static data", mapped directly from the executable:
fn main() { let s = "hello"; dbg!(s as *const _); }
$ cargo build --quiet $ gdb --quiet --args ./target/debug/whatbox Reading symbols from ./target/debug/whatbox... (gdb) catch syscall exit exit_group Catchpoint 1 (syscalls 'exit' [60] 'exit_group' [231]) (gdb) r Starting program: /home/amos/ftl/whatbox/target/debug/whatbox [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". [src/main.rs:3] s as *const _ = 0x000055555558c000 Catchpoint 1 (call to syscall exit_group), 0x00007ffff7e71621 in _exit () from /usr/lib/libc.so.6 (gdb) info proc map process 30848 Mapped address spaces: Start Addr End Addr Size Offset objfile 0x555555554000 0x555555559000 0x5000 0x0 /home/amos/ftl/whatbox/target/debug/whatbox 0x555555559000 0x55555558c000 0x33000 0x5000 /home/amos/ftl/whatbox/target/debug/whatbox 0x55555558c000 0x555555599000 0xd000 0x38000 /home/amos/ftl/whatbox/target/debug/whatbox 0x555555599000 0x55555559c000 0x3000 0x44000 /home/amos/ftl/whatbox/target/debug/whatbox 0x55555559c000 0x55555559d000 0x1000 0x47000 /home/amos/ftl/whatbox/target/debug/whatbox 0x55555559d000 0x5555555be000 0x21000 0x0 [heap] 0x7ffff7da3000 0x7ffff7da5000 0x2000 0x0 0x7ffff7da5000 0x7ffff7dcb000 0x26000 0x0 /usr/lib/libc-2.33.so 0x7ffff7dcb000 0x7ffff7f17000 0x14c000 0x26000 /usr/lib/libc-2.33.so 0x7ffff7f17000 0x7ffff7f63000 0x4c000 0x172000 /usr/lib/libc-2.33.so 0x7ffff7f63000 0x7ffff7f66000 0x3000 0x1bd000 /usr/lib/libc-2.33.so 0x7ffff7f66000 0x7ffff7f69000 0x3000 0x1c0000 /usr/lib/libc-2.33.so (cut)
Here "hello"
was at address 0x55555558c000
, which is in this range:
Start Addr End Addr Size Offset objfile 0x55555558c000 0x555555599000 0xd000 0x38000 /home/amos/ftl/whatbox/target/debug/whatbox
...in fact, it's at the very start of this range, and we can pull the same trick, to read it directly from the executable file ourselves:
$ dd status=none if=./target/debug/whatbox skip=$((0x38000)) bs=1 count=5 hello%
We can also have things that are on the stack, for example, if we turn it
into a String
, the String
itself will be on the stack:
fn main() { // 👇 let s: String = "hello".into(); // 👇 dbg!(&s as *const _); }
$ cargo build --quiet $ gdb --quiet --args ./target/debug/whatbox Reading symbols from ./target/debug/whatbox... (gdb) catch syscall exit exit_group Catchpoint 1 (syscalls 'exit' [60] 'exit_group' [231]) (gdb) r Starting program: /home/amos/ftl/whatbox/target/debug/whatbox [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". 👇 [src/main.rs:3] &s as *const _ = 0x00007fffffffd760 Catchpoint 1 (call to syscall exit_group), 0x00007ffff7e71621 in _exit () from /usr/lib/libc.so.6 (gdb) info proc map process 31339 Mapped address spaces: Start Addr End Addr Size Offset objfile 0x555555554000 0x555555559000 0x5000 0x0 /home/amos/ftl/whatbox/target/debug/whatbox 0x555555559000 0x55555558e000 0x35000 0x5000 /home/amos/ftl/whatbox/target/debug/whatbox 0x55555558e000 0x55555559c000 0xe000 0x3a000 /home/amos/ftl/whatbox/target/debug/whatbox 0x55555559c000 0x55555559f000 0x3000 0x47000 /home/amos/ftl/whatbox/target/debug/whatbox 0x55555559f000 0x5555555a0000 0x1000 0x4a000 /home/amos/ftl/whatbox/target/debug/whatbox 0x5555555a0000 0x5555555c1000 0x21000 0x0 [heap] 0x7ffff7da3000 0x7ffff7da5000 0x2000 0x0 (cut) 0x7ffff7fb4000 0x7ffff7fb6000 0x2000 0x0 0x7ffff7fc7000 0x7ffff7fca000 0x3000 0x0 [vvar] 0x7ffff7fca000 0x7ffff7fcc000 0x2000 0x0 [vdso] 0x7ffff7fcc000 0x7ffff7fcd000 0x1000 0x0 /usr/lib/ld-2.33.so 0x7ffff7fcd000 0x7ffff7ff1000 0x24000 0x1000 /usr/lib/ld-2.33.so 0x7ffff7ff1000 0x7ffff7ffa000 0x9000 0x25000 /usr/lib/ld-2.33.so 0x7ffff7ffb000 0x7ffff7ffd000 0x2000 0x2e000 /usr/lib/ld-2.33.so 0x7ffff7ffd000 0x7ffff7fff000 0x2000 0x30000 /usr/lib/ld-2.33.so 👇 👇 0x7ffffffdd000 0x7ffffffff000 0x22000 0x0 [stack]
But the String
's data is on the heap!
fn main() { let s: String = "hello".into(); // 👇 dbg!(s.as_bytes() as *const _); }
$ cargo build --quiet $ gdb --quiet --args ./target/debug/whatbox Reading symbols from ./target/debug/whatbox... (gdb) catch syscall exit exit_group Catchpoint 1 (syscalls 'exit' [60] 'exit_group' [231]) (gdb) r Starting program: /home/amos/ftl/whatbox/target/debug/whatbox [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". 👇 [src/main.rs:3] s.as_bytes() as *const _ = 0x00005555555a0aa0 Catchpoint 1 (call to syscall exit_group), 0x00007ffff7e71621 in _exit () from /usr/lib/libc.so.6 (gdb) info proc map process 31715 Mapped address spaces: Start Addr End Addr Size Offset objfile 0x555555554000 0x555555559000 0x5000 0x0 /home/amos/ftl/whatbox/target/debug/whatbox 0x555555559000 0x55555558e000 0x35000 0x5000 /home/amos/ftl/whatbox/target/debug/whatbox 0x55555558e000 0x55555559c000 0xe000 0x3a000 /home/amos/ftl/whatbox/target/debug/whatbox 0x55555559c000 0x55555559f000 0x3000 0x47000 /home/amos/ftl/whatbox/target/debug/whatbox 0x55555559f000 0x5555555a0000 0x1000 0x4a000 /home/amos/ftl/whatbox/target/debug/whatbox 👇 👇 0x5555555a0000 0x5555555c1000 0x21000 0x0 [heap] (cut)
So a Rust String
is like a Go string
? I mean, a Go StringHeader
?
Well, not exactly.
Because as we mentioned before, a Go string
/ StringHeader
has copy
semantics, which means we simply assign a string
to another variable, and
it'll create a new StringHeader
, pointing to the same memory area:
package main import ( "log" "reflect" "unsafe" ) func main() { var s1 = string([]byte{'h', 'e', 'l', 'l', 'o'}) var s2 = s1 var s3 = s1 log.Printf("&s1 = %#v", &s1) log.Printf("&s2 = %#v", &s2) log.Printf("&s3 = %#v", &s3) var sh *reflect.StringHeader sh = (*reflect.StringHeader)(unsafe.Pointer(&s1)) log.Printf("s1 points to %#v", sh.Data) sh = (*reflect.StringHeader)(unsafe.Pointer(&s2)) log.Printf("s2 points to %#v", sh.Data) sh = (*reflect.StringHeader)(unsafe.Pointer(&s3)) log.Printf("s3 points to %#v", sh.Data) }
$ go run ./go/main.go 2021/04/18 12:36:04 &s1 = (*string)(0xc00009e220) // these are all different 2021/04/18 12:36:04 &s2 = (*string)(0xc00009e230) 2021/04/18 12:36:04 &s3 = (*string)(0xc00009e240) 2021/04/18 12:36:04 s1 points to 0xc0000b8010 // these are the same 2021/04/18 12:36:04 s2 points to 0xc0000b8010 2021/04/18 12:36:04 s3 points to 0xc0000b8010
But String in Rust does not implement the Copy trait, so it has "move semantics".
fn main() { let s1: String = "hello".into(); let s2 = s1; let s3 = s1; dbg!(&s1 as *const _); dbg!(&s2 as *const _); dbg!(&s3 as *const _); dbg!(s1.as_bytes() as *const _); dbg!(s2.as_bytes() as *const _); dbg!(s3.as_bytes() as *const _); }
$ cargo run --quiet error[E0382]: use of moved value: `s1` --> src/main.rs:4:14 | 2 | let s1: String = "hello".into(); | -- move occurs because `s1` has type `String`, which does not implement the `Copy` trait 3 | let s2 = s1; | -- value moved here 4 | let s3 = s1; | ^^ value used here after move (cut)
When we first do let s2 = s1
, we move the String into s2
, and so, s1
can no longer be used. Which means let s3 = s1
is illegal.
What we can do is clone s1
, but then the contents are also cloned, so they
point to different copies of the data as well:
fn main() { let s1: String = "hello".into(); let s2 = s1.clone(); let s3 = s1.clone(); dbg!(&s1 as *const _); dbg!(&s2 as *const _); dbg!(&s3 as *const _); dbg!(s1.as_bytes() as *const _); dbg!(s2.as_bytes() as *const _); dbg!(s3.as_bytes() as *const _); }
$ cargo run --quiet [src/main.rs:6] &s1 as *const _ = 0x00007fff40426188 // all different [src/main.rs:7] &s2 as *const _ = 0x00007fff404261a0 [src/main.rs:8] &s3 as *const _ = 0x00007fff404261b8 [src/main.rs:10] s1.as_bytes() as *const _ = 0x000055ac20174aa0 // all different [src/main.rs:11] s2.as_bytes() as *const _ = 0x000055ac20174ac0 [src/main.rs:12] s3.as_bytes() as *const _ = 0x000055ac20174ae0
No, if we want to get something closer to the Go version, we can use references:
fn main() { let data: String = "hello".into(); let s1: &str = &data; let s2: &str = &data; let s3: &str = &data; dbg!(&s1 as *const _); dbg!(&s2 as *const _); dbg!(&s3 as *const _); dbg!(s1.as_bytes() as *const _); dbg!(s2.as_bytes() as *const _); dbg!(s3.as_bytes() as *const _); }
$ cargo run --quiet [src/main.rs:8] &s1 as *const _ = 0x00007ffeb7e82510 [src/main.rs:9] &s2 as *const _ = 0x00007ffeb7e82520 [src/main.rs:10] &s3 as *const _ = 0x00007ffeb7e82530 [src/main.rs:12] s1.as_bytes() as *const _ = 0x0000563249bbcaa0 [src/main.rs:13] s2.as_bytes() as *const _ = 0x0000563249bbcaa0 [src/main.rs:14] s3.as_bytes() as *const _ = 0x0000563249bbcaa0
Now, s1
, s2
, and s3
are all distinct references to the same underlying
data.
But that's still not really what Go does. Because we cannot return a reference to a local variable, for example:
fn main() { let s = lol(); dbg!(s as *const _); dbg!(s.as_bytes() as *const _); } fn lol() -> &str { let data: String = "hello".into(); let s: &str = &data; s }
$ cargo run --quiet error[E0106]: missing lifetime specifier --> src/main.rs:7:13 | 7 | fn lol() -> &str { | ^ expected named lifetime parameter | = help: this function's return type contains a borrowed value, but there is no value for it to be borrowed from help: consider using the `'static` lifetime | 7 | fn lol() -> &'static str { | ^^^^^^^^
The Rust compiler is trying to help us. "You can't just return a reference to something", it pleads. "You need to tell me how long the thing that's referenced is will live".
And so, we can add 'static
, to say that it will not be freed until the
program exits.
We can say that...
fn lol() -> &'static str { let data: String = "hello".into(); let s: &str = &data; s }
...but it's not true!
$ cargo run --quiet error[E0515]: cannot return value referencing local variable `data` --> src/main.rs:10:5 | 9 | let s: &str = &data; | ----- `data` is borrowed here 10 | s | ^ returns a value referencing data owned by the current function
Because data
is owned by the current function! Sure, the "string data"
actually lives on the heap, but it is owned by the String
, which means it's
allocated when let data
is declared, and it's freed whenever data
"goes
out of scope", in this case, at the end of the lol
function.
So, if we were able to return a reference to it, that reference would point to an object that is no longer live. We would have a good old dangling pointer.
If the string data lived elsewhere, say, if it were static data, in the executable itself, then we would be able to return a reference to it!
fn main() { let s = lol(); dbg!(s as *const _); dbg!(s.as_bytes() as *const _); } fn lol() -> &'static str { let s: &'static str = "hello"; s }
$ cargo run --quiet [src/main.rs:3] s as *const _ = 0x000055dab3e0e128 [src/main.rs:4] s.as_bytes() as *const _ = 0x000055dab3e0e128
Mhh. That address looks suspiciously close to the heap though.
Correct!
And now is as good a time as any to show some diagrams.
Let's get some addresses directly from GDB so our diagram can be close to reality.
$ cargo build --quiet $ gdb --quiet --args ./target/debug/whatbox (gdb) catch syscall exit exit_group (gdb) run (gdb) info proc map process 3406 Mapped address spaces: Start Addr End Addr Size Offset objfile 0x55555558c000 0x555555599000 0xd000 0x38000 /home/amos/ftl/whatbox/target/debug/whatbox 0x55555559d000 0x5555555be000 0x21000 0x0 [heap] 0x7ffffffdd000 0x7ffffffff000 0x22000 0x0 [stack]
In this case, our three main regions of interest are laid out roughly like this (not to scale):
If you liked what you saw, please support my work!