So, in the interest of fairness, let's jam a raw pointer in our Rust struct
instead:
Ok, that's more reasonable. And smaller than our C version! What's happening
there?
And then... I guess it's saving space by overlapping the u64
and the
*const c_char
, since either of them could be valid, but never both at the
same time: that's why we also call this construct a "disjoint union".
That's 16 bytes total.
Can we do the same thing in C? Sure we can! There's a union
keyword, like,
right there . It's just like struct
, except everything overlaps, and it
has the size of the largest member (more or less - there's also alignment
considerations).
And indeed, if we do this:
C code
struct UserID {
enum UserIDKind kind ;
union {
uint64_t number ;
char * text ;
};
};
int main () {
printf ("sizeof(struct UserID) = %ld\n" , sizeof (struct UserID ));
}
Then we get the same result as Rust:
Shell session
$ clang -Wall main.c -o main && ./main
sizeof(struct UserID) = 16
We can go even further - we can switch from an int
(4 bytes for our current
target, which happens to be clang 10 on 64-bit Linux) to an uint8_t
:
C code
struct UserID {
uint8_t kind ;
union {
uint64_t number ;
char * text ;
};
};
Shell session
$ clang -Wall main.c -o main && ./main
sizeof(struct UserID) = 16
Mhh that didn't change much...
Much like commercial food packaging - it's probably mostly padding.
Ah right! Let's try packing our struct again.
C code
struct __attribute__ ((packed )) UserID {
uint8_t kind ;
union {
uint64_t number ;
char * text ;
};
};
Shell session
$ clang -Wall main.c -o main && ./main
sizeof(struct UserID) = 9
Just nine bytes! Now that's compact.
Can we do the same with our Rust enum
?
Let's look at structs first - just like in C, Rust defaults to proper
alignment for fields, so if we have an u8
and an u64
:
Rust code
struct Foo {
bar : u8 ,
baz : u64 ,
}
fn main ( ) {
dbg ! ( size_of::<Foo>( ) ) ;
}
...then we end up with a 16-byte struct:
Shell session
$ cargo run -q
[src/main.rs:16] size_of::<Foo>() = 16
But, just like in C, if we ask nicely, Rust will pack it:
Rust code
#[ repr( packed) ]
struct Foo {
bar : u8 ,
baz : u64 ,
}
Shell session
$ cargo run -q
[src/main.rs:17] size_of::<Foo>() = 9
But what about enums?
Rust code
#[ repr( packed) ]
enum UserID {
Number( u64 ) ,
Text( * const c_char ) ,
}
Shell session
$ cargo run -q
error[E0517]: attribute should be applied to struct or union
--> src/main.rs:4:8
|
4 | #[repr(packed)]
| ^^^^^^
5 | / enum UserID {
6 | | Number(u64),
7 | | Text(*const c_char),
8 | | }
| |_- not a struct or union
We can't pack them! It's been discussed
before , along with other
exotic enum layout
optimizations , but currently,
you can't do it.
And yet... clearly smartstring
is doing something like that .
When a SmartString
is stored on the heap (its "boxed" variant), it's 24
bytes, just like a String
.
But if we try to make our own smartstring
, with a Rust enum
, we can't
even get close to that:
Rust code
use std:: mem:: size_of;
#[ allow( dead_code) ]
enum SmartString {
Boxed( String ) ,
Inline( [ u8 ; 24 ] ) ,
}
fn main ( ) {
dbg ! ( size_of::<String>( ) ) ;
dbg ! ( size_of::<[ u8 ; 24 ] >( ) ) ;
dbg ! ( size_of::<SmartString>( ) ) ;
}
Shell session
$ cargo run -q
[src/main.rs:10] size_of::<String>() = 24
[src/main.rs:11] size_of::<[u8; 24]>() = 24
[src/main.rs:12] size_of::<SmartString>() = 32
There is one thing we can do - since Rust won't pack our enum, we can just
make our own enum.
First off, we can't use Rust
unions , because it
only accepts fields of types that 1) are Copy , 2) do not implement Drop .
But what we can do... is just treat any type as what it is: a miserable
little pile of bytes.
Rust code
use std:: mem:: size_of;
#[ allow( dead_code) ]
#[ repr( packed) ]
struct SmartString {
discriminant : u8 ,
data : [ u8 ; 24 ] ,
}
fn main ( ) {
dbg ! ( size_of::<SmartString>( ) ) ;
}
Shell session
$ cargo run -q
[src/main.rs:11] size_of::<SmartString>() = 25
There! 25 bytes. It's the best we can hope for right now .
But it doesn't actually do anything yet - we're just storing 25 bytes in a
struct.
We need to figure out a way to store both our variants:
boxed: a String
inline: some utf-8 bytes, and I guess a length?
In the "inline" variant, we can never store more than 24 bytes, so we can
definitely represent the length as an u8
, too. So we actually want something
like this:
Rust code
struct Inline {
len : u8 ,
data : [ u8 ; 23 ] ,
}
And then, to be sure all our types actually have the same size, we'll use the static_assertions crate:
Shell session
$ cargo add static_assertions
Adding static_assertions v1.1.0 to dependencies
Rust code
use static_assertions:: * ;
use std:: mem:: size_of;
#[ allow( dead_code) ]
#[ repr( packed) ]
struct SmartString {
discriminant : u8 ,
data : [ u8 ; 24 ] ,
}
#[ allow( dead_code) ]
struct Inline {
len : u8 ,
data : [ u8 ; 23 ] ,
}
assert_eq_size ! ( String, Inline) ;
fn main ( ) {
dbg ! ( size_of::<SmartString>( ) ) ;
}
Shell session
$ cargo check
Finished dev [unoptimized + debuginfo] target(s) in 0.00s
Good. But to be extra super duper sure, since we're going to be doing a lot
of unsafe stuff, we can even define our Inline
in terms of the size of a
String
, so it's checked both ways:
Rust code
use static_assertions:: * ;
use std:: mem:: size_of;
const VARIANT_SIZE: usize = std:: mem:: size_of :: < String > ( ) ;
#[ allow( dead_code) ]
#[ repr( packed) ]
struct SmartString {
discriminant : u8 ,
data : [ u8 ; VARIANT_SIZE] ,
}
#[ allow( dead_code) ]
struct Inline {
len : u8 ,
data : [ u8 ; VARIANT_SIZE - 1 ] ,
}
assert_eq_size ! ( String, Inline) ;
fn main ( ) {
dbg ! ( size_of::<SmartString>( ) ) ;
}
Okay. Now let's actually implement our hand-rolled enum. The first thing
we'll need is... to actually be able to build it.
We're only using [u8; VARIANT_SIZE]
to reserve VARIANT_SIZE
bytes - if
we actually want to store something in there, we're going to need to obtain
a *mut
pointer to it, and cast it to whatever we want:
Rust code
impl SmartString {
pub fn new_boxed ( s : String ) -> Self {
Self:: new ( 0 , s)
}
pub fn new_inline ( ) -> Self {
Self:: new (
1 ,
Inline {
len : 0 ,
data : Default:: default ( ) ,
} ,
)
}
fn new < T > ( discriminant : u8 , data : T ) -> Self {
let mut res = Self {
discriminant,
data : Default:: default ( ) ,
} ;
let ptr: * mut T = res. data . as_mut_ptr ( ) . cast ( ) ;
unsafe { ptr. write_unaligned ( data) } ;
res
}
}
We can now build both variants of our SmartString
:
Rust code
fn main ( ) {
let boxed = SmartString:: new_boxed ( "This is a longer string, would not fit inline" . into ( ) ) ;
let inline = SmartString:: new_inline ( ) ;
}
Except we can't do much with them beyond that point.
Let's try to do something useful, like, obtaining a &str
slice from them:
Rust code
impl AsRef < str > for SmartString {
fn as_ref ( & self ) -> & str {
match self . discriminant {
0 => {
let s: * const ManuallyDrop < String > = self . data . as_ptr ( ) . cast ( ) ;
let tmp = unsafe { s. read_unaligned ( ) } ;
unsafe { & * ( tmp. as_ref ( ) as * const str ) }
}
1 => {
let s: * const Inline = self . data . as_ptr ( ) . cast ( ) ;
unsafe {
let slice = std:: slice:: from_raw_parts ( ( * s) . data . as_ptr ( ) , ( * s) . len as _ ) ;
std:: str:: from_utf8_unchecked ( slice)
}
}
_ => unreachable ! ( ) ,
}
}
}
Remember my bit about how we review unsafe
code extra hard to make sure
it doesn't violate invariants? That definitely applies here. We're using
the "safer" variant, unreachable
, but if I was feeling adventurous, I'd
think about using unreachable_unchecked
instead.
Now that we have an AsRef
implementation, we can actually print the
contents of our SmartString
- no matter what the underlying variant is.
For convenience, we'll implement Display
and Debug
:
Rust code
use std:: fmt;
impl fmt:: Display for SmartString {
fn fmt ( & self , f : & mut fmt:: Formatter < ' _ > ) -> fmt:: Result {
let s: & str = self . as_ref ( ) ;
fmt:: Display:: fmt ( s, f)
}
}
impl fmt:: Debug for SmartString {
fn fmt ( & self , f : & mut fmt:: Formatter < ' _ > ) -> fmt:: Result {
let s: & str = self . as_ref ( ) ;
fmt:: Debug:: fmt ( s, f)
}
}
fn main ( ) {
let boxed = SmartString:: new_boxed ( "This is a longer string, would not fit inline" . into ( ) ) ;
let inline = SmartString:: new_inline ( ) ;
dbg ! ( boxed, inline) ;
}
Shell session
$ cargo run -q
[src/main.rs:84] boxed = "This is a longer string, would not fit inline"
[src/main.rs:84] inline = ""
We're still missing a lot of things - we can't mutate our SmartString
at
all, which the smartstring
crate allows. We also never "demote" a boxed
variant to an inline variant, and we haven't had to think about "promoting"
an inline variant to a boxed variant, in case it grows too much.
But there's a much more pressing worry.
Allow me to demonstrate.
Rust code
fn main ( ) {
let s: String = "this is just some text" . into ( ) ;
dbg ! ( s) ;
}
Shell session
$ cargo build --quiet --release && valgrind --tool=memcheck ./target/release/enumpeek
==173592== Memcheck, a memory error detector
==173592== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==173592== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==173592== Command: ./target/release/enumpeek
==173592==
[src/main.rs:82] s = "this is just some text"
==173592==
==173592== HEAP SUMMARY:
==173592== in use at exit: 0 bytes in 0 blocks
==173592== total heap usage: 15 allocs, 15 frees, 2,335 bytes allocated
==173592==
==173592== All heap blocks were freed -- no leaks are possible
==173592==
==173592== For lists of detected and suppressed errors, rerun with: -s
==173592== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Rust code
fn main ( ) {
let s: SmartString = SmartString:: new_boxed ( "this is just some text" . into ( ) ) ;
dbg ! ( s) ;
}
Shell session
$ cargo build --quiet --release && valgrind --tool=memcheck ./target/release/enumpeek
==173779== Memcheck, a memory error detector
==173779== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==173779== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==173779== Command: ./target/release/enumpeek
==173779==
[src/main.rs:82] s = "this is just some text"
==173779==
==173779== HEAP SUMMARY:
==173779== in use at exit: 22 bytes in 1 blocks
==173779== total heap usage: 15 allocs, 14 frees, 2,335 bytes allocated
==173779==
==173779== LEAK SUMMARY:
==173779== definitely lost: 22 bytes in 1 blocks
==173779== indirectly lost: 0 bytes in 0 blocks
==173779== possibly lost: 0 bytes in 0 blocks
==173779== still reachable: 0 bytes in 0 blocks
==173779== suppressed: 0 bytes in 0 blocks
==173779== Rerun with --leak-check=full to see details of leaked memory
==173779==
==173779== For lists of detected and suppressed errors, rerun with: -s
==173779== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
We're leaking memory!
String
is a struct just like any other struct, but it owns memory that is
heap-allocated. In SmartString::new_boxed
, we take ownership of the
String
... and its associated heap-allocated memory... and then we just
never free it.
The compiler doesn't know to drop the String
we're holding inside of our
boxed-variant SmartString
, because that's not the type we're holding - as
far as it knows, we're just holding 24 bytes, which could be anything.
If we know that it is, in fact, a String
, and that these need to be dropped,
then we need to say so.
Here's our first stab at Drop
:
Rust code
impl Drop for SmartString {
fn drop ( & mut self ) {
match self . discriminant {
0 => {
let s: * mut String = self . data . as_mut_ptr ( ) . cast ( ) ;
let b: String = unsafe { * s } ;
drop ( b) ;
}
1 => {
// etc.
}
_ => unreachable ! ( ) ,
}
}
}
Shell session
$ cargo run -q
error[E0507]: cannot move out of `*s` which is behind a raw pointer
--> src/main.rs:46:42
|
46 | let b: String = unsafe { *s };
| ^^
| |
| move occurs because `*s` has type `std::string::String`, which does not implement the `Copy` trait
| help: consider borrowing here: `&*s`
Woops, that doesn't work. We can't move from a raw pointer since String
is not copy.
What can we do.. can we maybe make a Box
from it? It has from_raw
, which sounds
delightful:
Rust code
impl Drop for SmartString {
fn drop ( & mut self ) {
match self . discriminant {
0 => {
let s: * mut String = self . data . as_mut_ptr ( ) . cast ( ) ;
let b = unsafe { Box:: from_raw ( s) } ;
drop ( b) ;
}
1 => {
// etc.
}
_ => unreachable ! ( ) ,
}
}
}
Shell session
$ cargo run -q
[src/main.rs:117] s = "this is just some text"
free(): invalid pointer
[1] 179297 abort (core dumped) cargo run -q
Uh oh.
Wow, you weren't lying, unsafe
code is tricky.
Let's check in with our friend Valgrind:
Shell session
$ cargo build --quiet --release && valgrind --tool=memcheck ./target/release/enumpeek
==179648== Memcheck, a memory error detector
==179648== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==179648== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==179648== Command: ./target/release/enumpeek
==179648==
[src/main.rs:117] s = "this is just some text"
==179648== Invalid free() / delete / delete[] / realloc()
==179648== at 0x483B9AB: free (vg_replace_malloc.c:538)
==179648== by 0x10D501: enumpeek::main (in /home/amos/ftl/enumpeek/target/release/enumpeek)
==179648== by 0x10D8E2: std::rt::lang_start::{{closure}} (in /home/amos/ftl/enumpeek/target/release/enumpeek)
==179648== by 0x1163F7: {{closure}} (rt.rs:52)
==179648== by 0x1163F7: do_call<closure-0,i32> (panicking.rs:297)
==179648== by 0x1163F7: try<i32,closure-0> (panicking.rs:274)
==179648== by 0x1163F7: catch_unwind<closure-0,i32> (panic.rs:394)
==179648== by 0x1163F7: std::rt::lang_start_internal (rt.rs:51)
==179648== by 0x10D561: main (in /home/amos/ftl/enumpeek/target/release/enumpeek)
==179648== Address 0x1ffefff561 is on thread 1's stack
==179648== in frame #1, created by enumpeek::main (???:)
==179648==
==179648==
==179648== HEAP SUMMARY:
==179648== in use at exit: 0 bytes in 0 blocks
==179648== total heap usage: 15 allocs, 16 frees, 2,335 bytes allocated
==179648==
==179648== All heap blocks were freed -- no leaks are possible
==179648==
==179648== For lists of detected and suppressed errors, rerun with: -s
==179648== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
The problem here is that it's trying to deallocate the String
as if it had been
allocated on the heap. But it hasn't! The String
's internal data has been
allocated on the heap, but the String
struct itself hasn't.
Here's one way that does appear to work:
Rust code
impl Drop for SmartString {
fn drop ( & mut self ) {
match self . discriminant {
0 => {
let s: * mut String = self . data . as_mut_ptr ( ) . cast ( ) ;
let s: String = unsafe { std:: ptr:: read_unaligned ( s) } ;
drop ( s) ;
}
1 => {
// etc.
}
_ => unreachable ! ( ) ,
}
}
}
We can even:
Make a generic internal method to avoid code duplication with the Inline
case
Forget about the drop
, since dropping will happen as soon as the result of std::ptr::read_unaligned
goes out of scope
Let's go!
Rust code
impl SmartString {
fn drop_variant < T > ( & self ) {
unsafe { std:: ptr:: read_unaligned ( self . data . as_ptr ( ) . cast :: < T > ( ) ) } ;
}
}
impl Drop for SmartString {
fn drop ( & mut self ) {
match self . discriminant {
0 => unsafe { self . drop_variant :: < String > ( ) } ,
1 => unsafe { self . drop_variant :: < Inline > ( ) } ,
_ => unreachable ! ( ) ,
}
}
}
Shell session
$ cargo build --quiet --release && valgrind --tool=memcheck ./target/release/enumpeek
==181085== Memcheck, a memory error detector
==181085== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==181085== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==181085== Command: ./target/release/enumpeek
==181085==
[src/main.rs:99] s = "this is just some text"
==181085==
==181085== HEAP SUMMARY:
==181085== in use at exit: 0 bytes in 0 blocks
==181085== total heap usage: 15 allocs, 15 frees, 2,335 bytes allocated
==181085==
==181085== All heap blocks were freed -- no leaks are possible
==181085==
==181085== For lists of detected and suppressed errors, rerun with: -s
==181085== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Wonderful! Is it actually correct? I don't know! I wouldn't use it in
production before asking a bunch of other folks to look at it first. But at
least, in our case, it doesn't seem to leak memory or do invalid frees.
That's always good.
We could go on all day re-implementing all of smartstring
's features
on our homegrown version, but the point remains: our version is larger.
One entire byte larger.
Much like the smallvec::SmallVec
type is larger than the Vec
type.
Rust code
$ cargo add smallvec
Adding smallvec v1. 4 . 2 to dependencies
Rust code
use std:: mem:: size_of;
use smallvec:: SmallVec;
fn main ( ) {
dbg ! ( size_of::<Vec<u8 >>( ) , size_of::<SmallVec<[ u8 ; 1 ] >>( ) ) ;
}
Shell session
$ cargo run -q
[src/main.rs:100] size_of::<Vec<u8>>() = 24
[src/main.rs:100] size_of::<SmallVec<[u8; 1]>>() = 32
So, hopefully, by now, 44 whole-ass minutes into this article , you understand
exactly why that is an interesting question.
The mystery is not why SmallVec<[u8; 1]>
is 32 bytes, 8 more than a
Vec<u8>
. A SmallVec
is "just an enum" - its discriminant only needs to
account for two variants, but because Rust enums have padding to maintain
alignment, we pay 8 bytes for it.
The mystery is, how the heck is a SmartString
just 24 bytes.
To answer that question, we need to take a closer look at pointers.
So, what's a pointer? Just a number right? That tells you where in memory
something is.
For example, if we declare a local x
, of type i32
, it probably ends up
on the stack:
Rust code
fn main ( ) {
// this is a signed 32-bit integer
let x = 30 ;
// this is a reference to a signed 32-bit integer
let x_ref = & x;
// this is a pointer to a signed 32-bit integer
let x_ptr = x_ref as * const _ ;
dbg ! ( x_ptr) ;
}
Shell session
$ cargo run -q
[src/main.rs:105] x_ptr = 0x00007fff10be39ec
Of course, a local could also be in a register. But that's irrelevant here.
As soon as we take the address of something, it needs to be mapped somewhere
in our virtual memory address space, and registers aren't, so for the
purposes of this explanation, let's pretend registers don't exist.
Yeah let's just completely disregard the fastest storage available on a modern
computer, seems good.
Look cool bear, do you want to finish this article?
So. Number that tells you where in memory something is. It is very much an
address, the same way some countries have addresses for physical places,
except with more indirection.
An "aligned" pointer is a pointer whose value (the address) is a multiple of
the data size. It is convenient for CPUs when data has its "natural
alignment".
Let's look at some examples.
The smallest unit of memory we can address is a byte. A pointer to a byte is
always aligned, since the number
of a pointer counts bytes. In other
words, the alignment of u8
is 1.
Rust code
fn main ( ) {
let arr = [ 1u8 , 2u8 , 3u8 , 4u8 ] ;
dbg ! (
&arr[ 0 ] as *const _,
&arr[ 1 ] as *const _,
&arr[ 2 ] as *const _,
&arr[ 3 ] as *const _,
) ;
}
Shell session
$ cargo run -q
[src/main.rs:106] &arr[0] as *const _ = 0x00007ffd6474abdc
[src/main.rs:106] &arr[1] as *const _ = 0x00007ffd6474abdd
[src/main.rs:106] &arr[2] as *const _ = 0x00007ffd6474abde
[src/main.rs:106] &arr[3] as *const _ = 0x00007ffd6474abdf
If we're talking about pointers to u16
, then we want an alignment of 2.
Rust code
fn main ( ) {
let arr = [ 1u16 , 2u16 , 3u16 , 4u16 ] ;
fn inspect < T > ( t : * const T ) -> ( * const T , bool ) {
( t, t as usize % 2 == 0 )
}
dbg ! (
inspect( &arr[ 0 ] as *const _) ,
inspect( &arr[ 1 ] as *const _) ,
inspect( &arr[ 2 ] as *const _) ,
inspect( &arr[ 3 ] as *const _) ,
) ;
}
Shell session
$ cargo run -q
[src/main.rs:110] inspect(&arr[0] as *const _) = (
0x00007ffd81bf5918,
true,
)
[src/main.rs:110] inspect(&arr[1] as *const _) = (
0x00007ffd81bf591a,
true,
)
[src/main.rs:110] inspect(&arr[2] as *const _) = (
0x00007ffd81bf591c,
true,
)
[src/main.rs:110] inspect(&arr[3] as *const _) = (
0x00007ffd81bf591e,
true,
)
For u32
values, we want an alignment of 4, and for u64
values, we want
an alignment of 8.
For the visual thinkers among us, here are some examples of values that are
properly aligned:
The little boxes at the bottom represent what types we can store at these
locations if we want them to be properly aligned.
The top part represents the actual memory layout of, for example, a struct,
like this one here:
C code
#include <stdint.h>
#include <stdio.h>
#include <stddef.h>
struct S {
uint8_t a ;
uint8_t b ;
uint16_t c ;
uint32_t d ;
};
int main () {
printf ("sizeof(S) = %ld\n" , sizeof (struct S ));
printf ("offsetof(struct S, a) = %zu\n" , offsetof (struct S , a ));
printf ("offsetof(struct S, b) = %zu\n" , offsetof (struct S , b ));
printf ("offsetof(struct S, c) = %zu\n" , offsetof (struct S , c ));
printf ("offsetof(struct S, d) = %zu\n" , offsetof (struct S , d ));
}
Shell session
$ clang -Wall main.c -o main && ./main
sizeof(S) = 8
offsetof(struct S, a) = 0
offsetof(struct S, b) = 1
offsetof(struct S, c) = 2
offsetof(struct S, d) = 4
Here, things work out just fine - but they could be less ideal.
For example, we could have this layout instead:
C code
struct S {
uint8_t a ;
uint16_t b ;
uint8_t c ;
uint32_t d ;
};
What would this look like?
Shell session
$ clang -Wall main.c -o main && ./main
sizeof(S) = 12
offsetof(struct S, a) = 0
offsetof(struct S, b) = 2
offsetof(struct S, c) = 4
offsetof(struct S, d) = 8
To maintain alignment, the compiler has inserted padding:
"Padding" is not necessarily zeroed - it's just unused space. Even if it's
initially zeroed, it's not guaranteed to stay zero if you assign members.
Regular usage of the struct might mess with the values in the padding, and so
a good old block memory comparison (memcmp
) would not be enough to test two
structs for equality.
What happens if we define the same struct layout in Rust?
Rust code
fn main ( ) {
struct S {
a : u8 ,
b : u16 ,
c : u8 ,
d : u32 ,
}
dbg ! ( std::mem::size_of::<S>( ) ) ;
}
Shell session
$ cargo run -q
[src/main.rs:112] std::mem::size_of::<S>() = 8
It's only 8 bytes again! What happened? Let's look at its layout:
Shell session
$ cargo add memoffset
Adding memoffset v0.5.5 to dependencies
Rust code
fn main ( ) {
struct S {
a : u8 ,
b : u16 ,
c : u8 ,
d : u32 ,
}
use memoffset:: offset_of;
dbg ! (
std::mem::size_of::<S>( ) ,
offset_of!( S, a) ,
offset_of!( S, b) ,
offset_of!( S, c) ,
offset_of!( S, d)
) ;
}
Shell session
$ cargo run -q
[src/main.rs:113] std::mem::size_of::<S>() = 8
[src/main.rs:113] offset_of!(S, a) = 6
[src/main.rs:113] offset_of!(S, b) = 4
[src/main.rs:113] offset_of!(S, c) = 7
[src/main.rs:113] offset_of!(S, d) = 0
Our fields were re-ordered!
We can tell the Rust compiler not to re-order our fields with
repr(C)
:
Rust code
fn main ( ) {
#[ repr( C) ]
struct S {
a : u8 ,
b : u16 ,
c : u8 ,
d : u32 ,
}
use memoffset:: offset_of;
dbg ! (
std::mem::size_of::<S>( ) ,
offset_of!( S, a) ,
offset_of!( S, b) ,
offset_of!( S, c) ,
offset_of!( S, d)
) ;
}
Rust code
cargo run -q
[ src/main. rs : 11 ] std:: mem:: size_of :: < S > ( ) = 12
[ src/main. rs : 11 ] offset_of ! ( S, a) = 0
[ src/main. rs : 11 ] offset_of ! ( S, b) = 2
[ src/main. rs : 11 ] offset_of ! ( S, c) = 4
[ src/main. rs : 11 ] offset_of ! ( S, d) = 8
And now we have the same layout we had with C (hence repr(C)
), with
the same padding:
Or we could tell the Rust compiler to not re-order fields, and to pack
the struct (ie. to not insert any padding), completely disregarding alignment:
Rust code
fn main ( ) {
#[ repr( C, packed) ]
struct S {
a : u8 ,
b : u16 ,
c : u8 ,
d : u32 ,
}
use memoffset:: offset_of;
dbg ! (
std::mem::size_of::<S>( ) ,
offset_of!( S, a) ,
offset_of!( S, b) ,
offset_of!( S, c) ,
offset_of!( S, d)
) ;
}
But now... S.b
is no longer aligned properly!
Rust code
$ cargo run -q
[ src/main. rs : 11 ] std:: mem:: size_of :: < S > ( ) = 8
[ src/main. rs : 11 ] offset_of ! ( S, a) = 0
[ src/main. rs : 11 ] offset_of ! ( S, b) = 1
[ src/main. rs : 11 ] offset_of ! ( S, c) = 3
[ src/main. rs : 11 ] offset_of ! ( S, d) = 4
And if we try to take a reference to it, Rust will warn us (for now,
it'll become a hard error later):
Rust code
fn main ( ) {
#[ repr( C, packed) ]
#[ derive( Default) ]
struct S {
a : u8 ,
b : u16 ,
c : u8 ,
d : u32 ,
}
let s: S = Default:: default ( ) ;
dbg ! ( &s.b) ;
}
Shell session
$ cargo run -q
warning: borrow of packed field is unsafe and requires unsafe function or block (error E0133)
--> src/main.rs:12:10
|
12 | dbg!(&s.b);
| ^^^^
|
= note: `#[warn(safe_packed_borrows)]` on by default
= warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
= note: for more information, see issue #46043 <https://github.com/rust-lang/rust/issues/46043>
= note: fields of packed structs might be misaligned: dereferencing a misaligned pointer or even just creating a misaligned reference is undefined behavior
warning: 1 warning emitted
[src/main.rs:12] &s.b = 0
And yet... everything seems to work fine on my 2018 Core i7.
We can even mutate it, no problem:
Rust code
fn main ( ) {
#[ repr( C, packed) ]
#[ derive( Default) ]
struct S {
a : u8 ,
b : u16 ,
c : u8 ,
d : u32 ,
}
let mut s: S = Default:: default ( ) ;
unsafe {
s. b = 0x123 ;
println ! ( "{:#x}" , s.b) ;
}
}
Shell session
$ cargo run -q
0x123
That's not the only way to obtain an unaligned pointer. Using some pointer
casting, we can also treat, say, two u8
as a single, unaligned u16
:
Rust code
fn main ( ) {
let mut arr = [ 1u8 , 2u8 , 3u8 ] ;
let ptr_u16 = ( & mut arr[ 1 ] ) as * mut _ as * mut u16 ;
unsafe {
* ptr_u16 = 0x123 ;
println ! ( "{:#x}" , *ptr_u16) ;
}
}
Note that clippy will catch this - and has this a hard error by default:
Shell session
$ cargo clippy
Checking enumpeek v0.1.0 (/home/amos/ftl/enumpeek)
error: casting from `*mut u8` to a more-strictly-aligned pointer (`*mut u16`) (1 < 2 bytes)
--> src/main.rs:3:19
|
3 | let ptr_u16 = (&mut arr[1]) as *mut _ as *mut u16;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: `#[deny(clippy::cast_ptr_alignment)]` on by default
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#cast_ptr_alignment
It does work, though, again, on my 2018 Core i7.
Shell session
$ cargo run -q
0x123
So... why should we care about alignment again?
Well, it's a long story.
Basically, back when C was "invented", some processors had poor support for
unaligned memory access.
For some of them, unaligned memory accesses would raise a processor
exception :
the exception handler might be able to correct the unaligned access, but at a
very high performance cost. Or the handler might not be able to correct the
unaligned access at all, and program execution would just abort.
Some other architectures, like Intel's "Core 2" series (succeeded by
Nehalem ),
transparently support unaligned memory accesses, at "some performance cost".
I'm tempted to link to some microbenchmarks here, but they all contradict
each other - it depends on so many factors. Some benchmarks show a 10%
slowdown, some show a 50% slowdown, there's clearly a lot going on there.
Just remember that, even after processors started getting first-class support
for unaligned reads, it was still desirable to avoid them for performance
reasons.
But I've kept the best for last:
Some architectures would not raise a processor exception, but silently
perform a different read instead.
An example ARMv4t chip: the sound processor chip for the SEGA NAOMI arcade system.
Yaca2671
Wait... it would perform a different read ?
That sounds terrifying. Did that really happen? Do you have any sources on
that?
I'll do you one better - I'll show you.
We've done some unaligned reads in this article already.
It's relatively easy to do!
First you need some data - we'll just pick 8 unique byte values, so it's
easier to follow what's going on:
C code
uint8_t arr [8] = { 0xef, 0xcd, 0xab, 0x89, 0x67, 0x45, 0x23, 0x01 };
And then we do a read from an address that isn't aligned properly. For
example, we try to read an uint32_t
starting from the second item
of our array.
Here's the full code sample:
C code
#include <stdint.h>
#include <stdio.h>
int main () {
uint8_t arr [8] = {0xef, 0xcd, 0xab, 0x89, 0x67, 0x45, 0x23, 0x01};
// arrays are zero-indexed, so `1` is the second item
uint32_t * ptr = (uint32_t * )(& arr [1]);
printf ("0x%08x\n" , * ptr );
return 0;
}
What is this going to print? 0xcdab8967
? Wrong!
My 2018 Core i7 is a little-endian CPU:
Shell session
$ lscpu | grep -E '(Byte Order|Model name)'
Byte Order: Little Endian
Model name: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
...which means bytes are stored from "least significant" to "most
significant".
So instead of having the thing that makes sense:
We have this upside-down world arrow spaghetti thing:
Shell session
$ gcc -Wall main.c -o main && ./main
0x6789abcd
And that's why I picked such specific values for our uint_8
array.
Okay, I still don't believe you though - this unaligned read is
working perfectly fine.
Yes, yes.
But what if we compile that code... for the Game Boy Advance?
A 'Glacier' Game Boy Advance - one of the colors available at launch in North America
Evan-Amos
The Game Boy Advance has a 16.8 MHz 32-bit ARM7TDMI processor - which implements the ARMv4T
micro-architecture.
It means we went back to 2001, Marty.
Okay... let's say I'm curious - how does one even compile for the GBA in 2020?
It's simple! Just install devKitPro on
your OS of choice, and, uh, read a bit lot of documentation, and you're good to go!
Didn't I see you trying to set up an
SA-1100
virtual machine earlier? And then a SH4 one?
Yeah well, hindsight is 20/20. Do you want to see something cool or not?
So, I didn't actually read any documentation about the Game Boy Advance, I just
found a sample project I liked, and this is what it showed in VisualBoyAdvance :
The C code for the project looked something like this:
C code
// (cut)
// clear screen
iprintf ("\x1b[2J" );
// print at coordinates 10,10
iprintf ("\x1b[10;10H" );
iprintf ("Hello World!" );
// (cut)
So - easy enough! We have some sort of printf
implementation - and
apart from that... it's just C! And it's built with GCC, the same
compiler family we just used on 64-bit Linux for my 2018 Core i7.
So, it shouldn't be too hard to get our code running there.
Here's the full listing of source/console.c
:
C code
#include <gba_console.h>
#include <gba_video.h>
#include <gba_interrupt.h>
#include <gba_systemcalls.h>
int main (void ) {
irqInit ();
irqEnable (IRQ_VBLANK );
consoleDemoInit ();
// clear screen
iprintf ("\x1b[2J" );
// print at coordinates 10,10
iprintf ("\x1b[10;10H" );
uint8_t arr [8] = {0xef, 0xcd, 0xab, 0x89, 0x67, 0x45, 0x23, 0x01};
// arrays are zero-indexed, so `1` is the second item
uint32_t * ptr = (uint32_t * )(& arr [1]);
iprintf ("0x%08x" , * ptr );
while (1) {
VBlankIntrWait ();
}
}
And as you can see, cool bear, it does the wrong rea...
AhAH! SEE? It's fine! Everything's fine.
No, wait this can't be right - is VBA-M somehow fixing the unaligned access
transparently? This wouldn't work on the actual chip... let me just-
Yeah yeah, the actual chip, sure.
All I'm seeing here is that you're once again making outrageous claims you
can't back. So much for checking your sources, Amos. "Oooh we have to be
careful about factual errors, ooh we don't want to contribute to
misinformation", "our job is to dispel myths", yada-yada.
Hold on, let me see if I can disassemble this.
I noticed that, along with a .gba
file, devKitPro
also gave us an .elf
file.
Shell session
$ file ./ansi_console/ansi_console.elf
./ansi_console/ansi_console.elf: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, with debug_info, not stripped
And I noticed that, alongside gcc
, as
, ld
, etc., it also had an objdump
:
Shell session
$ $DEVKITARM/bin/arm-none-eabi-objdump --disassemble --source ./ansi_console/ansi_console.elf
int main(void) {
(cut)
// print at coordinates 10,10
iprintf("\x1b[10;10H");
8000286: 4805 ldr r0, [pc, #20] ; (800029c <main+0x2c>)
8000288: f000 fcec bl 8000c64 <iprintf>
uint8_t arr[8] = {0xef, 0xcd, 0xab, 0x89, 0x67, 0x45, 0x23, 0x01};
// arrays are zero-indexed, so `1` is the second item
uint32_t *ptr = (uint32_t *)(&arr[1]);
iprintf("0x%08x", *ptr);
800028c: 4904 ldr r1, [pc, #16] ; (80002a0 <main+0x30>)
800028e: 4805 ldr r0, [pc, #20] ; (80002a4 <main+0x34>)
8000290: f000 fce8 bl 8000c64 <iprintf>
/*! \fn void VBlankIntrWait()
\brief waits for a vertical blank interrupt to occur.
*/
static inline
void VBlankIntrWait() { SystemCall(5); }
8000294: df05 svc 5
8000296: e7fd b.n 8000294 <main+0x24>
Huh, first time looking at ARM assembly, neat.
Yeah uh, good thing objdump
has a --source
option or I would be
completely lost.
So I'm assuming we want to look at 800028c
, where it does the interesting
iprintf
call?
Right. It seems to be using ldr
(load with immediate offset) to put values
in registers r1
and r0
. I'm assuming r0
is the printf
format string,
and r1
is the first argument, which is relative to the program counter,
Yes! And the argument in question ends up being read from 80002a0
,
which... let me scroll down a bit:
Shell session
8000298: 080086cc .word 0x080086cc
800029c: 080086d4 .word 0x080086d4
80002a0: 6789abcd .word 0x6789abcd
80002a4: 080086e0 .word 0x080086e0
...which is a constant? Equal to 0x6789abcd
?
Bwahahahahahahahahahahaha. Amos.
Do you want to hear a joke?
What's the worst thing about laundry day?
I don't know, what is the worst thing about laundry day?
...okay. How does that help?
Well, I'm afraid GCC just went ahead and evaluated all your little unaligned
crimes at compile-time.
So it ends up printing a constant. Which is the expected result, if
unaligned reads were supported transparently.
Right! Because everything is constant - the array is a constant, the pointer
arithmetic and dereferencing will be the same for all runs... I see it now.
I can't believe I trusted GCC.
Don't go blaming GCC. GCC is generating the fastest code - no code at all.
By the way, you might want to disassemble your 64-bit Linux program as well.
Shell session
$ objdump --disassemble --source ./main
(cut)
int main() {
1040: 48 83 ec 08 sub rsp,0x8
uint8_t arr[8] = {0xef, 0xcd, 0xab, 0x89, 0x67, 0x45, 0x23, 0x01};
// arrays are zero-indexed, so `1` is the second item
uint32_t *ptr = (uint32_t *)(&arr[1]);
printf("0x%08x\n", *ptr);
1044: be cd ab 89 67 mov esi,0x6789abcd
1049: 48 8d 3d b4 0f 00 00 lea rdi,[rip+0xfb4] # 2004 <_IO_stdin_used+0x4>
1050: 31 c0 xor eax,eax
1052: e8 d9 ff ff ff call 1030 <printf@plt>
return 0;
}
Oh come on. It's just doing mov esi,0x6789abcd
!
Yup. Which just goes to show: don't mess with something you don't
underst..
That's baloney. We definitely should mess with things we don't understand,
we should just verify our work, and possibly have more experienced folks
double-check it.
Okay, valid - how are you going to get out of that one, though?
Well, let's fix the Linux 64-bit version first - I want the CPU to be doing
that read, not one of GCC's passes.
What if I made a function? Then it definitely couldn't just constant-fold
all the fun away.
C code
#include <stdint.h>
#include <stdio.h>
uint32_t deref (uint32_t * ptr ) {
return * ptr ;
}
int main () {
uint8_t arr [8] = {0xef, 0xcd, 0xab, 0x89, 0x67, 0x45, 0x23, 0x01};
uint32_t * ptr = (uint32_t * )(& arr [1]);
// we are now calling deref:
printf ("0x%08x\n" , deref (ptr ));
return 0;
}
Shell session
$ gcc -O3 -g main.c -o main && ./main
0x6789abcd
...now try disassembling it?
Shell session
$ objdump --disassemble --source ./main
0000000000001040 <main>:
uint32_t deref(uint32_t *ptr) {
return *ptr;
}
int main() {
1040: 48 83 ec 08 sub rsp,0x8
uint8_t arr[8] = {0xef, 0xcd, 0xab, 0x89, 0x67, 0x45, 0x23, 0x01};
uint32_t *ptr = (uint32_t *)(&arr[1]);
// we are now calling deref:
printf("0x%08x\n", deref(ptr));
1044: be cd ab 89 67 mov esi,0x6789abcd
1049: 48 8d 3d b4 0f 00 00 lea rdi,[rip+0xfb4] # 2004 <_IO_stdin_used+0x4>
1050: 31 c0 xor eax,eax
1052: e8 d9 ff ff ff call 1030 <printf@plt>
return 0;
}
1057: 31 c0 xor eax,eax
1059: 48 83 c4 08 add rsp,0x8
105d: c3 ret
105e: 66 90 xchg ax,ax
Bwahahahahahahha inliner goes brrrr.
Don't forget you're doing this to yourself.
Okay fine, fine , this is GCC, I can just tell it not to inline
the function. Then it has to pass the pointer through a register,
uhh, %rdi
probably? And it has to do the actual memory read
in the function to return it through %rax
.
Okay, whatever you say champ!
C code
uint32_t __attribute__((noinline )) deref (uint32_t * ptr ) {
return * ptr ;
}
Shell session
$ gcc -O3 -g main.c -o main && ./main
0x6789abcd
$ objdump --disassemble --source ./main
(cut)
00000000000011b0 <deref>:
return *ptr;
11b0: 8b 07 mov eax,DWORD PTR [rdi]
}
11b2: c3 ret
11b3: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
11ba: 00 00 00
11bd: 0f 1f 00 nop DWORD PTR [rax]
Shell session
0000000000001050 <main>:
uint32_t __attribute__((noinline)) deref(uint32_t *ptr) {
return *ptr;
}
int main() {
1050: 48 83 ec 18 sub rsp,0x18
uint8_t arr[8] = {0xef, 0xcd, 0xab, 0x89, 0x67, 0x45, 0x23, 0x01};
1054: 64 48 8b 04 25 28 00 mov rax,QWORD PTR fs:0x28
105b: 00 00
105d: 48 89 44 24 08 mov QWORD PTR [rsp+0x8],rax
1062: 48 b8 ef cd ab 89 67 movabs rax,0x123456789abcdef
1069: 45 23 01
uint32_t *ptr = (uint32_t *)(&arr[1]);
// we are now calling deref:
printf("0x%08x\n", deref(ptr));
106c: 48 8d 7c 24 01 lea rdi,[rsp+0x1]
uint8_t arr[8] = {0xef, 0xcd, 0xab, 0x89, 0x67, 0x45, 0x23, 0x01};
1071: 48 89 04 24 mov QWORD PTR [rsp],rax
printf("0x%08x\n", deref(ptr));
1075: e8 36 01 00 00 call 11b0 <deref>
107a: 48 8d 3d 83 0f 00 00 lea rdi,[rip+0xf83] # 2004 <_IO_stdin_used+0x4>
1081: 89 c6 mov esi,eax
1083: 31 c0 xor eax,eax
1085: e8 b6 ff ff ff call 1040 <printf@plt>
return 0;
}
Yes! Okay, that should do it.
Now let's do the same to our Game Boy Advance, uh, game:
C code
// includes omitted
uint32_t __attribute__((noinline )) deref (uint32_t * ptr ) {
return * ptr ;
}
int main (void ) {
irqInit ();
irqEnable (IRQ_VBLANK );
consoleDemoInit ();
// clear screen
iprintf ("\x1b[2J" );
// print at coordinates 10,10
iprintf ("\x1b[10;10H" );
uint8_t arr [8] = {0xef, 0xcd, 0xab, 0x89, 0x67, 0x45, 0x23, 0x01};
uint32_t * ptr = (uint32_t * )(& arr [1]);
// we are now calling deref:
iprintf ("0x%08x" , deref (ptr ));
while (1) {
VBlankIntrWait ();
}
}
Shell session
$ make && vbam -f 1 ./ansi_console/ansi_console.gba
(cut)
The above code, ran on an actual Game Boy Advance.
@leo60228
Okay, that is pretty cool.
How is it getting to that result, though? Shouldn't the result
be 0x89abcdef
?
Well, no - it's not "rounding down the memory address" - it's
just that the bottom 2 bits are not part of the address that's
being read - it interprets the bottom two bits as the amount by
which to circular-shift the result to the right.
It uses which bits to what now?
Okay it'll be simpler to just show you: if we wanted to emulate the behavior
of ARMv4t in our 64-bit Linux executable, we would do this:
C code
// performs a circular shift to the right by `n` bits
uint32_t circ_rshift (uint32_t input , int n ) {
return (input >> n ) | (input << (32 - n ));
}
uint32_t __attribute__ ((noinline )) deref_armv4t (uint32_t * input_addr ) {
size_t input = (size_t ) input_addr ;
// everything except the bottom 2 bits
const size_t ADDR_MASK = 0xfffffffffffffffc;
// just the bottom 2 bits
const size_t SHFT_MASK = ~ADDR_MASK ;
uint32_t * addr = (uint32_t * ) (input & ADDR_MASK );
int shift_bytes_amount = (int ) (input & SHFT_MASK );
int shift_bits_amount = shift_bytes_amount * 8;
return circ_rshift (* addr , shift_bits_amount );
}
int main () {
uint8_t arr [8] = {0xef, 0xcd, 0xab, 0x89, 0x67, 0x45, 0x23, 0x01};
uint32_t * ptr = (uint32_t * )(& arr [1]);
printf ("0x%08x\n" , deref_armv4t (ptr ));
return 0;
}
Shell session
$ gcc -O3 -g main.c -o main && ./main
0xef89abcd
Okay, that'll take a few reads, but thanks.
This was a long-ass example though. Are you sure it was worth it?
Sure! Developing for small devices is fun. Pick up your phone and send "ARM"
to 0xef89abcd
if you want more of this.
Let's take a minute to regroup.
Historically, it was easier - and cheaper - for processors to only provide
good support for aligned reads. So, for historical and architectural reasons,
natural data alignment has been very desirable in a language like C.
Which is why C automatically inserts padding when laying out structs, and
also why the outcome of "obtaining an unaligned pointer and dereferencing it"
is UB (undefined behavior) - it varies from platform to platform!
Because of that history, data alignment is a concept built deep into the C
language, and compiler infrastructure like LLVM - even though our modern
processors might be fine with it in some cases.
On modern x86-64 processors (post-Nehalem), unaligned memory access typically
comes at a much lower performance cost than, say, on a Core 2, especially if
the data is already in cache.
So, are there any reasons left to care about alignment on modern processors?
Yes! Even though performance is getting harder and harder to "get a feel for",
due to the increasing complexity of processor architectures, aligned memory
access performance remains more predictable.
We've also seen the advent of SIMD (Single Instruction, Multiple Data)
instruction sets, which as their name indicates, allow operating on large
amounts of data (more than 64 bits' worth) with a single instruction.
Those come with even stricter alignment requirements. For example,
SSE requires all
its operands to be 16-byte (128-bit) aligned.
Some atomic operations also have alignment requirements - for example, Go's
sync/atomic
package documentation notes the following:
On ARM, x86-32, and 32-bit MIPS, it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically . The first word in a variable or in an allocated struct, array, or slice can be relied upon to be 64-bit aligned.
https://godoc.org/sync/atomic#pkg-note-bug
And finally, data alignment rules let us do the one trick smartstring
relies on.
Let's clone smartstring
locally, and navigate through it using rust-analyzer
's
"Go To Definition" functionality.
The type we already know about is SmartString
- it is defined as follows:
Rust code
#[ cfg_attr( target_pointer_width = "64" , repr( C, align( 8 ) ) ) ]
#[ cfg_attr( target_pointer_width = "32" , repr( C, align( 4 ) ) ) ]
pub struct SmartString < Mode : SmartStringMode > {
data : MaybeUninit < InlineString < Mode > > ,
}
In other words - it's mostly a wrapper over InlineString
, which looks like:
Rust code
#[ repr( C) ]
pub ( crate ) struct InlineString < Mode : SmartStringMode > {
pub ( crate ) marker : Marker ,
pub ( crate ) data : Mode:: InlineArray ,
}
Interesting! There's definitely a parallel with our homegrown version,
shown here as a refresher:
Rust code
struct Inline {
len : u8 ,
data : [ u8 ; 23 ] ,
}
And indeed, Marker
is "just an u8
":
Rust code
#[ derive( Clone, Copy, Debug) ]
pub ( crate ) struct Marker ( u8 ) ;
If we look at Marker
's code, though, we notice something interesting:
it seems to contain both the discriminant, and 7 bits of data.
Rust code
impl Marker {
#[ inline( always) ]
fn assemble ( discriminant : Discriminant , data : u8 ) -> u8 {
let data = data;
if UPSIDE_DOWN_LAND {
data << 1 | discriminant. bit ( )
} else {
discriminant. bit ( ) << 7 | data
}
}
}
How? Why?
Well... it's all thanks to data alignment.
I lied a little just now - when I said SmartString
was "mostly just a
wrapper for InlineString
". You know what else SmartString
can be?
A String
!
When a SmartString
is built from a String
, it's literally just
assigning that String
to itself, via std::ptr::write
:
Rust code
impl < Mode : SmartStringMode > SmartString < Mode > {
fn from_boxed ( boxed : Mode:: BoxedString ) -> Self {
let mut out = Self {
data : MaybeUninit:: uninit ( ) ,
} ;
let data_ptr: * mut Mode:: BoxedString = out. data . as_mut_ptr ( ) . cast ( ) ;
unsafe { data_ptr. write ( boxed) } ;
out
}
}
Ok so what you're saying is...
Sorry, no, I don't get it. Come again? A SmartString
can be just a String
?
How does it know if it's a String
or an InlineString
? Is it... is it somehow
using unused parts of the String
struct to store the discriminant?
Is it using padding maybe? Padding is unused.
No, that would be undefined behavior! Padding can get randomly messed up if
the compiler thinks it's fast, remember?
Okay so, what else? What else is unused in String
, which can be used to
store the discriminant of a SmartString
?
Well - let's look at the definition of the String
struct.
Rust code
pub struct String {
vec : Vec < u8 > ,
}
Mh indeed. Let's look at the definition of Vec
:
Rust code
pub struct Vec < T > {
buf : RawVec < T > ,
len : usize ,
}
It's turtles all the way down. Let's look at the definition of RawVec
Rust code
pub struct RawVec < T , A : AllocRef = Global > {
ptr : Unique < T > ,
cap : usize ,
alloc : A ,
}
Okay... and finally, let's look at the definition of Unique
:
Rust code
#[ repr( transparent) ]
pub struct Unique < T : ?Sized > {
pointer : * const T ,
// (cut)
}
So to summarize, String
owns a Vec<u8>
, which owns a RawVec<u8>
, which
contains a Unique<u8>
, which is actually just a *const u8
.
So, the first member of the String
struct is a pointer to an u8
. The
second is an usize
, to store capacity, and the third one is length
, and
it's 8 bytes as well, so we have:
Right - String
starts with a pointer.
But it's a pointer to an u8
, right? Shouldn't it have an alignment of 1?
Right! But it points to heap-allocated memory.
In other words, it's set to
a value returned by our friendly neighborhood system allocator (or a custom
allocator, which hopefully behaves).
And the malloc(3)
man page says:
The malloc()
and calloc()
functions return a pointer to the allocated
memory, which is suitably aligned for any built-in type.
For any built-in type. So it doesn't even rely on the amount of memory
requested - even if we allocate a single byte, over and over, it'll
still give us 64-bit aligned addresses on 64-bit Linux.
Let's try allocating ONE MILLION BYTES, just for fun.
Why did you use your supervillain voice just now
I don't know what you're talking about
C code
#include <stdio.h>
#include <stdlib.h>
int main () {
int aligned_count = 0;
const int max = 1000 * 1000;
// N.B: this leaks memory on purpose - if we freed, we might
// not get unique addresses.
for (int i = 0; i < max ; i ++ ) {
void * ptr = malloc (4);
// is this 64-bit aligned?
if (((size_t ) ptr % 8) == 0) {
aligned_count += 1;
}
}
printf ("%d/%d addresses malloc returned were 64-bit aligned\n" , aligned_count , max );
}
Shell session
$ gcc -Wall -O3 -g main.c -o main && ./main
1000000/1000000 addresses malloc returned were 64-bit aligned
Alright, so String
starts with a pointer that is 8-byte-aligned on 64-bit Linux.
What does that mean?
Well - all multiples of 8 have the bottom 3 bits set to zero:
0
is all 0b0
8
is 0b1000
16
is 0b10000
24
is 0b11000
etc.
In fact, we can also test alignment that way:
C code
#include <stdio.h>
#include <stdlib.h>
int main () {
int aligned_count = 0;
const int max = 1000 * 1000;
for (int i = 0; i < max ; i ++ ) {
void * ptr = malloc (4);
// N.B: binary literals are a GCC extension
if (((size_t ) ptr & 0b111) == 0) {
aligned_count += 1;
}
}
printf ("%d/%d addresses malloc returned had their bottom 3 bits zeroed\n" , aligned_count , max );
}
Shell session
$ gcc -Wall -O3 -g main.c -o main && ./main
1000000/1000000 addresses malloc returned had their bottom 3 bits zeroed
Alright - so, on x86_64, pointers that are heap-allocated have their bottom 3
bits zeroed.
How does that help us?
Well... we can make an enum out of that!
Let's try to make our own smartstring
again.
Since we want to beat Rust enums, we'll have to use a struct
, again:
Rust code
#[ derive( Default) ]
struct SmartString {
data : InlineString ,
}
Our SmartString
will be either a String
or an InlineString
:
Rust code
#[ derive( Default) ]
#[ repr( C, align( 8 ) ) ]
struct InlineString {
marker : u8 ,
len : u8 ,
contents : [ u8 ; 22 ] ,
}
...so we need to make sure that both types have the same size , and the same
alignment :
Rust code
assert_eq_size ! ( SmartString, String) ;
assert_eq_align ! ( SmartString, String) ;
Building a SmartString
from a String
is easy:
Rust code
impl SmartString {
fn new_boxed ( s : String ) -> Self {
let mut res: Self = Default:: default ( ) ;
let ptr: * mut String = ( & mut res. data ) as * mut _ as * mut String ;
unsafe { * ptr = s } ;
res
}
}
Building the inline variant is slightly more involved - we need to copy over
the data, and we only have 22 bytes of storage. Luckily, using slice copying
operations, the standard library will panic if things go wrong, instead of
resulting in silent memory corruption.
Rust code
impl SmartString {
fn new_inline ( data : & str ) -> Self {
let mut inline: InlineString = Default:: default ( ) ;
// set bottom bit so we can distinguish from a `String`
inline. marker = 0x1 ;
let src = data. as_bytes ( ) ;
// this will panic if `data` is too large
let dst = & mut inline. contents [ ..data. len ( ) ] ;
dst. copy_from_slice ( src) ;
inline. len = data. len ( ) as u8 ;
Self { data : inline }
}
}
Pay extra attention to the marker
field - it's at the very beginning
of our InlineString
struct:
Mhh, aren't the bottom 3 bits at.. the bottom? In the last byte of the data
pointer?
They are!
But we're on Little-Endian, so that "last" byte is stored first.
So, once we have that, we can easily determine whether our current variant
is Boxed
our Inline
:
Rust code
#[ derive( Debug) ]
enum Discriminant {
Boxed,
Inline,
}
impl SmartString {
fn discriminant ( & self ) -> Discriminant {
if self . data . marker & 0x1 == 0 {
Discriminant:: Boxed
} else {
Discriminant:: Inline
}
}
}
Then, we can make a helper to get an Option<&String>
, which will be
Some
only for the boxed variant:
Rust code
impl SmartString {
fn as_boxed ( & self ) -> Option < & String > {
if let Discriminant:: Boxed = self . discriminant ( ) {
#[ allow( clippy::transmute_ptr_to_ref) ]
Some ( unsafe { std:: mem:: transmute ( & self . data as * const _ ) } )
} else {
None
}
}
}
And finally, we can add an as_str
function:
Rust code
impl SmartString {
fn as_str ( & self ) -> & str {
match self . as_boxed ( ) {
Some( s) => s. as_str ( ) ,
None => unsafe {
std:: str:: from_utf8_unchecked ( & self . data . contents [ ..self . data . len as usize ] )
} ,
}
}
}
And we've got ourselves a proof of concept!
Let's take our second attempt at smartstring
out for a run:
Rust code
fn main ( ) {
let a = SmartString:: new_boxed ( "Hi there" . into ( ) ) ;
let b = SmartString:: new_inline ( "Hi there" ) ;
println ! ( "{:?} => {}" , a.discriminant( ) , a.as_str( ) ) ;
println ! ( "{:?} => {}" , b.discriminant( ) , b.as_str( ) ) ;
}
Shell session
$ cargo run -q
Boxed => Hi there
Inline => Hi there
Neat!
Okay, but - apart from missing 95% of smartstring
's features, our
version is not as good. It only has 22 bytes of inline storage!
That's right! And smartstring
has 23. How does that work?
Well... see our marker
field here?
Rust code
#[ derive( Default) ]
#[ repr( C, align( 8 ) ) ]
struct InlineString {
// that one
marker : u8 ,
len : u8 ,
contents : [ u8 ; 22 ] ,
}
We're only using one bit of it. With a little creativity, we can use the
other 7 bits to store the length, which gives us... 128 possible values.
Since we only need to count up to 23, that works for us!
Let's make a newtype just to handle the bit twiddling:
Rust code
#[ derive( Default) ]
#[ repr( C, align( 8 ) ) ]
struct InlineString {
// new, was just a u8
marker : Marker ,
// note that `len` is gone
// new, used to be 22
contents : [ u8 ; 23 ] ,
}
#[ derive( Default, Clone, Copy, Debug) ]
#[ repr( transparent) ]
struct Marker {
val : u8 ,
}
And here is the bit twiddling in question. When assembling, we shift
everything to the left, and do a binary OR with 0x1
, our discriminant bit.
And when we need to retrieve our data, we just shift everything to the right
by one, and the discriminant bit "falls off":
Rust code
impl Marker {
fn assemble ( len : usize ) -> Self {
assert ! ( len <= 23 ) ;
Self {
val : 0x1 | ( ( len as u8 ) << 1 ) ,
}
}
fn len ( & self ) -> usize {
( self . val >> 1 ) as _
}
}
Then we just need to adjust SmartString::discriminant
to read from
marker.val
instead:
Rust code
impl SmartString {
fn discriminant ( & self ) -> Discriminant {
if self . data . marker . val & 0x1 == 0 {
Discriminant:: Boxed
} else {
Discriminant:: Inline
}
}
}
And SmartString::as_str
to get its length from the marker:
Rust code
impl SmartString {
pub fn as_str ( & self ) -> & str {
match self . as_boxed ( ) {
Some( s) => s. as_str ( ) ,
None => unsafe {
std:: str:: from_utf8_unchecked ( & self . data . contents [ ..self . data . marker . len ( ) ] )
} ,
}
}
}
And it all works together!
Shell session
$ cargo run -q
Boxed => Hi there
Inline => Hi there
And that's exactly how smartstring
works on little-endian architectures:
Okay, that one's fun.
Do you know how much memory you can address with 64 bits?
18.45 exabytes (or 16 exbibytes).
In practice, in the current implementations of the x86_64 (AMD64)
architecture, only the lower 48 bits are used, so we can only address
281.5 terabytes of memory (256 tebibytes).
Which means the high 16 bits are all zeroed, as well! Note that this is not
guaranteed to remain true in the future - it's an implementation detail.
Now, x86_64 is little-endian, but the same principle applies for big-endian
architectures - in practice, none of them need more than 281 terabytes of
memory right now, so they also don't use the high 16 bits, and so, we can
use that for our discriminant.
The only difference is, we now use 0x80
to get/set our marker bit, and
the length is shifted to the right when set, and to the left when read.
But the marker
byte is at the same position!
It is! And it answer one half of the question - why is smartstring
so
damn compact. The other half of the question is: why aren't smallvec
that compact?
Even the smallest SmallVec
is 32 bits:
Rust code
fn main ( ) {
let small_v = smallvec:: SmallVec:: < [ u8 ; 1 ] > :: new ( ) ;
let stand_v = Vec :: < u8 > :: new ( ) ;
use std:: mem:: size_of_val;
dbg ! ( size_of_val( &small_v) , size_of_val( &stand_v) ) ;
}
Shell session
$ cargo run -q
[src/main.rs:100] size_of_val(&small_v) = 32
[src/main.rs:100] size_of_val(&stand_v) = 24
Well, if we look at SmallVec
's definition, we see this:
Rust code
pub struct SmallVec < A : Array > {
capacity : usize ,
data : SmallVecData < A > ,
}
And SmallVecData
is... you guessed it, an enum
:
Rust code
#[ cfg( not( feature = "union" ) ) ]
enum SmallVecData < A : Array > {
Inline( MaybeUninit < A > ) ,
Heap( ( * mut A:: Item , usize ) ) ,
}
But what's this... a union
feature? Let's read the docs:
When the union
feature is enabled smallvec
will track its state (inline or spilled)
without the use of an enum tag, reducing the size of the smallvec
by one machine word.
This means that there is potentially no space overhead compared to Vec
.
Note that smallvec
can still be larger than Vec
if the inline buffer is larger than two
machine words.
Let's try it out!
TOML markup
# in Cargo.toml
[ dependencies ]
smallvec = { version = "1.4.2" , features = [ "union" ] }
Let's also make our smallvec a bit bigger:
Rust code
fn main ( ) {
// was `u8; 1`
let small_v = smallvec:: SmallVec:: < [ u8 ; 16 ] > :: new ( ) ;
let stand_v = Vec :: < u8 > :: new ( ) ;
use std:: mem:: size_of_val;
dbg ! ( size_of_val( &small_v) , size_of_val( &stand_v) ) ;
}
Shell session
$ cargo run -q
[src/main.rs:100] size_of_val(&small_v) = 24
[src/main.rs:100] size_of_val(&stand_v) = 24
There we go! No overhead!
Our SmallVec
only becomes larger than Vec
if we exceed the size of
"two machine words" (two u64
here):
Rust code
fn main ( ) {
// was `u8; 16`
let small_v = smallvec:: SmallVec:: < [ u8 ; 32 ] > :: new ( ) ;
let stand_v = Vec :: < u8 > :: new ( ) ;
use std:: mem:: size_of_val;
dbg ! ( size_of_val( &small_v) , size_of_val( &stand_v) ) ;
}
Shell session
$ cargo run -q
[src/main.rs:100] size_of_val(&small_v) = 40
[src/main.rs:100] size_of_val(&stand_v) = 24
So there we have, the definitive answer to the question "why is SmartString
the same size as String
but SmallVec
is larger than Vec
?".
And the answer is: that's not true, SmallVec
can be small as well (as long
as the inline capacity is 16 bytes or less), but either way, it involves a
lot of neat memory layout tricks.
Special thanks to @rawrafox , @trav_downs , @exelotl , and @rzidane360 , @lcnr7 , and @maybewaffle , who helped research this article and should NOT be held accountable for any inaccuracies,
as I ran off into the distance and wrote everything myself anyway.