Working with strings in Rust

Feb 19, 2020 34 minute read rust

Clear, simple, and wrong
A very quick UTF-8 primer
Back to C
Now for some Rust
Error handling
Iteration
Passing strings around
uppercase, but in Rust
Closing words

There's a question that always comes up when people pick up the Rust programming language: why are there two string types? Why is there String, and &str?

My Declarative Memory Management article answers the question partially, but there is a lot more to say about it, so let's run a few experiments and see if we can conjure up a thorough defense of Rust's approach over, say, C's.

Clear, simple, and wrong

Let's start with a simple C program that prints its arguments.

C code

// in `print.c`

#include <stdio.h> // for printf

int main(int argc, char **argv) {
    for (int i = 0; i < argc; i++) {
        char *arg = argv[i];
        printf("%s\n", arg);
    }

    return 0;
}

Shell session

$ gcc print.c -o print
$ ./print "ready" "set" "go"
./print
ready
set
go

Okay! Simple enough.

We're using the standard C11 main function signature, which takes the number of argument (argc, for argument count) as an int, and an "array" of "strings" (argv, for argument vector) as a char**, or char *[].

Then we use the printf format specifier %s to print each argument as a string - followed by \n, a newline. And sure enough, it prints each argument on its own line.

Before proceeding, let's make sure we have a proper understanding of what's going on.

C code

// in `print.c`

int main(int argc, char **argv) {
    printf("argv = %p\n", argv); // new!
    for (int i = 0; i < argc; i++) {
        char *arg = argv[i];
        printf("argv[%d] = %p\n", i, argv[i]); // new!
        printf("%s\n", arg);
    }

    return 0;
}

Now we're using the %p format specifier, which prints.. pointers!

Shell session

$ gcc print.c -o print
$ ./print "ready" "set" "go"
argv = 0x7ffcc35d84a8
argv[0] = 0x7ffcc35d9039
./print
argv[1] = 0x7ffcc35d9041
ready
argv[2] = 0x7ffcc35d9047
set
argv[3] = 0x7ffcc35d904b
go

Okay, so, argv is an array of addresses, and at those addresses, there is.. string data. Something like that:

Mhh. How does printf's %s specifier know when to stop printing? Since it just gets a single address, not a start and end address, or a start address and a length?

Let's try printing each argument ourselves:

C code

// in `print.c`

#include <stdio.h> // printf

int main(int argc, char **argv) {
    for (int i = 0; i < argc; i++) {
        char *arg = argv[i];
        // we don't know where to stop, so let's just print 15 characters.
        for (int j = 0; j < 15; j++) {
            char character = arg[j];
            // the %c specifier is for characters
            printf("%c", character);
        }
        printf("\n");
    }

    return 0;
}

Shell session

$ gcc print.c -o print
$ ./print "ready" "set" "go"
./printreadys
readysetgoCD
setgoCDPATH=.
goCDPATH=.:/ho

Uh oh. Our command-line arguments are bleeding into each other. It looks like they... all follow each other?

Let's try piping our program into a hexadecimal dumper like xxd, to see exactly what's going on:

Shell session

$ # note: "-g 1" means "show groups of one byte",
$ # xxd defaults to "-g 2".
$ ./print "ready" "set" "go" | xxd -g 1
00000000: 2e 2f 70 72 69 6e 74 00 72 65 61 64 79 00 73 0a  ./print.ready.s.
00000010: 72 65 61 64 79 00 73 65 74 00 67 6f 00 43 44 0a  ready.set.go.CD.
00000020: 73 65 74 00 67 6f 00 43 44 50 41 54 48 3d 2e 0a  set.go.CDPATH=..
00000030: 67 6f 00 43 44 50 41 54 48 3d 2e 3a 2f 68 6f 0a  go.CDPATH=.:/ho.

AhAH! They do follow each other, but there's something in between - here's the same output, annotated with ^^ where the separators are:

00000000: 2e 2f 70 72 69 6e 74 00 72 65 61 64 79 00 73 0a  ./print.ready.s.
          .  /  p  r  i  n  t  ^^ r  e  a  d  y  ^^

It looks like every argument is terminated by the value 0. Indeed, C has null-terminated strings.

So we can "fix" our printing program:

C code

#include <stdio.h> // printf

int main(int argc, char **argv) {
    for (int i = 0; i < argc; i++) {
        char *arg = argv[i];
        // note: the loop condition is gone, we just loop forever.
        // well, until a 'break' at least.
        for (int j = 0;; j++) {
            char character = arg[j];

            // technically, we ought to use '\0' rather than just 0,
            // but even `gcc -Wall -Wextra -Wpedantic` doesn't chastise
            // us, so let's just go with it.
            if (character == 0) {
                break;
            }
            printf("%c", character);
        }
        printf("\n");
    }

    return 0;
}

Shell session

$ gcc print.c -o print
$ ./print "ready" "set" "go"
./print
ready
set
go

All better! Although, we need to fix our diagram, too:

Cool bear's hot tip

You may have noticed that when our print program went beyond the end of our arguments, it showed CDPATH=.:/ho too.

That was (part of) an environment variable! Those are stored right next to the program's arguments in glibc, the GNU C library.

But the specifics are out of scope for this article, you may want to check out the Making our own executable packer series instead.

Okay! Now that we fully understand what's going on, let's do something a little more interesting: printing our arguments converted to upper-case.

So if we run ./print hello, it should print HELLO.

We'll also skip the first argument, because it's the name of the program, and that's not really interesting to us right now.

C code

#include <stdio.h> // printf
#include <ctype.h> // toupper

int main(int argc, char **argv) {
    // start from 1, skips program name
    for (int i = 1; i < argc; i++) {
        char *arg = argv[i];
        for (int j = 0;; j++) {
            char character = arg[j];
            if (character == 0) {
                break;
            }
            printf("%c", toupper(character));
        }
        printf("\n");
    }

    return 0;
}

Shell session

$ gcc print.c -o print
$ ./print "hello"
HELLO

Alright! Nice going! Looks feature-complete to me, ready to be shipped.

Out of an abundance of caution, let's run one last test:

Shell session

$ gcc print.c -o print
$ ./print "élément"
éLéMENT

Oh, uh, woops. What we really wanted was "ÉLÉMENT" but clearly, we haven't yet figured out everything that's going on.

Okay, maybe upper-casing is too complicated for now, let's do something simpler: print each character separated by a space.

C code

// in `print.c`

#include <stdio.h> // printf

int main(int argc, char **argv) {
    for (int i = 1; i < argc; i++) {
        char *arg = argv[i];
        for (int j = 0;; j++) {
            char character = arg[j];
            if (character == 0) {
                break;
            }
            // notice the space following `%c`
            printf("%c ", character);
        }
        printf("\n");
    }

    return 0;
}

Shell session

$ gcc print.c -o print
$ ./print "élément"
  l   m e n t

Oh no. This won't do - it won't do at all.

Let's go back to our last well-behaved version, the one that just printed each character, without spaces in between, and see what the output actually was

C code

// in main
// in for
// in second for
            printf("%c", character); // notice the lack of space after `%c`

Shell session

$ gcc print.c -o print
$ ./print "élément" | xxd -g 1
00000000: c3 a9 6c c3 a9 6d 65 6e 74 0a                    ..l..ment.
          ^^^^^    ^^^^^

If I'm, uh, reading this correctly, "é" is not a char, it's actually two chars in a trenchcoat.

That seems... strange.

Let's write a quick JavaScript program and run it with Node.js:

JavaScript code

// in `print.js`

const { argv, stdout } = process;

// we have to skip *two* arguments: the path to node,
// and the path to our script
for (const arg of argv.slice(2)) {
    for (const character of arg) {
        stdout.write(character);
        stdout.write(" ");
    }
    stdout.write("\n");
}

Shell session

$ node print.js "élément"
é l é m e n t

Ah! Much better! Can Node.js also convert to upper-case properly?

JavaScript code

// in `print.js`

const { argv, stdout } = process;

for (const arg of argv.slice(2)) {
    stdout.write(arg.toUpperCase());
    stdout.write("\n");
}

Shell session

$ node print.js "élément"
ÉLÉMENT

It can. Let's look at a hexadecimal dump:

Shell session

$ node print.js "élément" | xxd -g 1
00000000: c3 89 4c c3 89 4d 45 4e 54 0a                    ..L..MENT.
          ^^^^^    ^^^^^

Although our Node.js program behaves as expected, we can see that É is also different from the other letters, and that the upper-case counterpart *of "c3 a9" is "c3 89".

Our C program didn't work - it couldn't work, because it was only seeing "c3" and "a9" individually, when it should have considered it as a single, uh, "Unicode scalar value".

Why is "é" encoded as "c3 a9"? It's time for a very quick UTF-8 encoding course.

A very quick UTF-8 primer

So, characters like "abcdefghijklmnopqrstuvwxyz", "ABCDEFGHIJKLMNOPQRSTUVWXYZ" and "123456789", and "!@#$%^&*()", etc., all have numbers.

For example, the number for "A" is 65. Why is that so? It's a convention! All a computer knows about is numbers, and we often use bytes as the smallest unit, so, a long time ago, someone just decided that if a byte has the value 65, then it refers to the letter "A".

Since ASCII is a 7-bit encoding, it has 128 possible values: from 0 to 127 (inclusive). But, on modern machines at least, a byte is 8 bits, so there's another 128 possible values.

Great! Everyone thought. We can just stuff "special characters" in there:

It's not... just ASCII, it's ASCII plus 128 characters of our choice. Of course, there's a lot of languages out there, so not every language's non-ASCII character can fit in those additional 128 values, so there were several alternative interpretations of those any value that was greater than 127.

Those interpretations were named "codepages". The picture above is Codepage 437, also known as CP437, OEM-US, OEM 437, PC-8, or DOS Latin US.

It's sorta adequate for languages like French, if you don't care about capital letters. It's not adequate at all for Eastern european languages, and doesn't even begin to cover Asian languages.

So, Japan came up with its own thing, where they replaced ASCII's backslash with a yen sign, the tilde with an overline (sure, why not), and introduced double-byte characters, because 128 extra characters sure wasn't enough for them.

Cool bear's hot tip

Wait, replacing backslash? Does that mean... in file paths... ?

..yep.

And for the languages with smaller alphabets, people used other code pages like Windows-1252 for years, and most text in the Western world was still sorta kinda ASCII, also known as "extended ASCII".

But eventually, the world collectively started to put their affairs in order and settled on UTF-8, which:

Looks like ASCII (not extended) for ASCII characters, and uses the same space.
Allows for a lot more characters - like, billions of them with multi-byte sequences.

Of course, before that happened, people asked, isn't two bytes enough? (Or sequences of two two-byte characters?), and surely four bytes is okay, but eventually, for important reasons like compactness, and keeping most C programs half-broken instead of completely broken, everyone adopted UTF-8.

Except Microsoft.

Well, okay, they kinda did, although it feels like too little, too late. Everything is still UTF-16 internally. RIP.

Cool bear's hot tip

Speaking of, the Bush hid the facts saga is hilarious.

So, yeah, ASCII plus multi-byte character sequences, how does it even work? Well, it's the same basic principle, each character has a value, so in Unicode, the number for "é" is "e9" - we usually write codepoints like so: "U+00E9".

And 0xE9 is 233 in decimal, which is greater than 127, so, it's not ASCII, and we need to do multi-byte encoding.

How does UTF-8 do multi-byte encoding? With bit sequences!

If a byte starts with 110 it means we'll need two bytes
If a byte starts with 1110 it means we'll need three bytes
If a byte starts with 11110 it means we'll need four bytes
If a byte starts with 10, it means it's a continuation of a multi-byte character sequence.

So, for "é", which has codepoint U+00E9, its binary representation is "11101001", and we know we're going to need two bytes, so we should have something like this:

We can see in the lower part that two-byte UTF-8 sequences give us 11 bits of storage: 5 bits in the first byte, and 6 bits in the second byte. We only need to fit 8 bits, so we fill them from right to left, first the last 6 bits:

Then the remaining 2 bits:

The rest is padding, filled with zeroes:

We're done! 0b11000011 is 0xC3, and 0b10101001 is 0xA9.

Which corresponds to what we've seen earlier - "é" is "c3 a9".

Back to C

So, uh, our C program. If we want to really separate characters, we have to do some UTF-8 decoding.

We can do that ourselves - no, really! Well, we can try anyway.

C code

// in `print.c`

#include <stdio.h> // printf
#include <stdint.h> // uint8_t

void print_spaced(char *s) {
    // start at the beginning
    int i = 0;

    while (1) {
        // we're going to be shifting bytes around,
        // so treat them like unsigned 8-bit values
        uint8_t c = s[i];
        if (c == 0) {
            // reached null terminator, stop printing
            break;
        }

        // length of the sequence, ie., number of bytes
        // that encode a single Unicode scalar value
        int len = 1;
        if (c >> 5 == 0b110) {
            len = 2;
        } else if (c >> 4 == 0b1110) {
            len = 3;
        } else if (c >> 3 == 0b11110) {
            len = 4;
        }

        // print the entire UTF-8-encoded Unicode scalar value
        for (; len > 0; len--) {
            printf("%c", s[i]);
            i++;
        }
        // print space separator
        printf(" ");
    }
}

int main(int argc, char **argv) {
    for (int i = 1; i < argc; i++) {
        print_spaced(argv[i]);
        printf("\n");
    }

    return 0;
}

There! Simple! None of that String and &str business. In fact, there's a remarkable lack of Rust code for an article about Rust string handling, and we're about ten minutes in already!

Cool bear's hot tip

Binary literals, e.g. 0b100101010 are not standard C, they're a GNU extension. Normally you'd see hexadecimal literals, e.g. 0xDEADBEEF, but it's much harder to see what's going on since UTF-8 deals with individual bits.

Does our program work?

Shell session

$ gcc print.c -o print
$ ./print "eat the rich"
e a t   t h e   r i c h

So far so good!

Shell session

$ ./print "platée de rösti"
p l a t é e   d e   r ö s t i

Nice!

Shell session

$ ./print "23€ ≈ ¥2731"
2 3 €   ≈   ¥ 2 7 3 1

Cool!

Shell session

$ ./print "text 🤷 encoding"
t e x t   🤷   e n c o d i n g

Alright!

Well I don't know what everyone is complaining about, UTF-8 is super easy to implement, it only took us a few minutes and it is 100% correct and accurate and standards-compliant and it will work forever on all inputs and always do the right thing

...or will it?

Here comes the counter-example... I can feel it in my bones.

Consider the following string:

Shell session

$ echo "noe\\u0308l"
noël

It's just Christmas in French! Surely our program can handle that, no sweat:

Shell session

$ ./print $(echo "noe\\u0308l")
n o e ̈ l

Uh oh.

Turns out U+0308 is a "combining diaeresis", which is fancy talk for "just slap two dots on the previous character".

In fact, we can slap more of them if we want (for extra Christmas cheer):

Cool bear's hot tip

The combination of multiple scalar values that end up showing a single "shape" are called "grapheme clusters", and you should read Henri Sivonen's It’s Not Wrong that "🤦🏼‍♂️".length == 7 if you want to learn more about them.

Also, since I'm French, I'm writing this whole article with a Latin-1 bias, I haven't even talked about right-to-left scripts, vowel modifiers, syllable blocks, etc. - luckily, Manish Goregaokar has a great piece on that, Breaking Our Latin-1 Assumptions.

Okay. OKAY. So maybe our program doesn't implement all the subtleties of UTF-8 encoding, but, you know, we came pretty close.

We haven't tried converting things to upper-case in a while though.

Can we do that? Is that legal?

Let's find out.

So we'll disregard combining characters for now, and focus on Unicode scalar values. What we want is to:

Decode our input, ie. transform it from UTF-8 to a series of Unicode scalar values (we'll pick uint32_t)
Transform said scalar values to their upper-case counterparts
Re-encode as UTF-8
Print to the console

So let's start with a decode_utf8 function. We'll only handle 2-byte sequences:

C code

// in `upper.c`

#include <stdio.h> // printf
#include <stdint.h> // uint8_t, uint32_t
#include <stdlib.h> // exit

void decode_utf8(char *src, uint32_t *dst) {
    int i = 0;
    int j = 0;

    while (1) {
        uint8_t c = src[i];
        if (c == 0) {
            dst[j] = 0;
            break; // null terminator
        }

        uint32_t scalar;
        int len;

        if (c >> 3 == 0b11110) {
            fprintf(stderr, "decode_utf8: 4-byte sequences are not supported!\n");
            exit(1);
        } if (c >> 4 == 0b1110) {
            fprintf(stderr, "decode_utf8: 3-byte sequences are not supported!\n");
            exit(1);
        } else if (c >> 5 == 0b110) {
            // 2-byte sequence
            uint32_t b1 = (uint32_t) src[i];
            uint32_t b2 = (uint32_t) src[i + 1];
            uint32_t mask1 = 0b0000011111000000;
            uint32_t mask2 = 0b0000000000111111;

            scalar = ((b1 << 6) & mask1) | ((b2 << 0) & mask2);
            len = 2;
        } else {
            // 1-byte sequence
            scalar = (uint32_t) c;
            len = 1;
        }
        dst[j++] = scalar;
        i += len;
    }
}

int main(int argc, char **argv) {
    uint32_t scalars[1024]; // hopefully that's enough
    decode_utf8(argv[1], scalars);

    for (int i = 0;; i++) {
        if (scalars[i] == 0) {
            break;
        }
        printf("U+%04X ", scalars[i]);
    }
    printf("\n");

    return 0;
}

Shell session

$ gcc upper.c -o upper
$ ./upper "noël"
U+006E U+006F U+00EB U+006C

Logically, U+00EB should be the codepoint for "ë"... and it is!

Its full name is "Latin Small Letter E with Diaeresis". Okay.

So now we just have to do the reverse conversion - easy!

C code

// in `upper.c`

void encode_utf8(uint32_t *src, char *dst) {
    int i = 0;
    int j = 0;

    while (1) {
        uint32_t scalar = src[i];

        if (scalar == 0) {
            dst[j] = 0; // null terminator
            break;
        }

        if (scalar > 0b11111111111) {
            fprintf(stderr, "Can only encode codepoints <= 0x%x", 0b11111111111);
            exit(1);
        }

        if (scalar > 0b1111111) { // 7 bits
            // 2-byte sequence

            uint8_t b1 = 0b11000000 | ((uint8_t) ((scalar & 0b11111000000) >> 6));
            //           2-byte marker              first 5 of 11 bits

            uint8_t b2 = 0b10000000 | ((uint8_t) (scalar & 0b111111));
            //           continuation               last 6 of 11 bits  

            dst[j + 0] = b1;
            dst[j + 1] = b2;
            j += 2;
        } else {
            // 1-byte sequence
            dst[j] = (char) scalar;
            j++;
        }

        i++;
    }
}

// omitted: decode_utf8

int main(int argc, char **argv) {
    uint32_t scalars[1024]; // hopefully that's enough
    decode_utf8(argv[1], scalars);

    for (int i = 0;; i++) {
        if (scalars[i] == 0) {
            break;
        }
        printf("U+%04X ", scalars[i]);
    }
    printf("\n");

    uint8_t result[1024]; // yolo
    encode_utf8(scalars, result);

    printf("%s\n", result);

    return 0;
}

Shell session

$ gcc upper.c -o upper
$ ./upper "noël"
U+006E U+006F U+00EB U+006C 
noël

Fantastic!

Now all we need is some sort of conversion table! From lower-case codepoints to their upper-case equivalents.

We'll fill in just enough to support French:

C code

#include <ctype.h> // toupper

int main(int argc, char **argv) {
    uint32_t scalars[1024]; // hopefully that's enough
    decode_utf8(argv[1], scalars);

    for (int i = 0;; i++) {
        if (scalars[i] == 0) {
            break;
        }
        printf("U+%04X ", scalars[i]);
    }
    printf("\n");

    // this is the highest codepoint we can decode/encode successfully
    const size_t table_size = 0b11111111111;
    uint32_t lower_to_upper[table_size];
    // initialize the table to just return the codepoint unchanged
    for (uint32_t cp = 0; cp < table_size; cp++) {
        lower_to_upper[cp] = cp;
    }
    // set a-z => A-Z
    for (int c = 97; c <= 122; c++) { // ha.
        lower_to_upper[(uint32_t) c] = (uint32_t) toupper(c);
    }

    // note: nested functions is a GNU extension!
    void set(char *lower, char *upper) {
        uint32_t lower_s[1024];
        uint32_t upper_s[1024];
        decode_utf8(lower, lower_s);
        decode_utf8(upper, upper_s);
        for (int i = 0;; i++) {
            if (lower_s[i] == 0) {
                break;
            }
            lower_to_upper[lower_s[i]] = upper_s[i];
        }
    }
    // set a few more
    set(
        "éêèàâëüöïÿôîçæœ",
        "ÉÊÈÀÂËÜÖÏŸÔÎÇÆŒ"
    );

    // now convert our scalars to upper-case
    for (int i = 0;; i++) {
        if (scalars[i] == 0) {
            break;
        }
        scalars[i] = lower_to_upper[scalars[i]];
    }

    uint8_t result[1024]; // yolo
    encode_utf8(scalars, result);

    printf("%s\n", result);

    return 0;
}

Let's take it for a spin:

Shell session

$ gcc upper.c -o upper
$ ./upper "Voix ambiguë d'un cœur qui, au zéphyr, préfère les jattes de kiwis"
U+0056 U+006F U+0069 U+0078 U+0020 U+0061 U+006D U+0062 U+0069 U+0067 U+0075 U+00EB U+0020 U+0064 U+0027 U+0075 U+006E U+0020 U+0063 U+0153 U+0075 U+0072 U+0020 U+0071 U+0075 U+0069 U+002C U+0020 U+0061 U+0075 U+0020 U+007A U+00E9 U+0070 U+0068 U+0079 U+0072 U+002C U+0020 U+0070 U+0072 U+00E9 U+0066 U+00E8 U+0072 U+0065 U+0020 U+006C U+0065 U+0073 U+0020 U+006A U+0061 U+0074 U+0074 U+0065 U+0073 U+0020 U+0064 U+0065 U+0020 U+006B U+0069 U+0077 U+0069 U+0073 
VOIX AMBIGUË D'UN CŒUR QUI, AU ZÉPHYR, PRÉFÈRE LES JATTES DE KIWIS

Wonderful!

Now for some Rust

Okay, so that was about 140 lines of C (with GNU extensions) - here's the gist if you want to mess with it.

Obviously, there's libraries to do all that for you, correctly. ICU is a popular choice, for its irritating completeness, but there's lighter alternatives that may suit your needs.

Cool bear's hot tip

Proper UTF-8 handling is not just about computer security - software impacts actual human safety (CW: death) all the time.

Let's see what a Rust program that does the same thing would look like, though:

Shell session

$ cargo new rustre
     Created binary (application) `rustre` package
$ cd rustre

Rust code

fn main() {
    let arg = std::env::args()
        .skip(1)
        .next()
        .expect("should have one argument");
    println!("{}", arg.to_uppercase());
}

Cool bear's hot tip

Turbo explanation of the above: std::env::args() returns an Iterator of strings. skip(1) ignores the program name (which is usually the first argument), next() gets the next element in the iterator (the first "real") argument.

By that point we have an Option<String> - there might be a next argument, or there might not be. If there isn't, .expect(msg) stops the program by printing msg. If there is, we now have a String!

Shell session

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/rustre`
thread 'main' panicked at 'should have one argument', src/libcore/option.rs:1188:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

Okay! So when we don't pass an argument, it says so. Wonderful.

Let's pass a few test strings:

Shell session

$ cargo run --quiet -- "noël"
NOËL
$ cargo run --quiet -- "trans rights"
TRANS RIGHTS
$ cargo run --quiet -- "voix ambiguë d'un cœur qui, au zéphyr, préfère les jattes de kiwis"
VOIX AMBIGUË D'UN CŒUR QUI, AU ZÉPHYR, PRÉFÈRE LES JATTES DE KIWIS
$ cargo run --quiet -- "heinz große"
HEINZ GROSSE

Everything checks out!

That last one is particularly cool - in German, "ß" (eszett) is indeed a ligature for "ss". Well, it's complicated, but that's the gist.

Error handling

So, Rust definitely behaves as if strings were UTF-8, that means it must decode our command-line arguments at some point. And that means... this can fail.

But I only see error handling for when we have no arguments at all, not for when the arguments are invalid utf-8.

Speaking of... what is invalid utf-8? Well, we've seen that "é" is encoded as "c3 e9", so this works:

Shell session

$ cargo run --quiet -- $(printf "\\xC3\\xA9")
É

And it works with our C program too:

$ ../print $(printf "\\xC3\\xA9")
É

And we've seen that a two-byte UTF-8 sequence has:

An indication in the first byte that it was a two-byte sequence (the first three bits, 110)
An indication in the second byte that it was a continuation of a multi-byte sequence (the first two bits, 10)

So.. what if we start a two-byte sequence and then abruptly stop? What if we pass C3, but not A9?

Shell session

$ cargo run --quiet -- $(printf "\\xC3")
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: "\xC3"', src/libcore/result.rs:1188:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

AhAH! A panic. Let's look at the backtrace:

Shell session

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: "\xC3"', src/libcore/result.rs:1188:5                                                
stack backtrace:
(cut)
  13: core::result::unwrap_failed
             at src/libcore/result.rs:1188
  14: core::result::Result<T,E>::unwrap
             at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libcore/result.rs:956
  15: <std::env::Args as core::iter::traits::iterator::Iterator>::next::{{closure}}
             at src/libstd/env.rs:789
  16: core::option::Option<T>::map
             at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libcore/option.rs:450
  17: <std::env::Args as core::iter::traits::iterator::Iterator>::next
             at src/libstd/env.rs:789
  18: <&mut I as core::iter::traits::iterator::Iterator>::next
             at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libcore/iter/traits/iterator.rs:2991
  19: core::iter::traits::iterator::Iterator::nth
             at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libcore/iter/traits/iterator.rs:323
  20: <core::iter::adapters::Skip<I> as core::iter::traits::iterator::Iterator>::next
             at /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libcore/iter/adapters/mod.rs:1657
  21: rustre::main
             at src/main.rs:2
(cut)

Okay, so, even with most of the frames cut, it's still a mouthful but basically:

in main()
we called .next() on an Iterator
which eventually called .unwrap() on a Result
which panicked

This means it only panics if we try to get the arguments as a String.

If we get it as an OsString, for example, it doesn't panic:

Rust code

fn main() {
    let arg = std::env::args_os()
        .skip(1)
        .next()
        .expect("should have one argument");
    println!("{:?}", arg)
}

Shell session

$ cargo run --quiet -- hello
"hello"
$ cargo run --quiet $(printf "\\xC3")
"\xC3"

...but then we don't have a .to_uppercase() method. Because it's not text. It's an OsString, ie. it's a a series of bytes which we might be able to interpret as text (given the encoding) — or not.

Speaking of, how does our C program handle our invalid utf-8 input?

Shell session

$ ../upper $(printf "\\xC3")
U+00C0 U+0043 U+0044 U+0050 U+0041 U+0054 U+0048 U+003D U+002E U+003A U+002F U+0068 U+006F U+006D U+0065 U+002F U+0061 U+006D U+006F U+0073 U+002F U+0072 U+0075 U+0073 U+0074 U+003A U+002F U+0068 U+006F U+006D U+0065 U+002F U+0061 U+006D U+006F U+0073 U+002F U+0067 U+006F U+003A U+002F U+0068 U+006F U+006D U+0065 U+002F U+0061 U+006D U+006F U+0073 U+002F U+0066 U+0074 U+006C U+003A U+002F U+0068 U+006F U+006D U+0065 U+002F U+0061 U+006D U+006F U+0073 U+002F U+0070 U+0065 U+0072 U+0073 U+006F U+003A U+002F U+0068 U+006F U+006D U+0065 U+002F U+0061 U+006D U+006F U+0073 U+002F U+0077 U+006F U+0072 U+006B 
ÀCDPATH=.:/HOME/AMOS/RUST:/HOME/AMOS/GO:/HOME/AMOS/FTL:/HOME/AMOS/PERSO:/HOME/AMOS/WORK

The answer is: not well. Not well at all, in fact.

Our naive UTF-8 decoder first read C3 and was all like "neat, a 2-byte sequence!", and then it read the next byte (which happened to be the null terminator), and decided the result should be "à".

So, instead of stopping, it read past the end of the argument, right into the environment block, finding the first environment variable, and now you can see the places I cd to frequently (in upper-case).

Now, this seems pretty tame in this context... but what if it wasn't?

What if our C program was used as part of a web server, and its output was shown directly to the user? And what if the first environment variable wasn't CDPATH, but SECRET_API_TOKEN?

Then it would be a disaster. And it's not a hypothetical, it happens all the time.

Cool bear's hot tip

By the way, our program is also vulnerable to buffer overflow attacks: if the input decodes to more than 1024 scalar values, it could overwrite other variables, potentially variables that are involved in verifying someone's credentials...

So, our C program will happily do dangerous things (which is very on-brand), but our Rust program panics early if the command-line arguments are not valid utf-8.

What if we want to handle that case gracefully?

Then we can use OsStr::to_str, which returns an Option - a value that is either something or nothing.

Rust code

fn main() {
    let arg = std::env::args_os()
        .skip(1)
        .next()
        .expect("should have one argument");

    match arg.to_str() {
        Some(arg) => println!("valid UTF-8: {}", arg),
        None => println!("not valid UTF-8: {:?}", arg),
    }
}

Shell session

$ cargo run --quiet -- "é"
valid UTF-8: é
$ cargo run --quiet -- $(printf "\\xC3")
not valid UTF-8: "\xC3"

Wonderful.

What did we learn?

In Rust, provided you don't explicitly work around it with unsafe, values of type String are always valid UTF-8.

If you try to build a String with invalid UTF-8, you won't get a String, you'll get an error instead. Some helpers, like std::env::args(), hide the error handling because the error case is very rare - but it still checks for it, and panics if it happens, because that's the safe thing to do.

By comparison, C has no string type. It doesn't even have a real character type. char is.. an ASCII character plus an additional bit - effectively, it's just a signed 8-bit integer: int8_t.

There is absolutely no guarantee that anything in a char* is valid UTF-8, or valid anything for that matter. There is no encoding associated to a char*, which is just an address in memory. There is no length associated to it either, so computing its length involves finding the null terminator.

Null-terminated strings are also a serious security concern. Not to mention that NUL is a valid Unicode character, so null-terminated strings cannot represent all valid UTF-8 strings.

Iteration

How would we go about separating characters by spaces?

Rust code

fn main() {
    let arg = std::env::args()
        .skip(1)
        .next()
        .expect("should have one argument");

    for c in arg.chars() {
        print!("{} ", c);
    }
    println!()
}

Shell session

$ cargo run --quiet -- "cup of tea"
c u p   o f   t e a

That was easy! Let's try it with non-ASCII characters:

Shell session

$ cargo run --quiet -- "23€ ≈ ¥2731"
2 3 €   ≈   ¥ 2 7 3 1

Shell session

$ cargo run --quiet -- "memory safety 🥺 please 🙏"
m e m o r y   s a f e t y   🥺   p l e a s e   🙏

Everything seems fine.

What if we want to print the Unicode scalar values's numbers instead of their, uh, graphemes?

Rust code

fn main() {
    let arg = std::env::args()
        .skip(1)
        .next()
        .expect("should have one argument");

    for c in arg.chars() {
        print!("{} (U+{:04X}) ", c, c as u32);
    }
    println!()
}

Shell session

$ cargo run --quiet -- "aimée"
a (U+0061) i (U+0069) m (U+006D) é (U+00E9) e (U+0065)

Cool!

What if we want to show how it's encoded as UTF-8? By which I mean, print the individual bytes?

Rust code

fn main() {
    let arg = std::env::args()
        .skip(1)
        .next()
        .expect("should have one argument");

    for b in arg.bytes() {
        print!("{:02X} ", b);
    }
    println!()
}

Shell session

$ cargo run --quiet -- "aimée"
61 69 6D C3 A9 65

There's our c3 a9!

Easy enough. We didn't even have to worry about types - there hasn't been a single String or &str in any of our Rust programs so far.

So, let's go looking for trouble.

Passing strings around

First, C.

C is easy! Just use char*! Don't worry about anything else.

C code

// in `woops.c`

#include <stdio.h>

int len(char *s) {
    int l = 0;
    while (s[l]) {
        l++;
    }
    return l;
}

int main(int argc, char **argv) {
    char *arg = argv[1];
    int l = len(arg);
    printf("length of \"%s\" = %d\n", arg, l);
}

Shell session

$ # we're back into the parent of the "rustre" directory
$ # (in case you're following along)
$ gcc woops.c -o woops
$ ./woops "dog"
length of "dog" = 3

See? Easy! None of that String / &str nonsense. No lifetimes, no nothing.

Ah. deep breath. Simpler times.

Okay, back to reality. First of all, that's not really the length of a string. It's.. the number of bytes it takes to encode it using UTF-8.

So, for example:

Shell session

$ ./woops "née"
length of "née" = 4

And also:

Shell session

$ ./woops "🐈"
length of "🐈" = 4

But in all fairness, that was to be expected. We didn't spend half the article implementing a half-baked UTF-8 decoder and encoder just to be surprised that, without one, we can't count characters properly.

Also, that's not what's bothering me right now.

What's bothering me right now is that the compiler does nothing to prevent us from doing this:

C code

#include <stdio.h>

int len(char *s) {
    s[0] = '\0';
    return 0;
}

int main(int argc, char **argv) {
    char *arg = argv[1];
    int l = len(arg);
    printf("length of \"%s\" = %d\n", arg, l);
}

Shell session

$ gcc woops.c -o woops
$ ./woops "some user input"
length of "" = 0

And, you know, len() is right. By the time it's done... the length of the string is zero. (It even "works" on non-ASCII inputs!).

This would pass unit tests. And if no one bothered to look at the len function itself - say, if it was in a third-party library, or worse, a proprietary third-party library, then it would be... interesting... to debug.

"But Amos!" - you interject - "C has const"!

Okay, sure, C has const:

C code

int len(const char *s) {
    s[0] = '\0';
    return 0;
}

Now it doesn't compile:

Shell session

woops.c: In function ‘len’:
woops.c:4:10: error: assignment of read-only location ‘*s’
    4 |     s[0] = '\0';
      |

But C also has unbounded leniency, and so:

C code

int len(const char *s) {
    char *S = (void *) s;
    S[0] = '\0';
    return 0;
}

Now it compiles again. And it runs. And it doesn't fail at runtime - it silently overwrites our input string, just the same.

Even -Wall, -Wextra and -Wpedantic don't warn us about this. They warn us about argc being unused. Which, fair enough, not passing an argument to ./woops definitely ends up reading from unmapped memory addresses and crashes right now.

And if this is in a proprietary library, you're lulled into a false sense of security, because you look at the header file and you see this:

C code

int len(const char *s);

But, granted - that's a contrived example. You'd have to be a pretty evil vendor to ship a len function that mutates its input. Unless you do it accidentally. Which you'd never do, right? Unless you do. In which case, well, you shouldn't have. Obviously.

Right?

Okay so let's go with a more realistic example: a function that turns a string uppercase:

C code

// in `woops.c`

#include <stdio.h>
#include <ctype.h>

void uppercase(char *s) {
    // this is peak C right there
    do {
        *s = toupper(*s);
    } while (*s++);
}

int main(int argc, char **argv) {
    char *arg = argv[1];

    char *upp = arg;
    uppercase(upp);

    printf("upp = %s\n", upp);
}

Sounds familiar right? But we're focused on a different aspect.

Anyway, it works:

Shell session

$ gcc woops.c -o woops
$ ./woops "dog"
upp = "DOG"

But... I'm not happy with this code. What if we needed arg later?

C code

// in `woops.c`

#include <stdio.h>
#include <ctype.h>

void uppercase(char *s) {
    // this is peak C right there
    do {
        *s = toupper(*s);
    } while (*s++);
}

int main(int argc, char **argv) {
    char *arg = argv[1];

    char *upp = arg;
    uppercase(upp);

    printf("arg = %s\n", arg);
    printf("upp = %s\n", upp);
}

Shell session

$ ./woops "dog"
arg = DOG
upp = DOG

Woops. The arg and upp variables point to the same memory, and uppercase works in-place, so it changed both.

Well, I can think of a few ways to fix that! Either we can duplicate the string before we call uppercase:

C code

// in `woops.c`

#include <stdio.h>
#include <string.h>
#include <ctype.h>

void uppercase(char *s) {
    do { *s = toupper(*s); } while (*s++);
}

int main(int argc, char **argv) {
    char *arg = argv[1];

    char *upp = strdup(arg);
    uppercase(upp);

    printf("arg = %s\n", arg);
    printf("upp = %s\n", upp);
}

Shell session

$ ./woops "dog"
arg = dog
upp = DOG

Or we could have uppercase return a fresh char *.

And this way, we can even use it directly as an argument to printf. How convenient!

C code

// in `woops.c`

#include <stdio.h>
#include <string.h>
#include <ctype.h>

char *uppercase(char *s) {
    s = strdup(s);
    char *t = s;
    do { *t = toupper(*t); } while (*t++);
    return s;
}

int main(int argc, char **argv) {
    char *arg = argv[1];
    printf("upp = %s\n", uppercase(arg));
    printf("arg = %s\n", arg);
}

Shell session

$ ./woops "dog"
upp = DOG
arg = dog

Wonderful! Correctness, at last!

We did, however, just introduce a memory leak into our program.

See, arg points to somewhere in memory that is set up at process startup. Again, the details are out of scope, but what I can tell you with confidence is that it hasn't been allocated with malloc, and it shouldn't be freed with free.

The result of strdup, however, definitely needs to be freed by calling free.

In fact, valgrind can easily confirm our program is leaking memory right now:

Shell session

$ valgrind ./woops "hello valgrind"
==3628== Memcheck, a memory error detector
==3628== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3628== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==3628== Command: ./woops hello\ valgrind
==3628== 
upp = HELLO VALGRIND
arg = hello valgrind
==3628== 
==3628== HEAP SUMMARY:
==3628==     in use at exit: 15 bytes in 1 blocks
==3628==   total heap usage: 2 allocs, 1 frees, 1,039 bytes allocated
==3628== 
==3628== LEAK SUMMARY:
==3628==    definitely lost: 15 bytes in 1 blocks
==3628==    indirectly lost: 0 bytes in 0 blocks
==3628==      possibly lost: 0 bytes in 0 blocks
==3628==    still reachable: 0 bytes in 0 blocks
==3628==         suppressed: 0 bytes in 0 blocks
==3628== Rerun with --leak-check=full to see details of leaked memory
==3628== 
==3628== For lists of detected and suppressed errors, rerun with: -s
==3628== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Of course, for a short-lived program like that, it might not matter. If we're writing a more long-lived application, maybe a desktop one, or, worse, a server one, then it matters a lot.

So now, we have to remember to call free() on the result of uppercase().

Every time.

And the only hint in the function's signature is that it returns a char *. Which is not immediately obvious when you just look at the call site.

So here's our non-leaky program:

C code

// in `woops.c`

#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <stdlib.h>

char *uppercase(char *s) {
    s = strdup(s);
    char *t = s;
    do { *t = toupper(*t); } while (*t++);
    return s;
}

int main(int argc, char **argv) {
    char *arg = argv[1];
    char *upp = uppercase(arg);
    free(upp);
    printf("upp = %s\n", upp);
    printf("arg = %s\n", arg);
}

Shell session

$ ./woops "good clean code"
upp = 
arg = good clean code

Uhhh.. hang on.

$ valgrind ./woops "good clean code"
==4132== Memcheck, a memory error detector
==4132== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==4132== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==4132== Command: ./woops good\ clean\ code
==4132==
==4132== Invalid read of size 1
==4132==    at 0x483BCE2: __strlen_sse2 (vg_replace_strmem.c:463)
==4132==    by 0x48EC61D: __vfprintf_internal (in /usr/lib/libc-2.30.so)
==4132==    by 0x48D726E: printf (in /usr/lib/libc-2.30.so)
==4132==    by 0x10920D: main (in /home/amos/ftl/strings/woops)
==4132==  Address 0x4a4a040 is 0 bytes inside a block of size 16 free'd
==4132==    at 0x48399AB: free (vg_replace_malloc.c:540)
==4132==    by 0x1091F5: main (in /home/amos/ftl/strings/woops)
==4132==  Block was alloc'd at
==4132==    at 0x483877F: malloc (vg_replace_malloc.c:309)
==4132==    by 0x490EBEE: strdup (in /usr/lib/libc-2.30.so)
==4132==    by 0x109180: uppercase (in /home/amos/ftl/strings/woops)
==4132==    by 0x1091E5: main (in /home/amos/ftl/strings/woops)

Oh wow, valgrind got good.

If we built our program with -g, it'd even show us line numbers in our ".c" sources. It does show line numbers in glibc's ".c" sources, because I installed glibc debug symbols recently, for reasons, but yeah, whoa, look at that output.

So anyway, silly me, I freed upp right before printing it, my fingers slipped, luckily this never happens in real life right haha.

It would be cool if the compiler could tell me about it at compile-time but I guess it's impossible because free() is just a regular function so who's to say what it actually does? What if we defined our own free function that doesn't invalidate its argument? Checkmate, static analysts.

Anyway let's fix it real quick:

C code

// in `woops.c`

#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <stdlib.h>

char *uppercase(char *s) {
    s = strdup(s);
    char *t = s;
    do { *t = toupper(*t); } while (*t++);
    return s;
}

int main(int argc, char **argv) {
    char *arg = argv[1];
    char *upp = uppercase(arg);
    printf("upp = %s\n", upp);
    free(upp);
    printf("arg = %s\n", arg);
}

Shell session

$ ./woops "good clean code"
upp = GOOD CLEAN CODE
arg = good clean code

Wonderful!

Except.. it's not. Not really. Having a C function return a new allocation is, well, kind of an anti-pattern. Precisely because you might forget to free it.

Instead, we'd usually have a "source" and a "destination" parameter:

C code

#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <stdlib.h>

void uppercase(const char *src, char *dst) {
    do { *dst++ = toupper(*src); } while (*src++);
}

int main(int argc, char **argv) {
    const char *arg = argv[1];
    char *upp = malloc(strlen(arg));
    uppercase(arg, upp);

    printf("upp = %s\n", upp);
    free(upp);
    printf("arg = %s\n", arg);
}

How nice. We even use const in all the right places! I think! Except maybe argv! Who knows? The compiler sure doesn't seem to care much. I guess casting non-const to const is pretty harmless. Fair enough, GCC, fair enough.

So anyway this works great:

Shell session

$ gcc woops.c -o woops
$ ./woops "clean design"
upp = CLEAN DESIGN
arg = clean design

It works so well I'm sure we don't even need to run it under valgrind. Who's got the time?

Well.. we do. We have the time. Let's be super duper safe:

$ valgrind ./woops "clean design"
==5823== Memcheck, a memory error detector
==5823== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5823== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==5823== Command: ./woops clean\ design
==5823== 
==5823== Invalid write of size 1
==5823==    at 0x10920A: uppercase (in /home/amos/ftl/strings/woops)
==5823==    by 0x1090A8: main (in /home/amos/ftl/strings/woops)
==5823==  Address 0x4a4a04c is 0 bytes after a block of size 12 alloc'd
==5823==    at 0x483877F: malloc (vg_replace_malloc.c:309)
==5823==    by 0x10909A: main (in /home/amos/ftl/strings/woops)
==5823== 
==5823== Invalid read of size 1
==5823==    at 0x483BCF4: __strlen_sse2 (vg_replace_strmem.c:463)
==5823==    by 0x48EC61D: __vfprintf_internal (in /usr/lib/libc-2.30.so)
==5823==    by 0x48D726E: printf (in /usr/lib/libc-2.30.so)
==5823==    by 0x1090B9: main (in /home/amos/ftl/strings/woops)
==5823==  Address 0x4a4a04c is 0 bytes after a block of size 12 alloc'd
==5823==    at 0x483877F: malloc (vg_replace_malloc.c:309)
==5823==    by 0x10909A: main (in /home/amos/ftl/strings/woops)
==5823== 
upp = CLEAN DESIGN
arg = clean design
==5823== 
==5823== HEAP SUMMARY:
==5823==     in use at exit: 0 bytes in 0 blocks
==5823==   total heap usage: 2 allocs, 2 frees, 1,036 bytes allocated
==5823== 
==5823== All heap blocks were freed -- no leaks are possible
==5823== 
==5823== For lists of detected and suppressed errors, rerun with: -s
==5823== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

Ohhhhh no. What did we do now?

C code

    char *upp = malloc(strlen(arg));

Oh RIGHT haha this never happens to me I swear, but, yeah, if you're allocating space for a string, you need to account for the null terminator.

This is like, Computers 101, I'm not sure what's up with me today - students have been learning this stuff for /decades/, I must be the first dude to ever get it wrong.

Ah well, let's add that one... last... fix:

C code

    char *upp = malloc(strlen(arg) + 1);

Shell session

$ gcc woops.c -o woops
$ ./woops "last fix"
upp = LAST FIX
arg = last fix

Fantastic! Even valgrind doesn't complain!

Now our program 100% does what it should: for each character of src, we convert it to uppercase, and then store into dst, which was allocated by the caller, so it's, well, clear-er that it's the caller's job to free it.

But um. Speaking of null terminators... what happened to it exactly? I don't remember setting a null terminator in dst.

Let's print everything we send to toupper:

C code

void uppercase(const char *src, char *dst) {
    do {
        printf("toupper('%c') (as hex: 0x%x)\n", *src, *src);
        *dst++ = toupper(*src);
    } while (*src++);
}

Shell session

$ gcc woops.c -o woops
$ ./woops "last fix"
toupper('l') (as hex: 0x6c)
toupper('a') (as hex: 0x61)
toupper('s') (as hex: 0x73)
toupper('t') (as hex: 0x74)
toupper(' ') (as hex: 0x20)
toupper('f') (as hex: 0x66)
toupper('i') (as hex: 0x69)
toupper('x') (as hex: 0x78)
toupper('') (as hex: 0x0)
upp = LAST FIX
arg = last fix

Oh, haha. It turns out that, the way we wrote our loop, we read the null terminator and pass it to toupper(), eventually writing it to dst. So we made two mistakes (iterating too far, and not writing a null terminator) but they, kinda, canceled each other out.

Lucky toupper has no way to return an error and just returns 0 for 0, right? Or maybe 0 is what it returns on error? Who knows! It's a C API! Anything is possible.

Again, I don't usually make those mistakes - I'm a programmer. Humans are not known for being fallible. This is clearly just a fluke. It would never happen if I was actually writing production code, much less if I worked for a large company.

Never.

And if I did, it probably wouldn't result in remote code execution or something nasty like that. It'd probably display a playful panda instead, that asks our users to wait until some of our engineers tend to the problem.

Right?

And even if I was working for a large company, and was writing production code, some tool probably would have caught that. Obviously Valgrind didn't, but you know, some other tool. I'm sure there's a tool. Somewhere.

Unless it's a perfectly legitimate use of the C language and standard library, just.. an incorrect one. And in that case no amount of tooling will help us. But that's why we have teams! Well, if we have a team. Surely a colleague would have caught it during code review. I thought it looked a bit suspicious. Didn't I say I felt funny about the code? I could've sworn.

Anyway, let's fix it. It won't look as C-ish as before, but it's for the sake of robust code, so just go with it:

C code

void uppercase(const char *src, char *dst) {
    for (int i = 0;; i++) {
        if (src[i] == '\0') {
            // null terminator
            dst[i] = '\0';
            return;
        }

        dst[i] = toupper(src[i]);
    }
}

$ gcc woops.c -o woops
$ ./woops "last fix (for real this time)"
upp = LAST FIX (FOR REAL THIS TIME)
arg = last fix (for real this time)

Alright. Enough playing around. We've finally achieved C perfection.

Although.....

Things certainly might go wrong if someone were to do something like this:

C code

int main(int argc, char **argv) {
    const char *arg = argv[1];
    char *upp = malloc(strlen(arg) + 1);

    arg = "doggo override";
    uppercase(arg, upp);

    printf("upp = %s\n", upp);
    free(upp);
    printf("arg = %s\n", arg);
}

Shell session

$ gcc woops.c -o woops
$ ./woops "dog"
upp = DOGGO OVERRIDE
arg = doggo override

Well, nothing seems wrong.

rings bell Garçon?

Shell session

$ valgrind ./woops "dog"                                            
==8023== Memcheck, a memory error detector                     
==8023== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==8023== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==8023== Command: ./woops dog                                                 
==8023==    
==8023== Invalid write of size 1
==8023==    at 0x1091E1: uppercase (in /home/amos/ftl/strings/woops)
==8023==    by 0x10923F: main (in /home/amos/ftl/strings/woops)
==8023==  Address 0x4a4a044 is 0 bytes after a block of size 4 alloc'd
==8023==    at 0x483877F: malloc (vg_replace_malloc.c:309)
==8023==    by 0x10921D: main (in /home/amos/ftl/strings/woops)
==8023== 
==8023== Invalid write of size 1
==8023==    at 0x1091B1: uppercase (in /home/amos/ftl/strings/woops)
==8023==    by 0x10923F: main (in /home/amos/ftl/strings/woops)
==8023==  Address 0x4a4a04e is 10 bytes after a block of size 4 alloc'd
==8023==    at 0x483877F: malloc (vg_replace_malloc.c:309)
==8023==    by 0x10921D: main (in /home/amos/ftl/strings/woops)
==8023== 
==8023== Invalid read of size 1
==8023==    at 0x483BCF4: __strlen_sse2 (vg_replace_strmem.c:463)
==8023==    by 0x48EC61D: __vfprintf_internal (in /usr/lib/libc-2.30.so)
==8023==    by 0x48D726E: printf (in /usr/lib/libc-2.30.so)
==8023==    by 0x109257: main (in /home/amos/ftl/strings/woops)
==8023==  Address 0x4a4a044 is 0 bytes after a block of size 4 alloc'd
==8023==    at 0x483877F: malloc (vg_replace_malloc.c:309)
==8023==    by 0x10921D: main (in /home/amos/ftl/strings/woops)
==8023== 
==8023== Invalid read of size 1
==8023==    at 0x49018F4: _IO_file_xsputn@@GLIBC_2.2.5 (in /usr/lib/libc-2.30.so)
==8023==    by 0x48EAB51: __vfprintf_internal (in /usr/lib/libc-2.30.so)

Oh whoa. That's a lot of output. I've cut out the rest, but it keeps going on and on.

So the problem here is that... we allocate enough room for "DOG\0", but we end up converting "doggo override\0" to uppercase, so we end up writing past the area that malloc allocated for us.

I believe the technical term is a ~~~fucky wucky~~~ buffer overflow.

Well, let's look at the standard C library, for ideas on proper API design:

C code

char *strdup(const char *s);
char *strndup(const char *s, size_t n);

Ah, of course! They have this n variant that takes, well, a size, and it won't write past that size.

But wait.. is n counting the null terminator or not?

The strndup() function is similar, but copies at most n bytes. If s is longer than n, only n bytes are copied, and a terminating null byte ('\0') is added.

It does not.

Well, let's follow the shoulders of giants I guess (that's the expression, right?)

C code

#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <stdlib.h>

void nuppercase(const char *src, char *dst, int n) {
    for (int i = 0; i < n; i++) {
        if (src[i] == '\0') {
            // null terminator
            dst[i] = '\0';
            return;
        }

        dst[i] = toupper(src[i]);
    }

    // null terminator in case we stopped because `i >= n`
    dst[n] = '\0';
}

int main(int argc, char **argv) {
    const char *arg = argv[1];
    size_t upp_len = strlen(arg);
    char *upp = malloc(upp_len + 1);

    arg = "doggo override";
    nuppercase(arg, upp, upp_len);

    printf("upp = %s\n", upp);
    free(upp);
    printf("arg = %s\n", arg);
}

Shell session

$ gcc woops.c -o woops
$ ./woops "cat"
upp = DOG
arg = doggo override

Fantastic!

Of course, our program doesn't really do what we meant for it to do, but at least we've silenced Valgrind, so, close enough.

Also: we've only ever dealt with char*! None of that "multiple string types" noise.

So simple. So relaxing.

uppercase, but in Rust

Okay so by now, we know that the program we ended up with can be written in Rust simply as:

Rust code

fn main() {
    let arg = std::env::args()
        .skip(1)
        .next()
        .expect("should have one argument");

    println!("upp = {}", arg.to_uppercase());
    println!("arg = {}", arg);
}

Shell session

$ cargo run --quiet -- "dog"
upp = DOG
arg = dog

But let's try to get this as close as possible to the C code, as an exercise.

First of all, doesn't this have a memory leak? We're not freeing the result of to_uppercase anywhere! Nor are we freeing arg, for that matter.

Shell session

$ cargo build --quiet
$ valgrind ./target/debug/rustre "dog"
==9534== Memcheck, a memory error detector
==9534== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==9534== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==9534== Command: ./target/debug/rustre dog
==9534== 
upp = DOG
arg = dog
==9534== 
==9534== HEAP SUMMARY:
==9534==     in use at exit: 0 bytes in 0 blocks
==9534==   total heap usage: 23 allocs, 23 frees, 3,420 bytes allocated
==9534== 
==9534== All heap blocks were freed -- no leaks are possible
==9534== 
==9534== For lists of detected and suppressed errors, rerun with: -s
==9534== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Mhh. No leaks. Very well.

But it doesn't look like the C version either. Our C version had an uppercase function, so, let's make one:

Rust code

fn main() {
    let arg = std::env::args()
        .skip(1)
        .next()
        .expect("should have one argument");

    println!("upp = {}", uppercase(arg));
    println!("arg = {}", arg);
}

fn uppercase(s: String) -> String {
    s.to_uppercase()
}

Shell session

$ cargo build --quiet
error[E0382]: borrow of moved value: `arg`
 --> src/main.rs:8:26
  |
2 |     let arg = std::env::args()
  |         --- move occurs because `arg` has type `std::string::String`, which does not implement the `Copy` trait
...
7 |     println!("upp = {}", uppercase(arg));
  |                                    --- value moved here
8 |     println!("arg = {}", arg);
  |                          ^^^ value borrowed here after move

error: aborting due to previous error

For more information about this error, try `rustc --explain E0382`.
error: could not compile `rustre`.

Oh lord here comes the compiler.

squints

Okay so... uh, the problem is that we call uppercase() which.. moves (?) arg, and then we use arg again, and that's not good.

I guess we.. could first print arg and then call uppercase() second.

Would that work?

Rust code

fn main() {
    let arg = std::env::args()
        .skip(1)
        .next()
        .expect("should have one argument");

    println!("arg = {}", arg);
    println!("upp = {}", uppercase(arg));
}

fn uppercase(s: String) -> String {
    s.to_uppercase()
}

Shell session

$ cargo run --quiet -- "dog"
arg = dog
upp = DOG

It does! Alright!

But let's say we don't want to. Let's say we want to call uppercase first.

I guess we could pass it a copy of arg. Well, the compiler just told us String does not implement the Copy trait, so.. I guess a clone of arg?

Rust code

fn main() {
    let arg = std::env::args()
        .skip(1)
        .next()
        .expect("should have one argument");

    println!("upp = {}", uppercase(arg.clone()));
    println!("arg = {}", arg);
}

fn uppercase(s: String) -> String {
    s.to_uppercase()
}

Shell session

$ cargo run --quiet -- "dog"
upp = DOG
arg = dog

Neat!

This is kinda silly though - why do we need to clone arg? It's just the input to uppercase. We don't need a second copy of it in memory. The first copy was just fine!

Right now, in memory we have:

arg (which is "dog")
the copy of arg we sent to uppercase() (which is "dog")
the value uppercase() returns (which is "DOG")

Now, I'm no expert, but I'm guessing that's where &str comes in, huh?

Let's try it:

Rust code

fn main() {
    let arg = std::env::args()
        .skip(1)
        .next()
        .expect("should have one argument");

    println!("upp = {}", uppercase(arg));
    println!("arg = {}", arg);
}

fn uppercase(s: &str) -> String {
    s.to_uppercase()
}

I'm not too sure about this, but in doubt, let's RTFC (Run The Friendly Compiler):

Rust code

cargo run --quiet -- "dog"
error[E0308]: mismatched types
 --> src/main.rs:7:36
  |
7 |     println!("upp = {}", uppercase(arg));
  |                                    ^^^
  |                                    |
  |                                    expected `&str`, found struct `std::string::String`
  |                                    help: consider borrowing here: `&arg`

Ahah. We were missing one piece of the puzzle.

Very well then:

Rust code

    println!("upp = {}", uppercase(&arg));

Shell session

$ cargo run --quiet -- "dog"
upp = DOG
arg = dog

Nice! And still no memory leaks, if Valgrind is to be trusted.

Also, this function signature:

Rust code

    fn uppercase(s: &str) -> String

...can be trusted. Without using unsafe code (which you can audit for), there is no way to modify s. It's not just "a cast away". You really, really have to explicitly opt into doing dangerous stuff.

But, you know, still, this isn't close to the C code.

For it to be closer to the C code, we should:

allocate a "destination"
pass the destination to uppercase()
have uppercase() iterate through each character individually, convert them to uppercase, and append them to the destination

Obviously we don't have malloc and friends, and our strings aren't null-terminated, but we can still get pretty close.

Rust code

fn main() {
    let arg = std::env::args()
        .skip(1)
        .next()
        .expect("should have one argument");

    let mut upp = String::new();
    println!("upp = {}", uppercase(&arg, upp));
    println!("arg = {}", arg);
}

fn uppercase(src: &str, dst: String) -> String {
    for c in src.chars() {
        dst.push(c.to_uppercase());
    }
    dst
}

chants Run the friendly compiler! Run the friendly compiler!

Shell session

$ cargo run --quiet -- "dog"
error[E0308]: mismatched types
  --> src/main.rs:14:18
   |
14 |         dst.push(c.to_uppercase());
   |                  ^^^^^^^^^^^^^^^^ expected `char`, found struct `std::char::ToUppercase`

audience ooohs

Well, this doesn't work. What the heck is a ToUppercase anyway. More types? Just what we needed.

Let's read the docs I guess:

Returns an iterator that yields the uppercase equivalent of a char.

This struct is created by the to_uppercase method on char. See its documentation for more.

Mh. Oh. OH! That's right! Remember this?

Shell session

$ cargo run --quiet -- "heinz große"
HEINZ GROSSE

Sometimes the upper-case equivalent of a char is.. several chars. So it returns an iterator instead of a single char.

We didn't have that problem in woops.c, because, well, we didn't handle UTF-8 at all.

Very well, iterator, we know this - I think - we can use for x in y:

Rust code

fn uppercase(src: &str, dst: String) -> String {
    for c in src.chars() {
        for c in c.to_uppercase() {
            dst.push(c);
        }
    }
    dst
}

Shell session

$ error[E0596]: cannot borrow `dst` as mutable, as it is not declared as mutable
  --> src/main.rs:15:13
   |
12 | fn uppercase(src: &str, dst: String) -> String {
   |                         --- help: consider changing this to be mutable: `mut dst`
...
15 |             dst.push(c);
   |             ^^^ cannot borrow as mutable

Mh. When did we ever borrow dst as mutable? I don't see a single & in uppercase.

Let's look at the declaration of String::push:

Rust code

pub fn push(&mut self, ch: char)

Oh, there it is! So dst.push(c) is exactly the same as String::push(&mut dst, c). Convenient.

So, yeah, let's make dst mut, just like the compiler suggested.

Rust code

fn uppercase(src: &str, mut dst: String) -> String {
    for c in src.chars() {
        for c in c.to_uppercase() {
            dst.push(c);
        }
    }
    dst
}

$ cargo run --quiet -- "dog"
upp = DOG
arg = dog

Cool bear's hot tip

std::char::to_uppercase() returning an Iterator is great for performance - our C UTF-8 implementation was eager - it always decoded (or re-encoded) an entire string, but Rust's standard library uses Iterators everywhere to make it lazy.

If we were to iterate through the characters of a string, convert them to uppercase, and check that the result contains "S", even if the input was a very large string, we'd only have a handful of characters at a time in memory, never the full thing. Also, we'd stop decoding once we've found the "S". It's a very flexible design!

But... that's not quite what the C code did, is it? When we had the C version of uppercase take an src and dst, it returned... void!

Very well then:

Rust code

fn main() {
    let arg = std::env::args()
        .skip(1)
        .next()
        .expect("should have one argument");

    let upp = String::new();
    uppercase(&arg, upp);

    println!("upp = {}", upp);
    println!("arg = {}", arg);
}

fn uppercase(src: &str, mut dst: String) {
    for c in src.chars() {
        for c in c.to_uppercase() {
            dst.push(c);
        }
    }
}

Shell session

cargo run --quiet -- "dog"
error[E0382]: borrow of moved value: `upp`
  --> src/main.rs:10:26
   |
7  |     let upp = String::new();
   |         --- move occurs because `upp` has type `std::string::String`, which does not implement the `Copy` trait
8  |     uppercase(&arg, upp);
   |                     --- value moved here
9  | 
10 |     println!("upp = {}", upp);
   |                          ^^^ value borrowed here after move

Oh, this again.

Well, now, we definitely cannot pass a copy of upp to uppercase() (even though it is tempting), because then uppercase would not operate on the right copy of upp, so, this would be wrong:

Rust code

fn main() {
    let arg = std::env::args()
        .skip(1)
        .next()
        .expect("should have one argument");

    let upp = String::new();
    uppercase(&arg, upp.clone());

    println!("upp = {}", upp);
    println!("arg = {}", arg);
}

Shell session

$ cargo run --quiet -- "dog"
upp = 
arg = dog

As expected, the upp we print ends up empty, which is what String::new returns.

Cool bear's hot tip

By the way, String is heap-allocated, because it's growable. Whereas a &str can refer to data from anywhere: the heap, the stack, even the program's data segment.

If you have lots of small strings, check out the smol_str crate, which is immutable, provides O(1) cloning, and stack-allocates strings smaller than 22 bytes.

It's similar in spirit to the smallvec crate.

So instead, we need to... let uppercase write to upp, just for a little bit.

We need to let it... borrow upp mutably. Right? Right.

Rust code

fn main() {
    let arg = std::env::args()
        .skip(1)
        .next()
        .expect("should have one argument");

    let mut upp = String::new();
    // was just `upp`
    uppercase(&arg, &mut upp);

    println!("upp = {}", upp);
    println!("arg = {}", arg);
}

// was `mut dst: String`
fn uppercase(src: &str, dst: &mut String) {
    for c in src.chars() {
        for c in c.to_uppercase() {
            dst.push(c);
        }
    }
}

..and now it works again!

Shell session

$ cargo run --quiet -- "dog"
upp = DOG
arg = dog

Of course, we're still not 100% at parity with the C version. We didn't have to call malloc, because String::new and String::push worry about allocation for us. We didn't have to call free, because a String going out of scope frees it automatically.

We didn't have to make an "n" variant of our function, because dst is a mutable reference to a growable string, so it's impossible for us to write past the end of it.

And finally, we couldn't mishandle the null terminator, because there is none.

So, wait, growable string and all.. does that mean we could pre-allocate a String of reasonable size, and just re-use it for multiple uppercase calls?

Rust code

fn main() {
    let mut upp = String::with_capacity(512);
    for arg in std::env::args().skip(1) {
        upp.clear();
        uppercase(&arg, &mut upp);
        println!("upp = {}", upp);
        println!("arg = {}", arg);
    }
}

fn uppercase(src: &str, dst: &mut String) {
    for c in src.chars() {
        for c in c.to_uppercase() {
            dst.push(c);
        }
    }
}

Shell session

cargo run --quiet -- "dog" "cat" "parrot"
upp = DOG
arg = dog
upp = CAT
arg = cat
upp = PARROT
arg = parrot

Yes, it does!

C allowed us to do direct indexing, does Rust allow us to do that?

Rust code

fn main() {
    for arg in std::env::args().skip(1) {
        for i in 0..arg.len() {
            println!("arg[{}] = {}", i, arg[i]);
        }
    }
}

Shell session

$ cargo run --quiet -- "dog"
error[E0277]: the type `std::string::String` cannot be indexed by `usize`
 --> src/main.rs:4:41
  |
4 |             println!("arg[{}] = {}", i, arg[i]);
  |                                         ^^^^^^ `std::string::String` cannot be indexed by `usize`
  |
  = help: the trait `std::ops::Index<usize>` is not implemented for `std::string::String`

Mh. No we cannot.

I guess we could first convert it to an array of Unicode scalar values, and then index that:

Rust code

fn main() {
    for arg in std::env::args().skip(1) {
        let scalars: Vec<char> = arg.chars().collect();
        for i in 0..scalars.len() {
            println!("arg[{}] = {}", i, scalars[i]);
        }
    }
}

Shell session

$ cargo run --quiet -- "dog"
arg[0] = d
arg[1] = o
arg[2] = g

Yeah, that works! And to be honest, it's better, because we only have to decode the UTF-8 string once - and then we get random access. That's probably why it doesn't implement Index<usize>.

Here's an interesting thing that is supported though: Index<Range<usize>>

Rust code

fn main() {
    for arg in std::env::args().skip(1) {
        let mut stripped = &arg[..];
        while stripped.starts_with(" ") {
            stripped = &stripped[1..]
        }
        while stripped.ends_with(" ") {
            stripped = &stripped[..stripped.len() - 1]
        }
        println!("     arg = {:?}", arg);
        println!("stripped = {:?}", stripped);
    }
}

Rust code

$ cargo run --quiet -- "  floating in space   "
     arg = "  floating in space   "
stripped = "floating in space"

And what's even more interesting, is that this doesn't involve any allocations!

None!

It's a different &str, that points to the same memory region, it just starts and ends at different offsets.

In fact, we could make it a function:

Rust code

fn main() {
    for arg in std::env::args().skip(1) {
        let stripped = strip(&arg);
        println!("     arg = {:?}", arg);
        println!("stripped = {:?}", stripped);
    }
}

fn strip(src: &str) -> &str {
    let mut dst = &src[..];
    while dst.starts_with(" ") {
        dst = &dst[1..]
    }
    while dst.ends_with(" ") {
        dst = &dst[..dst.len() - 1]
    }
    dst
}

And it would work just as well.

That seems.. dangerous, though - what if the memory of the original string is freed?

Rust code

fn main() {
    let stripped;
    {
        let original = String::from("  floating in space  ");
        stripped = strip(&original);
    }
    println!("stripped = {:?}", stripped);
}

$ cargo run --quiet -- "  floating in space   "
error[E0597]: `original` does not live long enough
 --> src/main.rs:5:26
  |
5 |         stripped = strip(&original);
  |                          ^^^^^^^^^ borrowed value does not live long enough
6 |     }
  |     - `original` dropped here while still borrowed
7 |     println!("stripped = {:?}", stripped);
  |                                 -------- borrow later used here

That is neat. If we had a similar function in C, it would be real easy to just free "original" and still use "stripped", and then, boom, instant use-after-free.

Which is why we probably wouldn't do it, in C. We would probably store the stripped string separately. But in Rust? Sure, why not! The compiler has got your back. It'll check for any shenanigans.

Lastly - indexing into a String with ranges is cool, but.. is it character ranges?

Rust code

fn main() {
    for arg in std::env::args().skip(1) {
        println!("first four = {:?}", &arg[..4]);
    }
}

Shell session

$ cargo run --quiet -- "want safety?"
first four = "want"
$ cargo run --quiet -- "🙈🙉🙊💥"
first four = "🙈"

Mh, byte ranges.

Wait. WAIT A MINUTE. I thought all Rust strings were UTF-8?

But using slicing, we can end up with partial multi-byte sequences, just like we crafted by hand earlier. And then, boom, invalid UTF-8, just like that.

Unless...

Rust code

fn main() {
    for arg in std::env::args().skip(1) {
        println!("first two = {:?}", &arg[..2]);
    }
}

Shell session

$ cargo run --quiet -- "🙈🙉🙊💥"
thread 'main' panicked at 'byte index 2 is not a char boundary; it is inside '🙈' (bytes 0..4) of `🙈🙉🙊💥`', src/libcore/str/mod.rs:2069:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

That is... that is just excellent. Not just the fact that it panics - which is the safe thing to do - but look at that message. I'm just completely in awe. No, I hadn't tried it before writing this article but DARN look at that thing.

Closing words

Anyway, this article is getting pretty long.

Hopefully it's a nice enough introduction to the world of string handling in Rust, and it should motivate why Rust has both String and &str.

The answer, of course, is safety, correctness, and performance. As usual.

We did encounter roadblocks when writing our last Rust string manipulation program, but they were mostly compile-time errors, or panics. Not once have we:

Read from an invalid address
Written to an invalid address
Forgotten to free something
Overwritten some other data
Needed an additional sanitizer tool to tell us where the problem was

And that, plus the amazing compiler messages, and the community, is the beauty of Rust.

If you want to learn more about Rust, check out A half-hour to learn Rust and Declarative Memory Management.

Cool bear's hot tip

There's another string-ish type I haven't mentioned, and if you're curious, I definitely recommend reading Pascal Hertleif's The Secret Life Of Cows.

It's about "clone-on-write" strings, which let us do even fewer allocations.

And if you do read that, you should definitely take a look at cow-utils, which implements clone-on-write versions of replace, to_lowercase, to_uppercase and more.

Discuss this article on /r/fasterthanlime

If you liked what you saw, please support my work!

Donate on GitHub Donate on Patreon

Looking for the homepage?

Another article: What's in the box?

Working with strings in Rust

Contents