The bottom emoji breaks rust-analyzer

Some bugs are merely fun. Others are simply delicious!

Today's pick is the latter.

Reproducing the issue, part 1

(It may be tempting to skip that section, but reproducing an issue is an important part of figuring it out, so.)

I've never used Emacs before, so let's install it. I do most of my computing on an era-appropriate Ubuntu, today it's Ubuntu 22.10, so I just need to:

Shell session
$ sudo apt install emacs-nox

And then create a new Rust crate, cd into it, and launch emacs:

Shell session
$ cargo new bottom
     Created binary (application) `bottom` package
$ cd bottom/
$ emacs

I am greeted with a terminal interface that says: "Welcome to GNU Emacs, one component of the GNU/Linux operating system". I feel religious already.

To open src/main.rs, I can press C-x C-f, where C- means Ctrl+ (and M- means Alt+, at least for me), which opens a "Find file" prompt at the bottom, pre-filled with the current working directory: for me that's ~/bearcove/bottom.

There's tab-completion in there, so s<TAB>m<TAB> completes it to .../src/main.rs, and pressing Enter (which I understand Emacs folks spell RET, for Return) opens the file.

Without additional configuration, not much happens: there's no syntax highlighting, no code intelligence of any kind, we have to ask for that.

So let's ask for that.

With guidance from a friend (on IRC, of all places! what year is this?), I ended up putting the following in my ~/.emacs file:

Emacs Lisp
;; in `~/.emacs`

(require 'package)
(add-to-list 'package-archives '("melpa" . "https://melpa.org/packages/"))
(add-to-list 'package-archives '("gnu"   . "https://elpa.gnu.org/packages/"))

(package-refresh-contents)
(package-install 'use-package)

(require 'use-package-ensure)
(setq use-package-always-ensure t)

(use-package rustic)
(use-package lsp-mode
  :ensure
  :commands lsp
  :custom
  ;; what to use when checking on-save. "check" is default, I prefer clippy
  (lsp-rust-analyzer-cargo-watch-command "clippy")
  (lsp-eldoc-render-all t)
  (lsp-idle-delay 0.6)
  (lsp-rust-analyzer-server-display-inlay-hints t)
  :config
  (add-hook 'lsp-mode-hook 'lsp-ui-mode))

(use-package lsp-ui
  :ensure
  :commands lsp-ui-mode
  :custom
  (lsp-ui-peek-always-show t)
  (lsp-ui-sideline-enable nil)
  (lsp-ui-doc-enable t))

(custom-set-variables
 ;; custom-set-variables was added by Custom.
 ;; If you edit it by hand, you could mess it up, so be careful.
 ;; Your init file should contain only one such instance.
 ;; If there is more than one, they won't work right.
 '(package-selected-packages '(lsp-ui rustic lsp-mode ## cmake-mode)))
(custom-set-faces
 ;; custom-set-faces was added by Custom.
 ;; If you edit it by hand, you could mess it up, so be careful.
 ;; Your init file should contain only one such instance.
 ;; If there is more than one, they won't work right.
 )

(The bit at the bottom was there originally, I'm guessing Ubuntu ships it? It didn't have lsp-ui / rustic / lsp-mode before I restarted emacs).

Restarting emacs installs everything, just like Vim plug-in managers would do, so far, so good.

Opening src/main.rs again prompts me to import a workspace root, I pick the first option, which seems reasonable, and tada, we have Rust highlighting and inline documentation and stuff!

Windows Terminal screenshot showing emacs open on a Rust hello world source file

Or rather, I do, because I have a rust-analyzer binary in path from a while ago (I contribute occasionally):

Shell session
$ which rust-analyzer
/home/amos/.cargo/bin/rust-analyzer

$ rust-analyzer --version
rust-analyzer 0.0.0 (caf23f291 2022-07-11)

This is old. rust-analyzer is released every week. Let's remove it and see if things still work.

Shell session
$ rm $(which rust-analyzer)

$ emacs src/main.rs

(cut)
Server rls:83343/starting exited (check corresponding stderr buffer
for details). Do you want to restart it? (y or n)

Ah. It doesn't. That's unfortunate!

How rust-analyzer is distributed

Rust components are normally distributed with rustup. rustc is a Rust component, so, it is:

Shell session
$ which rustc
/home/amos/.cargo/bin/rustc

$ rustc -V
rustc 1.67.1 (d5a82bbd2 2023-02-07)

$ rustup show active-toolchain
stable-x86_64-unknown-linux-gnu (default)

$ rustup which rustc
/home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc

$ ~/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc -V
rustc 1.67.1 (d5a82bbd2 2023-02-07)

You can also get it from elsewhere, and I'm sure you have your reasons, but they're not relevant to the topic at hand.

After rust-analyzer was promoted from "very useful peripheral project" to official rust-lang project, it started being distributed as a rustup component.

Back then, it was done by it being a git submodule in the rust-lang/rust repository. Since then, it's been migrated to a git subtree, which helped resolve the rust-analyzer support for proc macros in Rust nightly issue.

Why do we care? Because that means there's technically two trees for rust-analyzer at any given moment in time: a git subtree can (and should) be merged in either direction: ra->rust because ra moves fast and is developed mostly in its own repository, and rust->ra because the proc-macro bridge might change.

So! We can install rust-analyzer with rustup:

Shell session
$ rustup component add rust-analyzer
info: downloading component 'rust-analyzer'
info: installing component 'rust-analyzer'

But that doesn't set up a "proxy binary" under ~/.cargo/bin:

Shell session
$ which rust-analyzer

(it prints nothing)

It is, however, somewhere on disk:

Shell session
$ rustup which rust-analyzer
/home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rust-analyzer

We can even try it out ourselves by launching it, and typing {} then Enter/Return:

Shell session
$ ~/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rust-analyzer
{}
[ERROR rust_analyzer] Unexpected error: expected initialize request, got error: receiving on an empty and disconnected channel
expected initialize request, got error: receiving on an empty and disconnected channel

That's right! It's our old friend JSON-over-stdin.

Since this is a reasonable way in which users might want to install rust-analyzer, let's see if our emacs setup picks it up:

Shell session
$ emacs src/main.rs
(cut)
Server rls:86574/starting exited (check corresponding stderr buffer for details).
Do you want to restart it? (y or n)

No. This is unfortunate, again.

And even if it did work, it would still not be ideal, because... that version is old too:

Shell session
$ $(rustup which rust-analyzer) --version
rust-analyzer 1.67.1 (d5a82bb 2023-02-07)

Okay.. that one's only a few days old, but only because Rust 1.67.0 did an oopsie-woopsie when it comes to thin archives and they had to release 1.67.1 shortly after.

But yeah, if we check the version shipped with Rust 1.67.0:

Shell session
$ $(rustup +1.67.0 which rust-analyzer) --version
rust-analyzer 1.67.0 (fc594f1 2023-01-24)

It's from around that time. And that hash refers to a a commit in rust-lang/rust, not rust-lang/rust-analyzer, so depending when the the last sync has been made, it might be even further behind.

rust-analyzer wants to ship every monday, and so, it does! At the time of this writing, the latest release is from 2023-02-13, and it's on GitHub, which is where the VSCode extension used to download it from.

These days, it ships the rust-analyzer binary directly in the extension:

Shell session
$ unzip -l rust-lang.rust-analyzer-0.4.1398@linux-x64.vsix
Archive:  rust-lang.rust-analyzer-0.4.1398@linux-x64.vsix
  Length      Date    Time    Name
---------  ---------- -----   ----
     2637  2023-02-10 00:29   extension.vsixmanifest
      462  2023-02-10 00:29   [Content_Types].xml
    15341  2023-02-09 18:46   extension/icon.png
     1036  2023-02-09 18:46   extension/language-configuration.json
    12006  2023-02-09 18:46   extension/LICENSE.txt
   415788  2023-02-10 00:29   extension/out/main.js
   298705  2023-02-09 18:46   extension/package-lock.json
    86857  2023-02-10 00:29   extension/package.json
      756  2023-02-09 18:46   extension/ra_syntax_tree.tmGrammar.json
     2422  2023-02-10 00:29   extension/README.md
 41073816  2023-02-10 00:29   extension/server/rust-analyzer
   648798  2023-02-10 00:29   extension/node_modules/d3-graphviz/build/d3-graphviz.min.js
   279449  2023-02-10 00:29   extension/node_modules/d3/dist/d3.min.js
---------                     -------
 42838073                     13 files

But other editor plug-ins, like coc-rust-analyzer, download it right from GitHub releases.

It seemed odd that the Emacs ecosystem would lack such functionality, so I looked it up: rustic says about automatic server installation that lsp-mode provides this feature, but eglot doesn't, and then it says "Install rust-analyzer manually".

In lsp-mode, I found code that seems like it should download rust-analyzer, but I'm not sure how to use it.

The docs talk about an lsp-enable-suggest-server-download option, which defaults to true, but I've never seen it download the server (I know because I checked ~/.emacs.d/.cache/lsp).

Although... now that I look at it closer, this error message:

Server rls:83343/starting exited (check corresponding stderr buffer
for details). Do you want to restart it? (y or n)

Mentions rls, which is the predecessor to rust-analyzer, and that definitely sounds wrong. But lsp-rust-server defaults to rust-analyzer, so.. is it just falling back? Setting the option explicitly doesn't seem to change much.

After more Emacs learning, I discovered how to switch to the *lsp-log* buffer the docs point to and discovered the following:

Command "rls" is present on the path.
Command "rust-analyzer" is not present on the path.
Command "rls" is present on the path.
Command "rust-analyzer" is not present on the path.
Found the following clients for /home/amos/bearcove/bottom/src/main.rs: (server-id rls, priority -1)
The following clients were selected based on priority: (server-id rls, priority -1)

This is a horrible fallback. RLS is deprecated. The only reason it's falling back to it is because there is a proxy binary for it, that exists, but errors out since the component is not installed:

Shell session
$ which rls
/home/amos/.cargo/bin/rls

$ rls
error: 'rls' is not installed for the toolchain 'stable-x86_64-unknown-linux-gnu'
To install, run `rustup component add rls`

$ echo $?
1

Let's summarize the situation:

I must admit this is an even bigger deal than what I was originally planning to write about! If this is the experience Emacs folks have been having with Rust, this explains a lot of things.

My sample size is N=3, but everyone in that sample ended up building rust-analyzer from source, and that means they get an extremely up-to-date RA once, and then most probably forget to update it forever, which is even worse than grabbing it from rustup.

Please remind me to submit a PR to lsp-mode that yanks ALL the rls stuff after I'm done writing this article. Someone must.

ANYWAY.

For today, let's just tell lsp-mode to use the one from rustup:

Emacs Lisp
;; under (use-package lsp-mode
;; in the :custom block
  (lsp-rust-analyzer-server-command (list (string-trim (shell-command-to-string "rustup which rust-analyzer"))))

Just kidding! That doesn't work. And yes, that evaluates to the right path. And yes, that "custom" option expects a list. And yes, I did learn what custom-set-variables does while troubleshooting this, and no, setting it through there doesn't work either.

The *Messages* buffer still shows that it couldn't find rust-analyzer in PATH.

My best guess is that rustic, which can use either lsp-mode or eglot, overrides this very late in the game, and I can't do a damn thing about it. I'm not sure why they even have an "Automatic server installation" section in their docs then.

So. Fine. We'll use violence and create a shell script in ~/.cargo/bin/rust-analyzer, containing this:

#!/bin/bash

$(rustup which rust-analyzer) "$@"
Shell session
$ chmod +x ~/.cargo/bin/rust-analyzer
$ hash -r
$ rust-analyzer --version
rust-analyzer 1.67.1 (d5a82bb 2023-02-07)

Tada, we did a crime!

Reproducing the issue, part 2

So, now that we have emacs configured with lsp-mode and rustic, using a recent-ish rust-analyzer, we're back to square one.

Note that I forgot to add company-mode, the thing that actually provides completions: we can add it next to (use-package rustic) in ~/.emacs:

Emacs Lisp
;; in `~/.emacs`

;; cut: everything before that line

(require 'use-package-ensure)
(setq use-package-always-ensure t)

(use-package rustic)
;; 👇 new!
(use-package company-mode)

Restarting emacs installs it, and we now get completions "after typing a few characters and waiting a bit". You can bind a key combination to "company-complete" to have it pop up on-demand, which I did, and then Emacs told me "no, you shouldn't write it C-SPC, you should write it [?\C- ]", which is exceedingly rude, but let's move on.

(The keybinding didn't work in the terminal but it did work in emacs-gtk, which I installed out of frustration).

Anyway!

Now onto the actual bug: let's add to our code.. an emoji! Any emoji.

Rust code
fn main() {
  // 🥺
  println!("Hello, world!");
}

Upon typing this emoji in the editor, a message will pop up in the bottom bar (appropriate) saying the LSP server crashed, would you like to restart it, no I wouldn't, I'd like to see the messages, a little C-x b *rust-anal<TAB>::std<TAB> later I'm in buffer *rust-analyzer::stderr* seeing this:

Panic context:
>
version: 1.67.1 (d5a82bb 2023-02-07)
notification: textDocument/didChange

thread 'LspServer' panicked at 'assertion failed: self.is_char_boundary(n)', /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/string.rs:1819:29
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:111:5
   3: <alloc::string::String>::replace_range::<core::ops::range::Range<usize>>
   4: rust_analyzer::lsp_utils::apply_document_changes::<<rust_analyzer::global_state::GlobalState>::on_notification::{closure#3}::{closure#0}>
   5: <<rust_analyzer::global_state::GlobalState>::on_notification::{closure#3} as core::ops::function::FnOnce<(&mut rust_analyzer::global_state::GlobalState, lsp_types::DidChangeTextD\
ocumentParams)>>::call_once
   6: <rust_analyzer::dispatch::NotificationDispatcher>::on::<lsp_types::notification::DidChangeTextDocument>
   7: <rust_analyzer::global_state::GlobalState>::handle_event
   8: <rust_analyzer::global_state::GlobalState>::run
   9: rust_analyzer::main_loop::main_loop
  10: rust_analyzer::run_server
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Process rust-analyzer stderr finished

That is the bug we're interested in.

And if you have some instinct as to the source of this bug, let me tell you: it's so much worse than you think.

Exploring UTF-8 and UTF-16 with Rust

Rust strings are UTF-8, period.

And by "Rust strings" I mean the owned type String and string slices, aka &str.

Cool bear's hot tip

There's a ton of other types that dereference to str, like Box<str>, Arc<str> etc., but they're not relevant to this discussion.

We can tell by printing the underlying byte representation:

Rust code
fn main() {
    println!("{:02x?}", "abc".as_bytes());
}
Shell session
$ cargo run --quiet
[61, 62, 63]

Which is not to say that you cannot use UTF-16 string representation in Rust, you just.. make your own type for it. Or, more likely, you use a crate, like widestring:

Shell session
$ cargo add widestring
    Updating crates.io index
      Adding widestring v1.0.2 to dependencies.
             Features:
             + alloc
             + std
Rust code
fn main() {
    // this is a macro, it does the right thing
    let s = widestring::u16str!("abc");
    println!("{:04x?}", s.as_slice());
}

This gives us a &[u16], so it's not quite what we're looking for:

Shell session
$ cargo run --quiet
[0061, 0062, 0063]

Getting at the bytes is harder, but not impossible:

Rust code
fn main() {
    // this is a macro, it does the right thing
    let s = widestring::u16str!("abc");
    {
        let u16s = s.as_slice();
        let (_, u8s, _) = unsafe { u16s.align_to::<u8>() };
        println!("{:02x?}", u8s);
    }
}

Heck, it shouldn't even really require unsafe, since anything that's u16-aligned is also definitely u8-aligned, here, let me make a safe wrapper for it:

Rust code
fn u16_slice_to_u8_slice(s: &[u16]) -> &[u8] {
    unsafe {
        // Safety: u8 slices don't require any alignment, it really doesn't
        // get any smaller without bit-twiddling
        s.align_to::<u8>().1
    }
}

There:

Rust code
fn main() {
    // this is a macro, it does the right thing
    let s = widestring::u16str!("abc");
    {
        let u16s = s.as_slice();
        let u8s = u16_slice_to_u8_slice(u16s);
        println!("{:02x?}", u8s);
    }
}
Shell session
$ cargo run --quiet
[61, 00, 62, 00, 63, 00]

Okay, cool! So utf-16 is just utf-8 with extra zeroes.

No, that's... no.

No, of course not. It's easy with ASCII characters, but it gets more complicated the fancier you want to get.

How about "é" for example. That sounds fancy! How does it look?

Rust code
const S: &str = "é";

fn main() {
    println!("{S:?}");
    println!("UTF-8   {:02x?}", S.as_bytes());
    println!(
        "UTF-16  {:02x?}",
        u16_slice_to_u8_slice((widestring::u16str!(S)).as_slice())
    );
}

fn u16_slice_to_u8_slice(s: &[u16]) -> &[u8] {
    unsafe {
        // Safety: u8 slices don't require any alignment, it really doesn't
        // get any smaller without bit-twiddling
        s.align_to::<u8>().1
    }
}
Shell session
$ cargo run -q     
"é"
UTF-8   [c3, a9]
UTF-16  [e9, 00]

Ah, not fancy enough! It takes up 2 bytes in UTF-8, and one "code unit" in UTF-16, which we're showing as two bytes, because we can.

Let's try something fancier!

Shell session
$ cargo run -q
"œ"
UTF-8   [c5, 93]
UTF-16  [53, 01]

"Latin Small Ligature Oe" is in the Latin Extended-A range: that goes 0100-017F. We no longer have a "useless byte" in the UTF-16 encoding, since the code unit is actually "0153" (in hexadecimal).

Meanwhile, in UTF-8 land, we still need two bytes to encode it. All is well.

Let's get fancier still! How about some Hiragana? (3040-309F)

Shell session
$ cargo run -q
"ぁ"
UTF-8   [e3, 81, 81]
UTF-16  [41, 30]

"Hiragana Letter Small A" takes 3 bytes in UTF-8, but only 2 bytes (one 16-bit code unit) in UTF-16. Huh! That must be why people loved UTF-16 so much, they made it the default string representation for Java and JavaScript (no relation).

Because at some point in the history of the human race, we thought 65536 characters were more than enough. It would be absurd to need more characters than that. And so we made UCS-2, which used two bytes for every character, enabling the encoding of the BMP, the Basic Multilingual Plane.

But that's a lie. We knew 65536 characters weren't enough, because China.

(Among other reasons).

So there were already 1.1 million characters defined by Unicode, and some of those wouldn't fit, and so we made UCS-4, which used four bytes for every character, but that sounds terribly wasteful compared to plain old ASCII-with-a-codepage-good-luck-exporting-documents I guess.

So ISO/IEC 2022 specifies escape sequences to switch between character sets. I wish I was making it up, but I'm not! Apparently xterm supports some of it! I know!!! I can't believe it either!

So anyway, UTF-16 is an evolution of UCS-2 that says hey maybe switching character encodings in-band isn't the greatest thing (also I'm not sure how widely adopted those escape sequences were in the first place), let's just have a thing where most characters fit in 2 bytes, and those that don't fit in.. more.

So let's look at our Hiragana again but simply print UTF-16 code units now, instead of the underlying two bytes:

Rust code
const S: &str = "ぁ";

fn main() {
    println!("{S:?}");
    println!("UTF-8   {:02x?}", S.as_bytes());
    println!("UTF-16  {:02x?}", (widestring::u16str!(S)).as_slice());
}
Shell session
$ cargo run -q
"ぁ"
UTF-8   [e3, 81, 81]
UTF-16  [3041]

Okay, cool, so: for U+3041, we need 3 bytes in UTF-8, and one UTF-16 code unit (2 bytes).

Let us now embark on a cross-plane journey:

Shell session
$ cargo run -q
"𠀀"
UTF-8   [f0, a0, 80, 80]
UTF-16  [d840, dc00]

But first, a minute of silence for anyone reading this from a Linux desktop machine.

...

There. In case you can't see it, the character in question is Ideograph the sound made by breathing in; oh! (cf. U+311B BOPOMOFO LETTER O, which is derived from this character) CJK, or "hē", for friends.

It is U+20000, which is outside the BMP.

The?

BMP, Basic Multilingual Plane, we've gone over this: it's 0000-FFFF, except... except part of it is reserved for UTF-16 surrogates! High surrogates are D800-DB7F, and surprise surprise, that's what our first code unit is: d840. Low surrogates are DC00-DFFF, and that's what our second code unit is.

How does this work? Well, we take our codepoint, in our case U+20000, and subtract 0x10000 from it. We get 0x10000, or, in binary:

 0b00010000000000000000
   <--------><-------->
       hi        lo

Then we take the high ten bits, add them to 0xD800, in our case, 0b1000000 gives 0x40, so we get 0xD840. As for the low ten bits, we add them to 0xDC00, for us that's 0x0 (should've picked a different example), so we get 0xDC00.

Easy peasy.

And so, "𠀀" (which should render as he ) is:

All clear? Good.

How about our emoji? Our innocent little emoji? Look at it. It didn't mean to cause any trouble. Let's find out:

Shell session
$ cargo run -q
"🥺"
UTF-8   [f0, 9f, a5, ba]
UTF-16  [d83e, dd7a]

Okay, same! Ooh I bet we can find what codepoint it is by doing the inverse transformation on the UTF-16 surrogate pair here:

Shell session
$ gdb --quiet
(cut)
(gdb) p/x 0xd83e - 0xd800
$1 = 0x3e
(gdb) p/x 0xdd7a - 0xdc00
$2 = 0x17a

That's your hex calculator of choice?

Shh I'm trying to focus here.

Shell session
(continued)

(gdb) p/t 0x3e
$3 = 111110
(gdb) p/t 0x17a
$4 = 101111010

Wait wait why are you going through binary, you can just do a left shift here, right?

Yes, yes, alright

Shell session
(continued)

(gdb) p/x (0x3e << 10) + 0x17a + 0x10000
$5 = 0x1f97a

Hurray, here it is! U+1F97A Bottom Face Emoji.

So. Rust really likes UTF-8, and so do I. Hence, I really like Rust.

Whereas some languages just do not let you mutate strings at all (it's safer that way), Rust does! You can totally do this, for example:

Rust code
fn main() {
    let mut s = String::from("amos");
    s.replace_range(0..1, "c");
    dbg!(s);
}
Shell session
$ cargo run -q
[src/main.rs:4] s = "cmos"

Note that this is safe rust!

Also note that Amos is my first name, and CMOS is a type of metal-oxide-semiconductor field-effect transistor fabrication process that uses complementary and symmetrical pairs of p-type and n-type MOSFETs for logic functions.

Coincidence? Probably.

We can also do this, using unsafe Rust:

Rust code
fn main() {
    let mut s = String::from("amos");
    // why b'c'? because 'c' is a char, whereas b'c' is a u8 (a byte).
    unsafe { s.as_bytes_mut()[0] = b'c' };
    dbg!(s);
}

And we get the same output.

Which begs the question, why is the former safe and the latter unsafe, if they do the same darn thing!

It's because they don't.

The unsafe version lets us do this:

Rust code
fn main() {
    let mut s = String::from("🥺");
    unsafe {
        s.as_bytes_mut()[3] = b'#';
    };
    dbg!(s);
}
Shell session
$ cargo run --quiet
[src/main.rs:6] s = "�#"

Which is, say it with me: undefined behaviorrrr ooooh spooky ghost emoji! We broke an invariant (strings must always be valid UTF-8) and now anything goes, we could even have memory corruption come out of this! Probably!

Them's the rules of unsafe Rust: if we use it, we are suddenly responsible for maintaining all the invariants ourselves. Just like we are, all the time, in unsafe languages like C/C++ (also C, and C++).

replace_range does not let us do that:

Rust code
fn main() {
    let mut s = String::from("🥺");
    s.replace_range(3..4, "#");
    dbg!(s);
}
Shell session
$ cargo run --quiet
thread 'main' panicked at 'assertion failed: self.is_char_boundary(n)', /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/string.rs:1811:29
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Well well well, doesn't that look familiar.

Again with a backtrace:

Shell session
$ RUST_BACKTRACE=1 cargo run --quiet
thread 'main' panicked at 'assertion failed: self.is_char_boundary(n)', /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/string.rs:1811:29
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:111:5
   3: alloc::string::String::replace_range
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/string.rs:1811:29
   4: bottom::main
             at ./src/main.rs:3:5
   5: core::ops::function::FnOnce::call_once
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/ops/function.rs:507:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Yep okay this is definitely familiar.

So this is what's happening with lsp-mode + rust-analyzer, right?

lsp-mode is sending bad offsets, rust-analyzer panics. Mystery solved?

Well, yes. But why does that happen? And what should rust-analyzer actually do in this case?

The Language Server Protocol

To answer these questions, we must look at the LSP: the Language Server Protocol.

And instead of reading the spec directly, let's look at what's actually being sent between an LSP client (like lsp-mode for Emacs) and an LSP server (like rust-analyzer).

There's something built into lsp-mode to do this, the lsp-log-io setting, but I think it's a lot more fun to capture it ourselves, by providing a wrapper for rust-analyzer.

Zig
// in `src/main.zig`

const std = @import("std");
const ChildProcess = std.ChildProcess;
const StdIo = ChildProcess.StdIo;
const File = std.fs.File;
const Thread = std.Thread;

const DEBUG = false;
const DEBUG_LEAKS = false;
const CLEANUP_STREAMS_WORKAROUND_NAME = ".cleanupStreamsWorkaround";

fn debug(comptime fmt: []const u8, args: anytype) void {
    if (DEBUG) {
        std.debug.print(fmt, args);
    }
}

pub fn main() !void {
    var alloc: std.mem.Allocator = undefined; // like the behavior

    var gpa: ?std.heap.GeneralPurposeAllocator(.{}) = undefined;
    // this `defer` cannot live inside the `if DEBUG_LEAKS` block, because then
    // the allocator would be deinitialized before our program even runs.
    defer if (gpa) |*g| {
        std.debug.print("Checking for memory leaks...\n", .{});
        std.debug.assert(!g.deinit());
    };

    if (DEBUG_LEAKS) {
        // can't figure out how to elide the type name here
        // assigning apparently promotes from `GPA` to `?GPA`
        gpa = std.heap.GeneralPurposeAllocator(.{}){};
        // ...but we still need to 'unwrap' it here (could get away
        // with a temporary variable)
        alloc = gpa.?.allocator();
    } else {
        alloc = std.heap.page_allocator;
    }

    // note: on Linux, this ends up using `fork+execve`. this is a bad idea,
    // as others have found out before: https://github.com/golang/go/issues/5838
    //
    // I was curious why passing `rustup` here worked: for me it had to either
    // parse `PATH` or go through a shell of some sort. Turns out it _does_ parse
    // `PATH` (falling back to `/usr/local/bin:/bin:/usr/bin`) and then does
    // a _bunch_ of `execve` calls in a row, continuing until it doesn't return
    // "AccessDenied" or "FileNotFound".
    var res = try ChildProcess.exec(.{ // anonymous struct literal
        // C99 syntax, yum
        .allocator = alloc,
        .argv = &[_][]const u8{ "rustup", "which", "rust-analyzer" },
    });
    // `ExecResult` doesn't have a `deinit` method, if you don't call these
    // you have a leak
    defer alloc.free(res.stdout);
    defer alloc.free(res.stderr);

    var ra_path = x: { // block with a label
        // I'm not sure why I have to pass `u8` here instead of it being inferred
        var splits = std.mem.split(u8, res.stdout, "\n");
        // breaking the label - blocks aren't expressions, so this is how
        // it's done. note that `first()` doesn't return `?[]const u8`, it's
        // infallible
        break :x splits.first();
    };
    // ra_path is a slice of res.stdout, so we can't free it first
    debug("RA path = '{s}'\n", .{ra_path});

    var proc = ChildProcess.init(
        // the array length can be inferred (from `_`), but the element type cannot
        &[_][]const u8{ra_path},
        alloc,
    );
    // nothing's stopping you from assigning `proc.stdin` etc. at this stage
    // but it won't do anything.
    proc.stdin_behavior = StdIo.Pipe;
    proc.stdout_behavior = StdIo.Pipe;
    proc.stderr_behavior = StdIo.Inherit;
    try proc.spawn();

    // the first argument is `SpawnConfig`, its only member right now
    // is `stack_size`, which defaults to 16MiB.
    //
    // the third argument is a tuple, which corresponds to the arguments of
    // the function you pass. the function takes a single argument which is an
    // anonyomus struct, hence the `.{.{}}` dance
    var copy_stdin = try Thread.spawn(.{}, forward, .{.{ .alloc = alloc, .dir = "to-server", .src = std.io.getStdIn(), .dst = &proc.stdin.?, .close_on_eof = true }});
    // this avoids an `expected *T but got *const T` error
    var our_stdout = std.io.getStdOut();

    // close_on_eof is omitted here, so it takes the default value (which is nice)
    var copy_stdout = try Thread.spawn(.{}, forward, .{.{ .alloc = alloc, .dir = "from-server", .src = proc.stdout.?, .dst = &our_stdout }});

    copy_stdin.join();
    copy_stdout.join();

    var term = try proc.wait();
    // note: "structs implement Debug", sort of — printing a `StringHashMap`
    // will make you really sad. any field that's `[]const u8` will be printed
    // as an array of numbers
    debug("term = {}\n", .{term});

    try std.fs.cwd().deleteFile(CLEANUP_STREAMS_WORKAROUND_NAME);
}

// note: `!void` means "return any error or void" (and then you can match on that).
// we could also list every possible error type we return, but it's a lot:
// error{AccessDenied,BadPathName,DeviceBusy,FileBusy,FileLocksNotSupported,
// FileNotFound,FileTooBig,InvalidHandle,InvalidUtf8,IsDir,NameTooLong,NoDevice,
// NoSpaceLeft,NotDir,PathAlreadyExists,PipeBusy,ProcessFdQuotaExceeded,
// SharingViolation,SymLinkLoop,SystemFdQuotaExceeded,SystemResources,
// Unexpected,WouldBlock}
//
// dst is a pointer, because we have to be able to close it _and_ replace it
// with a valid file descriptor, see below.
fn forward(args: struct { alloc: std.mem.Allocator, dir: []const u8, src: File, dst: *File, close_on_eof: bool = false }) !void {
    // note: no destructuring assignment, gotta keep the language small 🤷
    var alloc = args.alloc;
    var src = args.src;
    var dst = args.dst;

    // note: removing the `@as` complains about something comptime related
    var msg_number = @as(usize, 1);
    // note: try is sorta like `?` in Rust, it bubbles up any error
    var tmp = try std.fs.openDirAbsolute("/tmp", .{});

    // you can have tagged enums but you need to define two types
    const StageTag = enum {
        headers,
        body,
    };
    const Stage = union(StageTag) { headers: void, body: struct {
        headers: std.StringHashMap([]const u8),
        content_length: usize,
    } };

    var stage = Stage{
        // `{}` is void, not sure why
        .headers = {},
    };

    const delimiter = "\r\n\r\n";
    // note: intentionally starting with a small buffer to test realloc
    var buffer: []u8 = try alloc.alloc(u8, 10);
    // this makes me nervous, what if `buffer` is re-assigned before free-ing?
    // which value gets freed, the first one or the last one?
    defer alloc.free(buffer);

    // using `@as` would've been an option here too, not sure why
    // some std library code prefers `@as` to this:
    var read: usize = 0;

    while (true) {
        debug("buffer is {} bytes\n", .{buffer.len});

        // note: this does nothing in ReleaseFast and ReleaseSmall modes
        std.debug.assert(read < buffer.len);

        var n = try src.read(buffer[read..]);
        debug("read {}\n", .{n});
        if (n == 0) {
            debug("zero-length read, probably EOF\n", .{});
            if (args.close_on_eof) {
                dst.close();
                // `Process.wait` closes any stream that has StdIo.Pipe behavior,
                // but since we've already closed it, it'll reach "unreachable"
                // code when it notices close returned .EBADF.
                // Unfortunately we really _do_ need to close it here, so we
                // substitute another valid file descriptor to work around this.
                dst.* = try std.fs.cwd().createFile(CLEANUP_STREAMS_WORKAROUND_NAME, .{});
            }
            return;
        }
        read += n;
        debug("{} bytes in buffer\n", .{read});

        switch (stage) {
            StageTag.headers => {
                debug("Looking for CRLF\n", .{});
                x: {
                    // `orelse` unwraps optionals, going from `?T` to `T`.
                    // we use a labelled break to break out of this switch arm
                    // if we haven't found CRLF yet.
                    //
                    // we once again have to specify `u8` even though we're
                    // passing a `[]u8`
                    //
                    // note: `buffer[..n]` is not valid syntax, but `buffer[n..]` is.
                    var index = std.mem.indexOf(u8, buffer[0..n], delimiter) orelse break :x;

                    debug("found CRLF at {}\n", .{index});
                    var msg = buffer[0..index];
                    debug("msg = {s}\n", .{msg});

                    // this hashes keys as strings - `AutoHashMap` is very
                    // unhappy if you try to use it with `[]const u8` keys
                    var headers = std.StringHashMap([]const u8).init(alloc);

                    var splits = std.mem.split(u8, msg, "\r\n");
                    while (splits.next()) |line| {
                        var tokens = std.mem.split(u8, line, ": ");
                        // essentially "strdup", very important since there's no
                        // concept of ownership in zig: de/re-allocating after
                        // that would make the hashmap point to garbage
                        var k = try alloc.dupe(u8, tokens.next().?);
                        var v = try alloc.dupe(u8, tokens.next().?);
                        try headers.put(k, v);
                    }

                    var content_length_s = headers.get("Content-Length").?;
                    var content_length = try std.fmt.parseUnsigned(usize, content_length_s, 10);
                    debug("content-length = {}\n", .{content_length});

                    // note: overlapping src/dst are fine as long as you're
                    // copying to the left, not to the right.
                    std.mem.copy(u8, buffer, buffer[index + delimiter.len ..]);
                    read -= (index + delimiter.len);

                    stage = Stage{
                        .body = .{
                            .headers = headers,
                            .content_length = content_length,
                        },
                    };
                }
            },
            // note: if we used |head| we'd get a copy of the struct, which means
            // a copy of the hashmap, which means we couldn't deinit it.
            StageTag.body => |*head| {
                if (read >= head.content_length) {
                    var body = buffer[0..head.content_length];

                    var msg_dst_path = try std.fmt.allocPrint(alloc, "{s}-{d:0>3}.bin", .{ args.dir, msg_number });
                    // note: unlike Go, `defer` happens at the end of the scope
                    // (a block is a scope, an if is a scope), not on function return
                    defer alloc.free(msg_dst_path);

                    msg_number += 1;
                    // note: this is why we opened `/tmp`, zig's stdlib is dead
                    // set on using `openat`.
                    var msg_dst = try tmp.createFile(msg_dst_path, .{});
                    defer msg_dst.close();

                    var iter = head.headers.iterator();
                    while (iter.next()) |entry| {
                        debug("header: {s}: {s}\n", .{ entry.key_ptr.*, entry.value_ptr.* });
                        var header_line = try std.fmt.allocPrint(alloc, "{s}: {s}\r\n", .{ entry.key_ptr.*, entry.value_ptr.* });
                        defer alloc.free(header_line);

                        try dst.writeAll(header_line);
                        try msg_dst.writeAll(header_line);
                    }
                    try dst.writeAll("\r\n");
                    try msg_dst.writeAll("\r\n");
                    try dst.writeAll(body);
                    try msg_dst.writeAll(body);

                    std.mem.copy(u8, buffer, buffer[head.content_length..]);
                    read -= head.content_length;

                    var free_iter = head.headers.iterator();
                    while (free_iter.next()) |entry| {
                        // key/values need to be freed _before_ calling deinit
                        // if you remove `.*` from these, the compiler will crash
                        // with a monochrome message (it's usually colored)
                        alloc.free(entry.key_ptr.*);
                        alloc.free(entry.value_ptr.*);
                    }
                    head.headers.deinit();

                    stage = Stage{
                        .headers = {},
                    };
                }
                // otherwise keep reading
            },
        }

        if (read >= buffer.len) {
            buffer = try alloc.realloc(buffer, buffer.len * 2);
            debug("re-allocated buffer, new len = {}\n", .{buffer.len});
        }
    }
}

Of course we'll need a build.zig file too, which zig init-exe can generate, this is what I got:

Zig
const std = @import("std");

pub fn build(b: *std.build.Builder) void {
    // Standard target options allows the person running `zig build` to choose
    // what target to build for. Here we do not override the defaults, which
    // means any target is allowed, and the default is native. Other options
    // for restricting supported target set are available.
    const target = b.standardTargetOptions(.{});

    // Standard release options allow the person running `zig build` to select
    // between Debug, ReleaseSafe, ReleaseFast, and ReleaseSmall.
    const mode = b.standardReleaseOptions();

    const exe = b.addExecutable("lsp-proxy", "src/main.zig");
    exe.setTarget(target);
    exe.setBuildMode(mode);
    exe.install();

    const run_cmd = exe.run();
    run_cmd.step.dependOn(b.getInstallStep());
    if (b.args) |args| {
        run_cmd.addArgs(args);
    }

    const run_step = b.step("run", "Run the app");
    run_step.dependOn(&run_cmd.step);

    const exe_tests = b.addTest("src/main.zig");
    exe_tests.setTarget(target);
    exe_tests.setBuildMode(mode);

    const test_step = b.step("test", "Run unit tests");
    test_step.dependOn(&exe_tests.step);
}

We can then make a ReleaseFast build:

Shell session
$ time zig build -Drelease-fast
zig build -Drelease-fast  6.30s user 0.32s system 101% cpu 6.551 total

And drop it inside ~/.cargo/bin:

Shell session
$ cp zig-out/bin/lsp-proxy ~/.cargo/bin/rust-analyzer

Now, if we run emacs again, and try to insert the emoji, we can see what happens because each message is dumped as /tmp/from-server-XXX.bin and /tmp/to-server-XXX.bin.

Here's the first message emacs/lsp-mode sends to rust-analyzer:

Content-Length: 4467

{"jsonrpc":"2.0","method":"initialize","params":{"processId":252882,"rootPath":"/home/amos/bearcove/bottom","clientInfo":{"name":"emacs","version":"GNU Emacs 27.1 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.30, cairo version 1.16.0)\n of 2022-01-24, modified by Debian"},"rootUri":"file:///home/amos/bearcove/bottom","capabilities":{"workspace":{"workspaceEdit":{"documentChanges":true,"resourceOperations":["create","rename","delete"]},"applyEdit":true,"symbol":{"symbolKind":{"valueSet":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]}},"executeCommand":{"dynamicRegistration":false},"didChangeWatchedFiles":{"dynamicRegistration":true},"workspaceFolders":true,"configuration":true,"codeLens":{"refreshSupport":true},"fileOperations":{"didCreate":false,"willCreate":false,"didRename":true,"willRename":true,"didDelete":false,"willDelete":false}},"textDocument":{"declaration":{"dynamicRegistration":true,"linkSupport":true},"definition":{"dynamicRegistration":true,"linkSupport":true},"references":{"dynamicRegistration":true},"implementation":{"dynamicRegistration":true,"linkSupport":true},"typeDefinition":{"dynamicRegistration":true,"linkSupport":true},"synchronization":{"willSave":true,"didSave":true,"willSaveWaitUntil":true},"documentSymbol":{"symbolKind":{"valueSet":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]},"hierarchicalDocumentSymbolSupport":true},"formatting":{"dynamicRegistration":true},"rangeFormatting":{"dynamicRegistration":true},"onTypeFormatting":{"dynamicRegistration":true},"rename":{"dynamicRegistration":true,"prepareSupport":true},"codeAction":{"dynamicRegistration":true,"isPreferredSupport":true,"codeActionLiteralSupport":{"codeActionKind":{"valueSet":["","quickfix","refactor","refactor.extract","refactor.inline","refactor.rewrite","source","source.organizeImports"]}},"resolveSupport":{"properties":["edit","command"]},"dataSupport":true},"completion":{"completionItem":{"snippetSupport":false,"documentationFormat":["markdown","plaintext"],"resolveAdditionalTextEditsSupport":true,"insertReplaceSupport":true,"deprecatedSupport":true,"resolveSupport":{"properties":["documentation","detail","additionalTextEdits","command"]},"insertTextModeSupport":{"valueSet":[1,2]}},"contextSupport":true,"dynamicRegistration":true},"signatureHelp":{"signatureInformation":{"parameterInformation":{"labelOffsetSupport":true}},"dynamicRegistration":true},"documentLink":{"dynamicRegistration":true,"tooltipSupport":true},"hover":{"contentFormat":["markdown","plaintext"],"dynamicRegistration":true},"foldingRange":{"dynamicRegistration":true},"selectionRange":{"dynamicRegistration":true},"callHierarchy":{"dynamicRegistration":false},"typeHierarchy":{"dynamicRegistration":true},"publishDiagnostics":{"relatedInformation":true,"tagSupport":{"valueSet":[1,2]},"versionSupport":true},"linkedEditingRange":{"dynamicRegistration":true}},"window":{"workDoneProgress":true,"showDocument":{"support":true}},"experimental":{"snippetTextEdit":null}},"initializationOptions":{"diagnostics":{"enable":true,"enableExperimental":false,"disabled":[],"warningsAsHint":[],"warningsAsInfo":[]},"imports":{"granularity":{"enforce":false,"group":"crate"},"group":true,"merge":{"glob":true},"prefix":"plain"},"lruCapacity":null,"checkOnSave":{"enable":true,"command":"clippy","extraArgs":[],"features":[],"overrideCommand":[]},"files":{"exclude":[],"watcher":"client","excludeDirs":[]},"cargo":{"allFeatures":false,"noDefaultFeatures":false,"features":[],"target":null,"runBuildScripts":true,"loadOutDirsFromCheck":true,"autoreload":true,"useRustcWrapperForBuildScripts":true,"unsetTest":[]},"rustfmt":{"extraArgs":[],"overrideCommand":[],"rangeFormatting":{"enable":false}},"inlayHints":{"bindingModeHints":false,"chainingHints":false,"closingBraceHints":{"enable":true,"minLines":25},"closureReturnTypeHints":false,"lifetimeElisionHints":{"enable":"never","useParameterNames":false},"maxLength":null,"parameterHints":false,"reborrowHints":"never","renderColons":true,"typeHints":{"enable":true,"hideClosureInitialization":false,"hideNamedConstructor":false}},"completion":{"addCallParenthesis":true,"addCallArgumentSnippets":true,"postfix":{"enable":true},"autoimport":{"enable":true},"autoself":{"enable":true}},"callInfo":{"full":true},"procMacro":{"enable":true},"rustcSource":null,"linkedProjects":[],"highlighting":{"strings":true},"experimental":{"procAttrMacros":true}},"workDoneToken":"1"},"id":1}

Interestingly, instead of going for newline-separated JSON, LSP goes for an HTTP/1.1-like protocol, with a header section. I don't love this, but nobody asked.

One long line isn't very readable, I've prettified the JSON for you:

JSON
{
  "jsonrpc": "2.0",
  "method": "initialize",
  "params": {
    "processId": 252882,
    "rootPath": "/home/amos/bearcove/bottom",
    "clientInfo": {
      "name": "emacs",
      "version": "GNU Emacs 27.1 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.30, cairo version 1.16.0)\n of 2022-01-24, modified by Debian"
    },
    "rootUri": "file:///home/amos/bearcove/bottom",
    "capabilities": {
      "workspace": {
        "workspaceEdit": {
          "documentChanges": true,
          "resourceOperations": [
            "create",
            "rename",
            "delete"
          ]
        },
        "applyEdit": true,
        "symbol": {
          "symbolKind": {
            "valueSet": [
              1,
              2,
              3,
              4,
              5,
              6,
              7,
              8,
              9,
              10,
              11,
              12,
              13,
              14,
              15,
              16,
              17,
              18,
              19,
              20,
              21,
              22,
              23,
              24,
              25,
              26
            ]
          }
        },
        "executeCommand": {
          "dynamicRegistration": false
        },
        "didChangeWatchedFiles": {
          "dynamicRegistration": true
        },
        "workspaceFolders": true,
        "configuration": true,
        "codeLens": {
          "refreshSupport": true
        },
        "fileOperations": {
          "didCreate": false,
          "willCreate": false,
          "didRename": true,
          "willRename": true,
          "didDelete": false,
          "willDelete": false
        }
      },
      "textDocument": {
        "declaration": {
          "dynamicRegistration": true,
          "linkSupport": true
        },
        "definition": {
          "dynamicRegistration": true,
          "linkSupport": true
        },
        "references": {
          "dynamicRegistration": true
        },
        "implementation": {
          "dynamicRegistration": true,
          "linkSupport": true
        },
        "typeDefinition": {
          "dynamicRegistration": true,
          "linkSupport": true
        },
        "synchronization": {
          "willSave": true,
          "didSave": true,
          "willSaveWaitUntil": true
        },
        "documentSymbol": {
          "symbolKind": {
            "valueSet": [
              1,
              2,
              3,
              4,
              5,
              6,
              7,
              8,
              9,
              10,
              11,
              12,
              13,
              14,
              15,
              16,
              17,
              18,
              19,
              20,
              21,
              22,
              23,
              24,
              25,
              26
            ]
          },
          "hierarchicalDocumentSymbolSupport": true
        },
        "formatting": {
          "dynamicRegistration": true
        },
        "rangeFormatting": {
          "dynamicRegistration": true
        },
        "onTypeFormatting": {
          "dynamicRegistration": true
        },
        "rename": {
          "dynamicRegistration": true,
          "prepareSupport": true
        },
        "codeAction": {
          "dynamicRegistration": true,
          "isPreferredSupport": true,
          "codeActionLiteralSupport": {
            "codeActionKind": {
              "valueSet": [
                "",
                "quickfix",
                "refactor",
                "refactor.extract",
                "refactor.inline",
                "refactor.rewrite",
                "source",
                "source.organizeImports"
              ]
            }
          },
          "resolveSupport": {
            "properties": [
              "edit",
              "command"
            ]
          },
          "dataSupport": true
        },
        "completion": {
          "completionItem": {
            "snippetSupport": false,
            "documentationFormat": [
              "markdown",
              "plaintext"
            ],
            "resolveAdditionalTextEditsSupport": true,
            "insertReplaceSupport": true,
            "deprecatedSupport": true,
            "resolveSupport": {
              "properties": [
                "documentation",
                "detail",
                "additionalTextEdits",
                "command"
              ]
            },
            "insertTextModeSupport": {
              "valueSet": [
                1,
                2
              ]
            }
          },
          "contextSupport": true,
          "dynamicRegistration": true
        },
        "signatureHelp": {
          "signatureInformation": {
            "parameterInformation": {
              "labelOffsetSupport": true
            }
          },
          "dynamicRegistration": true
        },
        "documentLink": {
          "dynamicRegistration": true,
          "tooltipSupport": true
        },
        "hover": {
          "contentFormat": [
            "markdown",
            "plaintext"
          ],
          "dynamicRegistration": true
        },
        "foldingRange": {
          "dynamicRegistration": true
        },
        "selectionRange": {
          "dynamicRegistration": true
        },
        "callHierarchy": {
          "dynamicRegistration": false
        },
        "typeHierarchy": {
          "dynamicRegistration": true
        },
        "publishDiagnostics": {
          "relatedInformation": true,
          "tagSupport": {
            "valueSet": [
              1,
              2
            ]
          },
          "versionSupport": true
        },
        "linkedEditingRange": {
          "dynamicRegistration": true
        }
      },
      "window": {
        "workDoneProgress": true,
        "showDocument": {
          "support": true
        }
      },
      "experimental": {
        "snippetTextEdit": null
      }
    },
    "initializationOptions": {
      "diagnostics": {
        "enable": true,
        "enableExperimental": false,
        "disabled": [
          
        ],
        "warningsAsHint": [
          
        ],
        "warningsAsInfo": [
          
        ]
      },
      "imports": {
        "granularity": {
          "enforce": false,
          "group": "crate"
        },
        "group": true,
        "merge": {
          "glob": true
        },
        "prefix": "plain"
      },
      "lruCapacity": null,
      "checkOnSave": {
        "enable": true,
        "command": "clippy",
        "extraArgs": [
          
        ],
        "features": [
          
        ],
        "overrideCommand": [
          
        ]
      },
      "files": {
        "exclude": [
          
        ],
        "watcher": "client",
        "excludeDirs": [
          
        ]
      },
      "cargo": {
        "allFeatures": false,
        "noDefaultFeatures": false,
        "features": [
          
        ],
        "target": null,
        "runBuildScripts": true,
        "loadOutDirsFromCheck": true,
        "autoreload": true,
        "useRustcWrapperForBuildScripts": true,
        "unsetTest": [
          
        ]
      },
      "rustfmt": {
        "extraArgs": [
          
        ],
        "overrideCommand": [
          
        ],
        "rangeFormatting": {
          "enable": false
        }
      },
      "inlayHints": {
        "bindingModeHints": false,
        "chainingHints": false,
        "closingBraceHints": {
          "enable": true,
          "minLines": 25
        },
        "closureReturnTypeHints": false,
        "lifetimeElisionHints": {
          "enable": "never",
          "useParameterNames": false
        },
        "maxLength": null,
        "parameterHints": false,
        "reborrowHints": "never",
        "renderColons": true,
        "typeHints": {
          "enable": true,
          "hideClosureInitialization": false,
          "hideNamedConstructor": false
        }
      },
      "completion": {
        "addCallParenthesis": true,
        "addCallArgumentSnippets": true,
        "postfix": {
          "enable": true
        },
        "autoimport": {
          "enable": true
        },
        "autoself": {
          "enable": true
        }
      },
      "callInfo": {
        "full": true
      },
      "procMacro": {
        "enable": true
      },
      "rustcSource": null,
      "linkedProjects": [
        
      ],
      "highlighting": {
        "strings": true
      },
      "experimental": {
        "procAttrMacros": true
      }
    },
    "workDoneToken": "1"
  },
  "id": 1
}

Good job on scrolling down all that way! Take a second to catch your breath.

You can check the JSON-RPC 2.0 spec if you're curious about the request/response coating we have here, but mostly, yeah, the client is announcing capabilities to the server, and passing some cargo/rustfmt specific config.

They even have a user-agent! (clientInfo.version).

Similarly, the server responds with its own capabilities:

JSON
{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "capabilities": {
      "textDocumentSync": {
        "openClose": true,
        "change": 2,
        "save": {
          
        }
      },
      "selectionRangeProvider": true,
      "hoverProvider": true,
      "completionProvider": {
        "resolveProvider": true,
        "triggerCharacters": [
          ":",
          ".",
          "'",
          "("
        ],
        "completionItem": {
          "labelDetailsSupport": false
        }
      },
      "signatureHelpProvider": {
        "triggerCharacters": [
          "(",
          ",",
          "<"
        ]
      },
      "definitionProvider": true,
      "typeDefinitionProvider": true,
      "implementationProvider": true,
      "referencesProvider": true,
      "documentHighlightProvider": true,
      "documentSymbolProvider": true,
      "workspaceSymbolProvider": true,
      "codeActionProvider": {
        "codeActionKinds": [
          "",
          "quickfix",
          "refactor",
          "refactor.extract",
          "refactor.inline",
          "refactor.rewrite"
        ],
        "resolveProvider": true
      },
      "codeLensProvider": {
        "resolveProvider": true
      },
      "documentFormattingProvider": true,
      "documentRangeFormattingProvider": false,
      "documentOnTypeFormattingProvider": {
        "firstTriggerCharacter": "=",
        "moreTriggerCharacter": [
          ".",
          ">",
          "{"
        ]
      },
      "renameProvider": {
        "prepareProvider": true
      },
      "foldingRangeProvider": true,
      "declarationProvider": true,
      "workspace": {
        "fileOperations": {
          "willRename": {
            "filters": [
              {
                "scheme": "file",
                "pattern": {
                  "glob": "**/*.rs",
                  "matches": "file"
                }
              },
              {
                "scheme": "file",
                "pattern": {
                  "glob": "**",
                  "matches": "folder"
                }
              }
            ]
          }
        }
      },
      "callHierarchyProvider": true,
      "semanticTokensProvider": {
        "legend": {
          "tokenTypes": [
            "comment",
            "decorator",
            "enumMember",
            "enum",
            "function",
            "interface",
            "keyword",
            "macro",
            "method",
            "namespace",
            "number",
            "operator",
            "parameter",
            "property",
            "string",
            "struct",
            "typeParameter",
            "variable",
            "angle",
            "arithmetic",
            "attribute",
            "attributeBracket",
            "bitwise",
            "boolean",
            "brace",
            "bracket",
            "builtinAttribute",
            "builtinType",
            "character",
            "colon",
            "comma",
            "comparison",
            "constParameter",
            "derive",
            "deriveHelper",
            "dot",
            "escapeSequence",
            "formatSpecifier",
            "generic",
            "label",
            "lifetime",
            "logical",
            "macroBang",
            "parenthesis",
            "punctuation",
            "selfKeyword",
            "selfTypeKeyword",
            "semicolon",
            "typeAlias",
            "toolModule",
            "union",
            "unresolvedReference"
          ],
          "tokenModifiers": [
            "documentation",
            "declaration",
            "static",
            "defaultLibrary",
            "async",
            "attribute",
            "callable",
            "constant",
            "consuming",
            "controlFlow",
            "crateRoot",
            "injected",
            "intraDocLink",
            "library",
            "mutable",
            "public",
            "reference",
            "trait",
            "unsafe"
          ]
        },
        "range": true,
        "full": {
          "delta": true
        }
      },
      "inlayHintProvider": {
        "resolveProvider": true
      },
      "experimental": {
        "externalDocs": true,
        "hoverRange": true,
        "joinLines": true,
        "matchingBrace": true,
        "moveItem": true,
        "onEnter": true,
        "openCargoToml": true,
        "parentModule": true,
        "runnables": {
          "kinds": [
            "cargo"
          ]
        },
        "ssr": true,
        "workspaceSymbolScopeKindFiltering": true
      }
    },
    "serverInfo": {
      "name": "rust-analyzer",
      "version": "1.67.1 (d5a82bb 2023-02-07)"
    }
  }
}

Which is thankfully much shorter.

So, what character encoding do we have here? It's UTF-8. You can pass a Content-Type header, but I don't think anyone does, and it defaults to application/vscode-jsonrpc; charset=utf-8.

Note that headers are ASCII-only, just like HTTP/1.1 (don't lawyer me on this) — that already lets us know it's not using UTF-16 for everything.

In this code, when inserting the bottom emoji at this location:

Rust code
fn main() {
  // 🥺
  println!("Hello, world!");
}

These are the messages sent by lsp-mode to rust-analyzer:

JSON
{
  "jsonrpc": "2.0",
  "method": "textDocument/didChange",
  "params": {
    "textDocument": {
      "uri": "file:///home/amos/bearcove/bottom/src/main.rs",
      "version": 1
    },
    "contentChanges": [
      {
        "range": {
          "start": {
            "line": 1,
            "character": 7
          },
          "end": {
            "line": 1,
            "character": 7
          }
        },
        "rangeLength": 0,
        "text": "🥺"
      }
    ]
  }
}

It's important to note that the file hasn't been saved yet. You don't need to save your code to get completions, or be able to hover symbols or jump to definitions: that means the LSP server operates on what the editor has in memory, and so it needs a view of that, not of the file. I don't think LSP servers need to do any I/O at all (at least not for opened documents).

Here the change is: for line 1 (lines are zero-indexed), at character 7, we're inserting "🥺". I imagine if the range wasn't of length zero, this would serve as a "replace" operation.

Then, we have:

JSON
{
  "jsonrpc": "2.0",
  "method": "textDocument/codeAction",
  "params": {
    "textDocument": {
      "uri": "file:///home/amos/bearcove/bottom/src/main.rs"
    },
    "range": {
      "start": {
        "line": 1,
        "character": 8
      },
      "end": {
        "line": 1,
        "character": 8
      }
    },
    "context": {
      "diagnostics": [
        
      ]
    }
  },
  "id": 17
}

According to the spec:

The code action request is sent from the client to the server to compute commands for a given text document and range.

These commands are typically code fixes to either fix problems or to beautify/refactor code.

The result of a textDocument/codeAction request is an array of Command literals which are typically presented in the user interface.

And here lies the heart of the issue: "🥺" was inserted at character 7. And lsp-mode is now asking for code actions at character 8.

Which begs the question: what is a character?

What is a character?

Of course, nobody can quite agree on the answer.

You won't make enemies by stating that "a" is a character.

You might even convince them that "é" is a character.

But is "é" one character?

Wait, you said the same thing twice.

No I didn't!

Shell session
$ node
Welcome to Node.js v18.14.0.
Type ".help" for more information.
> new TextEncoder().encode("é")
Uint8Array(2) [ 195, 169 ]
> new TextEncoder().encode("é")
Uint8Array(3) [ 101, 204, 129 ]

The former is U+00E9 Latin Small Letter E with Acute, whereas the latter is U+0065 Latin Small Letter E followed by U+0301 Combining Acute Accent

They're, respectively, one unicode code point...

Shell session
> "é".length
1
> "é".charAt(0) 
"é"

...and two unicode code points:

Shell session
> "é".length
2
> "é".charAt(0)
'e'
> "é".charAt(1)
'́'

They are both, however, only one grapheme cluster.

What about emojis?

We already know "🥺" encodes to four bytes in utf-8:

Shell session
> new TextEncoder("utf-8").encode("🥺")
Uint8Array(4) [ 240, 159, 165, 186 ]

And if you enjoyed things like Wat, you might already know what "🥺".length is going to be:

Shell session
> "🥺".length
2
> "🥺".charAt(0)
'\ud83e'
> "🥺".charAt(1)
'\udd7a'

Hey... hey! Those are surrogates!

They sure are! charAt returns UTF-16 code units!

It goes even deeper for more involved emojis:

Shell session
> "👨‍👩‍👦".length
8

This "family: man, woman, boy" emoji (which should render as a single grapheme cluster for you, unless you're visiting us with a browser from the past, or from the terminal) is actually made up of the Man emoji, the Woman emoji and the Boy emoji, joined together with ZWJ, Zero Width Joiners.

We can see each individual component:

Shell session
> [..."👨‍👩‍👦"]
[ '👨', '‍', '👩', '‍', '👦' ]

What are these components exactly though? They're not UTF-16 code units: we know that's what .length returns, and it said 8 (we also know each of these emojis probably takes up 2 UTF-16 code units).

They're Unicode code points. We can get code units by iterating from 0 to s.length and calling charCodeAt:

Shell session
> var s = "👨‍👩‍👦"; for (let i = 0; i < s.length; i++) { console.log(s.charCodeAt(i).toString(16)) }
d83d
dc68
200d
d83d
dc69
200d
d83d
dc66

And we can get code point values with codePointAt, which takes offsets in UTF-16 code units. MDN says:

The codePointAt() method returns a non-negative integer that is the Unicode code point value at the given position. Note that this function does not give the nth code point in a string, but the code point starting at the specified string index.

And further:

If the element at pos is a UTF-16 high surrogate, returns the code point of the surrogate pair.

And:

If the element at pos is a UTF-16 low surrogate, returns only the low surrogate code point.

So, if we really want to index into a string, we have to be very careful, or we'll get wrong results (where Rust chooses to panic):

Shell session
> var s = "👨‍👩‍👦"; [0, 2, 3, 5, 7].map(i => `U+${s.codePointAt(i).toString(16)}`)
[ 'U+1f468', 'U+200d', 'U+1f469', 'U+200d', 'U+dc66' ]

U+200D is Zero-Width Joiner, this checks out.

To not get it wrong, it's easier to use iteration, which is based on Unicode code points, unlike indexing which is based on UTF-16 code units:

Shell session
> for (const c of "👨‍👩‍👦") { console.log(`U+${c.codePointAt(0).toString(16)}`) }
U+1f468
U+200d
U+1f469
U+200d
U+1f466

Note that even though for..of is already iterating over codepoints, it "yields" strings itself, since there's no "character" or "rune" type in ECMAScript - the closest you get is what codePointAt returns, which is a Number, which is only accurate for integer up to 53 bits wide.

Luckily, the maximum valid Unicode code point is U+10FFFF, so we only need 21 bits. (Also, only ~150K code points are defined by Unicode 15.0, the latest version as of February of 2023).

Which explains why, if you go to read the documentation for Emacs, a 47-year-old piece of software, you will find the following:

To support this multitude of characters and scripts, Emacs closely follows the Unicode Standard.

The Unicode Standard assigns a unique number, called a codepoint, to each and every character. The range of codepoints defined by Unicode, or the Unicode codespace, is 0..#x10FFFF (in hexadecimal notation), inclusive.

Emacs extends this range with codepoints in the range #x110000..#x3FFFFF, which it uses for representing characters that are not unified with Unicode and raw 8-bit bytes that cannot be interpreted as characters.

Thus, a character codepoint in Emacs is a 22-bit integer.

Makes perfect sense to me.

Caught in the middle

In summary: no one can agree what a character is.

I can't find reliable sources on when "multibyte support" was added to Emacs, but it's safe to say they decided a "character" was a "Unicode code point", ignoring zero-width joiners, combining characters, etc. Fair enough.

As for the LSP spec, it was originally developed for Visual Studio Code, which is implemented in ECMAScript (JavaScript).

And because indexing into strings in ECMAScript (and s.length, etc.) is based on UTF-16 code units, they decided a character should be.. a UTF-16 code unit!

You may be recoiling in horror at this point, and who could blame you, but the point is, this is a fantastic cautionary tale: if your API ever has a field named character:

JSON
"start": {
  "line": 1,
  "character": 8
}

...then you've taken a wrong turn and need to reconsider. Immediately.

To finish elucidating the exact LSP transcript we've obtained between lsp-mode and rust-analyzer: that "character" field should have value 9, not 8, because even though "🥺" is one "Emacs character" (one Unicode code point), it is two "LSP characters" (two UTF-16 code units).

What of rust-analyzer then? Well, it certainly doesn't use UTF-16 as its internal representation: like most Rust programs (that don't need to deal with pre-UTF-8-codepage Windows APIs), it uses UTF-8.

But it does implement LSP to the letter and treat "character offsets" as UTF-16 code units. Which it can do because, thanks to testDocument/didChange, it has its own copy of the document text (in UTF-8).

But it's able to translate UTF-16 code unit offsets to UTF-8 offsets: and up until then I've been making claims about rust-analyzer does without looking, so it's time to look: if we follow the thread, we'll surely find... yes!:

Rust code
// in `rust-analyzer/crates/ide-db/src/line_index.rs`

#[derive(Clone, Debug, PartialEq, Eq)]
pub struct LineIndex {
    /// Offset the the beginning of each line, zero-based
    pub(crate) newlines: Vec<TextSize>,
    /// List of non-ASCII characters on each line
    pub(crate) utf16_lines: NoHashHashMap<u32, Vec<Utf16Char>>,
}

#[derive(Clone, Debug, Hash, PartialEq, Eq)]
pub(crate) struct Utf16Char {
    /// Start offset of a character inside a line, zero-based
    pub(crate) start: TextSize,
    /// End offset of a character inside a line, zero-based
    pub(crate) end: TextSize,
}

...an index of every non-ASCII character for every line of every file, which it can then use to do UTF-16 to UTF-8 index translation:

Rust code
impl LineIndex {
    fn utf8_to_utf16_col(&self, line: u32, col: TextSize) -> usize {
        let mut res: usize = col.into();
        if let Some(utf16_chars) = self.utf16_lines.get(&line) {
            for c in utf16_chars {
                if c.end <= col {
                    res -= usize::from(c.len()) - c.len_utf16();
                } else {
                    // From here on, all utf16 characters come *after* the character we are mapping,
                    // so we don't need to take them into account
                    break;
                }
            }
        }
        res
    }
}

To do that, it uses a nice property: if some codepoint needs 4 UTF-8 bytes, it needs 2 UTF-16 code units:

Rust code
impl Utf16Char {
    /// Returns the length in 8-bit UTF-8 code units.
    fn len(&self) -> TextSize {
        self.end - self.start
    }

    /// Returns the length in 16-bit UTF-16 code units.
    fn len_utf16(&self) -> usize {
        //                             👇
        if self.len() == TextSize::from(4) {
            2
        } else {
            1
        }
    }
}

And that's how rust-analyzer adheres to the LSP spec. And lsp-mode doesn't.

Of course, as a user, it's hard to bring oneself to care about that. As a rust-analyzer user, it feels like rust-analyzer's bug. It's the one crashing!

Why does it crash? Couldn't it just... not?

Couldn't it just ignore erroneous input? Or... JSON-RPC has mechanisms for error reporting, couldn't it just use that?

Both of these are nonstarters, and I'll explain why.

If rust-analyzer ever gets a character offset that's in the middle of an UTF-16 surrogate pair, like it did in our recorded sequence:

A diagram showing the encoding of our comment then bottom emoji line, showing utf-8 bytes, where the emoji takes 4 slots, utf-16 code units, where it takes 2, and unicode code points like in Emacs, where it takes one

Then... all bets are off!

It can't just "assume there's been an oopsie" and go to the end of the UTF-16 surrogate pair, because there might have already been errors before that: if we had two similar emojis before, it would think it's before the bottom emoji, rather than after.

That means completions would be wrong, which might seem innocent enough, but it also means that any further testDocument/didChange requests will give the LSP server (in this case rust-analyzer) a corrupted view of the document!

And that can already happen, without crashing!

If we insert "🥲😎" in one go, then "🥺" at offset 2, the Emacs buffer will have "🥲😎🥺", but any compliant LSP server will have "🥲🥺😎".

So, when it is caught, when an LSP server detects that it's getting bad offsets from the client (in this case lsp-mode), it absolutely should sound the alarm!

But is crashing the right thing to do though?

Didn't you mention JSON-RPC had a built-in mechanism for reporting errors?

Oh it does! But unfortunately, error handling is hard. There's nothing in the LSP spec that says "your implementation is fundamentally flawed and we cannot rely on ANYTHING going forward". At best you can do "this specific request has failed".

But requests fail all the time for various reasons and LSP clients barely bother to show you something (VSCode notably shows a notification, which is more annoying than anything else, since you can't do anything about it).

How should a client behave when it gets an error? Should it decide to give up after a certain amount of errors? How many is enough? Eight? Sixteen? The situation isn't going to fix itself. If a textDocument/didChange request fails, it's game over - the representations on either side have already diverged.

Restarting the server, like lsp-mode suggests in a prompt, is not really going to help either, because editing any code after the emoji will have the wrong offsets, and so the representations will start to diverge again very soon.

This is an error condition that is rare (someone messed up the protocol in a BIG way) and unrecoverable (there's a very low likelihood of anything correct or useful happening after that): as frustrating as it is for Emacs users, rust-analyzer is absolutely correct in panicking here.

(In fact, it would be neat if the LSP handshake include some basic checks to make sure both ends agrees on what offsets mean, but it's a little late for that.)

The way forward

This is where things get delicate.

Because the tech of it all, is, as usual, relatively straightforward. If you've made it this far, you have a thorough understanding of exactly what's going wrong and why, going all the way back to JavaScript was designed in 10 days in 1995.

The way forward is for lsp-mode to fix itself, clearly. But if you're maintaining the tool at the other end of the LSP discussion, it's hard to politely yet urgently signify this to them. Ideally we should get along with our neighbors just fine!

And in the neighbor's defense, using UTF-16 code units for offsets is... it's not good. At all. Modelling a spec after a specific language's limitations is how we ended up with "you can't really send integers bigger than 9e15 in a JSON document and be sure they'll come out okay on the other side".

"Sending the correct UTF-16 code unit offsets" would probably involve something like re-encoding lines from "whatever Emacs uses internally" (presumably UTF-8) to UTF-16, which can be time-consuming, especially if you're doing that in Emacs Lisp I suppose. Or you could do something smarter, like rust-analyzer does, but then you better have good unit tests (like rust-analyzer also does).

Still in the neighbor's defense, there's a lot of code out there that does not use any code points outside of the BMP, thus doesn't have any UTF-16 surrogate pairs. So it's seems like an edge case for lsp-mode (whereas it is, in fact, a fundamental disagreement with the spec).

Luckily, and this is where I feel like the rust-analyzer project has been diplomatic, there's already been steps taken to move this forward. You and I, we're not the only UTF-16 haters in the world, there's also clangd, the C/C++ language server, which went "lol what no" and straight up decided to allow clients to use UTF-8 offsets instead.

And, extending a hand to lsp-mode, so they don't have do something dumb but expensive (re-encoding as UTF-16) or clever but fast (keep a cache of offsets, use the length-in-other-encoding trick RA does), rust-analyzer has implemented that extension.

So the timeline now goes:

And here we are, 2.5 years after the initial filing, with the same issue being noticed and reported over and over again against rust-analyzer (and, I imagine, similarly correctness-minded language servers).

So, here's my call to action.

Dear lsp-mode maintainers, please, in that order:

  1. Do not fall back to RLS, ever.
  2. Fix your "character offsets" to be UTF-16 code units. I know it's bad, but it's bad for everyone and that's what the spec says.
  3. Implement the UTF-8 offset extensions (LSP 3.17.0), that way we don't have this silly situation where two tools that use UTF-8 internally do conversions in both directions every time they speak.

It's overdue.

If you liked what you saw, please support my work!

Github logo Donate on GitHub Patreon logo Donate on Patreon

Looking for the homepage?
Another article: The curse of strong typing