The bottom emoji breaks rust-analyzer
Some bugs are merely fun. Others are simply delicious!
Today's pick is the latter.
Reproducing the issue, part 1
(It may be tempting to skip that section, but reproducing an issue is an important part of figuring it out, so.)
I've never used Emacs before, so let's install it. I do most of my computing on an era-appropriate Ubuntu, today it's Ubuntu 22.10, so I just need to:
$ sudo apt install emacs-nox
And then create a new Rust crate, cd
into it, and launch emacs:
$ cargo new bottom Created binary (application) `bottom` package $ cd bottom/ $ emacs
I am greeted with a terminal interface that says: "Welcome to GNU Emacs, one component of the GNU/Linux operating system". I feel religious already.
To open src/main.rs
, I can press C-x C-f
, where C-
means Ctrl+
(and M-
means Alt+
, at least for me), which opens a "Find file" prompt at the bottom,
pre-filled with the current working directory: for me that's ~/bearcove/bottom
.
There's tab-completion in there, so s<TAB>m<TAB>
completes it to
.../src/main.rs
, and pressing Enter (which I understand Emacs folks spell
RET
, for Return) opens the file.
Without additional configuration, not much happens: there's no syntax highlighting, no code intelligence of any kind, we have to ask for that.
So let's ask for that.
With guidance from a friend (on IRC, of all places! what year is this?), I ended
up putting the following in my ~/.emacs
file:
;; in `~/.emacs` (require 'package) (add-to-list 'package-archives '("melpa" . "https://melpa.org/packages/")) (add-to-list 'package-archives '("gnu" . "https://elpa.gnu.org/packages/")) (package-refresh-contents) (package-install 'use-package) (require 'use-package-ensure) (setq use-package-always-ensure t) (use-package rustic) (use-package lsp-mode :ensure :commands lsp :custom ;; what to use when checking on-save. "check" is default, I prefer clippy (lsp-rust-analyzer-cargo-watch-command "clippy") (lsp-eldoc-render-all t) (lsp-idle-delay 0.6) (lsp-rust-analyzer-server-display-inlay-hints t) :config (add-hook 'lsp-mode-hook 'lsp-ui-mode)) (use-package lsp-ui :ensure :commands lsp-ui-mode :custom (lsp-ui-peek-always-show t) (lsp-ui-sideline-enable nil) (lsp-ui-doc-enable t)) (custom-set-variables ;; custom-set-variables was added by Custom. ;; If you edit it by hand, you could mess it up, so be careful. ;; Your init file should contain only one such instance. ;; If there is more than one, they won't work right. '(package-selected-packages '(lsp-ui rustic lsp-mode ## cmake-mode))) (custom-set-faces ;; custom-set-faces was added by Custom. ;; If you edit it by hand, you could mess it up, so be careful. ;; Your init file should contain only one such instance. ;; If there is more than one, they won't work right. )
(The bit at the bottom was there originally, I'm guessing Ubuntu ships it? It
didn't have lsp-ui
/ rustic
/ lsp-mode
before I restarted emacs).
Restarting emacs installs everything, just like Vim plug-in managers would do, so far, so good.
Opening src/main.rs
again prompts me to import a workspace root, I pick the
first option, which seems reasonable, and tada, we have Rust highlighting and
inline documentation and stuff!
Or rather, I do, because I have a rust-analyzer
binary in path from a while
ago (I contribute occasionally):
$ which rust-analyzer /home/amos/.cargo/bin/rust-analyzer $ rust-analyzer --version rust-analyzer 0.0.0 (caf23f291 2022-07-11)
This is old. rust-analyzer is released every week. Let's remove it and see if things still work.
$ rm $(which rust-analyzer) $ emacs src/main.rs (cut) Server rls:83343/starting exited (check corresponding stderr buffer for details). Do you want to restart it? (y or n)
Ah. It doesn't. That's unfortunate!
How rust-analyzer is distributed
Rust components are normally distributed with rustup
. rustc
is a Rust
component, so, it is:
$ which rustc /home/amos/.cargo/bin/rustc $ rustc -V rustc 1.67.1 (d5a82bbd2 2023-02-07) $ rustup show active-toolchain stable-x86_64-unknown-linux-gnu (default) $ rustup which rustc /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc $ ~/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc -V rustc 1.67.1 (d5a82bbd2 2023-02-07)
You can also get it from elsewhere, and I'm sure you have your reasons, but they're not relevant to the topic at hand.
After rust-analyzer was promoted from "very useful peripheral project" to official rust-lang project, it started being distributed as a rustup component.
Back then, it was done by it being a git submodule in the rust-lang/rust repository. Since then, it's been migrated to a git subtree, which helped resolve the rust-analyzer support for proc macros in Rust nightly issue.
Why do we care? Because that means there's technically two trees for rust-analyzer at any given moment in time: a git subtree can (and should) be merged in either direction: ra->rust because ra moves fast and is developed mostly in its own repository, and rust->ra because the proc-macro bridge might change.
So! We can install rust-analyzer
with rustup:
$ rustup component add rust-analyzer info: downloading component 'rust-analyzer' info: installing component 'rust-analyzer'
But that doesn't set up a "proxy binary" under ~/.cargo/bin
:
$ which rust-analyzer (it prints nothing)
It is, however, somewhere on disk:
$ rustup which rust-analyzer /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rust-analyzer
We can even try it out ourselves by launching it, and typing {}
then Enter/Return:
$ ~/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rust-analyzer {} [ERROR rust_analyzer] Unexpected error: expected initialize request, got error: receiving on an empty and disconnected channel expected initialize request, got error: receiving on an empty and disconnected channel
That's right! It's our old friend JSON-over-stdin.
Since this is a reasonable way in which users might want to install rust-analyzer, let's see if our emacs setup picks it up:
$ emacs src/main.rs (cut) Server rls:86574/starting exited (check corresponding stderr buffer for details). Do you want to restart it? (y or n)
No. This is unfortunate, again.
And even if it did work, it would still not be ideal, because... that version is old too:
$ $(rustup which rust-analyzer) --version rust-analyzer 1.67.1 (d5a82bb 2023-02-07)
Okay.. that one's only a few days old, but only because Rust 1.67.0 did an oopsie-woopsie when it comes to thin archives and they had to release 1.67.1 shortly after.
But yeah, if we check the version shipped with Rust 1.67.0:
$ $(rustup +1.67.0 which rust-analyzer) --version rust-analyzer 1.67.0 (fc594f1 2023-01-24)
It's from around that time. And that hash refers to a a commit in
rust-lang/rust, not
rust-lang/rust-analyzer
, so depending when the the last sync has been made,
it might be even further behind.
rust-analyzer wants to ship every monday, and so, it does! At the time of this writing, the latest release is from 2023-02-13, and it's on GitHub, which is where the VSCode extension used to download it from.
These days, it ships the rust-analyzer binary directly in the extension:
$ unzip -l rust-lang.rust-analyzer-0.4.1398@linux-x64.vsix Archive: rust-lang.rust-analyzer-0.4.1398@linux-x64.vsix Length Date Time Name --------- ---------- ----- ---- 2637 2023-02-10 00:29 extension.vsixmanifest 462 2023-02-10 00:29 [Content_Types].xml 15341 2023-02-09 18:46 extension/icon.png 1036 2023-02-09 18:46 extension/language-configuration.json 12006 2023-02-09 18:46 extension/LICENSE.txt 415788 2023-02-10 00:29 extension/out/main.js 298705 2023-02-09 18:46 extension/package-lock.json 86857 2023-02-10 00:29 extension/package.json 756 2023-02-09 18:46 extension/ra_syntax_tree.tmGrammar.json 2422 2023-02-10 00:29 extension/README.md 41073816 2023-02-10 00:29 extension/server/rust-analyzer 648798 2023-02-10 00:29 extension/node_modules/d3-graphviz/build/d3-graphviz.min.js 279449 2023-02-10 00:29 extension/node_modules/d3/dist/d3.min.js --------- ------- 42838073 13 files
But other editor plug-ins, like coc-rust-analyzer, download it right from GitHub releases.
It seemed odd that the Emacs ecosystem would lack such functionality, so I
looked it up: rustic
says about automatic server installation that lsp-mode
provides this feature, but eglot
doesn't, and
then it says "Install rust-analyzer manually".
In lsp-mode, I found code that seems like it should download rust-analyzer, but I'm not sure how to use it.
The docs talk about an lsp-enable-suggest-server-download
option, which
defaults to true, but I've never seen it download the server (I know because I checked ~/.emacs.d/.cache/lsp
).
Although... now that I look at it closer, this error message:
Server rls:83343/starting exited (check corresponding stderr buffer for details). Do you want to restart it? (y or n)
Mentions rls
, which is the predecessor to rust-analyzer
, and that definitely
sounds wrong. But lsp-rust-server
defaults to rust-analyzer
, so.. is it just
falling back? Setting the option explicitly doesn't seem to change much.
After more Emacs learning, I discovered how to switch to the *lsp-log*
buffer
the docs point to
and discovered the following:
Command "rls" is present on the path. Command "rust-analyzer" is not present on the path. Command "rls" is present on the path. Command "rust-analyzer" is not present on the path. Found the following clients for /home/amos/bearcove/bottom/src/main.rs: (server-id rls, priority -1) The following clients were selected based on priority: (server-id rls, priority -1)
This is a horrible fallback. RLS is deprecated. The only reason it's falling back to it is because there is a proxy binary for it, that exists, but errors out since the component is not installed:
$ which rls /home/amos/.cargo/bin/rls $ rls error: 'rls' is not installed for the toolchain 'stable-x86_64-unknown-linux-gnu' To install, run `rustup component add rls` $ echo $? 1
Let's summarize the situation:
- rustup is the "blessed" way to get a rust toolchain (you can do nice things like pinning with it)
- it sets up an
rls
proxy binary under~/.cargo/bin
like it should, because even though rls is deprecated, some folks still use it for now I guess lsp-mode
looks forrust-analyzer
first,rls
second, and if it finds the second (even if it's just a stub/proxy that says "sorry not installed"), it falls back to it and the "auto download rust-analyzer" logic never triggers
I must admit this is an even bigger deal than what I was originally planning to write about! If this is the experience Emacs folks have been having with Rust, this explains a lot of things.
My sample size is N=3, but everyone in that sample ended up building rust-analyzer from source, and that means they get an extremely up-to-date RA once, and then most probably forget to update it forever, which is even worse than grabbing it from rustup.
Please remind me to submit a PR to lsp-mode
that yanks ALL the rls stuff
after I'm done writing this article. Someone must.
ANYWAY.
For today, let's just tell lsp-mode
to use the one from rustup:
;; under (use-package lsp-mode ;; in the :custom block (lsp-rust-analyzer-server-command (list (string-trim (shell-command-to-string "rustup which rust-analyzer"))))
Just kidding! That doesn't work. And yes, that evaluates to the right path. And
yes, that "custom" option expects a list. And yes, I did learn what
custom-set-variables
does while troubleshooting this, and no, setting it
through there doesn't work either.
The *Messages*
buffer still shows that it couldn't find rust-analyzer
in
PATH.
My best guess is that rustic
, which can use either lsp-mode
or eglot
,
overrides this very late in the game, and I can't do a damn thing about it.
I'm not sure why they even have an "Automatic server installation" section in
their docs then.
So. Fine. We'll use violence and create a shell script in
~/.cargo/bin/rust-analyzer
, containing this:
#!/bin/bash $(rustup which rust-analyzer) "$@"
$ chmod +x ~/.cargo/bin/rust-analyzer $ hash -r $ rust-analyzer --version rust-analyzer 1.67.1 (d5a82bb 2023-02-07)
Tada, we did a crime!
Reproducing the issue, part 2
So, now that we have emacs configured with lsp-mode and rustic, using a recent-ish rust-analyzer, we're back to square one.
Note that I forgot to add company-mode, the
thing that actually provides completions: we can add it next to (use-package rustic)
in ~/.emacs
:
;; in `~/.emacs` ;; cut: everything before that line (require 'use-package-ensure) (setq use-package-always-ensure t) (use-package rustic) ;; π new! (use-package company-mode)
Restarting emacs installs it, and we now get completions "after typing a few
characters and waiting a bit". You can bind a key combination to
"company-complete" to have it pop up on-demand, which I did, and then Emacs
told me "no, you shouldn't write it C-SPC
, you should write it [?\C- ]
", which
is exceedingly rude, but let's move on.
(The keybinding didn't work in the terminal but it did work in emacs-gtk, which I installed out of frustration).
Anyway!
Now onto the actual bug: let's add to our code.. an emoji! Any emoji.
fn main() { // π₯Ί println!("Hello, world!"); }
Upon typing this emoji in the editor, a message will pop up in the bottom bar
(appropriate) saying the LSP server crashed, would you like to restart it, no I
wouldn't, I'd like to see the messages, a little C-x b *rust-anal<TAB>::std<TAB>
later I'm in buffer *rust-analyzer::stderr*
seeing
this:
Panic context: > version: 1.67.1 (d5a82bb 2023-02-07) notification: textDocument/didChange thread 'LspServer' panicked at 'assertion failed: self.is_char_boundary(n)', /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/string.rs:1819:29 stack backtrace: 0: rust_begin_unwind at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5 1: core::panicking::panic_fmt at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14 2: core::panicking::panic at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:111:5 3: <alloc::string::String>::replace_range::<core::ops::range::Range<usize>> 4: rust_analyzer::lsp_utils::apply_document_changes::<<rust_analyzer::global_state::GlobalState>::on_notification::{closure#3}::{closure#0}> 5: <<rust_analyzer::global_state::GlobalState>::on_notification::{closure#3} as core::ops::function::FnOnce<(&mut rust_analyzer::global_state::GlobalState, lsp_types::DidChangeTextD\ ocumentParams)>>::call_once 6: <rust_analyzer::dispatch::NotificationDispatcher>::on::<lsp_types::notification::DidChangeTextDocument> 7: <rust_analyzer::global_state::GlobalState>::handle_event 8: <rust_analyzer::global_state::GlobalState>::run 9: rust_analyzer::main_loop::main_loop 10: rust_analyzer::run_server note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. Process rust-analyzer stderr finished
That is the bug we're interested in.
And if you have some instinct as to the source of this bug, let me tell you: it's so much worse than you think.
Exploring UTF-8 and UTF-16 with Rust
Rust strings are UTF-8, period.
And by "Rust strings" I mean the owned type String and string slices, aka &str.
There's a ton of other types that dereference to str
, like Box<str>
,
Arc<str>
etc., but they're not relevant to this discussion.
We can tell by printing the underlying byte representation:
fn main() { println!("{:02x?}", "abc".as_bytes()); }
$ cargo run --quiet [61, 62, 63]
Which is not to say that you cannot use UTF-16 string representation in Rust, you just.. make your own type for it. Or, more likely, you use a crate, like widestring:
$ cargo add widestring Updating crates.io index Adding widestring v1.0.2 to dependencies. Features: + alloc + std
fn main() { // this is a macro, it does the right thing let s = widestring::u16str!("abc"); println!("{:04x?}", s.as_slice()); }
This gives us a &[u16]
, so it's not quite what we're looking for:
$ cargo run --quiet [0061, 0062, 0063]
Getting at the bytes is harder, but not impossible:
fn main() { // this is a macro, it does the right thing let s = widestring::u16str!("abc"); { let u16s = s.as_slice(); let (_, u8s, _) = unsafe { u16s.align_to::<u8>() }; println!("{:02x?}", u8s); } }
Heck, it shouldn't even really require unsafe, since anything that's u16-aligned is also definitely u8-aligned, here, let me make a safe wrapper for it:
fn u16_slice_to_u8_slice(s: &[u16]) -> &[u8] { unsafe { // Safety: u8 slices don't require any alignment, it really doesn't // get any smaller without bit-twiddling s.align_to::<u8>().1 } }
There:
fn main() { // this is a macro, it does the right thing let s = widestring::u16str!("abc"); { let u16s = s.as_slice(); let u8s = u16_slice_to_u8_slice(u16s); println!("{:02x?}", u8s); } }
$ cargo run --quiet [61, 00, 62, 00, 63, 00]
Okay, cool! So utf-16 is just utf-8 with extra zeroes.
No, that's... no.
No, of course not. It's easy with ASCII characters, but it gets more complicated the fancier you want to get.
How about "Γ©" for example. That sounds fancy! How does it look?
const S: &str = "Γ©"; fn main() { println!("{S:?}"); println!("UTF-8 {:02x?}", S.as_bytes()); println!( "UTF-16 {:02x?}", u16_slice_to_u8_slice((widestring::u16str!(S)).as_slice()) ); } fn u16_slice_to_u8_slice(s: &[u16]) -> &[u8] { unsafe { // Safety: u8 slices don't require any alignment, it really doesn't // get any smaller without bit-twiddling s.align_to::<u8>().1 } }
$ cargo run -q "Γ©" UTF-8 [c3, a9] UTF-16 [e9, 00]
Ah, not fancy enough! It takes up 2 bytes in UTF-8, and one "code unit" in UTF-16, which we're showing as two bytes, because we can.
Let's try something fancier!
$ cargo run -q "Ε" UTF-8 [c5, 93] UTF-16 [53, 01]
"Latin Small Ligature Oe" is in the Latin
Extended-A range: that goes 0100-017F
. We
no longer have a "useless byte" in the UTF-16 encoding, since the code unit is
actually "0153" (in hexadecimal).
Meanwhile, in UTF-8 land, we still need two bytes to encode it. All is well.
Let's get fancier still! How about some Hiragana? (3040-309F
)
$ cargo run -q "γ" UTF-8 [e3, 81, 81] UTF-16 [41, 30]
"Hiragana Letter Small A" takes 3 bytes in UTF-8, but only 2 bytes (one 16-bit code unit) in UTF-16. Huh! That must be why people loved UTF-16 so much, they made it the default string representation for Java and JavaScript (no relation).
Because at some point in the history of the human race, we thought 65536 characters were more than enough. It would be absurd to need more characters than that. And so we made UCS-2, which used two bytes for every character, enabling the encoding of the BMP, the Basic Multilingual Plane.
But that's a lie. We knew 65536 characters weren't enough, because China.
(Among other reasons).
So there were already 1.1 million characters defined by Unicode, and some of those wouldn't fit, and so we made UCS-4, which used four bytes for every character, but that sounds terribly wasteful compared to plain old ASCII-with-a-codepage-good-luck-exporting-documents I guess.
So ISO/IEC 2022 specifies escape sequences to switch between character sets. I wish I was making it up, but I'm not! Apparently xterm supports some of it! I know!!! I can't believe it either!
So anyway, UTF-16 is an evolution of UCS-2 that says hey maybe switching character encodings in-band isn't the greatest thing (also I'm not sure how widely adopted those escape sequences were in the first place), let's just have a thing where most characters fit in 2 bytes, and those that don't fit in.. more.
So let's look at our Hiragana again but simply print UTF-16 code units now, instead of the underlying two bytes:
const S: &str = "γ"; fn main() { println!("{S:?}"); println!("UTF-8 {:02x?}", S.as_bytes()); println!("UTF-16 {:02x?}", (widestring::u16str!(S)).as_slice()); }
$ cargo run -q "γ" UTF-8 [e3, 81, 81] UTF-16 [3041]
Okay, cool, so: for U+3041, we need 3 bytes in UTF-8, and one UTF-16 code unit (2 bytes).
Let us now embark on a cross-plane journey:
$ cargo run -q "π " UTF-8 [f0, a0, 80, 80] UTF-16 [d840, dc00]
But first, a minute of silence for anyone reading this from a Linux desktop machine.
...
There. In case you can't see it, the character in question is Ideograph the sound made by breathing in; oh! (cf. U+311B BOPOMOFO LETTER O, which is derived from this character) CJK, or "hΔ", for friends.
It is U+20000, which is outside the BMP.
The?
BMP, Basic Multilingual
Plane,
we've gone over this: it's 0000-FFFF
, except... except part of it is reserved
for UTF-16 surrogates! High surrogates are D800-DB7F
, and surprise surprise,
that's what our first code unit is: d840
. Low surrogates are DC00-DFFF
, and
that's what our second code unit is.
How does this work? Well, we take our codepoint, in our case U+20000
, and
subtract 0x10000 from it. We get 0x10000
, or, in binary:
0b00010000000000000000 <--------><--------> hi lo
Then we take the high ten bits, add them to 0xD800
, in our case, 0b1000000
gives 0x40
, so we get 0xD840
. As for the low ten bits, we add them to
0xDC00
, for us that's 0x0 (should've picked a different example), so we get
0xDC00
.
Easy peasy.
And so, "π " (which should render as ) is:
- 1 grapheme cluster
- 1 unicode character (with the unicode code point U+20000)
- 4 UTF-8 bytes
- 2 UTF-16 code units
All clear? Good.
How about our emoji? Our innocent little emoji? Look at it. It didn't mean to cause any trouble. Let's find out:
$ cargo run -q "π₯Ί" UTF-8 [f0, 9f, a5, ba] UTF-16 [d83e, dd7a]
Okay, same! Ooh I bet we can find what codepoint it is by doing the inverse transformation on the UTF-16 surrogate pair here:
$ gdb --quiet (cut) (gdb) p/x 0xd83e - 0xd800 $1 = 0x3e (gdb) p/x 0xdd7a - 0xdc00 $2 = 0x17a
That's your hex calculator of choice?
Shh I'm trying to focus here.
(continued) (gdb) p/t 0x3e $3 = 111110 (gdb) p/t 0x17a $4 = 101111010
Wait wait why are you going through binary, you can just do a left shift here, right?
Yes, yes, alright
(continued) (gdb) p/x (0x3e << 10) + 0x17a + 0x10000 $5 = 0x1f97a
Hurray, here it is! U+1F97A Bottom Face Emoji.
So. Rust really likes UTF-8, and so do I. Hence, I really like Rust.
Whereas some languages just do not let you mutate strings at all (it's safer that way), Rust does! You can totally do this, for example:
fn main() { let mut s = String::from("amos"); s.replace_range(0..1, "c"); dbg!(s); }
$ cargo run -q [src/main.rs:4] s = "cmos"
Note that this is safe rust!
Also note that Amos is my first name, and CMOS is a type of metal-oxide-semiconductor field-effect transistor fabrication process that uses complementary and symmetrical pairs of p-type and n-type MOSFETs for logic functions.
Coincidence? Probably.
We can also do this, using unsafe Rust:
fn main() { let mut s = String::from("amos"); // why b'c'? because 'c' is a char, whereas b'c' is a u8 (a byte). unsafe { s.as_bytes_mut()[0] = b'c' }; dbg!(s); }
And we get the same output.
Which begs the question, why is the former safe and the latter unsafe, if they do the same darn thing!
It's because they don't.
The unsafe version lets us do this:
fn main() { let mut s = String::from("π₯Ί"); unsafe { s.as_bytes_mut()[3] = b'#'; }; dbg!(s); }
$ cargo run --quiet [src/main.rs:6] s = "οΏ½#"
Which is, say it with me: undefined behaviorrrr ooooh spooky ghost emoji! We broke an invariant (strings must always be valid UTF-8) and now anything goes, we could even have memory corruption come out of this! Probably!
Them's the rules of unsafe Rust: if we use it, we are suddenly responsible for maintaining all the invariants ourselves. Just like we are, all the time, in unsafe languages like C/C++ (also C, and C++).
replace_range
does not let us do that:
fn main() { let mut s = String::from("π₯Ί"); s.replace_range(3..4, "#"); dbg!(s); }
$ cargo run --quiet thread 'main' panicked at 'assertion failed: self.is_char_boundary(n)', /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/string.rs:1811:29 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Well well well, doesn't that look familiar.
Again with a backtrace:
$ RUST_BACKTRACE=1 cargo run --quiet thread 'main' panicked at 'assertion failed: self.is_char_boundary(n)', /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/string.rs:1811:29 stack backtrace: 0: rust_begin_unwind at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5 1: core::panicking::panic_fmt at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14 2: core::panicking::panic at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:111:5 3: alloc::string::String::replace_range at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/string.rs:1811:29 4: bottom::main at ./src/main.rs:3:5 5: core::ops::function::FnOnce::call_once at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/ops/function.rs:507:5 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Yep okay this is definitely familiar.
So this is what's happening with lsp-mode + rust-analyzer, right?
lsp-mode is sending bad offsets, rust-analyzer panics. Mystery solved?
Well, yes. But why does that happen? And what should rust-analyzer actually do in this case?
The Language Server Protocol
To answer these questions, we must look at the LSP: the Language Server Protocol.
And instead of reading the spec directly, let's look at what's actually being sent between an LSP client (like lsp-mode for Emacs) and an LSP server (like rust-analyzer).
There's something built into lsp-mode to do this, the lsp-log-io
setting, but
I think it's a lot more fun to capture it ourselves, by providing a wrapper
for rust-analyzer.
// in `src/main.zig` const std = @import("std"); const ChildProcess = std.ChildProcess; const StdIo = ChildProcess.StdIo; const File = std.fs.File; const Thread = std.Thread; const DEBUG = false; const DEBUG_LEAKS = false; const CLEANUP_STREAMS_WORKAROUND_NAME = ".cleanupStreamsWorkaround"; fn debug(comptime fmt: []const u8, args: anytype) void { if (DEBUG) { std.debug.print(fmt, args); } } pub fn main() !void { var alloc: std.mem.Allocator = undefined; // like the behavior var gpa: ?std.heap.GeneralPurposeAllocator(.{}) = undefined; // this `defer` cannot live inside the `if DEBUG_LEAKS` block, because then // the allocator would be deinitialized before our program even runs. defer if (gpa) |*g| { std.debug.print("Checking for memory leaks...\n", .{}); std.debug.assert(!g.deinit()); }; if (DEBUG_LEAKS) { // can't figure out how to elide the type name here // assigning apparently promotes from `GPA` to `?GPA` gpa = std.heap.GeneralPurposeAllocator(.{}){}; // ...but we still need to 'unwrap' it here (could get away // with a temporary variable) alloc = gpa.?.allocator(); } else { alloc = std.heap.page_allocator; } // note: on Linux, this ends up using `fork+execve`. this is a bad idea, // as others have found out before: https://github.com/golang/go/issues/5838 // // I was curious why passing `rustup` here worked: for me it had to either // parse `PATH` or go through a shell of some sort. Turns out it _does_ parse // `PATH` (falling back to `/usr/local/bin:/bin:/usr/bin`) and then does // a _bunch_ of `execve` calls in a row, continuing until it doesn't return // "AccessDenied" or "FileNotFound". var res = try ChildProcess.exec(.{ // anonymous struct literal // C99 syntax, yum .allocator = alloc, .argv = &[_][]const u8{ "rustup", "which", "rust-analyzer" }, }); // `ExecResult` doesn't have a `deinit` method, if you don't call these // you have a leak defer alloc.free(res.stdout); defer alloc.free(res.stderr); var ra_path = x: { // block with a label // I'm not sure why I have to pass `u8` here instead of it being inferred var splits = std.mem.split(u8, res.stdout, "\n"); // breaking the label - blocks aren't expressions, so this is how // it's done. note that `first()` doesn't return `?[]const u8`, it's // infallible break :x splits.first(); }; // ra_path is a slice of res.stdout, so we can't free it first debug("RA path = '{s}'\n", .{ra_path}); var proc = ChildProcess.init( // the array length can be inferred (from `_`), but the element type cannot &[_][]const u8{ra_path}, alloc, ); // nothing's stopping you from assigning `proc.stdin` etc. at this stage // but it won't do anything. proc.stdin_behavior = StdIo.Pipe; proc.stdout_behavior = StdIo.Pipe; proc.stderr_behavior = StdIo.Inherit; try proc.spawn(); // the first argument is `SpawnConfig`, its only member right now // is `stack_size`, which defaults to 16MiB. // // the third argument is a tuple, which corresponds to the arguments of // the function you pass. the function takes a single argument which is an // anonyomus struct, hence the `.{.{}}` dance var copy_stdin = try Thread.spawn(.{}, forward, .{.{ .alloc = alloc, .dir = "to-server", .src = std.io.getStdIn(), .dst = &proc.stdin.?, .close_on_eof = true }}); // this avoids an `expected *T but got *const T` error var our_stdout = std.io.getStdOut(); // close_on_eof is omitted here, so it takes the default value (which is nice) var copy_stdout = try Thread.spawn(.{}, forward, .{.{ .alloc = alloc, .dir = "from-server", .src = proc.stdout.?, .dst = &our_stdout }}); copy_stdin.join(); copy_stdout.join(); var term = try proc.wait(); // note: "structs implement Debug", sort of β printing a `StringHashMap` // will make you really sad. any field that's `[]const u8` will be printed // as an array of numbers debug("term = {}\n", .{term}); try std.fs.cwd().deleteFile(CLEANUP_STREAMS_WORKAROUND_NAME); } // note: `!void` means "return any error or void" (and then you can match on that). // we could also list every possible error type we return, but it's a lot: // error{AccessDenied,BadPathName,DeviceBusy,FileBusy,FileLocksNotSupported, // FileNotFound,FileTooBig,InvalidHandle,InvalidUtf8,IsDir,NameTooLong,NoDevice, // NoSpaceLeft,NotDir,PathAlreadyExists,PipeBusy,ProcessFdQuotaExceeded, // SharingViolation,SymLinkLoop,SystemFdQuotaExceeded,SystemResources, // Unexpected,WouldBlock} // // dst is a pointer, because we have to be able to close it _and_ replace it // with a valid file descriptor, see below. fn forward(args: struct { alloc: std.mem.Allocator, dir: []const u8, src: File, dst: *File, close_on_eof: bool = false }) !void { // note: no destructuring assignment, gotta keep the language small π€· var alloc = args.alloc; var src = args.src; var dst = args.dst; // note: removing the `@as` complains about something comptime related var msg_number = @as(usize, 1); // note: try is sorta like `?` in Rust, it bubbles up any error var tmp = try std.fs.openDirAbsolute("/tmp", .{}); // you can have tagged enums but you need to define two types const StageTag = enum { headers, body, }; const Stage = union(StageTag) { headers: void, body: struct { headers: std.StringHashMap([]const u8), content_length: usize, } }; var stage = Stage{ // `{}` is void, not sure why .headers = {}, }; const delimiter = "\r\n\r\n"; // note: intentionally starting with a small buffer to test realloc var buffer: []u8 = try alloc.alloc(u8, 10); // this makes me nervous, what if `buffer` is re-assigned before free-ing? // which value gets freed, the first one or the last one? defer alloc.free(buffer); // using `@as` would've been an option here too, not sure why // some std library code prefers `@as` to this: var read: usize = 0; while (true) { debug("buffer is {} bytes\n", .{buffer.len}); // note: this does nothing in ReleaseFast and ReleaseSmall modes std.debug.assert(read < buffer.len); var n = try src.read(buffer[read..]); debug("read {}\n", .{n}); if (n == 0) { debug("zero-length read, probably EOF\n", .{}); if (args.close_on_eof) { dst.close(); // `Process.wait` closes any stream that has StdIo.Pipe behavior, // but since we've already closed it, it'll reach "unreachable" // code when it notices close returned .EBADF. // Unfortunately we really _do_ need to close it here, so we // substitute another valid file descriptor to work around this. dst.* = try std.fs.cwd().createFile(CLEANUP_STREAMS_WORKAROUND_NAME, .{}); } return; } read += n; debug("{} bytes in buffer\n", .{read}); switch (stage) { StageTag.headers => { debug("Looking for CRLF\n", .{}); x: { // `orelse` unwraps optionals, going from `?T` to `T`. // we use a labelled break to break out of this switch arm // if we haven't found CRLF yet. // // we once again have to specify `u8` even though we're // passing a `[]u8` // // note: `buffer[..n]` is not valid syntax, but `buffer[n..]` is. var index = std.mem.indexOf(u8, buffer[0..n], delimiter) orelse break :x; debug("found CRLF at {}\n", .{index}); var msg = buffer[0..index]; debug("msg = {s}\n", .{msg}); // this hashes keys as strings - `AutoHashMap` is very // unhappy if you try to use it with `[]const u8` keys var headers = std.StringHashMap([]const u8).init(alloc); var splits = std.mem.split(u8, msg, "\r\n"); while (splits.next()) |line| { var tokens = std.mem.split(u8, line, ": "); // essentially "strdup", very important since there's no // concept of ownership in zig: de/re-allocating after // that would make the hashmap point to garbage var k = try alloc.dupe(u8, tokens.next().?); var v = try alloc.dupe(u8, tokens.next().?); try headers.put(k, v); } var content_length_s = headers.get("Content-Length").?; var content_length = try std.fmt.parseUnsigned(usize, content_length_s, 10); debug("content-length = {}\n", .{content_length}); // note: overlapping src/dst are fine as long as you're // copying to the left, not to the right. std.mem.copy(u8, buffer, buffer[index + delimiter.len ..]); read -= (index + delimiter.len); stage = Stage{ .body = .{ .headers = headers, .content_length = content_length, }, }; } }, // note: if we used |head| we'd get a copy of the struct, which means // a copy of the hashmap, which means we couldn't deinit it. StageTag.body => |*head| { if (read >= head.content_length) { var body = buffer[0..head.content_length]; var msg_dst_path = try std.fmt.allocPrint(alloc, "{s}-{d:0>3}.bin", .{ args.dir, msg_number }); // note: unlike Go, `defer` happens at the end of the scope // (a block is a scope, an if is a scope), not on function return defer alloc.free(msg_dst_path); msg_number += 1; // note: this is why we opened `/tmp`, zig's stdlib is dead // set on using `openat`. var msg_dst = try tmp.createFile(msg_dst_path, .{}); defer msg_dst.close(); var iter = head.headers.iterator(); while (iter.next()) |entry| { debug("header: {s}: {s}\n", .{ entry.key_ptr.*, entry.value_ptr.* }); var header_line = try std.fmt.allocPrint(alloc, "{s}: {s}\r\n", .{ entry.key_ptr.*, entry.value_ptr.* }); defer alloc.free(header_line); try dst.writeAll(header_line); try msg_dst.writeAll(header_line); } try dst.writeAll("\r\n"); try msg_dst.writeAll("\r\n"); try dst.writeAll(body); try msg_dst.writeAll(body); std.mem.copy(u8, buffer, buffer[head.content_length..]); read -= head.content_length; var free_iter = head.headers.iterator(); while (free_iter.next()) |entry| { // key/values need to be freed _before_ calling deinit // if you remove `.*` from these, the compiler will crash // with a monochrome message (it's usually colored) alloc.free(entry.key_ptr.*); alloc.free(entry.value_ptr.*); } head.headers.deinit(); stage = Stage{ .headers = {}, }; } // otherwise keep reading }, } if (read >= buffer.len) { buffer = try alloc.realloc(buffer, buffer.len * 2); debug("re-allocated buffer, new len = {}\n", .{buffer.len}); } } }
Of course we'll need a build.zig
file too, which zig init-exe
can generate,
this is what I got:
const std = @import("std"); pub fn build(b: *std.build.Builder) void { // Standard target options allows the person running `zig build` to choose // what target to build for. Here we do not override the defaults, which // means any target is allowed, and the default is native. Other options // for restricting supported target set are available. const target = b.standardTargetOptions(.{}); // Standard release options allow the person running `zig build` to select // between Debug, ReleaseSafe, ReleaseFast, and ReleaseSmall. const mode = b.standardReleaseOptions(); const exe = b.addExecutable("lsp-proxy", "src/main.zig"); exe.setTarget(target); exe.setBuildMode(mode); exe.install(); const run_cmd = exe.run(); run_cmd.step.dependOn(b.getInstallStep()); if (b.args) |args| { run_cmd.addArgs(args); } const run_step = b.step("run", "Run the app"); run_step.dependOn(&run_cmd.step); const exe_tests = b.addTest("src/main.zig"); exe_tests.setTarget(target); exe_tests.setBuildMode(mode); const test_step = b.step("test", "Run unit tests"); test_step.dependOn(&exe_tests.step); }
We can then make a ReleaseFast
build:
$ time zig build -Drelease-fast zig build -Drelease-fast 6.30s user 0.32s system 101% cpu 6.551 total
And drop it inside ~/.cargo/bin
:
$ cp zig-out/bin/lsp-proxy ~/.cargo/bin/rust-analyzer
Now, if we run emacs again, and try to insert the emoji, we can see what
happens because each message is dumped as /tmp/from-server-XXX.bin
and
/tmp/to-server-XXX.bin
.
Here's the first message emacs/lsp-mode sends to rust-analyzer:
Content-Length: 4467 {"jsonrpc":"2.0","method":"initialize","params":{"processId":252882,"rootPath":"/home/amos/bearcove/bottom","clientInfo":{"name":"emacs","version":"GNU Emacs 27.1 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.30, cairo version 1.16.0)\n of 2022-01-24, modified by Debian"},"rootUri":"file:///home/amos/bearcove/bottom","capabilities":{"workspace":{"workspaceEdit":{"documentChanges":true,"resourceOperations":["create","rename","delete"]},"applyEdit":true,"symbol":{"symbolKind":{"valueSet":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]}},"executeCommand":{"dynamicRegistration":false},"didChangeWatchedFiles":{"dynamicRegistration":true},"workspaceFolders":true,"configuration":true,"codeLens":{"refreshSupport":true},"fileOperations":{"didCreate":false,"willCreate":false,"didRename":true,"willRename":true,"didDelete":false,"willDelete":false}},"textDocument":{"declaration":{"dynamicRegistration":true,"linkSupport":true},"definition":{"dynamicRegistration":true,"linkSupport":true},"references":{"dynamicRegistration":true},"implementation":{"dynamicRegistration":true,"linkSupport":true},"typeDefinition":{"dynamicRegistration":true,"linkSupport":true},"synchronization":{"willSave":true,"didSave":true,"willSaveWaitUntil":true},"documentSymbol":{"symbolKind":{"valueSet":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]},"hierarchicalDocumentSymbolSupport":true},"formatting":{"dynamicRegistration":true},"rangeFormatting":{"dynamicRegistration":true},"onTypeFormatting":{"dynamicRegistration":true},"rename":{"dynamicRegistration":true,"prepareSupport":true},"codeAction":{"dynamicRegistration":true,"isPreferredSupport":true,"codeActionLiteralSupport":{"codeActionKind":{"valueSet":["","quickfix","refactor","refactor.extract","refactor.inline","refactor.rewrite","source","source.organizeImports"]}},"resolveSupport":{"properties":["edit","command"]},"dataSupport":true},"completion":{"completionItem":{"snippetSupport":false,"documentationFormat":["markdown","plaintext"],"resolveAdditionalTextEditsSupport":true,"insertReplaceSupport":true,"deprecatedSupport":true,"resolveSupport":{"properties":["documentation","detail","additionalTextEdits","command"]},"insertTextModeSupport":{"valueSet":[1,2]}},"contextSupport":true,"dynamicRegistration":true},"signatureHelp":{"signatureInformation":{"parameterInformation":{"labelOffsetSupport":true}},"dynamicRegistration":true},"documentLink":{"dynamicRegistration":true,"tooltipSupport":true},"hover":{"contentFormat":["markdown","plaintext"],"dynamicRegistration":true},"foldingRange":{"dynamicRegistration":true},"selectionRange":{"dynamicRegistration":true},"callHierarchy":{"dynamicRegistration":false},"typeHierarchy":{"dynamicRegistration":true},"publishDiagnostics":{"relatedInformation":true,"tagSupport":{"valueSet":[1,2]},"versionSupport":true},"linkedEditingRange":{"dynamicRegistration":true}},"window":{"workDoneProgress":true,"showDocument":{"support":true}},"experimental":{"snippetTextEdit":null}},"initializationOptions":{"diagnostics":{"enable":true,"enableExperimental":false,"disabled":[],"warningsAsHint":[],"warningsAsInfo":[]},"imports":{"granularity":{"enforce":false,"group":"crate"},"group":true,"merge":{"glob":true},"prefix":"plain"},"lruCapacity":null,"checkOnSave":{"enable":true,"command":"clippy","extraArgs":[],"features":[],"overrideCommand":[]},"files":{"exclude":[],"watcher":"client","excludeDirs":[]},"cargo":{"allFeatures":false,"noDefaultFeatures":false,"features":[],"target":null,"runBuildScripts":true,"loadOutDirsFromCheck":true,"autoreload":true,"useRustcWrapperForBuildScripts":true,"unsetTest":[]},"rustfmt":{"extraArgs":[],"overrideCommand":[],"rangeFormatting":{"enable":false}},"inlayHints":{"bindingModeHints":false,"chainingHints":false,"closingBraceHints":{"enable":true,"minLines":25},"closureReturnTypeHints":false,"lifetimeElisionHints":{"enable":"never","useParameterNames":false},"maxLength":null,"parameterHints":false,"reborrowHints":"never","renderColons":true,"typeHints":{"enable":true,"hideClosureInitialization":false,"hideNamedConstructor":false}},"completion":{"addCallParenthesis":true,"addCallArgumentSnippets":true,"postfix":{"enable":true},"autoimport":{"enable":true},"autoself":{"enable":true}},"callInfo":{"full":true},"procMacro":{"enable":true},"rustcSource":null,"linkedProjects":[],"highlighting":{"strings":true},"experimental":{"procAttrMacros":true}},"workDoneToken":"1"},"id":1}
Interestingly, instead of going for newline-separated JSON, LSP goes for an HTTP/1.1-like protocol, with a header section. I don't love this, but nobody asked.
One long line isn't very readable, I've prettified the JSON for you:
{ "jsonrpc": "2.0", "method": "initialize", "params": { "processId": 252882, "rootPath": "/home/amos/bearcove/bottom", "clientInfo": { "name": "emacs", "version": "GNU Emacs 27.1 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.30, cairo version 1.16.0)\n of 2022-01-24, modified by Debian" }, "rootUri": "file:///home/amos/bearcove/bottom", "capabilities": { "workspace": { "workspaceEdit": { "documentChanges": true, "resourceOperations": [ "create", "rename", "delete" ] }, "applyEdit": true, "symbol": { "symbolKind": { "valueSet": [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26 ] } }, "executeCommand": { "dynamicRegistration": false }, "didChangeWatchedFiles": { "dynamicRegistration": true }, "workspaceFolders": true, "configuration": true, "codeLens": { "refreshSupport": true }, "fileOperations": { "didCreate": false, "willCreate": false, "didRename": true, "willRename": true, "didDelete": false, "willDelete": false } }, "textDocument": { "declaration": { "dynamicRegistration": true, "linkSupport": true }, "definition": { "dynamicRegistration": true, "linkSupport": true }, "references": { "dynamicRegistration": true }, "implementation": { "dynamicRegistration": true, "linkSupport": true }, "typeDefinition": { "dynamicRegistration": true, "linkSupport": true }, "synchronization": { "willSave": true, "didSave": true, "willSaveWaitUntil": true }, "documentSymbol": { "symbolKind": { "valueSet": [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26 ] }, "hierarchicalDocumentSymbolSupport": true }, "formatting": { "dynamicRegistration": true }, "rangeFormatting": { "dynamicRegistration": true }, "onTypeFormatting": { "dynamicRegistration": true }, "rename": { "dynamicRegistration": true, "prepareSupport": true }, "codeAction": { "dynamicRegistration": true, "isPreferredSupport": true, "codeActionLiteralSupport": { "codeActionKind": { "valueSet": [ "", "quickfix", "refactor", "refactor.extract", "refactor.inline", "refactor.rewrite", "source", "source.organizeImports" ] } }, "resolveSupport": { "properties": [ "edit", "command" ] }, "dataSupport": true }, "completion": { "completionItem": { "snippetSupport": false, "documentationFormat": [ "markdown", "plaintext" ], "resolveAdditionalTextEditsSupport": true, "insertReplaceSupport": true, "deprecatedSupport": true, "resolveSupport": { "properties": [ "documentation", "detail", "additionalTextEdits", "command" ] }, "insertTextModeSupport": { "valueSet": [ 1, 2 ] } }, "contextSupport": true, "dynamicRegistration": true }, "signatureHelp": { "signatureInformation": { "parameterInformation": { "labelOffsetSupport": true } }, "dynamicRegistration": true }, "documentLink": { "dynamicRegistration": true, "tooltipSupport": true }, "hover": { "contentFormat": [ "markdown", "plaintext" ], "dynamicRegistration": true }, "foldingRange": { "dynamicRegistration": true }, "selectionRange": { "dynamicRegistration": true }, "callHierarchy": { "dynamicRegistration": false }, "typeHierarchy": { "dynamicRegistration": true }, "publishDiagnostics": { "relatedInformation": true, "tagSupport": { "valueSet": [ 1, 2 ] }, "versionSupport": true }, "linkedEditingRange": { "dynamicRegistration": true } }, "window": { "workDoneProgress": true, "showDocument": { "support": true } }, "experimental": { "snippetTextEdit": null } }, "initializationOptions": { "diagnostics": { "enable": true, "enableExperimental": false, "disabled": [ ], "warningsAsHint": [ ], "warningsAsInfo": [ ] }, "imports": { "granularity": { "enforce": false, "group": "crate" }, "group": true, "merge": { "glob": true }, "prefix": "plain" }, "lruCapacity": null, "checkOnSave": { "enable": true, "command": "clippy", "extraArgs": [ ], "features": [ ], "overrideCommand": [ ] }, "files": { "exclude": [ ], "watcher": "client", "excludeDirs": [ ] }, "cargo": { "allFeatures": false, "noDefaultFeatures": false, "features": [ ], "target": null, "runBuildScripts": true, "loadOutDirsFromCheck": true, "autoreload": true, "useRustcWrapperForBuildScripts": true, "unsetTest": [ ] }, "rustfmt": { "extraArgs": [ ], "overrideCommand": [ ], "rangeFormatting": { "enable": false } }, "inlayHints": { "bindingModeHints": false, "chainingHints": false, "closingBraceHints": { "enable": true, "minLines": 25 }, "closureReturnTypeHints": false, "lifetimeElisionHints": { "enable": "never", "useParameterNames": false }, "maxLength": null, "parameterHints": false, "reborrowHints": "never", "renderColons": true, "typeHints": { "enable": true, "hideClosureInitialization": false, "hideNamedConstructor": false } }, "completion": { "addCallParenthesis": true, "addCallArgumentSnippets": true, "postfix": { "enable": true }, "autoimport": { "enable": true }, "autoself": { "enable": true } }, "callInfo": { "full": true }, "procMacro": { "enable": true }, "rustcSource": null, "linkedProjects": [ ], "highlighting": { "strings": true }, "experimental": { "procAttrMacros": true } }, "workDoneToken": "1" }, "id": 1 }
Good job on scrolling down all that way! Take a second to catch your breath.
You can check the JSON-RPC 2.0 spec if you're curious about the request/response coating we have here, but mostly, yeah, the client is announcing capabilities to the server, and passing some cargo/rustfmt specific config.
They even have a user-agent! (clientInfo.version
).
Similarly, the server responds with its own capabilities:
{ "jsonrpc": "2.0", "id": 1, "result": { "capabilities": { "textDocumentSync": { "openClose": true, "change": 2, "save": { } }, "selectionRangeProvider": true, "hoverProvider": true, "completionProvider": { "resolveProvider": true, "triggerCharacters": [ ":", ".", "'", "(" ], "completionItem": { "labelDetailsSupport": false } }, "signatureHelpProvider": { "triggerCharacters": [ "(", ",", "<" ] }, "definitionProvider": true, "typeDefinitionProvider": true, "implementationProvider": true, "referencesProvider": true, "documentHighlightProvider": true, "documentSymbolProvider": true, "workspaceSymbolProvider": true, "codeActionProvider": { "codeActionKinds": [ "", "quickfix", "refactor", "refactor.extract", "refactor.inline", "refactor.rewrite" ], "resolveProvider": true }, "codeLensProvider": { "resolveProvider": true }, "documentFormattingProvider": true, "documentRangeFormattingProvider": false, "documentOnTypeFormattingProvider": { "firstTriggerCharacter": "=", "moreTriggerCharacter": [ ".", ">", "{" ] }, "renameProvider": { "prepareProvider": true }, "foldingRangeProvider": true, "declarationProvider": true, "workspace": { "fileOperations": { "willRename": { "filters": [ { "scheme": "file", "pattern": { "glob": "**/*.rs", "matches": "file" } }, { "scheme": "file", "pattern": { "glob": "**", "matches": "folder" } } ] } } }, "callHierarchyProvider": true, "semanticTokensProvider": { "legend": { "tokenTypes": [ "comment", "decorator", "enumMember", "enum", "function", "interface", "keyword", "macro", "method", "namespace", "number", "operator", "parameter", "property", "string", "struct", "typeParameter", "variable", "angle", "arithmetic", "attribute", "attributeBracket", "bitwise", "boolean", "brace", "bracket", "builtinAttribute", "builtinType", "character", "colon", "comma", "comparison", "constParameter", "derive", "deriveHelper", "dot", "escapeSequence", "formatSpecifier", "generic", "label", "lifetime", "logical", "macroBang", "parenthesis", "punctuation", "selfKeyword", "selfTypeKeyword", "semicolon", "typeAlias", "toolModule", "union", "unresolvedReference" ], "tokenModifiers": [ "documentation", "declaration", "static", "defaultLibrary", "async", "attribute", "callable", "constant", "consuming", "controlFlow", "crateRoot", "injected", "intraDocLink", "library", "mutable", "public", "reference", "trait", "unsafe" ] }, "range": true, "full": { "delta": true } }, "inlayHintProvider": { "resolveProvider": true }, "experimental": { "externalDocs": true, "hoverRange": true, "joinLines": true, "matchingBrace": true, "moveItem": true, "onEnter": true, "openCargoToml": true, "parentModule": true, "runnables": { "kinds": [ "cargo" ] }, "ssr": true, "workspaceSymbolScopeKindFiltering": true } }, "serverInfo": { "name": "rust-analyzer", "version": "1.67.1 (d5a82bb 2023-02-07)" } } }
Which is thankfully much shorter.
So, what character encoding do we have here? It's UTF-8. You can pass a
Content-Type
header, but I don't think anyone does, and it defaults to
application/vscode-jsonrpc; charset=utf-8
.
Note that headers are ASCII-only, just like HTTP/1.1 (don't lawyer me on this) β that already lets us know it's not using UTF-16 for everything.
In this code, when inserting the bottom emoji at this location:
fn main() { // π₯Ί println!("Hello, world!"); }
These are the messages sent by lsp-mode
to rust-analyzer
:
{ "jsonrpc": "2.0", "method": "textDocument/didChange", "params": { "textDocument": { "uri": "file:///home/amos/bearcove/bottom/src/main.rs", "version": 1 }, "contentChanges": [ { "range": { "start": { "line": 1, "character": 7 }, "end": { "line": 1, "character": 7 } }, "rangeLength": 0, "text": "π₯Ί" } ] } }
It's important to note that the file hasn't been saved yet. You don't need to save your code to get completions, or be able to hover symbols or jump to definitions: that means the LSP server operates on what the editor has in memory, and so it needs a view of that, not of the file. I don't think LSP servers need to do any I/O at all (at least not for opened documents).
Here the change is: for line 1 (lines are zero-indexed), at character 7, we're inserting "π₯Ί". I imagine if the range wasn't of length zero, this would serve as a "replace" operation.
Then, we have:
{ "jsonrpc": "2.0", "method": "textDocument/codeAction", "params": { "textDocument": { "uri": "file:///home/amos/bearcove/bottom/src/main.rs" }, "range": { "start": { "line": 1, "character": 8 }, "end": { "line": 1, "character": 8 } }, "context": { "diagnostics": [ ] } }, "id": 17 }
According to the spec:
The code action request is sent from the client to the server to compute commands for a given text document and range.
These commands are typically code fixes to either fix problems or to beautify/refactor code.
The result of aΒ textDocument/codeActionΒ request is an array ofΒ CommandΒ literals which are typically presented in the user interface.
And here lies the heart of the issue: "π₯Ί" was inserted at character 7. And
lsp-mode
is now asking for code actions at character 8.
Which begs the question: what is a character?
What is a character?
Of course, nobody can quite agree on the answer.
You won't make enemies by stating that "a" is a character.
You might even convince them that "Γ©" is a character.
But is "eΜ" one character?
Wait, you said the same thing twice.
No I didn't!
$ node Welcome to Node.js v18.14.0. Type ".help" for more information. > new TextEncoder().encode("Γ©") Uint8Array(2) [ 195, 169 ] > new TextEncoder().encode("eΜ") Uint8Array(3) [ 101, 204, 129 ]
The former is U+00E9 Latin Small Letter E with Acute, whereas the latter is U+0065 Latin Small Letter E followed by U+0301 Combining Acute Accent
They're, respectively, one unicode code point...
> "Γ©".length 1 > "Γ©".charAt(0) "Γ©"
...and two unicode code points:
> "eΜ".length 2 > "eΜ".charAt(0) 'e' > "eΜ".charAt(1) 'Μ'
They are both, however, only one grapheme cluster.
What about emojis?
We already know "π₯Ί" encodes to four bytes in utf-8:
> new TextEncoder("utf-8").encode("π₯Ί") Uint8Array(4) [ 240, 159, 165, 186 ]
And if you enjoyed things like Wat, you might already know what "π₯Ί".length is going to be:
> "π₯Ί".length 2 > "π₯Ί".charAt(0) '\ud83e' > "π₯Ί".charAt(1) '\udd7a'
Hey... hey! Those are surrogates!
They sure are! charAt returns UTF-16 code units!
It goes even deeper for more involved emojis:
> "π¨βπ©βπ¦".length 8
This "family: man, woman, boy" emoji (which should render as a single grapheme cluster for you, unless you're visiting us with a browser from the past, or from the terminal) is actually made up of the Man emoji, the Woman emoji and the Boy emoji, joined together with ZWJ, Zero Width Joiners.
We can see each individual component:
> [..."π¨βπ©βπ¦"] [ 'π¨', 'β', 'π©', 'β', 'π¦' ]
What are these components exactly though? They're not UTF-16 code units: we know
that's what .length
returns, and it said 8 (we also know each of these emojis
probably takes up 2 UTF-16 code units).
They're Unicode code points. We can get code units by iterating from 0 to
s.length
and calling charCodeAt
:
> var s = "π¨βπ©βπ¦"; for (let i = 0; i < s.length; i++) { console.log(s.charCodeAt(i).toString(16)) } d83d dc68 200d d83d dc69 200d d83d dc66
And we can get code point values with codePointAt
, which takes offsets in
UTF-16 code units. MDN says:
The
codePointAt()
method returns a non-negative integer that is the Unicode code point value at the given position. Note that this function does not give the nth code point in a string, but the code point starting at the specified string index.
And further:
If the element at pos is a UTF-16 high surrogate, returns the code point of the surrogate pair.
And:
If the element at pos is a UTF-16 low surrogate, returns only the low surrogate code point.
So, if we really want to index into a string, we have to be very careful, or we'll get wrong results (where Rust chooses to panic):
> var s = "π¨βπ©βπ¦"; [0, 2, 3, 5, 7].map(i => `U+${s.codePointAt(i).toString(16)}`) [ 'U+1f468', 'U+200d', 'U+1f469', 'U+200d', 'U+dc66' ]
U+200D
is Zero-Width Joiner, this checks out.
To not get it wrong, it's easier to use iteration, which is based on Unicode code points, unlike indexing which is based on UTF-16 code units:
> for (const c of "π¨βπ©βπ¦") { console.log(`U+${c.codePointAt(0).toString(16)}`) } U+1f468 U+200d U+1f469 U+200d U+1f466
Note that even though for..of
is already iterating over codepoints, it
"yields" strings itself, since there's no "character" or "rune" type in
ECMAScript - the closest you get is what codePointAt
returns, which is a
Number
, which is only accurate for integer up to 53 bits wide.
Luckily, the maximum valid Unicode code point is U+10FFFF
, so we only need 21
bits. (Also, only ~150K code points are defined by Unicode 15.0, the latest
version as of February of 2023).
Which explains why, if you go to read the documentation for Emacs, a 47-year-old piece of software, you will find the following:
To support this multitude of characters and scripts, Emacs closely follows the Unicode Standard.
The Unicode Standard assigns a unique number, called a codepoint, to each and every character. The range of codepoints defined by Unicode, or the Unicode codespace, is
0..#x10FFFF
(in hexadecimal notation), inclusive.Emacs extends this range with codepoints in the range
#x110000..#x3FFFFF
, which it uses for representing characters that are not unified with Unicode and raw 8-bit bytes that cannot be interpreted as characters.Thus, a character codepoint in Emacs is a 22-bit integer.
Makes perfect sense to me.
Caught in the middle
In summary: no one can agree what a character is.
I can't find reliable sources on when "multibyte support" was added to Emacs, but it's safe to say they decided a "character" was a "Unicode code point", ignoring zero-width joiners, combining characters, etc. Fair enough.
As for the LSP spec, it was originally developed for Visual Studio Code, which is implemented in ECMAScript (JavaScript).
And because indexing into strings in ECMAScript (and s.length
, etc.) is based
on UTF-16 code units, they decided a character should be.. a UTF-16 code unit!
You may be recoiling in horror at this point, and who could blame you, but
the point is, this is a fantastic cautionary tale: if your API ever has
a field named character
:
"start": { "line": 1, "character": 8 }
...then you've taken a wrong turn and need to reconsider. Immediately.
To finish elucidating the exact LSP transcript we've obtained between lsp-mode
and rust-analyzer: that "character"
field should have value 9
, not 8
,
because even though "π₯Ί" is one "Emacs character" (one Unicode code point),
it is two "LSP characters" (two UTF-16 code units).
What of rust-analyzer then? Well, it certainly doesn't use UTF-16 as its internal representation: like most Rust programs (that don't need to deal with pre-UTF-8-codepage Windows APIs), it uses UTF-8.
But it does implement LSP to the letter and treat "character offsets" as UTF-16
code units. Which it can do because, thanks to testDocument/didChange
, it has
its own copy of the document text (in UTF-8).
But it's able to translate UTF-16 code unit offsets to UTF-8 offsets: and up until then I've been making claims about rust-analyzer does without looking, so it's time to look: if we follow the thread, we'll surely find... yes!:
// in `rust-analyzer/crates/ide-db/src/line_index.rs` #[derive(Clone, Debug, PartialEq, Eq)] pub struct LineIndex { /// Offset the the beginning of each line, zero-based pub(crate) newlines: Vec<TextSize>, /// List of non-ASCII characters on each line pub(crate) utf16_lines: NoHashHashMap<u32, Vec<Utf16Char>>, } #[derive(Clone, Debug, Hash, PartialEq, Eq)] pub(crate) struct Utf16Char { /// Start offset of a character inside a line, zero-based pub(crate) start: TextSize, /// End offset of a character inside a line, zero-based pub(crate) end: TextSize, }
...an index of every non-ASCII character for every line of every file, which it can then use to do UTF-16 to UTF-8 index translation:
impl LineIndex { fn utf8_to_utf16_col(&self, line: u32, col: TextSize) -> usize { let mut res: usize = col.into(); if let Some(utf16_chars) = self.utf16_lines.get(&line) { for c in utf16_chars { if c.end <= col { res -= usize::from(c.len()) - c.len_utf16(); } else { // From here on, all utf16 characters come *after* the character we are mapping, // so we don't need to take them into account break; } } } res } }
To do that, it uses a nice property: if some codepoint needs 4 UTF-8 bytes, it needs 2 UTF-16 code units:
impl Utf16Char { /// Returns the length in 8-bit UTF-8 code units. fn len(&self) -> TextSize { self.end - self.start } /// Returns the length in 16-bit UTF-16 code units. fn len_utf16(&self) -> usize { // π if self.len() == TextSize::from(4) { 2 } else { 1 } } }
And that's how rust-analyzer adheres to the LSP spec. And lsp-mode doesn't.
Of course, as a user, it's hard to bring oneself to care about that. As a rust-analyzer user, it feels like rust-analyzer's bug. It's the one crashing!
Why does it crash? Couldn't it just... not?
Couldn't it just ignore erroneous input? Or... JSON-RPC has mechanisms for error reporting, couldn't it just use that?
Both of these are nonstarters, and I'll explain why.
If rust-analyzer ever gets a character offset that's in the middle of an UTF-16 surrogate pair, like it did in our recorded sequence:
Then... all bets are off!
It can't just "assume there's been an oopsie" and go to the end of the UTF-16 surrogate pair, because there might have already been errors before that: if we had two similar emojis before, it would think it's before the bottom emoji, rather than after.
That means completions would be wrong, which might seem innocent enough, but it
also means that any further testDocument/didChange
requests will give the LSP
server (in this case rust-analyzer) a corrupted view of the document!
And that can already happen, without crashing!
If we insert "π₯²π" in one go, then "π₯Ί" at offset 2, the Emacs buffer will have "π₯²ππ₯Ί", but any compliant LSP server will have "π₯²π₯Ίπ".
So, when it is caught, when an LSP server detects that it's getting bad offsets from the client (in this case lsp-mode), it absolutely should sound the alarm!
But is crashing the right thing to do though?
Didn't you mention JSON-RPC had a built-in mechanism for reporting errors?
Oh it does! But unfortunately, error handling is hard. There's nothing in the LSP spec that says "your implementation is fundamentally flawed and we cannot rely on ANYTHING going forward". At best you can do "this specific request has failed".
But requests fail all the time for various reasons and LSP clients barely bother to show you something (VSCode notably shows a notification, which is more annoying than anything else, since you can't do anything about it).
How should a client behave when it gets an error? Should it decide to give up
after a certain amount of errors? How many is enough? Eight? Sixteen? The
situation isn't going to fix itself. If a textDocument/didChange
request
fails, it's game over - the representations on either side have already
diverged.
Restarting the server, like lsp-mode suggests in a prompt, is not really going to help either, because editing any code after the emoji will have the wrong offsets, and so the representations will start to diverge again very soon.
This is an error condition that is rare (someone messed up the protocol in a BIG way) and unrecoverable (there's a very low likelihood of anything correct or useful happening after that): as frustrating as it is for Emacs users, rust-analyzer is absolutely correct in panicking here.
(In fact, it would be neat if the LSP handshake include some basic checks to make sure both ends agrees on what offsets mean, but it's a little late for that.)
The way forward
This is where things get delicate.
Because the tech of it all, is, as usual, relatively straightforward. If you've made it this far, you have a thorough understanding of exactly what's going wrong and why, going all the way back to JavaScript was designed in 10 days in 1995.
The way forward is for lsp-mode to fix itself, clearly. But if you're maintaining the tool at the other end of the LSP discussion, it's hard to politely yet urgently signify this to them. Ideally we should get along with our neighbors just fine!
And in the neighbor's defense, using UTF-16 code units for offsets is... it's not good. At all. Modelling a spec after a specific language's limitations is how we ended up with "you can't really send integers bigger than 9e15 in a JSON document and be sure they'll come out okay on the other side".
"Sending the correct UTF-16 code unit offsets" would probably involve something like re-encoding lines from "whatever Emacs uses internally" (presumably UTF-8) to UTF-16, which can be time-consuming, especially if you're doing that in Emacs Lisp I suppose. Or you could do something smarter, like rust-analyzer does, but then you better have good unit tests (like rust-analyzer also does).
Still in the neighbor's defense, there's a lot of code out there that does not use any code points outside of the BMP, thus doesn't have any UTF-16 surrogate pairs. So it's seems like an edge case for lsp-mode (whereas it is, in fact, a fundamental disagreement with the spec).
Luckily, and this is where I feel like the rust-analyzer project has been diplomatic, there's already been steps taken to move this forward. You and I, we're not the only UTF-16 haters in the world, there's also clangd, the C/C++ language server, which went "lol what no" and straight up decided to allow clients to use UTF-8 offsets instead.
And, extending a hand to lsp-mode, so they don't have do something dumb but expensive (re-encoding as UTF-16) or clever but fast (keep a cache of offsets, use the length-in-other-encoding trick RA does), rust-analyzer has implemented that extension.
So the timeline now goes:
- Aug 14, 2020 β issue filed against lsp-mode
- Jan 19, 2021 β rust-analyzer team learns about this
- Jan 26, 2021 β rust-analyzer adds support for clangd UTF-8 offsets extensions
- Jan 31, 2021 β someone makes a fix for lsp-mode to send proper UTF-16 offsets, that isn't merged
- Feb 05, 2022 β someone requests UTF-8 offset support in lsp-mode
- May 10, 2022 β the LSP spec adds support for UTF-8 offsets officially in 3.17.0
- Oct 05, 2022 β LSP 3.17.0 becomes the "current" version of the spec
- Oct 21, 2022 β
lsp-types
(Rust crate) adds support for UTF-8 offsets - Oct 26, 2022 β rust-analyzer switches from clangd UTF-8 offsets to the official way
And here we are, 2.5 years after the initial filing, with the same issue being noticed and reported over and over again against rust-analyzer (and, I imagine, similarly correctness-minded language servers).
So, here's my call to action.
Dear lsp-mode
maintainers, please, in that order:
- Do not fall back to RLS, ever.
- Fix your "character offsets" to be UTF-16 code units. I know it's bad, but it's bad for everyone and that's what the spec says.
- Implement the UTF-8 offset extensions (LSP 3.17.0), that way we don't have this silly situation where two tools that use UTF-8 internally do conversions in both directions every time they speak.
It's overdue.
If you liked what you saw, please support my work!