My gift to the rustdoc team
Thanks to my sponsors: Christoph Grabo, Dirkjan Ochtman, James Leitch, David Barsky, WeblWabl, Manuel Hutter, John VanEnk, Aleksandre Khokhiashvili, Laine Taffin Altman, Stephan Buys, Max Bruckner, James Brown, Marcus Griep, Dylan Anthony, Dave Minter, Radu Matei, Guilherme Neubaner, Josh Triplett, Alex Krantz, Michal Hošna and 257 more
About two weeks ago I entered a discussion with the docs.rs team about, basically, why we have to look at this:
When we could be looking at this:
And of course, as always, there are reasons why things are the way they are. In an effort to understand those reasons, I opened a GitHub issue which resulted in a short but productive discussion.
I walked away discouraged, and then decided to, reasons be damned, attack this problem from three different angles.
But first, the minimal required amount of background information on all this.
Background
Rust provides everyone with a tool that lets them generate HTML and JSON
documentation for their crates, from doc
comments
(///, or //! for modules).
Which is amazing. You can easily get offline documentation before hopping on a plane, and you can preview what your docs will look like before publication.
Once you’re done iterating on the documentation of your crate, which you should do because documentation is important, it’s time to publish your crate to crates.io.
This puts your crate in the build queue at docs.rs, or rather, one of the two build queues, the one for nice people and the one for naughty people:
If/when the build succeeds, you get thrown in the 7.75TiB bucket with the others and you get a little corner of the internet to call yours, with a fancy navbar that connects you to the right of the docs.rs-verse:
The bucket contains a bunch of HTML, CSS, and JavaScript that is completely immutable, unless you run another rustdoc build from scratch (which the docs.rs team does for the latest version of all crates, but not historical versions).
This kind of explains the first reason why it is hard to just make those things colored. There is no way in hell that we are rebuilding every version of every crate ever with the “I like colors” feature turned on. That’s simply not feasible.
Problems
And that’s just the first of many different problems.
First off, there are many different solutions to highlight code.
- Which one do you pick?
- Which languages do you include?
- Can you trust it to run and to provide the quality output?
- Does it require dynamic linking?
- Does it build on all the target platforms that rustdoc supports?
- The HTML markup for syntax highlighted code is bigger than for non-syntax highlighted code
- By how much?
- Can we even afford that?
- Who’s gonna implement all this?
Well!
tree-sitter, 96 of them by popular vote, yes, no, yes, not much, probably, me.
Solutions
I have been using tree-sitter for as long as I have over-engineered my website, which is six years now.
As far as I’m concerned, it is the gold standard in terms of syntax highlighting that only an LSP can beat, but good luck convincing anyone to run that, to generate a bunch of documentation.
LSP meaning language server protocol, which is the language that Rust Analyzer and your code editor speak. They are able to do semantic highlighting, but of course require loading all of your source code, all of its dependencies, and the entire sysroot, which takes a lot of time and memory.
Therefore, it is unsuitable for offline syntax highlighting. Well, I mean… don’t let me stop you. I’m a bear, not a cop.
However, even though there are crates for the tree-sitter core and for tree-sitter-highlight, the rest you kind of have to put together yourself.
First, you have to find a grammar for your language. If your language is Rust or C++, then you’re in a very good position because a high quality grammar that’s up to date is available right now on the tree-sitter-grammars GitHub org.
But if your tastes are a little more uncommon, then you might find yourself searching for the perfect grammar for quite some time, or even writing your own.
Or, finding one that looks like it might be okay but was actually written against a much older version of tree-sitter and needs to be cleaned up and regenerated, with some weird rules removed because they make the compilation time explode…
“regenerate” in this context means taking the grammar.js and possibly scanner.cc of the grammar repository and rerunning it through the tree-sitter CLI, which is going to generate a mountain of C code for the actual parser.
You have to do that, of course, for every language you want to highlight:
I collected 18 different grammars before I started wondering if I couldn’t solve the problem for everyone once and for all, especially since I started having different projects that all needed to highlight something.
What those grammars and the automatically generated crate alongside them do is export a single symbol, which is a pointer to a struct that contains parsing tables along with function pointers to the scanner if there’s one, etc.
It is not ready to use by any stretch of the imagination.
Actually, I lied, and you can see it on that screenshot. It exports other things if you’re lucky, like highlights query and injections query, which you need if you want to actually highlight the result of parsing code into a tree.
If you don’t have highlights queries, then you have a tree of nodes, but you don’t know which corresponds to what. You don’t know what’s a keyword, what’s a function, what’s a number, a string, anything that could have some sort of meaningful color.
You don’t know how to match your color theme to all the nodes that you have. That’s what the highlights query does. As for the injections queries, they let you know what other grammar is nested inside of yours.
For example, Svelte components typically are HTML and can embed scripts and styles. So you inject JavaScript and CSS in there, and sometimes TypeScript.
There is a callback system in tree-sitter-highlight to handle injections, but having the right dependencies and implementing that callback are all up to you!
Unless you’re me and you’ve been dealing with that problem for 6 years and you have your own private stash of all the good grammars.
That changes today: I am happy to announce: arborium.
arborium
For the 96 languages that people requested, I have gone and searched for the best available grammar, and I have vendored it, fixed it up, made sure the highlight queries worked, made sure the license and attribution are present in my redistribution of them, and integrated it into one of the cargo feature flags of the main arborium crate.
But it goes a little further. If you depend, for example, on Svelte, then it’s also going to bring the crates that are needed to highlight the Svelte component fully, namely HTML, CSS, and JavaScript.
Much like the original tree-sitter crates, they cannot actually do much by themselves, but you’re supposed to use them through the main Arborium crate, which has very simple interfaces to highlight code:
use arborium::Highlighter;
let mut highlighter = Highlighter::new();
let html = highlighter.highlight_to_html("rust", "fn main() {}")?;
Granted, here we are kind of eschewing the subtlety of incremental parsing and highlighting that tree-sitter provides, but don’t worry, there are more complicated APIs right there if you need them.
Everything can be configured from the theme, of which we ship a fair amount built in, to the style of the HTML output, by default we go for the modern, compact, and widely-supported:
<a-k>keyword</a-k>
If you insist on being retro and pinky promise that Brotli compression makes up for it anyway, then you can use the long-winded alternative:
<span class="code-keyword">keyword</span>
If you’re more of a terminal kind of person, then you can have its output and see escapes. Even with an optional background color, some margin and padding, and a border, if you really want to make it stand out:
And perhaps most importantly, the rust crates are set up in such a way that they
can compile through cargo to the wasm32-unknown-unknown target.
This was the thing that tripped me up because it requires providing just enough libc symbols so that the grammars are happy.
crates/arborium-sysroot/wasm-sysroot › main 1 18via v17.0.0-clang › 18:10 🪴
› ls --tree
.
├── assert.h
├── ctype.h
├── endian.h
├── inttypes.h
(cut)
But Amos! Didn’t you just show a “WASM playground” that you got by running
tree-sitter build --wasm then tree-sitter playground?
Yeah, they target wasm32-wasi
Well, that’s because they build for wasm32-wasi, which is slightly different.
At the end of the day, someone has to provide system functions, and in our case,
it’s me.
Most functions provided are simple (isupper, islower) etc., with the
exception of malloc, free and friends, which in arborium’s case, are
provided by dlmalloc.
Because all of those crates compile with a Rust toolchain (that invokes a C
toolchain) to wasm32-unknown-unknown, we can run them in a browser. With
a little glue!
Angle 1: just include this script
Right now, if you publish a crate and want the documentation to be highlighted for languages other than Rust, you can follow the instructions at arborium.bearcove.eu, to:
- Create an HTML file within your repository
- Add metadata to your Cargo.toml file so the docs.rs build process picks it up
You can see this in action on the arboriu_docsrs_demo page, and its sources in the arborium repository
I even went the little extra mile of detecting that you’re running on docs.rs and matching the theme that is currently active in a responsive way. So it’s gonna use docs.rs light, docs.rs dark, and the Ayu theme, depending on whatever the page does.
Those themes do not appeal to my personal aesthetic, but I decided that consistency was the most important imperative here.
This solution is great because it works today.
It’s great because it means zero extra work for the Rust docs team. They don’t have to mess with Rustdoc, their build pipeline, or their infrastructure. It just works. It’s a wonderful escape hatch.
People have used it to integrate KaTeX (render LaTeX equations), to render diagrams, and do all sorts of things on the front-end.
This solution is also the worst! Because it requires not just JavaScript but also WebAssembly, it forces people to download large grammar bundles (sometimes hundreds of kilobytes!) just to highlight small code blocks.
But most importantly, it’s a security disaster waiting to happen.
You should never let anyone inject third-party JavaScript into the main context of your page. Right now on docs.rs, there’s not much to steal except your favorite theme, but that might not always be the case. It’s just bad practice, and the team knows it—they want, or should want, to close that hole.
If you’re confused about why this is so bad, imagine everyone adopts Arborium as the main way of highlighting code on their docs.rs pages. A few years down the line, I decide to turn evil. All I have to do is publish a malicious version of the arborium package on NPM to reach millions of people instantly.
Contrary to popular belief and this stock photo I paid a monthly subscription for and I'm DAMN WELL gonna use, you don't need to wear a hoodie to do hacking.
You could, of course, have people pin to a specific version of the Arborium package, but that would also prevent them from getting important updates. Ideally, all the JavaScript distributed on docs.rs pages should come from the docs team, so that the world is only in danger if the docs teams themselves turn evil.
Therefore, in the long term, in a world where we have money and people and time to address this, we must consider two other angles.
Angle 2: it goes in the rustdoc hole
Arborium is just a bunch of Rust crates that contains a bunch of C code, both of which are extremely portable. There is nothing funky going on here, there is no dynamic linking, there is no plugin folder, asynchronous loading or whatever. Just a bunch of grammars and code that you need to actually highlight things.
Therefore, I was able to make a PR against RustDoc to get it to highlight other languages:
At +537 -11, it’s a pretty small PR, that in reality pulls literal millions of lines of C code (parsers generated by tree-sitter).
This makes the question of “what grammars do we bundle?” all the more important—thankfully, I’m not going to be the one who solves it.
rust › rustdoc-arborium 3via v3.14.2 › 00:54 🪴
› ls -lhA build/aarch64-apple-darwin/stage2/bin/rustdoc
Permissions Size User Date Modified Name
.rwxr-xr-x 171M amos 14 Dec 00:52 build/aarch64-apple-darwin/stage2/bin/rustdoc
rust › main via v3.14.2 › 01:44 🪴
› ls -lhA build/aarch64-apple-darwin/stage2/bin/rustdoc
Permissions Size User Date Modified Name
.rwxr-xr-x 22M amos 14 Dec 01:44 build/aarch64-apple-darwin/stage2/bin/rustdoc
Top: a custom rustdoc with all 96 languages compiled in. Bottom: “main branch” rustdoc.
I fully anticipate that at some point in the discussion someone might look at those binary sizes and go: “yeesh, I don’t think we can do that”.
Consequently, I present to you: angle number three.
Angle 3: only in the backend
If it’s not feasible to afford everyone the luxury of highlighting hundreds of programming, markup, and configuration languages at home, then I will settle for doing the deed in the backend of docs.rs.
Enter: arborium-rustdoc.
It’s a post-processor specifically for rustdoc. It detects code blocks in HTML files and highlights them! It also patches the main CSS file to add its styles at the bottom.
I tested it on all dependencies of the facet monorepo, and the size of the ~900MB doc folder went up by a whopping 24KB!
I really hope we can afford this. I’m even willing to personally chip in.
Post-mortem
The most challenging part of this whole project was probably the CI set up: when building a small package, GitHub Actions is bearable. When orchestrating 2x96 builds + supporting packages and publishing with provenance to two platforms, it really isn’t.
I’d like to thank Depot.dev for generously donating their beefy CI runners, without which I would’ve just bailed out of this project early.
Even then, I distributed plugin jobs into ten tree-themed groups:
Any CI failure is punishing, so I kept as much of the logic as possible out of YAML, and into a cargo-xtask. It’s actually very friendly!
But it’s not just progress bars and nerd font icons. It’s also making sure that
every single artifact we produce can be loaded in a browser by parsing the
WebAssembly bundle and checking its imports, via walrus
(instead of summarily piping wasm-objdump -x into grep or whatever).
There’s a lot of build engineering going on here. I’m using blake3 hashes to avoid recomputing inputs, mostly because I think the name sounds cool, a dozen crazy things happened during those two weeks and I barely remember the half of it.
Conclusion
I built arborium so it could last us for the next 20 years. I’m thrilled to donate it to the commons (it’s Apache2+MIT) and to, hopefully, see accurate syntax highlighting blossom on the web, just like we’ve seen code editors suddenly get better at it before.
I believe tree-sitter can change the world a second time. This time, for everyone who simply doesn’t have the time or know-how to put all the pieces together.
All the details are on the arborium website.
For docs.rs specifically, if I had to do it, realistically? I’d go with arborium-rustdoc as a post-processing step. It’s fast, you can build it with support for all languages, and it doesn’t have any of the security or bundle size implications of the other two solutions. You can even sandbox it!
Happy holidays!
Did you know I also make videos? Check them out on PeerTube and also YouTube!
Here's another article just for you:
Some mistakes Rust doesn't catch
I still get excited about programming languages. But these days, it’s not so much because of what they let me do, but rather what they don’t let me do.
Ultimately, what you can with a programming language is seldom limited by the language itself: there’s nothing you can do in C++ that you can’t do in C, given infinite time.
As long as a language is turing-complete and compiles down to assembly, no matter the interface, it’s the same machine you’re talking to. You’re limited by… what your hardware can do, how much memory it has (and how fast it is), what kind of peripherals are plugged into it, and so on.