The case for sans-io
Thanks to my sponsors: Arjen Laarhoven, Julian Schmid, James Brown, Josh Triplett, Xavier Groleau, Stephan Buys, Guillaume E, Olivier Peyrusse, ShikChen, Wojciech Smołka, Sindre Johansen, Wyatt Herkamp, Dirkjan Ochtman, Paul Horn, AdrianEddy, Gorazd Brumen, Victor Song, C J Silverio, Aiden Scandella, Matthew T and 277 more
This is a dual feature! It's available as a video too. Watch on YouTube
The most popular option to decompress ZIP files from the Rust programming language is a crate simply named zip — At the time of this writing, it has 48 million downloads. It’s fully-featured, supporting various compression methods, encryption, and even supports writing zip files.
However, that’s not the crate everyone uses to read ZIP files. Some applications benefit from using asynchronous I/O, especially if they decompress archives that they download from the network.
Such is the case, for example, of the uv python package manager written in Rust. uv doesn’t use the zip crate, it uses the async_zip crate, which is maintained by a single person and gets a lot less attention.
This situation is fairly common in Rust: the same code gets written against sync interfaces and async interfaces. This results in a split ecosystem, duplication of effort, and of course, more bugs overall.
Character encoding differences
And that’s a shame because there are a lot of things about dealing with the ZIP format that are completely non-trivial. It is an old crufty format with a lot of edge cases.
Even though there is an ISO standard for the zip format and most of it is described in the freely available PKWARE APPNOTE, there’s still a lot of surprises to be found when looking at zip files in the wild, like I did when I worked at itch.io
The zip format predates the universal adoption of UTF-8. Don’t tell me Windows still uses UTF-16, I’m trying to ignore that right now. Plus they have a UTF-8 code page nowadays, so, shrug.
The zip format predates UTF-8, and that means the encoding of filenames in ZIP files used to be whatever code page your system happened to be set to.
Only in the year 2007, was the app note updated to document “extra field” values that indicates that the file names and file comments are actually encoded with UTF-8.
This was probably fine when you passed zip files on floppy disks from one office to the next in the same country, but at itch.io we had a situation where a Japanese game developer used the built-in Windows ZIP creation tool from Explorer and had file names encoded as Shift-JIS, a successor of JIS X 0201, a single-byte Japanese Industrial Standard text encoding developed in 1969.
Most ZIP tools, however, treated that file as if it was encoded with code page 437, the character set of the original 1981 IBM Personal Computer, you know, where “PC” comes from? Which to be fair is a pretty good guess in the west if the UTF-8 bit flag isn’t set.
Because the format only tells us whether a filename is “UTF-8” or “not UTF-8”, the solution I came up with, so that the itch.io desktop app can install games from all over the world…
…is to take all textual content from the zip file — filenames, comments, etc. — and do statistical analysis, trying to figure out what the character set is based on the frequency of certain byte sequences, like these, for Shift-JIS:
var commonChars_sjis = []uint16{
0x8140, 0x8141, 0x8142, 0x8145, 0x815b, 0x8169, 0x816a, 0x8175, 0x8176, 0x82a0,
0x82a2, 0x82a4, 0x82a9, 0x82aa, 0x82ab, 0x82ad, 0x82af, 0x82b1, 0x82b3, 0x82b5,
0x82b7, 0x82bd, 0x82be, 0x82c1, 0x82c4, 0x82c5, 0x82c6, 0x82c8, 0x82c9, 0x82cc,
0x82cd, 0x82dc, 0x82e0, 0x82e7, 0x82e8, 0x82e9, 0x82ea, 0x82f0, 0x82f1, 0x8341,
0x8343, 0x834e, 0x834f, 0x8358, 0x835e, 0x8362, 0x8367, 0x8375, 0x8376, 0x8389,
0x838a, 0x838b, 0x838d, 0x8393, 0x8e96, 0x93fa, 0x95aa,
}
This gives us a list of probabilities and, then you just take the highest and.. hope for the best!
I’m not aware of any other tool that bothers doing that — I think if I had to do it again, I would just require a standard archive format instead of trying to extract whatever stuff developers would shove in the file upload dialog.
Platform differences
But that’s not the only crufty part of the ZIP file format.
For example, it doesn’t really make a difference between files and directories. Directories simply have length 0 and their paths end with a forward slash.
~/zip-samples
❯ unzip -l wine-10.0-rc2.zip | head -8
Archive: wine-10.0-rc2.zip
Length Date Time Name
--------- ---------- ----- ----
0 12-13-2024 22:32 wine-10.0-rc2/
0 12-13-2024 22:32 wine-10.0-rc2/documentation/
8913 12-13-2024 22:32 wine-10.0-rc2/documentation/README-ru.md
5403 12-13-2024 22:32 wine-10.0-rc2/documentation/README-no.md
5611 12-13-2024 22:32 wine-10.0-rc2/documentation/README-fi.md
What about Windows?
Well, first off, did you know: All Windows APIs support using forward slashes as a path separator.
Microsoft has a very good article about file paths on Windows that I’m sure you can learn a lot from. I know I did.
And secondly, this is one of the things the app note is very clear on:
The path stored MUST NOT contain a drive or device letter, or a leading slash. All slashes MUST be forward slashes ‘/’ as opposed to backwards slashes ‘\’ for compatibility with Amiga and UNIX file systems etc.
PKWARE APPNOTE.TXT v6.3.10, section 4.4.17: file name
Of course, if the ZIP was actually created on Unix, then the entry would have a mode, and in the mode bits you can tell whether it’s a directory, a regular file, or a symbolic link.
In the wild I’ve noticed symbolic links tend to have their target as the contents of the entry, but of course that’s not what the APPNOTE says.
It says that in the Unix extra field, there is a variable size data field that can be used to store the target of a symbolic link or hard link.
Emphasis on “can”.
Because there were so many different tools that could create zip archives, and standardization only came later with the ISO standard (which mandates UTF-8 file names), the APPNOTE takes a descriptive rather than prescriptive approach.
It simply documents the various zip format implementations found in the wild, without making value judgments about the choices made by different software authors.
So if you want to support most zip files out there, you have to be able to read DOS-style timestamps and UNIX-style timestamps, which are completely different.
DOS timestamps, for example, are completely bonkers?
They fit in 32 bits, half for the time, half for the date, so far so good…
The day is a 5-bit integer, the month is a 4-bit integer, the year is a 7-bit wide integer, counting from 1980 — and as for the time, it’s stored in two-second intervals! It’s.. it’s fun.
I think of it every time someone says that IEEE 754 is “weird” because doing 0.1 + 0.2 shows a lot of decimals after 0.3 or whatever.
The end of central directory record
But okay, fine. Those are details you can probably ignore for files that have been created with recent tools.
But even the most basic fundamental aspects of the zip file format are slightly cursed?
Most file formats start with a magic number and then a header including metadata, and then the actual body, the actual meat of the file, so: pixel data for an image, or vertex coordinates for a model, things like that.
fasterthanli.me/content/img on main [!?]
❯ hexyl logo-round-2.png | head
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 89 50 4e 47 0d 0a 1a 0a ┊ 00 00 00 0d 49 48 44 52 │×PNG__•_┊⋄⋄⋄_IHDR│
│00000010│ 00 00 01 00 00 00 01 00 ┊ 08 06 00 00 00 5c 72 a8 │⋄⋄•⋄⋄⋄•⋄┊••⋄⋄⋄\r×│
│00000020│ 66 00 00 2a b5 7a 54 58 ┊ 74 52 61 77 20 70 72 6f │f⋄⋄*×zTX┊tRaw pro│
│00000030│ 66 69 6c 65 20 74 79 70 ┊ 65 20 65 78 69 66 00 00 │file typ┊e exif⋄⋄│
│00000040│ 78 da a5 9c 6b 76 5d b7 ┊ 8e 84 ff 73 14 77 08 7c │x×××kv]×┊×××s•w•|│
│00000050│ 93 18 0e 9f 6b f5 0c 7a ┊ f8 fd 15 8f e4 eb 38 ce │ו•×k×_z┊×ו×××8×│
│00000060│ ed a4 db 89 25 59 3a da ┊ 9b 9b 00 0a 55 00 78 dc │××××%Y:×┊××⋄_U⋄x×│
│00000070│ f9 ef ff ba ee 5f ff fa ┊ 57 f0 3e 54 97 4b eb d5 │×××××_××┊W×>T×K××│
│00000080│ 6a f5 fc c9 96 2d 0e be ┊ e8 fe f3 67 bc 8f c1 e7 │j××××-•×┊×××g××××│
But not ZIP! The only correct way of reading a zip file is to start from the end of the file and walk back until you find the signature of the end of central directory record.
And that’s why if you take a look at the zip crate API, it requires the input to
implement both Read
and Seek
, because even just to list the entries of the
zip file, you need to be able to move around it.
impl<R: Read + Seek> ZipArchive<R> {
/// Read a ZIP archive, collecting the files it contains.
///
/// This uses the central directory record of the ZIP file, and ignores local file headers.
pub fn new(reader: R) -> ZipResult<ZipArchive<R>> {
// ✂️
}
}
Doing this properly is not as simple as it may sound!
Originally, the zip
crate made 4-byte reads starting from almost the end of
the file and then moved left by 1 byte every time it didn’t match the signature
of the end of central directory record, which was hugely wasteful.
The async_zip
crate, which was written later, improved on that by making reads
of 2 KiB, and moving to the left 2 KiB minus the size of the signature to handle
the case where the signature would overflow two buffers, which is pretty smart!
The comments mention a 500x speedup compared to the zip
method.
The zip
crate eventually caught up in May of 2024 by doing 512-byte
reads, which temporarily made it much
faster until August of 2024 when they fixed a bug in the EOCD finding logic. A
pretty fun one actually.
Boundary confusion
Most file formats have some sort of framing mechanism. You read the file moving forward, and then you have records prefixed by their length.
MP4, or rather, MPEG-4 Part 14, calls those boxes. Media authoring software tends to write a lot of metadata that media players don’t necessarily know about, but anyone can skip over those boxes, even if they’re of a completely unknown type.
This property also makes it impossible to mistake data from the actual structure of the file. Each box has a type and the type can be a valid UTF-8 byte sequence. But there is never any ambiguity as to whether you’re reading the type of box or whether you’re reading the name of the author of the media file.
However, in the ZIP format, because you’re scanning from the end of the file going backwards, it is possible to read part of a comment or file path, and have it accidentally match the signature bytes for the end of central directory record.
And that’s the bug that was fixed in the zip crate in August of 2024. Instead of stopping at the first thing that looks like an EOCD signature, they now keep scanning the entire file and keep track of all the offsets at which signature-like things were found.
But of course, reading an entire multi-gigabyte file by increments of half a kilobyte, seeking backwards every time, is pretty much the worst possible read pattern that you can do on any kind of device? Any buffering done in userland or in the kernel is woefully unprepared for… that.
And I was going to give the example of a 4GB file, that would require 8 million syscalls just to find the EOCD, but then I stumbled upon this comment in the GitHub repository:
I tried this PR on a 200GB zip file (233899 files within) that I access over a networked share.
…and, it’s not like that person is doing anything wrong? But also, good lord.
If you’re confused about all the complexity in the linked code, remember that you can have garbage at the beginning of a zip file or at the end of the zip file, and most tools will still be able to decompress it.
For example, self-extracting zip files start with a native executable (note MZ
)
~/Downloads
❯ file winzip76-downwz.exe
winzip76-downwz.exe: PE32 executable (GUI) Intel 80386, for MS Windows
~/Downloads
❯ hexyl --length 64 winzip76-downwz.exe
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 4d 5a 90 00 03 00 00 00 ┊ 04 00 00 00 ff ff 00 00 │MZ×⋄•⋄⋄⋄┊•⋄⋄⋄××⋄⋄│
│00000010│ b8 00 00 00 00 00 00 00 ┊ 40 00 00 00 00 00 00 00 │×⋄⋄⋄⋄⋄⋄⋄┊@⋄⋄⋄⋄⋄⋄⋄│
│00000020│ 00 00 00 00 00 00 00 00 ┊ 00 00 00 00 00 00 00 00 │⋄⋄⋄⋄⋄⋄⋄⋄┊⋄⋄⋄⋄⋄⋄⋄⋄│
│00000030│ 00 00 00 00 00 00 00 00 ┊ 00 00 00 00 20 01 00 00 │⋄⋄⋄⋄⋄⋄⋄⋄┊⋄⋄⋄⋄ •⋄⋄│
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘
….and the zip file is just tacked on at the end (note PK
)
~/Downloads
❯ unzip -l winzip76-downwz.exe | head
Archive: winzip76-downwz.exe
warning [winzip76-downwz.exe]: 2785280 extra bytes at beginning or within zipfile
(attempting to process anyway)
Length Date Time Name
--------- ---------- ----- ----
2700 09-06-2024 18:34 common/css/common.css
21825 09-06-2024 18:34 common/css/jquery-ui.css
30945 09-06-2024 18:34 common/img/arrow.png
14982 09-06-2024 18:34 common/img/button-hover.png
14982 09-06-2024 18:34 common/img/button-normal.png
728365 09-06-2024 18:34 common/img/centerImg.png
17027 09-06-2024 18:34 common/img/close-hover.png
~/Downloads
❯ hexyl --skip 2785280 --length 64 winzip76-downwz.exe
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│002a8000│ 50 4b 03 04 14 00 00 00 ┊ 08 00 51 94 26 59 ad 3a │PK•••⋄⋄⋄┊•⋄Q×&Y×:│
│002a8010│ 80 57 d1 02 00 00 8c 0a ┊ 00 00 15 00 00 00 63 6f │×Wו⋄⋄×_┊⋄⋄•⋄⋄⋄co│
│002a8020│ 6d 6d 6f 6e 2f 63 73 73 ┊ 2f 63 6f 6d 6d 6f 6e 2e │mmon/css┊/common.│
│002a8030│ 63 73 73 b5 56 cb 6e db ┊ 30 10 bc 07 c8 3f 10 30 │css×V×n×┊0•ו×?•0│
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘
In December of 2024, as I was rewriting this piece, a PR landed, after 11 weeks of back and forth, that rewrites the EOCD detection algorithm again, fixing the huge performance regression introduced in August.
Is the async_zip
crate impacted by any of the bugs that were fixed and then
refixed in the zip
crate? Probably! It was last released in April of 2024, so,
who knows.
Not doing any I/O at all
I didn’t check, because I have my own zip crate, rc-zip, which I believe to be the best of the three, not just because it also does character set detection, but because contrary to the zip crate or the async zip crate, it is not tied to any particular style of I/O.
Also, it has a cool logo, by the exceedingly talented Misia:
The logo looks different in light mode and dark mode!
There is ample precedent for sans-io approaches, and I am very happy to credit Geoffroy Couprie of nom fame for encouraging me to take that approach five years ago when I started working on rc-zip.
There’s examples of sans-io in the Rust ecosystem already: the rustls crate
comes pretty close. Although it still somehow ties itself to the standard
Read
and Write
trait, the consumer of the library is free to choose
when to call read_tls
and write_tls
, which means it integrates seamlessly
with a completion-based library like mio.
The integration with tokio in tokio-rustls is a bit more awkward.
The sans-io pattern is even more common in the C ecosystem because, well, they have no standard I/O interface. You could have your APIs accept a file descriptor, but that would be fairly limiting.
The ZStandard decompression API, for example, looks like this:
// from the `zstd-sys` crate
pub unsafe extern "C" fn ZSTD_decompressStream(
zds: *mut ZSTD_DStream,
output: *mut ZSTD_outBuffer,
input: *mut ZSTD_inBuffer,
) -> usize
The input and output buffer are simply a pointer, a size, and a position:
struct ZSTD_inBuffer {
pub src: *const c_void,
pub size: usize,
pub pos: usize,
}
Calling decompressStream
updates the pos
field on the input buffer and the
output buffer, and lets the caller determine what happened based on the various
values in those structs.
If the input’s position is less than its size, that means only part of the input was used during this call, and the rest should be passed again to the next call. This can happen if the decoder didn’t have enough space in the output buffer for example!
If the output’s position is less than the output’s size it means the decoder is completely done and has flushed all remaining buffers.
If the output’s position is equal to the output buffer size, however, that means you should call it again with more output buffer.
All these states are surprisingly tricky to get right: the decompressor might need more input, and you may have no more input to give it — that could easily result in an infinite loop! Instead, you should have a way to signal that you have no more input to feed it, and that it should error out if it thinks the input is truncated.
The structure of rc-zip
Well, rc-zip does the same thing, except things are a bit more complicated because… the first thing we have to do is scan backwards from the end of the file, and after that, we want to be able to extract individual entries from the ZIP file, in any order, skipping over some, going back… pretty far from a linear scan!
To achieve this, it exposes two state machines: ArchiveFsm
is used to read
the central directory, returning an Archive
, and from there, you can build
EntryFsm
to read individual entries — knowing their offset, compression
method, etc.
Driving the ArchiveFsm
to completion involves following a simple loop.
pub fn wants_read(&self) -> Option<u64>
First, we call wants_read
— if the machine wants more data, it returns Some
with the offset of where in the file it wants us to read. Most of the time, this
follows the last read we did, but not always!
pub fn space(&mut self) -> &mut [u8]
If it did return Some
, we call space
, which borrows its internal buffer
mutably. Rust doesn’t deal with raw pointers, we get a slice back, which means
we know the maximum amount of data we can put in there.
pub fn fill(&mut self, count: usize) -> usize
Once we’ve performed a read, we call fill
, indicating how many bytes we read.
As with the standard Read
trait, a read of size 0 indicates end-of-file.
In the standard Read trait, a read of size 0 can also indicate that
the passed buffer was of size zero, but this never happens with ArchiveFsm
.
Finally, once we’ve fed our machine, we can call the process
method, and I’m
fairly happy with the design here…
pub fn process(self) -> Result<FsmResult<Self, Archive>, Error>
…because it consumes the state machine! If it’s done, then it returns
the Done
variant of FsmResult
, and we can never accidentally call another
method on the state machine again.
If it’s not done — if it wants more input and we should go around for another
turn of the loop, then it returns the Continue
variant, yielding back
ownership of itself to the consumer.
/// Indicates whether or not the state machine has completed its work
pub enum FsmResult<M, R> {
/// The I/O loop needs to continue, the state machine is given back.
Continue(M),
/// The state machine is done, and the result is returned.
Done(R),
}
We could of course, go deeper into type safety with typestates, but I’m fairly happy with the current design, which plugs fairly easily into both synchronous I/O, via rc-zip-sync, and asynchronous I/O, via rc-zip-tokio.
Bringing io_uring into it
Well, I say that — the rc-zip-tokio
implementation is actually fairly messy,
because asynchronous file I/O on Linux is a mess. You want to know how tokio
does an asynchronous file read on Linux? With a background thread!
// tokio 1.42, `src/fs/file.rs`
impl AsyncRead for File {
fn poll_read(
self: Pin<&mut Self>,
cx: &mut Context<'_>,
dst: &mut ReadBuf<'_>,
) -> Poll<io::Result<()>> {
ready!(crate::trace::trace_leaf(cx));
let me = self.get_mut();
let inner = me.inner.get_mut();
loop {
match inner.state {
State::Idle(ref mut buf_cell) => {
let mut buf = buf_cell.take().unwrap();
if !buf.is_empty() {
buf.copy_to(dst);
*buf_cell = Some(buf);
return Poll::Ready(Ok(()));
}
buf.ensure_capacity_for(dst, me.max_buf_size);
let std = me.std.clone();
// here! 👇
inner.state = State::Busy(spawn_blocking(move || {
let res = buf.read_from(&mut &*std);
(Operation::Read(res), buf)
}));
}
State::Busy(ref mut rx) => {
// ✂️
}
}
}
}
}
I think of that every time someone blogs about how reading a file with tokio is slower than with the standard library. No shit! Look at all the work it’s doing!
This is only the case for files, by the way, not TCP sockets, which is where tokio actually shines.
Just reading one gibibyte from /dev/urandom
with tokio and with libstd, we
can see a difference in performance:
use std::io::Read;
use tokio::{fs::File, io::AsyncReadExt};
#[tokio::main]
async fn main() {
use std::time::Instant;
const SIZE: usize = 1024 * 1024 * 1024;
eprintln!("============= starting async");
let start_async = Instant::now();
let mut f = File::open("/dev/urandom").await.unwrap();
let mut buffer = vec![0; SIZE];
f.read_exact(&mut buffer[..]).await.unwrap();
let duration_async = start_async.elapsed();
eprintln!("============= done async");
eprintln!("============= starting sync");
let start_sync = Instant::now();
let mut f = std::fs::File::open("/dev/urandom").unwrap();
let mut buffer = vec![0; SIZE];
f.read_exact(&mut buffer[..]).unwrap();
let duration_sync = start_sync.elapsed();
eprintln!("============= done sync");
eprintln!("Async operation took: {:?}", duration_async);
eprintln!("Sync operation took: {:?}", duration_sync);
}
The sync operation is consistently faster on a Linux server of mine.
The actual numbers matter very little — what’s interesting is digging in with lurk, an strace-like tool written in Rust.
Did you know the strace logo is an ostrich? Now you do!
With lurk, we can observe the that the async version is doing a lot of this:
[1000458] read(9, "\u0007×[Ã\toP©w«mÉOþþ«u\u00128Bz°©4Å©o\u000e-ñR`çâ8\bFu¦¼è¸$»æÔg!e¶ãçYëurw{fED-jø%r", 2097152) = 0x200000
[1000457] futex(0x7FFFF7CD6C20, 128, 1, 0x0, 93824993192448, 140737304358928) = 0
[1000458] futex(0x7FFFF7CD6C20, 129, 1, 0x1, 0, 140736615946240) = 1
[1000457] futex(0x7FFFF7CD5660, 129, 1, 0x7FFFF7AD4648, 93824993193232, 93824993192960) = 1
[1000458] futex(0x7FFFF7CD5660, 128, 1, 0x7FFFF7CD4B98, 140736615946240, 0) = 0
[1000458] read(9, "¿CÙ37ý¶äh÷ÉÏQ3$¡\u001bÂè\u0001zzCÍ\u0014ÌÄ\u001e@\f}éTö\u000bz¾è#<ÀvrJÌ_\u0015\u0013¤\u0004\\Çd\r\bØÿ.A\nð·
éWGã@¨Âǯ=,\fOò$S̺Ç<·\u0014x\rÏÆgPDʼÖ×\u0006FK\u0001H\u000eµXÐzf·IøgÊæ«Ueªd\u001b^).s¢ÑNwáaÝtq©\u0004F±^Vc¡ÎäQ\u001c\u0016ñ±\u001e~j\tBÿwácÊÉ,èa úòöæÔ
;Äp¯\u0019ߺL)\u0004§m[f,¨\u0002á#n\u0013 Þ\u0013¨òÞâ\u0006Èx<Z\u001diw\u0012\u0012î´¼ífÕ¿Y*ë\u0018Ûjéml.M\u0002ïô¨¿!Ô\bÆ$ \u001018X<þ¢\u0017¥X\tqçHl|N\u000fIj®\u000fäY¥vÙÐPêßJ*cÝ^é3\u0006ÆÝoΦdú±|é\u0010Y\rÀ¥í§~¯.Çugh·>obP=ó]Úà\u0019WÆF÷\u0016m;âið\u0011Ú\u0015´Fã¦\bMîç(¸*¹{^ùJ}¯ëMâ°Y(\f\bû-F+ãx2
\u0002»Ë}SÈlþ3`jLc\f:3·:t\u0001?\"^{\u0012\u0007\u001fô1ø¸ÄÂ÷ìÎ\"îuÉůXq\b;_\u0003\nQ\u001dâhG\"ê.\u0007øOùæ\u0006áôéEj.\"l;9oP}99©\u001f!<~2Ø\u0011¦.ÒÃER<E0Ê¿Ïaôú\u0013\u0006º,\u0011ùÙëÿÎ#\rû÷èÜð;dUK\u0019\u001d\u0001eOBï$R¡u¨óþtÚÍu1C3d£é»|$¡z pè&\u0007l\u0013ÍGçÜÔVë:2\"¥Dà", 2097152) = 0x200000
[1000457] futex(0x7FFFF7CD6C20, 128, 1, 0x0, 93824993192960, 140737304358928) = 0
[1000458] futex(0x7FFFF7CD6C20, 129, 1, 0x1, 0, 140736615946240) = 1
[1000457] futex(0x7FFFF7CD5660, 129, 1, 0x7FFFF7AD4648, 93824993193616, 93824993193344) = 1
[1000458] futex(0x7FFFF7CD5660, 128, 1, 0x7FFFF7CD4B98, 140736615946240, 0) = 0
[1000458] read(9, "©Âׯ^kd±2Þ\u0015õ³gó=Çø½29Ç\u0003{Ù&¶«â\u001c\u000fYT]wfx/ù¥°Á\u0017b\u0014ϤK7U\u0005m#þÒ\u001dÛ'J\fÓ\u0005^cãNÌ¢[i'4\u001fû\bûQD\b.Ýt¾*\u001b\u001cßóµÇD)Í\u0016uèÅù\t\ná὿(\róî¹\u0014\u001fƼÚ\u0010ÜËaÑ#M½).¬?XDÓ\u0018Æ/ËüSÉÏj{éF³Lßÿ²wò±Ì`£µ÷¬`QÚÕrÃÅXèË6\u001c÷I¸íGÊ!®Ò(\r¬#
\u001b.Ïx\u0010ãtÄ\râ¡.´ÿÅ×àV@ü\u0016,aÀÎ\"µp-NÇ+ôÝÐó \u0012dȨRÍã=\u001c!4Ej)ÝBQZ½ÓµÕÄBfÜÔqÛ\r\u001céB \u0001é-\u0014`\u001c²hÖ£äxÀè\r\u0019#¹ò8ù\u000e7\u000bƬbÔ9\u001bï\u0001¨?§U¨ù[g!P¶9;\nß.¢,)Bò\u0006#ò§Ïb*Um\u0016Zúpb)î³×\u000fHC¿\u0010\u000e", 2097152) = 0x200000
[1000457] futex(0x7FFFF7CD6C20, 128, 1, 0x0, 93824993193344, 140737304358928) = 0
[1000458] futex(0x7FFFF7CD6C20, 129, 1, 0x1, 0, 140736615946240) = 1
[1000457] futex(0x7FFFF7CD5660, 129, 1, 0x7FFFF7AD4648, 93824993194128, 93824993193856) = 1
[1000458] futex(0x7FFFF7CD5660, 128, 1, 0x7FFFF7CD4B98, 140736615946240, 0) = 0
[1000458] read(9, "çCÙÍ96´æ]è*7jtbäÿïÕTý5\u0004ö¾f\fYEW0«ÞOì\u0010\u000fô\u0012U¯á)ð=\"á
8bnÓÙþï^«ÀÀÕÆãÈ\u000em\u001d_Y\bÀ\u0004ô\r¾$:ó(»Ó
\u0017°Cá(.¥à×9ÈÛ\u0002ébª\u0002eüÛÕDÞFaøp#\u001fOJÛ'¢ÐÇØÃ÷±*9¥¥ÁC
2ý\u0006\u001fN", 2097152) = 0x200000
[1000457] futex(0x7FFFF7CD6C20, 128, 1, 0x0, 93824993193856, 140737304358928) = 0
[1000458] futex(0x7FFFF7CD6C20, 129, 1, 0x1, 0, 140736615946240) = 1
[1000457] futex(0x7FFFF7CD5660, 129, 1, 0x7FFFF7AD4648, 93824993194512, 93824993194240) = 1
[1000458] futex(0x7FFFF7CD5660, 128, 1, 0x7FFFF7CD4B98, 140736615946240, 0) = 0
It makes reads of 128 KiB from one thread, which then wakes up another thread, which queues some more work, and so on and so forth — doing that dance eight thousand times over the course of the program.
By comparison, the synchronous version simply does this:
[1000457] write(2, "============= starting sync\n====...", 28) = 28
[1000457] openat(4294967196, "/dev/urandom", 524288) = 10
[1000457] mmap(0x0, 1073745920, 3, 34, 4294967295, 0) = 0x7FFF1FFFE000
[1000457] read(10, "7¹5T\t4B{&ð_\u000fògÚ2\u0015¤(è6Và\\ʵzO\u000e]\u000bñ\u001cW¿GMxó\u0011¿ª°\u001b;zâÞÕjySdDiÉùTµ\u001f~\u0010ÙÄÜ8gë\u0012æ'_[Ìdòme¨º%Ä\u0012l³6?óÝbæ
Ƭ®Ñ,\u001f\u0014^\u0001Ç,ª\u000b\u0014\"²(çݯ\u0017ÖÄ÷T_¢\u0007", 1073741824) = 0x40000000
============= done sync
[1000457] write(2, "============= done sync\nAsync op...", 24) = 24
One majestic, 1 GiB read syscall.
You might need to scroll that code block to notice that the read
call returns 0x40000000
.
But it’s not tokio’s fault, not really. There simply was no good way to do async file reads on Linux — until io_uring came around.
If we change that terrible test program to force it to do reads of at most 128
KiBs, which is what tokio does anyway, and we add a tokio-uring
variant, we
see that it is consistently competitive with the sync version, and consistently
faster than the “classic” tokio“ by about 10%.
I’m not giving exact numbers because I’m frankly ashamed of my setup, and
you could tune the numbers to make it say what you want — what I do want to
show you is the read loop of the tokio-uring
version:
[1047471] io_uring_enter(13, 0, 0, 0, 0x0, 128) = 0
[1047471] epoll_wait(9, 0x7FFFA0000CB0, 1024, 4294967295) = 1
[1047471] io_uring_enter(13, 1, 0, 0, 0x0, 128) = 1
[1047471] epoll_wait(9, 0x7FFFA0000CB0, 1024, 4294967295) = 1
[1047471] write(10, "\u0001", 8) = 8
[1047471] write(10, "\u0001", 8) = 8
[1047471] io_uring_enter(13, 0, 0, 0, 0x0, 128) = 0
[1047471] epoll_wait(9, 0x7FFFA0000CB0, 1024, 4294967295) = 1
[1047471] io_uring_enter(13, 1, 0, 0, 0x0, 128) = 1
[1047471] epoll_wait(9, 0x7FFFA0000CB0, 1024, 4294967295) = 1
[1047471] write(10, "\u0001", 8) = 8
[1047471] write(10, "\u0001", 8) = 8
[1047471] io_uring_enter(13, 0, 0, 0, 0x0, 128) = 0
[1047471] epoll_wait(9, 0x7FFFA0000CB0, 1024, 4294967295) = 1
[1047471] io_uring_enter(13, 1, 0, 0, 0x0, 128) = 1
[1047471] epoll_wait(9, 0x7FFFA0000CB0, 1024, 4294967295) = 1
In steady state, it calls io_uring_enter
to submit the read operation,
epoll_wait
to wait for some operations to be completed, and write
to…
wake itself up, because that’s how tokio channels work!
Wanna see? Here’s (part of) a stacktrace:
Thread 22 "zipring" hit Catchpoint 1 (call to syscall write), 0x00007ffff7dd027f in write () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007ffff7dd027f in write () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00005555555bad90 in std::sys::pal::unix::fd::FileDesc::write () at std/src/sys/pal/unix/fd.rs:306
#2 std::sys::pal::unix::fs::File::write () at std/src/sys/pal/unix/fs.rs:1289
#3 std::fs::{impl#6}::write () at std/src/fs.rs:937
#4 0x000055555559aa54 in mio::sys::unix::waker::Waker::wake () at src/sys/unix/waker/eventfd.rs:53
#5 0x0000555555592015 in tokio::runtime::io::driver::Handle::unpark () at src/runtime/io/driver.rs:208
#6 tokio::runtime::driver::IoHandle::unpark () at src/runtime/driver.rs:198
#7 tokio::runtime::driver::Handle::unpark () at src/runtime/driver.rs:90
#8 0x00005555555994ef in tokio::runtime::scheduler::current_thread::{impl#7}::wake_by_ref () at src/runtime/scheduler/current_thread/mod.rs:700
#9 tokio::runtime::scheduler::current_thread::{impl#7}::wake () at src/runtime/scheduler/current_thread/mod.rs:694
#10 tokio::util::wake::wake_arc_raw<tokio::runtime::scheduler::current_thread::Handle> () at src/util/wake.rs:60
#11 0x0000555555572c16 in core::task::wake::Waker::wake () at /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/task/wake.rs:459
#12 tokio_uring::runtime::driver::op::Lifecycle::complete () at src/runtime/driver/op/mod.rs:283
#13 0x0000555555570d9f in tokio_uring::runtime::driver::Ops::complete () at src/runtime/driver/mod.rs:491
#14 tokio_uring::runtime::driver::Driver::dispatch_completions () at src/runtime/driver/mod.rs:92
#15 0x0000555555575826 in tokio_uring::runtime::driver::handle::Handle::dispatch_completions () at src/runtime/driver/handle.rs:45
#16 tokio_uring::runtime::drive_uring_wakes::{async_fn#0} () at src/runtime/mod.rs:165
#17 tokio::runtime::task::core::{impl#6}::poll::{closure#0}<tokio_uring::runtime::drive_uring_wakes::{async_fn_env#0}, alloc::sync::Arc<tokio::task::local::Shared, alloc::alloc::Global>> () at /home/amos/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.42.0/src/runtime/task/core.rs:331
#18 tokio::loom::std::unsafe_cell::UnsafeCell::with_mut<tokio::runtime::task::core::Stage<tokio_uring::runtime::drive_uring_wakes::{async_fn_env#0}>, core::task::poll::Poll<()>, tokio::runtime::task::core::{impl#6}::poll::{closure_env#0}<tokio_uring::runtime::drive_uring_wakes::{async_fn_env#0}, alloc::sync::Arc<tokio::task::local::Shared, alloc::alloc::Global>>> () at /home/amos/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.42.0/src/loom/std/unsafe_cell.rs:16
✂️
When submitting ops (that’s how “asynchronous sycalls” are called in io_uring
parlance), tokio-uring keeps a waker around, as we can see in their Lifecycle
enum:
#[allow(dead_code)]
pub(crate) enum Lifecycle {
/// The operation has been submitted to uring and is currently in-flight
Submitted,
/// The submitter is waiting for the completion of the operation
Waiting(Waker),
/// The submitter no longer has interest in the operation result. The state
/// must be passed to the driver and held until the operation completes.
Ignored(Box<dyn std::any::Any>),
/// The operation has completed with a single cqe result
Completed(cqueue::Entry),
/// One or more completion results have been recieved
/// This holds the indices uniquely identifying the list within the slab
CompletionList(SlabListIndices),
}
That Waker
really is just a boxed trait object in disguise:
pub struct Waker {
waker: RawWaker,
}
pub struct RawWaker {
data: *const (),
vtable: &'static RawWakerVTable,
}
…with a vtable that contains clone
, wake
, wake_by_ref
, and
drop
functions:
pub struct RawWakerVTable {
clone: unsafe fn(*const ()) -> RawWaker,
wake: unsafe fn(*const ()),
wake_by_ref: unsafe fn(*const ()),
drop: unsafe fn(*const ()),
}
And, well, what tokio actually does when you call wake_by_ref
is up to
the mio crate, which on Linux, uses eventfd — an API that allows
applications to create file descriptors just for the purpose of signaling
events! Cheaper than a pipe, and it can be multiplexed via epoll, just like
any other file descriptor, like regular files, network sockets, etc.
This kind of overhead, of mixing epoll and io_uring, is why some folks chose to make their own runtime, entirely separate from tokio. Datadog folks made glommio, Bytedance folks made monoio, vertexclique made nuclei, there is no shortage of interesting work!
Adding a monoio
variant to our test program, shows that the hot loop becomes just
io_uring_enter
:
[1142572] io_uring_enter(9, 1, 1, 1, 0x0, 128) = 1
[1142572] io_uring_enter(9, 1, 1, 1, 0x0, 128) = 1
[1142572] io_uring_enter(9, 1, 1, 1, 0x0, 128) = 1
[1142572] io_uring_enter(9, 1, 1, 1, 0x0, 128) = 1
[1142572] io_uring_enter(9, 1, 1, 1, 0x0, 128) = 1
[1142572] io_uring_enter(9, 1, 1, 1, 0x0, 128) = 1
[1142572] io_uring_enter(9, 1, 1, 1, 0x0, 128) = 1
✂️
It is however important to note that this isn’t actually a benchmark. Actual benchmarks barely indicate anything about the performance of real-world systems, but this test program didn’t even attempt to indicate anything. We were just poking at various systems to see how they worked.
Plugging rc-zip into monoio
All that said, I think monoio looks promising, so, to cap it all off, I think we
should make an rc-zip-monoio
package — just because we can!
We’ll keep it simple and try to implement a single async function taking
a reference to a file, and returning an Archive
or an error.
pub async fn read_zip_from_file(file: &File) -> Result<Archive, Error> {
// TODO: the rest of the owl
}
The file type here is from monoio
, and so it comes with a native read_at
method. But it has a signature that departs from the usual tokio
stuff:
pub async fn read_at<T: IoBufMut>(
&self,
buf: T,
pos: u64,
) -> BufResult<usize, T>
pub type BufResult<T, B> = (Result<T>, B);
It takes ownership of the buffer and returns it, even if the operation failed.
This is a requirement for a memory-safe io_uring interface in Rust: it prevents the buffer from being freed before the operation completes or is cancelled, it’s like we’re giving ownership of the buffer to the kernel.
There was an excellent P99 conf talk about that recently by… oh, look, it’s me! And Sherlock. Awww.
That API makes the structure of our code a little peculiar.
First off, our buffer is not a Vec<u8>
— we don’t need to track capacity
and length separately, and we don’t need it to grow. So, we simply have a
boxed slice of u8 instead, of 256 kibibytes, fully initialized, MaybeUninit
is out of scope for today:
let mut buf = vec![0u8; 256 * 1024].into_boxed_slice();
After finding out the size of the file, we create the state machine, and enter the loop:
let meta = file.metadata().await?;
let size = meta.len();
let mut fsm = ArchiveFsm::new(size);
loop {
// rest of the code goes here
}
In the loop, if the machine wants a read…
if let Some(offset) = fsm.wants_read() {
// rest of the code goes here
}
…then the first thing we do is calculate how big of a read we can make.
We don’t want to read more than what the machine has room for, but also can’t use the machine’s buffer, due to the current rc-zip APIs: it only lends us its buffer mutably, it doesn’t give us ownership of it, so we can’t transfer ownership of it to the kernel.
We will need to read into our own buffer, and then copy it into the machine’s buffer.
Changing the rc-zip API to address this would be relatively easy and it would also be a breaking change. So I’m not doing it today, but it’s in the cards for the future.
The maximum read size is the minimum between the size of our buffer and the size of the machine’s buffer:
let dst = fsm.space();
let max_read = dst.len().min(buf.len());
Once we’ve established that, we can obtain a SliceMut<Box<[u8]>>
, a type
provided by monoio
(tokio-uring
has a similar thing), it’s like a slice,
but owned! It’ll make sure we don’t read too much data.
let slice = IoBufMut::slice_mut(buf, 0..max_read);
I chose to make the call fully-qualified (instead of slice.slice_mut()
)
to make it really obvious where that function comes from — monoio’s
IoBufMut trait.
And then, we have a native, actual read_at
method on file:
let (res, slice) = file.read_at(slice, offset).await;
And as promised, we get the buffer back, no matter if the operation was successful
or not. So, first we propagate errors, and then we copy to the machine’s buffer
however many bytes we read, letting it know how much that was with its fill
method:
let n = res?;
(dst[..n]).copy_from_slice(&slice[..n]);
fsm.fill(n);
…and finally, we can take back ownership of our buffer, which is stashed
inside the SliceMut
we got back from read_at
:
buf = slice.into_inner();
And this explains why buf
is a mutable binding! We were able to move out
of it during a loop iteration, on the condition that we put it back. If we
didn’t, the Rust compiler would gently but firmly refuse to proceed:
error[E0382]: borrow of moved value: `buf`
--> rc-zip-monoio/src/lib.rs:35:42
|
27 | let mut buf = vec![0u8; 256 * 1024].into_boxed_slice();
| ------- move occurs because `buf` has type `Box<[u8]>`, which does not implement the `Copy` trait
...
30 | loop {
| ---- inside of this loop
...
35 | let max_read = dst.len().min(buf.len());
| ^^^ value borrowed here after move
...
41 | let slice = IoBufMut::slice_mut(buf, 0..max_read);
| ------------------------------------- `buf` moved due to this method call, in previous iteration of loop
|
note: `slice_mut` takes ownership of the receiver `self`, which moves `buf`
--> /Users/amos/.cargo/registry/src/index.crates.io-6f17d22bba15001f/monoio-0.2.4/src/buf/io_buf.rs:256:22
|
256 | fn slice_mut(mut self, range: impl ops::RangeBounds<usize>) -> SliceMut<Self>
| ^^^^
help: you can `clone` the value and consume it, but this might not be your desired behavior
|
41 | let slice = IoBufMut::slice_mut(buf.clone(), 0..max_read);
| ++++++++
After that, we can call process
on the state machine and either break out
of the loop or keep going:
fsm = match fsm.process()? {
FsmResult::Done(archive) => {
break Ok(archive);
}
FsmResult::Continue(fsm) => {
fsm
}
}
And that’s it! Here’s the complete listing:
use monoio::{buf::IoBufMut, fs::File};
use rc_zip::{
error::Error,
fsm::{ArchiveFsm, FsmResult},
parse::Archive,
};
pub async fn read_zip_from_file(file: &File) -> Result<Archive, Error> {
let meta = file.metadata().await?;
let size = meta.len();
let mut buf = vec![0u8; 256 * 1024].into_boxed_slice();
let mut fsm = ArchiveFsm::new(size);
loop {
if let Some(offset) = fsm.wants_read() {
let dst = fsm.space();
let max_read = dst.len().min(buf.len());
let slice = IoBufMut::slice_mut(buf, 0..max_read);
let (res, slice) = file.read_at(slice, offset).await;
let n = res?;
(dst[..n]).copy_from_slice(&slice[..n]);
fsm.fill(n);
buf = slice.into_inner();
}
fsm = match fsm.process()? {
FsmResult::Done(archive) => {
break Ok(archive);
}
FsmResult::Continue(fsm) => fsm,
}
}
}
And a program that uses it:
use monoio::fs::File;
use rc_zip_monoio::read_zip_from_file;
#[cfg(not(target_os = "linux"))]
type DefaultDriver = monoio::LegacyDriver;
#[cfg(target_os = "linux")]
type DefaultDriver = monoio::IoUringDriver;
fn main() {
monoio::start::<DefaultDriver, _>(async_main())
}
async fn async_main() {
let zip_path = [
std::env::var("HOME").unwrap().as_str(),
"zip-samples/wine-10.0-rc2.zip",
]
.join("/");
let file = File::open(&zip_path).await.unwrap();
let archive = read_zip_from_file(&file).await.unwrap();
for (i, e) in archive.entries().enumerate() {
println!("- {}", e.sanitized_name().unwrap_or_default());
if i > 10 {
break;
}
}
}
This program runs on macOS, my main machine, using monoio’s legacy driver, and also on Linux, using the io-uring driver!
We can see that from the io_uring_setup
call, to the printing of the file
listings, there is not a single read
or write
syscall — it’s all happening
as io-uring ops:
amos in 🌐 brat in monozip on main via 🦀 v1.83.0
❯ lurk -f ./target/release/monozip
[2705391] execve("", "", "") = 0
✂️
[2705391] io_uring_setup(1024, 0x7FFFFFFFCE50) = 3
[2705391] mmap(0x0, 65536, 3, 32769, 3, 268435456) = 0x7FFFF7DA4000
[2705391] mmap(0x0, 37184, 3, 32769, 3, 0) = 0x7FFFF7D9A000
[2705391] io_uring_enter(3, 1, 1, 1, 0x0, 128) = 1
[2705391] io_uring_enter(3, 1, 1, 1, 0x0, 128) = 1
[2705391] mmap(0x0, 266240, 3, 34, 4294967295, 0) = 0x7FFFF7D59000
[2705391] mmap(0x0, 266240, 3, 34, 4294967295, 0) = 0x7FFFF7D18000
[2705391] io_uring_enter(3, 1, 1, 1, 0x0, 128) = 1
[2705391] io_uring_enter(3, 1, 1, 1, 0x0, 128) = 1
[2705391] io_uring_enter(3, 1, 1, 1, 0x0, 128) = 1
[2705391] brk(0x55555565B000) = 0x55555565B000
[2705391] mmap(0x0, 233472, 3, 34, 4294967295, 0) = 0x7FFFF7CDF000
[2705391] mremap(0x7FFFF7CDF000, 233472, 462848, 1, 0x0) = 0x7FFFF7C6E000
[2705391] io_uring_enter(3, 1, 1, 1, 0x0, 128) = 1
[2705391] brk(0x55555567C000) = 0x55555567C000
[2705391] mremap(0x7FFFF7C6E000, 462848, 921600, 1, 0x0) = 0x7FFFF7B8D000
[2705391] brk(0x55555569D000) = 0x55555569D000
[2705391] io_uring_enter(3, 1, 1, 1, 0x0, 128) = 1
[2705391] brk(0x5555556BE000) = 0x5555556BE000
[2705391] io_uring_enter(3, 1, 1, 1, 0x0, 128) = 1
[2705391] brk(0x5555556DF000) = 0x5555556DF000
[2705391] mremap(0x7FFFF7B8D000, 921600, 1839104, 1, 0x0) = 0x7FFFF79CC000
[2705391] brk(0x555555700000) = 0x555555700000
[2705391] io_uring_enter(3, 1, 1, 1, 0x0, 128) = 1
[2705391] brk(0x555555721000) = 0x555555721000
[2705391] brk(0x555555743000) = 0x555555743000
[2705391] mmap(0x0, 151552, 3, 34, 4294967295, 0) = 0x7FFFF7CF3000
[2705391] mremap(0x7FFFF7CF3000, 151552, 299008, 1, 0x0) = 0x7FFFF7CAA000
[2705391] mremap(0x7FFFF7CAA000, 299008, 593920, 1, 0x0) = 0x7FFFF7C19000
[2705391] brk(0x555555764000) = 0x555555764000
[2705391] mremap(0x7FFFF7C19000, 593920, 1183744, 1, 0x0) = 0x7FFFF78AB000
[2705391] brk(0x555555785000) = 0x555555785000
[2705391] brk(0x5555557A6000) = 0x5555557A6000
[2705391] mremap(0x7FFFF78AB000, 1183744, 2363392, 1, 0x0) = 0x7FFFF766A000
[2705391] brk(0x5555557C7000) = 0x5555557C7000
[2705391] munmap(0x7FFFF79CC000, 1839104) = 0
[2705391] munmap(0x7FFFF7D18000, 266240) = 0
[2705391] munmap(0x7FFFF7D59000, 266240) = 0
[2705391] write(1, "- wine-10.0-rc2/\nxp 00000000 00:...", 17) = 17
[2705391] write(1, "- wine-10.0-rc2/documentation/\n:...", 31) = 31
[2705391] write(1, "- wine-10.0-rc2/documentation/RE...", 43) = 43
[2705391] write(1, "- wine-10.0-rc2/documentation/RE...", 43) = 43
[2705391] write(1, "- wine-10.0-rc2/documentation/RE...", 43) = 43
[2705391] write(1, "- wine-10.0-rc2/documentation/RE...", 43) = 43
[2705391] write(1, "- wine-10.0-rc2/documentation/RE...", 43) = 43
[2705391] write(1, "- wine-10.0-rc2/documentation/RE...", 43) = 43
[2705391] write(1, "- wine-10.0-rc2/documentation/RE...", 46) = 46
[2705391] write(1, "- wine-10.0-rc2/documentation/RE...", 43) = 43
[2705391] write(1, "- wine-10.0-rc2/documentation/RE...", 46) = 46
[2705391] write(1, "- wine-10.0-rc2/documentation/RE...", 43) = 43
[2705391] munmap(0x7FFFF766A000, 2363392) = 0
[2705391] io_uring_enter(3, 2, 0, 0, 0x0, 128) = 2
[2705391] munmap(0x7FFFF7D9A000, 37184) = 0
[2705391] munmap(0x7FFFF7DA4000, 65536) = 0
[2705391] close(3) = 0
[2705391] sigaltstack(0x7FFFFFFFDD80, 0x0) = 0
[2705391] munmap(0x7FFFF7FC0000, 12288) = 0
[2705391] exit_group(0) = ?
The only syscalls we do see are brk
, and mmap
-related things, which are
definitely related to heap allocation.
We talk about brk
and heap allocation in the Making our own executable packer series
The implementation of the other state machine, EntryFsm
is left as an
exercise to the reader, you can see my draft pull request on the rc-zip repository itself —
it’s simpler in a way, since the reads are linear, and also more complicated, because it actually streams data out as the file is decompressed.
But, you only need to implement it once, and then you get support for all the compression methods supported by rc-zip, including deflate, bzip2, LZMA, and ZStandard!
Closing words
Although there are other avenues being explored to avoid that sync/async chasm,
like keyword generics, I believe the way forward is to simply implement formats,
protocols, etc. in a sans-io
way.
I think unifying libstd and tokio is the wrong approach, because neither
interface is compatible with modern I/O APIs like io_uring
.
I say that knowing full well that my HTTP implementation, loona, is actually tied to a specific I/O model, but, I was trying to solve one problem at a time, and still learning about the inner workings of HTTP.
Now that I have the benefit of hindsight, I think it might be fun to rewrite loona as completely sans-io, and then it would be usable in all contexts: high-performance proxies with something like monoio, web applications with “classic” tokio, and maybe simpler CLI tools that don’t need or want async with a synchronous interface!
I also want to change the rc-zip interface to avoid that copy between the I/O buffers and the decoding buffers — making an API “uring-friendly” involves rethinking a lot of things.
And it’s fun to see that other ecosystems that don’t have any standard I/O abstraction, like C, or ecosystems with a much higher level of abstraction, like Node.js, have been faster at adopting io_uring than something like Rust, where a lot of code was written against a different, less flexible model.
See? I can say bad things about Rust! I’m not a shill.
This is a dual feature! It's available as a video too. Watch on YouTube
Here's another article just for you:
Recursive iterators in Rust
I’ve been looking for this blog post everywhere, but it doesn’t exist, so I guess it’s my turn to write about Some Fun with Rust.
The task at hand
Let’s say you have a recursive, acyclic data structure, like so:
struct Node {
values: Vec<i32>,
children: Vec<Node>,
}
This allows you to represent a tree-like structure:
[1, 2, 3]
/\
/ \
/ \
/ \
/ \
[4, 5] [6, 7]