Thanks to my sponsors: Sawyer Knoblich, Laine Taffin Altman, Dom, Jack Maguire, Ryan, Cole Kurkowski, L0r3m1p5um, Luciano Mammino, Michael Alyn Miller, Ives van Hoorne, Jelle Besseling, Nicolas Riebesel, Christopher Valerio, Ivo Murrell, Benjamin Röjder Delnavaz, Marcus Griep, Cass, Johan Saf, Brandon Piña, Beat Scherrer and 262 more
The case for sans-io
The most popular option to decompress ZIP files from the Rust programming language is a crate simply named zip — At the time of this writing, it has 48 million downloads. It’s fully-featured, supporting various compression methods, encryption, and even supports writing zip files.
However, that’s not the crate everyone uses to read ZIP files. Some applications benefit from using asynchronous I/O, especially if they decompress archives that they download from the network.
Such is the case, for example, of the uv python package manager written in Rust. uv doesn’t use the zip crate, it uses the async_zip crate, which is maintained by a single person and gets a lot less attention.
This situation is fairly common in Rust: the same code gets written against sync interfaces and async interfaces. This results in a split ecosystem, duplication of effort, and of course, more bugs overall.
Character encoding differences
And that’s a shame because there are a lot of things about dealing with the ZIP format that are completely non-trivial. It is an old crufty format with a lot of edge cases.
Even though there is an ISO standard for the zip format and most of it is described in the freely available PKWARE APPNOTE, there’s still a lot of surprises to be found when looking at zip files in the wild, like I did when I worked at itch.io
The zip format predates the universal adoption of UTF-8. Don’t tell me Windows still uses UTF-16, I’m trying to ignore that right now. Plus they have a UTF-8 code page nowadays, so, shrug.
The zip format predates UTF-8, and that means the encoding of filenames in ZIP files used to be whatever code page your system happened to be set to.
Only in the year 2007, was the app note updated to document “extra field” values that indicates that the file names and file comments are actually encoded with UTF-8.
This was probably fine when you passed zip files on floppy disks from one office to the next in the same country, but at itch.io we had a situation where a Japanese game developer used the built-in Windows ZIP creation tool from Explorer and had file names encoded as Shift-JIS, a successor of JIS X 0201, a single-byte Japanese Industrial Standard text encoding developed in 1969.
Most ZIP tools, however, treated that file as if it was encoded with code page 437, the character set of the original 1981 IBM Personal Computer, you know, where “PC” comes from? Which to be fair is a pretty good guess in the west if the UTF-8 bit flag isn’t set.
Because the format only tells us whether a filename is “UTF-8” or “not UTF-8”, the solution I came up with, so that the itch.io desktop app can install games from all over the world…
…is to take all textual content from the zip file — filenames, comments, etc. — and do statistical analysis, trying to figure out what the character set is based on the frequency of certain byte sequences, like these, for Shift-JIS:
var commonChars_sjis = []uint16{
0x8140, 0x8141, 0x8142, 0x8145, 0x815b, 0x8169, 0x816a, 0x8175, 0x8176, 0x82a0,
0x82a2, 0x82a4, 0x82a9, 0x82aa, 0x82ab, 0x82ad, 0x82af, 0x82b1, 0x82b3, 0x82b5,
0x82b7, 0x82bd, 0x82be, 0x82c1, 0x82c4, 0x82c5, 0x82c6, 0x82c8, 0x82c9, 0x82cc,
0x82cd, 0x82dc, 0x82e0, 0x82e7, 0x82e8, 0x82e9, 0x82ea, 0x82f0, 0x82f1, 0x8341,
0x8343, 0x834e, 0x834f, 0x8358, 0x835e, 0x8362, 0x8367, 0x8375, 0x8376, 0x8389,
0x838a, 0x838b, 0x838d, 0x8393, 0x8e96, 0x93fa, 0x95aa,
}
This gives us a list of probabilities and, then you just take the highest and.. hope for the best!
I’m not aware of any other tool that bothers doing that — I think if I had to do it again, I would just require a standard archive format instead of trying to extract whatever stuff developers would shove in the file upload dialog.
Platform differences
But that’s not the only crufty part of the ZIP file format.
For example, it doesn’t really make a difference between files and directories. Directories simply have length 0 and their paths end with a forward slash.
~/zip-samples
❯ unzip -l wine-10.0-rc2.zip | head -8
Archive: wine-10.0-rc2.zip
Length Date Time Name
--------- ---------- ----- ----
0 12-13-2024 22:32 wine-10.0-rc2/
0 12-13-2024 22:32 wine-10.0-rc2/documentation/
8913 12-13-2024 22:32 wine-10.0-rc2/documentation/README-ru.md
5403 12-13-2024 22:32 wine-10.0-rc2/documentation/README-no.md
5611 12-13-2024 22:32 wine-10.0-rc2/documentation/README-fi.md
What about Windows?
Well, first off, did you know: All Windows APIs support using forward slashes as a path separator.
Microsoft has a very good article about file paths on Windows that I’m sure you can learn a lot from. I know I did.
And secondly, this is one of the things the app note is very clear on:
The path stored MUST NOT contain a drive or device letter, or a leading slash. All slashes MUST be forward slashes ‘/’ as opposed to backwards slashes ‘\’ for compatibility with Amiga and UNIX file systems etc.
PKWARE APPNOTE.TXT v6.3.10, section 4.4.17: file name
Of course, if the ZIP was actually created on Unix, then the entry would have a mode, and in the mode bits you can tell whether it’s a directory, a regular file, or a symbolic link.
In the wild I’ve noticed symbolic links tend to have their target as the contents of the entry, but of course that’s not what the APPNOTE says.
It says that in the Unix extra field, there is a variable size data field that can be used to store the target of a symbolic link or hard link.
Emphasis on “can”.
Because there were so many different tools that could create zip archives, and standardization only came later with the ISO standard (which mandates UTF-8 file names), the APPNOTE takes a descriptive rather than prescriptive approach.
It simply documents the various zip format implementations found in the wild, without making value judgments about the choices made by different software authors.
So if you want to support most zip files out there, you have to be able to read DOS-style timestamps and UNIX-style timestamps, which are completely different.
DOS timestamps, for example, are completely bonkers?
They fit in 32 bits, half for the time, half for the date, so far so good…
The day is a 5-bit integer, the month is a 4-bit integer, the year is a 7-bit wide integer, counting from 1980 — and as for the time, it’s stored in two-second intervals! It’s.. it’s fun.
I think of it every time someone says that IEEE 754 is “weird” because doing 0.1 + 0.2 shows a lot of decimals after 0.3 or whatever.
The end of central directory record
But okay, fine. Those are details you can probably ignore for files that have been created with recent tools.
But even the most basic fundamental aspects of the zip file format are slightly cursed?
Most file formats start with a magic number and then a header including metadata, and then the actual body, the actual meat of the file, so: pixel data for an image, or vertex coordinates for a model, things like that.
fasterthanli.me/content/img on main [!?]
❯ hexyl logo-round-2.png | head
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 89 50 4e 47 0d 0a 1a 0a ┊ 00 00 00 0d 49 48 44 52 │×PNG__•_┊⋄⋄⋄_IHDR│
│00000010│ 00 00 01 00 00 00 01 00 ┊ 08 06 00 00 00 5c 72 a8 │⋄⋄•⋄⋄⋄•⋄┊••⋄⋄⋄\r×│
│00000020│ 66 00 00 2a b5 7a 54 58 ┊ 74 52 61 77 20 70 72 6f │f⋄⋄*×zTX┊tRaw pro│
│00000030│ 66 69 6c 65 20 74 79 70 ┊ 65 20 65 78 69 66 00 00 │file typ┊e exif⋄⋄│
│00000040│ 78 da a5 9c 6b 76 5d b7 ┊ 8e 84 ff 73 14 77 08 7c │x×××kv]×┊×××s•w•|│
│00000050│ 93 18 0e 9f 6b f5 0c 7a ┊ f8 fd 15 8f e4 eb 38 ce │ו•×k×_z┊×ו×××8×│
│00000060│ ed a4 db 89 25 59 3a da ┊ 9b 9b 00 0a 55 00 78 dc │××××%Y:×┊××⋄_U⋄x×│
│00000070│ f9 ef ff ba ee 5f ff fa ┊ 57 f0 3e 54 97 4b eb d5 │×××××_××┊W×>T×K××│
│00000080│ 6a f5 fc c9 96 2d 0e be ┊ e8 fe f3 67 bc 8f c1 e7 │j××××-•×┊×××g××××│
But not ZIP! The only correct way of reading a zip file is to start from the end of the file and walk back until you find the signature of the end of central directory record.
And that’s why if you take a look at the zip crate API, it requires the input to
implement both Read
and Seek
, because even just to list the entries of the
zip file, you need to be able to move around it.
impl<R: Read + Seek> ZipArchive<R> {
/// Read a ZIP archive, collecting the files it contains.
///
/// This uses the central directory record of the ZIP file, and ignores local file headers.
pub fn new(reader: R) -> ZipResult<ZipArchive<R>> {
// ✂️
}
}
Doing this properly is not as simple as it may sound!
Originally, the zip
crate made 4-byte reads starting from almost the end of
the file and then moved left by 1 byte every time it didn’t match the signature
of the end of central directory record, which was hugely wasteful.
The async_zip
crate, which was written later, improved on that by making reads
of 2 KiB, and moving to the left 2 KiB minus the size of the signature to handle
the case where the signature would overflow two buffers, which is pretty smart!
The comments mention a 500x speedup compared to the zip
method.
The zip
crate eventually caught up in May of 2024 by doing 512-byte
reads, which temporarily made it much
faster until August of 2024 when they fixed a bug in the EOCD finding logic. A
pretty fun one actually.
Boundary confusion
Most file formats have some sort of framing mechanism. You read the file moving forward, and then you have records prefixed by their length.
MP4, or rather, MPEG-4 Part 14, calls those boxes. Media authoring software tends to write a lot of metadata that media players don’t necessarily know about, but anyone can skip over those boxes, even if they’re of a completely unknown type.
This property also makes it impossible to mistake data from the actual structure of the file. Each box has a type and the type can be a valid UTF-8 byte sequence. But there is never any ambiguity as to whether you’re reading the type of box or whether you’re reading the name of the author of the media file.
However, in the ZIP format, because you’re scanning from the end of the file going backwards, it is possible to read part of a comment or file path, and have it accidentally match the signature bytes for the end of central directory record.
And that’s the bug that was fixed in the zip crate in August of 2024. Instead of stopping at the first thing that looks like an EOCD signature, they now keep scanning the entire file and keep track of all the offsets at which signature-like things were found.
But of course, reading an entire multi-gigabyte file by increments of half a kilobyte, seeking backwards every time, is pretty much the worst possible read pattern that you can do on any kind of device? Any buffering done in userland or in the kernel is woefully unprepared for… that.
And I was going to give the example of a 4GB file, that would require 8 million syscalls just to find the EOCD, but then I stumbled upon this comment in the GitHub repository:
I tried this PR on a 200GB zip file (233899 files within) that I access over a networked share.
…and, it’s not like that person is doing anything wrong? But also, good lord.