Jan 16, 2017

A little while ago, I wrote an article on things that can go wrong when downloading, it listed a series of reasons, from network problems to invalid content to imperfect hardware that may occur when initially installing a game.

This article discusses what methods we can use to upgrade a game to a later version, when an older version has been successfully installed.

## Compression

The amount of bandwidth available to a user is usually a constant - their internet access has a maximum theoretical speed of 20 mbps, 100 mbps, 1 gbps, or maybe, in some countries, much much less.

*[mbps]: megabits per second - for example, 20 mbps = 2.5 megabytes per second. *[gbps]: gigabits per second (1000 mbps)

I don't have fiber in my home office yet so I'm in the 20 mbps bucket. Any time I feel like complaining, the internet reminds me that others have it way worse, like this guy who took screenshots of how sites look after he let them load for 1 minute. Maybe 20 mbps isn't that bad, all things considered.

This means transferring some data from the itch.io servers to the user will always take about the same time. Since we have no control over the user's internet access, what we can do is reduce the amount of data we need to transfer for them to be able to play the game.

Traditionally, this is done by compressing the data before transferring it. Compression is a complex field and studying it in-depth is out of scope for this article, but the gist of it is taking advantage of repetitions. For example, if you had to tell someone the following sequence over the phone:

``````banana banana banana banana banana
``````

…you'd probably just say “banana repeated five times” instead of just saying each instance of “banana” out loud. This works for more complex sequences too:

``````banana cherry apple pie
banana cherry apple sauce
banana cherry apple pizza
``````

You could say:

• Let's call “banana cherry apple” “X”.
• The sequence is: “X pie X sauce X pizza”

And without realizing it, you'd be doing dictionary-based compression. The “X” => “banana cherry apple” mapping is part of your dictionary, which you defined once and then re-used three times, to save space. Imagine this, but with years of research, refinement and a whole new bag of tricks, and you get “modern” compression algorithms.

The algorithms we're interested in are “lossless” compression algorithms, meaning that the sequence is completely unchanged when we fully compress, then decompress it. A “lossy” algorithm would be you telling your friend over the phone “it's a bunch of nonsense about fruits”. They get the gist, but you lost some information on the way.

Lossy compression algorithms are actually quite useful for images, videos, and sound.

They strive for a high compression ratio (the compressed data is much smaller) while maintaining as much fidelity (the decompressed output looks like the original) as possible.

In our case, we can't afford to lose or alter any data at all - any single change would make the game impossible to run. Besides, some files of the games are already compressed using lossy algorithms!

### Real-world compression

Let's consider a practical case. The itch.io command-line uploader, butler, uses the brotli compression format on all files developers upload using it. The idea is that we can reduce the total transfer time by applying just a little compression. If we used too much, it would take longer to compress the data than to just transfer it, as shown by this helpful stacked area chart:

We can adjust “how much compression” we want by setting the “quality level”: lower quality levels mean less compression, but also less computation needed, whereas higher quality levels achieve higher compression, but require more computation (ie. more time).

Let's look at some actual data: if we take a build of Overland for Linux:

• The uncompressed game is 748MiB
• Compressed at quality level 1, it's 340MiB (took 17 seconds on my machine)
• Compressed at quality level 4, it's 327MiB (took 25 seconds)
• Compressed at quality level 6, it's 316MiB (took 51 seconds)
• Compressed at quality level 9, it's 311MiB (took almost 4 minutes)

As you can see, there's diminishing returns in increasing the quality level. The biggest jump is from “no compression” to “a little compression”.

This is not to say that higher compression levels are useless. Simply, they're not used for the initial upload, but are used later on to re-compress the data on itch.io servers, so that the download size is even smaller than what was uploaded.

## Delta encoding

So far, we've seen that compression helps us reduce the amount of data a user needs to download to play game.

• It lets developers upload faster
• It lets players play games sooner
• And it helps us saves on bandwidth costs (since we pay for each byte players download)

When updating games, there is something better than compression though. Imagine you already told your friend on the phone the sequence from earlier:

``````banana cherry apple pie
banana cherry apple sauce
banana cherry apple pizza
``````

If it changed to this:

``````banana merry apple pie
banana merry apple sauce
banana merry apple pizza
``````

You could simply tell your friend that each occurence of “cherry” was changed to “merry”, instead of re-transmitting the entire thing. Or perhaps you could even tell your friend that each occurence of “ch” is changed to “m” - even less data, and that would still work.

Finding the smallest set of instructions you can tell your friend so they can reconstruct the new data from the old data is what delta encoding is all about.

### Rsync

A simple form of delta encoding is the rsync algorithm, devised in 1996 by Andrew Tridgell. The idea is fairly simple. First, we divide the old sequence into blocks of equal sizes (except the last one). Let's do it on a shortened version of the sequence we've been relentlessly transmitting to our phone friend:

``````(bana)(na c)(herr)(y ap)(ple )(pie)
``````

Each of these blocks has a position and a hash. A hash is something that's shorter but can still be used to almost-uniquely identify something (that's an over-simplification, but it's good enough). For our example, we'll take the first letter of the blocks as its hash. Which gives us:

• (bana) is at position 0 and has the hash ‘b’
• (na c) is at position 4 and has the hash ‘n’
• (herr) is at position 8 and has the hash ‘h’
• (y ap) is at position 12 and has the hash ‘y’
• (ple ) is at position 16 and has the hash ‘p’
• (pie) is at position 20 and has the hash ‘p’

We see here that (ple ) and (pie) have the same hash. This is called a hash collision - and it's okay! In the algorithm's defense, we're using a pretty weak hash.

Now, we scan the new sequence and try to find if some of the blocks we already know about are in there:

``````(bana)na merry apple pie
``````

We start off with the first four letters. We compute its hash (here, we take the first letter, ‘b’) - and use it to look up an old block that may be similar. We do have a block with hash ‘b’, at position 0 of the old sequence. Now, because of hash collisions, we have to compare the contents - and indeed, (bana) has the same contents as (bana), so that means we can re-use an old block!

Moving ahead by a full block, we keep scanning the new sequence:

``````bana(na m)erry apple pie
``````

There is an old block with hash ‘n’ - but its contents are (na c), not (na m), so we can't re-use it. Similarly, there's no old block with (a me), ( mer), (merr), so on until (ry a), so we have no choice but to transfer that data to our phone friend. So far, our instructions are:

• We're working with blocks of length 4
• Re-use the block at position 0 of the old sequence
• Now write down “na merr”

We keep scanning:

``````banana merr(y ap)ple pie
``````

We do have an old block with hash ‘y’, and it is (y ap) - we can re-use that one. Moving ahead by a full block:

``````banana merry ap(ple )pie
``````

We have two old blocks with hash ‘p’ - (ple ) and (pie) - comparing them, we decide to re-use (ple ). And so on until we're done scanning. Our complete set of instructions is now:

• We're working with blocks of length 4
• Re-use the block at position 0 of the old sequence
• Now write down “na merr”
• Re-use the block at position 12 of the old sequence
• Re-use the block at position 16 of the old sequence
• Re-use the block at position 20 of the old sequence

Since blocks at positions 12, 16, and 20 are adjacent, we can reduce our instruction set further:

• We're working with blocks of length 4
• Re-use the block at position 0 of the old sequence
• Now write down “na merr”
• Re-use three blocks starting at position 12 of the old sequence

And this is basically how rsync works. Over the phone, it's not that great - but it's actually quite suited to computers! The real rsync algorithm uses two hashes. The first hash is weak (cheap to compute, high collision chance). The second hash is stronger, hence more expensive to compute, which is why it's only computed when the first hash matches.

The first hash also has a nice property: it's rolling. Computing the hash of a [0…n] block takes a certain amount of work, but once it's done, computing the hash of [1…n+1] is much faster, since we can re-use the previous result. This is especially well-suited to rsync, which looks at blocks of n bytes, moving byte by byte until it finds a match.

For more details, read the rsync paper.

That's the first level of delta encoding we do at itch.io. It's done when developers upload a new version of their game:

• First, butler downloads a set of hashes of the older version
• Then it scans the new version to see which blocks it can re-use, based on their hashes
• Then it transmits both “re-use instructions” (if it found old blocks with the same contents) along with fresh data (for the parts it couldn't re-use)

The patch produced by that step looks something like that:

### BSDiff

rsync is pretty good - but there are cases where it performs really poorly.

Consider the following sequence:

``````(bear)(fall)
``````

If it changes into the following:

``````(dear)(hall)
``````

They're not that different right? If you were still playing “patch your friends” on the phone, you'd simply say the “b” changed into an “d” and the “f” changed into an “h” - all done. As far as rsync is concerned, though, there is no 4-letter block in the new sequence that is equivalent to a 4-letter block from the old sequence - and so, it is forced to include the entire new sequence as “fresh data”, throwing away all the old data.

There is something interesting about this change though - ‘d’ is two letters after ‘b’ in the alphabet, and ‘h’ is two letters after ‘f’. If we construct a new sequence that is the difference in positions between each letter, we get:

``````  b e a r f a l l
- d e a r h a l l
-----------------
= 2 0 0 0 2 0 0 0
``````

And that compresses super well!

• Write down 2, then three 0s
• Now do it again

And done!

You might be skeptical of how useful this is in real life. It turns out, files being changed by the same “difference” in multiple places happens a lot.

Consider a phonebook that also lists relatives, along with the page number on which they appear. If a few pages are added at the start of the phonebook, all the page numbers for the relatives need to be adjusted (by the same amount).

In that way, executables are a lot like phonebooks. Adding a function tends to change the address of a bunch of other functions.

This is why bsdiff works so well - a “naive” binary delta encoding algorithm designed by Colin Percival in 2003 (seven years after the rsync paper).

Of course, our example is really simplistic. In practice, we're not looking at only eight letters, but several megabytes, even gigabytes of data. The hard part is to find good “approximate matches” between the old data and the new data.

### Suffix sorting

The first thing to do when looking for approximate matches (which we can store the compressed difference of, for a patch), is to find exact matches between the old and the new data. Think of it as searching for a word in a text.

There is a data structure we can build for a text that lets us find any word relatively quickly: a suffix array. To build a suffix array, we need to sort all suffixes of the text.

The text “rabbit” has the suffixes “rabbit”, “abbit”, “bbit”, and so on. If we sort them all by lexicographical ordering, we get:

• “abbit” (suffix 2)
• “bbit” (suffix 3)
• “bit” (suffix 4)
• “it” (suffix 5)
• “rabbit” (suffix 1)
• “t” (suffix 6)

*[lexicographical]: You can think of it as “alphabetical” for the sake of our examples.

Hence, the suffix array of “rabbit” is (2 3 4 5 1 6). Now then, if we want to search for a string inside of it, like “it”, we'd do what is called a binary search.

The idea behind a binary search is that if you have a sorted set of values, you can first compare with the value in the middle. If it's smaller than the value you're searching for, you can eliminate the first half and search in the second half. On the contrary, if it's larger than the value you're looking for, you can eliminate the second half and search in the first half.

Eventually, we're left with a set of size 1. If the only element in there is the one we're searching for, bingo! If not, it's not in the set.

Searching for the longest matching suffix using a suffix array works similarly, except instead of comparing numbers, we're comparing strings lexicographically:

Once we find an exact match, the next step in the bsdiff algorithm is to try and “expand” it, both forward and backward, as long as at least half the letters match. This gives us a longer, approximate match. We then compute the difference between both regions and store it, compressed. We repeat for each approximate match we find, including the fresh data (for which no match has been found, even approximate) in the patch.

### How bsdiff fits in the pipeline

Let's sum up:

• Each build of a game is distributed compressed, which reduces their size
• Distributing upgrades as patches saves even more bandwidth (taking advantage of the older version of the game that's already there)
• rsync patches are constructed on upload, but we can do a second pass with bsdiff to produce even smaller patches

To run bsdiff:

• We need the entire old file, and the entire new file
• That's why it's being run on itch.io servers (which have high-speed internet access to all builds) rather than the developers’ machines (which only have the latest build on disk)
• First we compute the suffix array for the old file (ie. we “suffix sort” it)
• Then we scan the new file, looking for approximate matches

There's one thing I haven't mentioned: bsdiff computes a patch to reconstruct one file using another file - but game builds include many files. So how do we know which new file corresponds to which old file? They may have been renamed, or split, or joined - so we can't rely on their name alone.

Luckily, there's a simple solution: remember how the rsync patch contains instructions like “re-use N blocks from that old file”? We can scan the rsync patch looking for these, and whichever old file we're re-using the most blocks of, that's the file we want to bsdiff against.

If we look at a real game, Mewnbase:

• The two most recent builds is 82MiB
• The rsync patch is 33MiB
• The bsdiff patch is 174KiB

As mentioned, bsdiff happens in two phases: (suffix) sorting, and scanning. If we re-use the original 2003 implementation of it, here's what it looks for this game in particular:

• Sorting takes 13.6s
• Scanning takes 1s

The numbers vary with the data being diffed, but overall, sorting is by far the most expensive part.

Luckily, the suffix sorting algorithm used by the original bsdiff, which we'll call Larsson-Sadakane, can be parallelized relatively easily.

I won't go into too many details here, but basically, Larsson-Sadakane uses a trick to construct the suffix array faster. It stems from the observation (made by Karp, Miller and Rosenberg) that each proper suffix of a string is also a suffix of another suffix. So, instead of sorting every suffix by the first symbol, then the second, then the third, the fourth, and so on, we can re-use the previous passes, to sort by the first, the second, the fourth, the eigth, the sixteenth, etc. - reducing the number of passes to log(n) instead of n.

Each of these passes divide the suffix array into buckets (which are later merged back together) - which are then sorted. Since each bucket is sorted individually, that work can be done in parallel on different cores. The parallel version of Larsson-Sadakane which I implemented uses one additional array of n integers, that contains the result of the previous pass - so that the various cores don't try to read from memory that's currently being written by other threads.

In practice, this works relatively well. Using 12 cores, the runtime for optimizing that particular Mewnbase patch looks like this:

• Sorting takes 4.68s (13.6s before)
• Scanning takes 1s

Scanning takes exactly the same amount of time, since it's the same code working from the same suffix array, but sorting is 3x faster. This is both good news (faster bsdiff, finally!) and bad news: a 3x speed up for using 12 cores isn't that great. Distributing the buckets to the cores for sorting and synchronizing the whole thing has a non-negligible cost here.

### Modern suffix sorting algorithms

Suffix sorting has been a subject of research over the past few decades, not because everyone is eager for smaller game updates (I wish!), but because it has applications in biotechnology: DNA is a series of four symbols, and being able to find common sequences amongst different genomes quickly is very important. (Except, they run their research on super-computers which routinely have more than 128 cores. But let's not get picky here).

For that reason, SACA algorithms newer than Larsson-Sadakane (1999) have been designed. In recent times, the most interesting research comes from Ge Nong (Professor of Sun Yat-sen University in Guangzhou, China). One of his papers, “An optimal suffix array construction algorithm”, describes exactly what you'd expect from the title.

I haven't had the time to research how the algorithm works well enough to describe it here, but I can tell you it's based on SA-IS, that it uses induced sorting (similar to DC3? correct me if I'm wrong), and that if you're really curious and want to dig in, you can read up on induced sorting here, and read the SA-IS paper here.

*[SACA]: Suffix Array Construction Algorithms

And since the output is still exactly the same suffix array, it's a drop-in replacement for the first phase of bsdiff! Using a golang implementation of that algorithm, we get the following runtimes:

• Sorting takes 9.9s (13.6s before)
• Scanning takes 1s

On other datasets, the difference is more striking:

• dictwords (200KiB) is sorted in 51ms using Larsson-Sadakane, 21ms using gosaca
• dictwords (2MiB) is sorted in 1237ms using Larsson-Sadakane, 355ms using gosaca

The bigger the input is, the faster gosaca is compared to Larsson-Sadakane. But it's still overall slower than parallel Larsson-Sadakane. If only there was a way to parallelize it…

### Parallelizing “an optimal SACA”

When we parallelized Larsson-Sadakane, we still ended up with exactly the same suffix array. I don't understand “optimal SACA” enough to parallelize it similarly - but what if instead of constructing a suffix array for the entire old file, we divided it into P intervals and constructed suffix arrays for those in parallel?

As expected, doing this is faster. On an eight-core machine:

• dictwords (200KiB) takes
• 21ms to sort as a single interval
• 12ms as two intervals sorted in parallel
• 8ms as four intervals
• 7ms as eight intervals
• dictcalls (2MiB) takes
• 355ms to sort as a single interval
• 171ms as two intervals sorted in parallel
• 95ms as four intervals
• 72ms as eight intervals

Of course, this isn't a silver bullet. The problem with this is that we end up with P different suffix arrays, one for each interval of the old file. And when doing the “scanning” phase of bsdiff, we have to look in each of them, because we don't know in which interval the best match lies.

But let's look at optimizing a patch for a much larger game: Aven Colony. A single version of the game nowadays is 11GiB. Using single-core (original) gosaca, we get:

• A reduction of size from 532MiB to 125MiB
• Sorting took 25 minutes
• Scanning took 5 minutes and 20 seconds

If we use two intervals, sorting should be reduced to about 13 minutes, and scanning would only get twice as slow, ie. 10 minutes - which would reduce processing time from 35 minutes to 23 minutes - still a win!

Using more intervals would not help: let's say we reduce sorting to 7 minutes with 4 intervals, scanning would now dominate with 20 minutes, giving a total runtime of 24 minutes, which is a setback.

### Parallelizing scanning

We've seen that constructing suffix arrays on intervals of the old file results in faster patch optimization overall, but also that the gains are still pretty minimal, since scanning becomes slower.

So how can we reduce the cost of scanning, when we have P cores available? I see two strategies.

The first is, now that we have P suffix arrays, to search them all in parallel every time the bsdiff scanning process is looking for the longest exact match. In practice, this ended up being slightly worse than looking in each suffix array sequentially because of synchronization costs typical of trying to do anything in parallel.

The second strategy is to run the bsdiff scanning phase on “blocks” of the new file in parallel. The synchronization costs are much lower because each unit of work (bsdiffing a few megabytes of the new file) is much larger. Each bsdiff scan reads from the same suffix array (which, after sorting, is read-only), and constructs a list of approximate matches, which are then put back together in order to write the patch.

In practice, this gives excellent results. For a sample ~800KiB file randomly generated, with random similar changes at various intervals:

• sequential bsdiff took 8.10s (7s sorting, 1s scanning)
• 2-intervals bsdiff took 4.48s (3.47s sorting, 938ms scanning)
• 4-intervals bsdiff took 3.10s (2.09s sorting, 939ms scanning)
• 8-intervals bsdiff took 2.74s (1.64s sorting, 1.03s scanning)

Note that the “scanning” figures reported here also include subtracting new data and old data for the approximate matches, and writing it as zstd-compressed protobuf messages.

The important thing to notice here is that, as intuited, sorting time goes down, whereas scanning time stays relatively constant. Technically more work is done, but it's distributed on more cores evenly, so it remains the same.

### Preliminary results

By using 11 partitions (number of cores - 1), we achieve the following runtime when optimizing Mewnbase:

• Sorting takes 1.54s (13.6s before)
• Scanning takes 866ms (1s before)

Overall, this took patch optimization time from 14.84s to 2.84s, a 5.22x speedup.

On the bigger game, Aven Colony, total processing time went from 69 minutes to 10 minutes, almost a 7x speedup.

These improvements to bsdiff all combined help us deliver updates to itch.io players faster and keep our bandwidth costs in check!

If you like this, you may also like Things that can go wrong when downloading, a similar simplified explanation of some challenges we face, this time when installing a game initially.

The techniques presented above are used in itch.io refinery, a customizable toolset for first releases & playtest - and for a bunch of already-released games via the itch.io app, available now for Linux, Windows and macOS.

Everything presented here is part of our open diffing and patching protocol, wharf. You can check out the source code directly on our GitHub organization.