Efficient game updates
đ This page was last updated ~8 years ago. Just so you know.
A little while ago, I wrote an article on things that can go wrong when downloading, it listed a series of reasons, from network problems to invalid content to imperfect hardware that may occur when initially installing a game.
This article discusses what methods we can use to upgrade a game to a later version, when an older version has been successfully installed.
Compression
The amount of bandwidth available to a user is usually a constant - their internet access has a maximum theoretical speed of 20 mbps, 100 mbps, 1 gbps, or maybe, in some countries, much much less.
*[mbps]: megabits per second - for example, 20 mbps = 2.5 megabytes per second. *[gbps]: gigabits per second (1000 mbps)
I donât have fiber in my home office yet so Iâm in the 20 mbps bucket. Any time I feel like complaining, the internet reminds me that others have it way worse, like this guy who took screenshots of how sites look after he let them load for 1 minute. Maybe 20 mbps isnât that bad, all things considered.
This means transferring some data from the itch.io servers to the user will always take about the same time. Since we have no control over the userâs internet access, what we can do is reduce the amount of data we need to transfer for them to be able to play the game.
Traditionally, this is done by compressing the data before transferring it. Compression is a complex field and studying it in-depth is out of scope for this article, but the gist of it is taking advantage of repetitions. For example, if you had to tell someone the following sequence over the phone:
banana banana banana banana banana
âŚyouâd probably just say âbanana repeated five timesâ instead of just saying each instance of âbananaâ out loud. This works for more complex sequences too:
banana cherry apple pie
banana cherry apple sauce
banana cherry apple pizza
You could say:
- Letâs call âbanana cherry appleâ âXâ.
- The sequence is: âX pie X sauce X pizzaâ
And without realizing it, youâd be doing dictionary-based compression. The âXâ => âbanana cherry appleâ mapping is part of your dictionary, which you defined once and then re-used three times, to save space. Imagine this, but with years of research, refinement and a whole new bag of tricks, and you get âmodernâ compression algorithms.
The algorithms weâre interested in are âlosslessâ compression algorithms, meaning that the sequence is completely unchanged when we fully compress, then decompress it. A âlossyâ algorithm would be you telling your friend over the phone âitâs a bunch of nonsense about fruitsâ. They get the gist, but you lost some information on the way.
Lossy compression algorithms are actually quite useful for images, videos, and sound.
They strive for a high compression ratio (the compressed data is much smaller) while maintaining as much fidelity (the decompressed output looks like the original) as possible.
In our case, we canât afford to lose or alter any data at all - any single change would make the game impossible to run. Besides, some files of the games are already compressed using lossy algorithms!
Real-world compression
Letâs consider a practical case. The itch.io command-line uploader, butler, uses the brotli compression format on all files developers upload using it. The idea is that we can reduce the total transfer time by applying just a little compression. If we used too much, it would take longer to compress the data than to just transfer it, as shown by this helpful stacked area chart:
We can adjust âhow much compressionâ we want by setting the âquality levelâ: lower quality levels mean less compression, but also less computation needed, whereas higher quality levels achieve higher compression, but require more computation (ie. more time).
Letâs look at some actual data: if we take a build of Overland for Linux:
- The uncompressed game is 748MiB
- Compressed at quality level 1, itâs 340MiB (took 17 seconds on my machine)
- Compressed at quality level 4, itâs 327MiB (took 25 seconds)
- Compressed at quality level 6, itâs 316MiB (took 51 seconds)
- Compressed at quality level 9, itâs 311MiB (took almost 4 minutes)
As you can see, thereâs diminishing returns in increasing the quality level. The biggest jump is from âno compressionâ to âa little compressionâ.
This is not to say that higher compression levels are useless. Simply, theyâre not used for the initial upload, but are used later on to re-compress the data on itch.io servers, so that the download size is even smaller than what was uploaded.
Delta encoding
So far, weâve seen that compression helps us reduce the amount of data a user needs to download to play game.
- It lets developers upload faster
- It lets players play games sooner
- And it helps us saves on bandwidth costs (since we pay for each byte players download)
When updating games, there is something better than compression though. Imagine you already told your friend on the phone the sequence from earlier:
banana cherry apple pie
banana cherry apple sauce
banana cherry apple pizza
If it changed to this:
banana merry apple pie
banana merry apple sauce
banana merry apple pizza
You could simply tell your friend that each occurence of âcherryâ was changed to âmerryâ, instead of re-transmitting the entire thing. Or perhaps you could even tell your friend that each occurence of âchâ is changed to âmâ - even less data, and that would still work.
Finding the smallest set of instructions you can tell your friend so they can reconstruct the new data from the old data is what delta encoding is all about.
Rsync
A simple form of delta encoding is the rsync algorithm, devised in 1996 by Andrew Tridgell. The idea is fairly simple. First, we divide the old sequence into blocks of equal sizes (except the last one). Letâs do it on a shortened version of the sequence weâve been relentlessly transmitting to our phone friend:
(bana)(na c)(herr)(y ap)(ple )(pie)
Each of these blocks has a position and a hash. A hash is something thatâs shorter but can still be used to almost-uniquely identify something (thatâs an over-simplification, but itâs good enough). For our example, weâll take the first letter of the blocks as its hash. Which gives us:
- (bana) is at position 0 and has the hash âbâ
- (na c) is at position 4 and has the hash ânâ
- (herr) is at position 8 and has the hash âhâ
- (y ap) is at position 12 and has the hash âyâ
- (ple ) is at position 16 and has the hash âpâ
- (pie) is at position 20 and has the hash âpâ
We see here that (ple ) and (pie) have the same hash. This is called a hash collision - and itâs okay! In the algorithmâs defense, weâre using a pretty weak hash.
Now, we scan the new sequence and try to find if some of the blocks we already know about are in there:
(bana)na merry apple pie
We start off with the first four letters. We compute its hash (here, we take the first letter, âbâ) - and use it to look up an old block that may be similar. We do have a block with hash âbâ, at position 0 of the old sequence. Now, because of hash collisions, we have to compare the contents - and indeed, (bana) has the same contents as (bana), so that means we can re-use an old block!
Moving ahead by a full block, we keep scanning the new sequence:
bana(na m)erry apple pie
There is an old block with hash ânâ - but its contents are (na c), not (na m), so we canât re-use it. Similarly, thereâs no old block with (a me), ( mer), (merr), so on until (ry a), so we have no choice but to transfer that data to our phone friend. So far, our instructions are:
- Weâre working with blocks of length 4
- Re-use the block at position 0 of the old sequence
- Now write down âna merrâ
We keep scanning:
banana merr(y ap)ple pie
We do have an old block with hash âyâ, and it is (y ap) - we can re-use that one. Moving ahead by a full block:
banana merry ap(ple )pie
We have two old blocks with hash âpâ - (ple ) and (pie) - comparing them, we decide to re-use (ple ). And so on until weâre done scanning. Our complete set of instructions is now:
- Weâre working with blocks of length 4
- Re-use the block at position 0 of the old sequence
- Now write down âna merrâ
- Re-use the block at position 12 of the old sequence
- Re-use the block at position 16 of the old sequence
- Re-use the block at position 20 of the old sequence
Since blocks at positions 12, 16, and 20 are adjacent, we can reduce our instruction set further:
- Weâre working with blocks of length 4
- Re-use the block at position 0 of the old sequence
- Now write down âna merrâ
- Re-use three blocks starting at position 12 of the old sequence
And this is basically how rsync works. Over the phone, itâs not that great - but itâs actually quite suited to computers! The real rsync algorithm uses two hashes. The first hash is weak (cheap to compute, high collision chance). The second hash is stronger, hence more expensive to compute, which is why itâs only computed when the first hash matches.
The first hash also has a nice property: itâs rolling. Computing the hash of a [0âŚn] block takes a certain amount of work, but once itâs done, computing the hash of [1âŚn+1] is much faster, since we can re-use the previous result. This is especially well-suited to rsync, which looks at blocks of n bytes, moving byte by byte until it finds a match.
For more details, read the rsync paper.
Thatâs the first level of delta encoding we do at itch.io. Itâs done when developers upload a new version of their game:
- First, butler downloads a set of hashes of the older version
- Then it scans the new version to see which blocks it can re-use, based on their hashes
- Then it transmits both âre-use instructionsâ (if it found old blocks with the same contents) along with fresh data (for the parts it couldnât re-use)
The patch produced by that step looks something like that:
BSDiff
rsync is pretty good - but there are cases where it performs really poorly.
Consider the following sequence:
(bear)(fall)
If it changes into the following:
(dear)(hall)
Theyâre not that different right? If you were still playing âpatch your friendsâ on the phone, youâd simply say the âbâ changed into an âdâ and the âfâ changed into an âhâ - all done. As far as rsync is concerned, though, there is no 4-letter block in the new sequence that is equivalent to a 4-letter block from the old sequence - and so, it is forced to include the entire new sequence as âfresh dataâ, throwing away all the old data.
There is something interesting about this change though - âdâ is two letters after âbâ in the alphabet, and âhâ is two letters after âfâ. If we construct a new sequence that is the difference in positions between each letter, we get:
b e a r f a l l
- d e a r h a l l
-----------------
= 2 0 0 0 2 0 0 0
And that compresses super well!
- Write down 2, then three 0s
- Now do it again
And done!
You might be skeptical of how useful this is in real life. It turns out, files being changed by the same âdifferenceâ in multiple places happens a lot.
Consider a phonebook that also lists relatives, along with the page number on which they appear. If a few pages are added at the start of the phonebook, all the page numbers for the relatives need to be adjusted (by the same amount).
In that way, executables are a lot like phonebooks. Adding a function tends to change the address of a bunch of other functions.
This is why bsdiff works so well - a ânaiveâ binary delta encoding algorithm designed by Colin Percival in 2003 (seven years after the rsync paper).
Of course, our example is really simplistic. In practice, weâre not looking at only eight letters, but several megabytes, even gigabytes of data. The hard part is to find good âapproximate matchesâ between the old data and the new data.
Suffix sorting
The first thing to do when looking for approximate matches (which we can store the compressed difference of, for a patch), is to find exact matches between the old and the new data. Think of it as searching for a word in a text.
There is a data structure we can build for a text that lets us find any word relatively quickly: a suffix array. To build a suffix array, we need to sort all suffixes of the text.
The text ârabbitâ has the suffixes ârabbitâ, âabbitâ, âbbitâ, and so on. If we sort them all by lexicographical ordering, we get:
- âabbitâ (suffix 2)
- âbbitâ (suffix 3)
- âbitâ (suffix 4)
- âitâ (suffix 5)
- ârabbitâ (suffix 1)
- âtâ (suffix 6)
*[lexicographical]: You can think of it as âalphabeticalâ for the sake of our examples.
Hence, the suffix array of ârabbitâ is (2 3 4 5 1 6). Now then, if we want to search for a string inside of it, like âitâ, weâd do what is called a binary search.
Binary search
The idea behind a binary search is that if you have a sorted set of values, you can first compare with the value in the middle. If itâs smaller than the value youâre searching for, you can eliminate the first half and search in the second half. On the contrary, if itâs larger than the value youâre looking for, you can eliminate the second half and search in the first half.
Eventually, weâre left with a set of size 1. If the only element in there is the one weâre searching for, bingo! If not, itâs not in the set.
Searching for the longest matching suffix using a suffix array works similarly, except instead of comparing numbers, weâre comparing strings lexicographically:
Once we find an exact match, the next step in the bsdiff algorithm is to try and âexpandâ it, both forward and backward, as long as at least half the letters match. This gives us a longer, approximate match. We then compute the difference between both regions and store it, compressed. We repeat for each approximate match we find, including the fresh data (for which no match has been found, even approximate) in the patch.
How bsdiff fits in the pipeline
Letâs sum up:
- Each build of a game is distributed compressed, which reduces their size
- Distributing upgrades as patches saves even more bandwidth (taking advantage of the older version of the game thatâs already there)
- rsync patches are constructed on upload, but we can do a second pass with bsdiff to produce even smaller patches
To run bsdiff:
- We need the entire old file, and the entire new file
- Thatâs why itâs being run on itch.io servers (which have high-speed internet access to all builds) rather than the developersâ machines (which only have the latest build on disk)
- First we compute the suffix array for the old file (ie. we âsuffix sortâ it)
- Then we scan the new file, looking for approximate matches
Thereâs one thing I havenât mentioned: bsdiff computes a patch to reconstruct one file using another file - but game builds include many files. So how do we know which new file corresponds to which old file? They may have been renamed, or split, or joined - so we canât rely on their name alone.
Luckily, thereâs a simple solution: remember how the rsync patch contains instructions like âre-use N blocks from that old fileâ? We can scan the rsync patch looking for these, and whichever old file weâre re-using the most blocks of, thatâs the file we want to bsdiff against.
If we look at a real game, Mewnbase:
- The two most recent builds is 82MiB
- The rsync patch is 33MiB
- The bsdiff patch is 174KiB
As mentioned, bsdiff happens in two phases: (suffix) sorting, and scanning. If we re-use the original 2003 implementation of it, hereâs what it looks for this game in particular:
- Sorting takes 13.6s
- Scanning takes 1s
The numbers vary with the data being diffed, but overall, sorting is by far the most expensive part.
Parallelizing Larsson-Sadakane
Luckily, the suffix sorting algorithm used by the original bsdiff, which weâll call Larsson-Sadakane, can be parallelized relatively easily.
I wonât go into too many details here, but basically, Larsson-Sadakane uses a trick to construct the suffix array faster. It stems from the observation (made by Karp, Miller and Rosenberg) that each proper suffix of a string is also a suffix of another suffix. So, instead of sorting every suffix by the first symbol, then the second, then the third, the fourth, and so on, we can re-use the previous passes, to sort by the first, the second, the fourth, the eigth, the sixteenth, etc. - reducing the number of passes to log(n) instead of n.
Each of these passes divide the suffix array into buckets (which are later merged back together) - which are then sorted. Since each bucket is sorted individually, that work can be done in parallel on different cores. The parallel version of Larsson-Sadakane which I implemented uses one additional array of n integers, that contains the result of the previous pass - so that the various cores donât try to read from memory thatâs currently being written by other threads.
In practice, this works relatively well. Using 12 cores, the runtime for optimizing that particular Mewnbase patch looks like this:
- Sorting takes 4.68s (13.6s before)
- Scanning takes 1s
Scanning takes exactly the same amount of time, since itâs the same code working from the same suffix array, but sorting is 3x faster. This is both good news (faster bsdiff, finally!) and bad news: a 3x speed up for using 12 cores isnât that great. Distributing the buckets to the cores for sorting and synchronizing the whole thing has a non-negligible cost here.
Modern suffix sorting algorithms
Suffix sorting has been a subject of research over the past few decades, not because everyone is eager for smaller game updates (I wish!), but because it has applications in biotechnology: DNA is a series of four symbols, and being able to find common sequences amongst different genomes quickly is very important. (Except, they run their research on super-computers which routinely have more than 128 cores. But letâs not get picky here).
For that reason, SACA algorithms newer than Larsson-Sadakane (1999) have been designed. In recent times, the most interesting research comes from Ge Nong (Professor of Sun Yat-sen University in Guangzhou, China). One of his papers, âAn optimal suffix array construction algorithmâ, describes exactly what youâd expect from the title.
I havenât had the time to research how the algorithm works well enough to describe it here, but I can tell you itâs based on SA-IS, that it uses induced sorting (similar to DC3? correct me if Iâm wrong), and that if youâre really curious and want to dig in, you can read up on induced sorting here, and read the SA-IS paper here.
*[SACA]: Suffix Array Construction Algorithms
And since the output is still exactly the same suffix array, itâs a drop-in replacement for the first phase of bsdiff! Using a golang implementation of that algorithm, we get the following runtimes:
- Sorting takes 9.9s (13.6s before)
- Scanning takes 1s
On other datasets, the difference is more striking:
- dictwords (200KiB) is sorted in 51ms using Larsson-Sadakane, 21ms using gosaca
- dictwords (2MiB) is sorted in 1237ms using Larsson-Sadakane, 355ms using gosaca
The bigger the input is, the faster gosaca is compared to Larsson-Sadakane. But itâs still overall slower than parallel Larsson-Sadakane. If only there was a way to parallelize itâŚ
Parallelizing âan optimal SACAâ
When we parallelized Larsson-Sadakane, we still ended up with exactly the same suffix array. I donât understand âoptimal SACAâ enough to parallelize it similarly - but what if instead of constructing a suffix array for the entire old file, we divided it into P intervals and constructed suffix arrays for those in parallel?
As expected, doing this is faster. On an eight-core machine:
- dictwords (200KiB) takes
- 21ms to sort as a single interval
- 12ms as two intervals sorted in parallel
- 8ms as four intervals
- 7ms as eight intervals
- dictcalls (2MiB) takes
- 355ms to sort as a single interval
- 171ms as two intervals sorted in parallel
- 95ms as four intervals
- 72ms as eight intervals
Of course, this isnât a silver bullet. The problem with this is that we end up with P different suffix arrays, one for each interval of the old file. And when doing the âscanningâ phase of bsdiff, we have to look in each of them, because we donât know in which interval the best match lies.
But letâs look at optimizing a patch for a much larger game: Aven Colony. A single version of the game nowadays is 11GiB. Using single-core (original) gosaca, we get:
- A reduction of size from 532MiB to 125MiB
- Sorting took 25 minutes
- Scanning took 5 minutes and 20 seconds
If we use two intervals, sorting should be reduced to about 13 minutes, and scanning would only get twice as slow, ie. 10 minutes - which would reduce processing time from 35 minutes to 23 minutes - still a win!
Using more intervals would not help: letâs say we reduce sorting to 7 minutes with 4 intervals, scanning would now dominate with 20 minutes, giving a total runtime of 24 minutes, which is a setback.
Parallelizing scanning
Weâve seen that constructing suffix arrays on intervals of the old file results in faster patch optimization overall, but also that the gains are still pretty minimal, since scanning becomes slower.
So how can we reduce the cost of scanning, when we have P cores available? I see two strategies.
The first is, now that we have P suffix arrays, to search them all in parallel every time the bsdiff scanning process is looking for the longest exact match. In practice, this ended up being slightly worse than looking in each suffix array sequentially because of synchronization costs typical of trying to do anything in parallel.
The second strategy is to run the bsdiff scanning phase on âblocksâ of the new file in parallel. The synchronization costs are much lower because each unit of work (bsdiffing a few megabytes of the new file) is much larger. Each bsdiff scan reads from the same suffix array (which, after sorting, is read-only), and constructs a list of approximate matches, which are then put back together in order to write the patch.
In practice, this gives excellent results. For a sample ~800KiB file randomly generated, with random similar changes at various intervals:
- sequential bsdiff took 8.10s (7s sorting, 1s scanning)
- 2-intervals bsdiff took 4.48s (3.47s sorting, 938ms scanning)
- 4-intervals bsdiff took 3.10s (2.09s sorting, 939ms scanning)
- 8-intervals bsdiff took 2.74s (1.64s sorting, 1.03s scanning)
Note that the âscanningâ figures reported here also include subtracting new data and old data for the approximate matches, and writing it as zstd-compressed protobuf messages.
The important thing to notice here is that, as intuited, sorting time goes down, whereas scanning time stays relatively constant. Technically more work is done, but itâs distributed on more cores evenly, so it remains the same.
Preliminary results
By using 11 partitions (number of cores - 1), we achieve the following runtime when optimizing Mewnbase:
- Sorting takes 1.54s (13.6s before)
- Scanning takes 866ms (1s before)
Overall, this took patch optimization time from 14.84s to 2.84s, a 5.22x speedup.
On the bigger game, Aven Colony, total processing time went from 69 minutes to 10 minutes, almost a 7x speedup.
These improvements to bsdiff all combined help us deliver updates to itch.io players faster and keep our bandwidth costs in check!
Links!
If you like this, you may also like Things that can go wrong when downloading, a similar simplified explanation of some challenges we face, this time when installing a game initially.
The techniques presented above are used in itch.io refinery, a customizable toolset for first releases & playtest - and for a bunch of already-released games via the itch.io app, available now for Linux, Windows and macOS.
Everything presented here is part of our open diffing and patching protocol, wharf. You can check out the source code directly on our GitHub organization.
If youâd like to learn more about itch.io, I recommend heading over to:
- Our about page
- Leaf Corcoranâs original post, Announcing itch.io
- Leafâs 2015 article, Running an indie game store
And finally, if youâd like to buy the games used as examples in this article, here you go:
Thanks for reading, and, until next time, take care!
Here's another article just for you:
Some mistakes Rust doesn't catch
I still get excited about programming languages. But these days, itâs not so much because of what they let me do, but rather what they donât let me do.
Ultimately, what you can with a programming language is seldom limited by the language itself: thereâs nothing you can do in C++ that you canât do in C, given infinite time.
As long as a language is turing-complete and compiles down to assembly, no matter the interface, itâs the same machine youâre talking to. Youâre limited by⌠what your hardware can do, how much memory it has (and how fast it is), what kind of peripherals are plugged into it, and so on.