# Fast font packing for fun and profit

Being creative is hard work, let's go optimizing instead! My graphics engine dye was pretty naive about displaying text, and it was wasteful. Let's see how I made it all better with this one weird tip.

Disclaimer: Even after a few years I'm still very much an OpenGL newbie. Please don't hit me with crowbars.

Once upon a time, OpenGL was easy to use - and also falling out of relevancy as far as high-performance 3D graphics were concerned. But it wasn't all bad! You could basically pick up any library out there and integrate it with your existing GL project. Not that it's a good idea, but it usually just worked.

For example, if you wanted to display TrueType fonts in your OpenGL application, you could just go ahead and pick up ftgl (with ooc bindings), and you'd be a happy camper.

But then later you might want to support OpenGL ES, or just OpenGL 3, and then FTGL wouldn't know what to do with a core context, where glPushMatrix and glVertex2f is a thing of the past. Of course, you could look for an OpenGL3 port of FTGL, but where's the fun in that?

Displaying TrueType text isn't that hard, at least for the simple case (ASCII, Americano-centric text), Let's go over a few approaches.

## Pre-processing into Bitmap fonts

The nice thing about TrueType fonts is that they're basically vector graphics with a shitton of metrics on top of them. So, they look good at pretty much any size.

But in your game, chances are you don't have arbitrary large text. And if you do, it's probably upscaled from a lower-res texture. So if you don't feel like loading TrueType fonts into your game directly and having to bother with all these, you can turn them into bitmap fonts instead, maybe using something like BMFont.

That is a totally valid approach, which I chose to disregard, because, again, where's the fun in that? And I tend to dislike approaches that require a lot of preprocessing / external tools without a lot of obvious benefits. Choosing a good font is hard, I'd rather have them easy to swap in and out.

## Using FreeType

FreeType is life, FreeType is love - it's basically the open-source package to read, interpret, and render .ttf (TrueType) font files. It usually goes a little something like that:

• Initialize Freetype globally (once per application)
• Load a new font face (a .ttf file, basically)
• Set a character size with a given resolution - units are a bit weird. From experience, stick to 72dpi, and Freetype accepts font sizes in 1/64th, so you want to pass fontSize * 64 as the first argument to setCharSize.
• Render each 'glyph' (basically, character) that you want into bitmaps

And then you have a black and white of each glyph somewhere in memory. This is easy to convert into an RGBA32 format OpenGL likes. Good! But what now? How can you display them? Well, you have a few solutions.

## One texture per glyph

That was my first approach. If you handle each glyph as a separate sprite, you can just lay them out properly and you'll have text rendering. Don't go home quite yet though - that approach is perhaps the least efficient.

Let's say that, like me, you only care about characters between ASCII 32 and ASCII 127 (all the printable characters) - that makes 95 glyphs in memory, and each of those glyphs has:

• One OpenGL texture, rounded up to the nearest power of two (because rectangle textures are evil-ish)
• An OpenGL VBO (Vertex Buffer Object) to contain the geometry (texture coordinates and vertex coordinates), an OpenGL VAO (Vertex Array Object) to contain components/stride info + other shader info.

And you have to use one OpenGL draw call per glyph. It might not seem like a big deal but let's put that into perspective - with a simple scene, 7 tile layers, 1 main character, and one line of text (about 20 letters long), 2/3 of our draw calls would be dedicated to text rendering. Not so hot.

The advantage of this approach, of course, is that even if you modify the text 60 times per second, the cost doesn't grow - all the textures, VBOs and stuff are already there, it's just a matter of drawing them in the right places.

## One texture per text object

The way text works in dye, is that you have GlText objects that have a value string. One can assume at least a bit of coherency, e.g. they won't be updated every frame (and even if the value 'pointer' is updated every frame, e.g. in naive UI code, chances are the actual string contents are the same, and that's an easy equality test), so what we can do instead is forget about an individual texture per glyph, and have one texture per text object instead.

Whenever text changes (or at initialization), you need to do the following:

• Create a texture big enough to fit all the text in the text object (so, you need to grok the font metrics and go through the text - are there newlines, how do you handle them, is the text aligned, do you allow custom spacing, etc.)
• Manually blit glyph bitmap data (that you got from Freetype, remember?) into your texture
• Upload that texture to OpenGL (glTexImage2D)

Now, that's bad on several levels - first off, chances are you'll be creating/destroying textures often. You could be smart (and complicate your code) by allocating textures bigger than needed and actually updating only a subset (with glTexSubImage2D) of the texture, but that doesn't get you off the hook - you're still blitting (copying bitmap data) by hand, on the CPU, using either loops or memcpy or whatever, and that's kind of silly when you have a GPU sitting around doing practically nothing.

So, for text that never ever changes, that might be the fastest approach - you only have one quad (well, two triangles, or one if you're smart about texture coordinates). For everything else, e.g. the UI in an action game that changes pretty much constantly, it's not that hot.

## One texture per font-size tuple, one VBO per text object

That's the approach used in freetype-gl - from which I heavily drew inspiration for my own implementation.

The idea is to manually blit all the glyphs you need into a single texture that's tightly packed:

The "one texture per glyph" approach was wasteful in more ways than one - since we only want power-of-two texture sizes, often the texture was larger than needed - it's not a big deal if the glyph is 12x14 and the texture is 16x16, but a 257x257 glyph would then require a 512x512 texture.. and that gets wasteful quick.

Another way that approach was wasteful was that having that many different textures pretty much prevented us from having a single draw call for a text object, even if we wanted to - we'd have to sample from many different texture objects and I don't know a simple way to do that in OpenGL.

But by having all the glyphs for a font-size tuple in a single texture, we can have a single Vertex Buffer that contains a whole text object. In Dye, VBOs are usually laid out like this:

• 2 floats for 2D texture coordinates
• 2 floats for 2D vertex coordinates
• rinse, repeat

Since GL_QUADS is deprecated in recent versions of OpenGL, we need two triangles, aka 6 vertices, which makes 6 * 4 = 24 floats per character. Surely there's a way to reduce that even more, but I'm not going to cover it here. (Basically: GL_TRIANGLE_STRIP allows you to define a quad in only 4 vertices, but then you need degenerate polygons to move on to another quad elsewhere - so you go from 6N to 6N - 2, which is not a huge save).

But wait, we didn't even cover how to pack a font tightly in a single texture. When faced with a hard problem that's been solved before, pretend to read the paper, and then follow suit and port the C++ code.

You know what, let's do things proper. With pen and paper even. Let's dive into that algorithm.

## The RectangleBinPack algorithm

RectangleBinPack is an algorithm and a data structure. It's basically a binary tree. You have to choose a 'bin size'. Let's try it with a 6x6 bin. At first, it looks like this:

There's only one node in our tree, we'll name it "root" (although nodes don't actually have names in the data structure). Its position is 0, 0 (the origin is on top left), and its size is 6x6. So far, so good.

Let's try to insert a 1x2 rectangle in there:

What happened? Well, the 1x2 rectangle fit in the 6x6 root node, the root node was a leaf (it had no children), so:

• The root node changed size, from 6x6 to 1x2
• ...and it split in the vertical direction: it now has two children, l, a 1x4 rectangle at 0,2 and r, a 5x6 rectangle at 1, 0.

The root node is no longer a leaf node, so it no longer represents free space, but occupied space instead. l and r now span the space that remains free.

Let's insert a 3x1 rectangle now. We'll walk our node, depth-first:

• Root node: has children, not a leaf node, so we don't even try to fit and explore the children instead
• l node: is a leaf node, but a 3x1 rectangle doesn't fit in there, we return without success.
• r node: is a leaf node, and fits, so r node changes size to 3x1 and splits in the horizontal direction, into l2 and r2.

We can start to see that the algorithm tries to leave big spaces free by the way it chooses split directions. We can also see that for a low occupancy, most shapes are going to fit in the shape of an inverted L (or a capital gamma if you want).

Let's do one last step to show something interesting. Let's insert a 5x2 rectangle. Same as before, we do a depth-first walk (implemented here by recursive function call):

• root is not a leaf node, let's explore the children
• l - leaf node, but too small
• r - not a leaf node, let's explore the children
• l2 - leaf node, but too small
• r2 - now we're talking. We have room, so we change size to 5x2 and split into l3 and r3.

But wait, where is l3? The tree tells us it's a 0x2 node at 6,1. The area of l3 is null, since its width is 0: we call that a "degenerate" node. We could add a special case in our algorithm to remove degenerate nodes from the graph, or we could leave it in: since no rectangle is ever going to fit in there, it doesn't hurt. Traversal performance isn't impacted either, because it is never going to split.

And there you have it. If you keep doing that you'll eventually reach a point where all your rectangles are snuggling together in a bin, or you'll reach a point where the bin's occupancy is too high to fit another rectangle in there, and then you can start over with a larger bin.

For dye, we start out with a 32x32 bin and then go on to try 64x64, 128x128, etc.

## Future work

There's still a lot that could be improved on the work done so far:

• The rectangle packer could try to do a π/2 rotation on the rectangles to see if it is more optimal. But then our text display code would have to take that into account when computing texture coordinates.
• Right now, when text is changed, the VBO data is uploaded via glBufferData
• we could squeeze more performance by using glMapBufferRange and writing to the system-side buffer instead
• There's still a lot of state switching in dye that I'm not fond of - since we don't allow external libraries to mess with our OpenGL context, maybe keeping track of which shader/texture unit is bound could reduce the number of OpenGL calls even more.
• The transformation code in dye generates too much garbage - every matrix is GC-heap-allocated yet we know exactly when we don't need them anymore.

Anyway, I hope you enjoyed the article! Let this article prove that I'm just an attention-grabber looking for fame and to make an easy buck by talking about popular subjects everybody wants to read about.

Until next time, take care!

If you liked what you saw, please support my work!

## Latest video View all

C++ vs Rust: which is faster?

I ported some Advent of Code solutions from C/C++ to Rust, and used the opportunity to compare performance. When I couldn't explain why they performed differently, I had no choice but to disassemble both and look at what the codegen was like!

Looking for the homepage?
Another article: Rust modules vs files