While looking to write a pure ooc version of ftgl, I was reading the source of ftgl-gl3 and I stumbled upon this piece of code:

cpp
static inline GLuint NextPowerOf2(GLuint in)
{
     in -= 1;

     in |= in >> 16;
     in |= in >> 8;
     in |= in >> 4;
     in |= in >> 2;
     in |= in >> 1;

     return in + 1;
}

This is needed because dealing with power-of-two textures (32x32, 64x64, 128x128, etc.) is more efficient with OpenGL (some implementations don't even support non-power-of-two textures!).

As it turns out, it's an old trick - and it's been explained before, but I had a bit of a hard time wrapping my mind around it, so I thought I'd blog about it here.

Purpose

So the basic idea is that we have a minimum size for our texture, for example 45x187, and we want to know the smallest power-of-two texture size that will fit. In that case, we can see that it'll be 64x256 - but how do we compute that efficiently?

We need a function that maps 45 to 64 and 187 to 256 - but that works for all integer values.

Naive approaches

One naive approach would be this:

ooc
nextPowerOfTwo: func (in: Int) -> Int {
  match {
    case in <= 1    => 1
    case in <= 2    => 2
    case in <= 4    => 4
    case in <= 8    => 8
    case in <= 16   => 16
    case in <= 32   => 32
    case in <= 64   => 64
    // and so on up to 2^32, ie. 4_294_967_296
  }
}

This will work just fine, but we're doing lots of comparisons.

Besides, the code isn't very DRY - we're mentioning each power of two twice.

We could do something else, like:

ooc
nextPowerOfTwo: func (in: Int) -> Int {
  pot := 0
  out := 1
  
  while (pot < 32 && in > out) {
    out = out << 1
    pot += 1
  }
  out
}

But we're still doing 32 loop iterations in the worst case - along with an add, a left shift, and two comparisons for each iteration.

Of course, a sufficiently smart compiler would be able to optimize that. Wouldn't it? Let's find out.

LLVM IR for the naive approach

Let's see the LLVM IR code generated by clang 3.2 that ships with OSX 10.8.2. I used rock -v --noclean -c -cc=clang +-S +-emit-llvm npot.ooc, and found it in .libs/ooc/npot/npot.o (I guess rock could use a -S option of its own so it wouldn't put assembly in .o files...).

llvm
define i32 @npot__nextPowerOfTwo(i32 %in) nounwind uwtable readnone ssp {
  %1 = icmp sgt i32 %in, 1
  br i1 %1, label %.lr.ph, label %._crit_edge

.lr.ph:                                           ; preds = %0, %.lr.ph
  %out.02 = phi i32 [ %2, %.lr.ph ], [ 1, %0 ]
  %pot.01 = phi i32 [ %3, %.lr.ph ], [ 0, %0 ]
  %2 = shl i32 %out.02, 1
  %3 = add nsw i32 %pot.01, 1
  %4 = icmp slt i32 %3, 32
  %5 = icmp slt i32 %2, %in
  %6 = and i1 %4, %5
  br i1 %6, label %.lr.ph, label %._crit_edge

._crit_edge:                                      ; preds = %.lr.ph, %0
  %out.0.lcssa = phi i32 [ 1, %0 ], [ %2, %.lr.ph ]
  ret i32 %out.0.lcssa
}

There's nothing too surprising here - LLVM IR is a bit destabilizing because it uses phi nodes instead of straightforward goto/jumps - but that makes implementing optimizations a lot easier because the flow is easier to analyze.

So, as we can see, there are 6 registers being allocated here: the first one is the result of a signed int comparison between the input parameter, in, and

  1. LLVM found a 'critical edge' case, and it thought it'd be a good idea to immediately return 1 if the input is 1, instead of going through the shift loop.

Whether or not it is actually a good idea is left as an exercise to the reader (hint: textures of 1x1 are actually useless - but the compiler doesn't know that).

After the .lr.ph label, we see two phi nodes - if we came here for the first time, we initialize out to 1, and pot to 0 - as specified in the source. If we came from a previous iteration, we just carry the values stored in registers %2 and %3 from before.

Then we have a shl (shift left) instruction on %out.02 (the previous value of out) and an add (integer addition), on %pot.01 (the previous value of pot). The nsw specifier means "no signed wrap", ie. we don't want the value to wrap around when the binary representation reaches 2^31. In fact, at this point I realize my code is probably slightly wrong, as I'm using a signed Int and yet I'm handling values up to 2^32 - which is wrong.

Next up, the two comparisons in our loop, pot < 32 and in > out (%in is not stored in a local register, LLVM IR is convenient like that - you can just refer to parameters and it'll do the right thing). Then the logical and between the results of our two tests, and finally a branching instructions - if the two conditions are satisfied, loop by jumping to label .lr.ph, otherwise, jump to %._crit_edge.

At .crit_edge, we have two cases: either we came from the critical edge case, directly from the first branching instruction, in which case we set %out.0.lcssa to 1, or we came from the loop, and the return value is stored in register %2.

Then we return a signed 32-bit int. All good so far.

Intel x86_64 assembly for the naive approach

While lower-level than C in some aspects (dealing with direct type widths, using control flow such as jumps whereas goto is recommended against in C), LLVM IR is still a relatively high-level view of things (phi nodes, virtual registers that aren't mapped to hardware registers yet, etc.).

So I thought it'd be interesting to look at the output of the assembly output, compiled by LLVM with the -O2 flag. For that, I compiled a test program with rock -v +-O2 --cc=clang npot, and used llvm-objdump -disassemble npot to look at the disassembled code.

x86 assembly
pushq   %rbp
movq    %rsp, %rbp
movl    $1, %ecx
movl    $1, %eax
cmpl    $2, %edi
jl      26
nopw    %cs:(%rax,%rax)
addl    %eax, %eax
cmpl    $31, %ecx
jg      6
incl    %ecx
cmpl    %edi, %eax
jl      -13
popq    %rbp
ret

This was disassembled by llvm-objdump and as-is, is a bit harder to read - the first reason is that we have the function prologue and epilogue at the start and end of the function, because we have amd64-specific instructions (nopw), and because jumps (jl, jg) are relative to the binary itself - hence, not terribly convenient for a human to read.

But let's give it a shot anyway. The first interesting thing to notice is the second cmpl instruction: it compares with the integer value 31. Why is that? Our original comparison was pot < 32. Also, it initializes two registers to 1, whereas we were using 0 and 1.

Well, let's try to rewrite the assembly above in pseudo-code to try and understand what happens:

label nextPowerOfTwo:
  // function prologue
  ecx = 1
  eax = 1
  if (edi < 2) goto crit_edge
label loop:
  eax += eax
  if (ecx > 31) goto crit_edge
  ecx += 1
  if (eax < edi) goto loop
label crit_edge:
  // function epilogue

Okay, that's a bit clearer, but really convoluted still - that's what you get for reading optimized assembly. I've removed padding and the prologue/epilogue for readability.

Here, we use only three registers, all of which store 32-bit ints

Let's read through it now: it seems that the critical edge case detected by LLVM can still be seen in this assembly code - the first compare/jump handles the case where in is equal to 1 - and jumps directly to the function epilogue, as eax already contains the right return value.

Why is it doing a jl (jump if less than), though? Maybe it's a shorter opcode (smaller binary size). Another explanation would be that it's faster to do a less than than an equal comparison. For example, doing 0b1000_0000_0000_0000 < 0b0111_0000_0000_000 requires only one bit comparison. Whereas doing 0b0000_0000_0000_0001 == 0b0000_0000_0000_0001 would require 32 bit comparisons, if the processor goes from most significant (left) to least significant (right) bits. But perhaps I'm trying to overjustify here.

Let's read the code further. There's no sign of shift in sight, but instead, the assembly reads addl %eax, %eax - adding out with itself. Indeed, doing out <<= 2 and out += out is exactly the same - except the latter is faster, so we can give credit for our assembler to do the most basic of optimizations here.

Then, the second if we encounter is also a classic optimization known as Short-circuit evaluation. In short, if you do a && b, and a is false, there's no need to test for the value of b. That's exactly what happens here - since we have a logical and in our while loop, as soon as pot reaches 31, we're done. Or is it 32?

And now we're starting to understand something... something terrible... it's that the optimizer messed with our pot! Our original code had pot range from 0 to 31 (included), whereas our optimizer made it range from 1 to 32 (included).

So if you were to use a debugger on that code, and you somehow managed to break inside the loop, you could witness that what used to be our pot variable is greater by 1 than what we had intended. However, what does it matter? It's only used in the control flow - and never gets returned, or passed to a subroutine. So it's the optimizer's perfect right to change it.

Back to our while loop: if the first condition is true, we increment %ecx (aka pot + 1), and test our second condition. If %eax (aka out) is greater or equal to %edi (aka in), we're all good and we can return - otherwise, we jump back to the beginning of the loop.

So, despite all the torture (sorry, optimizations) LLVM -O2 made our code go through, the basic algorithm is still the same: we have a loop, and in the worst case, it does 32 iterations, along with a few integer operations and comparisons.

Even writing the same algorithm in assembly by hand wouldn't help - in fact, I definitely wouldn't have thought of the smart tricks I explained above. I only learned about them by looking at the assembly output - which I encourage everyone to do more often.

The bit-twiddling approach

Now let's go back to the piece of code in the intro - also known as the bit-twiddling approach. Instead of using a loop, this approach uses properties of binary representations.

An example with the number 13 as entry would give this:

  13 // 0000 1101
  12 // 0000 1100 // -= 1
  12 // 0000 1100 // |= in >> 16
  12 // 0000 1100 // |= in >> 8
  12 // 0000 1100 // |= in >> 4
  15 // 0000 1111 // |= in >> 2
  15 // 0000 1111 // |= in >> 1
==================================
  16 // 0001 0000 // + 1

Maybe this is enough for you to understand the process - but keep reading, more disassembly ahead.

Let's go into more details: we start out with any 32-bit integer. Let's pick 100663860, which has the following binary representation (most significant bit is on the left):

    0000 0110 0000 0000  0000 0010 0011 0100

And then we do a binary or of itself shifted by 16 positions to the right. This gives us:

    0000 0110 0000 0000  0000 0010 0011 0100 // original
    0000 0000 0000 0000  0000 0110 0000 0000 // shifted by 16
    =============== binary or ==============
    0000 0110 0000 0000  0000 0110 0011 0100 // result

And then we take the result, and do a binary or of itself shifted by 8 positions to the right, this time:

    0000 0110 0000 0000  0000 0110 0011 0100 // previous result
    0000 0000 0000 0110  0000 0000 0000 0110 // shifted by 8
    =============== binary or ==============
    0000 0110 0000 0110  0000 0110 0011 0110 // result

Same deal, with a shift by 4:

    0000 0110 0000 0110  0000 0110 0011 0110 // previous result
    0000 0000 0110 0000  0110 0000 0110 0011 // shifted by 4
    =============== binary or ==============
    0000 0110 0110 0110  0110 0110 0111 0111 // result

After a shift by two, we've ensured that every other bit to the right of the leftmost bit set to 1 in the original, has been set to 1:

    0000 0110 0110 0110  0110 0110 0111 0111 // previous result
    0000 0001 1001 1001  1001 1001 1001 1101 // shifted by 2
    =============== binary or ==============
    0000 0111 1111 1111  1111 1111 1111 1111 // result

With the number we've picked, the shift by 1 is unnecessary, but it doesn't hurt either:

    0000 0111 1111 1111  1111 1111 1111 1111 // previous result
    0000 0011 1111 1111  1111 1111 1111 1111 // previous result
    =============== binary or ==============
    0000 0111 1111 1111  1111 1111 1111 1111 // result

And then all we have to do is add 1, to find the next power of two:

    0000 0111 1111 1111  1111 1111 1111 1111 // result
    0000 0000 0000 0000  0000 0000 0000 0001 // 1
    ========== integer addition ============
    0000 1000 0000 0000  0000 0000 0000 0000 // result

And the result is 134217728, aka 2^27, which is indeed the result we are looking for (and probably way too big for an OpenGL texture.)

There's one more thing: you'll notice that for powers of two, this algorithm will give the next power of two - whereas we would be fine with the power of two we passed. For this, all we have to do is subtract one from the original input number, and that's it!

That algorithm can be written like this in ooc:

ooc
nextPowerOfTwo: func (in: Int) -> Int {
    in -= 1

    in |= in >> 16
    in |= in >> 8
    in |= in >> 4
    in |= in >> 2
    in |= in >> 1

    in + 1
}

LLVM IR for the bit-twiddling approach

Using the same command lines as before on my new .ooc source file, let's see the LLVM IR:

llvm
define i32 @npot2__nextPowerOfTwo(i32 %in) nounwind uwtable readnone ssp {
  %1 = add nsw i32 %in, -1

  %2 = ashr i32 %1, 16
  %3 = or i32 %2, %1

  %4 = ashr i32 %3, 8
  %5 = or i32 %4, %3

  %6 = ashr i32 %5, 4
  %7 = or i32 %6, %5

  %8 = ashr i32 %7, 2
  %9 = or i32 %8, %7

  %10 = ashr i32 %9, 1
  %11 = or i32 %10, %9

  %12 = add nsw i32 %11, 1
  
  ret i32 %12
}

This is pretty much a 1-1 mapping of the ooc code above, except that each line is split into two separate steps: the right shift, and then the binary or.

There's not much to comment here, except the funny fact that LLVM doesn't seem to have increment and decrement instructions, but instead does add -1 and add 1 (which honestly makes sense).

Let's see if the assembly is more interesting.

Assembly for the bit-twiddling approach

x86 assembly
pushq   %rbp
movq    %rsp, %rbp

decl    %edi

movl    %edi, %eax
sarl    $16, %eax
orl     %edi, %eax

movl    %eax, %ecx
sarl    $8, %ecx
orl     %eax, %ecx

movl    %ecx, %eax
sarl    $4, %eax
orl     %ecx, %eax

movl    %eax, %ecx
sarl    $2, %ecx
orl     %eax, %ecx

movl    %ecx, %eax
sarl    %eax
orl     %ecx, %eax

leal    1(%rax), %eax

popq    %rbp
ret
nopl    (%rax)

Whereas LLVM used a grand total of %12 different registers, on hardware and without the constraints of SSA, it's a virtue to use as few registers as possible. Here, we use:

Turns out this assembly code is as straight-forward as the LLVM IR: movl is basically an assignment, sarl is a right shift, and orl is a binary or.

The only surprising part is the usage of leal near the end. Why not just use incl or addl? Well, according to Nils Pipenbrinck on StackOverflow , it means load effective address and can be used for address calculation.

In this case, we add an offset of 1 to %rax, which is the 64-bit variant of %eax, and store it in %eax.

Conclusion

That's all for today, folks! I didn't plan on going this far with this post, making it the longest to date, but I hoped you enjoyed it. If my readers like analyzing disassembled code, I might write more about that later.

In conclusion, I'd say that knowing how integers are represented can help us write more efficient code: in this case, we went for a code with 3 comparisons and up to 32 loop iterations to a code with 5 shifts, 5 ors, one addition, one subtraction, and no comparisons/branch instructions whatsoever.