Next power of two
👋 This page was last updated ~12 years ago. Just so you know.
While looking to write a pure ooc version of ftgl, I was reading the source of ftgl-gl3 and I stumbled upon this piece of code:
static inline GLuint NextPowerOf2(GLuint in)
{
in -= 1;
in |= in >> 16;
in |= in >> 8;
in |= in >> 4;
in |= in >> 2;
in |= in >> 1;
return in + 1;
}
This is needed because dealing with power-of-two textures (32x32, 64x64, 128x128, etc.) is more efficient with OpenGL (some implementations don't even support non-power-of-two textures!).
As it turns out, it's an old trick - and it's been explained before, but I had a bit of a hard time wrapping my mind around it, so I thought I'd blog about it here.
Purpose
So the basic idea is that we have a minimum size for our texture, for example 45x187, and we want to know the smallest power-of-two texture size that will fit. In that case, we can see that it'll be 64x256 - but how do we compute that efficiently?
We need a function that maps 45 to 64 and 187 to 256 - but that works for all integer values.
Naive approaches
One naive approach would be this:
nextPowerOfTwo: func (in: Int) -> Int {
match {
case in <= 1 => 1
case in <= 2 => 2
case in <= 4 => 4
case in <= 8 => 8
case in <= 16 => 16
case in <= 32 => 32
case in <= 64 => 64
// and so on up to 2^32, ie. 4_294_967_296
}
}
This will work just fine, but we're doing lots of comparisons.
Besides, the code isn't very DRY - we're mentioning each power of two twice.
We could do something else, like:
nextPowerOfTwo: func (in: Int) -> Int {
pot := 0
out := 1
while (pot < 32 && in > out) {
out = out << 1
pot += 1
}
out
}
But we're still doing 32 loop iterations in the worst case - along with an add, a left shift, and two comparisons for each iteration.
Of course, a sufficiently smart compiler would be able to optimize that. Wouldn't it? Let's find out.
LLVM IR for the naive approach
Let's see the LLVM IR code generated by clang 3.2
that ships with OSX 10.8.2. I used
rock -v --noclean -c -cc=clang +-S +-emit-llvm npot.ooc
, and
found it in .libs/ooc/npot/npot.o
(I guess rock
could use a -S option of its own so it wouldn't put assembly in .o files...).
define i32 @npot__nextPowerOfTwo(i32 %in) nounwind uwtable readnone ssp {
%1 = icmp sgt i32 %in, 1
br i1 %1, label %.lr.ph, label %._crit_edge
.lr.ph: ; preds = %0, %.lr.ph
%out.02 = phi i32 [ %2, %.lr.ph ], [ 1, %0 ]
%pot.01 = phi i32 [ %3, %.lr.ph ], [ 0, %0 ]
%2 = shl i32 %out.02, 1
%3 = add nsw i32 %pot.01, 1
%4 = icmp slt i32 %3, 32
%5 = icmp slt i32 %2, %in
%6 = and i1 %4, %5
br i1 %6, label %.lr.ph, label %._crit_edge
._crit_edge: ; preds = %.lr.ph, %0
%out.0.lcssa = phi i32 [ 1, %0 ], [ %2, %.lr.ph ]
ret i32 %out.0.lcssa
}
There's nothing too surprising here - LLVM IR is a bit destabilizing because it uses phi nodes instead of straightforward goto/jumps - but that makes implementing optimizations a lot easier because the flow is easier to analyze.
So, as we can see, there are 6 registers being allocated here: the first one
is the result of a signed int comparison between the input parameter, in
, and
- LLVM found a 'critical edge' case, and it thought it'd be a good idea to immediately return 1 if the input is 1, instead of going through the shift loop.
Whether or not it is actually a good idea is left as an exercise to the reader (hint: textures of 1x1 are actually useless - but the compiler doesn't know that).
After the .lr.ph
label, we see two phi nodes - if we came here for the first time,
we initialize out
to 1, and pot
to 0 - as specified in the source. If we came
from a previous iteration, we just carry the values stored in registers %2 and %3
from before.
Then we have a shl
(shift left) instruction on %out.02
(the previous value of out
)
and an add
(integer addition), on %pot.01
(the previous value of pot
). The nsw
specifier means "no signed wrap", ie. we don't want the value to wrap around when
the binary representation reaches 2^31. In fact, at this point I realize my code
is probably slightly wrong, as I'm using a signed Int and yet I'm handling values
up to 2^32 - which is wrong.
Next up, the two comparisons in our loop, pot < 32
and in > out
(%in
is not
stored in a local register, LLVM IR is convenient like that - you can just refer
to parameters and it'll do the right thing). Then the logical and between the results
of our two tests, and finally a branching instructions - if the two conditions are
satisfied, loop by jumping to label .lr.ph
, otherwise, jump to %._crit_edge
.
At .crit_edge
, we have two cases: either we came from the critical edge case, directly
from the first branching instruction, in which case we set %out.0.lcssa
to 1,
or we came from the loop, and the return value is stored in register %2
.
Then we return a signed 32-bit int. All good so far.
Intel x86_64 assembly for the naive approach
While lower-level than C in some aspects (dealing with direct type widths, using control flow such as jumps whereas goto is recommended against in C), LLVM IR is still a relatively high-level view of things (phi nodes, virtual registers that aren't mapped to hardware registers yet, etc.).
So I thought it'd be interesting to look at the output of the assembly output,
compiled by LLVM with the -O2
flag. For that, I compiled a test program
with rock -v +-O2 --cc=clang npot
, and used llvm-objdump -disassemble npot
to look at the disassembled code.
pushq %rbp
movq %rsp, %rbp
movl $1, %ecx
movl $1, %eax
cmpl $2, %edi
jl 26
nopw %cs:(%rax,%rax)
addl %eax, %eax
cmpl $31, %ecx
jg 6
incl %ecx
cmpl %edi, %eax
jl -13
popq %rbp
ret
This was disassembled by llvm-objdump and as-is, is a bit harder to read - the first reason is that we have the function prologue and epilogue at the start and end of the function, because we have amd64-specific instructions (nopw), and because jumps (jl, jg) are relative to the binary itself - hence, not terribly convenient for a human to read.
But let's give it a shot anyway. The first interesting thing to notice is the second
cmpl
instruction: it compares with the integer value 31. Why is that? Our original
comparison was pot < 32
. Also, it initializes two registers to 1, whereas we were
using 0 and 1.
Well, let's try to rewrite the assembly above in pseudo-code to try and understand what happens:
label nextPowerOfTwo:
// function prologue
ecx = 1
eax = 1
if (edi < 2) goto crit_edge
label loop:
eax += eax
if (ecx > 31) goto crit_edge
ecx += 1
if (eax < edi) goto loop
label crit_edge:
// function epilogue
Okay, that's a bit clearer, but really convoluted still - that's what you get for reading optimized assembly. I've removed padding and the prologue/epilogue for readability.
Here, we use only three registers, all of which store 32-bit ints
%edi
is the input argument - it is set by the caller%ecx
corresponds to our original variablepot
%eax
corresponds to our original variableout
- as it turns out,
%eax
is also used to return the result!
Let's read through it now: it seems that the critical edge case detected by
LLVM can still be seen in this assembly code - the first compare/jump handles
the case where in
is equal to 1 - and jumps directly to the function epilogue,
as eax already contains the right return value.
Why is it doing a jl
(jump if less than), though? Maybe it's a shorter opcode
(smaller binary size). Another explanation would be that it's faster to do a
less than
than an equal
comparison. For example, doing
0b1000_0000_0000_0000 < 0b0111_0000_0000_000
requires only one bit comparison.
Whereas doing 0b0000_0000_0000_0001 == 0b0000_0000_0000_0001
would require 32
bit comparisons, if the processor goes from most significant (left) to least
significant (right) bits. But perhaps I'm trying to overjustify here.
Let's read the code further. There's no sign of shift in sight, but instead,
the assembly reads addl %eax, %eax
- adding out
with itself. Indeed,
doing out <<= 2
and out += out
is exactly the same - except the latter
is faster, so we can give credit for our assembler to do the most basic
of optimizations here.
Then, the second if we encounter is also a classic optimization known as
Short-circuit evaluation.
In short, if you do a && b
, and a is false, there's no need to test for
the value of b. That's exactly what happens here - since we have a logical
and in our while loop, as soon as pot
reaches 31, we're done. Or is it 32?
And now we're starting to understand something... something terrible... it's
that the optimizer messed with our pot! Our original code had pot
range
from 0 to 31 (included), whereas our optimizer made it range from 1 to 32 (included).
So if you were to use a debugger on that code, and you somehow managed to break
inside the loop, you could witness that what used to be our pot
variable is
greater by 1 than what we had intended. However, what does it matter? It's only
used in the control flow - and never gets returned, or passed to a subroutine.
So it's the optimizer's perfect right to change it.
Back to our while loop: if the first condition is true, we increment %ecx
(aka pot
+ 1), and test our second condition. If %eax
(aka out
) is greater
or equal to %edi
(aka in
), we're all good and we can return - otherwise,
we jump back to the beginning of the loop.
So, despite all the torture (sorry, optimizations) LLVM -O2 made our code go through, the basic algorithm is still the same: we have a loop, and in the worst case, it does 32 iterations, along with a few integer operations and comparisons.
Even writing the same algorithm in assembly by hand wouldn't help - in fact, I definitely wouldn't have thought of the smart tricks I explained above. I only learned about them by looking at the assembly output - which I encourage everyone to do more often.
The bit-twiddling approach
Now let's go back to the piece of code in the intro - also known as the bit-twiddling approach. Instead of using a loop, this approach uses properties of binary representations.
An example with the number 13 as entry would give this:
13 // 0000 1101
12 // 0000 1100 // -= 1
12 // 0000 1100 // |= in >> 16
12 // 0000 1100 // |= in >> 8
12 // 0000 1100 // |= in >> 4
15 // 0000 1111 // |= in >> 2
15 // 0000 1111 // |= in >> 1
==================================
16 // 0001 0000 // + 1
Maybe this is enough for you to understand the process - but keep reading, more disassembly ahead.
Let's go into more details: we start out with any 32-bit integer. Let's pick 100663860, which has the following binary representation (most significant bit is on the left):
0000 0110 0000 0000 0000 0010 0011 0100
And then we do a binary or of itself shifted by 16 positions to the right. This gives us:
0000 0110 0000 0000 0000 0010 0011 0100 // original
0000 0000 0000 0000 0000 0110 0000 0000 // shifted by 16
=============== binary or ==============
0000 0110 0000 0000 0000 0110 0011 0100 // result
And then we take the result, and do a binary or of itself shifted by 8 positions to the right, this time:
0000 0110 0000 0000 0000 0110 0011 0100 // previous result
0000 0000 0000 0110 0000 0000 0000 0110 // shifted by 8
=============== binary or ==============
0000 0110 0000 0110 0000 0110 0011 0110 // result
Same deal, with a shift by 4:
0000 0110 0000 0110 0000 0110 0011 0110 // previous result
0000 0000 0110 0000 0110 0000 0110 0011 // shifted by 4
=============== binary or ==============
0000 0110 0110 0110 0110 0110 0111 0111 // result
After a shift by two, we've ensured that every other bit to the right of the leftmost bit set to 1 in the original, has been set to 1:
0000 0110 0110 0110 0110 0110 0111 0111 // previous result
0000 0001 1001 1001 1001 1001 1001 1101 // shifted by 2
=============== binary or ==============
0000 0111 1111 1111 1111 1111 1111 1111 // result
With the number we've picked, the shift by 1 is unnecessary, but it doesn't hurt either:
0000 0111 1111 1111 1111 1111 1111 1111 // previous result
0000 0011 1111 1111 1111 1111 1111 1111 // previous result
=============== binary or ==============
0000 0111 1111 1111 1111 1111 1111 1111 // result
And then all we have to do is add 1, to find the next power of two:
0000 0111 1111 1111 1111 1111 1111 1111 // result
0000 0000 0000 0000 0000 0000 0000 0001 // 1
========== integer addition ============
0000 1000 0000 0000 0000 0000 0000 0000 // result
And the result is 134217728
, aka 2^27
, which is indeed the result
we are looking for (and probably way too big for an OpenGL texture.)
There's one more thing: you'll notice that for powers of two, this algorithm will give the next power of two - whereas we would be fine with the power of two we passed. For this, all we have to do is subtract one from the original input number, and that's it!
That algorithm can be written like this in ooc:
nextPowerOfTwo: func (in: Int) -> Int {
in -= 1
in |= in >> 16
in |= in >> 8
in |= in >> 4
in |= in >> 2
in |= in >> 1
in + 1
}
LLVM IR for the bit-twiddling approach
Using the same command lines as before on my new .ooc source file, let's see the LLVM IR:
define i32 @npot2__nextPowerOfTwo(i32 %in) nounwind uwtable readnone ssp {
%1 = add nsw i32 %in, -1
%2 = ashr i32 %1, 16
%3 = or i32 %2, %1
%4 = ashr i32 %3, 8
%5 = or i32 %4, %3
%6 = ashr i32 %5, 4
%7 = or i32 %6, %5
%8 = ashr i32 %7, 2
%9 = or i32 %8, %7
%10 = ashr i32 %9, 1
%11 = or i32 %10, %9
%12 = add nsw i32 %11, 1
ret i32 %12
}
This is pretty much a 1-1 mapping of the ooc code above, except that each line is split into two separate steps: the right shift, and then the binary or.
There's not much to comment here, except the funny fact that LLVM doesn't seem
to have increment and decrement instructions, but instead does add -1
and
add 1
(which honestly makes sense).
Let's see if the assembly is more interesting.
Assembly for the bit-twiddling approach
pushq %rbp
movq %rsp, %rbp
decl %edi
movl %edi, %eax
sarl $16, %eax
orl %edi, %eax
movl %eax, %ecx
sarl $8, %ecx
orl %eax, %ecx
movl %ecx, %eax
sarl $4, %eax
orl %ecx, %eax
movl %eax, %ecx
sarl $2, %ecx
orl %eax, %ecx
movl %ecx, %eax
sarl %eax
orl %ecx, %eax
leal 1(%rax), %eax
popq %rbp
ret
nopl (%rax)
Whereas LLVM used a grand total of %12 different registers, on hardware and without the constraints of SSA, it's a virtue to use as few registers as possible. Here, we use:
%edi
for input (as before)%eax
and '%ecxas temporary variables to handle the various values of
inin our algorithm
- and
%eax
is used for the return value as well
Turns out this assembly code is as straight-forward as the LLVM IR:
movl
is basically an assignment, sarl
is a right shift, and orl
is
a binary or.
The only surprising part is the usage of leal
near the end. Why not just
use incl
or addl
? Well, according to Nils Pipenbrinck on StackOverflow
, it means load effective address
and
can be used for address calculation.
In this case, we add an offset of 1 to %rax
, which is the 64-bit variant
of %eax, and store it in %eax
.
Conclusion
That's all for today, folks! I didn't plan on going this far with this post, making it the longest to date, but I hoped you enjoyed it. If my readers like analyzing disassembled code, I might write more about that later.
In conclusion, I'd say that knowing how integers are represented can help us write more efficient code: in this case, we went for a code with 3 comparisons and up to 32 loop iterations to a code with 5 shifts, 5 ors, one addition, one subtraction, and no comparisons/branch instructions whatsoever.
Here's another article just for you:
Abstracting away correctness
I've been banging the same drum for years: APIs must be carefully designed.
This statement doesn't resonate the same way with everyone. In order to really understand what I mean by "careful API design", one has to have experienced both ends of the spectrum.
But there is a silver lining - once you have experienced "good design", it's really hard to go back to the other kind. Even after acknowledging that "good design" inevitably comes at a cost, whether it's cognitive load, compile times, making hiring more challenging, etc.