How to fast compute log2(n) ? We are going to explore few methods.

#### 1)The obvious:

`var log2:int = Math.log(n)*Math.LOG2E`

This is going to be incredibly slow, due to the function call, so lets try other methods described in

http://graphics.stanford.edu/~seander/bithacks.html, and add some salt to them.

#### 2)Log2 with branch

One idea is to divide the number of bits in an integer in half (16:16 bits), then see if our number has “bits” set in the high bits, and select that zone (0xffff0000), otherwise choose the other zone (0x0000ffff), then continue to divide again (8:8) , then (4:4) , then (2:2) and finally (1:1).

Here the code:

public static function logBase2_v1(v:int):int { var r:int =0; if(v & 0xFFFF0000) { v >>=16; r = 16; } if(v & 0xFF00) { v >>=8; r |= 8; } if(v & 0xF0) { v >>=4; r |= 4; } if(v & 0xC) { v >>=2; r |= 2; } if(v & 0x2) { v >>=1; r |= 1; } return r; }

as you can see, this method find the log2 in 5 tests, actionScript is not super fast with branches.

#### 3)Log2 no branch

We can rewrite the method above without branches by doing proper bit shifting, in 5 steps, and by saving in a local variable “shift” which zone is selected and to proper mask the result.

unsigned int v; // 32-bit value to find the log2 of

register unsigned int r; // result of log2(v) will go here

register unsigned int shift;

r = (v > 0xFFFF) << 4; v >>= r;

shift = (v > 0xFF ) << 3; v >>= shift; r |= shift;

shift = (v > 0xF ) << 2; v >>= shift; r |= shift;

shift = (v > 0x3 ) << 1; v >>= shift; r |= shift;

r |= (v >> 1);

that pseudo-code above can be optimized by combining the last 3 instructions blocks:

`v >>= shift; r |= shift; r |= (v >> 1);`

into something a bit faster (we don’t need to update and save v)

`r |= (v>>(shift+1))|shift;`

also, `(v>0xFFFF)`

needs a type conversion in Flash, so to avoid it, we can rewrite it by shifting the sign bit:

`(0xFFFF-v)>>>31`

same goes with (v>0xFF) , (v>0xF), (v>0x3)

And our code becomes:

public static function logBase2_v2(v:int):int { var r:int = ((0xFFFF-v)>>>31) << 4; v >>= r; var shift:int = ((0xFF-v)>>>31) << 3; v >>= shift; r |= shift; shift = ((0xF-v)>>>31) << 2; v >>= shift; r |= shift; shift = ((0x3-v)>>>31) << 1; return r|shift|(v >> (shift+1)); }

#### 3)Log2 LUT

What about a version with a lookup-table ?

I like the MultiplyDeBruijnBitPosition solution: the idea is to hash a power of 2 number by multiplying our number by a “pseudo” magic number to get a different number for each power of 2 bit position, corresponding to an entry in a LUT.

To make the lookup table efficient, we can save the LUT in our alchemy byteArray for faster access:

Here our initialization code:

public static function logBase2_init_v4():void { const kMultiplyDeBruijnBitPosition:Vector.<int> = Vector.<int>( [ 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31] ); for(var i:int=0;i<kMultiplyDeBruijnBitPosition.length;i++) { //save at position 16 as we use position 0->15 already fastmem.fastSetByte(kMultiplyDeBruijnBitPosition[i],16+i); } }

And the code to compute the log2, which reads in our lookup table:

public static function logBase2_v3( v:int ):int { v |= v >> 1; // first round down to one less than a power of 2 v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; return fastmem.fastGetByte(((v * 0x07C4ACDD) >>> 27) + 16); //+16 , as its where our lookup is }

the first part, find the first high bit set to 1, and sets all the lower bits starting at that position.

`v |= v >> 1;`

v |= v >> 2;

v |= v >> 4;

v |= v >> 8;

v |= v >> 16;

It can be-rewritten using AS3 asm byte code:

1) to not save the local ‘v’ variable 4 times, and access it 4 times, by just keeping v on the VM stack:

2) instead of using PushByte, directly use PushInt so no conversion is needed by the VM when doing the shift.

public static function logBase2_v3( v:int ):int { //v |= v >> 1; // first round down to one less than a power of 2 //v |= v >> 2; //v |= v >> 4; //v |= v >> 8; //v |= v >> 16; __asm( GetLocal(v), //+1|-0 ConvertInt, Dup, //+1|-0 PushInt(1), //+1|-0 ShiftRight, //+1|-2 BitOr, Dup, PushInt(2), ShiftRight, //+1|-2 BitOr, Dup, PushInt(4), ShiftRight, //+1|-2 BitOr, Dup, PushInt(8), ShiftRight, //+1|-2 BitOr, Dup, PushInt(16), ShiftRight, //+1|-2 BitOr, SetLocal(v) //+0|-1 ); return fastmem.fastGetByte(((v * 0x07C4ACDD) >>> 27) + 16); }

#### 4)Log with IEEE 32bits float

Finally, a very promising method, we can convert our integer into a 32 bits float, and read directly the exp in the 23rd position.

public static function logBase2_v3( v:int ):int { fastmem.fastSetFloat(v,0); return (fastmem.fastGetI32(0) >>> 23)-127; }

#### Profiling…

Which version is the fastest ?

#### What do you get?

#### Full code

package { import apparat.asm.*; import apparat.inline.Inlined; import com.buraks.utils.fastmem; public class TestPerf extends Inlined { public static function logBase2_v2(v:int):int { var r:int = ((0xFFFF-v)>>>31) << 4; v >>= r; var shift:int = ((0xFF-v)>>>31) << 3; v >>= shift; r |= shift; shift = ((0xF-v)>>>31) << 2; v >>= shift; r |= shift; shift = ((0x3-v)>>>31) << 1; return r|shift|(v >> (shift+1)); } public static function logBase2_v1(v:int):int { var r:int =0; if(v & 0xFFFF0000) { v >>=16; r = 16; } if(v & 0xFF00) { v >>=8; r |= 8; } if(v & 0xF0) { v >>=4; r |= 4; } if(v & 0xC) { v >>=2; r |= 2; } if(v & 0x2) { v >>=1; r |= 1; } return r; } public static function logBase2_v4( v:int ):int { fastmem.fastSetFloat(v,0); return (fastmem.fastGetI32(0) >>> 23)-127; } public static function logBase2_init_v3():void { const kMultiplyDeBruijnBitPosition:Vector.<int> = Vector.<int>( [ 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31] ); for(var i:int=0;i<kMultiplyDeBruijnBitPosition.length;i++) { fastmem.fastSetByte(kMultiplyDeBruijnBitPosition[i],16+i); } } public static function logBase2_v3( v:int ):int { //v |= v >> 1; // first round down to one less than a power of 2 //v |= v >> 2; //v |= v >> 4; //v |= v >> 8; //v |= v >> 16; __asm( GetLocal(v), //+1|-0 ConvertInt, Dup, //+1|-0 PushInt(1), //+1|-0 ShiftRight, //+1|-2 BitOr, Dup, PushInt(2), ShiftRight, //+1|-2 BitOr, Dup, PushInt(4), ShiftRight, //+1|-2 BitOr, Dup, PushInt(8), ShiftRight, //+1|-2 BitOr, Dup, PushInt(16), ShiftRight, //+1|-2 BitOr, SetLocal(v) //+0|-1 ); return fastmem.fastGetByte(((v * 0x07C4ACDD) >>> 27) + 16); } } }

little test function using those log2 functions, first we verify the results, then do some profiling:

private function logBase2Test():void { var i:int = 0; var j:int = 0; var a:int; var b:int; var c:int; var m:int; const REPS:int = 5000000; TestPerf.logBase2_init_v3(); var str:String="" for(i=1;i

I get very different results on this environment:

Windows 7

2.67 Ghz Intel Xeon (quad core)

Flash Player 11.8.800.97 (release, Google Chrome pepper plugin)

Results:

branch: 31

no branch: 10

alchemy look up: 23

alchemy float: 13

I don’t know why, but it goes to show just how much of a difference a CPU can make, I think. What does your testing environment look like?

I was testing on my PC , Windows7, 64Bits, CPU 2.2GHZ , FlashPlayer 11.7, release, I tried on Internet Explorer.

My experience for performances with Pepper flash / Chrome is usually not as good as the regular version.

So based on your results and mine, the “no branch” version is probably the best version across environments.