How to fast compute log2(n) ? We are going to explore few methods.
1)The obvious:
var log2:int = Math.log(n)*Math.LOG2E
This is going to be incredibly slow, due to the function call, so lets try other methods described in
http://graphics.stanford.edu/~seander/bithacks.html, and add some salt to them.
2)Log2 with branch
One idea is to divide the number of bits in an integer in half (16:16 bits), then see if our number has “bits” set in the high bits, and select that zone (0xffff0000), otherwise choose the other zone (0x0000ffff), then continue to divide again (8:8) , then (4:4) , then (2:2) and finally (1:1).
Here the code:
public static function logBase2_v1(v:int):int { var r:int =0; if(v & 0xFFFF0000) { v >>=16; r = 16; } if(v & 0xFF00) { v >>=8; r |= 8; } if(v & 0xF0) { v >>=4; r |= 4; } if(v & 0xC) { v >>=2; r |= 2; } if(v & 0x2) { v >>=1; r |= 1; } return r; }
as you can see, this method find the log2 in 5 tests, actionScript is not super fast with branches.
3)Log2 no branch
We can rewrite the method above without branches by doing proper bit shifting, in 5 steps, and by saving in a local variable “shift” which zone is selected and to proper mask the result.
unsigned int v; // 32-bit value to find the log2 of
register unsigned int r; // result of log2(v) will go here
register unsigned int shift;
r = (v > 0xFFFF) << 4; v >>= r;
shift = (v > 0xFF ) << 3; v >>= shift; r |= shift;
shift = (v > 0xF ) << 2; v >>= shift; r |= shift;
shift = (v > 0x3 ) << 1; v >>= shift; r |= shift;
r |= (v >> 1);
that pseudo-code above can be optimized by combining the last 3 instructions blocks:
v >>= shift; r |= shift; r |= (v >> 1);
into something a bit faster (we don’t need to update and save v)
r |= (v>>(shift+1))|shift;
also, (v>0xFFFF)
needs a type conversion in Flash, so to avoid it, we can rewrite it by shifting the sign bit:
(0xFFFF-v)>>>31
same goes with (v>0xFF) , (v>0xF), (v>0x3)
And our code becomes:
public static function logBase2_v2(v:int):int { var r:int = ((0xFFFF-v)>>>31) << 4; v >>= r; var shift:int = ((0xFF-v)>>>31) << 3; v >>= shift; r |= shift; shift = ((0xF-v)>>>31) << 2; v >>= shift; r |= shift; shift = ((0x3-v)>>>31) << 1; return r|shift|(v >> (shift+1)); }
3)Log2 LUT
What about a version with a lookup-table ?
I like the MultiplyDeBruijnBitPosition solution: the idea is to hash a power of 2 number by multiplying our number by a “pseudo” magic number to get a different number for each power of 2 bit position, corresponding to an entry in a LUT.
To make the lookup table efficient, we can save the LUT in our alchemy byteArray for faster access:
Here our initialization code:
public static function logBase2_init_v4():void { const kMultiplyDeBruijnBitPosition:Vector.<int> = Vector.<int>( [ 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31] ); for(var i:int=0;i<kMultiplyDeBruijnBitPosition.length;i++) { //save at position 16 as we use position 0->15 already fastmem.fastSetByte(kMultiplyDeBruijnBitPosition[i],16+i); } }
And the code to compute the log2, which reads in our lookup table:
public static function logBase2_v3( v:int ):int { v |= v >> 1; // first round down to one less than a power of 2 v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; return fastmem.fastGetByte(((v * 0x07C4ACDD) >>> 27) + 16); //+16 , as its where our lookup is }
the first part, find the first high bit set to 1, and sets all the lower bits starting at that position.
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
It can be-rewritten using AS3 asm byte code:
1) to not save the local ‘v’ variable 4 times, and access it 4 times, by just keeping v on the VM stack:
2) instead of using PushByte, directly use PushInt so no conversion is needed by the VM when doing the shift.
public static function logBase2_v3( v:int ):int { //v |= v >> 1; // first round down to one less than a power of 2 //v |= v >> 2; //v |= v >> 4; //v |= v >> 8; //v |= v >> 16; __asm( GetLocal(v), //+1|-0 ConvertInt, Dup, //+1|-0 PushInt(1), //+1|-0 ShiftRight, //+1|-2 BitOr, Dup, PushInt(2), ShiftRight, //+1|-2 BitOr, Dup, PushInt(4), ShiftRight, //+1|-2 BitOr, Dup, PushInt(8), ShiftRight, //+1|-2 BitOr, Dup, PushInt(16), ShiftRight, //+1|-2 BitOr, SetLocal(v) //+0|-1 ); return fastmem.fastGetByte(((v * 0x07C4ACDD) >>> 27) + 16); }
4)Log with IEEE 32bits float
Finally, a very promising method, we can convert our integer into a 32 bits float, and read directly the exp in the 23rd position.
public static function logBase2_v3( v:int ):int { fastmem.fastSetFloat(v,0); return (fastmem.fastGetI32(0) >>> 23)-127; }
Profiling…
Which version is the fastest ?
What do you get?
Full code
package { import apparat.asm.*; import apparat.inline.Inlined; import com.buraks.utils.fastmem; public class TestPerf extends Inlined { public static function logBase2_v2(v:int):int { var r:int = ((0xFFFF-v)>>>31) << 4; v >>= r; var shift:int = ((0xFF-v)>>>31) << 3; v >>= shift; r |= shift; shift = ((0xF-v)>>>31) << 2; v >>= shift; r |= shift; shift = ((0x3-v)>>>31) << 1; return r|shift|(v >> (shift+1)); } public static function logBase2_v1(v:int):int { var r:int =0; if(v & 0xFFFF0000) { v >>=16; r = 16; } if(v & 0xFF00) { v >>=8; r |= 8; } if(v & 0xF0) { v >>=4; r |= 4; } if(v & 0xC) { v >>=2; r |= 2; } if(v & 0x2) { v >>=1; r |= 1; } return r; } public static function logBase2_v4( v:int ):int { fastmem.fastSetFloat(v,0); return (fastmem.fastGetI32(0) >>> 23)-127; } public static function logBase2_init_v3():void { const kMultiplyDeBruijnBitPosition:Vector.<int> = Vector.<int>( [ 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31] ); for(var i:int=0;i<kMultiplyDeBruijnBitPosition.length;i++) { fastmem.fastSetByte(kMultiplyDeBruijnBitPosition[i],16+i); } } public static function logBase2_v3( v:int ):int { //v |= v >> 1; // first round down to one less than a power of 2 //v |= v >> 2; //v |= v >> 4; //v |= v >> 8; //v |= v >> 16; __asm( GetLocal(v), //+1|-0 ConvertInt, Dup, //+1|-0 PushInt(1), //+1|-0 ShiftRight, //+1|-2 BitOr, Dup, PushInt(2), ShiftRight, //+1|-2 BitOr, Dup, PushInt(4), ShiftRight, //+1|-2 BitOr, Dup, PushInt(8), ShiftRight, //+1|-2 BitOr, Dup, PushInt(16), ShiftRight, //+1|-2 BitOr, SetLocal(v) //+0|-1 ); return fastmem.fastGetByte(((v * 0x07C4ACDD) >>> 27) + 16); } } }
little test function using those log2 functions, first we verify the results, then do some profiling:
private function logBase2Test():void { var i:int = 0; var j:int = 0; var a:int; var b:int; var c:int; var m:int; const REPS:int = 5000000; TestPerf.logBase2_init_v3(); var str:String="" for(i=1;i
I get very different results on this environment:
Windows 7
2.67 Ghz Intel Xeon (quad core)
Flash Player 11.8.800.97 (release, Google Chrome pepper plugin)
Results:
branch: 31
no branch: 10
alchemy look up: 23
alchemy float: 13
I don’t know why, but it goes to show just how much of a difference a CPU can make, I think. What does your testing environment look like?
I was testing on my PC , Windows7, 64Bits, CPU 2.2GHZ , FlashPlayer 11.7, release, I tried on Internet Explorer.
My experience for performances with Pepper flash / Chrome is usually not as good as the regular version.
So based on your results and mine, the “no branch” version is probably the best version across environments.