In this previous post http://guihaire.com/code/?p=396, we explored
few algorithms to perform a modulo, to know a little bit more about alchemy speed, local variable speeds, and AS3 Byte code.
The native modulo “%” was faster in all my tests…
Today, I am going to revisit the code to try to Beat the Native % operator !
So the best “readable” implementation in that post was the russian peasant , with an early special case for trivial modulo (i<=j)
public static function Modulo_X1(i:int,j:int):int { if(i<=j) //trivial modulo return (i==j)? 0 : i; var x:int = j; var halfi:int = i>>1; while (x < halfi) { x <<= 1; } while (i >= j) { if (i >= x) i -= x; else x >>= 1; } return i; }
That version can be optimized !
1) doing the early special case is slow
if(i<=j) //trivial modulo return (i==j)? 0 : i;
One idea is to only include in our special case value when (i<j) in the modulo.
And i<j can be optimized by using ((i-j)>>>31) , which is 1 when i<j
So the test can be optimized that way:
if((i-j)>>>31) //trivial modulo return i;
The first loop can also be optimized using the >>>31 trick to test for <
so the loop starting with :while (x < halfi)
is now changed to:
var x:int = j; var halfi:int = i>>1; while ((x-halfi)>>>31) { x <<= 1; }
By doing our early-out special case test if(i<j), we only go to the rest of the code if i>=j
Which means the first time we go into
while (i >= j)
, its always true !
So we can move that test at the end of the loop with a do { } while( ) instead:
do { if (i >= x) i -= x; else x >>= 1; } while (i >= j)
Now, inside that loop, we can optimize the (i >= x by reversing the test order, to be able to use the >>>31 trick:
if (i >= x) i -= x; else x >>= 1;
changed to:
if((i-x)>>>31) x >>= 1; else i -= x;
Finally, in the last loop do {] while (cond), we can optimize the while (i >= j)
by reversing the test and by using the >>>31 operator again:
This time we can change the loop to: do { } while(true); , then we need to add a if() to exit the loop:
public static function Modulo_X4(i:int,j:int):int { if((i-j)>>>31) return i; var x:int = j; var halfi:int = i>>1; while ((x-halfi)>>>31) { x <<= 1; } do { if((i-x)>>>31) x >>= 1; else i -= x; if((i-j)>>>31) return i; } while (true); return i; }
ByteCode optimizations
Great... can we do more ? YES !
Lets look at the byte code generated:
58 operation(s): +1|-0 GetLocal(0) +0|-1 PushScope() +1|-0 GetLocal(1) +1|-0 GetLocal(2) +1|-2 Subtract() +1|-0 PushByte(31) +1|-2 ShiftRightUnsigned() +0|-1 IfFalse(L0) +1|-0 GetLocal(1) +0|-1 ReturnValue() L0: +1|-0 GetLocal(2) +1|-1 ConvertInt() +0|-1 SetLocal(3) +1|-0 GetLocal(1) +1|-0 PushByte(1) +1|-2 ShiftRight() +0|-1 SetLocal(4) +0|-0 Jump(L1) L2: +0|-0 Label() +1|-0 GetLocal(3) +1|-0 PushByte(1) +1|-2 ShiftLeft() +0|-1 SetLocal(3) L1: +1|-0 GetLocal(3) +1|-0 GetLocal(4) +1|-2 Subtract() +1|-0 PushByte(31) +1|-2 ShiftRightUnsigned() +0|-1 IfTrue(L2) +0|-0 Jump(L3) L7: +0|-0 Label() L3: +1|-0 GetLocal(1) +1|-0 GetLocal(3) +1|-2 Subtract() +1|-0 PushByte(31) +1|-2 ShiftRightUnsigned() +0|-1 IfFalse(L4) +1|-0 GetLocal(3) +1|-0 PushByte(1) +1|-2 ShiftRight() +0|-1 SetLocal(3) +0|-0 Jump(L5) L4: +1|-0 GetLocal(1) +1|-0 GetLocal(3) +1|-2 Subtract() +1|-1 ConvertInt() +0|-1 SetLocal(1) L5: +1|-0 GetLocal(1) +1|-0 GetLocal(2) +1|-2 Subtract() +1|-0 PushByte(31) +1|-2 ShiftRightUnsigned() +0|-1 IfFalse(L6) +1|-0 GetLocal(1) +0|-1 ReturnValue() L6: +0|-0 Jump(L7) +1|-0 GetLocal(1) +0|-1 ReturnValue()
We are going to see that the code generated is pretty disappointing ! (this is release), so lets optimize it.
Labels:
+0|-0 Jump(L3) L7: +0|-0 Label() L3: +1|-0 GetLocal(1)
so here, we do a Jump to L3, just to skip L7 Label, useless !
later in the code we have:
+0|-1 IfFalse(L6) +1|-0 GetLocal(1) +0|-1 ReturnValue() L6: +0|-0 Jump(L7)
so we jump to L6 , to immediately jump to L7... we should jump to L7 directly !
the code is also doing lots of
PushByte( )
which should be changed to
PushInt( )
and all the other bytecode optimizations we talked about in the posthttp://guihaire.com/code/?p=396 can be done again (try to keep on the stack our values as much as possible).
The code becomes:
__asm( GetLocal(i), // ConvertInt, SetLocal(a), GetLocal(i), GetLocal(j), SubtractInt, PushInt(31), ShiftRightUnsigned, IfTrue("L13"), GetLocal(j), GetLocal(a), //x,a PushInt(1), //x,a,1 ShiftRight, //x,a>>1 ConvertInt, SetLocal(halfi),//x Jump("L14"), //x "L15:", //x PushInt(1), //x,1 ShiftLeft, //x<<1 => x "L14:", //x Dup, //x,x GetLocal(halfi), //x,x,halfi SubtractInt, //x,x-halfi PushInt(31), //x,x-halfi,31 ShiftRightUnsigned, //x,(x-halfi)>>31 IfTrue("L15"), //x ConvertInt, SetLocal(x), // GetLocal(a), //a "L20:", Dup, //a,a GetLocal(x), //a,a,x SubtractInt, //a,a-x PushInt(31), //a,a-x,31 ShiftRightUnsigned,//a,(a-x)>>>31 IfFalse("L17"), //a GetLocal(x), //a,x PushInt(1), //a,x,1 ShiftRight, //a,x>>1 ConvertInt, SetLocal(x), //a Jump("L18"), //a "L17:", //a GetLocal(x), //a,x SubtractInt, //a-x=>a "L18:", //a Dup, //a,a GetLocal(j), //a,a,j SubtractInt, //a,a-j PushInt(31), //a,a-j,31 ShiftRightUnsigned, //a,a-j>>>31 IfFalse("L20"), //a ConvertInt, SetLocal(a), // "L13:" //a
Our AS3 versions are faster than the native % , and our optimized Bytecode version is about 2x faster than i%j !
What do you get?
code..
Here the full code used to profile that. Instead of using the apparat "inlined", I inlined myself the code in the loops:
private function integer_Modulo2():void { var i:int = 0; var j:int = 0; var a:int; const REPS:int = 3000; var time1:uint = getTimer(); for(j= 1;j<REPS;j++) { for(i=1;i<REPS;i++) { //operator a = i%j; //if(a != (i%j)) // throw("error1 ["+i+"%"+j+"]="+(i%j)+" a:"+a); } } var time2:uint = getTimer(); for(j= 1;j<REPS;j++) { for(i=1;i<REPS;i++) { a =i; if((i-j)>>>31) { } else { var x:int = j; var halfi:int = a>>1; while ((x-halfi)>>>31) { x <<= 1; } do { if((a-x)>>>31) x >>= 1; else a -= x; if((a-j)>>>31) { break; } } while (true); } //if(a != (i%j)) // throw("error2 ["+i+"%"+j+"]="+(i%j)+" a:"+a); } } var time3:uint = getTimer(); for(j= 1;j<REPS;j++) { for(i=1;i<REPS;i++) { __asm( GetLocal(i), // ConvertInt, SetLocal(a), GetLocal(i), GetLocal(j), SubtractInt, PushInt(31), ShiftRightUnsigned, IfTrue("L13"), GetLocal(j), GetLocal(a), //x,a PushInt(1), //x,a,1 ShiftRight, //x,a>>1 ConvertInt, SetLocal(halfi),//x Jump("L14"), //x "L15:", //x PushInt(1), //x,1 ShiftLeft, //x<<1 => x "L14:", //x Dup, //x,x GetLocal(halfi), //x,x,halfi SubtractInt, //x,x-halfi PushInt(31), //x,x-halfi,31 ShiftRightUnsigned, //x,(x-halfi)>>31 IfTrue("L15"), //x ConvertInt, SetLocal(x), // GetLocal(a), //a "L20:", Dup, //a,a GetLocal(x), //a,a,x SubtractInt, //a,a-x PushInt(31), //a,a-x,31 ShiftRightUnsigned,//a,(a-x)>>>31 IfFalse("L17"), //a GetLocal(x), //a,x PushInt(1), //a,x,1 ShiftRight, //a,x>>1 ConvertInt, SetLocal(x), //a Jump("L18"), //a "L17:", //a GetLocal(x), //a,x SubtractInt, //a-x=>a "L18:", //a Dup, //a,a GetLocal(j), //a,a,j SubtractInt, //a,a-j PushInt(31), //a,a-j,31 ShiftRightUnsigned, //a,a-j>>>31 IfFalse("L20"), //a ConvertInt, SetLocal(a), // "L13:" //a ); //if(a != (i%j)) // throw("error3 ["+i+"%"+j+"]="+(i%j)+" a:"+a); } } var time4:uint = getTimer(); startTest(); setTest( "a=i%j" , "integer modulo", time2-time1 ); setTest( "russian peasant" , "integer modulo", time3-time2 ); setTest( "russian peasant asm" , "integer modulo", time4-time3 ); endTest(); }
Hey Benjamin (smashing name btw),
Just wondering if you tried just using:
a – floor(a / b) * b
When I tried that a while back I was getting faster speeds than the inbuilt %.
For 10,000,000 itterations I got:
203 ms for a % b;
147 ms for a – (a / b | 0) * b;
About a 25% performance boost and just 1 line of code, might not be as fast as the ol pleasant russians but with no while loops it should have a much more consistent run speed across values. That said I might have missed something obvious (I so often do).
Anyway thanks for the host of intersesting performance related articles!!
ben
I didn’t, but its a good idea !