In this previous post http://guihaire.com/code/?p=396, we explored
few algorithms to perform a modulo, to know a little bit more about alchemy speed, local variable speeds, and AS3 Byte code.
The native modulo “%” was faster in all my tests…
Today, I am going to revisit the code to try to Beat the Native % operator !
So the best “readable” implementation in that post was the russian peasant , with an early special case for trivial modulo (i<=j)
public static function Modulo_X1(i:int,j:int):int
{
if(i<=j) //trivial modulo
return (i==j)? 0 : i;
var x:int = j;
var halfi:int = i>>1;
while (x < halfi)
{
x <<= 1;
}
while (i >= j)
{
if (i >= x)
i -= x;
else
x >>= 1;
}
return i;
}
That version can be optimized !
1) doing the early special case is slow
if(i<=j) //trivial modulo
return (i==j)? 0 : i;
One idea is to only include in our special case value when (i<j) in the modulo.
And i<j can be optimized by using ((i-j)>>>31) , which is 1 when i<j
So the test can be optimized that way:
if((i-j)>>>31) //trivial modulo
return i;
The first loop can also be optimized using the >>>31 trick to test for <
so the loop starting with :while (x < halfi)
is now changed to:
var x:int = j;
var halfi:int = i>>1;
while ((x-halfi)>>>31)
{
x <<= 1;
}
By doing our early-out special case test if(i<j), we only go to the rest of the code if i>=j
Which means the first time we go into
while (i >= j) , its always true !
So we can move that test at the end of the loop with a do { } while( ) instead:
do
{
if (i >= x)
i -= x;
else
x >>= 1;
}
while (i >= j)
Now, inside that loop, we can optimize the (i >= x by reversing the test order, to be able to use the >>>31 trick:
if (i >= x)
i -= x;
else
x >>= 1;
changed to:
if((i-x)>>>31) x >>= 1; else i -= x;
Finally, in the last loop do {] while (cond), we can optimize the while (i >= j) by reversing the test and by using the >>>31 operator again:
This time we can change the loop to: do { } while(true); , then we need to add a if() to exit the loop:
public static function Modulo_X4(i:int,j:int):int
{
if((i-j)>>>31)
return i;
var x:int = j;
var halfi:int = i>>1;
while ((x-halfi)>>>31)
{
x <<= 1;
}
do
{
if((i-x)>>>31)
x >>= 1;
else
i -= x;
if((i-j)>>>31)
return i;
}
while (true);
return i;
}
ByteCode optimizations
Great... can we do more ? YES !
Lets look at the byte code generated:
58 operation(s):
+1|-0 GetLocal(0)
+0|-1 PushScope()
+1|-0 GetLocal(1)
+1|-0 GetLocal(2)
+1|-2 Subtract()
+1|-0 PushByte(31)
+1|-2 ShiftRightUnsigned()
+0|-1 IfFalse(L0)
+1|-0 GetLocal(1)
+0|-1 ReturnValue()
L0: +1|-0 GetLocal(2)
+1|-1 ConvertInt()
+0|-1 SetLocal(3)
+1|-0 GetLocal(1)
+1|-0 PushByte(1)
+1|-2 ShiftRight()
+0|-1 SetLocal(4)
+0|-0 Jump(L1)
L2: +0|-0 Label()
+1|-0 GetLocal(3)
+1|-0 PushByte(1)
+1|-2 ShiftLeft()
+0|-1 SetLocal(3)
L1: +1|-0 GetLocal(3)
+1|-0 GetLocal(4)
+1|-2 Subtract()
+1|-0 PushByte(31)
+1|-2 ShiftRightUnsigned()
+0|-1 IfTrue(L2)
+0|-0 Jump(L3)
L7: +0|-0 Label()
L3: +1|-0 GetLocal(1)
+1|-0 GetLocal(3)
+1|-2 Subtract()
+1|-0 PushByte(31)
+1|-2 ShiftRightUnsigned()
+0|-1 IfFalse(L4)
+1|-0 GetLocal(3)
+1|-0 PushByte(1)
+1|-2 ShiftRight()
+0|-1 SetLocal(3)
+0|-0 Jump(L5)
L4: +1|-0 GetLocal(1)
+1|-0 GetLocal(3)
+1|-2 Subtract()
+1|-1 ConvertInt()
+0|-1 SetLocal(1)
L5: +1|-0 GetLocal(1)
+1|-0 GetLocal(2)
+1|-2 Subtract()
+1|-0 PushByte(31)
+1|-2 ShiftRightUnsigned()
+0|-1 IfFalse(L6)
+1|-0 GetLocal(1)
+0|-1 ReturnValue()
L6: +0|-0 Jump(L7)
+1|-0 GetLocal(1)
+0|-1 ReturnValue()
We are going to see that the code generated is pretty disappointing ! (this is release), so lets optimize it.
Labels:
+0|-0 Jump(L3)
L7: +0|-0 Label()
L3: +1|-0 GetLocal(1)
so here, we do a Jump to L3, just to skip L7 Label, useless !
later in the code we have:
+0|-1 IfFalse(L6)
+1|-0 GetLocal(1)
+0|-1 ReturnValue()
L6: +0|-0 Jump(L7)
so we jump to L6 , to immediately jump to L7... we should jump to L7 directly !
the code is also doing lots of
PushByte( )
which should be changed to
PushInt( )
and all the other bytecode optimizations we talked about in the posthttp://guihaire.com/code/?p=396 can be done again (try to keep on the stack our values as much as possible).
The code becomes:
__asm(
GetLocal(i), //
ConvertInt,
SetLocal(a),
GetLocal(i),
GetLocal(j),
SubtractInt,
PushInt(31),
ShiftRightUnsigned,
IfTrue("L13"),
GetLocal(j),
GetLocal(a), //x,a
PushInt(1), //x,a,1
ShiftRight, //x,a>>1
ConvertInt,
SetLocal(halfi),//x
Jump("L14"), //x
"L15:", //x
PushInt(1), //x,1
ShiftLeft, //x<<1 => x
"L14:", //x
Dup, //x,x
GetLocal(halfi), //x,x,halfi
SubtractInt, //x,x-halfi
PushInt(31), //x,x-halfi,31
ShiftRightUnsigned, //x,(x-halfi)>>31
IfTrue("L15"), //x
ConvertInt,
SetLocal(x), //
GetLocal(a), //a
"L20:",
Dup, //a,a
GetLocal(x), //a,a,x
SubtractInt, //a,a-x
PushInt(31), //a,a-x,31
ShiftRightUnsigned,//a,(a-x)>>>31
IfFalse("L17"), //a
GetLocal(x), //a,x
PushInt(1), //a,x,1
ShiftRight, //a,x>>1
ConvertInt,
SetLocal(x), //a
Jump("L18"), //a
"L17:", //a
GetLocal(x), //a,x
SubtractInt, //a-x=>a
"L18:", //a
Dup, //a,a
GetLocal(j), //a,a,j
SubtractInt, //a,a-j
PushInt(31), //a,a-j,31
ShiftRightUnsigned, //a,a-j>>>31
IfFalse("L20"), //a
ConvertInt,
SetLocal(a), //
"L13:" //a
Our AS3 versions are faster than the native % , and our optimized Bytecode version is about 2x faster than i%j !
What do you get?
code..
Here the full code used to profile that. Instead of using the apparat "inlined", I inlined myself the code in the loops:
private function integer_Modulo2():void
{
var i:int = 0;
var j:int = 0;
var a:int;
const REPS:int = 3000;
var time1:uint = getTimer();
for(j= 1;j<REPS;j++)
{
for(i=1;i<REPS;i++)
{
//operator
a = i%j;
//if(a != (i%j))
// throw("error1 ["+i+"%"+j+"]="+(i%j)+" a:"+a);
}
}
var time2:uint = getTimer();
for(j= 1;j<REPS;j++)
{
for(i=1;i<REPS;i++)
{
a =i;
if((i-j)>>>31)
{
}
else
{
var x:int = j;
var halfi:int = a>>1;
while ((x-halfi)>>>31)
{
x <<= 1;
}
do
{
if((a-x)>>>31)
x >>= 1;
else
a -= x;
if((a-j)>>>31)
{
break;
}
}
while (true);
}
//if(a != (i%j))
// throw("error2 ["+i+"%"+j+"]="+(i%j)+" a:"+a);
}
}
var time3:uint = getTimer();
for(j= 1;j<REPS;j++)
{
for(i=1;i<REPS;i++)
{
__asm(
GetLocal(i), //
ConvertInt,
SetLocal(a),
GetLocal(i),
GetLocal(j),
SubtractInt,
PushInt(31),
ShiftRightUnsigned,
IfTrue("L13"),
GetLocal(j),
GetLocal(a), //x,a
PushInt(1), //x,a,1
ShiftRight, //x,a>>1
ConvertInt,
SetLocal(halfi),//x
Jump("L14"), //x
"L15:", //x
PushInt(1), //x,1
ShiftLeft, //x<<1 => x
"L14:", //x
Dup, //x,x
GetLocal(halfi), //x,x,halfi
SubtractInt, //x,x-halfi
PushInt(31), //x,x-halfi,31
ShiftRightUnsigned, //x,(x-halfi)>>31
IfTrue("L15"), //x
ConvertInt,
SetLocal(x), //
GetLocal(a), //a
"L20:",
Dup, //a,a
GetLocal(x), //a,a,x
SubtractInt, //a,a-x
PushInt(31), //a,a-x,31
ShiftRightUnsigned,//a,(a-x)>>>31
IfFalse("L17"), //a
GetLocal(x), //a,x
PushInt(1), //a,x,1
ShiftRight, //a,x>>1
ConvertInt,
SetLocal(x), //a
Jump("L18"), //a
"L17:", //a
GetLocal(x), //a,x
SubtractInt, //a-x=>a
"L18:", //a
Dup, //a,a
GetLocal(j), //a,a,j
SubtractInt, //a,a-j
PushInt(31), //a,a-j,31
ShiftRightUnsigned, //a,a-j>>>31
IfFalse("L20"), //a
ConvertInt,
SetLocal(a), //
"L13:" //a
);
//if(a != (i%j))
// throw("error3 ["+i+"%"+j+"]="+(i%j)+" a:"+a);
}
}
var time4:uint = getTimer();
startTest();
setTest( "a=i%j" , "integer modulo", time2-time1 );
setTest( "russian peasant" , "integer modulo", time3-time2 );
setTest( "russian peasant asm" , "integer modulo", time4-time3 );
endTest();
}


Hey Benjamin (smashing name btw),
Just wondering if you tried just using:
a – floor(a / b) * b
When I tried that a while back I was getting faster speeds than the inbuilt %.
For 10,000,000 itterations I got:
203 ms for a % b;
147 ms for a – (a / b | 0) * b;
About a 25% performance boost and just 1 line of code, might not be as fast as the ol pleasant russians but with no while loops it should have a much more consistent run speed across values. That said I might have missed something obvious (I so often do).
Anyway thanks for the host of intersesting performance related articles!!
ben
I didn’t, but its a good idea !