Can we fast compute 1.0/x ?
static public function recriprocal( value:Number ):Number { return 1.0/value; }
So here some ideas, we already have a fast way to compute 1/sqrt(x), if you look at 1/x and 1/sqrt(x) , you will notice that both equations are pretty close,
so it gives us a pretty good “first guess”.
See the post about computing 1/sqrt here.
By removing the extra *0.5 in rsqrt, and choose a better initial value, the code becomes:
static public function recriprocal1( value:Number ):Number { fastmem.fastSetFloat(value,0); fastmem.fastSetI32( 0x7EF311C2 - fastmem.fastGetI32(0),0); return fastmem.fastGetFloat(0); }
From there, using the Newton–Raphson, we can converge to a closer result:
We can fine tune our first guess magic value by trying a bunch of values and do a binary search and converge to the optimal value depending on how many Newton iterations are performed, here the results:
static public function recriprocal2( value:Number ):Number { fastmem.fastSetFloat(value,0); fastmem.fastSetI32( 0x7EF311C3 - fastmem.fastGetI32(0),0); var inv:Number = fastmem.fastGetFloat(0); return inv * (2.0 - inv * value); } static public function recriprocal3( value:Number ):Number { fastmem.fastSetFloat(value,0); fastmem.fastSetI32( 0x7EF312AC - fastmem.fastGetI32(0),0); var inv:Number = fastmem.fastGetFloat(0); inv *= (2.0 - inv * value); return inv * (2.0 - inv * value); } static public function recriprocal4( value:Number ):Number { fastmem.fastSetFloat(value,0); fastmem.fastSetI32( 0x7EEEEBB3 - fastmem.fastGetI32(0),0); var inv:Number = fastmem.fastGetFloat(0); inv *= (2.0 - inv * value); inv *= (2.0 - inv * value); return inv * (2.0 - inv * value); }
Is it accurate ?
Now, is that faster ?
And in my case, I need to compute a bunch of values in a byteArray from x to 1/x, would it be faster in that case?
So the basic idea, using alchemy would be do just do that:
for(adr=0;(adr+=4)<adrMax;) { fastmem.fastSetFloat(1/fastmem.fastGetFloat(adr),adr); }
If we use our ieee-trick version, we do instead:
direct guess:
for(adr=0;(adr+=4)<adrMax;) { fastmem.fastSetI32( 0x7EF311C2 - fastmem.fastGetI32(adr),adr); }
one newton iteration:
for(adr=0;(adr+=4)<adrMax;) { fastmem.fastSetI32( 0x7EF311C3 - fastmem.fastGetI32(adr),0); inv = fastmem.fastGetFloat(0); fastmem.fastSetFloat(inv * (2.0 - inv * fastmem.fastGetFloat(adr) ),adr); }
The assembly code of the last version is:
+1|-0 PushInt(2129859011)
+1|-0 GetLocal(9)
+1|-1 GetInt()
+1|-2 Subtract()
+1|-0 PushByte(0)
+0|-2 SetInt()
+1|-0 PushByte(0)
+1|-1 GetFloat()
+1|-1 ConvertDouble()
+0|-1 SetLocal(5)
+1|-0 GetLocal(5)
+1|-0 PushByte(2)
+1|-0 GetLocal(5)
+1|-0 GetLocal(9)
+1|-1 GetFloat()
+1|-2 Multiply()
+1|-2 Subtract()
+1|-2 Multiply()
+1|-0 GetLocal(9)
+0|-2 SetFloat()
We can rewrite it to use the faster SubtractInt instead of Subtract, PushInt instead of Push Bytes etc:
for(adr=0;(adr+=4)<adrMax;) { //fastmem.fastSetI32( 0x7EF311C3 - fastmem.fastGetI32(adr),0); //inv = fastmem.fastGetFloat(0); //fastmem.fastSetFloat(inv * (2.0 - inv * fastmem.fastGetFloat(adr) ),adr); __asm( PushInt(2129859011), GetLocal(adr), GetInt, SubtractInt, PushByte(0), SetInt, PushByte(0), GetFloat, ConvertDouble, Dup, PushInt(2), Swap, GetLocal(adr), GetFloat, Multiply, Subtract, Multiply, GetLocal(adr), SetFloat ); }
Performances ?
Nice !
What do you get?
Code
protected function reciprocal0(v:Number):Number { return TestPerf.recriprocal0(v);} protected function reciprocal1(v:Number):Number { return TestPerf.recriprocal1(v);} protected function reciprocal2(v:Number):Number { return TestPerf.recriprocal2(v);} protected function reciprocal3(v:Number):Number { return TestPerf.recriprocal3(v);} protected function reciprocal4(v:Number):Number { return TestPerf.recriprocal4(v);} protected function reciprocalTest():void { visibleitems.addElement(curve); initSinLut(); resetCurve(); drawCurve("1/a",reciprocal0,1,100,0.05); drawCurve("v1 alchemy 1/a",reciprocal1,1,100,0.05); drawCurve("v2 alchemy 1/a",reciprocal2,1,100,0.05); drawCurve("v3 alchemy 1/a",reciprocal3,1,100,0.05); drawCurve("v4 alchemy 1/a",reciprocal4,1,100,0.05); var j:Number = 0; var reciprocal:Number=0; var val:Number; var val2:Number; const REPSJ:int = 1000; const INC:Number = 0.0001; var t:int; var calibrateTime:Number = AccurateTimer.calibrateTimer(); t = AccurateTimer.startTimer(); for(j= INC;j<REPSJ;j+=INC) { reciprocal = TestPerf.recriprocal0(j); } var time0:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); for(j= INC;j<REPSJ;j+=INC) { reciprocal = TestPerf.recriprocal1(j); } var time1:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); for(j= INC;j<REPSJ;j+=INC) { reciprocal = TestPerf.recriprocal2(j); } var time2:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); for(j= INC;j<REPSJ;j+=INC) { reciprocal = TestPerf.recriprocal3(j); } var time3:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); for(j= INC;j<REPSJ;j+=INC) { reciprocal = TestPerf.recriprocal4(j); } var time4:Number = AccurateTimer.endTimer(t,calibrateTime); startTest(); setTest( "1/j" , "reciprocal", time0); setTest( "v1 alchemy reciprocal" , "reciprocal", time1); setTest( "v2 alchemy reciprocal" , "reciprocal", time2); setTest( "v3 alchemy reciprocal" , "reciprocal", time3); setTest( "v4 alchemy reciprocal" , "reciprocal", time4); endTest(); } protected function reciprocalTest2():void { var j:Number = 0; var reciprocal:Number=0; var val:Number; var val2:Number; var inv:Number; const REPSJ:int = 1000; const INC:Number = 0.001; var t:int; //fill our byteArray with floats var adr:int = 0; for(j= INC;j<REPSJ;j+=INC) { fastmem.fastSetFloat(j,adr+=4); } var adrMax:int = adr; var calibrateTime:Number = AccurateTimer.calibrateTimer(); t = AccurateTimer.startTimer(); for(adr=0;(adr+=4)<adrMax;) { fastmem.fastSetFloat(1/fastmem.fastGetFloat(adr),adr); } var time0:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); for(adr=0;(adr+=4)<adrMax;) { fastmem.fastSetI32( 0x7EF311C2 - fastmem.fastGetI32(adr),adr); } var time1:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); for(adr=0;(adr+=4)<adrMax;) { fastmem.fastSetI32( 0x7EF311C3 - fastmem.fastGetI32(adr),0); inv = fastmem.fastGetFloat(0); fastmem.fastSetFloat(inv * (2.0 - inv * fastmem.fastGetFloat(adr) ),adr); } var time2:Number = AccurateTimer.endTimer(t,calibrateTime); /* +1|-0 PushInt(2129859011) +1|-0 GetLocal(9) +1|-1 GetInt() +1|-2 Subtract() +1|-0 PushByte(0) +0|-2 SetInt() +1|-0 PushByte(0) +1|-1 GetFloat() +1|-1 ConvertDouble() +0|-1 SetLocal(5) +1|-0 GetLocal(5) +1|-0 PushByte(2) +1|-0 GetLocal(5) +1|-0 GetLocal(9) +1|-1 GetFloat() +1|-2 Multiply() +1|-2 Subtract() +1|-2 Multiply() +1|-0 GetLocal(9) +0|-2 SetFloat() */ t = AccurateTimer.startTimer(); for(adr=0;(adr+=4)<adrMax;) { //fastmem.fastSetI32( 0x7EF311C3 - fastmem.fastGetI32(adr),0); //inv = fastmem.fastGetFloat(0); //fastmem.fastSetFloat(inv * (2.0 - inv * fastmem.fastGetFloat(adr) ),adr); __asm( PushInt(2129859011), GetLocal(adr), GetInt, SubtractInt, PushByte(0), SetInt, PushByte(0), GetFloat, ConvertDouble, Dup, PushInt(2), Swap, GetLocal(adr), GetFloat, Multiply, Subtract, Multiply, GetLocal(adr), SetFloat ); } var time3:Number = AccurateTimer.endTimer(t,calibrateTime); startTest(); setTest( "1/j" , "byteArray reciprocal", time0); setTest( "alchemy reciprocal v1" , "byteArray reciprocal", time1); setTest( "alchemy reciprocal v2" , "byteArray reciprocal", time2); setTest( "alchemy reciprocal v2 asm" , "byteArray reciprocal", time3); endTest(); } public class TestPerf extends Inlined { static public function recriprocal0( value:Number ):Number { return 1.0/value; } static public function recriprocal1( value:Number ):Number { fastmem.fastSetFloat(value,0); fastmem.fastSetI32( 0x7EF311C2 - fastmem.fastGetI32(0),0); return fastmem.fastGetFloat(0); } static public function recriprocal2( value:Number ):Number { fastmem.fastSetFloat(value,0); fastmem.fastSetI32( 0x7EF311C3 - fastmem.fastGetI32(0),0); var inv:Number = fastmem.fastGetFloat(0); return inv * (2.0 - inv * value); } static public function recriprocal3( value:Number ):Number { fastmem.fastSetFloat(value,0); fastmem.fastSetI32( 0x7EF312AC - fastmem.fastGetI32(0),0); var inv:Number = fastmem.fastGetFloat(0); inv *= (2.0 - inv * value); return inv * (2.0 - inv * value); } static public function recriprocal4( value:Number ):Number { fastmem.fastSetFloat(value,0); fastmem.fastSetI32( 0x7EEEEBB3 - fastmem.fastGetI32(0),0); var inv:Number = fastmem.fastGetFloat(0); inv *= (2.0 - inv * value); inv *= (2.0 - inv * value); return inv * (2.0 - inv * value); } }
I basically get the same graphs as you on my Core i7 Mac. This seems like a good tip to keep in mind for the next time I’m processing a big
ByteArray
of data. I wonder though- would some of your other Alchemy-based tests that were slower than their AS3 counterparts also perform better if you add the “what if it’s in a bigByteArray
” type of test?right, it would be interesting to revisit those old tests, and with context3D , index and vertex buffers can all be in ByteArray, so manipulating ByteArray fast can be critical in a 3d app.