In my previous posts, I presented optimizations for e bunch of Maths functions, so lets continue and see whats next in the list ! Today I am going to explore ways to optimize Math.pow(2,x) or 2^x , as well as Math.exp
One easy trick to fast perform 2^i for integers is to use bit shifting, 1<<i.
But what about floats?
Optimizing Math.exp(x) or Math(2,x) is going to be very similar, as pow(a,b) = exp(b * log(a)) , so Math.exp(x*log(2)) is equivalent to Math(2,x), the same way Math.exp(y) is equivalent to Math.pow(2,y/log(2)). So there is only a constant multiplication to introduce in the formula between the 2.
fast exp implementations
One simple approach is to use Taylor series. For exp, we have:
for the exp for a degree 7 , by refactoring, we get:
exp_x = (5040+x*(5040+x*(2520+x*(840+x*(210+x*(42+x*(7+x)))))))*0.00019841269f;
Another approach is to use this serie which converges to exp(x) :
for n=8, we get :
p = (1+x * 0.00390625); //=1/256 for n=8 p *=p;p *=p;p *=p;p *=p; p *=p;p *=p;p *=p; exp_x = p*p;
If we want to compute 2^x , we just need to multiply by ln2, which gives:
p = (1+x *0.00270760617406228636491106297445); //LN2/256 = 0.00270760617406228636491106297445 p *=p;p *=p;p *=p;p *=p; p *=p;p *=p;p *=p; pow2_x = p*p;
This approach has one major problem : it gives good results for x values around zero , but the approximation error grows really high as it gets farther from zero.
We can go around that limitation by using the fact that exp(a+b) = exp(a)*exp(b)
So we could put the integer part of x in a, and the fractional part in b, with x=a+b.
for example exp(3.45) = exp(3)*exp(0.45)
same with 2^3.45 = 2^3 * 2^0.45, to compute 2^i , we just need to do 1<<i, and re-use the serie for the fractional part:
Here my implementation:
static public function pow2_v8( p:Number ):Number { //2^x = 2^(i+f) = 2^i * 2^f , i is integer part , f is fractional part, with x=i+f //2^i is (1<<i), //2^f use (1+x*len(2)/n)^n = 2^x with n=8 bellow: var i:int = p; //i is the integer part of p p = (p-i) * 0.00270760617406228636491106297445+1; //LN2/256 = 0.00270760617406228636491106297445 , (p-i) is the fractional part p *=p; p *=p; p *=p; p *=p; p *=p; p *=p; p *=p; return p*p*(1<<i); }
How accurate is the new version compared to the previous one ?
Okay, hard to tell, this time let’s plot the difference between our version and the reference version flash Math.pow(2,x), so I do: (plot(x) = Math.abs( pow2Optimized(x) – Math.pow(2,x) );
This time, we can really see the accuracy improvement of the new version !
TODO: We could improve the accuracy when the fractional part is > 0.5, by computing exp(integerPart+1)*exp(fractionalPart-1) to stay closer to 0 on the fractional part.
TODO: support negative numbers.
Alternate options ?
IEEE-754 float tricks to our rescue !
using float 32 bits aliased to integers, we can extract our exp approximation:
static public function pow2_v3( p:Number ):Number { fastmem.fastSetI32(/*(1 << 23)*/8388608 * (p + 126.94269504),0); return fastmem.fastGetFloat(0); }
Another solution is described in this paper
nic.schraudolph.org/pubs/Schraudolph99.pdf, it exploits ieee-754 64bits double format to fast compute the exp:
Here, I use apparat to alias an integer as a float and get the 2^ approximation.
Note that we write the integer32 at byte 4, and get the double at byte0, and the first 4 bytes needs to be initialized correctly to zero with an initialization function.
static public function pow2_v4( p:Number ):Number { // idea from http://nic.schraudolph.org/pubs/Schraudolph99.pdf //var magic1:Number = (1048576/Math.LN2) / 1.442695040; //var magic2:Number = (1072693248 - 60801) // p * magic1 + magic2; fastmem.fastSetI32(p * 1048576.0006461141 + 1072632447,4); return fastmem.fastGetDouble(0); } static public function pow2_v4Initialize():void { fastmem.fastSetI32(0,0); }
More optimizations ?
In the 32 floats version (labelled pow2_v3) , there is a conversion from Number to integer, we have another trick to convert Number to integer (Number to int), lets use it ! And the new version becomes:
See how bizarre it gets: we write a double on 0, but read the first 4 bytes as a float:
static public function pow2_v6( p:Number ):Number { // idea from http://nic.schraudolph.org/pubs/Schraudolph99.pdf //var magic1:Number = (1048576/Math.LN2) / 1.442695040; //var magic2:Number = (1072693248 - 60801) // p * magic1 + magic2; // + Number to integer conversion by Sree fastmem.fastSetDouble(((p + 126.94269504)*8388608/*(1 << 23)*/)+6755399441055744.0,0); return fastmem.fastGetFloat(0); }
Now, what about accuracy ?
As far as precision, our alchemy versions are much better than basic taylor series, but not as good as our modified taylor version.
Now, the period of the error coincides exactly with when the binary value overflows from the mantissa into the exponent. We can extract in the mantissa the bits making it periodic, by excluding the exp section. if we keep only the high 9 bits, we can make a LUT table of 512 values to compensate for the error (we could do a bigger LUT table too by keeping 10 or more bits)
We can compensate the error by a multiplier, so we need to compute : err = Math.pow(2,x) / pow2_v6(x) , and here the LUT initialization code:
private function buildv6PreciseLut():void { for(var i:int;i<512;i++) { var v:Number = 1.0/128.0 + Number(i)/128.0; var err:Number= Math.pow(2,v)/TestPerf.pow2_v6(v); var i511:int = fastmem.fastGetUI16(2)&511; if(expLUT[i511]==0) { expLUT[i511] = err; //fastmem.fastSetFloat(err,(i511<<2)+16); //if you want the result as float, saved starting at position 16 fastmem.fastSetDouble(err,(i511<<3)+2064); //if you want the result as double, saved starting at 2064 } } }
And to use our LUT table, we extract the same bits extracted to build the LUT table, here the code:
By getting our error multiplier as a float in our LUT table:
pow2 = TestPerf.pow2_v6(j)*fastmem.fastGetFloat(((fastmem.fastGetUI16(2)&511)<<2)+16);
and here using the LUT table as a double:
pow2 = TestPerf.pow2_v6(j)*fastmem.fastGetDouble(((fastmem.fastGetUI16(2)&511)<<3)+2064);
How precise is our new version ?
in green is our new precise method, which as you can see is about as accurate as the taylor-modified version above.
Great, now lets see the perfs !
* if accuracy is not a problem, then the v6 version is probably the best.
* if you need accuray, the taylor modified version offer better performances than Math.pow(2,x), but only works for positive values.
note: I tried to optimize the accurate-taylor version in Assembly version, but I didn't get any improvements, the code is bellow with more versions tested:
What do you get?
code
package { import apparat.asm.*; import apparat.inline.Inlined; import com.buraks.utils.fastmem; public class TestPerf extends Inlined { static public function pow2_v1( p:Number ):Number { var clipp:Number = (p < -126) ? -126.0 : p; var z:Number = clipp - (clipp>>0) + (p < 0); //fastmem.fastSetI32(( /*(1 << 23)*/8388608 * (clipp + 121.2740575 + 27.7280233 / (4.84252568 - z) - 1.49012907 * z) ),0); fastmem.fastSetDouble(( /*(1 << 23)*/8388608 * (clipp + 121.2740575 + 27.7280233 / (4.84252568 - z) - 1.49012907 * z) )+6755399441055744.0,0); return fastmem.fastGetFloat(0); } static public function pow2_v2( p:Number ):Number { fastmem.fastSetI32(/*(1 << 23)*/8388608 * (((p < -126) ? -126.0 : p) + 126.94269504),0); return fastmem.fastGetFloat(0); } static public function pow2_v3( p:Number ):Number { fastmem.fastSetI32(/*(1 << 23)*/8388608 * (p + 126.94269504),0); return fastmem.fastGetFloat(0); } static public function pow2_v4( p:Number ):Number { // idea from http://nic.schraudolph.org/pubs/Schraudolph99.pdf //var magic1:Number = (1048576/Math.LN2) / 1.442695040; //var magic2:Number = (1072693248 - 60801) // p * magic1 + magic2; fastmem.fastSetI32(p * 1048576.0006461141 + 1072632447,4); return fastmem.fastGetDouble(0); } static public function pow2_v4Initialize():void { fastmem.fastSetI32(0,0); } static public function pow2_v5( p:Number ):Number { // idea from http://nic.schraudolph.org/pubs/Schraudolph99.pdf //var magic1:Number = (1048576/Math.LN2) / 1.442695040; //var magic2:Number = (1072693248 - 60801) // p * magic1 + magic2; fastmem.fastSetDouble((p * 1048576.0006461141 + 1072632447)+6755399441055744.0,4); return fastmem.fastGetDouble(0); } static public function pow2_v6( p:Number ):Number { // idea from http://nic.schraudolph.org/pubs/Schraudolph99.pdf //var magic1:Number = (1048576/Math.LN2) / 1.442695040; //var magic2:Number = (1072693248 - 60801) // p * magic1 + magic2; // + Number to integer conversion by Sree fastmem.fastSetDouble(((p + 126.94269504)*8388608/*(1 << 23)*/)+6755399441055744.0,0); return fastmem.fastGetFloat(0); } static public function pow2_v7( p:Number ):Number { p = (1+p * 0.00270760617406228636491106297445); //LN2/256 = 0.00270760617406228636491106297445 p *=p; p *=p; p *=p; p *=p; p *=p; p *=p; p *=p; return p*p; } static public function pow2_v8( p:Number ):Number { //2^x = 2^(i+f) = 2^i * 2^f , i is integer part , f is fractional part, with x=i+f //2^i is (1<<i), //2^f use (1+x*len(2)/n)^n = 2^x with n=8 bellow: var i:int = p; //i is the integer part of p p = (p-i) * 0.00270760617406228636491106297445+1; //LN2/256 = 0.00270760617406228636491106297445 , (p-i) is the fractional part p *=p; p *=p; p *=p; p *=p; p *=p; p *=p; p *=p; return p*p*(1<<i); } /* asm +1|-0 GetLocal(1) +1|-1 ConvertInt() +0|-1 SetLocal(2) +1|-0 GetLocal(1) +1|-0 GetLocal(2) +1|-2 Subtract() +1|-0 PushDouble(0.0027076061740622863) +1|-2 Multiply() +1|-0 PushByte(1) +1|-2 Add() +1|-1 ConvertDouble() +0|-1 SetLocal(1) +1|-0 GetLocal(1) +1|-0 GetLocal(1) +1|-2 Multiply() +1|-1 ConvertDouble() +0|-1 SetLocal(1) +1|-0 GetLocal(1) +1|-0 GetLocal(1) +1|-2 Multiply() +1|-1 ConvertDouble() +0|-1 SetLocal(1) +1|-0 GetLocal(1) +1|-0 GetLocal(1) +1|-2 Multiply() +1|-1 ConvertDouble() +0|-1 SetLocal(1) +1|-0 GetLocal(1) +1|-0 GetLocal(1) +1|-2 Multiply() +1|-1 ConvertDouble() +0|-1 SetLocal(1) +1|-0 GetLocal(1) +1|-0 GetLocal(1) +1|-2 Multiply() +1|-1 ConvertDouble() +0|-1 SetLocal(1) +1|-0 GetLocal(1) +1|-0 GetLocal(1) +1|-2 Multiply() +1|-1 ConvertDouble() +0|-1 SetLocal(1) +1|-0 GetLocal(1) +1|-0 GetLocal(1) +1|-2 Multiply() +1|-1 ConvertDouble() +0|-1 SetLocal(1) +1|-0 GetLocal(1) +1|-0 GetLocal(1) +1|-2 Multiply() +1|-0 PushByte(1) +1|-0 GetLocal(2) +1|-2 ShiftLeft() +1|-2 Multiply() +0|-1 ReturnValue() */ static public function pow2_v9( p:Number ):Number { var i:int; __asm( GetLocal(p), Dup, ConvertInt, Dup, SetLocal(i), Subtract, PushDouble(0.0027076061740622863), Multiply, PushDouble(1), Add, ConvertDouble, Dup, Multiply, ConvertDouble, Dup, Multiply, ConvertDouble, Dup, Multiply, ConvertDouble, Dup, Multiply, ConvertDouble, Dup, Multiply, ConvertDouble, Dup, Multiply, ConvertDouble, Dup, Multiply, ConvertDouble, Dup, Multiply, PushInt(1), GetLocal(i), ShiftLeft, Multiply, ReturnValue ); return p; } } } private function math_pow2( v:Number):Number { return Math.pow(2,v); } private function pow2_v1( v:Number):Number { return TestPerf.pow2_v1(v); } private function pow2_v2( v:Number):Number { return TestPerf.pow2_v2(v); } private function pow2_v3( v:Number):Number { return TestPerf.pow2_v3(v); } private function pow2_v4( v:Number):Number { fastmem.fastSetI32(0,0); return TestPerf.pow2_v4(v); } private function pow2_v5( v:Number):Number { fastmem.fastSetI32(0,0); return TestPerf.pow2_v5(v); } private function pow2_v6( v:Number):Number { return TestPerf.pow2_v6(v); } private function pow2_v6_precise( v:Number):Number { //fill our lut table: return TestPerf.pow2_v6(v)*expLUT[fastmem.fastGetUI16(2)&511]; } private function pow2_v7( v:Number):Number { return TestPerf.pow2_v7(v); } private function pow2_v8( v:Number):Number { return (v>0) ? TestPerf.pow2_v8(v) : 0; } private function pow2_v9( v:Number):Number { return (v>0) ? TestPerf.pow2_v9(v) : 0; } private function pow2_v1_error( v:Number):Number { return Math.abs(TestPerf.pow2_v1(v)-Math.pow(2,v)); } private function pow2_v2_error( v:Number):Number { return Math.abs(TestPerf.pow2_v2(v)-Math.pow(2,v)); } private function pow2_v3_error( v:Number):Number { return Math.abs(TestPerf.pow2_v3(v)-Math.pow(2,v)); } private function pow2_v4_error( v:Number):Number { fastmem.fastSetI32(0,0); return Math.abs(TestPerf.pow2_v4(v)-Math.pow(2,v)); } private function pow2_v5_error( v:Number):Number { fastmem.fastSetI32(0,0); return Math.abs(TestPerf.pow2_v5(v)-Math.pow(2,v)); } private function pow2_v6_error( v:Number):Number { return Math.abs(TestPerf.pow2_v6(v)-Math.pow(2,v)); } static protected var expLUT:Vector.<Number> = new Vector.<Number>(512); private function buildv6PreciseLut():void { for(var i:int;i<512;i++) { var v:Number = 1.0/128.0 + Number(i)/128.0; var err:Number= Math.pow(2,v)/TestPerf.pow2_v6(v); var i511:int = fastmem.fastGetUI16(2)&511; if(expLUT[i511]==0) { expLUT[i511] = err; fastmem.fastSetFloat(err,(i511<<2)+16); fastmem.fastSetDouble(err,(i511<<3)+2064); } } } private function pow2_v6_precise_error( v:Number):Number { //fill our lut table: return Math.abs(TestPerf.pow2_v6(v)*expLUT[fastmem.fastGetUI16(2)&511]-Math.pow(2,v)); } private function pow2_v7_error( v:Number):Number { return Math.abs(TestPerf.pow2_v7(v)-Math.pow(2,v)); } private function pow2_v8_error( v:Number):Number { return (v>0) ? Math.abs(TestPerf.pow2_v8(v)-Math.pow(2,v)) : 0; } private function pow2Test():void { buildv6PreciseLut(); //DEBUG --- v6 // visibleitems.addElement(curve2); // resetCurve(); // drawCurve2("pow2v6absError",pow2_v6_error,0/*0.5/512*/,9,1.0/512.0); // drawCurve2("pow2v6_precise_absError",pow2_v6_precise_error,0/*0.5/512*/,9,1.0/512.0); // return; var INC:Number = 0.00001; var REPS:Number = 9; var j:Number = 0; var pow2:Number; var calibrateTime:Number = AccurateTimer.calibrateTimer(); var t:int = AccurateTimer.startTimer(); for(j=0;j<REPS;j+=INC) { pow2 += Math.pow(2,j); } var time0:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); //pow2 float for(j=-REPS;j<REPS;j+=INC) { pow2 += TestPerf.pow2_v1(j); } var time1:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); //pow2 float for(j=-REPS;j<REPS;j+=INC) { pow2 += TestPerf.pow2_v3(j); } var time3:Number = AccurateTimer.endTimer(t,calibrateTime); TestPerf.pow2_v4Initialize();// need to clear t = AccurateTimer.startTimer(); //pow2 double for(j=-REPS;j<REPS;j+=INC) { pow2 += TestPerf.pow2_v4(j); } var time4:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); //pow2 double for(j=-REPS;j<REPS;j+=INC) { pow2 += TestPerf.pow2_v5(j); } var time5:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); //pow2 float2 for(j=-REPS;j<REPS;j+=INC) { pow2 += TestPerf.pow2_v6(j); } var time6:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); //pow2 float2 for(j=-REPS;j<REPS;j+=INC) { pow2 += TestPerf.pow2_v6(j)*fastmem.fastGetFloat(((fastmem.fastGetUI16(2)&511)<<2)+16); } var time7:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); //pow2 float2 for(j=-REPS;j<REPS;j+=INC) { pow2 += TestPerf.pow2_v6(j)*fastmem.fastGetDouble(((fastmem.fastGetUI16(2)&511)<<3)+2064); } var time8:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); //pow2 v7 for(j=-REPS;j<REPS;j+=INC) { pow2 += TestPerf.pow2_v7(j); } var time9:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); //pow2 v8 - positive values var REPS2:Number = REPS + REPS; for(j=0;j<REPS2;j+=INC) { pow2 += TestPerf.pow2_v8(j); } var time10:Number = AccurateTimer.endTimer(t,calibrateTime); var REPS2:Number = REPS + REPS; for(j=0;j<REPS2;j+=INC) { pow2 += TestPerf.pow2_v9(j); } var time11:Number = AccurateTimer.endTimer(t,calibrateTime); startTest(); setTest( "Math.pow2" , "pow2", time0 ); //setTest( "pow2 v1 precis" , "pow2", time1 ); setTest( "pow2 v3 float" , "pow2", time3 ); setTest( "pow2 v4 double" , "pow2", time4 ); setTest( "pow2 v5 double2" , "pow2", time5 ); setTest( "pow2 v6 float" , "pow2", time6 ); setTest( "pow2 v6 lut float precis" , "pow2", time7 ); setTest( "pow2 v6 lut double precis" , "pow2", time8 ); setTest( "pow2 v7 serie" , "pow2", time9 ); setTest( "pow2 v8 serie int precis" , "pow2", time10 ); //setTest( "pow2 v9 serie asm" , "pow2", time11 ); endTest(); //show curve: visibleitems.addElement(curve); visibleitems.addElement(curve2); resetCurve(); drawCurve("Math.pow2",math_pow2,-3,9,0.002); // drawCurve("pow2 v1",pow2_v1,-3,9,0.002); drawCurve("pow2 v3",pow2_v3,-3,9,0.002); drawCurve("pow2 v4",pow2_v4,-3,9,0.002); drawCurve("pow2 v5",pow2_v5,-3,9,0.002); drawCurve("pow2 v6",pow2_v6,-3,9,0.002); drawCurve("pow2 v6_precise",pow2_v6_precise,-3,9,0.002); drawCurve("pow2 v7",pow2_v7,-3,9,0.002); drawCurve("pow2 v8",pow2_v8,-3,9,0.002); drawCurve("pow2 v9",pow2_v9,-3,9,0.002); // drawCurve2("pow2v1absError",pow2_v1_error,-3,9,0.002); drawCurve2("pow2v3absError",pow2_v3_error,-3,9,0.002); drawCurve2("pow2v4absError",pow2_v4_error,-3,9,0.002); drawCurve2("pow2v5absError",pow2_v5_error,-3,9,0.002); drawCurve2("pow2v6absError",pow2_v6_error,-3,9,0.002); drawCurve2("pow2v6PreciseabsError",pow2_v6_precise_error,-3,9,0.002); drawCurve2("pow2v7absError",pow2_v7_error,-3,9,0.002); drawCurve2("pow2v8absError",pow2_v8_error,-3,9,0.002); }
Nice optimization (~2x) on a commonly-used function!
I get the same results as you do on my 3rd gen Intel Core i7 with Mac OS 10.8 with Chrome’s FP 11.9 release.
Jackson,
I updated the test , 10x optimizations, try it again !!