In this post http://guihaire.com/code/?p=681 about radix sorting, we need to clear a portion of our domain ByteArray, and in 2 of the comments, the question was,
what is the fastest way to perform that clear?
So we have to clear (2048*4)*3 Bytes before writing into the histogram,
To clear the memory region, I don’t want to re-allocate a new “alchemy byteArray” each time I perform a radix sort, as I also manage in the same byteArray other things for the rest of my app, and switching DomainMemory byteArray is very expensive and should always be avoided (especially on mobile)
So here few methods I tried:
1)write a local variable set to ZERO written as a double
for(adr=b0;adr<max;adr+=32) { fastmem.fastSetDouble(mem,adr); fastmem.fastSetDouble(mem,adr+8); fastmem.fastSetDouble(mem,adr+16); fastmem.fastSetDouble(mem,adr+24); }
2)Write ZERO as a double, enrolled 4 times
for(adr=b0;adr<max;adr+=32) { fastmem.fastSetDouble(0,adr); fastmem.fastSetDouble(0,adr+8); fastmem.fastSetDouble(0,adr+16); fastmem.fastSetDouble(0,adr+24); }
3)Same as 2,but enrolled 8 times:
for(adr=b0;adr<max;adr+=64) { fastmem.fastSetDouble(0,adr); fastmem.fastSetDouble(0,adr+8); fastmem.fastSetDouble(0,adr+16); fastmem.fastSetDouble(0,adr+24); fastmem.fastSetDouble(0,adr+32); fastmem.fastSetDouble(0,adr+40); fastmem.fastSetDouble(0,adr+48); fastmem.fastSetDouble(0,adr+56); }
4)Allocate a separate byteArray, and use ReadBytes
var ba:ByteArray = new ByteArray(); ba.length = len; ba.endian = Endian.LITTLE_ENDIAN; ba.position = 0; alchemyByteArray.position = b0; ba.readBytes(alchemyByteArray,b0,len);
5) Same as 4, but do not time the allocation of the temp byteArray
as we can pre-allocate it when the app starts:
alchemyByteArray.position = b0; ba.position = 0; ba.readBytes(alchemyByteArray,b0,len);
6)Write as int, enrolled 8 times:
for(adr=b0;adr<max;adr+=32) { fastmem.fastSetI32(0,adr); fastmem.fastSetI32(0,adr+4); fastmem.fastSetI32(0,adr+8); fastmem.fastSetI32(0,adr+12); fastmem.fastSetI32(0,adr+16); fastmem.fastSetI32(0,adr+20); fastmem.fastSetI32(0,adr+24); fastmem.fastSetI32(0,adr+28); }
7)same as 6, but with a cast:
for(adr=b0;adr<max;adr+=32) { fastmem.fastSetI32(0,adr); fastmem.fastSetI32(0,int(adr+4)); fastmem.fastSetI32(0,int(adr+8)); fastmem.fastSetI32(0,int(adr+12)); fastmem.fastSetI32(0,int(adr+16)); fastmem.fastSetI32(0,int(adr+20)); fastmem.fastSetI32(0,int(adr+24)); fastmem.fastSetI32(0,int(adr+28)); }
8)same as 1, but enrolled 64 times, and perform adr+=8:
for(adr=b0;adr<max;) { fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); ....repeat this line 512/8 times }
8)same as 1, but enrolled 64 times
for(adr=b0;adr<max;adr+=512) { fastmem.fastSetDouble(mem,adr); fastmem.fastSetDouble(mem,adr+8); fastmem.fastSetDouble(mem,adr+16); fastmem.fastSetDouble(mem,adr+24); fastmem.fastSetDouble(mem,adr+32); fastmem.fastSetDouble(mem,adr+40); fastmem.fastSetDouble(mem,adr+48); fastmem.fastSetDouble(mem,adr+56); fastmem.fastSetDouble(mem,adr+64); fastmem.fastSetDouble(mem,adr+72); fastmem.fastSetDouble(mem,adr+80); fastmem.fastSetDouble(mem,adr+88); fastmem.fastSetDouble(mem,adr+96); fastmem.fastSetDouble(mem,adr+104); fastmem.fastSetDouble(mem,adr+112); fastmem.fastSetDouble(mem,adr+120); fastmem.fastSetDouble(mem,adr+128); fastmem.fastSetDouble(mem,adr+136); fastmem.fastSetDouble(mem,adr+144); fastmem.fastSetDouble(mem,adr+152); fastmem.fastSetDouble(mem,adr+160); fastmem.fastSetDouble(mem,adr+168); fastmem.fastSetDouble(mem,adr+176); fastmem.fastSetDouble(mem,adr+184); fastmem.fastSetDouble(mem,adr+192); fastmem.fastSetDouble(mem,adr+200); fastmem.fastSetDouble(mem,adr+208); fastmem.fastSetDouble(mem,adr+216); fastmem.fastSetDouble(mem,adr+224); fastmem.fastSetDouble(mem,adr+232); fastmem.fastSetDouble(mem,adr+240); fastmem.fastSetDouble(mem,adr+248); fastmem.fastSetDouble(mem,adr+256); fastmem.fastSetDouble(mem,adr+264); fastmem.fastSetDouble(mem,adr+272); fastmem.fastSetDouble(mem,adr+280); fastmem.fastSetDouble(mem,adr+288); fastmem.fastSetDouble(mem,adr+296); fastmem.fastSetDouble(mem,adr+304); fastmem.fastSetDouble(mem,adr+312); fastmem.fastSetDouble(mem,adr+320); fastmem.fastSetDouble(mem,adr+328); fastmem.fastSetDouble(mem,adr+336); fastmem.fastSetDouble(mem,adr+344); fastmem.fastSetDouble(mem,adr+352); fastmem.fastSetDouble(mem,adr+360); fastmem.fastSetDouble(mem,adr+368); fastmem.fastSetDouble(mem,adr+376); fastmem.fastSetDouble(mem,adr+384); fastmem.fastSetDouble(mem,adr+392); fastmem.fastSetDouble(mem,adr+400); fastmem.fastSetDouble(mem,adr+408); fastmem.fastSetDouble(mem,adr+416); fastmem.fastSetDouble(mem,adr+424); fastmem.fastSetDouble(mem,adr+432); fastmem.fastSetDouble(mem,adr+440); fastmem.fastSetDouble(mem,adr+448); fastmem.fastSetDouble(mem,adr+456); fastmem.fastSetDouble(mem,adr+464); fastmem.fastSetDouble(mem,adr+472); fastmem.fastSetDouble(mem,adr+480); fastmem.fastSetDouble(mem,adr+488); fastmem.fastSetDouble(mem,adr+496); fastmem.fastSetDouble(mem,adr+504); }
Results:
– casting with a int() is really slow.
– writing a local mem is faster than writing 0
this is not surprising as the compiler when pushing 0 on the stack does PushByte(0), which then needs to be converted to a Number, and when using a local variable, the variable already has the right type : Number, no conversions.
– readBytes with a pre-allocated buffer is fast, but we still pay the cost of calling some glue code function, there is also extra memory involved.
– the solution writing a Number equal 0 , with the loop enrolled 64 times is the best..
we could enroll the entire loop 🙂
Profiling code
protected function clearAlchemyByteArray():void { var b0:int = 32; var len:int = ((2048<<2)*3); var max:int = b0 + len; var mem:Number = 0; var REPS:int = 100; var calibrateTime:Number = AccurateTimer.calibrateTimer(); var adr:int = 0; var t:int,i:int; t = AccurateTimer.startTimer(); for(i=0;i<REPS;i++) for(adr=b0;adr<max;adr+=32) { fastmem.fastSetDouble(mem,adr); fastmem.fastSetDouble(mem,adr+8); fastmem.fastSetDouble(mem,adr+16); fastmem.fastSetDouble(mem,adr+24); } var time1:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); for(i=0;i<REPS;i++) for(adr=b0;adr<max;adr+=32) { fastmem.fastSetDouble(0,adr); fastmem.fastSetDouble(0,adr+8); fastmem.fastSetDouble(0,adr+16); fastmem.fastSetDouble(0,adr+24); } var time2:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); for(i=0;i<REPS;i++) for(adr=b0;adr<max;adr+=64) { fastmem.fastSetDouble(0,adr); fastmem.fastSetDouble(0,adr+8); fastmem.fastSetDouble(0,adr+16); fastmem.fastSetDouble(0,adr+24); fastmem.fastSetDouble(0,adr+32); fastmem.fastSetDouble(0,adr+40); fastmem.fastSetDouble(0,adr+48); fastmem.fastSetDouble(0,adr+56); } var time3:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); for(i=0;i<REPS;i++) { var ba:ByteArray = new ByteArray(); ba.length = len; ba.endian = Endian.LITTLE_ENDIAN; ba.position = 0; alchemyByteArray.position = b0; ba.readBytes(alchemyByteArray,b0,len); } var time4:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); for(i=0;i<REPS;i++) { alchemyByteArray.position = b0; ba.position = 0; ba.readBytes(alchemyByteArray,b0,len); } var time5:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); for(i=0;i<REPS;i++) for(adr=b0;adr<max;adr+=32) { fastmem.fastSetI32(0,adr); fastmem.fastSetI32(0,adr+4); fastmem.fastSetI32(0,adr+8); fastmem.fastSetI32(0,adr+12); fastmem.fastSetI32(0,adr+16); fastmem.fastSetI32(0,adr+20); fastmem.fastSetI32(0,adr+24); fastmem.fastSetI32(0,adr+28); } var time6:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); for(i=0;i<REPS;i++) for(adr=b0;adr<max;adr+=32) { fastmem.fastSetI32(0,adr); fastmem.fastSetI32(0,int(adr+4)); fastmem.fastSetI32(0,int(adr+8)); fastmem.fastSetI32(0,int(adr+12)); fastmem.fastSetI32(0,int(adr+16)); fastmem.fastSetI32(0,int(adr+20)); fastmem.fastSetI32(0,int(adr+24)); fastmem.fastSetI32(0,int(adr+28)); } var time7:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); for(i=0;i<REPS;i++) for(adr=b0;adr<max;) { fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); fastmem.fastSetDouble(mem,adr+=8); } var time8:Number = AccurateTimer.endTimer(t,calibrateTime); t = AccurateTimer.startTimer(); for(i=0;i<REPS;i++) for(adr=b0;adr<max;adr+=512) { fastmem.fastSetDouble(mem,adr); fastmem.fastSetDouble(mem,adr+8); fastmem.fastSetDouble(mem,adr+16); fastmem.fastSetDouble(mem,adr+24); fastmem.fastSetDouble(mem,adr+32); fastmem.fastSetDouble(mem,adr+40); fastmem.fastSetDouble(mem,adr+48); fastmem.fastSetDouble(mem,adr+56); fastmem.fastSetDouble(mem,adr+64); fastmem.fastSetDouble(mem,adr+72); fastmem.fastSetDouble(mem,adr+80); fastmem.fastSetDouble(mem,adr+88); fastmem.fastSetDouble(mem,adr+96); fastmem.fastSetDouble(mem,adr+104); fastmem.fastSetDouble(mem,adr+112); fastmem.fastSetDouble(mem,adr+120); fastmem.fastSetDouble(mem,adr+128); fastmem.fastSetDouble(mem,adr+136); fastmem.fastSetDouble(mem,adr+144); fastmem.fastSetDouble(mem,adr+152); fastmem.fastSetDouble(mem,adr+160); fastmem.fastSetDouble(mem,adr+168); fastmem.fastSetDouble(mem,adr+176); fastmem.fastSetDouble(mem,adr+184); fastmem.fastSetDouble(mem,adr+192); fastmem.fastSetDouble(mem,adr+200); fastmem.fastSetDouble(mem,adr+208); fastmem.fastSetDouble(mem,adr+216); fastmem.fastSetDouble(mem,adr+224); fastmem.fastSetDouble(mem,adr+232); fastmem.fastSetDouble(mem,adr+240); fastmem.fastSetDouble(mem,adr+248); fastmem.fastSetDouble(mem,adr+256); fastmem.fastSetDouble(mem,adr+264); fastmem.fastSetDouble(mem,adr+272); fastmem.fastSetDouble(mem,adr+280); fastmem.fastSetDouble(mem,adr+288); fastmem.fastSetDouble(mem,adr+296); fastmem.fastSetDouble(mem,adr+304); fastmem.fastSetDouble(mem,adr+312); fastmem.fastSetDouble(mem,adr+320); fastmem.fastSetDouble(mem,adr+328); fastmem.fastSetDouble(mem,adr+336); fastmem.fastSetDouble(mem,adr+344); fastmem.fastSetDouble(mem,adr+352); fastmem.fastSetDouble(mem,adr+360); fastmem.fastSetDouble(mem,adr+368); fastmem.fastSetDouble(mem,adr+376); fastmem.fastSetDouble(mem,adr+384); fastmem.fastSetDouble(mem,adr+392); fastmem.fastSetDouble(mem,adr+400); fastmem.fastSetDouble(mem,adr+408); fastmem.fastSetDouble(mem,adr+416); fastmem.fastSetDouble(mem,adr+424); fastmem.fastSetDouble(mem,adr+432); fastmem.fastSetDouble(mem,adr+440); fastmem.fastSetDouble(mem,adr+448); fastmem.fastSetDouble(mem,adr+456); fastmem.fastSetDouble(mem,adr+464); fastmem.fastSetDouble(mem,adr+472); fastmem.fastSetDouble(mem,adr+480); fastmem.fastSetDouble(mem,adr+488); fastmem.fastSetDouble(mem,adr+496); fastmem.fastSetDouble(mem,adr+504); } var time9:Number = AccurateTimer.endTimer(t,calibrateTime); startTest(); setTest( "1)write Double local mem" , "ByteArrayClear", time1 ); setTest( "2)write Double 0 32+" , "ByteArrayClear", time2 ); setTest( "3)write Double 0 64+" , "ByteArrayClear",time3 ); setTest( "4)readBytes+Alloc", "ByteArrayClear", time4 ); setTest( "5)readBytes", "ByteArrayClear", time5 ); setTest( "6)write int32 " , "ByteArrayClear", time6 ); setTest( "7)write int32 int() cast " , "ByteArrayClear", time7 ); setTest( "8)write Double local mem +=8" , "ByteArrayClear", time8 ); setTest( "9)write Double local mem +512" , "ByteArrayClear", time9 ); endTest(); }
Given that it’s about twice as fast when going from #1 (4 doubles per loop) to #9 (64 doubles per loop), perhaps writing all 3072 doubles with no loop would be twice as fast again. Yes, the code would be ugly. No, you shouldn’t have to do this to get good performance. 🙂
So I tried to enroll the full loop, use Assembly code, and push on the stack directly the destination addresses (so there is no add to perform as we already know where to clear in our byteArray)… I measured no gain when doing that.
Here the code: