intro
I was trying to find a very efficient way to sort an array of integers (or vector.<int>).
Second goal is to beat the native Array.sort, so we don’t have to copy our Vector to an array then back to a Vector (which might cause Garbage collection).
I found this great post from Eugene http://blog.inspirit.ru/?p=271 with lots of sort algorithms implemented in as3, including flashSort, a non recursive quick-sort implementation and more.
flash sort code:
So given those timing results, I downloaded Eugene code, took the fastest algorithm , FlashSort, and converted the code to work for integers instead of Number as it was my requirement.
According to wikipedia, Flashsort is a distribution sorting algorithm showing linear computational complexity O(n) for uniformly distributed data sets and relatively little additional memory requirement. The original work was published in 1998 by Karl-Dietrich Neubert.
Note that Eugene code is a direct port of the “reference” flashSort code (here a bunch of implementations http://www.neubert.net/Flacodes/FLACodes.html#Sprung10)
static public function flashSortInt(a:Vector.<int>):void { var n:int = a.length; var i:int = 0, j:int = 0, k:int = 0, t:int; var m:int = int(n * .125); var l:Vector.<int> = new Vector.<int>(m); var anmin:int = a[0]; var nmax:int = 0; var nmove:int = 0; for (i = 1; i < n; ++i) { if (a[i] < anmin) anmin = a[i]; if (a[i] > a[nmax]) nmax = i; } if (anmin == a[nmax]) return; var c1:Number = (m - 1) / (a[nmax] - anmin); for (i = 0; i < n; ++i) { k = int(c1*(a[i] - anmin)); ++l[k]; } for (k = 1; k < m; ++k) { l[k] += l[int(k-1)]; } var hold:int = a[nmax]; a[nmax] = a[0]; a[0] = hold; var flash:int; j = 0; k = int(m - 1); i = int(n - 1); while (nmove < i) { while (j > (l[k]-1)) { k = int(c1 * (a[ int(++j) ] - anmin)); } flash = a[j]; while (!(j == l[k])) { k = int(c1 * (flash - anmin)); hold = a[ (t = int(l[k]-1)) ]; a[ t ] = flash; flash = hold; --l[k]; ++nmove; } } for(j = 1; j < n; ++j) { hold = a[j]; i = int(j - 1); while(i >= 0 && a[i] > hold) a[int(i+1)] = a[ int(i--) ]; a[int(i+1)] = hold; } }
Great, can we optimize it ?
Yes ! Let’s walk through the code, and see what optimizations can be done.
The first loop just walk the array fo find the min and the max value, and save the index:
for (i = 1; i < n; ++i) { if (a[i] < anmin) anmin = a[i]; if (a[i] > a[nmax]) nmax = i; }
In that code, Accessing a[i] twice (or 3 times if passing the first test) is slow, and we can cache the value in a local variable.
Now, for each iteration, we test (i<n) , (a[i] < anmin) , (a[i] > a[nmax]),
example: if the array has a length of 100000 , thats 300000 test/branches,
can we do better ?
If we enroll the loop to read 2 values each time, (and increment i+=2), we can find in the pair which one is the smallest, and do the test for anmin on the min value , and the test with a[nmax] on the max value of the pair, so we do:
var kmin:int,kmax:int,kimin:int,kimax:int; for (i=0; (i+=2) < n;) { if( (kmin=a[i])>(kmax=a[i-1])) { if (kmax< anmin) anmin = kmax; if (kmin> anmax) { anmax = kmin; nmax = i;} } else { if (kmin< anmin) anmin = kmin; if (kmax> anmax) { anmax = kmax; nmax = i-1;} } }
So this is 4 tests per iterations, but we divided by 2 the number of iterations,
so back to our example array of 100000, thats 100000/2 = 50000 iterations * 4 = 200000 tests instead of 300000 before.
Note that we skip the first element at the beginning, then do 2 values at a time, so if the array has an even length, we need to do a very last test after the enrolled loop:
var kmin:int,kmax:int,kimin:int,kimax:int; for (i=0; (i+=2) < n;) { if( (kmin=a[i])>(kmax=a[i-1])) { if (kmax< anmin) anmin = kmax; if (kmin> anmax) { anmax = kmin; nmax = i;} } else { if (kmin< anmin) anmin = kmin; if (kmax > anmax) { anmax = kmax; nmax = i-1;} } } if(--i < n) { if ((k = a[i]) < anmin) anmin = k; else if (k > anmax) { anmax = k; nmax = i;} }
next optimization
after that, we have:
var c1:Number = (m - 1) / (a[nmax] - anmin);
c1 is a number, which leads to int->number->int conversions in the following loops. So we can use a fixpoint value instead:
var c1:int = ((m - 1)<<13) / (anmax - anmin);
Note that the choice of shift 13 is debatable, and depends on the delta of the values and the number of elements in the array you are trying to sort, so if you are not sure, you might want to keep the c1 value as a Number. 13 works if the temporary array used by flashSort is less or equal than length=262143 (0x3ffff)
after that, the second loop:
for (i = 0; i < n; ++i) { k = (c1*(a[i] - anmin)); ++l[k]; }
So here again we can avoid the temporary k variable, but we need to shift to the right the index with 13 because of c1 is now a fixpoint:
for (i = -1; ++i < n;) { ++l[(c1*(a[i] - anmin))>>13]; }
next loop:
histogram:
for (k = 1; k < m; ++k) { l[k] += l[int(k-1)]; }
the biggest problem in that loop is the need to access the array twice per iteration. but we always access the previous element to sum the histogram, so we can cache the previous result (in var lk:int)
var lk:int = l[0]; for (k = 0; ++k < m;) { lk = (l[k] += lk); }
lets continue to the next code block:
while (j > (l[k]-1)) { k = int(c1 * (a[ int(++j) ] - anmin)); }
Instead of testing (j > (l[k]-1)), we can test while (j >= l[k]), which removes the need to do -1
Also, we need to change the code inside the loop a bit because of c1 being a fixpoint:
I also removed the unnecessary int( ) which by the way makes the function slower (for number to integer conversions, there is faster solutions in AS3:http://guihaire.com/code/?p=156)
while (j >= l[k]) { k = (c1 * (a[ (++j)] - anmin))>>13; }
lets go on…
while (!(j == l[k])) { k = (c1 * (flash - anmin)); hold = a[ (t = (l[k]-1)) ]; a[ t ] = flash; flash = hold; --l[k]; ++nmove; }
l[k] is accessed 3 times, the code can be re-worked to read and write l[k] a single time in the loop !
lk = l[k]; while (j !=lk) { hold = a[(lk = (--l[(k = ((c1 * (flash - anmin))>>13))]))]; a[lk] = flash; flash = hold; ++nmove; }
last loop
for(j = 1; j < n; ++j) { hold = a[j]; i = int(j - 1); while(i >= 0 && a[i] > hold) a[int(i+1)] = a[ int(i--) ]; a[int(i+1)] = hold; }
The last loop access the vector a[] 3 times per iteration, we can do better !
for(j = 0; ++j < n;) { hold = a[j]; i = j; while((--i >= 0) && ((k=a[i]) > hold)) a[i+1] = k; a[(i+1)] = hold; }
Finally, we can replace all the tests with (a<b) with the faster ((a-b)>>>31) (see this post: http://guihaire.com/code/?p=718)
Great, lets profile the new code :
Here with sorting 500000 integers:
Nice, 30% faster !!!
Can we do better ?
One nice attribute of the flashsort algorithm, is the fact that we can setup a temp array of the size we want. In Eugene implementation, its the size of the (array/8), is 1/8 the best value ?
var m:int = int(n * .125);
var l:Vector.<int> = new Vector.<int>(m);
So I modified the function to be able to choose the size of the array, and I tried:
tempBuffer = sizeOfArray/8 , *0.3 , *0.4, *0.43 , *0.5
According to other articles about FlashSort, 0.43 should be the best value, but it really depends on the generated code, computer used etc.. and in our case the Flash VM. so the temporary buffer size allow us to control how much time we spend doing the initial classification versus how much time we do the final insertion sort, so we need to find the best balance.
So here some new profiling results:
an additional 16% improvement, making the sorting 46% Faster !!!
Could we do even better ?
Probably,
* pre allocate the temporary array so we don’t pay allocation/initialization time + later Garbage collection time.
* we could try rewriting the flashSort in alchemy opcode.
* unroll more loops
* look at the generated assembly code and rewrite it with better stack control.
FlashSort updated code:
Anyway, here my updated code, if you spot additionals optimizations, please let me know !
static public function flashSortInt(a:Vector.<int> , multiplier:Number = 0.43):void { var n:int = a.length; var i:int = 0, j:int = 0, k:int = 0; var m:int = multiplier * n;if(m>262143)m=262143; var l:Vector.<int> = new Vector.<int>(m); var anmin:int = a[0]; var anmax:int = anmin; var nmax:int = 0; var nmove:int = 0; var lk:int; i =0; //do 2 indexes / iterations var kmin:int,kmax:int,kimin:int,kimax:int; for (i=0; ((i+=2)-n)>>>31;) { if( ((kmax=a[i-1])-(kmin=a[i]))>>>31) { if ((kmax-anmin)>>>31) anmin = kmax; if ((anmax-kmin)>>>31) { anmax = kmin; nmax = i;} } else { if ((kmin-anmin)>>>31) anmin = kmin; if ((anmax-kmax)>>>31) { anmax = kmax; nmax = i-1;} } } if((--i -n)>>>31) { if (((k = a[i])-anmin)>>>31) anmin = k; else if ((anmax-k)>>>31 ) { anmax = k; nmax = i;} } //var time3:int = getTimer(); //time2 = time3-time2; //trace(time1,time2); if (anmin == anmax) return; var c1:int = ((m - 1)<<13) / (anmax - anmin); for (i = -1; (++i - n)>>>31;) { ++l[(c1*(a[i] - anmin))>>13]; } lk = l[0]; for (k = 0; (++k - m)>>>31;) { lk = (l[k] += lk); } //swap a[nmax] and a[0] var hold:int = anmax; a[nmax] = a[0]; a[0] = hold; var flash:int; j = 0; k = (m - 1); i = (n - 1); while ((nmove - i)>>>31) { while (j >= l[k]) { k = (c1 * (a[ (++j)] - anmin))>>13; } flash = a[j]; lk = l[k]; while (j !=lk) { hold = a[(lk = (--l[(k = ((c1 * (flash - anmin))>>13))]))]; a[lk] = flash; flash = hold; ++nmove; } } for(j = 0; (++j - n)>>>31;) { hold = a[j]; i = j; while((--i >= 0) && ((hold-(k=a[i]))>>>31)) a[(i+1)] = k; a[(i+1)] = hold; } }
Sorts Comparaisons
How is our sort performing against the fastest native Flash sorting: Array.sort ?
– wow ! our Flash sort is 4x faster than the Array.sort !
– as we already know, Array.sort is much faster than Vector.sort
– The Shell sort (presented here http://guihaire.com/code/?p=20 is faster than Vector<>.sort, but overall gives poor results.
Here more profiling with fewer values sorted:
All the tests shows that my updated flashSort alway beat the Array.Sort, from 20 to 500000 items.
What do you get ?
profiling code
private function flashSortIntegerTestNum( nbr:int ):void { //generate nbr numbers: var i:int = -1; var array1:Vector.<int> = new Vector.<int>(nbr); var array2:Vector.<int> = new Vector.<int>(nbr); var array3:Vector.<int> = new Vector.<int>(nbr); var array4:Vector.<int> = new Vector.<int>(nbr); var array5:Vector.<int> = new Vector.<int>(nbr); var array6:Vector.<int> = new Vector.<int>(nbr); var array7:Vector.<int> = new Vector.<int>(nbr); var sorted:Vector.<int> = new Vector.<int>(nbr); var half:int = nbr>>1; for(i=-1;(++i)<nbr;) { sorted[i] = i-half; } for(i=-1;(++i)<nbr;) { var index:int = Math.random()*sorted.length; array7[i] = array6[i] = array5[i] = array4[i] = array3[i] = array2[i] = array1[i] = sorted[index]; sorted.splice(index,1); } var time1:uint = getTimer(); //FlashSort1 { SortAlgorithms.flashSortInt(array1); } var time2:uint = getTimer(); //FlashSort2 , 0.125 { SortAlgorithms.flashSortInt3(array2,0.125); } var time3:uint = getTimer(); //FlashSort2 , 0.3 { SortAlgorithms.flashSortInt3(array3,0.3); } var time4:uint = getTimer(); //FlashSort2 , 0.35 { SortAlgorithms.flashSortInt3(array4,0.35); } var time5:uint = getTimer(); //FlashSort2 , 0.4 { SortAlgorithms.flashSortInt3(array5,0.4); } var time6:uint = getTimer(); //FlashSort2 , 0.45 { SortAlgorithms.flashSortInt3(array6,0.45); } var time7:uint = getTimer(); //FlashSort2 , 0.5 { SortAlgorithms.flashSortInt3(array7,0.5); } var time8:uint = getTimer(); verifySort(array1); verifySort(array2); verifySort(array3); verifySort(array4); verifySort(array5); verifySort(array6); verifySort(array7); startTest(); setTest( "FlashEugene.sort" , "sortInt"+nbr, time2-time1 ); setTest( "FlashBen.125.sort" , "sortInt"+nbr, time3-time2 ); setTest( "FlashBen.3.sort" , "sortInt"+nbr, time4-time3 ); setTest( "FlashBen.35.sort" , "sortInt"+nbr, time6-time5 ); setTest( "FlashBen.4.sort" , "sortInt"+nbr, time5-time4 ); setTest( "FlashBen.45.sort" , "sortInt"+nbr, time7-time6 ); setTest( "FlashBen.5.sort" , "sortInt"+nbr, time8-time7 ); endTest(); }
On my 2.3 GHz Intel Core i7, Flash Player 11.8, Mac OS X 10.8 it’s hard to tell any difference at 20,000 since everything is mostly 1 ms. You could run the sort in a loop, say 100 times to get larger numbers with more differentiation.
With 100,000, your version beats the original by about 2x or about 11x depending on the temporary Vector size. Which version of yours is fastest is really inconsistent from test to test, so perhaps it’s really susceptible to variance in the random number generator. Regardless, it’s always faster than the original which is great.
Looking forward to the ultimate assembly version. 🙂
Jackson, I like your idea to time smaller sorting sizes, it will be interesting to know if FLashSort always beat the Array.sort (and what is the minimum size where it starts to be interesting)
Jackson, I updated the code with your suggestion, when performing a sort of less than 10000 values, it will do the test multiple times by allocating multiple arrays with the exact same random numbers in each. Note when performing the test multiple times, I always do the test from an unsorted array.
I also added in the graph the sorting of the native Array.sort
Its Good to see that my updated FlashSort is always faster than the Native Array.Sort!