A Fast sorting algorithm for almost sorted data



I am currently trying my hand at a compressor for the hutter prize. So
far my compressor has potential but is nowhere near ready.

It does however make heavy use of sorting. So spent some time trying
to find a specialise sorting algorithm which would sort this type of
data very quickly. There is plenty of information on the usual general
sorting algorithms on the net however I did not have much luck finding
one which would be very efficient on almost sorted data.

I would like to ask whether any one has knowledge of such a sorting
algorithm?

In my desperation i have made a potentially new sorting algorithm
which I am currently calling Run sort.

It seems to be on average twice as fast as java's standard
java.util.Arrays.sort(). According to Java's API... "The sorting
algorithm is a tuned quicksort, adapted from Jon L. Bentley and M.
Douglas McIlroy's "Engineering a Sort Function", Software-Practice and
Experience, Vol. 23(11) P. 1249-1265 (November 1993). This algorithm
offers n*log(n) performance on many data sets that cause other
quicksorts to degrade to quadratic performance. "

I also compared it to GNU complier's implementation of Arrays.sort()
which seems to be a insertion/merge sort hybrid. In this case the Run
sort seems to perform about 2-3 times as fast than the GNU sort.

This sorting Algorithm works ok, however i really need an ultra
efficent sort as it is my bottle neck in my compresser.



A bench mark program with source included in the JAR and data sets
generated by my compression algorithm can be found at:
http://javaunlimited.net/hosted/moogie/benchmark.zip

The high-level algorithm:

1. iterate over the input array identifying the runs of in-order
objects
2. while we have more than 1 run goto 3 else goto 11
3. iterate over the runs comparing the "first" object of each run to
identify the run which has the next sorted object making note of the
previously selected run.
4. if the selected run's "last" object is greater than or equal to the
previously selected run's "first" object then we can assume that the
entire selected run can be added to the sorted output array. goto 5
else goto 6
5. copy all the objects in the selected run into the sorted output
array. goto 7
6. copy only the "first" object in the selected run into the sorted
output array.
7. remove the "first" object from the selected run.
8. if the selected run has no more objects then goto 9 else 10
9. remove the selected run from the list of runs
10. goto 2
11. there is only one run and so directly copy the run into sorted
output array.
12. return the sorted array.



My current Implementation of Run Sort is as follows:

package wikicompress;

import java.util.Comparator;

/**
*
* @author nklaebe
* Email: budgetanime@xxxxxxxxxxx
*
*/
public class RunSort implements Comparator
{
final static public RunSort defaultComparator=new RunSort();

private RunSort()
{

}

public static void sort(Comparable a[])
{
sort(a,0,a.length,defaultComparator);
}

public static void sort(Comparable a[], int start,int end)
{
sort(a,start,end,defaultComparator);
}

public static void sort(Comparable a[], int start,int
end,Comparator comparator)
{
Comparable[] tempArray=new Comparable[end-start];
sort(a,start,end,comparator,tempArray);
}

public static void sort(Comparable a[], int start,int
end,Comparator comparator,Comparable[] tempArray)
{
// calculate the dimenstion of objects to sort in the given array
int size=end-start;


// copy the objects to sort into the working temporary array
System.arraycopy(a,start,tempArray,0,size);

boolean removeRun=false;
int prevIndex=0;
int i;
Comparable tmp=null;
Comparable prev;
Comparable nextBestObject;

Run root=null;
Run currRun;
Run prevRun=null;
Run selectedRun=null;
int runCounts=0;

//
// construct the runs of already sorted objects
//

// iterate over the array creating runs of objects already in
order...
prev=tempArray[0];
for (i=1;i<size;i++)
{
tmp=tempArray[i];
if (comparator.compare(tmp,prev)<0)
{
currRun=new Run();
runCounts++;
currRun.parentRun=prevRun;
if (prevRun!=null) prevRun.nextRun=currRun;
else root=currRun;
currRun.endIndex=i;
currRun.span=i-prevIndex;
currRun.lastObjectInRun=prev;

prevRun=currRun;
prevIndex=i;
}
prev=tmp;
}

// construct the final run
runCounts++;
currRun=new Run();
currRun.parentRun=prevRun;
if (prevRun!=null) prevRun.nextRun=currRun;
else root=currRun;
currRun.endIndex=i;
currRun.span=i-prevIndex;
currRun.lastObjectInRun=prev;

//
// construct the sorted objects
//

i=0;

// while there are still multiple runs...
while (runCounts>1)
{
// select the "first" object from the first run
prev=tempArray[root.endIndex-root.span];
nextBestObject=prev;
selectedRun=root;

// compare this object against the "first" objects from the other
existing runs.
currRun=root.nextRun;
while (currRun!=null)
{
tmp=tempArray[currRun.endIndex-currRun.span];

// if the current run's "first" object is higher than the
current best "first" object then this run now has the
// best "first" object and so update the selected run and best
object.
if (comparator.compare(tmp,prev)<0)
{
nextBestObject=prev;
selectedRun=currRun;
prev=tmp;
}
currRun=currRun.nextRun;
}


// check to see whether the entire run is less than the next
closest run's 'first' value.
// if it is then copy the entire run into the input array and set
the run to be removed.
if (selectedRun.span>10 &&
comparator.compare(nextBestObject,selectedRun.lastObjectInRun)>=0)
{
System.arraycopy(tempArray,selectedRun.endIndex-
selectedRun.span,a,start+i,selectedRun.span);
i+=selectedRun.span;
removeRun=true;
}
// else insert the best ( the next highest object) into the input
array.
else
{
a[start+i++]=prev;
}

// update the offset into the selected run. This offset
determines the "first" object of the run.
selectedRun.span--;

// if the run is set to be removed or if we have reached the end
of a run. i.e. a run with only one object...
if (removeRun || selectedRun.span==0)
{
removeRun=false;

// update the list of runs by removing the selected run from the
list making sure that the root run
// and the tail run are updated as necessary.

if (selectedRun==root)
{
root=selectedRun.nextRun;
}
else if (selectedRun.nextRun==null)
{
selectedRun.parentRun.nextRun=null;
}
else
{
selectedRun.parentRun.nextRun=selectedRun.nextRun;
selectedRun.nextRun.parentRun=selectedRun.parentRun;
}
runCounts--;
}
}

// we now only have one run and so no comparisions are needed and
just copy the run into the input array as is.
System.arraycopy(tempArray,root.endIndex-root.span,a,start
+i,root.span);
}


public int compare(Object arg0, Object arg1) {
return ((Comparable) arg0).compareTo(arg1);
}

}






package wikicompress;

public class Run {

/** the next Run in the list of Runs */
public Run nextRun;

/** this Run's parent in the list of Runs */
public Run parentRun;

/** the index into the input array where this run ends (non
inclusive) */
public int endIndex;

/** the current span length of the run. i.e. the size of the run */
public int span;

/** a reference to the last Object in this run */
Comparable lastObjectInRun;
}

.



Relevant Pages

  • Re: "Sorting" assignment
    ... If the array is already sorted, this means that you end up ... attempt to sort them. ... int quicksortPartition ... for (intIndex1 = intLeft; ...
    (comp.programming)
  • Re: indirect sort
    ... puts it into an array and sorts this array is faster than ... just sorting the objects. ... worthy to have my own implementation of sort(), ... comparator can simple subtract one argument from the other and cast ...
    (comp.lang.java.programmer)
  • Re: problem with array
    ... > different sort algorythms. ... > the sort the array elements are compared and swapped. ... void selectionsort(void *base, size_t n, size_t size, ... void PrintArrays(int *a, int **pa,size_t n); ...
    (comp.lang.c)
  • Re: indirect sort
    ... I wrote a Comparator (I don't want to use ... Comparable in order to be able to choose the field I am sorting by) ... a way to create a doublearray of my fields I want to sort by, ...
    (comp.lang.java.programmer)
  • Re: max size of ptr array
    ... The problem is that I get a stack overflow error when I do a quick sort ... you'll get a compiler error. ... int main{ ... And since an int requires 4 "bytes" plus the fact that your array is ...
    (microsoft.public.vc.language)