Re: A Fast sorting algorithm for almost sorted data
- From: Matt Mahoney <matmahoney@xxxxxxxxx>
- Date: 4 May 2007 09:47:55 -0700
On May 4, 3:39 am, moogie <budgetan...@xxxxxxxxxxxxxx> wrote:
I am currently trying my hand at a compressor for the hutter prize. So
far my compressor has potential but is nowhere near ready.
It does however make heavy use of sorting. So spent some time trying
to find a specialise sorting algorithm which would sort this type of
data very quickly. There is plenty of information on the usual general
sorting algorithms on the net however I did not have much luck finding
one which would be very efficient on almost sorted data.
I would like to ask whether any one has knowledge of such a sorting
algorithm?
In my desperation i have made a potentially new sorting algorithm
which I am currently calling Run sort.
It seems to be on average twice as fast as java's standard
java.util.Arrays.sort(). According to Java's API... "The sorting
algorithm is a tuned quicksort, adapted from Jon L. Bentley and M.
Douglas McIlroy's "Engineering a Sort Function", Software-Practice and
Experience, Vol. 23(11) P. 1249-1265 (November 1993). This algorithm
offers n*log(n) performance on many data sets that cause other
quicksorts to degrade to quadratic performance. "
I also compared it to GNU complier's implementation of Arrays.sort()
which seems to be a insertion/merge sort hybrid. In this case the Run
sort seems to perform about 2-3 times as fast than the GNU sort.
This sorting Algorithm works ok, however i really need an ultra
efficent sort as it is my bottle neck in my compresser.
A bench mark program with source included in the JAR and data sets
generated by my compression algorithm can be found at:http://javaunlimited.net/hosted/moogie/benchmark.zip
The high-level algorithm:
1. iterate over the input array identifying the runs of in-order
objects
2. while we have more than 1 run goto 3 else goto 11
3. iterate over the runs comparing the "first" object of each run to
identify the run which has the next sorted object making note of the
previously selected run.
4. if the selected run's "last" object is greater than or equal to the
previously selected run's "first" object then we can assume that the
entire selected run can be added to the sorted output array. goto 5
else goto 6
5. copy all the objects in the selected run into the sorted output
array. goto 7
6. copy only the "first" object in the selected run into the sorted
output array.
7. remove the "first" object from the selected run.
8. if the selected run has no more objects then goto 9 else 10
9. remove the selected run from the list of runs
10. goto 2
11. there is only one run and so directly copy the run into sorted
output array.
12. return the sorted array.
My current Implementation of Run Sort is as follows:
package wikicompress;
import java.util.Comparator;
/**
*
* @author nklaebe
* Email: budgetan...@xxxxxxxxxxx
*
*/
public class RunSort implements Comparator
{
final static public RunSort defaultComparator=new RunSort();
private RunSort()
{
}
public static void sort(Comparable a[])
{
sort(a,0,a.length,defaultComparator);
}
public static void sort(Comparable a[], int start,int end)
{
sort(a,start,end,defaultComparator);
}
public static void sort(Comparable a[], int start,int
end,Comparator comparator)
{
Comparable[] tempArray=new Comparable[end-start];
sort(a,start,end,comparator,tempArray);
}
public static void sort(Comparable a[], int start,int
end,Comparator comparator,Comparable[] tempArray)
{
// calculate the dimenstion of objects to sort in the given array
int size=end-start;
// copy the objects to sort into the working temporary array
System.arraycopy(a,start,tempArray,0,size);
boolean removeRun=false;
int prevIndex=0;
int i;
Comparable tmp=null;
Comparable prev;
Comparable nextBestObject;
Run root=null;
Run currRun;
Run prevRun=null;
Run selectedRun=null;
int runCounts=0;
//
// construct the runs of already sorted objects
//
// iterate over the array creating runs of objects already in
order...
prev=tempArray[0];
for (i=1;i<size;i++)
{
tmp=tempArray[i];
if (comparator.compare(tmp,prev)<0)
{
currRun=new Run();
runCounts++;
currRun.parentRun=prevRun;
if (prevRun!=null) prevRun.nextRun=currRun;
else root=currRun;
currRun.endIndex=i;
currRun.span=i-prevIndex;
currRun.lastObjectInRun=prev;
prevRun=currRun;
prevIndex=i;
}
prev=tmp;
}
// construct the final run
runCounts++;
currRun=new Run();
currRun.parentRun=prevRun;
if (prevRun!=null) prevRun.nextRun=currRun;
else root=currRun;
currRun.endIndex=i;
currRun.span=i-prevIndex;
currRun.lastObjectInRun=prev;
//
// construct the sorted objects
//
i=0;
// while there are still multiple runs...
while (runCounts>1)
{
// select the "first" object from the first run
prev=tempArray[root.endIndex-root.span];
nextBestObject=prev;
selectedRun=root;
// compare this object against the "first" objects from the other
existing runs.
currRun=root.nextRun;
while (currRun!=null)
{
tmp=tempArray[currRun.endIndex-currRun.span];
// if the current run's "first" object is higher than the
current best "first" object then this run now has the
// best "first" object and so update the selected run and best
object.
if (comparator.compare(tmp,prev)<0)
{
nextBestObject=prev;
selectedRun=currRun;
prev=tmp;
}
currRun=currRun.nextRun;
}
// check to see whether the entire run is less than the next
closest run's 'first' value.
// if it is then copy the entire run into the input array and set
the run to be removed.
if (selectedRun.span>10 &&
comparator.compare(nextBestObject,selectedRun.lastObjectInRun)>=0)
{
System.arraycopy(tempArray,selectedRun.endIndex-
selectedRun.span,a,start+i,selectedRun.span);
i+=selectedRun.span;
removeRun=true;
}
// else insert the best ( the next highest object) into the input
array.
else
{
a[start+i++]=prev;
}
// update the offset into the selected run. This offset
determines the "first" object of the run.
selectedRun.span--;
// if the run is set to be removed or if we have reached the end
of a run. i.e. a run with only one object...
if (removeRun || selectedRun.span==0)
{
removeRun=false;
// update the list of runs by removing the selected run from the
list making sure that the root run
// and the tail run are updated as necessary.
if (selectedRun==root)
{
root=selectedRun.nextRun;
}
else if (selectedRun.nextRun==null)
{
selectedRun.parentRun.nextRun=null;
}
else
{
selectedRun.parentRun.nextRun=selectedRun.nextRun;
selectedRun.nextRun.parentRun=selectedRun.parentRun;
}
runCounts--;
}
}
// we now only have one run and so no comparisions are needed and
just copy the run into the input array as is.
System.arraycopy(tempArray,root.endIndex-root.span,a,start
+i,root.span);
}
public int compare(Object arg0, Object arg1) {
return ((Comparable) arg0).compareTo(arg1);
}
}
package wikicompress;
public class Run {
/** the next Run in the list of Runs */
public Run nextRun;
/** this Run's parent in the list of Runs */
public Run parentRun;
/** the index into the input array where this run ends (non
inclusive) */
public int endIndex;
/** the current span length of the run. i.e. the size of the run */
public int span;
/** a reference to the last Object in this run */
Comparable lastObjectInRun;
}
The problem might not be in the sort algorithm but in an inefficient
comparison function. I looked at fast sorting in developing BBB (a
BWT compressor, see http://cs.fit.edu/~mmahoney/compression/text.html#1640
)
I compared qsort(), sort(), and stable_sort() in g++, and found
stable_sort() to be the fastest. I believe they all use quicksort.
For the comparison function, I compare the first byte directly, then
if they are equal, I use memcmp() for the rest. I experimented with
variations of this. g++ optimizes the mem...() functions to inline
assembler. Look at the lessthan() function in bbb.cpp.
BBB is still slow on highly redundant files because it doesn't use any
preprocessing to prevent long matches, which slows down string
comparison. Most BWT compressors use something like LZP to remove
long matches first. Also, BBB was mainly designed to compress in
large blocks (80% of memory), which is why it does well on enwik9. On
smaller files it is comparable to other BWT compressors.
-- Matt Mahoney
.
- Follow-Ups:
- Re: A Fast sorting algorithm for almost sorted data
- From: Ben Rudiak-Gould
- Re: A Fast sorting algorithm for almost sorted data
- From: moogie
- Re: A Fast sorting algorithm for almost sorted data
- References:
- A Fast sorting algorithm for almost sorted data
- From: moogie
- A Fast sorting algorithm for almost sorted data
- Prev by Date: Re: a little (very little,) data, extracted from my front-end feeding a conventional compressor
- Next by Date: Re: PNG-ZLIB INFLATE question
- Previous by thread: Re: A Fast sorting algorithm for almost sorted data
- Next by thread: Re: A Fast sorting algorithm for almost sorted data
- Index(es):
Relevant Pages
|