Sunday, November 7, 2010

Java Collections sorting: copy, copy & copy in every sense!

This morning I was hacking into OpenJDK some implementation details. Curiosity, perhaps.

After the announcement that Java is going to use timsort for generic sorting (which is a very good piece of news).
By the way, this is an actual implementation. I'm also wondering why these things are mostly ignored in Algorithm courses, but this is another story.

So... I checked the actual implementation. I had to download the OpenJDK source, as apparently on OS X IntelliJ does not auto-show implementation of base classes when I "jump to implementation". Anyway... the implementation of sorting is basically in Arrays.java.

I want to point out the funny comment relative to sort1 (which is the private function which actually does the job for non floating point stuff):
    /*
     * The code for each of the seven primitive types is largely identical.
     * C'est la vie.
     */


which really saved the day... Yes... it is almost cut and paste for the primitive types (char, int, byte, etc). I was amazed... it is 2010, we have generics but, as primitive types are not objects we have to actually write it 7 times. As a side note, code is recursive quicksort variant.

Ok, nice. You have implementation constraints and you have to cope with that. Yes... C++ templates perfectly solve this crap. But then, there is another very funny discovery (well, in fact I have known this for quite a few time). But that is not the only (technically) unnecessary copy which happen(ed|s) when dealing with sorting in Java; this is the sort function in Collections.java

public static <T extends Comparable<? super T>> void sort(List<T> list) {
Object[] a = list.toArray();
Arrays.sort(a);
ListIterator<T> i = list.listIterator();
for (int j=0; j<a.length; j++) {
i.next();
i.set((T)a[j]);
}
}


Wow! It is clear what it does? For every collection you sort, a copy is made. To be honest, the documentation states this clearly:


This implementation dumps the specified list into an array, sorts
the array, and iterates over the list resetting each element
from the corresponding position in the array. This avoids the
n^2 log(n) performance that would result from attempting
to sort a linked list in place.


Nice! Very nice... this means that the default sort function may be unnecessarily slow. Unless you use arrays (but using arrays in place of collections is considered a bad practice). Moreover, Collection is an interface: its implementation may be such that a full copy into main memory is not feasible. BTW, I should double check because the comments in Arrays.sort state that the algorithm is a modified quicksort, not a modified mergesort.

Oh... nice! Tomorrow I am discussing other methods in Arrays and Collections.

No comments: