Saturday, July 23, 2011

Atkin for everyone (benchmark)

How I got here is quite a long story. However, I'm benchmarking trivial implementations of the Atkin sieve with different programming languages/implementations. Why the "trivial implementation"? Because:

  1. I did not want to spend to much time to write clever ones in each language
  2. Because otherwise half the benchmark would have been about my specific optimization skills (platform internals knowledge etc.) in each language
  3. I'm lazy

So... we try not to put too much importance on unimportant things and draw some interesting conclusions. First, as a comparison I use a very fast C implementation of the atkin sieve which I found here. This is the bottom-line. It is a cleverly implemented version in a fast language. I do not expect anybody to get close.

It is almost unbetable:

% time primes 1 10000000 > /dev/null
primes 1 10000000 > /dev/null 0.07s user 0.00s system 97% cpu 0.070 total

Consider that I had to suppress the output because this one not only computes the primes, but it also prints them.

Quite interestingly an unoptimized Java implementation is just above 100 milliseconds. Ok, I did not write every number on standard output. C still is faster. The point is that the Java code took like 10 minutes to be written and is compared against a carefully crafted implementation in a language considered to be faster. This is a "good enough result" for Java (and it shows some excellent compiler work behind that). Both Java and C had to find the primes below 10000000. This is also what I did with the other languages.

Python

This is a kludge I used to test both regular Python (using numpy, which should provide a reasonably fast buffer to support the sieve) and Pypy, which should JIT to death integer computations.

I found quite interesting that using on not using numpy does not change things much. Regular Python took  6.16s without numpy and 6.12s with numpy. This makes sense: afterall, numpy is fast if used matrix-wise. In this case, however, I'm doing things element-wise. Perhaps I should devise a pure matrix manipulation implementation and benchmark that instead.

Pypy is surprisingly good: it takes 1.20 seconds to run the 10000000 numbers computation.

Javascript

Recently I got lots of interest in Javascript (especially thanks to the wonderful Javascript: the good parts, and the fact that about every piece of software I'm using nowadays is scripted in Javascript). So I decided to compare node.js/v8 as well. Writing the code was quite easy, even though Javascript has quite too many quirks for me to really like it. The virtual machine is impressing: less than one second (0.88) for the whole computation.

I have also to say that perhaps this benchmark is more in favor of Pypy than of v8, considering the huge amount of resources behind v8, compared to the resources behind pypy. Moreover, I find uncomparably more confortable to work with Python that with Javascript.

Clojure

After testing Java, the natural candidate is Clojure. How much inefficiency does the clojure runtime add? The code I wrote is voluntarily unclojurish and is more a direct translation from Java to clojure, exactly because now the focus is on the overhead. Apparently there are some big problems.

The first version I wrote did not involve the int coercions that the present version has and took about 2.8 secs. Replacing loops with doseqs made it slower (to about 3.x secs), so I went back to the explicit loop version. After that adding and removing coercions made times vary. Some coercions slowed things down, some sped them up. Another funny thing (logical, in fact) is that it is not true that a given coercion always speeds things up: for example the coercion on n1 was an improvement only after I also coerced end. For no reason I could fathom coercing n2 and n3 makes things worse.

In fact the presented version takes 1.5-1.6 seconds. However, adding the coercion on n2 makes the whole thing jump to 2.7 seconds. While for other coercions I understood why thing actually got worse, in this case I just don't know what to think.

Moreover, uncommenting the count-primes function call, makes the timing for atkin jump at 2.2 seconds. I think I plainly admit my incapacity to optimize clojure code: the logic behind its behavior just evades me.

Technorati Tags:

"JavaScript: The Good Parts" (Douglas Crockford)

2 comments:

Anonymous said...

Looking at the Clojure code I think you are running into boxing issues - assuming you are using Clojure 1.2 then most of your arithmetic operations are resulting in numbers getting boxed. Ways round this are either to use the unchecked-* operations or to move to Clojure 1.3 (where primitive operations are unboxed and much faster by default)

Unknown said...

Yes. I believe they are boxing issues as well (otherwise things would not improve/worsen depending on which values I coerce).

Unchecked? May try that one as well. However, I think its a bit beyond what I'm trying to do here. The other solutions were written without special thought on optimization.