Saturday, July 23, 2011

Atkin for everyone (benchmark)

How I got here is quite a long story. However, I'm benchmarking trivial implementations of the Atkin sieve with different programming languages/implementations. Why the "trivial implementation"? Because:

  1. I did not want to spend to much time to write clever ones in each language
  2. Because otherwise half the benchmark would have been about my specific optimization skills (platform internals knowledge etc.) in each language
  3. I'm lazy

So... we try not to put too much importance on unimportant things and draw some interesting conclusions. First, as a comparison I use a very fast C implementation of the atkin sieve which I found here. This is the bottom-line. It is a cleverly implemented version in a fast language. I do not expect anybody to get close.

It is almost unbetable:

% time primes 1 10000000 > /dev/null
primes 1 10000000 > /dev/null 0.07s user 0.00s system 97% cpu 0.070 total

Consider that I had to suppress the output because this one not only computes the primes, but it also prints them.

Quite interestingly an unoptimized Java implementation is just above 100 milliseconds. Ok, I did not write every number on standard output. C still is faster. The point is that the Java code took like 10 minutes to be written and is compared against a carefully crafted implementation in a language considered to be faster. This is a "good enough result" for Java (and it shows some excellent compiler work behind that). Both Java and C had to find the primes below 10000000. This is also what I did with the other languages.

import java.lang.Boolean;
import java.lang.Integer;
public class AtkinSieveArray {
static private int count(boolean[] buffer) {
int counter = 0;
for(int i = 2; i < buffer.length; ++i) {
if(buffer[i]) {
counter++;
}
}
return counter;
}
static private void sieve(boolean[] buffer) {
int end = buffer.length;
double limit = Math.sqrt(end);
int n, x, x2, x2_3, y, y2;
int k, i, m;
for (x = 1; x <= limit; ++x) {
x2 = x * x;
x2_3 = 3 * x2;
for (y = 1; y <= limit; ++y) {
y2 = y * y;
n = 4 * x2 + y2;
if ((n < end) && (n % 12 == 1 || n % 12 == 5)) {
buffer[n] = !buffer[n];
}
n = x2_3 + y2; /* 3x^2 + y^2 */
if ((n < end) && (n % 12 == 7)) {
buffer[n] = !buffer[n];
}
n = x2_3 - y2; /* 3x^2 - y^2 */
if ((x > y) && (n < end) && (n % 12 == 11)) {
buffer[n] = !buffer[n];
}
}
}
for (n = 5; n <= limit; n += 2) {
if (buffer[n]) {
k = n * n;
for (i = 1, m = k; m < end; i += 2, m = i * k) {
buffer[m] = false;
}
}
}
buffer[3] = true;
buffer[2] = true;
}
public static void main(String[] args) {
int max;
boolean buffer[] = null;
if (args.length > 0) {
max = Integer.valueOf(args[0]);
} else {
max = 1000;
}
buffer = new boolean[max+1];
long start = System.currentTimeMillis();
sieve(buffer);
System.out.println(count(buffer));
System.out.println(System.currentTimeMillis() - start);
}
}

Python

This is a kludge I used to test both regular Python (using numpy, which should provide a reasonably fast buffer to support the sieve) and Pypy, which should JIT to death integer computations.

try:
import numpy as np
def create_buffer(size):
arr = np.zeros(size+1, dtype=np.bool)
return arr, arr.size
except ImportError:
def create_buffer(size):
arr = [False, ] * (size+1)
return arr, len(arr)
# ...
def main(args):
max_integer = parse_args(args)
buffer, upper_bound = create_buffer(max_integer)
atkin(buffer, upper_bound)
view raw atkin_main.py hosted with ❤ by GitHub

I found quite interesting that using on not using numpy does not change things much. Regular Python took  6.16s without numpy and 6.12s with numpy. This makes sense: afterall, numpy is fast if used matrix-wise. In this case, however, I'm doing things element-wise. Perhaps I should devise a pure matrix manipulation implementation and benchmark that instead.

def atkin(buffer, end):
limit = int(math.ceil(math.sqrt(end)))
for x in range(1, limit + 1):
x2 = x * x
x2_3 = 3 * x2
for y in range(1, limit + 1):
y2 = y * y
n = 4 * x2 + y2
if (n < end) and (n % 12 == 1 or n % 12 == 5):
buffer[n] = not buffer[n]
n = x2_3 + y2
if (n < end) and (n % 12 == 7):
buffer[n] = not buffer[n]
n = x2_3 - y2
if (x > y) and (n < end) and (n % 12 == 11):
buffer[n] = not buffer[n]
for n in range(5, limit+1):
if buffer[n]:
k = n * n
m = k
i = 1
while m < end:
buffer[m] = False
i += 2
m = i * k
buffer[2] = True
buffer[3] = True
return buffer
view raw atkin_f.py hosted with ❤ by GitHub

Pypy is surprisingly good: it takes 1.20 seconds to run the 10000000 numbers computation.

Javascript

Recently I got lots of interest in Javascript (especially thanks to the wonderful Javascript: the good parts, and the fact that about every piece of software I'm using nowadays is scripted in Javascript). So I decided to compare node.js/v8 as well. Writing the code was quite easy, even though Javascript has quite too many quirks for me to really like it. The virtual machine is impressing: less than one second (0.88) for the whole computation.

Array.dim = function(dimension, initial) {
var a = [], i;
for(i = 0; i < dimension; i+=1) {
a[i] = initial;
}
return a;
}
var atkin = function(buffer) {
var end = buffer.length;
var limit = Math.sqrt(end);
var n, x, x2, x2_3, y, y2;
var k, i, m;
for(x = 1; x <= limit; ++x) {
x2 = x * x;
x2_3 = 3 * x2;
for(y = 1; y <= limit; ++y) {
y2 = y*y;
n = 4 * x2 + y2;
if((n < end) && (n % 12 == 1 || n % 12 == 5)) {
buffer[n] = ~buffer[n];
}
n = x2_3 + y2; /* 3x^2 + y^2 */
if((n < end) && (n % 12 == 7)) {
buffer[n] = ~buffer[n];
}
n = x2_3 - y2; /* 3x^2 - y^2 */
if((x > y) && (n < end) && (n % 12 == 11)) {
buffer[n] = ~buffer[n];
}
}
}
for(n = 5; n <= limit; n+=2) {
if(buffer[n]) {
k = n * n;
for(i = 1, m = k; m < end; i+=2, m = i * k) {
buffer[m] = false;
}
}
}
buffer[3] = true;
buffer[2] = true;
}
var printAll = function(buffer) {
var i;
for(i =0; i < buffer.length; ++i) {
if(buffer[i]) {
console.log("%d, ", i);
}
}
}
var countAll = function(buffer) {
var i, counter;
counter = 0;
for(i =0; i < buffer.length; ++i) {
if(buffer[i]) {
counter += 1;
}
}
return counter;
}
var defaultSize = 1000;
var size = defaultSize;
var sizeString = process.argv[2];
if(sizeString !== undefined) {
size = parseInt(sizeString);
size = (size === NaN) ? defaultSize : size;
}
var buffer = Array.dim(size, false);
console.time('atkin');
atkin(buffer);
console.timeEnd('atkin');
console.log("%d\n", countAll(buffer));
view raw atkin.js hosted with ❤ by GitHub

I have also to say that perhaps this benchmark is more in favor of Pypy than of v8, considering the huge amount of resources behind v8, compared to the resources behind pypy. Moreover, I find uncomparably more confortable to work with Python that with Javascript.

Clojure

After testing Java, the natural candidate is Clojure. How much inefficiency does the clojure runtime add? The code I wrote is voluntarily unclojurish and is more a direct translation from Java to clojure, exactly because now the focus is on the overhead. Apparently there are some big problems.

(set! *warn-on-reflection* true)
(defn atkin [^booleans buffer]
(let [end (int (alength buffer))
limit (int (Math/ceil (Math/sqrt end)))]
(loop [x (int 1)]
(let [x2 (int (* x x))
x2_3 (int (* x2 3))]
(loop [y (int 1)]
(let [y2 (int (* y y))
n1 (int (+ (* 4 x2) y2))
n2 (+ x2_3 y2)
n3 (- x2_3 y2)]
(when (and (< n1 end) (or (= (mod n1 12) 1) (= (mod n1 12) 5)))
(aset buffer n1 (not (aget buffer n1))))
(when (and (< n2 end) (= (mod n2 12) 7))
(aset buffer n2 (not (aget buffer n2))))
(when (and (> x y) (< n3 end) (= (mod n3 12) 11))
(aset buffer n3 (not (aget buffer n3)))))
(if (< y limit) (recur (inc y))))
(if (< x limit) (recur (inc x)))))
(loop [n (int 5)]
(let [k (int (* n n))]
(when (aget buffer n)
(doseq [i (iterate #(+ 2 %) 1) :let [m (* i k)]
:while (< m end)]
(aset buffer m false))))
(if (<= n limit) (recur (int (+ n 2)))))
(aset buffer 2 true)
(aset buffer 3 true)))
(defn print-numbers [buffer]
(let [indexed-buffer (map vector (iterate inc 0) buffer)]
(doseq [[i v] indexed-buffer
:when v]
(println i))))
(defn count-primes [buffer]
(reduce + (for [bit buffer :when bit] 1)))
(let [buffer (boolean-array 10000000)]
(time (atkin buffer))
;(print-numbers buffer)
;(println (count-primes buffer)))
)
view raw atkin.clj hosted with ❤ by GitHub

The first version I wrote did not involve the int coercions that the present version has and took about 2.8 secs. Replacing loops with doseqs made it slower (to about 3.x secs), so I went back to the explicit loop version. After that adding and removing coercions made times vary. Some coercions slowed things down, some sped them up. Another funny thing (logical, in fact) is that it is not true that a given coercion always speeds things up: for example the coercion on n1 was an improvement only after I also coerced end. For no reason I could fathom coercing n2 and n3 makes things worse.

In fact the presented version takes 1.5-1.6 seconds. However, adding the coercion on n2 makes the whole thing jump to 2.7 seconds. While for other coercions I understood why thing actually got worse, in this case I just don't know what to think.

Moreover, uncommenting the count-primes function call, makes the timing for atkin jump at 2.2 seconds. I think I plainly admit my incapacity to optimize clojure code: the logic behind its behavior just evades me.

Technorati Tags:

"JavaScript: The Good Parts" (Douglas Crockford)

2 comments:

Anonymous said...

Looking at the Clojure code I think you are running into boxing issues - assuming you are using Clojure 1.2 then most of your arithmetic operations are resulting in numbers getting boxed. Ways round this are either to use the unchecked-* operations or to move to Clojure 1.3 (where primitive operations are unboxed and much faster by default)

Unknown said...

Yes. I believe they are boxing issues as well (otherwise things would not improve/worsen depending on which values I coerce).

Unchecked? May try that one as well. However, I think its a bit beyond what I'm trying to do here. The other solutions were written without special thought on optimization.