Monday, August 8, 2011

Python, Numpy, Atkin & Erathostenes

In this post I compared some versions of the Atkin sieve in different languages. Basically it was just the same version in different languages.

Regarding Python, I had a version in basic Python and one using numpy arrays as the intermediary data structure. Essentially the execution time was the same. Somewhere in the post I commented that it was unsurprising and that I should write a version for numpy which more thoroughly used vector-operations to speed things up.

Unfortunately, it seems to me that atkin algorithm is not well suited to this kind of optimization. At least, I did not found an easy way to do it. On the other hand, Eratosthenes Sieve has an extremely natural formulation in terms of vector operations. The reason why this is way faster than using element-wise operation is that essentially it means pushing whole loops below the python level, at C/Fortan level (also reducing the creation of Python objects in favor of primitive C types).

Essentially the code is somewhat easier to read (having basic numpy/matlab understanding) that the regular version.

The times are in the order of:

% time python erat_matrix.py 10000000
(array([      2,       3,       5, ..., 9999971, 9999973, 9999991]),)
python erat_matrix.py 10000000  0.61s user 0.12s system 96% cpu 0.750 total

(that result was to compare times with those in the previous post, i.e., they are better than anything but Java and the highly optimized C code) and

% time python erat_matrix.py 100000000
(array([       2,        3,        5, ..., 99999959, 99999971, 99999989]),)
python erat_matrix.py 100000000  5.90s user 0.85s system 99% cpu 6.781 total

The result is rather surprising: I have seen the code running from 36 seconds to the result reported. Yesterday it was almost always around 10 seconds, so I decided to make "faster" versions. Today such "faster versions" are not faster, but at least are more consistent (always around 7 seconds, instead than varying so much). I have no clues on the reason of these discrepancies: I'm not seeing any exceptional load on the machine. Benchmarks lies, that's it.

To see if the code could benefit from a list of prebuilt primes I first tried with some manual unrolling

and since the results were promising:

Essentially, this version always runs in about 7 seconds on my machine, which is roughly 10 times slower than the highly optimized atkins sieve in C.

No comments: