Wednesday, July 27, 2011

Once we were...

Recently I've been working with a strange python module (which I won't name here). The module apparently works well. Though, when I read through the sources I felt bad. The code was so unpythonic that a direct translation from Java would have given more pythonic results.

Object oriented to the bone, especially when not necessary. Type-checks with type(foo) == ....
As if the PEP8 was burned at the stake...

I have to fight with the temptation to just fix the whole code, which would probably break everything. The tests are there but usual ways to run them do not work.

How is that I'm not surprised that I found a .pydevproject file in the root? Ah, once there were just vim and Emacs. And the CLI. And things were harder to make it work, but then they worked. For everyone.

Saturday, July 23, 2011

Atkin for everyone (benchmark)

How I got here is quite a long story. However, I'm benchmarking trivial implementations of the Atkin sieve with different programming languages/implementations. Why the "trivial implementation"? Because:

  1. I did not want to spend to much time to write clever ones in each language
  2. Because otherwise half the benchmark would have been about my specific optimization skills (platform internals knowledge etc.) in each language
  3. I'm lazy

So... we try not to put too much importance on unimportant things and draw some interesting conclusions. First, as a comparison I use a very fast C implementation of the atkin sieve which I found here. This is the bottom-line. It is a cleverly implemented version in a fast language. I do not expect anybody to get close.

It is almost unbetable:

% time primes 1 10000000 > /dev/null
primes 1 10000000 > /dev/null 0.07s user 0.00s system 97% cpu 0.070 total

Consider that I had to suppress the output because this one not only computes the primes, but it also prints them.

Quite interestingly an unoptimized Java implementation is just above 100 milliseconds. Ok, I did not write every number on standard output. C still is faster. The point is that the Java code took like 10 minutes to be written and is compared against a carefully crafted implementation in a language considered to be faster. This is a "good enough result" for Java (and it shows some excellent compiler work behind that). Both Java and C had to find the primes below 10000000. This is also what I did with the other languages.

Python

This is a kludge I used to test both regular Python (using numpy, which should provide a reasonably fast buffer to support the sieve) and Pypy, which should JIT to death integer computations.

I found quite interesting that using on not using numpy does not change things much. Regular Python took  6.16s without numpy and 6.12s with numpy. This makes sense: afterall, numpy is fast if used matrix-wise. In this case, however, I'm doing things element-wise. Perhaps I should devise a pure matrix manipulation implementation and benchmark that instead.

Pypy is surprisingly good: it takes 1.20 seconds to run the 10000000 numbers computation.

Javascript

Recently I got lots of interest in Javascript (especially thanks to the wonderful Javascript: the good parts, and the fact that about every piece of software I'm using nowadays is scripted in Javascript). So I decided to compare node.js/v8 as well. Writing the code was quite easy, even though Javascript has quite too many quirks for me to really like it. The virtual machine is impressing: less than one second (0.88) for the whole computation.

I have also to say that perhaps this benchmark is more in favor of Pypy than of v8, considering the huge amount of resources behind v8, compared to the resources behind pypy. Moreover, I find uncomparably more confortable to work with Python that with Javascript.

Clojure

After testing Java, the natural candidate is Clojure. How much inefficiency does the clojure runtime add? The code I wrote is voluntarily unclojurish and is more a direct translation from Java to clojure, exactly because now the focus is on the overhead. Apparently there are some big problems.

The first version I wrote did not involve the int coercions that the present version has and took about 2.8 secs. Replacing loops with doseqs made it slower (to about 3.x secs), so I went back to the explicit loop version. After that adding and removing coercions made times vary. Some coercions slowed things down, some sped them up. Another funny thing (logical, in fact) is that it is not true that a given coercion always speeds things up: for example the coercion on n1 was an improvement only after I also coerced end. For no reason I could fathom coercing n2 and n3 makes things worse.

In fact the presented version takes 1.5-1.6 seconds. However, adding the coercion on n2 makes the whole thing jump to 2.7 seconds. While for other coercions I understood why thing actually got worse, in this case I just don't know what to think.

Moreover, uncommenting the count-primes function call, makes the timing for atkin jump at 2.2 seconds. I think I plainly admit my incapacity to optimize clojure code: the logic behind its behavior just evades me.

Technorati Tags:

"JavaScript: The Good Parts" (Douglas Crockford)

Thursday, July 21, 2011

Is state really bad?

Prologue

Actually this post was being written in March, when some facts referred in the body of the article actually happened. I am resuming the post today, after months of extensive (and rather unbloggable) work.

Intro

Stateful programming is somewhat the default strategy in the industry nowadays and it has been in the last 60 years. If we consider the 10 more widespread languages in the tiobe index (a statistic is worth another), they are all imperative languages strongly based on stateful programming. This month Lisp is has enormously improved (Land of Lisp anyone?), but even if we would consider Common Lisp and Scheme together, it would not go in the first 10 positions (even if just for a few points).

The Haskell case

In fact, in the last years we have witness an increase of interest in functional languages. As far as I remember, about 5 years ago there was a lot of hype about Haskell (which was in its 30's or something like that). It was related, as far as I remember, in a famous user-friendly linux distro having written some tools in Haskell and in the Darcs project. Darcs has almost disappeared with Git (Perl+C) and Mercurial/Bazaar (Python, Python+C) sharing the market of distributed versioning tools.

The case is rather interesting: Darcs was built with a very strong "theory" behind, the algebra of patches. The problem is that the merge process is still exponential in the worse case and that it is rather buggy. It is interesting that the software is both slow and bug ridden. It is tempting to infer something about Haskell, at least to counter the hypothesis that pure functional programming makes it relatively easy to write bug-free programs. Perhaps this should be taken into account the next time someone looks down on setf. Well, I do like Haskell, just not as much as I love Lisp.

The last thing I would like to mention is that I like the idea to insulate imperative code from functional/algorithmic code (but this is something I suggest with every language). However being forced to do that is not necessarily a good idea. In fact, it makes Haskell quite harder to pick up and has the paradoxical effect that if you are a developer good enough to code in Haskell, you probably already know how to separate side-effected code from side-effect free code. Besides, although I like purity (to some extent), I'm not sure whether state monads should be considered stateless. I mean, a state monad implements state in a nice functional construct, mathematically sound and pure. In fact they are almost seen as an executable transaltion of the declarative semantics of a stateful program. Consequently, I'm not sure whether frequent use of state monads is really an improvement over carefully used stateful programming.

In other words, although I admit my limited experience of Haskell in real world projects, I believe that if we think functionally the code is going to be clear (but it will be as clear in Haskell as in Lisp). On the other hand, if we think in a heavily stateful way, our code may be less clear in Lisp and quite messy in Haskell, where we would implement our (wrong) thinking having state monads pop out everywhere. Feel free to correct me, possibly with real experience and examples.

Erlang and Clojure

Erlang is different from Haskell in that functional stateless programming is required for practical purposes. In order to allow shared-nothing process based architecture (and the actor model), state is essentially banned. Of course, there are solutions which bring back state (ETS, DTS, plus various libraries such as mnesia). It is worth noting that this kind of state is generally different from usual imperative state.

In imperative languages everything is stateful by default. On the other hand, in this case, we basically have only "db like" solutions. This does not encourage real stateful programming, and in fact, usual Erlang programs are pretty much state free. The essential different is that they are not side-effect free.

Being state free and side effect free are two very different things. Having no side effects is pure functional programming. And is extremely hard to do. In fact, it would not even made sense in a programming language built around the idea of passing messages between processes. If you send a message, you are not side-effect free anymore.

Somewhat in the same sense, Clojure is stateless but not side effect free. Most clojure coding is performed without explicit state change and there are a few explicit ways to deal with state. This is done in transactional way, so that essentially state change is allowed if explicitly required. State change should not mingled with stuff which has effect outside the clojure process. You can "unset" a variable and recover the old value, but you cannot unsend a message.

At the moment I feel that I can easily tolerate (and like) the kind of state change insulation provided by clojure and the pragmatics of avoiding side-effected stuff for as long as possible typical of all functional languages. On the other hand, I find really hard to accept absolute stateless-ness.

Lisp and Languages with FOFs

On the other hand of the spectrum of "functional languages" we have Common Lisp and the various freedom languages where functions are first order objects (Python, Ruby, ...), with Scheme leaning a bit more towards the functional side. Technically speaking Python is not even a functional language. Surely it is not stateless (but Lisp is neither). In fact, in Python recursion is frowned in favor of iteration (but iteration is often built around generators, which are an user-friendly/dumbed-down[0] instance of anamorphism -- unfold --, which is strongly functionally flavored). On the other hand, some call Ruby the Matz-Lisp.
In these languages there is no explicit restriction towards stateful programming. This means that usual algorithms and way of thinking are easy to implement (especially in Lisp, both very functional and very stateful algorithms are easy to implement). On the other hand, restriction of state change is left to the sense of the programmer (which not always is a good thing, but I don't really like languages which tell me exactly what to do... they just have to make the better way obvious -- I'm not Dutch, sorry --).
To be sincere, there are just things that are so much easier in stateful programming than in stateless programming. And for other stuff the converse applies. Consequently, the language should not force me too much towards one style or the other. Moreover, stateless-only languages tend to be harder to pick up for beginners (on the other hand Python and Ruby are extremely easy to start with and I believe that scheme is a wonderful first language for computer scientists). Eventually, stuff like GUIs are really hard to implement in fully stateless settings (clojure essentially relies on Java underlying classes, and I actually have never used Erlang for guis -- not that I am a strong GUI developer --).

On performance

One of the consequences is that lots of algorithms, data structures and ideas are just developed with stateful programming in mind. This is the reason books suck as Okasaki's Purely Functional Data Structures have been written: just to bridge the gap between mainstream stateful programming and functional reasoning.

Unfortunately many algorithms have been created with imperative languages in mind, and the data structures supporting the algorithms strongly reflect the imperative setting. Implementing such algorithms in functional settings is usually inefficient to say the least. Much theory (and not only the practice) is built around state change. The Art of Computer Programming uses a pseudo assembly language which essentially manipulates a Von Neumann machine.

Something as easy as "modify an array" may be a source of inefficiency in many languages. E.g., clojure provides transients just to make it easier to express such operations. On the other hand functional languages have usually some troubles in providing a data structure that is immutable, offers O(1) random element access and allows element modifications without copying the whole data structure.

Books


"Purely Functional Data Structures" (Chris Okasaki)


"Art of Computer Programming, The, Volumes 1-3 Boxed Set (3rd Edition) (Vol 1-3)" (Donald E. Knuth)



Notes

[0] depends on whether you are a python or a scheme programmer: python programmers usually say user-friendly