Thursday, December 30, 2010

IntelliJ 10 and PyCharm

So now I'm downloading IntelliJ 10. I already bought the upgrade, without even trying it. I am a very happy IntelliJ 9 user, the upgrade price is reasonable (especially because I qualify for academic discounts).

Moreover, I tried Pycharm and startup times have improved greatly. And IntelliJ is also very fast, almost comparable with some not too lightweight editors. I think I can live without many of the improvements they did to the X version of IntelliJ, however faster startup + mercurial + github integration for me are worth the price. And then there is the question of Python and Ruby plugins, which alone can make the difference.

If the Python plugin is as good as PyCharm (and it should) then I can do without upgrading WingIDE. WingIDE is still GTK (which bugs me a lot). And it is a nice IDE, I love it. However, it lacks refactoring. Ok, most refactoring in Python can't be done automatically, however simple local variables renames could (and both Eclipse and PyCharm do them).

Saturday, December 25, 2010


I must buy just because of this:

Well... more on this here

Friday, December 17, 2010

Networks: an introduction

I usually do not discuss here things closely related to my research area. This is for a number of reasons: for example, I usually write this blog in my spare time and I focus on different subjects. There are exceptions... for example I'm writing about PageRank (here, here and I have not yet finished), which is rather close to a specific part of what I'm doing (no, I'm not re-writing Google from scratch ;) ).

Nowadays I have lots of interest in network and graph theory. I like it because it keeps my maths fresh (I miss maths a lot, in fact) and most computer science and engineering subjects can be better grasped with a sound and complete (pun intended) understanding of graph theory.

I recently found a wonderful book, which is "Networks: An Introduction" by M.E.J. Newman. This is not about "computer networks" (though they are dealt with). It is about the maths behind networks. It is a single book which cover all the basis of network theory.

Different kinds of networks are described (biological, social, information). Network properties, metrics and measures are defined, and results on average properties for each kind of network are given, together with the large-scale description of such networks. Then, a whole lot of computer algorithms are presented. Models to generate networks (random graphs, Barabasi and Albert model, Strogatz and Watts model) are explained and eventually network processes (epidemics, percolation, node removal/addition, dynamic systems, search) are presented. Basically, it's all there.

Moreover, the book is very readable, even with little experience on the subject and each part is mostly independent. That is to say, it is possible to start reading almost from any point, which makes the book a great reference. But it is also a great introduction to the subject. And > 300 references mean that the book is a great hub to start further explorations of the world of networks.

, ,

Wednesday, December 15, 2010

Introducing Page Rank Maths

In my previous post, I started introducing Google PageRank. Now I'm starting to dig into PageRank maths. PageRank definition is not very "computation oriented".

r(Pi) = Pj ∈ Li  r(Pj)

There is not a clear temporal ordering of pages or any other sensible starting point. Hopefully, an iterative method may converge to the result. So we subscript the rank with a number indicating the step and express the (i)-th step in terms of the (i−1)-step:

rk(Pi) = Pj ∈ Li  rk−1(Pj)

Of course, we are not happy with the whole "may-converge" thing and would rather have some precise results. The first step is using some convenient notation: formula 2.2 is easy to grasp but awkward to manipulate.
The idea here is express the equations in matrix form. Let L = (eij) be the adjacency matrix of our graph. In other words:

eij =
1    if there is a link from i to j
0    otherwise
Then we consider the matrix H = D−1·A, where D = max(I, D) and D is the diagonal matrix of the degrees of the nodes. In Python we can compute H with:
import numpy as np

def h_matrix(adjacency_matrix):
    degrees = np.sum(adjacency_matrix, axis=1)
    degrees[degrees==0] = 1
    return, adjacency_matrix)

# ...

The H j-th row can be interpreted as the distribution of probability for jumping from node j to each other node. If our graph is the web, H has billions of rows (and columns) but in average each page has about 10 out links. H is a very sparse matrix, indeed.
Let π be the rank-vector. Then we can reformulate the problem as:
πT = πT·H
Basically we are looking for the eigenvector associated with the eigenvalue λ1 = 1. The idea here is to use the power method. The power method is an iterative method that computes the eigenpair (λ1, π) of a matrix A. A must satisfy some basic hypotheses for the power method to work. Let {λ1,…,λn} be the set of eigenvalues of A, then it must be true that |λ1| > |λ2| ≥ … ≥ |λm|. As a consequence, λ1 ∈ \mathbbR.
Then it is possible to compute (λ1, π) with:

xn + 1 = yn

Then xn→ x and cn→λ1.
Here cn is a scaling factor; e.g. λ1, ||Anx0||, the component of yn of maximal magnitude. Now, the fact that we suggested λ1 as a possible scaling factor is not a mistake: it is proved that the actual Google matrix has λ1.
Moreover x0R(A−λ1I), where R(M)={v|v=Ax}. However, in practice "most" starting vectors x0 are good.
If H were a usable Google Matrix, things would be very nice. H is essentially the matrix of a random walk over the graph. We could just see the whole thing as a Markov process. More on this can be found in [3].
The analogy with Markov processes here can give us some insight on the problems with the H matrix. If we interpret nodes as states, we can see that a closed class of state (possibly a singleton) would be a serious issue. And any node with no outgoing links constitutes such a class.
In the Markov chain interpretation, if such a node is reachable, then we would just remain there. Such nodes would be rank sinks. In such a setting, every web site with no outgoing links would acquire a rank far superior to its actual importance. Using Markov chains terminology, the H matrix is substochastic.
Another problem is related to cycles. E.g., consider the matrix
C =

Then: C2n = I and C2n+1 = C. As a consequence we cannot expect to converge (notice, here we have eigenvalues with same magnitude, so we know that the power method is not going to work).
Luckily enough, these problems are rather easy to fix modifying the matrix. More on this, later!


M. Franceschet. PageRank: Standing on the shoulders of giants.
M.E.J. Newman. The structure and function of complex networks. Arxiv preprint cond-mat/0303516, 2003.
M.E.J. Newman. Networks: An Introduction. Oxford University Press, 2010.
L Page and S Brin. The pagerank citation ranking: Bringing order to the web. 1998.

Lisp and Scheme

Here (scheme-vs-common-lisp) an interesting comparison between Common Lisp and Scheme.
Looks rather objective. I actually prefer Common Lisp... the only two I really hate are:
  1. 2-namespaces... but funcalling is not bad as I remembered, so I really don't care anymore (perhaps "hate" is a bit of a misnomer here)
  2. no standard prescribed TCO (though the implementations I chose support it, so after all I DGAF)... ah, no call/cc. However, I don't like call/cc (perhaps I have not grasped it fully)

Well... In fact I quite like that in Lisp I have conditions (which are just a generalization of exceptions). I'm quite used to using exceptions. And Lisp also offers throw/catch and named blocks which may be used to jump.

It looks like unwind-protect and call/cc do are not good friends. Moreover, I do not fiddle very often with control structures. However, I write quite a lot of code making use of control structures... error conditions and such.

Tuesday, December 14, 2010

TTH and listings

TTH is a beautiful TeX/LaTeX to HTML converter. Its main focus is converting math formulas without resorting to images. An example can be seen here.

It is easy to use, easy to install and work acceptably. The only trouble I had so far regards including source codes. TTH supports verbatim but does not support lstlisting. Consequently I defined the conditional \iftth with
It is by default false (under latex) but it is true when running tth. This way I can have conditional commands such as:
    {\iftth \verbatiminput{#1} \else \lstinputlisting{#1} \fi}

Doubling the RAM...

gives new life even to a macpro!


Monday, December 13, 2010

Other applications of the ideas behind Page Rank

One of the problem a search engine has to solve is to sort the potentially very long list of matches for a given query with a sensible ordering. Sensible here means that ideally the more interesting results come first.
Many strategies could be applied here (and there is a lot of ongoing research on the subject). Many complex strategies are already applied. However, the basic idea [4] Google used was very simple (as an idea) and very robust.
It is easy to see that the idea was simple. Let r(Pi) be the ranking of page Pi, Li be the set of pages linking to Pi, and |Pi| be the number of outbound links of |Pi| then:

r(Pi) =

Pj ∈ Li 

Simple, indeed. With the minor problem that we have no starting point to trigger the computation. Luckily enough, iterative methods come to help. More on this in a few paragraph. I have not yet clarified the "robust" part.
In fact the kinds of ideas underlying the PageRank are rather old. Similar ideas have been spotted in bibliometrics, econometrics and sociometrics. This is not to be considered a defect (lack of novelty), rather I consider it a feature as in "already field-tested".
Using words, PageRank states that a page is important if many important pages link to it. Which is not different from the usual definition of importance of journals, where a journal is important if many important journals contain articles referring to articles in that journal. Well, in the journal world things are easier as there are far less journals than web pages and there is a clear temporal order that can be used. Moreover, articles usually cite only papers written before them (at least until some research has a paper describing his implementation of time-machines, then he could easily refer to articles not yet published).
In 1949 Seeley proposed to "measure" the popularity of a child (he was researching relationships in children social group) considering the popularity of the children which enlisted him among his friends. That is, once again, something like Pi = f(F11,...,F1k), where F1j is a child who listed 1 among his friends. The very same criterion can be applied to recommendations.
Leontief won Nobel Prize for Economy for his ideas on econometrics on the input-output model, which is yet an application of the same underlying ideas. Here I won't give further details. These example come from [1] where further details are given.

I explore further how to compute the PageRank here.

Sunday, December 12, 2010

Combinatory Logic in Lisp

Many year ago, I was introduced to a formalism called "combinatory language". I quite fell in love with it, but that my personal taste. Combinatory logic, according to Wikipedia, was originally introduced as a notation to remove variables from logic. However, in the presentation I'm familiar with variables are used [1]. In fact, I have seen it a computation model: combinatory language is proved equivalent to lambda calculus (trivial) and as such it is also a computation model (and a possible foundation for functional programming).

The fundamental units are combinators, that is to say elementary (higher order) functions. The result of the computation comes from function application (but the elementary functions are so simple that it is possible to see implement the evaluation just as a simple rewrite system). In fact, just two combinators are needed S and K (definition follows). Usually the I combinator is defined as well, but that can be defined in terms of K and S, so it is not "primitive".

Quite interestingly, it is possible to use combinatory logic to define regular first order logic. So it can be a model for mathematics as well... and some rather important (and difficult to prove) theorems in logic are rather elementary and easy results in combinatory logic. But well... to the metal.

Let us have an infinite sequence of variables (lowercase) and a finite or infinite sequence of constants (uppercase), including the three fundamental combinators I, K, S.  Variables and constants are atoms. The definition of combinatory term is as follows:
  1. Every atom is a combinatory term
  2. If X and Y are combinatory terms, (XY) is a combinatory term as well
A combinator is a combinatory term whose atoms are only S, K, I. Variables are disjoint from constants.

Since combinatory logic is very "operative" we define the fundamental computation tool, which is the weak reduction.
  • IX -> X
  • KXY -> X
  • SXYZ -> XZ(YZ)
  • X -> X
where the parentheses have the usual meaning (that is to say xzu where u is the result of evaluating yz). Right now I don't want to introduce axioms regarding variables. Now I rewrite the combinators as "usual functions" to clarify what they really do:
  • I(f) = f
  • (K(a))(x) = a
  • (S(f, g))(x) = f(x, g(x))
  • (B(f, g))(x) = f(g(x))

The B combinator is not fundamental, and can be easily defined as S(KS)K. So now we have a pretty clear idea of what combinators are. I corresponds to the identity function. K is "like" a function returning a constant whatever arguments we pass. In fact K is a "function" which given something yields the function always returning that something. B is a function composition operator. S is a bit harder to understand (but wait to see the Y combintor! ;) ),

My first significant lisp program years ago was the implementation (in lisp) of the the basic computable functions (primitive recursive and mu operator, IIRC what I did). I thought that a nice example of another lisp program could be implementing the combinators in Lisp. After doing some more theory (e.g., we haven't yet stated how to "compute" programs or represent numbers), we can go for the implementation.

  1. R. Hindley, B. Lercher, J. P. Seldin, Introduction to combinatory logic (CUP Archive, 1972), p. 170

, , , , , ,

Powered by ScribeFire.

Saturday, December 4, 2010

Pointfree programming

I have to admit that, nowadays, I'm not a huge fan of Haskell. It may be that I learned it when "Real World Haskell" did not exist. I learned on excellent Hudak's book, and although it is a very nice book (as implied by the adjective "excellent") I think that the focus is quite too academical (which is an awkward criticism from me). When learning a language I like books which give me the pragmatic of the language (that is to say "how I'm I expected to write the code") but also practical examples. Something to hack myself in order to dig into the language.

Unfortunately back then most of the "nicer" example could be run easily on Windows or Linux but not OS X (some graphic library, IIRC). So... I grow tired of Haskell rather soon (it's not nice to code continuously referring to library indexes to see if what you are doing has already been done and trying to understand the logic of some very idiomatic piece of code you have not been introduced to).

RWH does a very nice job in teaching Haskell (and perhaps you can read Hudak as a second book, if you need). However, it came somewhat too late. Perhaps another time I will start with Haskell again.

Moreover, while code in RWH is idiomatic and readable, I think that many programming styles popular in the Haskell community (judging from pieces of code I stumbled upon surfing the web) tend to be so clever that "I'm not able to debug them". Not unless I have decomposed them...

For example, I quite like the idea that everything is curried. I got a BSc in maths before turning to computer science and I have seen plenty of complex function. I'm used to functions that take other functions as argument, return functions. Functions (sometimes called operators) which bridge spaces of functions and similar stuff.

Consequently I quite like the idea that + is a unary function which returns a unary function (that is to say an "adder").

def add(x):
    def adder(y):
        return x + y
    return adder

This is a bit awkward in Python, however you can use stuff in the functools module to do the trick. However, Haskell is built around this very concept. If you just write (+ 1) you have the "increment" function. While I think this is nice (after all I don't really like to write a full lambda to do such a thing), abuses are easy to spot.

Another thing I rather like is the fact that you can omit "unused" parameters.

E.g., it is perfectly ok to write:

inc = (+ 1)

Which makes sense: (+ 1) is the function which adds 1 and we just give it the name inc. So far so good. If you are not very happy with operators, the classical example is:

sum = foldr (+) 0

instead of

sum' xs = foldr (+) 0 xs

(these examples come from the point-free Haskell guide). And I agree (to a point). In fact I still believe that there should be just one way to do things; however, Haskell guides make it very clear that the first version is preferred. Moreover, you can always write down the second variant and then remove dummy variables.

What I especially like is how it stresses the functional nature of the language: you don't "define" (defun?) functions, you just give them names. Functions just are. Ok, to much lambda calculus lately.

The point is that Haskell programmers tell you that:

fn = f . g . h

is clearer than
fn x = f (g (h x))

On that I strongly disagree. Since I have been introduced to the function composition operator, I have found that most people get it wrong. I have no idea (afterall you just substitute the composition operator with an open parentheses and everything is fine). Now, that is not necessarily a problem: not everyone is destined to become a programmer (and an Haskell programmer).

On this I found some support in the lisp/scheme/clojure community:

Avoid functional combinators, or, worse, the point-free (or `point-less') style of code that is popular in the Haskell world. At most, use function composition only where the composition of functions is the crux of the idea being expressed, rather than simply a procedure that happens to be a composition of two others. Rationale: Tempting as it may be to recognize patterns that can be structured as combinations of functional combinators -- say, `compose this procedure with the projection of the second argument of that other one', or (COMPOSE FOO (PROJECT 2 BAR)) --, the reader of the code must subsequently examine the elaborate structure that has been built up to obscure the underlying purpose. The previous fragment could have been written (LAMBDA (A B) (FOO (BAR B))), which is in fact shorter, and which tells the reader directly what argument is being passed on to what, and what argument is being ignored, without forcing the reader to search for the definitions of FOO and BAR or the call site of the final composition. The explicit fragment contains substantially more information when intermediate values are named, which is very helpful for understanding it and especially for modifying it later on.

which comes from here. Of course this can be an instance of "Lisp syntax is not very good at doing this kind of things, so you should not even try them". And it is not hard to see that (compose foo bar) could not be used extensively as function composition in Haskell, as it carries more syntactic noise. Clojure actually has builtin comp and partial functions:

clojure.core/comp([f] [f g] [f g h] [f1 f2 f3 & fs])  Takes a set of functions and returns a fn that is the composition  of those fns. The returned fn takes a variable number of args,  applies the rightmost of fns to the args, the next  fn (right-to-left) to the result, etc.clojure.core/partial([f arg1] [f arg1 arg2] [f arg1 arg2 arg3] [f arg1 arg2 arg3 & more])  Takes a function f and fewer than the normal arguments to f, and  returns a fn that takes a variable number of additional args. When  called, the returned function calls f with args + additional args.

Which is useful, are used sparingly.

However, when you start freely mixing the various features of Haskell syntax you have got a real mess. At least in my opinion. Functions taking multiple parameters in Haskell are really just unary functions returning functions.

E.g., (A x B) -> C in fact is just A->(B->C).

From the point of view of a programmer coming from another language, you can "bind" only a subset of the arguments of a function. Which, in Haskell terms means that you simply get a function instead of just a value.

For example, (map (*2)) is a function that doubles all the value of a list. It's just like that! And if you don't have the list in a variable: you can compute it on the fly (e.g. filter odd [1..10], which means take only the odd values from the range [1..10] -> 1, 3, 5, 7, 9).

The first tentative fails:

Prelude% map (*2) filter odd [1..10]
    Couldn't match expected type `[a]'
           against inferred type `(a1 -> Bool) -> [a1] -> [a1]'
    In the second argument of `map', namely `filter'
    In the expression: map (* 2) filter odd [1 .. 10]
    In the definition of `it': it = map (* 2) filter odd [1 .. 10]

Here the problem is that the "second parameter" of map should be a list. However the compiler finds "filter". Filter is a perfectly legal value for a function, only it has type (a1 -> Bool) -> [a1] -> [a1] and not [a]. In other words, we are passing a function where a list is expected.

Essentially the compiler just reads the arguments and tries to "fill" the parameters. In this case, however, it does not work. We should have written map (*2) (filter odd [1..10]).

See the parentheses? They do the right thing. Now the compiler knows that it is supposed to pass map the return value of the filter function.

However, now we have a "moral" problem: we have parentheses. Point free... does not work! We would have to write

f lst = map (*2) (filter odd lst)

instead of

f = map (*2) (filter odd)

In fact the latter is not valid. (filter odd) has type [a]->[a], not [a]. Still, I the use of parentheses is very readable in my opinion. But we want to use point free, remember? "Luckily" map (*2) has type [a]->[a] and (filter odd) also has type [a]->[a]. They can be composed! How?

map (*2) . filter odd

Now, if you are a real Haskell programmer, you are going to like it. I somewhat like it... it has some perverse elegance, I have to admit. Then imagine that you can compose this way any function. Haskell programs are highly composible.

In fact it is not hard to write very very complex functions just chaining other functions with . (and $). E.g., which is the function that computes the sum of all the perfect squares of a sequence of numbers, excluding the ones smaller than 50?

sum . dropWhile (< 50) . map (** 2)

The point is that there is no limit to the nonsense you can do this way.

Friday, December 3, 2010

Closures in imperative settings: hidden and not so hidden problems (Python)

Nowadays, the number of essentially imperative languages that have some degree of support for closures is not a small one. Even languages which traditionally do not have closures like C have been extended in order to support something which roughly behaves like a closure. E.g., the Apple blocks extension goes in this direction. Perhaps one day I will discuss this feature. Even the good old gcc nested function extension (with all its drawbacks) does the job. And in Java you can do with an anonymous class. Of course Python and Ruby have "almost" full support for closure. What do we mean with "almost full"? The point is that a closure is essentially a functional concept. In functional languages you usually don't change the contents of a variable. Either you can't (Haskell) or you shouldn't (Lisp). On the other hand imperative languages do not have the restriction and imperative programmers, even those using closures, do not think that it is a bad thing.  However, both Python and Ruby suffer greatly on this. In our first example we are using closures, although many imperative programmers may have written code like that without being aware they were dealing with closures. Consider a simple graphical program using the omnipresent Tkinter framework.
import Tkinter as tk

class Window(tk.Frame):
    def __init__(self, root):
        tk.Frame.__init__(self, root)
        self.value = 0
        def inc():
            self.value += 1
        self.label = tk.Label(text=str(self.value))
        self.button = tk.Button(text="Increment", command=inc)

root = tk.Tk()
w = Window(root)
It may be not clear, but inc is a closure. In fact the self variable is a local variable to the __init__ method and is referenced inside inc, defined in __init__. The above code runs fine. However, if we do not define value as an attribute of self, errors arise. Consider this code:
import Tkinter as tk

class Window(tk.Frame):
    def __init__(self, root):
        tk.Frame.__init__(self, root)
        value = 0
        def inc():
            value += 1
        self.label = tk.Label(text=str(value))
        self.button = tk.Button(text="Increment", command=inc)

root = tk.Tk()
w = Window(root)
It looks right, but it is not. In fact, a "UnboundLocalError: local variable 'value' referenced before assignment" exeception is thrown. Why? When executing value += 1 does not find any local variable value, because value is not local: it's local to the enclosing scope. And is not global. So the error. If you are wondering what happens if value is global (and in inc you ask for a global value variable)... things work. But every time a global is used a programmer dies somewhere. Who knows who's next... In fact in Python closure can read variables from the enclosing scope. They can modify them (for example using methods which mutates them, or like in the first example, rebinding an attribute). However, you can't "rebind" the variable. And since integers are immutable, += 1 "assigns" a new integer.

, , , ,

Powered by ScribeFire.

Thursday, December 2, 2010

Connected. Take this REPL, brother, and may it serve you well.

Ok... slime is finally come to reason.
Reading documentation usually helps.

Yeah... auto-completes too.

, ,

Powered by ScribeFire.

Wednesday, December 1, 2010

Land of Lisp (part 1)

I did not want to buy this book. I have a pretty busy schedule and I /knew/ that if I had it on my desk it would have made things worse. Moreover, Lisper have the tendency to make every other language inadequate; considering that I am (unfortunately) not going to use Lisp in the near future, that would be on the demotivational side of things. Just hope to be able to use clojure, but that's another story.

The introduction is one of the funniest pieces of half-technical writing I ever read. In the rest of the book the humor is present and that is great value. I don't consider myself a Lisp expert; however, it's about a decade since my first exposure to Lisp, I have recently dig into scheme and used quite an amount of functional, logic, dynamic, high-level and whatever languages. That means that I was afraid the first chapters of the book would just be very boring to read (because I already know the contents), but perhaps a necessary evil to familiarize with the concepts and the philosophy of the book, which could prove useful when reading more advanced stuff.

Well, I was right and I was wrong. I was right since it's pretty hard resisting to read the book, even though I have some compelling things to do (or maybe more so). O was wrong because the first chapters are a pleasant reading /even if you know/ the concepts presented. Perhaps it is even more interesting as my brain is not involved into understanding difficult new concepts but can just enjoy the beautiful presentation of familiar stuff.

Besides, the book is a wonderful introduction to the functional way of thinking and covers relatively advanced topics. You can really learn Lisp from the book, even if you don't know a thing about Lisp and functional programming. However, it is nice to read for someone familiar with FP and not with Lisp. I believe that it is funny even for experienced Lisp hackers, as the comics and jokes inside the book are really funny.

Clojure is mentioned, as are Scheme and Arc. However, the focus is just Common Lisp. It is not hard to port the examples to other Lisps, however. As macros are introduced relatively late, the example could also be used with other languages, such as Python.

Friday, November 26, 2010

Slowly abandoning Apple?

There was a time when I used to say most software I use was Cocoa. Consequently, I did not use cross-platform software (a part from CLI stuff). I preferred the OS X only, non portable solution to cross-platform solutions every time the OS X provided a niftier integration with the system.

I was rather shocked considering that it is no longer true. First: I mostly work with non Apple systems (that is to say Windows and Linux). This is not really an issue: I was aware that my job could be done without a Mac. Being mostly a developer, I just need the development tools: compilers, open source interpreters, Eclipse, WingIDE, IntelliJ. All this stuff is cross platform. I also use latex (which is also cross platform). Unfortunately I use Word and the wonderful MathType quite often. Well... once I was very skeptical about the pair, long time ago. I changed my mind, of course. Still don't like word, but MathType is wonderful. JEdit and gvim more than TextMate or BBEdit.

I still use OmniGraffle, and I find it wonderful. However, I think that it sucks not being able to use it on my netbook (my apple laptop has a broken hard drive and I haven't fixed it yet). I mostly dropped Papers for Mendeley (cross-platform and I think more useful). My synchronization needs have increased: I use gmail more often than Mail; Dropbox beats the crap out of MobileMe (which I'm not going to renew).

Etc etc etc. In fact the last bundles of Mac software are not interesting from my point of view. I bought some sw some time ago, and I did not use it. Not that the MacOS is not great, beware. Just I feel like I have to be able to leave the platform with no hassle should the need arise. My MacPro is going to last, so is my laptop. But not forever; and then? I even find some things are better outside the mac (e.g., key shortcuts of IDEs... they are not standardized like on the mac, of course, but the ides just don't work at 100% on the mac... perhaps the new versions are improving).

Just happy I bought 1Password for Windows. Now I just have to leave Yojimbo. Still, no time to export the notes... and the serial numbers. I don't even want to think how much I money I threw away. :(

Yeah... of course, if I had more time to play things would be different. I just love Logic (Express) and doing audio on the Mac. Though, I have not enough time.

Powered by ScribeFire.

, , , ,

Friday, November 19, 2010

WingIDE 4.x out!

WingIDE 4 beta is available. I'm trying it right now.
Unfortunately I have not an active Django project to test the IDE.

, , ,

Powered by ScribeFire.

Thursday, November 18, 2010

Geometry/graphics algorithms: segment intersection

With modern gaming libraries dealing with most geometric/graphics issues, it is probably possible to create simple (and not so simple) games without a clue on the basics of computational geometry. However, the elementary algorithms are really so simple that it is worth to keep them in mind nonetheless.
First, I will briefly review some mathematical concepts we will be using, then I will present the algorithms. In this post I will present algorithms to find if two consecutive segments represent a left or a right turn and to discover whether two segments intersect.
The segment intersection is probably the single most important operation: for example it can be used as a building block to detect collisions in a 2D-game and can be generalized to work with 3D games as well.

Magic of the cross product

For those who thought that they belonged to somewhere else during the geometry classes (or who profoundly hated trigonometry) this will be short. Moreover, we won't use sin and cos since they are extremely slow operations. Nor will we use divisions (as they are both slow and yield to potentially badly conditioned algorithms).

Let A and B two vectors, that is to say two segments with the origin in the origin of the system (0, 0). Their cross product A x B is defined as:

Physics (and mathematicians) would say that A x B is a vector perpendicular to both A and B (thus not contained in the considered 2D plane) according to the right hand rule and with magnitude |xAyB-xByA|. However, we are only interested in the sign:

  • If the cross-product is negative, then A is counterclockwise from B
  • If the cross-product is 0, then A and B are collinear
  • If the cross-product is positive, then A is clockwise from B
Our second step is to consider pairs of segments with one end in common. Let s=CA and t=CB (CA and CB should have a bar above, but I can’t do that for typographical reasons) be the segments, where A=( x’s, y’s), B=( x’t, y’t) and C=(x, y).
We know how to compute whether the segment CA is clockwise from the segment CB, that is to say:
As a consequence:
  • If the cross-product is negative, then CA is counterclockwise from CB
  • If the cross-product is 0, then CA and CB are collinear
  • If the cross-product is positive, then CA is clockwise from CB

Turn left or turn right?

Now, given the consecutive segments AB and BC, we want to find out whether going into BC is a left or a right turn. A simple schema clarifies the thing.

Essentially we can compute the cross product ACxAB in order to discover if BC is a left or right turn with respect to AB. This is a very important building block to compute the segment intersection.
We notice how we did not use sin, cos or divisions. We have simply a bunch of multiplications, sums and subtractions. It is even likely that some architecture can perform the determinant of a matrix with a single operation.

Segment intersection

At this stage, we can introduce the code to compute the segment intersection. First I present the function computing the direction between two given segments. Considering that we do not want to make this functions dependent from the specific representation of segments, we just let it take the points A, B and C.
The function uses only multiplications, additions and other elementary inexpensive operations (no divisions, no cos or sin). The basic idea is using the simple function we defined in the last section and eventually deal with the special case when one segment end lies on the other segment.

So, the function is:

def direction(origin, first, second):
    return cross_product(second-origin, first-origin)

provided that we represented points like that:

import collections

def cross_product(lhs, rhs):
    return lhs.x * rhs.y - lhs.y * rhs.x

class Point(collections.namedtuple('Point', 'x y')):
    def __add__(self, other):
        return Point(self.x+other.x, self.y+other.y)

    def __sub__(self, other):
        return Point(self.x-other.x, self.y-other.y)

def direction(origin, first, second):
    return cross_product(second-origin, first-origin)

Notice the use of named tuples here. We have efficiency of tuples and ease of use of regular classes. But this is another story. Then, we have the real thing:

def does_intersect(first_start, first_end, second_start, second_end):
    direction_1 = direction(second_start, second_end, first_start)
    direction_2 = direction(second_start, second_end, first_end)
    direction_3 = direction(first_start, first_end, second_start)
    direction_4 = direction(first_start, first_end, second_end)

    if (direction_1 * direction_2 < 0
        and direction_3 * direction_4 < 0):
        return True
    elif direction_1 == 0 and on_segment(second_start, second_end, first_start):
        return True
    elif direction_2 == 0 and on_segment(second_start, second_end, first_end):
        return True
    elif direction_3 == 0 and on_segment(first_start, first_end, second_start):
        return True
    elif direction_4 == 0 and on_segment(first_start, first_end, second_end):
        return True
        return False

What do we mean? direction_1 tells us "on which side" the start of the first segment is with respect to the second segment. On the other hand, direction_2 tells us on which side the second segment end lies. If both points are on the same side (they are both left turns or right turns), then the two segments do not intersect. But if the extremes of the first segment are on different side of the second segment and the extremes of the second segment are on different sides of the first segment, then it means that the two segments must have an intersection. The rest of the procedure deals with collinear segments: in this case some extreme of one segment must lie on the other segment. The last procedure is:

def on_segment(origin, first, second):
    if (min(origin.x, first.x) <= second.x <= max(origin.x, first.x)
        and min(origin.y, first.y) <= second.y <= may(origin.y, first.y)):
        return True
    return False 

Wednesday, November 17, 2010

My EEE tells me which characters I use most

After a handful of months my EEE shows clearly which characters I use most.
That is to say: E, R, T, I, O, A, S, L, C, N, M and "SPACE".

How I figured out? The keyboard is worn. But not in an unpleasant sense!
It's just that the keys were slightly rugged and now in the center are completely sleek.

Tuesday, November 16, 2010

Interpret escape sequences in Python strings

In Python source files it is possible to define "raw" string literals which mostly do not interpret the escape sequences in the string literal. These string literals are introduced with an "r" prefix, like:
In [2]: s = r'\n'
In [3]: len(s)
Out[3]: 2

As it can be seen that is not the new line character, but a sequence of two characters, '\' and 'n'. String obtained with raw_input have somewhat the same property. For example, if we input \x61, it is not interpreted as the character with code 0x61 (which is 'a'), but as the sequence of 4 bytes '\x61'; that probably is the part saying "raw" input. ;)

If we want to evaluate the escape sequences "as the python compiler does when it encounters a regular python literal (e.g., 'hi\n'), then we have to use the decode method of string objects.

In [17]: s = raw_input(
In [18]: s
Out[18]: '\\x61'

In [19]: s.decode('string-escape')
Out[19]: 'a'

, ,

Saturday, November 13, 2010

Have you got enought Fiber?

I recently bought last edition of the Pickaxe and started reading through. The "Fibers, Threads and Processes" chapter caught my attention immediately. I'm not on par with latest ruby features (especially, I have never used 1.9), so I decided to understand what fibers are.

The example in the book closely resembles the one presented in the Pickaxe when explaining Fibers. However, I performed some slight modifications are I prefer "filter" programs, reading from stdin and printing to stdout, than the ones which require a command line argument (or worse, have a fixed constant in the program code). Apparently this strategy is very poorly supported by IDEs but well... shit happens. Afterall I'm already tolerating that the environment is not usable before 2 whole minutes... ah, Netbooks!

Anyway, this is the code:
require 'fiber'
require 'fiber'

words = do
  STDIN.each_line do |line|
    line.scan(/\w+/) do |word|
      Fiber.yield word

counts =
while word = words.resume
  counts[word] += 1

counts.keys.sort.each {|k| print "#{k}: #{counts[k]}\n"}

So, the idea is that basically the block passed to the fiber is not executed.
When the resume message is sent to the fiber, the control is transferred inside the fiber. The fiber is executed up to a yield message; at this point the control is transferred back to the code which called the Fiber and the value returned by the resume message are the arguments of the yield message.

Ok, rather easy. This is basically equivalent to Python generators. As it happens most of the times, in Ruby you pass a block so to someone, in Python you define a function.

So I decided to provide a rather faithful Python version:
import re
import sys
import collections

def yield_words(iterable):
    for line in iterable:
        for word in re.finditer('\w+', line):
count = collections.defaultdict(lambda: 0)
for word in yield_words(sys.stdin):
    count[word] += 1
for word, occurrencies in count.iteritems():
    print word, ':', occurrencies

The structure is essentially the same. The last component in Python and Ruby is extremely similar. Of course in Ruby you pass a block, in Python you have a for loop on an iterable.

The second component is also quite similar. However, in Python we simply iterate over a for loop. We don't have to call explicitly a "resume" command/method (although we could). I think that the essential difference is that Python generators are iterators. On the other hand Ruby Fibers are something slightly different

I decided to keep the structure of the first block as similar as possible. Chaining Python iterators with functions in functools and intertools, the yield_words function could have been written like:

import itertools as it
import functools as ft

def yield_words(iterable):
    return ( for word in
                      (re.finditer('\w+', line) for line in iterable)))

Not really sure that chaining all that stuff increases readability, though. As a side note, all the presented versions are really interactive. If you use stdin instead of piping a file (you were not typing lots of text in, were you?), the program is fully interactive.

That is to say, for example, that is, each time you read a line it is immediately processed, as if the two program parts (the "producer", which asks input and the "consumer" which puts it in the dictionary) were mingled. However, the logical separation is quite evident and the code clarity is greatly improved.

IntelliJ Ruby Plugin

As I promised... more stuff on Ruby! This is about the installation.
OS X and Linux mostly have Ruby installed. However, you may want to install JRuby as well. On windows, I installed them both.

As a development environment, I choose IntelliJ, which I already use profitably for Java. Probably there are better choices (TextMate on OS X springs to mind, gvim/emacs). However, I want to see what can a "traditional IDE" (and a good one) offer for developing with a very dynamic language like Ruby. Unfortunately Python support does not seem as mature (besides, Ruby plugin is supported by IntelliJ developers and they also offer a Ruby IDE -- whose features should be offered by the Ruby plugin as well --).

As a side note, for Python I already bought WingIDE and I'm happy with that. However, since I don't plan to work with Ruby anytime soon, I only consider free (as free f*ckin' beer) solutions.

Here some screenshots from the setup of a Ruby project

So here some stuff on what the plugin can do. It's lots of interesting stuff.

However, I would like to point out that Ruby is a dynamic language; and the first things you are going to do are simple scripts to automate some repetitive task unless you are going to do Rails development. Since I'm not interested in Rails, I just point out that setting up a Rails environment seems very easy and well supported.

On the other hand I'm mostly interested in the "scripting" part, as it's the first you are going to meet if you don't do rails, like I said.

IRB is well supported. But I don't think its integration with the IDE is sufficient. I can open an IRB console and put Ruby commands inside. Good. A REPL is a very important part of dynamic languages development. I can also load a Ruby script directly in IRB (alt-shift-l -- btw, I suppose it will be a PITA on OS X).

However, since that command only 'requires' the file in IRB, successive invocations have no effect, unless you close and reopen IRB. It would be nice if modified versions of the script could be run without closing and opening IRB. It is something I would expect from a Ruby IDE. E.g., in TextMate it is very easy to do; and at the beginning it is just what you need.

The other options is creating a "runner". Which is easy, there is a menu entry that automagically creates one. This is good to run a script (but remember... you already created a project, and then create runners for every -- logically unrelated -- script in the project). This is a lot of useless work.

Things work slightly better if you just select the text and load (the selected text) in IRB. This way you can run the code more than once. Moreover, you can also have functions defined in IRB and than play with them. I think this is far more useful than 'requiring' stuff, especially because you can't reload.

Ok... the nice things another day. Maybe. However, the "environment" support seems really javish. Not happy. :(

Friday, November 12, 2010

Java and OS X (good news, afterall)

It looks like my concerns were unjustified. OS X and Oracle (well, mostly the open jdk community) are going to collaborate to provide future versions of Java for OS X.

Apple is basically gives its current Java machine to the OpenJDK project. Oracle (and OpenJDK guys) will provide future versions, as it is desirable. Perhaps we will not have JVM lagging behind anymore. Or perhaps we will, but the lag will be shorter.

, , , ,

Powered by ScribeFire.

Pickaxe (almost) for free [Ruby]

Although here in Italy there is lots of talking about Ruby, they are not quite referring to the programming language. However, there are some very interesting news about the programming language as well.

These days the new version of Programming Ruby (also known as the Pickaxe) has a huge discount: both the printed version and the electronic version cost $10. Buying both costs 20$.

I believe that publishers should really reconsider charging extra money for those who want also an electronic version of a book they already bought. They should reconsider this (as I'm afraid piracy will just become a serious problem as ipad/ebook riders and similar device are going to spread). However, in this case I don't care.

Why? 20$ to have a full book (electronic & paper) is a very convenient price and I'm happy with it that way. I just bought them so expect some more ruby related posts in the future. ;)

Sunday, November 7, 2010

Java Collections sorting: copy, copy & copy in every sense!

This morning I was hacking into OpenJDK some implementation details. Curiosity, perhaps.

After the announcement that Java is going to use timsort for generic sorting (which is a very good piece of news).
By the way, this is an actual implementation. I'm also wondering why these things are mostly ignored in Algorithm courses, but this is another story.

So... I checked the actual implementation. I had to download the OpenJDK source, as apparently on OS X IntelliJ does not auto-show implementation of base classes when I "jump to implementation". Anyway... the implementation of sorting is basically in

I want to point out the funny comment relative to sort1 (which is the private function which actually does the job for non floating point stuff):
     * The code for each of the seven primitive types is largely identical.
     * C'est la vie.

which really saved the day... Yes... it is almost cut and paste for the primitive types (char, int, byte, etc). I was amazed... it is 2010, we have generics but, as primitive types are not objects we have to actually write it 7 times. As a side note, code is recursive quicksort variant.

Ok, nice. You have implementation constraints and you have to cope with that. Yes... C++ templates perfectly solve this crap. But then, there is another very funny discovery (well, in fact I have known this for quite a few time). But that is not the only (technically) unnecessary copy which happen(ed|s) when dealing with sorting in Java; this is the sort function in

public static <T extends Comparable<? super T>> void sort(List<T> list) {
Object[] a = list.toArray();
ListIterator<T> i = list.listIterator();
for (int j=0; j<a.length; j++) {;

Wow! It is clear what it does? For every collection you sort, a copy is made. To be honest, the documentation states this clearly:

This implementation dumps the specified list into an array, sorts
the array, and iterates over the list resetting each element
from the corresponding position in the array. This avoids the
n^2 log(n) performance that would result from attempting
to sort a linked list in place.

Nice! Very nice... this means that the default sort function may be unnecessarily slow. Unless you use arrays (but using arrays in place of collections is considered a bad practice). Moreover, Collection is an interface: its implementation may be such that a full copy into main memory is not feasible. BTW, I should double check because the comments in Arrays.sort state that the algorithm is a modified quicksort, not a modified mergesort.

Oh... nice! Tomorrow I am discussing other methods in Arrays and Collections.

Friday, November 5, 2010

FOAF, crawlers, etc


As part of my research, I have to gather a large number of FOAF profiles. So I looked for some crawlers able to get the profiles (without having to write one myself, which is boring, considering the flexibility of RDF -- and consequently the number of outbound links which could be meaningful).


So... looking for FOAF crawlers (which are semantic crawlers, also called scutters). The first one I found is Futil (funny name choice, BTW). The project page gives some information on the project (quite generic) and has a link to the web svn browsing page. I did not find references to the actual svn repo (but I could easily infer it). Ok. So far so good.

So I examined the list of dependencies the project has:

  • Python >= 2.4.0
  • RDFLib >= 2.3.1
  • PyLucene >= 2.0.0
  • PySQLite >= 2.3.2
  • Python-MySQLdb >= 1.2.1
  • PyXML >= 0.8.4
  • Python-URL >= 1.0.1
The first dependency is obvious. It's a python project so we need a python environment. And 2.4 is a very low requirement. Good. RDFLib can be easily installed with easy_install, even on windows. Really, easy_install RDFLib


just works! Ok. Then it comes to PyLucene: I found it quite impossible to install on windows. I'm sure I can work it out, but instructions are terrible. There is a binary build project, however, it does not work. At least, not the simple instructions they give. I installed the egg and then:

Python 2.6.5 (r265:79096, Mar 19 2010, 21:48:26) [MSC v.1500 32 bit (Intel)] on
Type "help", "copyright", "credits" or "license" for more information.
[1]: import lucene
Traceback (most recent call last):
File "stdin", line 1, in module
File "C:\Python26\lib\site-packages\lucene-2.9.3-py2.6-win32.egg\lucene\", line 2, in module
import os, _lucene
ImportError: DLL load failed: The specified module could not be found.

Ok. Things are getting nasty and for no reason. Compiling Python modules on Windows is usually a PITA. Especially because I have Visual Studio 2010 and not Visual Studio 2009 or 2008 or whatever is the default compiler for Python on Windows. Thanks to the little "limited compile yourself dll hell" I'm not going that route (moreover, I'm a gcc guy).

So I tried the futil package without pylucene. Maybe, I thought, something works even without PyLucene. Afterall, Python is a very dynamic language. It is trivial to create a program/library which does not load things it does not use. So I tries just:


python --help<br />
What I got was an error related to not having MySQLdb. What the fuck! I asked for an help message. Nothing more! I need no stinking mysql! Just a bloody help message. Consequently, I suppose that the whole project needs everything installed, even if I don't use it. Which is bad.


I suppose I'll try with slug , written in Java, and see what happens.

Tuesday, November 2, 2010

Papers: word, ipad, and other stuff

I have been using Papers for something like an year now. I quite like the application, indeed. It is easy to use and rather powerful.

I use it to keep all the articles I read and study for work organised. This way the papers are in the very same directory and I can access them easily from the application in a similar way to what iTunes does for music. I can also add notes and stuff like that.

Papers is also able to perform searches on the web (e.g., using Google Scholar). Then, I can import the papers into the application and possibly categorise them with "groups", assign an "evaluation" (which is rather useful... for example it lets me save time with articles with promising title and abstract but deluding content; this way I don't read them again when looking for information to cite).

Papers (not the application ;) ) have metadata, which is also used for citations. Most of the times (well, not really... sometimes -- but this is not Papers fault as this comes from Scholar or badly formatted papers or whatever) the metadata is accurate and everything is fine.  Unfortunately sometimes (which is, once again unfortunately, most of the times) the metadata is not correct. The quickest way I found to fix this is "unmatching" the paper and then searching again selecting some relevant data from the Paper. The interface is very well done and it is easy to understand how it works. In this case, I usually end up with correct metadata ready to be included in a bibliography.

And here it comes the first weak spot of Papers. Unfortunately (or fortunately, depending on the point of view), my group works with Word. As I am the youngest, I simply accept the decision. Consequently, we have to build the bibliography with Word.

Word has a nice bibliography management feature; but it is quite broken. Or I have not found how it works. Word has a built-in bibliographic database. Papers is able to "write" a new database (possibly making a backup of the old one). At this point, word is able to insert citations from the db to the current document. Word has a very limited array of bibliography styles and although more can be downloaded, I found that they do not work very well. Sometimes the generated bibliography is not formatted correctly and it is very hard to make it fit into the requested paper style. At some point I usually end up making it static and then do everything manually.
Perhaps, there are better ways to do it. I have not found them, unfortunately.

This is not necessarily Papers fault. But it is a serious problem. Perhaps I should try different approaches, like going through bibtex or something like that. I remember that when I used to work with latex the bibliography part was quite easy (as long as bibtex was used... when creating bibliographies manually it was a real PITA).

Another very nice feature Paper has is iphone/ipad synchronization. I often read papers on the ipad (and save forests) and that is nice. Essentially if I just remember to synchronize, I have my full library with me all the times. This is dramatically nice. I'm just looking forward for the ipad to allow directly printing files.

The problem is that Papers does not easily synchronise data between two macs. This is awful. Right now I work most of the times with my EEE netbook. However, I would like to have the library on the macpro always in sync with the MacBook Pro's. I read some post requesting the feature, but it seems it has not been added yet. This is a major drawback, especially considering that Mendeley automatically synchronises not only among macs, but among every computer and device (win PCs, Linux ones, android, iphone/ipad and macs).

One of these days I'll share my opinions on Mendeley as well.

Sunday, October 31, 2010

What about Apple quality (ipad, iphoto, camera connection kit)?

The beginning

I quit using Apple products in the late nineties as I felt that directly and solely using Linux was far a better choice to my needs. As a consequence, I stopped doing many things I use to (such as recording and producing music) and that was it. Some years later, I bought a 12" powerbook and I think it is one of the best notebooks I ever had. Jaguar was crap compared to the systems we have today.

Integration with open source stuff was mostly left to the users (and that way I learnt lots of things I had for granted in Linux). On the other hand, I had lots of beautiful applications and a unix system which worked flawlessly on my laptop, which, at the time, was not common at all. Since then I have been quite enthusiastic about apple products. I bought many more Macs and used them as my main development platform. I bought mac software. I have an iPad and an iPhone.

I was very satisfied with the general quality of the products. Few bugs, few problems. On the other hand, I'm now starting to seriously question Apple strategy. I have already expressed my concerns about the mac store. Though I'm starting to believe things are worse.

The tragedy

For example, this fall I went to Cornwall on holiday. I brought my camera and my ipad. The idea was that I could store the pictures on my iPad (I bought also the camera connection kit).

I was pleasantly surprised that the kit stored the pictures in raw format and consequently I had no loss of quality. On the other hand, this detail was not clear at all (I had to sent myself an email containing one picture to find out that it still was a NEF file). I think that this information is irrelevant for most amateurs (as most digital compact cameras save in JPEG); however, it is critical for a professional. And it may be critical for some amateurs as well, the ones with digital reflex cameras.

So as soon as my memory card was full, I connected it to the ipad and started downloading the pictures. Apparently, everything went well; I simply felt that the process was quite too slow (but I may be biased as I wanted to use the ipad at the same time to check the email and I did not want to interrupt the transfer). So I decided that I could afford buying another memory card. With that one and the now emptied old one, I could manage to take pictures without another transfer.

Went I came back, I connected the ipad to the mac pro and started downloading the pictures. The process hanged after importing a hundred pictures or so. Since I connected it to my monitor (which has an USB hub), I initially blamed the monitor. However, I should have suspects on iphoto since the "cancel" button did not work and I had to manually force quit the application. Then I tried another hub (which was more tested); same results. And then I directly connected the ipad to the Mac and once again the transfer hanged. My guess, at this point, was that the ipad had issues. In fact, I transferred greater quantities of pictures (but I forgot I did not use iphoto, but Nikon Transfer).

As a side note, I stopped using Nikon Transfer because I lost some pictures due to unclear options related to "deleting pictures after transfer". For NT it meant delete all the pictures after the transfer, both the ones you transferred and the ones you did not, while I thought that it conservatively meant "delete the pictures that you just transferred". Luckily enough CardRaider saved the day.

Well, iPhoto was not very clear on what they meant with "Keep pictures" "Delete pictures" either, so I decided to "keep" and manually delete. I think that Apple could have been more precise.

Anyway, at this point I was quite desperate: I found no options in ipad-iPhoto to mass put all the pictures somewhere where I could access them. Luckily, I remembered that Preview is able to import pictures from cameras. And it is able to import pictures from the ipad as well. The program is faster than iphoto. It is faster to start, faster to transfer. The image selection interface is also far clearer: moreover, failed or interrupted downloads showed clearly the already imported images, so starting from the point where I left was easier.

In fact, the first time I got the same transfer error I had with iphoto. However, plug and unplug + quit & reopen fixed the issue and now I have all the pictures safely on my hard drive.


I never really trusted iphoto: I kept my files outside the library in order to more easily access them. I trust the program even less: I won't use it to import pictures, just to present them. The "find faces" functionality is just a curiosity: I still haven't found any use for it. Moreover, much of the imported pictures are not processed by that functionality. I don't know why.

As a consequence, I would like to get rid of iphoto. I want to stop subscribing to mobile me as well (e.g., out of the blue, iWeb is not able to update my website anymore). Now I just have to find a substitute; something which has similar features and that possibly retains the association between the raw pictures and the modified one. Still iPhoto was not that bad: I could have used it. I just think it is not refined and a bit left on its own.

And now there is the new version. And I am not very motivated in buying that software. On the other hand abandoning most apple ties is quite hard and since all the stuff is very well integrated, it is somehow required to stop using all that crap together, otherwise problems arise.

Some pictures

Ocean Pictures (Dawn)

Ocean Picture

Ocean Picture (Night)



Technorati Tags: , , , , , , , , , ,

Saturday, October 30, 2010

Python String Concatenation (again): the revenge of join!

In this second post (the first here) I proposed some benchmarks to investigate the efficiency of string concatenation in Python. I believe that the results are partial for two basic reasons:
  1. we benchmark just concatenation of huge amounts of very small strings
  2. we are creating the strings in the iteration. As a consequence, we are implicitly penalizing the methods using str.join
In this post I address the second issue. I moved the creation of the list outside the benchmarked code, and simplified the functions accordingly.
def str_sum(seq):
    out_str = ''
    for item in seq:
        out_str += item
    return out_str

def str_sum_b(seq):
    out_str = ''
    for item in seq:
        out_str = out_str + item
    return out_str

def str_join(seq):
    return ''.join(seq)

def str_join_lc(seq):
    return ''.join(item for item in seq)

def string_io(seq):
    file_str = StringIO()
    for item in seq:
    out_str = file_str.getvalue()
    return out_str
These are the same functions we benchmarked last time simplified where necessary. str_sum and str_sum_b are basically unchanged. On the other hand, str_join is completely different: since the first three lines where used to create the list, now they have completely disappeared.
def str_join(loop_count):
    str_list = []
    for num in xrange(loop_count):
    out_str = ''.join(str_list)
    return out_str
As a consequence, the funcion is much shorter. I mantained the _lc variant: however, I will remove it from serious benchmarks as it is inevitably slower than the regular version. I also introduced some new functions to test here:
def str_sum_c(seq):
    return reduce(op.add, seq)

def str_sum_d(seq):
    return reduce(op.iadd, seq)

def str_sum_e(seq):
    return sum(seq, '')
They are also disabled: the _e variant is invalid Python. Calling sum on a sequence of strings gives this instructive error:
TypeError: sum() can't sum strings [use ''.join(seq) instead]
The other two were excluded because they are too slow:
itsstr_sum str_sum_cstr_sum_d
Unfortunately, reduce appear to be very slow. Consider these additional functions:
def str_sum_f(seq):
    return reduce(str.__add__, seq)

def str_sum_g(seq):
    out_str = ''
    for item in seq:
        op.iadd(out_str, item)
    return out_str
I decided to benchmark only a relatively small subset of the presented functions. The measures I plotted are the result of 10 executions of the function; the average time can be obtained simply dividing by 10. In the first plot I consider strings composed by numbers of fragments one order of magnitude smaller than the ones benchmarked in my other post.
In this other plot, I present the results with the same order of magnitude. However, the times presented result from 10 executions of the same function: the spikes and irregularities were not eliminated by averaging the times. Consequently, I don't consider the plot meaningful unless I discover the reason of such behavior (which comes completely unexpected).
In the future I will try a benchmark consisting of concatenation of larger strings. Technorati Tags: , , , , ,

Thursday, October 28, 2010

jEdit 2

Seems I forgot to mention that it supports rectangular selections, vertical editing (meaning that it is possible to modify more lines with the same command).

I also found quite useful buffer local properties (written in the file itself). However, I feel that using a completely different standard from Emacs and vi is not a very good idea. Yeah, Java is a world on its own.

Unfortunately, it seems that development is going quite slow. Well, as a TextMate user I should not comment on this... ;)

Unfortunately IDEs like IntelliJ have so much more features that for Java development jEdit is not on par. On the other hand the startup times are very reasonable even on my netbook. I actually favor jEdit over other Windows evolved editors. I need more time in order to make a serious comparison with BBEdit (which I have used since the first versions) and TextMate (which is somehow even more intuitive).

Technorati Tags:
, , , ,