Thursday, October 14, 2010

JEdit: the forgotten editor

These days, I'm rethinking lots of stuff regarding my geeky life. For work reasons, I hugely increased my exposure to Java. And to my surprise, I find it quite bearable. Perhaps, one day, I'll say that I actually like it. This is much more likely if they just add closures... but that's another story.

In this process, I decided to try JEdit. Back then, some people suggested JEdit as a viable alternative. I quite refused the claim based on some hasty experiments. I think that my bias was such that I could not be pleased with it. Perhaps back then (10 years ago or something like that) it was not that good; perhaps not. And then I sticked to my early negative experience.
As a partial excuse I can point out that back then the unix community was very skeptical on everything which was not vim and Emacs (with the same old flame wars among the editors).

Right now, I'm using jEdit quite a lot and I want to describe my impressions. First: its main drawback is that it is slow to load. It is quite objective, even on my MacPro. However, BBEdit is not much faster and I use that regularly. But like BBEdit it is very fast to open new files once it is opened. In other words, you can just leave it open and that is fine. Of course, with vi this is not necessary. The other major drawback is that it can't work in text only mode: if you need to edit files remotely, you might be in a trouble.

However, the ssh/ftp plugin is quite useful and partially solves this problem. By the way, there is no reason not to use vim when needs arises.

jEdit supports 130 syntax languages (according to wikipedia, I did not count them), which is good. In fact, it is likely that you open a file and gets the correct syntax highlight and customizable options for each of them. Which I find nice.

Plugins

jEdit plugins are very easy to install; easier than default installation process on Emacs and vim, though auto-installers are catching up. In fact, even easier than text mate and bbedit plugins. In the plugin list, there are lots of different plugins; the majority are java related, but this is not exclusive. Some plugins just remind me of good old programming practices. For example the whitespace plugin does many whitespace related functions, like showing trailing whitespace and auto-removing it on save or auto-converting between tabs and spaces. This is something most programmers want.

jEdit has very flexible frame management. It is possible to split text area in different parts, dock and undock specific panes. For example with the Console module it is possible to have a "shell". The users chooses to have it in a separate window or to place it at the bottom of the main frame. Emacs anyone?

Of course, in the console windows it is possible to place many different REPLs, such as a Python or a Prolog shell. Or clojure (which, is also installed by the plugin: installing the plugin fully sets up a working clojure environment).




Another very useful feature is the integration with BeanShell (which allows running Java snippets). Moreover, plugins can also be written in Jython and other JVM supported languages. Of course there are plugins providing Java autocomplete and refactoring and similar stuff.

The are plugins to do spell-check and most things one expects from a text editor, latex mode, etc etc etc. The XML mode provides autocomplete for HTML files (or perhaps it is just builtin), and that makes it quite a powerful editor for HTML.

Macros


jEdit supports macros, which should be rather akin to other macro facilities ;they do not only generate text (but can remove, modify or do completely text unrelated stuff like changing interface elements). I have not explored this part yet.

The Editor


Then, it comes to basic text editing capabilities. Here is the point where most NENV (Non Emacs, Non Vim) editors fail. They just suck at manipulating text. jEdit, even though not the most powerful editor I tried, is an honest contender here.

Abbreviations provide expansion of long snippets of text. Buffer based completion is also available.

The clipboard is very powerful: there are "registers" and it is possible to manipulate text in there, append text to clipboard, vertical paste, etc. All this possibilities have keyboard shortcuts.

Selection is also keyboard controllable (move one word/character/paragraph forward/backward etc). It is possible to select code blocks, up to parentheses, etc...

Search comes in two flavors: incremental search and "standard" search. Regular expressions are available. Moreover, it is possible to apply search to whole directories, filtering files according to file names. The "replace" part can also be provided by a bean shell expression, which basically gives unlimited power to the replace feature.

Other text related features (spaces->tabs, lowercase, delete to end of line, delete to start of line, etc) are provided. So it is paren matching.

Markers (bookmarks) are supported, and so it is folding. Some utilities (file managers and similar stuff) are provided, and manual editing of jedit configuration files is also available.

Conclusion


I think that jEdit is a seriuos contender in the "2nd generation" editor market. It is free, it supports many languages in a rather complete way. Of course, many languages which are supported in Emacs (haskell, ocaml, erlang, ...) are not supported. However, it is a nice alternative to do lightweight java editing, to work with Python, web editing and similar stuff. Moreover, it is a very interesting alternative for Clojure/Scala development.

Wednesday, October 13, 2010

Ubuntu 10.10: so far, so good

On October 11th I decided to upgrade to the new ubuntu. I wanted to try the update procedure, as:
  1. It's not a big issue to reinstall afterwards (right now Linux is not critical for my job)
  2. I was really curious (and scared, from what I read on the web... but well, if OS X manages to just update, why Ubuntu should not?)
First I updated the system to the latest version as recomended. Then, I started the proper upgrade process. The installation process took *hours*. Not only because of the download process (here at the University we've got a pretty fast connection ;) ). It was just the setup/configure process. It took somethink like five or six bloody hours (I don't know, since I left the netbook on and went home). Besides, having question asked during the installation process (question blocking the process) means you have to be in front of the monitor for the whole process. So the day after I found some nice blocking panes asking me silly questions[0]. Then it also stopped responding to input. I inadvertently touched the shut down key and the system ignored my tentatives to press the cancel button. However, Ubuntu rebooted the half installed system fine (I think that it was just setting up unimportant applicative packages when I interrupted the procedure). dpkg --configure -a and now *everything* is fine. Perhaps I have been lucky, but I think they are doing a great job. I was expecting an epic failure (my fault, by the way). Usage impressions another day. Maybe. Or maybe not.

Tuesday, October 12, 2010

Borg, Singleton, Memoization and my mistakes

Today something rather embarrassing is happening to me. In fact, I feel that I'm plainly missing the right design. As I wanted to do some testing and benchmarking on the quicksort algorithms we discussed (here, here and in a yet unpublished post), I needed a fairly large body of random data. As a consequence I wanted to create a simple script to generate a "random" sequence of lines which could be used as an input to build a list of "Persons". I am interested in sorting "complex" objects according to multiple criteria, not only simple data. The language for the script is obviously Python as it's the language I'm sure will get the task done faster (in my hands). As a remark... I am quite against random tests, as they are not reproducible. On the other hand randomly generating the fixtures is fine. Once they are generated, they remain the same (thus, tests are reproducible). Moreover, they can be easily accessed from multiple languages. My first idea showed I had not clear the problem at hand. I started with the idea of randomly generating the names (as random sequence of characters), and I asked myself if I had to randomize name length as well. Luckily, I realized the mistake after writing 2 lines of code. It is strange to generate random string to emulate names, when I could just download a list of first names and surnames and build the data from them. I was confident the effort would be roughly the same. In fact, it is. Considering these URLS (for example):
SURNAME_URL = 'http://names.mongabay.com/most_common_surnames.htm'
MALENAME_URL = 'http://names.mongabay.com/male_names.htm'
FEMNAME_URL = 'http://names.mongabay.com/female_names.htm'
I could just get the list of names with these specific lines:
parser = html.parse(url)
lines = [el.text for el
    in parser.xpath('//td/table/tr/td[1]')
    if el.text]
I quite do not like the fact that I'm so dependent on the website structure and perhaps I could look for a site offering the names just as a plain file. But I don't really care enough to look for another website. Moreover, the "way" I get the lines is quite orthogonal from the matter of this post. Now the point is that I want to access those files with a nice API in order to avoid code repetition. And here my brain completely short-circuited. The first idea I got was something like this:
from lxml import html

class NamesMeta(type):
    SURNAME_URL = 'http://names.mongabay.com/most_common_surnames.htm'
    MALENAME_URL = 'http://names.mongabay.com/male_names.htm'
    FEMNAME_URL = 'http://names.mongabay.com/female_names.htm'

    @property
    def male_names(cls):
        try:
            return cls._male_names
        except AttributeError:
            cls._male_names = cls._list_of(cls.MALENAME_URL)
        return cls._male_names

    @staticmethod
    def _list_of(url):
        parser = html.parse(url)
        return [el.text for el in parser.xpath('//td/table/tr/td[1]')
                if el.text]

class Names(object):
    __metaclass__ = NamesMeta
For some mysterious reason I liked the idea of accessing the names with:
Names.male_names
. While I don't believe it is inherently a bad idea to be able to do it, I do believe it is a bad idea not having an "ask" method. However, I am going to deal with one design problem at a time. Going back to the example, this approach works, but it requires me to duplicate logic for the three properties (male_names, female_names and surnames). And I hate code duplication (and looping "execing" methods is not a good strategy, either). Then there is this thing... I used metaclasses, for no good reason. Well, my idea was to use the "class object as singleton pattern" strategy. This is a very easy way to implement a singleton. Here the first error is that I should really be using Borg/Monostate (or perhaps a bunch of functions...). The code is redundant (if finished) and overly complex. In order to understand how it works you have to be familiar with metaclasses, et all. And for no good reason. Moreover, the @staticmethod points out another clear design problem. As far as the class is only conceived to hide the logic of getting the name lists, it is ok to have the function to get the names in the class interface. However, I believe that we can generalize the class a bit. What if I want to "ask" more things? I could add an interface to add more sources of knowledge. I'm not sure that this counts as "over-generalization" (that is over-engineering). I think that generalizing, in this case should help me to clean the design. And I got to this:
from lxml import html

class Failure(Exception):
    pass

def name_getter(question):
    def _list_of(url):
        parser = html.parse(url)
        return [el.text for el in parser.xpath('//td/table/tr/td[1]')
                if el.text]
    url_db = dict(
        male_names='http://names.mongabay.com/male_names.htm',
        female_names='http://names.mongabay.com/female_names.htm',
        surnames='http://names.mongabay.com/most_common_surnames.htm'
    )
    try:
        url = url_db[question]
    except KeyError:
        raise Failure
    return _list_of(url)

class Knowledge(object):
    # (question, fetcher)
    answers = dict(
        male_names=name_getter,
        female_names=name_getter,
        surnames=name_getter
    )

    def __getattr__(self, key):
        try:
            getter = self.answers[key]
        except KeyError:
            raise Failure
        else:
            value = getter(key)
            setattr(self, key, value)
            return value
This is decent. I can access the "questions" rather easily. I can easily add an interface to query which question the class can answer; the code is rather easy to understand, use and whatever. Moreover, using __getattr__ still has the advantage that the "I have not yet cached the answer" logic is dealt by Object. We will get rid of this, in time. Right now, it is easy to understand and we do not have to write code to deal with it. But then... idiocy struck me again. I felt like being able to instantiate multiple Knowledge classes would be a BAD THING. Which in some sense it is... but hell, it's just a helper script (though the object oriented designer in me is quite a bitch). And since the metaclass strategy worked for the past example, something like:
class KnowledgeMeta(type):
    # (question, fetcher)
    answers = dict(
        male_names=name_getter,
        female_names=name_getter,
        surnames=name_getter
    )
    # ...

class Knowledge(object):
    __metaclass__ = KnowledgeMeta
works. But at this point it should be clear that the (object oriented) solution should use the Borg pattern. Identity is a false myth and this is an archetypal example of "only shared state matters". Moreover, it is so appropriate to have a Borg here... we have object that send to the hive-mind whatever they learn. They are borgs! The solution I propose is now:
from lxml import html

class Failure(Exception):
    pass

class Borg(object):
    # Martelli A., Martelli Ravenscroft A., Asher D.,
    # "Python Cookbook", 2nd Edition, O'Reilly
    _shared_state = {}
    def __new__(cls, *a, **k):
        obj = object.__new__(cls, *a, **k)
        obj.__dict__ = cls._shared_state
    return obj

    def __hash__(self):
        return 42 # randomly chosen number from a dice with one face

    def __eq__(self, other):
        try:
            return self.__dict__ is other.__dict__
        except AttributeError:
            return False

def name_getter(question):
    def _list_of(url):
        parser = html.parse(url)
        return [el.text for el in parser.xpath('//td/table/tr/td[1]')
            if el.text]
    url_db = dict(
        male_names='http://names.mongabay.com/male_names.htm',
        female_names='http://names.mongabay.com/female_names.htm',
        surnames='http://names.mongabay.com/most_common_surnames.htm'
    )
    try:
        url = url_db[question]
    except KeyError:
        raise Failure
    return _list_of(url)

class Knowledge(Borg):
    # (question, fetcher)
    find_answer = dict(
        male_names=name_getter,
        female_names=name_getter,
        surnames=name_getter
    )

    def __getattr__(self, key):
        try:
            finder = self.find_answer[key]
        except KeyError:
            raise Failure
        value = finder(key)
        setattr(self, key, value)
        return value

    @property
    def known_questions(self):
        sreturn self.find_answers.keys()
The idea of using properties instead of method calls is perhaps not very good: it adds no advantage, no clarity. And somewhat hides the fact we may be accessing something on the net. In fact a simple API like:
def answer(question): #...
def questions(): # ...
def add_question(question, answer_getter): # ...
would be perfectly acceptable, usable and clear. However, in the latter solution we would have some global variables; I still believe that an object oriented interface here is better. __getattr__ just simplified some logic, but also masked that we are dealing with caching/memoization: __getattr__ is called iff the key is not find in __dict__ (that is to say in Borg._shared_state), which is exactly what memoization is about. If we want to have an "ask" method we would have to manually deal with the logic of not finding stuff in our cached answers. And I believe an "ask" method would be very useful. For example because it allows for "questions" which are not valid python identifiers. I find much more readable something like:
knowledge.ask('number of dwarves')
than:
getattr(knowlege, 'number of dwarves')
Luckily enough, memoization patterns are widespread and we can also find ready to use implementations. However, this time, we provide a memoize decorator. This decorator takes the dictionary used as a cache as an argument. Of course, Borg._shared_state is the dictionary we want to use. The decorator is:
def memoize_method(db):
    def memoizer(method):
        def aux(self, key):
            try:
                value = db[key]
            except KeyError:
                value = db[key] = method(self, key)
            finally:
                return value
        return aux
    return memoizer
Defining an "ask" method now is trivial. It should just get the answer finder from the find_answer dictionary, call it, and return the value. Caching is dealt by the decorator. So, eventually:
class Knowledge(Borg):
    # (question, fetcher)
    find_answer = dict(
        male_names=name_getter,
        female_names=name_getter,
        surnames=name_getter
    )

    @memoize_method(Borg._shared_state)
    def ask(self, key):
        try:
            finder = self.find_answer[key]
        except KeyError:
            raise Failure
        else:
            return finder(key)


    def __getattr__(self, key):
        return self.ask(key)

    @property
    def known_questions(self):
        return self.find_answer.keys()

Technorati Tags: , , ,

Monday, October 11, 2010

Design of Sorting in Java and Python (pt. 2)

In this post I presented a pattern which arises very quickly when doing comparisons among objects. The pattern is essentially:


def name_cmp(fst, snd):
 return cmp(f(fst), f(snd))

In fact I see huge repetition in this kind of code, because we write code in that shape for many different "f"s (that is to say, getter or more complicated computation, but it boils down to that pattern).

For example, let us define a Person object in Java and also define comparators to sort a list of persons according to both the number and the phone number.

public class Person {
    private String name;
    private String phone;

    public Person(String name, String phone) {
        this.name = name;
        this.phone = phone;
    }

    public String getPhone() {
        return phone;
    }

    public String getName() {
        return name;
    }

    static final private PhoneComparator PHONE_COMPARATOR = 
     new PhoneComparator();
    static final private NameComparator NAME_COMPARATOR = 
     new NameComparator();
    
    static private class PhoneComparator implements Comparator<Person> {
        public int compare(Person o1, Person o2) {
            return o1.getPhone().compareTo(o2.getPhone());
        }
    }

    static private class NameComparator implements Comparator<Person> {
        public int compare(Person o1, Person o2) {
            return o1.getName().compareTo(o2.getName());
        }
    }

    static Comparator<Person> ageComparator() { return PHONE_COMPARATOR; }
    static Comparator<Person> nameComparator() { return NAME_COMPARATOR; }
    
    // ide generated toString, equals etc. here\
};

I see huge repetition in this kind of code. But even in Python, defining all the compatators is boring and repetive. Of course in Python something like:

def cmp_buider(f):
 def cmp_(fst, snd):
  return cmp(f(fst), f(snd))
 return cmp_

would come quite natural. For the ones who do not like functionally flavoured code, this is the Java variant (which is longer and uglier).

public interface Key<K extends Comparable<K>, E> {
    public K index(E obj);
}

public class KeyToComparator {
    public static <K extends Comparable<K>, T> Comparator<T> build(final Key<K, T> key) {
        return new Comparator<T>() {
            public int compare(T o1, T o2) {
                return key.index(o1).compareTo(key.index(o2));
            }
        };
    }
}

and then we use it like:

static public void sortWithKey(List<Person> lstPerson) {
    Collections.sort(lstPerson, KeyToComparator.build(
            new Key<String, Person>() {
                public String index(Person obj) {
                    return obj.getName();
                }
            }));
}

I hope I'm not the only one feeling that this stuff is ugly as hell. However, it is quite general and powerful. And it lets you avoid code repetition. At the very least, the code to write a key is way shorter than the one to write a Comparator. Of course, if methods where first class objects...

Sunday, October 10, 2010

Parametrized Unit Tests (in Java)

While "creating" by hand parametrized test functionality is not extremely hard in Python (and in fact there are multiple people who did that, plus the "blessed" implementation which is going to be included in unittest2 some time soon, I hope), doing the same thing in Java is a nightmare.

Luckily enough, JUnit 4 offers the functionality out of the box.

In fact, I have to admit that it seems cleaner than Python alternative. Let us start with a Java quicksort (this time ugly, but not buggy, as far as my tests are correct).

package sorting.quicksort;

import java.util.List;

public class QuickSorter<E extends Comparable<E>> {
    List<E> lst;

    public synchronized void sort(List<E> lst) {
        this.lst = lst;
        sort(0, lst.size()-1);
    }

    private void sort(final int start, final int end) {
        if(start >= end) {
            return;
        }

        int pivot = partition(start, end);

        sort(start, pivot-1);
        sort(pivot+1, end);
    }

    private int partition(int start, int end) {
        int pivot = choosePivot(start, end);
        swap(pivot, start);

        int leftProcessed  = start, rightProcessed = end + 1;
        E pivotElement = lst.get(start);

        while(true) {
            do {
                leftProcessed++;
            } while(leftProcessed <= end &&
                lst.get(leftProcessed).compareTo(pivotElement) < 0);
            do {
                rightProcessed--;
            } while(lst.get(rightProcessed).compareTo(pivotElement) > 0);
            if(leftProcessed > rightProcessed) {
                break;
            }
            swap(leftProcessed, rightProcessed);
        }

        swap(start, rightProcessed);
        return rightProcessed;
    }

    private void swap(final int i, final int j) {
        E tmp = lst.get(i);
        lst.set(i, lst.get(j));
        lst.set(j, tmp);
    }

    private int choosePivot(final int left, final int right) {
        return left + (right - left)/2;
    }
}

The test case is:
package sorting.quicksort;

import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.Parameterized;

import java.util.*;

@RunWith(value=Parameterized.class)
public class QuickSorterTest<E extends Comparable<E>> {
    private List<E> lst;
    private List<E> copyOfLst;
    private QuickSorter<E> sorter;

    public QuickSorterTest(List<E> lst) {
        this.lst = lst;
    }

    @Parameterized.Parameters
    public static Collection<Object[]> getParameters() {
        return Arrays.asList(
                new Object[]{ Arrays.asList(1, 2, 3, 4) },
                new Object[]{ Arrays.asList(1, 2, 4, 3) },
                new Object[]{ Arrays.asList(1, 3, 2, 4) },
                new Object[]{ Arrays.asList(1, 3, 2, 3) },
                new Object[]{ Arrays.asList(4, 2, 3, 1) },
                new Object[]{ Arrays.asList(1, 4, 3, 2) },
                new Object[]{ Arrays.asList(3, 2, 1, 4) },
                new Object[]{ Arrays.asList(3, 2, 2, 4) },
                new Object[]{ Arrays.asList(3, 1, 1, 3) },
                new Object[]{ Arrays.asList(1, 1, 1, 1) },
                new Object[]{ Arrays.asList(1, 2, 1, 1) },
                new Object[]{ Arrays.asList(1, 2, 3) },
                new Object[]{ Arrays.asList(1, 3, 2) },
                new Object[]{ Arrays.asList(2, 1, 3) },
                new Object[]{ Arrays.asList(2, 3, 1) },
                new Object[]{ Arrays.asList(3, 1, 2) },
                new Object[]{ Arrays.asList(3, 2, 1) },
                new Object[]{ Arrays.asList(3, 2, 3) },
                new Object[]{ Arrays.asList(3, 1, 1) },
                new Object[]{ Arrays.asList() },
                new Object[]{ Arrays.asList(1) },
                new Object[]{ Arrays.asList("hello", "cruel", "world") });
    }

    @Before
    public void setUp() {
        lst = new ArrayList<E>(lst);
        copyOfLst = new ArrayList<E>(lst);
        sorter = new QuickSorter<E>();
    }

    @Test
    public void testSort() throws Exception {
        sorter.sort(lst);
        Collections.sort(copyOfLst);
        Assert.assertEquals(copyOfLst, lst);
    }
}

Now, the idea is quite clear. The code to be written is ugly. The parametrized test case is annotated with @RunWith and the Parametrized.class. The test case must have a method annotated with @Parameterized.Parameters. This method return a collection of array of objects.

The idea is that for each element returned by that method a new test case object is constructed using the array of objects as actual arguments. In fact, the design is not extremely different from what we saw in Python.

In Python we had a generator yielding the tests. Here we have a static method which does quite the same job. Of course, here a Collection is returned. And Java syntax here is quite cumbersome. Luckily enough there is some flexibility with types so that I do not need to make different test case classes just because of that.

In the example we sort both arrays of integers and arrays of strings.

The setUp method and the test method do roughly the same thing they do in Python. And also the Java version uses a constructor (in fact, I'm led to believe that it is simply unavoidable).

Saturday, October 9, 2010

Design of Sorting in Java and Python (pt. 1)

The sorting interface built-in in Java collection model essentially depends on object which are either Comparable or to external comparator objects. The idea is quite straightforward: if the object does not provide a natural ordering (or the developers did not provide one) it is very easy to create a Comparator object.

Such an object takes to objects of type T and returns -1, 0 or 1 depending on whether the first object is "less than", "equal to" or "greater than" the second object. This strategy is extremely flexible (and in fact it is also an example of the very Strategy design pattern) and allows for different orders. For example it is possible to create ordering based on one or more properties and whatever the developers need to.

Of course, this is one of the cases when trying to be a purely object oriented language sucks: we are creating object to do the job of functions. Yes... comparators should not really have any state and just provide a compare "functional" method. However, this i relatively natural in Java. And as Comparable is an interface, we can just provide anonymous comparators like we would have provided anonymous functions.

Collections.sort(lst, new Comparator<Integer>() {
    public int compare(Integer left, Integer right) {
        return new Integer(left * left).compareTo(right * right); 
    }
});

In Python < 2.3 we had quite the same interface. Of course, having higher order functions is great and it was possible to just pass a cmp function. However, now the cmp parameter of sort is deprecated and in Python 3 disappeared... we just can do better than that!

So now, we may want to reflect. For example, let us write the Java code we just presented in Python.

def si_cmp(a, b):
 return cmp(abs(a), abs(b))
 
nums = [-5, -1, -3, 2, 4, 1]
nums.sort(cmp=si_cmp)
print nums

We are switching to python just to get rid of syntax noise. The very same logic does apply to Java. Basically we "lifted" the comparison operator with abs (ok, we should be a bit more rigorous, but the formalization is left to the willing reader).

In fact, this is an extremely common pattern. When we want to compare a pair of Person objects using their name property, we are once again "lifting" the comparison on strings to the comparison on Persons.

class Person(object):
 def __init__(self, name, phone):
  self.name = name
  self.phone = phone
  
 def __repr__(self):
  return '{name=%s, phone=%s}' % (self.name, self.phone)
  
persons = [Person(name, phone) for name, phone in (
   ('John Smith', 'xxxxxxxx'),
   ('Robert White', 'yyyyyyyy'),
   ('Hannibal Lecter', 'zzzzzzzzz'))]
   
def name_cmp(fst, snd):
 return cmp(fst.name, snd.name)
 
persons.sort(cmp=name_cmp)

If we print the persons array the output is (with some formatting licensed regarding whitespace I took in order to better fit in the web page):

[{name=Hannibal Lecter, phone=zzzzzzzzz}, 
 {name=John Smith, phone=xxxxxxxx}, 
 {name=Robert White, phone=yyyyyyyy}]

It may not be apparent that this is an instance of the very same pattern. Perhaps, because we do not really consider accessing an attribute as a function call. Well, think about a getName method (which would be bad practice in Python, but it is exactly what you would have in Java...

class Person(object):
 def __init__(self, name, phone):
  self.name = name
  self.phone = phone
  
 def getName(self):
  return self.name

and then:

def name_cmp(fst, snd):
 return cmp(fst.getName(), snd.getName())

which in Python is exactly equivalent to:
def name_cmp(fst, snd):
 return cmp(Person.getName(fst), Person.getName(snd))

In fact, in Python we prefer using the useful operator module attrgetter.
That function takes the name of an attribute and returns a function that gets that attribute. So operator.attrgetter('name') is a function returning the name attribute of any given object.

import operator as op

def name_cmp(fst, snd):
 get_name = op.attrgetter('name')
 return cmp(get_name(fst), get_name(snd))

More reflections on this to come...

Friday, October 8, 2010

Parametrized Unit Testing (in Python)

In my last post I briefly discussed parametrized tests. Now it is time to see them in practice.

Right now, in Python there is no way to do parametrized testing. Well, actually there are many ways. Python is an extremely flexible and high level languange and adding that features is a matter of less than a hundred lines of code (or somewhat more, perhaps, if you want an even richer semantics). Just the bare bone functionality may take just a dozen lines of code.


For this reason, many people implemented their own version (see the linked thread). Plus there are nose and py.test variants. Actually, I have used nose to perform that kind of tests.


Nose allows tests to be simple functions (and uses python decorators to add setUp and tearDown functions when necessary).This is a (buggy) quicksort implementation in Python:

def quicksort(lst, select_pivot=lambda s, e: s + (s-e)/2):
    def partition(start, end):
        pivot = select_pivot(start, end)
        lst[pivot], lst[start] = lst[start], lst[pivot]
        left = start
        right = end + 1
        pivot_element = lst[pivot]
        
        while 1:
            left += 1
            right -= 1
            while (left <= end) and (lst[left] < pivot_element):
                left += 1
            while lst[right] > pivot_element:
                right -= 1
            if left > right:
                break
            lst[left], lst[right] = lst[right], lst[left]
        lst[start], lst[right] = lst[right], lst[start]
        return right
    
    def sort(start, end):
        if start >= end:
            return
        pivot = partition(start, end)
        sort(start, pivot-1)
        sort(pivot+1, start)
        
    sort(0, len(lst)-1)

And this is the test:

import qsort
from nose import tools

fixture_lsts = [
    [1, 2, 3, 4], [1, 2, 4, 3], [1, 3, 2, 4], [1, 3, 2, 3],
    [4, 2, 3, 1], [1, 4, 3, 2], [3, 2, 1, 4], [3, 2, 2, 4],
    [3, 1, 1, 3], [1, 1, 1, 1], [1, 2, 1, 1], [1, 2, 3],
    [1, 3, 2], [2, 1, 3], [2, 3, 1], [3, 1, 2], [3, 2, 1],
    [3, 2, 3], [3, 1, 1], [], [1]
]

def check_sorted(lst):
    sorted_lst = sorted(lst)
    lst = lst[:] # do not modify lists
    qsort.quicksort(lst)
    tools.assert_equals(lst, sorted_lst)
    
def test_quicksort():
    for lst in fixture_lsts:
        yield check_sorted, lst
        


This is what I see from WingIDE:

Buggy quicksort test output


In fact, the test function is allowed to be a generator. In this case, it is assumed to yield tuples where the first argument is a callable and the others are its parameters. This way it is very easy to do parametrized testing.

Please notice (the documentation points that out) that if you attach setup and teardown stuff to the test function then they will be run just once, before and after the whole test function and not before and after every parametrized test. This is what I usually would like, in fact. In order to have this behavior, it is sufficient to add the functions to the yielded callable.

For example, I see that in the proposed "check" function we are doing stuff which should be done in a setup method. Unfortunately one of my greatest dissatisfactions with nose (and with_setup in particular) is that it completely forces you to use a stateful model.

From the setup function to the executed test function the only way to communicate is through the global environment. This is half bad (I've not yet decided if I prefer the overkill of TestCase classes -- where classes are just used for grouping purposes without "true" object behavior from the user point of view -- or using module variables -- which seems very pythonic as modules are objects but somewhat encourages the use of global variables).

This works quite well if you model all your tests with this idea in mind (and basically it's not different from xUnit model, just no classes). Unfortunately when you use generator based testing the test function receives the part of the SUT to test as a parameter, and the parameter is not accessible from the setup function. The easiest solution (the one which does not involve writing some custom test loader or at least a new piece of code at nose level), is useing classes.

In fact, the yielded callable is allowed to be a callable object. In this case, it is possible to define a setup function "as if" it were a TestCase (still, it is not a TestCase.

The resulting code is:

class CheckSorted(object):
    def __init__(self, lst):
        self.lst = lst

    def setup(self):
        self.sorted_lst = sorted(self.lst)
        self.lst = self.lst[:]

    def __call__(self):
        qsort.quicksort(self.lst)
        tools.assert_equals(self.lst, self.sorted_lst)


def test_quicksort2():
    for lst in fixture_lsts:
        yield CheckSorted(lst),

Please notice the , (comma) at the end of the yield statement. It is necessaty as a tuple must be yielded. Of course code could be made more explicit with

yield (CheckSorted(lst), )

I don't actually like this solution very much, since it is somewhat confusing what the costructor should do: usually TestCases do not have a constructor. Setup methods just set them up. However, in this case, part of the job is performed by the constructor (and there is no other way -- unless we use a global or something like that, but I believe the cure is worse than the illness). Anyway, this is what we have.

Thursday, October 7, 2010

Parametrized Unit Testing

Classical unit testing examples are usually performed with rather simple objects whose methods have boundary conditions rather easy to identify. As a consequence, it is quite natural to separate success and failure conditions in separate methods of the TestCase object. The example is easy because it needs to be.

Moreover, the underlying assumption is that if the boundary conditions were not easily identifiable, the method should be split in sub-methods more easily testable. In practice this an ideal condition not always realizable in practice.

Sometimes algorithms just have complex behavior. Sometimes we can’t afford the cost of subroutine calls (especially if we are using dynamic languages were the compiler optimizations are basically crippled). Or perhaps we are using a language without functions (Java anyone?) and we don’t want to make internal working subroutines public (or toy with modifying access permissions). Of course, we could put up some more complex architecture with a facade with the full algorithm hidden and an internal fully testable object implementing the algorithm.

But this is, in fact, just a complications. Sometimes the need to test just a function/method with a lot of different parameters arises. Above situations usually lead to this. For example, consider the number of different arrays we would like to test a quicksort procedure with.

A first attempt at doing this could be something like:

def testQuicksort(self):
    for lst in parameters_set:
        self.assertEquals(sorted(lst), quicksort(lst))

Nice try. It surely works. And I’m quite sure I wrote code like this one. I should not, but hey I’m human. However, if some failures arise, this is troublesome for a bunch of reasons.

  1. It is not even always immediate to find out exactly which input caused the failure
  2. Just the first failure is reported, because of the way xUnit works (and besides, it is correct that it works that way, no need to fix the library). Sometimes seeing all the failing inputs can help in locating the place where the bug is.
  3. It may be very expensive in terms of cpu time to run the test on every input: most IDEs have feature like “rerun just the the tests which failed” but this strategy is useless in this case. The whole bunch of tests is seen as just one single failing test. In facts it is such a huge set of unrolled assertions that trivial Eager Test (assertion roulette) design errors pale in comparison.
  4. Additional stuff like singularly enabling/disabling (skipping) tests is not going to work.


  5. If you like to inspect failing tests with a debugger, it is a real bitch to set the debugger conditions to trigger the debugger just for the input that is going to fail.
  6. Kent Beck is going to cry.


Tuesday, October 5, 2010

Me and my IDE

Introduction

As customary in my articles on text editors and IDE I originally started this post with a self-condescending presentation of my skills and expertise. Then there have been some miscomprehension between chrome and blogger and the second part of the draft was not saved although blogger stated otherwise.

As a consequence I had the more self-condescending part and I lost the slightly more interesting one. I thought to do everybody a favor and delete that stuff. Now, a slightly less condescending paragraph on myself in order to motivate my point of view.

The past

I almost always used editors. Emacs, vim… and when on BBEdit and TextMate. Ah… light tools. Not only that: I was quite vocal in the IDE vs. Editor war (from the editor side, of course). And this can also be related to my choice of  “freedom languages” (more on this on later posts, perhaps?).

But some weeks ago I realized that now, without noticing, I’m mostly using IDEs. How has that happened? The easy part to explain is the Java related one. Java greatly benefits from IDEs (or perhaps the widespread diffusion of IDEs has tampered the development of good language features?); in fact it is quite much harder to develop Java without an IDE, since the language is very verbose. On the other hand its restrictions and bounds make it quite easier to develop good and powerful tools. Of course I could use UltraEdit on Windows… but no luck with OS X and Linux (moreover, I had to pay that as well… and I preferred to buy IntelliJ).

Emacs has very nice modules to work with Java. On the other hand they have to be installed separately (and in my experience also slowed down normal operations). When I have to loose to much time configuring my editor, I usually try other solutions. That is why I used TextMate and BBEdit extensively… because they are almost already set-up. I also like vim, since has decent default setups for most situations.

For somewhat similar reasons, I bought WingIDE to work with Python. I could set up Emacs (or vim) to do that stuff, of course. Still, I had not much time, not motivation… and I bought the thing. I especially liked it because it worked equally well on single files and on large projects. It also had the best auto-completion for Python I ever tried, nice debugging features (I use them rarely, but when I do I love to have an easy to use environment, especially because I use them rarely and I tend to forget the tricks). In the pack there is also nice support for unit-testing (and that is a definite plus).

In the same time I worked extensively with C++. I mostly used editors (vim, because I especially like it with C/C++) and BBEdit because of its easy to set up “project management” features. BBEdit also has most the processing power of classical editors and that is nice. I quite enjoyed the best effort auto-completion it provided and ctags based stuff. Ok… still too much template metaprogramming (or simply correct and not basic use of templates) crippled smart “IDE like” features which I would have liked.

In other words, I sincerely craved for a nice C++ IDE. Perhaps I should try Java based IDEs with C++ as well… I may be surprised. Though I hate setting up projects in IDEs. Every IDE has its conventions and it is not always immediate how to work with a project that both needs “classical command line build interface” (which is something I’m not disposed – nor can I afford to – lose) and the IDE based build. Somewhat CMake eases the problem, but at the price of restricting myself to the supported IDEs.

Other languages I use, but I think that they account for less than 30% of my development time (and it is going to drop). And perhaps they could be moved to IDEs as well… as soon as I solve the “single file project problem” [or I simply give up my hopes with that].

This basically rules me out from editor flame-wars. Still, I believe that learning using IDEs is a terrible mistake… but is a modest position. Moreover, nowadays most editor proponents show you how to add all the IDE features to their editor of choice. So what’s the point? It’s not an Editor vs. IDE matter, it’s Emacs vs. Eclipse. And somewhat I believe Eclipse is going to become what Emacs was in the past years. And hopefully it is going to become faster like Emacs did…

The future

So what now? I’m lazy to the bone. And I’m quite new to the IDE world. I have my habits and I don’t really want to change them. And don’t believe the ones who tell that using IDEs is easier. Nor faster. It depends on your skill set. As I have my own requirements, some things may be slower with IDEs, at least until I figure out how to do them.

Which basically means that an IDE is not automatically more productive: it may be, provided that you know the tool. Yes, some have a very good learning curve (like WingIDE), others are very good, but need you to figure the way they work to exploit them fully (IntelliJ). Moreover, I still have to figure out how to develop Java projects easily built both with IDEs and with command line tools. Perhaps I should explore maven (which should be well integrated with all the main IDEs). After-all its racist on my part to spend time with cmake and auto-tools and not doing that for Java “based” projects.

Monday, October 4, 2010

Something on the edit distance (Levenshtein distance)

The edit distance intuitively measures the “distance” between two strings in terms some operations considered elementary (e.g., adding or removing a character). We could say that the edit distance between two words w1 and w2 is the minimum number of such operations necessary to transform w1 in w2.

The more mathematically inclined may notice this is actually a metric in mathematical sense. Let d(u,v):{\Sigma ^*} \times {\Sigma ^*} \to {\mathbb{N}^ + } \subset \mathbb{R}the edit function over the alphabet Σ. It is trivial to prove that. for all u and v, d(u,v) = 0\,\,\,\,{\rm{iff}}\,\,\,\,u = v and d(u,v) = d(v,u). The proof that the triangle inequality (d(u,v) \ge d(u,w) + d(w,v), with w \in {\Sigma ^*}) holds is slightly trickier and is provided just should have been provided at the end of the post, if your author were not a very lazy person (though he reserves to write it dow some time in the future).

There are multiple edit distance. The most popular is the Levenshtein distance. In fact, it is so popular that is called “Edit distance” tout court. The elementary operations considered by the Levenshtein distance are:
  1. Add a character
  2. Remove a character
  3. Substitute a character
Essentially, since we want the minimum number of operation, we should try all possible paths of modifications from the string w1 to w2. Luckily enough a maximum bound is easy to estimate. If len(w1) is n and len(w2) is m, we know for sure that we can remove all n characters from w1 and add the m characters of w2. That is to say, we can automatically discard all the paths of operations longer than m + n. Luckily enough, we can build algorithms much more efficient than that. The more popular algorithm to compute the edit distance can be expressed as:

package string_distance.levenshtein;

public class Distance {
    static public int distance(String u, String v) {
        int n = u.length();
        int m = v.length();

        int distance[][] = new int[n+1][m+1];

        for(int i = 0; i <= n; ++i) {
            distance[i][0] = i;
        }
        for(int j = 0; j <= m; ++j) {
            distance[0][j] = j;
        }

        for(int i = 1; i <= n; ++i) {
            for(int j = 1; j <= m; ++j) {
                if(u.charAt(i-1) == v.charAt(j-1)) {
                    distance[i][j] = distance[i-1][j-1];
                } else {
                    distance[i][j] = 1 + minimum(
                            distance[i-1][j],
                            distance[i][j-1],
                            distance[i-1][j-1]
                    );
                }
            }
        }

        return distance[n][m];
    }

    private static int minimum(int i, int j, int k) {
        return Math.min(Math.min(i,j), k);
    }
}
It is quite important to understand the basic idea. The algorithm uses an (n+1)x(m+1) matrix (provided that len(w1) = n and len(w2) = m) to store intermediate results. In an imperative setting, this data structure comes particularly handy and natural. In order to simplify the notation, we use Python slicing notation, e.g., s[0:4] is the substring of s starting at position 0 and ending before position 4 (that is to say, len(s[a:b]) = b-a). Essentially the i,j cell of the matrix holds the Levenshtein distance between w1[0:i] and w2[0:j]. This is why the matrix holds n+1 rows and m+1 columns: how many substring starting at 0 has string s of length l? l+1, because also the empty string and the full string are substrings. Also notice that the indexes i and j used throughout the algorithm do not refer to the “i-th character”, but to the i-th substring.

The first two loops consider the case when one of the two substrings is empty (s[0:0] is always the empty string). In this case, the edit distance is the length of the non-empty string. Intuitively, the first loop means that the edit distance of w1[0:i] with the empty substring of w2 is i, because we have to delete i characters, the second one that the edit distance of the empty substring of w1 with w2[0:j] is j because we have to add j characters.
The main part of the algorithm is constituted by the two nested for loops. They dominate the cost of the algorithm which always runs in O(nm), which means that is essentially quadratic in both space and time in the size of the longest string. In other words it quite unpractical to run the algorithm as described using large strings as input.
 For each possible pair of i and j, if the (i-1)-th character and the (j-1)-th character are the same, we can simply report the value computed for the (i-1)-th and the (j-1)-th substrings. That is to say, if u[i-1] = v[j-1], then the edit distance between u[0:i] and v[0:i] is by definition the same edit distance between u[0:i-1] and v[0:i-1]. Suppose it is not the case. We know that u[i-1] != v[j-1] and we know the edit distances between the pairs (u[0:i-1], v[0:j]), (u[0:i], v[0:j-1]), (u[0:i-1], v[0:j-1]). We say that the edit distance between s[0:i] and s[0:j] is 1 plus the minimum value among the edit distances between the aforementioned pairs. What we mean?
Suppose that the minimum edit distance is between u[0:i-1] and v[0:j]. Then it means that u[0:i-1] and v[0:j] are more similar than the other possibilities. And we know that if we get from u[0:i-1] to v[0:j] with k operations, then we can get from u[0:i] to v[0:j] with k+1 operations, because we can “delete” the last character from u[0:i] and then it is like starting from u[0:i-1]. The same ideas apply in the other cases, but it the additional action is an addition (the last character of v[0:j]) or a modification (we modify u[i-1] into v[j-1]).

This basically the idea. I have to say that I felt very uncomfortable with the indexing schema. I am used to half closed intervals when looping. That is to say I usually write loops where indexes range j \in \left[ {a,b} \right) \cap \mathbb{Z} (with a and b integers, of course), as for example for(int j = a; j < b; ++j).

This indexing stuff is not a minor matter, as off-by-one errors are very frequent and only thorough unit testing or demanding formal verification can provide use with sufficient confidence in the code itself[0].

 Perhaps the code could have been clearer if I used the convention that i and j are the indexes in the string (thus using i+1 and j+1 to access the matrix). However, not many presentations use this schema and I felt that this could make confusion when studying the code and comparing it to different implementations. A third possibility would have been to loop over the Java strings and manually increment the indexes. But for-each is not applicable to Java strings, thus I would have to create more ad-hoc code (which is what I would have done in real world code, at least until proved too inefficient).

 Luckily enough, in Python I was able to express the code in a more elegant way using the enumerate built-in. I am quite fond of that one! I also decided to use numpy arrays to implement the matrix. An alternative would have been creating a custom Matrix class (perhaps implemented with a linearized list, with a list of lists or with a dictionary). However, this would have been far less efficient and more grievous to write. Notice how two for loops have been substituted with matricial operations (and more to come… ;) ).

def fast_lev(a, b):
    m = len(a) 
    n = len(b)
    distance = np.zeros((m+1, n+1), dtype=int)
    distance[0,::] = np.arange(n+1)
    distance[::,0] = np.arange(m+1)
    
    for i, ca in enumerate(a, start=1):
        for j, cb in enumerate(b, start=1):
            if ca == cb:
                distance[i,j] = distance[i-1,j-1]
            else:
                distance[i,j] = min(
                    distance[i-1,j] + 1,
                    distance[i,j-1] + 1,
                    distance[i-1,j-1] + 1
                )
    return distance[m,n]

[0] Andrew Koenig, “C Traps and Pitfals”, Addison-Wesley Professional (some available here)