Sunday, October 31, 2010

What about Apple quality (ipad, iphoto, camera connection kit)?

The beginning

I quit using Apple products in the late nineties as I felt that directly and solely using Linux was far a better choice to my needs. As a consequence, I stopped doing many things I use to (such as recording and producing music) and that was it. Some years later, I bought a 12" powerbook and I think it is one of the best notebooks I ever had. Jaguar was crap compared to the systems we have today.

Integration with open source stuff was mostly left to the users (and that way I learnt lots of things I had for granted in Linux). On the other hand, I had lots of beautiful applications and a unix system which worked flawlessly on my laptop, which, at the time, was not common at all. Since then I have been quite enthusiastic about apple products. I bought many more Macs and used them as my main development platform. I bought mac software. I have an iPad and an iPhone.

I was very satisfied with the general quality of the products. Few bugs, few problems. On the other hand, I'm now starting to seriously question Apple strategy. I have already expressed my concerns about the mac store. Though I'm starting to believe things are worse.

The tragedy

For example, this fall I went to Cornwall on holiday. I brought my camera and my ipad. The idea was that I could store the pictures on my iPad (I bought also the camera connection kit).

I was pleasantly surprised that the kit stored the pictures in raw format and consequently I had no loss of quality. On the other hand, this detail was not clear at all (I had to sent myself an email containing one picture to find out that it still was a NEF file). I think that this information is irrelevant for most amateurs (as most digital compact cameras save in JPEG); however, it is critical for a professional. And it may be critical for some amateurs as well, the ones with digital reflex cameras.

So as soon as my memory card was full, I connected it to the ipad and started downloading the pictures. Apparently, everything went well; I simply felt that the process was quite too slow (but I may be biased as I wanted to use the ipad at the same time to check the email and I did not want to interrupt the transfer). So I decided that I could afford buying another memory card. With that one and the now emptied old one, I could manage to take pictures without another transfer.

Went I came back, I connected the ipad to the mac pro and started downloading the pictures. The process hanged after importing a hundred pictures or so. Since I connected it to my monitor (which has an USB hub), I initially blamed the monitor. However, I should have suspects on iphoto since the "cancel" button did not work and I had to manually force quit the application. Then I tried another hub (which was more tested); same results. And then I directly connected the ipad to the Mac and once again the transfer hanged. My guess, at this point, was that the ipad had issues. In fact, I transferred greater quantities of pictures (but I forgot I did not use iphoto, but Nikon Transfer).

As a side note, I stopped using Nikon Transfer because I lost some pictures due to unclear options related to "deleting pictures after transfer". For NT it meant delete all the pictures after the transfer, both the ones you transferred and the ones you did not, while I thought that it conservatively meant "delete the pictures that you just transferred". Luckily enough CardRaider saved the day.

Well, iPhoto was not very clear on what they meant with "Keep pictures" "Delete pictures" either, so I decided to "keep" and manually delete. I think that Apple could have been more precise.

Anyway, at this point I was quite desperate: I found no options in ipad-iPhoto to mass put all the pictures somewhere where I could access them. Luckily, I remembered that Preview is able to import pictures from cameras. And it is able to import pictures from the ipad as well. The program is faster than iphoto. It is faster to start, faster to transfer. The image selection interface is also far clearer: moreover, failed or interrupted downloads showed clearly the already imported images, so starting from the point where I left was easier.

In fact, the first time I got the same transfer error I had with iphoto. However, plug and unplug + quit & reopen fixed the issue and now I have all the pictures safely on my hard drive.

Conclusion

I never really trusted iphoto: I kept my files outside the library in order to more easily access them. I trust the program even less: I won't use it to import pictures, just to present them. The "find faces" functionality is just a curiosity: I still haven't found any use for it. Moreover, much of the imported pictures are not processed by that functionality. I don't know why.

As a consequence, I would like to get rid of iphoto. I want to stop subscribing to mobile me as well (e.g., out of the blue, iWeb is not able to update my website anymore). Now I just have to find a substitute; something which has similar features and that possibly retains the association between the raw pictures and the modified one. Still iPhoto was not that bad: I could have used it. I just think it is not refined and a bit left on its own.

And now there is the new version. And I am not very motivated in buying that software. On the other hand abandoning most apple ties is quite hard and since all the stuff is very well integrated, it is somehow required to stop using all that crap together, otherwise problems arise.

Some pictures

Ocean Pictures (Dawn)

Ocean Picture

Ocean Picture (Night)

 

 

Technorati Tags: , , , , , , , , , ,

Saturday, October 30, 2010

Python String Concatenation (again): the revenge of join!

In this second post (the first here) I proposed some benchmarks to investigate the efficiency of string concatenation in Python. I believe that the results are partial for two basic reasons:
  1. we benchmark just concatenation of huge amounts of very small strings
  2. we are creating the strings in the iteration. As a consequence, we are implicitly penalizing the methods using str.join
In this post I address the second issue. I moved the creation of the list outside the benchmarked code, and simplified the functions accordingly.
def str_sum(seq):
    out_str = ''
    for item in seq:
        out_str += item
    return out_str

def str_sum_b(seq):
    out_str = ''
    for item in seq:
        out_str = out_str + item
    return out_str

def str_join(seq):
    return ''.join(seq)

def str_join_lc(seq):
    return ''.join(item for item in seq)

def string_io(seq):
    file_str = StringIO()
    for item in seq:
        file_str.write(item)
    out_str = file_str.getvalue()
    return out_str
These are the same functions we benchmarked last time simplified where necessary. str_sum and str_sum_b are basically unchanged. On the other hand, str_join is completely different: since the first three lines where used to create the list, now they have completely disappeared.
@old_version
def str_join(loop_count):
    str_list = []
    for num in xrange(loop_count):
        str_list.append(str(num))
    out_str = ''.join(str_list)
    return out_str
As a consequence, the funcion is much shorter. I mantained the _lc variant: however, I will remove it from serious benchmarks as it is inevitably slower than the regular version. I also introduced some new functions to test here:
def str_sum_c(seq):
    return reduce(op.add, seq)

def str_sum_d(seq):
    return reduce(op.iadd, seq)

def str_sum_e(seq):
    return sum(seq, '')
They are also disabled: the _e variant is invalid Python. Calling sum on a sequence of strings gives this instructive error:
TypeError: sum() can't sum strings [use ''.join(seq) instead]
The other two were excluded because they are too slow:
itsstr_sum str_sum_cstr_sum_d
1000000.0105722.7818532.704670
Unfortunately, reduce appear to be very slow. Consider these additional functions:
def str_sum_f(seq):
    return reduce(str.__add__, seq)

def str_sum_g(seq):
    out_str = ''
    for item in seq:
        op.iadd(out_str, item)
    return out_str
itsstr_sum_bstr_sum_fstr_sum_g
1000000.0114482.8126120.018933
I decided to benchmark only a relatively small subset of the presented functions. The measures I plotted are the result of 10 executions of the function; the average time can be obtained simply dividing by 10. In the first plot I consider strings composed by numbers of fragments one order of magnitude smaller than the ones benchmarked in my other post.
In this other plot, I present the results with the same order of magnitude. However, the times presented result from 10 executions of the same function: the spikes and irregularities were not eliminated by averaging the times. Consequently, I don't consider the plot meaningful unless I discover the reason of such behavior (which comes completely unexpected).
In the future I will try a benchmark consisting of concatenation of larger strings. Technorati Tags: , , , , ,

Thursday, October 28, 2010

jEdit 2

Seems I forgot to mention that it supports rectangular selections, vertical editing (meaning that it is possible to modify more lines with the same command).

I also found quite useful buffer local properties (written in the file itself). However, I feel that using a completely different standard from Emacs and vi is not a very good idea. Yeah, Java is a world on its own.

Unfortunately, it seems that development is going quite slow. Well, as a TextMate user I should not comment on this... ;)

Unfortunately IDEs like IntelliJ have so much more features that for Java development jEdit is not on par. On the other hand the startup times are very reasonable even on my netbook. I actually favor jEdit over other Windows evolved editors. I need more time in order to make a serious comparison with BBEdit (which I have used since the first versions) and TextMate (which is somehow even more intuitive).


Technorati Tags:
, , , ,

Saturday, October 23, 2010

Python repeated string concatenation: should we change our habits?

One of the most common errors newbies do is related to string concatenation in Python. In Python, string are immutable objects; consequently, concatenating two string effectively creates a new object (and possibly releases the two former object). The common advise is to build strings using the popular str.join method.
sep.join(string_framents_sequence)
which frequently just turns into:
''.join(string_framents_sequence)
Of course, sometimes the advice goes as far as "never" ever concatenate two strings". In some sense, the advise is good, but it is often given for the wrong reasons. In fact, most of the times the '%' (or format) idiom is simply clearer. Compare:
'[%d:%d] %d %s>' % (min, hour, ret_code, path)
with:
'[' + min + ':' + hour + '] ' + ret_code + ' ' + path + '>'
Anyway, recently it has been disputed whether the str.join alternative is so much faster than repeated concatenation, given the fact that the += operator has been improved. My goal is not to investigate this matter.

Baseline

At first, I decided to benchmark this simple function:
def baseline(loop_count):
    for num in xrange(loop_count):
        str(num)
Since my first testing code is creating strings from a bunch of integer, here I am just benchmarking performance of the bare operation, without actually building the string. The second step is to accumulate all the fragments. The idea is to discover the point where thrashing starts, that is to say the range where the operation behavior remains linear. Outside these zone, we must figure out other factors and in fact the benchmark is not relevant. A third baseline tests uses list comprehensions.
def baseline2(loop_count):
    lst = []
    for num in xrange(loop_count):
        lst.append(str(num))
    return lst

def baseline3(loop_count):
    return [str(int) for int in xrange(loop_count)]
If we examine the plot of the execution times, we can see that when we keep in memory all the string fragments, thrashing starts after 107 iterations. The graph is logarithmic in both axis. As a side note, using list comprehensions is twice as fast than manually adding stuff to the list one by one. Benchmark of String Fragments Creation

Tested Code

These are the functions I actually benchmarked. Full code for the benchmark can be found in my git repository.
def str_sum(loop_count):
    out_str = ''
    for num in xrange(loop_count):
        out_str += str(num)
    return out_str

def str_join(loop_count):
    str_list = []
    for num in xrange(loop_count):
        str_list.append(str(num))
    out_str = ''.join(str_list)
    return out_str

def str_join_lc(loop_count):
    return ''.join(str(num) for num in xrange(loop_count))

def string_io(loop_count):
    file_str = StringIO()
    for num in xrange(loop_count):
        file_str.write(str(num))
    out_str = file_str.getvalue()
    return out_str
str_sum is essentially string concatenation using operator +=. str_join and str_join_lc are string concatenations performed with str.join. The str_join_lc variant uses list comprehensions. string_io is a StringIO based method (here it is the StringIO from cStringIO). Code has been adapted from the code benchmarked here. However, I decided not to use the `` operator, even though it is faster. Since it is going away from Python, I believe it is not relevant anymore. Moreover, using str in place of `` uniformly slows down all the methods.

Results

In this graph, I present results relative to Python2.6. Essentially, nowadays the faster method to create the strings has been proven to be the once advised str.join. However, I believe that difference is not great enough to change habits.
Here I present also plots for Python 2.7 and Python 2.6; essentially there is no significative difference with relative results. Python 2.7 is just faster than Python 2.6 with every considered method (and Python 2.5 is slower).
So, it appears that indeed concatenating strings with the += operator is simply faster. Out of curiosity I also tried a variant:
def str_sum_b(loop_count):
    out_str = ''
    for num in xrange(loop_count):
        out_str = out_str + str(num)
    return out_str
and I found no real difference with the variant with +=. Wow! The benchmark was executed on my MacPro. Python2.5 and Python2.6 are the Python variants distributed with the OS (Snow Leopard). Python 2.7 has been dowloaded from Python.org. In this other post I investigated string concatenation using pre-built list of string fragments. The results are quite different, indeed! Technorati Tags: , , , ,

Mac App Store Review Guidelines

Lately, it seems to me that at Apple is doing everything they can to completely annoy developers and geeks. Which is, of course, a very bad premise for a post; as far as I have seen Apple proved to be right against all the odds most of the times. People continuously complain on Apple (and I have never seen a company with so many haters, with the possible exception of Microsoft), though sells grow and the products improve.

In fact Snow Leopard is the best OS they did (and perhaps the best I ever tried). The machines they build are sturdy and elegant at the same time. And the ipad is a wonderful tool .

There are many things that I don't like, though. Perhaps one day I will elaborate on this. At the present, however, I would like to point out the conditions to have an app accepted at the Apple store.

Some are worried that they will become the sole way to distribute software. I am confident it will not. First, the EU already questioned whether it was acceptable to do that on the iphone/ipad. I am sure that they will seriously avoid having Apple doing that on a PC. Moreover, Apple would lose developers and applications (thinking Adobe and Microsoft and many other commercial companies will comply with those lines is pure madness)... And I even think that it is good to create such an OS.

It means slowly going towards some Raskin's ideas (he also opposed the very idea of third party applications). Though, it would be an OS completely useless to me. If they do it and they do it right, perhaps it is a good thing. But they won't get my money. Windows 7 is a solid system and Ubuntu gets better with every release: I'm not even sure that they would catch up without Apple going so experimental (they just lack some software).

Back to the rules...

2.1Apps that crash will be rejected

2.2Apps that exhibit bugs will be rejected

Am I reading this for real? Every developer knows that every non trivial piece of software has bugs. We strive to put in fewer bugs and for having the severity of the bugs reduced. But no bugs? Madness. An uncaught exception means that the application is going to crash. As far as I can see, those rules are basically reducing the set of acceptable apps to applications so small that they can be manually proved correct (unfortunately, this basically excludes GUI and concurrent applications).

So I think that those rules are fakes. They don't mean a thing. If they do, I am going to buy some Apple software from the store and ask them to remove it from the store as soon as it manifests a bug. And it will... ah, it will! Apple software is far from perfect: it just has a very careful UI design... ;)

2.7Apps that duplicate apps already in the App Store may be rejected, particularly if there are many of them

Oh, really? So you've got to be fast! Perhaps if the developers publish a bunch of word processors (which, with Cocoa API is quite simple) then Microsoft Office will be rejected. I don't really believe this, but it would be funny. In fact, I'm not even sure that MS is going to put anything on the store.

2.8Apps that are not very useful or do not provide any lasting entertainment value may be rejected

 

What the fuck does it mean? Office does not provide any entertainment value: its a tool. I am not entertained when I work with Word or with Keynote. Dear sirs at Apple... there is people working with your bloody macintosh computers. We are not to be entertained: we don't want to have a funny dog advisor to make the software entertaining.

2.16Apps that download or install additional code or resources to add functionality or change their primary purpose will be rejected

 

At last, they rephrased the sentence on the ipad guidelines. The way it was before basically would rule out a web browser.

2.19Apps that require license keys or implement their own copy protection will be rejected

2.20Apps that present a license screen at launch will be rejected

 

I love those two. I hate developers spending effort on protections. If I like the app, I am going to pay for it. If I don't like it, I'm not gonna use it. There are developers who put on the protections before writing useful functionalities (and sometimes they do sell the betas as well... luckily, the MacStore will at least make this behavior unrealizable).

2.18Apps that install kexts will be rejected

2.23Apps that spawn processes that continue to run after a user has quit the app without user consent will be rejected

7.4Apps containing "rental" content or services that expire after a limited time will be rejected

Ok... here it is obvious other means for installation will remain

2.24Apps that use deprecated or optionally installed technologies (e.g., Java, Rosetta) will be rejected

No Eclipse on the Mac Store... It seems that Python apps will be ok, as Python ships with every mac. At least nowadays.

2.25Apps that do not run on the currently shipping OS will be rejected

 

Oh... so developers... you always have to keep up with the last Apple OS!

 

3.1Apps with metadata that mentions the name of any other computer platform will be rejected

 

Don't know... CommodorEmulator seems to violate the rule.

6.1Apps must comply with all terms and conditions explained in the Apple Macintosh Human

Interface Guidelines

6.4Apple and our customers place a high value on simple, refined, creative, well thought through interfaces. They take more work but are worth it. Apple sets a high bar. If your user interface is complex or less than very good it may be rejected

Ok. Reasonable. Perhaps even possible. A certification "(Good UI)" would be very good: a prohibition to sell is bad.

6.2Apps that look similar to Apple Products or apps bundled on the Mac, including the Finder, iChat, iTunes, and Dashboard, will be rejected

How is this called in english? Abuse of a dominant position? Yeah. If this mac store is ever going to be successful, then Apple will got into troubles. And rightly so.

 

7.6In general, the more expensive your app, the more thoroughly we will review it

LOL. ROTFL. Makes sense, though.

8.3Apps that are simply web clippings, content aggregators, or a collection of links, may be rejected

Actually, no wikipedia app.

9.1Apps that encourage users to use an Apple product in a way that may cause damage to the

device will be rejected

 

LOL.

9.2Apps that rapidly drain a products battery or generate excessive heat will be rejected

So, no scientific computation tools allowed.

11.1Apps portraying realistic images of people or animals being killed or maimed, shot, stabbed, tortured or injured will be rejected

11.3"Enemies" within the context of a game cannot solely target a specific race, culture, a real government or corporation, or any other real entity

Looks like some games will be rejected. E.g., in 2nd war games, it seems you will have to be able to play both sides.

Technorati Tags: , , ,

Friday, October 22, 2010

No more Java for OS X?

As of the release of Java for Mac OS X 10.6 Update 3, the version of Java that is ported by Apple, and that ships with Mac OS X, is deprecated.

This means that the Apple-produced runtime will not be maintained at the same level, and may be removed from future versions of Mac OS X. The Java runtime shipping in Mac OS X 10.6 Snow Leopard, and Mac OS X 10.5 Leopard, will continue to be supported and maintained through the standard support cycles of those products.

Link

Don't really know how to interpret this one.

Screenshot 2010-10-22 01h 47m 54s.png

At the moment Oracle does not offer an OS X Java implementation. What is going to happen?


Technorati Tags:
, ,



Powered by ScribeFire.

Asus Survey

I own an ASUS netbook and I am quite satisfied. I also wrote about specific problems related to performance and Java and RAM and those things you should be supposed to do with a netbook but are painstakingly slow. However, things improved since I bought 2 GB of RAM.

Anyway... I'm a rather satisfied ASUS customer and I received a survey request.

"""Thank you for registering your Asus product. We are holding our 2010 Asus laptop current usage survey."""

So I decided to do the survey. Ok... I start answering all the questions; then I tried to submit the whole thing.

The system does not allow me to send the survey if all the questions weren't answered. Which is idiotic: I did not fill in some question on purpose. I did not want to answer or I did not know how to answer. But I have to provide an answer nonetheless. And of course, that is a fucking random answer.

For example, they insist asking me questions on how I would like my next touch sensitive ASUS laptop when I just answered I do not want any bloody touch screen and touch whatever laptop. I have an iPad for that. And they keep asking where would I use the laptop I told them I would not buy. WTF!

At the end, after giving some random answers (if your analysts and web designers can't do their job and you get answers without value, the problem is yours, not mine), the survey could not be submitted without providing my personal details. And I'm not seeing the point.

I am doing you a favor ASUS. You don't have to bother me this way: no survey from me at this conditions. Of course, ASUS can do well without my survey. But I think I'm not the only one with position similar to this. By the way, I considered the idea of filling the survey because I'm a satisfied customer. I was more than willing to spend time to answer properly and reflect on the answers. Instead, the survey really annoyed me.

Tuesday, October 19, 2010

Checked Exceptions

I definitely hate them. They are evil.



He is a boy-scout in comparison.

Monday, October 18, 2010

Spoilt by Python?

To my horror and dismay I found out that my bash-fu (ok, zsh-fu) dropped to the point of seriously sucking. The question was simple... I made a judgment error and put a git repository on my dropbox are and I also used it actively from multiple computers (while I should have just use it to synchronize stuff and edit two clones).

So I basically found out some files like “./.git/index (enrico-eee's conflicted copy 2010-10-16)” and my commits disappeared. Dropbox renamed the “right” file with that name and used an older file. It is not clear why this happened. However, it was not pleasant.

Given my modest knowledge of git (modest, but not non-existent) I supposed that everything could be solved renaming the files to their rightful name. And in fact it did. The point was renaming the files.

My idea was using find. Ok. I tried some solutions and I “almost” got there. In fact a couple of cut and pastes could have finished it. Though I understood that:

  1. given my present knowledge of shell scripting, I would have done better using a for loop in the first place
  2. considering that I had to start from scratch, I would have finished that earlier if I used Python
  3. Lazyness wins over thirst of knowledge if the specific knowledge is dull and uninteresting

I eventually put down a bunch of python lines doing the trick:

   1 import os
   2 import re
   3 import shutil
   4 from os import path
   5 
   6 rex = re.compile('(?P<name>.*) \(enrico-eee\'s conflicted copy .*\)(?P<ext>.*)')
   7 
   8 for root, dirs, files in os.walk(os.getcwd()):
   9     for filename in files:
  10         match = rex.match(filename)
  11         if match:
  12             new_name = "%(name)s%(ext)s" % match.groupdict()
  13             shutil.move(
  14                 path.join(root, filename),
  15                 path.join(root, new_name)
  16             )

I think that my shell scripting abilities are going to decrease. Almost every-time I have a problem I solve it with Python, since I’m far more skilled in Python scripting. Not even sure it is a bad thing.

Sunday, October 17, 2010

Anagrams: fast python algorithm

So in this post I showed how the trivial algorithm to compute anagrams is stupid (and consequently slow) beyond hope. On the other hand I promised a nice algorithm to do that. In fact, it is just a matter of pre-processing the data. The idea is to compute a key which is unique to all the words which have the very same letters and then group all the words with the same key. One of the simplest keys could be a bag of the letters (if your language supports bags -- or multi-sets --). Python sets have unique keys, so they are not a good candidate. However a "sorted" variant of the word has the right properties. Every word with the same letters (which is the definition of palindrome) yields the same string if we sort its letters. A good structure is a dictionary with this key as key and a list of strings as value. When we find a string with a previously found key, we add it to the list of strings, otherwise we generate a new list. This is the very popular setdefault idiom:
d.setdefault(key, []).append(w)
The key for a word of length n is computed in O(n log n), insertion in the dictionary is O(1) (get the key is O(1) in average and append to the list is O(1)). So the algorithm uses rather efficient elementary operations. At this point "general" questions on the processed wordlist can be answered in O(M) if M is the number of the words (e.g., longest anagram, word with most anagrams). Simply getting the anagrams of a given word is an O(n log n) operation. We are not using an excessive amount of memory. We basically have an hash-table and some "pointers" to the original words. And now the code:
class Anagrams(object):
    def __init__(self, dictionary):
        self.dictionary = dictionary
        self.indexed_words = {}
        self.preprocess()

    def make_key(self, word):
        sorted_word = list(word)
        sorted_word.sort()
        return tuple(sorted_word)

    def preprocess(self):
        for word in self.dictionary:
            key = self.make_key(word)
            self.indexed_words.setdefault(key, []).append(word)

    def anagram(self, word):
        key = self.make_key(word)
        return tuple(self.indexed_words[key])

    def anagrams(self):
        return [anagrams for key, anagrams in self.indexed_words.iteritems()
                if len(anagrams) > 1]
As a matter of fact, if we extend the corpus of words, the required time increases. In fact, using the dictionary from here it took 4 whole seconds. Which is nonetheless a lot less time than some million of years. I did not use that dictionary because it contains lots of misspells which can have sense when computing Levenshtein distances, but are quite a PITA if we compute anagrams (we don't want non existing words to popup as solutions).

Technorati Tags: , , , ,

Saturday, October 16, 2010

Anagrams and (brains vs. cpu)

In AI there is a vast category of algorithms which follow a (blind) generate and test strategy. They essentially "generate" the solutions (as a whole or step by step) and then test if the solution complies with an acceptance function. Unfortunately, this simple strategy is usually ineffective. The cost is often exponential in the size of the input and since there is no "intelligence" all solutions (even dumb ones) are explored). In AI it is customary to use smarter search techniques (A*) or to "change" the problem so that even smarter solutions are viable. An apparently unrelated example, is "finding the anagrams of a word". The trivial (and generate and test) solution is:
  1. [generate]: try all the permutations of the word
  2. [test]: check if the word is in the dictionary
and this strategy also shares the very same problems of the other generate and test strategies. All the permutation of a k characters word are k! = k(k-1)(k-1)...2. It is customary to see algorithms like:
def yield_anagrams(dictionary, word):
return [word for word in
(''.join(candidate) for candidate in it.permutations(word))
if word in dictionary]
which in other languages are even longer and less attractive. This approach is rather tempting for very small words. I got some interesting wordlists from here. As a good starting point I used the 12Dict list and in particular the 2of12.txt file. More info on the lists can be found in the ReadMes. In the rest of the post we are going to work with that dictionary. This graph shows the distribution of word lengths: Distribution of Words by Length This other graph shows how many groups of words with the same anagram we have by word length: Frequency of Anagram Groups by Word Length For those preferring tabular data:
Size of the wordNumber of groups
28
397
4319
5313
6329
7239
8124
973
1030
1127
129
134
142
As we can see, the longest words with anagrams are 14 characters long. Even though most anagrams are very short word, our algorithm should work reasonably well with every input. The code I showed cannot. For example, let us consider the time required to compute the anagrams of words of increasing size:
scoreless90,174
praetorian101,828
rationalize1121,03
indiscreetly12269,0
LinearLinear Graph of Performance Computing Anagrams in a Naive Way In the first graph, we both axes are linear. If we do not believe me when I say that the growth is exponential, then look at this other graph: here the y axis is in logarithmic scale. Exponential means that if you simply change the implementation or the implementation language, things are going to blow up once again a couple of steps later. LinearLog Graph of Performance Computing Anagrams in a Naive Way So, let us sum up: we have 41238 words. Of these words, 5644 have 12 or more characters. If we suppose that anagrams for words longer than 12 character cost exactly as anagrams for words 12 characters long, the whole process is going to take 1518236 seconds, that is to say more than 17 days. In fact, the whole process is going to be far more expensive. I have a full table (excel -- here you may want to download the file --, csv) with times estimated using the formula derived from our 4 values. Of course, this is not extremely precise, but the sheer magnitude of the computed values should make the point clear. To compute all the anagrams for all the words up to size 11 (included) it takes one day. If you want to reach size 13 it is going to take one year. However, it is still a long way to process all the words... for example, in order to discover that our dictionary words with anagrams of length 15, our program is going to run for more than 38 years. The conclusion that no words longer than 14 has an anagram (according to our dictionary) takes more than 4000000 years. Of course, I did not wait all that time to gather these results... tomorrow I am showing a smarter strategy which drops the required time from 4000000 years to 0.16 seconds.

Technorati Tags: , , , ,

Friday, October 15, 2010

Get a list of words/dictionary in Python

In this post I briefly discussed an imperative algorithm to compute the edit distance between two words. In order to test the software (and benchmark it) I developed a simple python module which downloaded a dictionary from the web and returned a wordlist. Then I later made the API more complex and accurate. Basically it just boils down to one function:

def get_wordlist(pathname, backup_url, no_words=None, min_length=2,
max_length=32, lengths=None, predicate=(lambda x: True)):
'''Return a list of words.

pathname is the place where the dictionary should be stored
backup_url is used to download the dictinary to pathname if not present
the other arguments have the same meaning than in read_all
'''


Essentially if in pathname a valid dictionary (according to the module) is found, then it used. Otherwise is downloaded from backup_url, expanded into pathname and then used. The other argument have the same meaning than in the read_all function and, in particultar:

dict_dir is the directory where the dictionary is located
no_words is the maximum number of words (or None for all) to be read
min_length/max_length specify the range of the lengths of words to be read
alternatively, lengths is an iterable containing the lengths (3, 4, 7)
only words which make predicate true are returned


Essentially get_wordlist can be used to get in just one line of code a list of words specifying the length range (or the length set -- e.g., we want only words with odd length from 3 to 10 (excluded) and words with length 12). Moreover, the predicate parameter allows to filter words according to a given criterion, e.g., only words containing the '-'. In fact, the dictionary I chose contains lots of misspelled words, compound words and similar stuff. This is quite appropriate to test levehenstein distance. In fact no_words is been included for a reason: we random sample the words in the list because otherwise levenshtein is going to blow up. It is an inherently quadratic algorithm, and limiting the number of words allows for many more nice tests.

The full source file can be found here.

Thursday, October 14, 2010

JEdit: the forgotten editor

These days, I'm rethinking lots of stuff regarding my geeky life. For work reasons, I hugely increased my exposure to Java. And to my surprise, I find it quite bearable. Perhaps, one day, I'll say that I actually like it. This is much more likely if they just add closures... but that's another story.

In this process, I decided to try JEdit. Back then, some people suggested JEdit as a viable alternative. I quite refused the claim based on some hasty experiments. I think that my bias was such that I could not be pleased with it. Perhaps back then (10 years ago or something like that) it was not that good; perhaps not. And then I sticked to my early negative experience.
As a partial excuse I can point out that back then the unix community was very skeptical on everything which was not vim and Emacs (with the same old flame wars among the editors).

Right now, I'm using jEdit quite a lot and I want to describe my impressions. First: its main drawback is that it is slow to load. It is quite objective, even on my MacPro. However, BBEdit is not much faster and I use that regularly. But like BBEdit it is very fast to open new files once it is opened. In other words, you can just leave it open and that is fine. Of course, with vi this is not necessary. The other major drawback is that it can't work in text only mode: if you need to edit files remotely, you might be in a trouble.

However, the ssh/ftp plugin is quite useful and partially solves this problem. By the way, there is no reason not to use vim when needs arises.

jEdit supports 130 syntax languages (according to wikipedia, I did not count them), which is good. In fact, it is likely that you open a file and gets the correct syntax highlight and customizable options for each of them. Which I find nice.

Plugins

jEdit plugins are very easy to install; easier than default installation process on Emacs and vim, though auto-installers are catching up. In fact, even easier than text mate and bbedit plugins. In the plugin list, there are lots of different plugins; the majority are java related, but this is not exclusive. Some plugins just remind me of good old programming practices. For example the whitespace plugin does many whitespace related functions, like showing trailing whitespace and auto-removing it on save or auto-converting between tabs and spaces. This is something most programmers want.

jEdit has very flexible frame management. It is possible to split text area in different parts, dock and undock specific panes. For example with the Console module it is possible to have a "shell". The users chooses to have it in a separate window or to place it at the bottom of the main frame. Emacs anyone?

Of course, in the console windows it is possible to place many different REPLs, such as a Python or a Prolog shell. Or clojure (which, is also installed by the plugin: installing the plugin fully sets up a working clojure environment).




Another very useful feature is the integration with BeanShell (which allows running Java snippets). Moreover, plugins can also be written in Jython and other JVM supported languages. Of course there are plugins providing Java autocomplete and refactoring and similar stuff.

The are plugins to do spell-check and most things one expects from a text editor, latex mode, etc etc etc. The XML mode provides autocomplete for HTML files (or perhaps it is just builtin), and that makes it quite a powerful editor for HTML.

Macros


jEdit supports macros, which should be rather akin to other macro facilities ;they do not only generate text (but can remove, modify or do completely text unrelated stuff like changing interface elements). I have not explored this part yet.

The Editor


Then, it comes to basic text editing capabilities. Here is the point where most NENV (Non Emacs, Non Vim) editors fail. They just suck at manipulating text. jEdit, even though not the most powerful editor I tried, is an honest contender here.

Abbreviations provide expansion of long snippets of text. Buffer based completion is also available.

The clipboard is very powerful: there are "registers" and it is possible to manipulate text in there, append text to clipboard, vertical paste, etc. All this possibilities have keyboard shortcuts.

Selection is also keyboard controllable (move one word/character/paragraph forward/backward etc). It is possible to select code blocks, up to parentheses, etc...

Search comes in two flavors: incremental search and "standard" search. Regular expressions are available. Moreover, it is possible to apply search to whole directories, filtering files according to file names. The "replace" part can also be provided by a bean shell expression, which basically gives unlimited power to the replace feature.

Other text related features (spaces->tabs, lowercase, delete to end of line, delete to start of line, etc) are provided. So it is paren matching.

Markers (bookmarks) are supported, and so it is folding. Some utilities (file managers and similar stuff) are provided, and manual editing of jedit configuration files is also available.

Conclusion


I think that jEdit is a seriuos contender in the "2nd generation" editor market. It is free, it supports many languages in a rather complete way. Of course, many languages which are supported in Emacs (haskell, ocaml, erlang, ...) are not supported. However, it is a nice alternative to do lightweight java editing, to work with Python, web editing and similar stuff. Moreover, it is a very interesting alternative for Clojure/Scala development.

Wednesday, October 13, 2010

Ubuntu 10.10: so far, so good

On October 11th I decided to upgrade to the new ubuntu. I wanted to try the update procedure, as:
  1. It's not a big issue to reinstall afterwards (right now Linux is not critical for my job)
  2. I was really curious (and scared, from what I read on the web... but well, if OS X manages to just update, why Ubuntu should not?)
First I updated the system to the latest version as recomended. Then, I started the proper upgrade process. The installation process took *hours*. Not only because of the download process (here at the University we've got a pretty fast connection ;) ). It was just the setup/configure process. It took somethink like five or six bloody hours (I don't know, since I left the netbook on and went home). Besides, having question asked during the installation process (question blocking the process) means you have to be in front of the monitor for the whole process. So the day after I found some nice blocking panes asking me silly questions[0]. Then it also stopped responding to input. I inadvertently touched the shut down key and the system ignored my tentatives to press the cancel button. However, Ubuntu rebooted the half installed system fine (I think that it was just setting up unimportant applicative packages when I interrupted the procedure). dpkg --configure -a and now *everything* is fine. Perhaps I have been lucky, but I think they are doing a great job. I was expecting an epic failure (my fault, by the way). Usage impressions another day. Maybe. Or maybe not.

Tuesday, October 12, 2010

Borg, Singleton, Memoization and my mistakes

Today something rather embarrassing is happening to me. In fact, I feel that I'm plainly missing the right design. As I wanted to do some testing and benchmarking on the quicksort algorithms we discussed (here, here and in a yet unpublished post), I needed a fairly large body of random data. As a consequence I wanted to create a simple script to generate a "random" sequence of lines which could be used as an input to build a list of "Persons". I am interested in sorting "complex" objects according to multiple criteria, not only simple data. The language for the script is obviously Python as it's the language I'm sure will get the task done faster (in my hands). As a remark... I am quite against random tests, as they are not reproducible. On the other hand randomly generating the fixtures is fine. Once they are generated, they remain the same (thus, tests are reproducible). Moreover, they can be easily accessed from multiple languages. My first idea showed I had not clear the problem at hand. I started with the idea of randomly generating the names (as random sequence of characters), and I asked myself if I had to randomize name length as well. Luckily, I realized the mistake after writing 2 lines of code. It is strange to generate random string to emulate names, when I could just download a list of first names and surnames and build the data from them. I was confident the effort would be roughly the same. In fact, it is. Considering these URLS (for example):
SURNAME_URL = 'http://names.mongabay.com/most_common_surnames.htm'
MALENAME_URL = 'http://names.mongabay.com/male_names.htm'
FEMNAME_URL = 'http://names.mongabay.com/female_names.htm'
I could just get the list of names with these specific lines:
parser = html.parse(url)
lines = [el.text for el
    in parser.xpath('//td/table/tr/td[1]')
    if el.text]
I quite do not like the fact that I'm so dependent on the website structure and perhaps I could look for a site offering the names just as a plain file. But I don't really care enough to look for another website. Moreover, the "way" I get the lines is quite orthogonal from the matter of this post. Now the point is that I want to access those files with a nice API in order to avoid code repetition. And here my brain completely short-circuited. The first idea I got was something like this:
from lxml import html

class NamesMeta(type):
    SURNAME_URL = 'http://names.mongabay.com/most_common_surnames.htm'
    MALENAME_URL = 'http://names.mongabay.com/male_names.htm'
    FEMNAME_URL = 'http://names.mongabay.com/female_names.htm'

    @property
    def male_names(cls):
        try:
            return cls._male_names
        except AttributeError:
            cls._male_names = cls._list_of(cls.MALENAME_URL)
        return cls._male_names

    @staticmethod
    def _list_of(url):
        parser = html.parse(url)
        return [el.text for el in parser.xpath('//td/table/tr/td[1]')
                if el.text]

class Names(object):
    __metaclass__ = NamesMeta
For some mysterious reason I liked the idea of accessing the names with:
Names.male_names
. While I don't believe it is inherently a bad idea to be able to do it, I do believe it is a bad idea not having an "ask" method. However, I am going to deal with one design problem at a time. Going back to the example, this approach works, but it requires me to duplicate logic for the three properties (male_names, female_names and surnames). And I hate code duplication (and looping "execing" methods is not a good strategy, either). Then there is this thing... I used metaclasses, for no good reason. Well, my idea was to use the "class object as singleton pattern" strategy. This is a very easy way to implement a singleton. Here the first error is that I should really be using Borg/Monostate (or perhaps a bunch of functions...). The code is redundant (if finished) and overly complex. In order to understand how it works you have to be familiar with metaclasses, et all. And for no good reason. Moreover, the @staticmethod points out another clear design problem. As far as the class is only conceived to hide the logic of getting the name lists, it is ok to have the function to get the names in the class interface. However, I believe that we can generalize the class a bit. What if I want to "ask" more things? I could add an interface to add more sources of knowledge. I'm not sure that this counts as "over-generalization" (that is over-engineering). I think that generalizing, in this case should help me to clean the design. And I got to this:
from lxml import html

class Failure(Exception):
    pass

def name_getter(question):
    def _list_of(url):
        parser = html.parse(url)
        return [el.text for el in parser.xpath('//td/table/tr/td[1]')
                if el.text]
    url_db = dict(
        male_names='http://names.mongabay.com/male_names.htm',
        female_names='http://names.mongabay.com/female_names.htm',
        surnames='http://names.mongabay.com/most_common_surnames.htm'
    )
    try:
        url = url_db[question]
    except KeyError:
        raise Failure
    return _list_of(url)

class Knowledge(object):
    # (question, fetcher)
    answers = dict(
        male_names=name_getter,
        female_names=name_getter,
        surnames=name_getter
    )

    def __getattr__(self, key):
        try:
            getter = self.answers[key]
        except KeyError:
            raise Failure
        else:
            value = getter(key)
            setattr(self, key, value)
            return value
This is decent. I can access the "questions" rather easily. I can easily add an interface to query which question the class can answer; the code is rather easy to understand, use and whatever. Moreover, using __getattr__ still has the advantage that the "I have not yet cached the answer" logic is dealt by Object. We will get rid of this, in time. Right now, it is easy to understand and we do not have to write code to deal with it. But then... idiocy struck me again. I felt like being able to instantiate multiple Knowledge classes would be a BAD THING. Which in some sense it is... but hell, it's just a helper script (though the object oriented designer in me is quite a bitch). And since the metaclass strategy worked for the past example, something like:
class KnowledgeMeta(type):
    # (question, fetcher)
    answers = dict(
        male_names=name_getter,
        female_names=name_getter,
        surnames=name_getter
    )
    # ...

class Knowledge(object):
    __metaclass__ = KnowledgeMeta
works. But at this point it should be clear that the (object oriented) solution should use the Borg pattern. Identity is a false myth and this is an archetypal example of "only shared state matters". Moreover, it is so appropriate to have a Borg here... we have object that send to the hive-mind whatever they learn. They are borgs! The solution I propose is now:
from lxml import html

class Failure(Exception):
    pass

class Borg(object):
    # Martelli A., Martelli Ravenscroft A., Asher D.,
    # "Python Cookbook", 2nd Edition, O'Reilly
    _shared_state = {}
    def __new__(cls, *a, **k):
        obj = object.__new__(cls, *a, **k)
        obj.__dict__ = cls._shared_state
    return obj

    def __hash__(self):
        return 42 # randomly chosen number from a dice with one face

    def __eq__(self, other):
        try:
            return self.__dict__ is other.__dict__
        except AttributeError:
            return False

def name_getter(question):
    def _list_of(url):
        parser = html.parse(url)
        return [el.text for el in parser.xpath('//td/table/tr/td[1]')
            if el.text]
    url_db = dict(
        male_names='http://names.mongabay.com/male_names.htm',
        female_names='http://names.mongabay.com/female_names.htm',
        surnames='http://names.mongabay.com/most_common_surnames.htm'
    )
    try:
        url = url_db[question]
    except KeyError:
        raise Failure
    return _list_of(url)

class Knowledge(Borg):
    # (question, fetcher)
    find_answer = dict(
        male_names=name_getter,
        female_names=name_getter,
        surnames=name_getter
    )

    def __getattr__(self, key):
        try:
            finder = self.find_answer[key]
        except KeyError:
            raise Failure
        value = finder(key)
        setattr(self, key, value)
        return value

    @property
    def known_questions(self):
        sreturn self.find_answers.keys()
The idea of using properties instead of method calls is perhaps not very good: it adds no advantage, no clarity. And somewhat hides the fact we may be accessing something on the net. In fact a simple API like:
def answer(question): #...
def questions(): # ...
def add_question(question, answer_getter): # ...
would be perfectly acceptable, usable and clear. However, in the latter solution we would have some global variables; I still believe that an object oriented interface here is better. __getattr__ just simplified some logic, but also masked that we are dealing with caching/memoization: __getattr__ is called iff the key is not find in __dict__ (that is to say in Borg._shared_state), which is exactly what memoization is about. If we want to have an "ask" method we would have to manually deal with the logic of not finding stuff in our cached answers. And I believe an "ask" method would be very useful. For example because it allows for "questions" which are not valid python identifiers. I find much more readable something like:
knowledge.ask('number of dwarves')
than:
getattr(knowlege, 'number of dwarves')
Luckily enough, memoization patterns are widespread and we can also find ready to use implementations. However, this time, we provide a memoize decorator. This decorator takes the dictionary used as a cache as an argument. Of course, Borg._shared_state is the dictionary we want to use. The decorator is:
def memoize_method(db):
    def memoizer(method):
        def aux(self, key):
            try:
                value = db[key]
            except KeyError:
                value = db[key] = method(self, key)
            finally:
                return value
        return aux
    return memoizer
Defining an "ask" method now is trivial. It should just get the answer finder from the find_answer dictionary, call it, and return the value. Caching is dealt by the decorator. So, eventually:
class Knowledge(Borg):
    # (question, fetcher)
    find_answer = dict(
        male_names=name_getter,
        female_names=name_getter,
        surnames=name_getter
    )

    @memoize_method(Borg._shared_state)
    def ask(self, key):
        try:
            finder = self.find_answer[key]
        except KeyError:
            raise Failure
        else:
            return finder(key)


    def __getattr__(self, key):
        return self.ask(key)

    @property
    def known_questions(self):
        return self.find_answer.keys()

Technorati Tags: , , ,

Monday, October 11, 2010

Design of Sorting in Java and Python (pt. 2)

In this post I presented a pattern which arises very quickly when doing comparisons among objects. The pattern is essentially:


def name_cmp(fst, snd):
 return cmp(f(fst), f(snd))

In fact I see huge repetition in this kind of code, because we write code in that shape for many different "f"s (that is to say, getter or more complicated computation, but it boils down to that pattern).

For example, let us define a Person object in Java and also define comparators to sort a list of persons according to both the number and the phone number.

public class Person {
    private String name;
    private String phone;

    public Person(String name, String phone) {
        this.name = name;
        this.phone = phone;
    }

    public String getPhone() {
        return phone;
    }

    public String getName() {
        return name;
    }

    static final private PhoneComparator PHONE_COMPARATOR = 
     new PhoneComparator();
    static final private NameComparator NAME_COMPARATOR = 
     new NameComparator();
    
    static private class PhoneComparator implements Comparator<Person> {
        public int compare(Person o1, Person o2) {
            return o1.getPhone().compareTo(o2.getPhone());
        }
    }

    static private class NameComparator implements Comparator<Person> {
        public int compare(Person o1, Person o2) {
            return o1.getName().compareTo(o2.getName());
        }
    }

    static Comparator<Person> ageComparator() { return PHONE_COMPARATOR; }
    static Comparator<Person> nameComparator() { return NAME_COMPARATOR; }
    
    // ide generated toString, equals etc. here\
};

I see huge repetition in this kind of code. But even in Python, defining all the compatators is boring and repetive. Of course in Python something like:

def cmp_buider(f):
 def cmp_(fst, snd):
  return cmp(f(fst), f(snd))
 return cmp_

would come quite natural. For the ones who do not like functionally flavoured code, this is the Java variant (which is longer and uglier).

public interface Key<K extends Comparable<K>, E> {
    public K index(E obj);
}

public class KeyToComparator {
    public static <K extends Comparable<K>, T> Comparator<T> build(final Key<K, T> key) {
        return new Comparator<T>() {
            public int compare(T o1, T o2) {
                return key.index(o1).compareTo(key.index(o2));
            }
        };
    }
}

and then we use it like:

static public void sortWithKey(List<Person> lstPerson) {
    Collections.sort(lstPerson, KeyToComparator.build(
            new Key<String, Person>() {
                public String index(Person obj) {
                    return obj.getName();
                }
            }));
}

I hope I'm not the only one feeling that this stuff is ugly as hell. However, it is quite general and powerful. And it lets you avoid code repetition. At the very least, the code to write a key is way shorter than the one to write a Comparator. Of course, if methods where first class objects...

Sunday, October 10, 2010

Parametrized Unit Tests (in Java)

While "creating" by hand parametrized test functionality is not extremely hard in Python (and in fact there are multiple people who did that, plus the "blessed" implementation which is going to be included in unittest2 some time soon, I hope), doing the same thing in Java is a nightmare.

Luckily enough, JUnit 4 offers the functionality out of the box.

In fact, I have to admit that it seems cleaner than Python alternative. Let us start with a Java quicksort (this time ugly, but not buggy, as far as my tests are correct).

package sorting.quicksort;

import java.util.List;

public class QuickSorter<E extends Comparable<E>> {
    List<E> lst;

    public synchronized void sort(List<E> lst) {
        this.lst = lst;
        sort(0, lst.size()-1);
    }

    private void sort(final int start, final int end) {
        if(start >= end) {
            return;
        }

        int pivot = partition(start, end);

        sort(start, pivot-1);
        sort(pivot+1, end);
    }

    private int partition(int start, int end) {
        int pivot = choosePivot(start, end);
        swap(pivot, start);

        int leftProcessed  = start, rightProcessed = end + 1;
        E pivotElement = lst.get(start);

        while(true) {
            do {
                leftProcessed++;
            } while(leftProcessed <= end &&
                lst.get(leftProcessed).compareTo(pivotElement) < 0);
            do {
                rightProcessed--;
            } while(lst.get(rightProcessed).compareTo(pivotElement) > 0);
            if(leftProcessed > rightProcessed) {
                break;
            }
            swap(leftProcessed, rightProcessed);
        }

        swap(start, rightProcessed);
        return rightProcessed;
    }

    private void swap(final int i, final int j) {
        E tmp = lst.get(i);
        lst.set(i, lst.get(j));
        lst.set(j, tmp);
    }

    private int choosePivot(final int left, final int right) {
        return left + (right - left)/2;
    }
}

The test case is:
package sorting.quicksort;

import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.Parameterized;

import java.util.*;

@RunWith(value=Parameterized.class)
public class QuickSorterTest<E extends Comparable<E>> {
    private List<E> lst;
    private List<E> copyOfLst;
    private QuickSorter<E> sorter;

    public QuickSorterTest(List<E> lst) {
        this.lst = lst;
    }

    @Parameterized.Parameters
    public static Collection<Object[]> getParameters() {
        return Arrays.asList(
                new Object[]{ Arrays.asList(1, 2, 3, 4) },
                new Object[]{ Arrays.asList(1, 2, 4, 3) },
                new Object[]{ Arrays.asList(1, 3, 2, 4) },
                new Object[]{ Arrays.asList(1, 3, 2, 3) },
                new Object[]{ Arrays.asList(4, 2, 3, 1) },
                new Object[]{ Arrays.asList(1, 4, 3, 2) },
                new Object[]{ Arrays.asList(3, 2, 1, 4) },
                new Object[]{ Arrays.asList(3, 2, 2, 4) },
                new Object[]{ Arrays.asList(3, 1, 1, 3) },
                new Object[]{ Arrays.asList(1, 1, 1, 1) },
                new Object[]{ Arrays.asList(1, 2, 1, 1) },
                new Object[]{ Arrays.asList(1, 2, 3) },
                new Object[]{ Arrays.asList(1, 3, 2) },
                new Object[]{ Arrays.asList(2, 1, 3) },
                new Object[]{ Arrays.asList(2, 3, 1) },
                new Object[]{ Arrays.asList(3, 1, 2) },
                new Object[]{ Arrays.asList(3, 2, 1) },
                new Object[]{ Arrays.asList(3, 2, 3) },
                new Object[]{ Arrays.asList(3, 1, 1) },
                new Object[]{ Arrays.asList() },
                new Object[]{ Arrays.asList(1) },
                new Object[]{ Arrays.asList("hello", "cruel", "world") });
    }

    @Before
    public void setUp() {
        lst = new ArrayList<E>(lst);
        copyOfLst = new ArrayList<E>(lst);
        sorter = new QuickSorter<E>();
    }

    @Test
    public void testSort() throws Exception {
        sorter.sort(lst);
        Collections.sort(copyOfLst);
        Assert.assertEquals(copyOfLst, lst);
    }
}

Now, the idea is quite clear. The code to be written is ugly. The parametrized test case is annotated with @RunWith and the Parametrized.class. The test case must have a method annotated with @Parameterized.Parameters. This method return a collection of array of objects.

The idea is that for each element returned by that method a new test case object is constructed using the array of objects as actual arguments. In fact, the design is not extremely different from what we saw in Python.

In Python we had a generator yielding the tests. Here we have a static method which does quite the same job. Of course, here a Collection is returned. And Java syntax here is quite cumbersome. Luckily enough there is some flexibility with types so that I do not need to make different test case classes just because of that.

In the example we sort both arrays of integers and arrays of strings.

The setUp method and the test method do roughly the same thing they do in Python. And also the Java version uses a constructor (in fact, I'm led to believe that it is simply unavoidable).

Saturday, October 9, 2010

Design of Sorting in Java and Python (pt. 1)

The sorting interface built-in in Java collection model essentially depends on object which are either Comparable or to external comparator objects. The idea is quite straightforward: if the object does not provide a natural ordering (or the developers did not provide one) it is very easy to create a Comparator object.

Such an object takes to objects of type T and returns -1, 0 or 1 depending on whether the first object is "less than", "equal to" or "greater than" the second object. This strategy is extremely flexible (and in fact it is also an example of the very Strategy design pattern) and allows for different orders. For example it is possible to create ordering based on one or more properties and whatever the developers need to.

Of course, this is one of the cases when trying to be a purely object oriented language sucks: we are creating object to do the job of functions. Yes... comparators should not really have any state and just provide a compare "functional" method. However, this i relatively natural in Java. And as Comparable is an interface, we can just provide anonymous comparators like we would have provided anonymous functions.

Collections.sort(lst, new Comparator<Integer>() {
    public int compare(Integer left, Integer right) {
        return new Integer(left * left).compareTo(right * right); 
    }
});

In Python < 2.3 we had quite the same interface. Of course, having higher order functions is great and it was possible to just pass a cmp function. However, now the cmp parameter of sort is deprecated and in Python 3 disappeared... we just can do better than that!

So now, we may want to reflect. For example, let us write the Java code we just presented in Python.

def si_cmp(a, b):
 return cmp(abs(a), abs(b))
 
nums = [-5, -1, -3, 2, 4, 1]
nums.sort(cmp=si_cmp)
print nums

We are switching to python just to get rid of syntax noise. The very same logic does apply to Java. Basically we "lifted" the comparison operator with abs (ok, we should be a bit more rigorous, but the formalization is left to the willing reader).

In fact, this is an extremely common pattern. When we want to compare a pair of Person objects using their name property, we are once again "lifting" the comparison on strings to the comparison on Persons.

class Person(object):
 def __init__(self, name, phone):
  self.name = name
  self.phone = phone
  
 def __repr__(self):
  return '{name=%s, phone=%s}' % (self.name, self.phone)
  
persons = [Person(name, phone) for name, phone in (
   ('John Smith', 'xxxxxxxx'),
   ('Robert White', 'yyyyyyyy'),
   ('Hannibal Lecter', 'zzzzzzzzz'))]
   
def name_cmp(fst, snd):
 return cmp(fst.name, snd.name)
 
persons.sort(cmp=name_cmp)

If we print the persons array the output is (with some formatting licensed regarding whitespace I took in order to better fit in the web page):

[{name=Hannibal Lecter, phone=zzzzzzzzz}, 
 {name=John Smith, phone=xxxxxxxx}, 
 {name=Robert White, phone=yyyyyyyy}]

It may not be apparent that this is an instance of the very same pattern. Perhaps, because we do not really consider accessing an attribute as a function call. Well, think about a getName method (which would be bad practice in Python, but it is exactly what you would have in Java...

class Person(object):
 def __init__(self, name, phone):
  self.name = name
  self.phone = phone
  
 def getName(self):
  return self.name

and then:

def name_cmp(fst, snd):
 return cmp(fst.getName(), snd.getName())

which in Python is exactly equivalent to:
def name_cmp(fst, snd):
 return cmp(Person.getName(fst), Person.getName(snd))

In fact, in Python we prefer using the useful operator module attrgetter.
That function takes the name of an attribute and returns a function that gets that attribute. So operator.attrgetter('name') is a function returning the name attribute of any given object.

import operator as op

def name_cmp(fst, snd):
 get_name = op.attrgetter('name')
 return cmp(get_name(fst), get_name(snd))

More reflections on this to come...

Friday, October 8, 2010

Parametrized Unit Testing (in Python)

In my last post I briefly discussed parametrized tests. Now it is time to see them in practice.

Right now, in Python there is no way to do parametrized testing. Well, actually there are many ways. Python is an extremely flexible and high level languange and adding that features is a matter of less than a hundred lines of code (or somewhat more, perhaps, if you want an even richer semantics). Just the bare bone functionality may take just a dozen lines of code.


For this reason, many people implemented their own version (see the linked thread). Plus there are nose and py.test variants. Actually, I have used nose to perform that kind of tests.


Nose allows tests to be simple functions (and uses python decorators to add setUp and tearDown functions when necessary).This is a (buggy) quicksort implementation in Python:

def quicksort(lst, select_pivot=lambda s, e: s + (s-e)/2):
    def partition(start, end):
        pivot = select_pivot(start, end)
        lst[pivot], lst[start] = lst[start], lst[pivot]
        left = start
        right = end + 1
        pivot_element = lst[pivot]
        
        while 1:
            left += 1
            right -= 1
            while (left <= end) and (lst[left] < pivot_element):
                left += 1
            while lst[right] > pivot_element:
                right -= 1
            if left > right:
                break
            lst[left], lst[right] = lst[right], lst[left]
        lst[start], lst[right] = lst[right], lst[start]
        return right
    
    def sort(start, end):
        if start >= end:
            return
        pivot = partition(start, end)
        sort(start, pivot-1)
        sort(pivot+1, start)
        
    sort(0, len(lst)-1)

And this is the test:

import qsort
from nose import tools

fixture_lsts = [
    [1, 2, 3, 4], [1, 2, 4, 3], [1, 3, 2, 4], [1, 3, 2, 3],
    [4, 2, 3, 1], [1, 4, 3, 2], [3, 2, 1, 4], [3, 2, 2, 4],
    [3, 1, 1, 3], [1, 1, 1, 1], [1, 2, 1, 1], [1, 2, 3],
    [1, 3, 2], [2, 1, 3], [2, 3, 1], [3, 1, 2], [3, 2, 1],
    [3, 2, 3], [3, 1, 1], [], [1]
]

def check_sorted(lst):
    sorted_lst = sorted(lst)
    lst = lst[:] # do not modify lists
    qsort.quicksort(lst)
    tools.assert_equals(lst, sorted_lst)
    
def test_quicksort():
    for lst in fixture_lsts:
        yield check_sorted, lst
        


This is what I see from WingIDE:

Buggy quicksort test output


In fact, the test function is allowed to be a generator. In this case, it is assumed to yield tuples where the first argument is a callable and the others are its parameters. This way it is very easy to do parametrized testing.

Please notice (the documentation points that out) that if you attach setup and teardown stuff to the test function then they will be run just once, before and after the whole test function and not before and after every parametrized test. This is what I usually would like, in fact. In order to have this behavior, it is sufficient to add the functions to the yielded callable.

For example, I see that in the proposed "check" function we are doing stuff which should be done in a setup method. Unfortunately one of my greatest dissatisfactions with nose (and with_setup in particular) is that it completely forces you to use a stateful model.

From the setup function to the executed test function the only way to communicate is through the global environment. This is half bad (I've not yet decided if I prefer the overkill of TestCase classes -- where classes are just used for grouping purposes without "true" object behavior from the user point of view -- or using module variables -- which seems very pythonic as modules are objects but somewhat encourages the use of global variables).

This works quite well if you model all your tests with this idea in mind (and basically it's not different from xUnit model, just no classes). Unfortunately when you use generator based testing the test function receives the part of the SUT to test as a parameter, and the parameter is not accessible from the setup function. The easiest solution (the one which does not involve writing some custom test loader or at least a new piece of code at nose level), is useing classes.

In fact, the yielded callable is allowed to be a callable object. In this case, it is possible to define a setup function "as if" it were a TestCase (still, it is not a TestCase.

The resulting code is:

class CheckSorted(object):
    def __init__(self, lst):
        self.lst = lst

    def setup(self):
        self.sorted_lst = sorted(self.lst)
        self.lst = self.lst[:]

    def __call__(self):
        qsort.quicksort(self.lst)
        tools.assert_equals(self.lst, self.sorted_lst)


def test_quicksort2():
    for lst in fixture_lsts:
        yield CheckSorted(lst),

Please notice the , (comma) at the end of the yield statement. It is necessaty as a tuple must be yielded. Of course code could be made more explicit with

yield (CheckSorted(lst), )

I don't actually like this solution very much, since it is somewhat confusing what the costructor should do: usually TestCases do not have a constructor. Setup methods just set them up. However, in this case, part of the job is performed by the constructor (and there is no other way -- unless we use a global or something like that, but I believe the cure is worse than the illness). Anyway, this is what we have.