Friday, October 15, 2010

Get a list of words/dictionary in Python

In this post I briefly discussed an imperative algorithm to compute the edit distance between two words. In order to test the software (and benchmark it) I developed a simple python module which downloaded a dictionary from the web and returned a wordlist. Then I later made the API more complex and accurate. Basically it just boils down to one function:

def get_wordlist(pathname, backup_url, no_words=None, min_length=2,
max_length=32, lengths=None, predicate=(lambda x: True)):
'''Return a list of words.

pathname is the place where the dictionary should be stored
backup_url is used to download the dictinary to pathname if not present
the other arguments have the same meaning than in read_all
'''


Essentially if in pathname a valid dictionary (according to the module) is found, then it used. Otherwise is downloaded from backup_url, expanded into pathname and then used. The other argument have the same meaning than in the read_all function and, in particultar:

dict_dir is the directory where the dictionary is located
no_words is the maximum number of words (or None for all) to be read
min_length/max_length specify the range of the lengths of words to be read
alternatively, lengths is an iterable containing the lengths (3, 4, 7)
only words which make predicate true are returned


Essentially get_wordlist can be used to get in just one line of code a list of words specifying the length range (or the length set -- e.g., we want only words with odd length from 3 to 10 (excluded) and words with length 12). Moreover, the predicate parameter allows to filter words according to a given criterion, e.g., only words containing the '-'. In fact, the dictionary I chose contains lots of misspelled words, compound words and similar stuff. This is quite appropriate to test levehenstein distance. In fact no_words is been included for a reason: we random sample the words in the list because otherwise levenshtein is going to blow up. It is an inherently quadratic algorithm, and limiting the number of words allows for many more nice tests.

The full source file can be found here.

No comments: