def get_wordlist(pathname, backup_url, no_words=None, min_length=2, max_length=32, lengths=None, predicate=(lambda x: True)): '''Return a list of words. pathname is the place where the dictionary should be stored backup_url is used to download the dictinary to pathname if not present the other arguments have the same meaning than in read_all '''
Essentially if in pathname a valid dictionary (according to the module) is found, then it used. Otherwise is downloaded from backup_url, expanded into pathname and then used. The other argument have the same meaning than in the read_all function and, in particultar:
dict_dir is the directory where the dictionary is located no_words is the maximum number of words (or None for all) to be read min_length/max_length specify the range of the lengths of words to be read alternatively, lengths is an iterable containing the lengths (3, 4, 7) only words which make predicate true are returned
Essentially get_wordlist can be used to get in just one line of code a list of words specifying the length range (or the length set -- e.g., we want only words with odd length from 3 to 10 (excluded) and words with length 12). Moreover, the predicate parameter allows to filter words according to a given criterion, e.g., only words containing the '-'. In fact, the dictionary I chose contains lots of misspelled words, compound words and similar stuff. This is quite appropriate to test levehenstein distance. In fact no_words is been included for a reason: we random sample the words in the list because otherwise levenshtein is going to blow up. It is an inherently quadratic algorithm, and limiting the number of words allows for many more nice tests.
The full source file can be found here.
No comments:
Post a Comment