Sunday, October 17, 2010

Anagrams: fast python algorithm

So in this post I showed how the trivial algorithm to compute anagrams is stupid (and consequently slow) beyond hope. On the other hand I promised a nice algorithm to do that. In fact, it is just a matter of pre-processing the data. The idea is to compute a key which is unique to all the words which have the very same letters and then group all the words with the same key. One of the simplest keys could be a bag of the letters (if your language supports bags -- or multi-sets --). Python sets have unique keys, so they are not a good candidate. However a "sorted" variant of the word has the right properties. Every word with the same letters (which is the definition of palindrome) yields the same string if we sort its letters. A good structure is a dictionary with this key as key and a list of strings as value. When we find a string with a previously found key, we add it to the list of strings, otherwise we generate a new list. This is the very popular setdefault idiom:
d.setdefault(key, []).append(w)
The key for a word of length n is computed in O(n log n), insertion in the dictionary is O(1) (get the key is O(1) in average and append to the list is O(1)). So the algorithm uses rather efficient elementary operations. At this point "general" questions on the processed wordlist can be answered in O(M) if M is the number of the words (e.g., longest anagram, word with most anagrams). Simply getting the anagrams of a given word is an O(n log n) operation. We are not using an excessive amount of memory. We basically have an hash-table and some "pointers" to the original words. And now the code:
class Anagrams(object):
    def __init__(self, dictionary):
        self.dictionary = dictionary
        self.indexed_words = {}
        self.preprocess()

    def make_key(self, word):
        sorted_word = list(word)
        sorted_word.sort()
        return tuple(sorted_word)

    def preprocess(self):
        for word in self.dictionary:
            key = self.make_key(word)
            self.indexed_words.setdefault(key, []).append(word)

    def anagram(self, word):
        key = self.make_key(word)
        return tuple(self.indexed_words[key])

    def anagrams(self):
        return [anagrams for key, anagrams in self.indexed_words.iteritems()
                if len(anagrams) > 1]
As a matter of fact, if we extend the corpus of words, the required time increases. In fact, using the dictionary from here it took 4 whole seconds. Which is nonetheless a lot less time than some million of years. I did not use that dictionary because it contains lots of misspells which can have sense when computing Levenshtein distances, but are quite a PITA if we compute anagrams (we don't want non existing words to popup as solutions).

Technorati Tags: , , , ,

2 comments:

Tony said...

For my recursive anagram solver posted 11.11.11 11:11:11
see http://www.daniweb.com/software-development/python/code/393153/multiword-anagrams-by-recursive-generator

Unknown said...

Nice!