Tuesday, October 12, 2010

Borg, Singleton, Memoization and my mistakes

Today something rather embarrassing is happening to me. In fact, I feel that I'm plainly missing the right design. As I wanted to do some testing and benchmarking on the quicksort algorithms we discussed (here, here and in a yet unpublished post), I needed a fairly large body of random data. As a consequence I wanted to create a simple script to generate a "random" sequence of lines which could be used as an input to build a list of "Persons". I am interested in sorting "complex" objects according to multiple criteria, not only simple data. The language for the script is obviously Python as it's the language I'm sure will get the task done faster (in my hands). As a remark... I am quite against random tests, as they are not reproducible. On the other hand randomly generating the fixtures is fine. Once they are generated, they remain the same (thus, tests are reproducible). Moreover, they can be easily accessed from multiple languages. My first idea showed I had not clear the problem at hand. I started with the idea of randomly generating the names (as random sequence of characters), and I asked myself if I had to randomize name length as well. Luckily, I realized the mistake after writing 2 lines of code. It is strange to generate random string to emulate names, when I could just download a list of first names and surnames and build the data from them. I was confident the effort would be roughly the same. In fact, it is. Considering these URLS (for example):
SURNAME_URL = 'http://names.mongabay.com/most_common_surnames.htm'
MALENAME_URL = 'http://names.mongabay.com/male_names.htm'
FEMNAME_URL = 'http://names.mongabay.com/female_names.htm'
I could just get the list of names with these specific lines:
parser = html.parse(url)
lines = [el.text for el
    in parser.xpath('//td/table/tr/td[1]')
    if el.text]
I quite do not like the fact that I'm so dependent on the website structure and perhaps I could look for a site offering the names just as a plain file. But I don't really care enough to look for another website. Moreover, the "way" I get the lines is quite orthogonal from the matter of this post. Now the point is that I want to access those files with a nice API in order to avoid code repetition. And here my brain completely short-circuited. The first idea I got was something like this:
from lxml import html

class NamesMeta(type):
    SURNAME_URL = 'http://names.mongabay.com/most_common_surnames.htm'
    MALENAME_URL = 'http://names.mongabay.com/male_names.htm'
    FEMNAME_URL = 'http://names.mongabay.com/female_names.htm'

    @property
    def male_names(cls):
        try:
            return cls._male_names
        except AttributeError:
            cls._male_names = cls._list_of(cls.MALENAME_URL)
        return cls._male_names

    @staticmethod
    def _list_of(url):
        parser = html.parse(url)
        return [el.text for el in parser.xpath('//td/table/tr/td[1]')
                if el.text]

class Names(object):
    __metaclass__ = NamesMeta
For some mysterious reason I liked the idea of accessing the names with:
Names.male_names
. While I don't believe it is inherently a bad idea to be able to do it, I do believe it is a bad idea not having an "ask" method. However, I am going to deal with one design problem at a time. Going back to the example, this approach works, but it requires me to duplicate logic for the three properties (male_names, female_names and surnames). And I hate code duplication (and looping "execing" methods is not a good strategy, either). Then there is this thing... I used metaclasses, for no good reason. Well, my idea was to use the "class object as singleton pattern" strategy. This is a very easy way to implement a singleton. Here the first error is that I should really be using Borg/Monostate (or perhaps a bunch of functions...). The code is redundant (if finished) and overly complex. In order to understand how it works you have to be familiar with metaclasses, et all. And for no good reason. Moreover, the @staticmethod points out another clear design problem. As far as the class is only conceived to hide the logic of getting the name lists, it is ok to have the function to get the names in the class interface. However, I believe that we can generalize the class a bit. What if I want to "ask" more things? I could add an interface to add more sources of knowledge. I'm not sure that this counts as "over-generalization" (that is over-engineering). I think that generalizing, in this case should help me to clean the design. And I got to this:
from lxml import html

class Failure(Exception):
    pass

def name_getter(question):
    def _list_of(url):
        parser = html.parse(url)
        return [el.text for el in parser.xpath('//td/table/tr/td[1]')
                if el.text]
    url_db = dict(
        male_names='http://names.mongabay.com/male_names.htm',
        female_names='http://names.mongabay.com/female_names.htm',
        surnames='http://names.mongabay.com/most_common_surnames.htm'
    )
    try:
        url = url_db[question]
    except KeyError:
        raise Failure
    return _list_of(url)

class Knowledge(object):
    # (question, fetcher)
    answers = dict(
        male_names=name_getter,
        female_names=name_getter,
        surnames=name_getter
    )

    def __getattr__(self, key):
        try:
            getter = self.answers[key]
        except KeyError:
            raise Failure
        else:
            value = getter(key)
            setattr(self, key, value)
            return value
This is decent. I can access the "questions" rather easily. I can easily add an interface to query which question the class can answer; the code is rather easy to understand, use and whatever. Moreover, using __getattr__ still has the advantage that the "I have not yet cached the answer" logic is dealt by Object. We will get rid of this, in time. Right now, it is easy to understand and we do not have to write code to deal with it. But then... idiocy struck me again. I felt like being able to instantiate multiple Knowledge classes would be a BAD THING. Which in some sense it is... but hell, it's just a helper script (though the object oriented designer in me is quite a bitch). And since the metaclass strategy worked for the past example, something like:
class KnowledgeMeta(type):
    # (question, fetcher)
    answers = dict(
        male_names=name_getter,
        female_names=name_getter,
        surnames=name_getter
    )
    # ...

class Knowledge(object):
    __metaclass__ = KnowledgeMeta
works. But at this point it should be clear that the (object oriented) solution should use the Borg pattern. Identity is a false myth and this is an archetypal example of "only shared state matters". Moreover, it is so appropriate to have a Borg here... we have object that send to the hive-mind whatever they learn. They are borgs! The solution I propose is now:
from lxml import html

class Failure(Exception):
    pass

class Borg(object):
    # Martelli A., Martelli Ravenscroft A., Asher D.,
    # "Python Cookbook", 2nd Edition, O'Reilly
    _shared_state = {}
    def __new__(cls, *a, **k):
        obj = object.__new__(cls, *a, **k)
        obj.__dict__ = cls._shared_state
    return obj

    def __hash__(self):
        return 42 # randomly chosen number from a dice with one face

    def __eq__(self, other):
        try:
            return self.__dict__ is other.__dict__
        except AttributeError:
            return False

def name_getter(question):
    def _list_of(url):
        parser = html.parse(url)
        return [el.text for el in parser.xpath('//td/table/tr/td[1]')
            if el.text]
    url_db = dict(
        male_names='http://names.mongabay.com/male_names.htm',
        female_names='http://names.mongabay.com/female_names.htm',
        surnames='http://names.mongabay.com/most_common_surnames.htm'
    )
    try:
        url = url_db[question]
    except KeyError:
        raise Failure
    return _list_of(url)

class Knowledge(Borg):
    # (question, fetcher)
    find_answer = dict(
        male_names=name_getter,
        female_names=name_getter,
        surnames=name_getter
    )

    def __getattr__(self, key):
        try:
            finder = self.find_answer[key]
        except KeyError:
            raise Failure
        value = finder(key)
        setattr(self, key, value)
        return value

    @property
    def known_questions(self):
        sreturn self.find_answers.keys()
The idea of using properties instead of method calls is perhaps not very good: it adds no advantage, no clarity. And somewhat hides the fact we may be accessing something on the net. In fact a simple API like:
def answer(question): #...
def questions(): # ...
def add_question(question, answer_getter): # ...
would be perfectly acceptable, usable and clear. However, in the latter solution we would have some global variables; I still believe that an object oriented interface here is better. __getattr__ just simplified some logic, but also masked that we are dealing with caching/memoization: __getattr__ is called iff the key is not find in __dict__ (that is to say in Borg._shared_state), which is exactly what memoization is about. If we want to have an "ask" method we would have to manually deal with the logic of not finding stuff in our cached answers. And I believe an "ask" method would be very useful. For example because it allows for "questions" which are not valid python identifiers. I find much more readable something like:
knowledge.ask('number of dwarves')
than:
getattr(knowlege, 'number of dwarves')
Luckily enough, memoization patterns are widespread and we can also find ready to use implementations. However, this time, we provide a memoize decorator. This decorator takes the dictionary used as a cache as an argument. Of course, Borg._shared_state is the dictionary we want to use. The decorator is:
def memoize_method(db):
    def memoizer(method):
        def aux(self, key):
            try:
                value = db[key]
            except KeyError:
                value = db[key] = method(self, key)
            finally:
                return value
        return aux
    return memoizer
Defining an "ask" method now is trivial. It should just get the answer finder from the find_answer dictionary, call it, and return the value. Caching is dealt by the decorator. So, eventually:
class Knowledge(Borg):
    # (question, fetcher)
    find_answer = dict(
        male_names=name_getter,
        female_names=name_getter,
        surnames=name_getter
    )

    @memoize_method(Borg._shared_state)
    def ask(self, key):
        try:
            finder = self.find_answer[key]
        except KeyError:
            raise Failure
        else:
            return finder(key)


    def __getattr__(self, key):
        return self.ask(key)

    @property
    def known_questions(self):
        return self.find_answer.keys()

Technorati Tags: , , ,

No comments: