Thursday, July 8, 2010

How much do you cost me?

Introduction

Study of algorithms and data structures is a major field in computer science. Efficiently representing and processing data has been of paramount importance since the early days of computing. Many problems are simply not accessible without proper study, since the computational complexity used to overwhelm older machines.

Today, we have powerful machines. Tomorrow, we will have more powerful and “more parallel” machines. One day, our computers will surpass the computing power of our own brain. However, no matters how fast the computer it is, there are always problems which are too hard, too expensive and, in short, intractable.

Essentially, the cost of the algorithm depends from the size of its input. For example, searching for an element in a list intuitively takes time proportional to the number of elements: after all, we don’t know where it is, we have to start somewhere and the element could be at the other end of the list.

In this case we say that the worst case is exactly A∙N, if A is the cost of comparing two elements and N is the length of the list. The average case is “only” A∙(N/2) since “in average” the element will be found after examining only half of the elements. Intuitively, the probability the element is the last one, is equal to the probability it is the first one, and so one. Computing average case cost is usually harder than worst case cost, because involves probability and random variables.

Landau notation and estimates

The computational cost of algorithms is usually represented in a unified notation (called Landau notation, from the name of the famous mathematician who introduced the notation in the field of calculus). Essentially, we don’t care about the multiplicative constants in the costs (e.g., A, “the cost of comparing two elements”) as long as they are, well… constants. We usually say that an looking for an element in a list costs O(N), which means that the time is proportional to the size of the list itself.

Most of the times, computing the cost of an algorithm is rather straightforward. Considering a strictly imperative setting, you essentially estimate the number of times you run the body of a loop. For example, let us consider
the following code:

M = {}
n = 4

for i in xrange(0, n):
    for j in xrange(0, i):
        M[i,j] = M[j,i] = i + j

print M

It is rather easy to see that if A is the cost of M[i,j] = M[j,i] = i + j, the whole thing costs (using presumed Gauss's formula):



That is to say it's O(N^2). A (correct) but coarse argument is that the internal loop runs i times, and i is at most n. So in the worse case it is n. The upper bound is used as an estimate n*n (again O(n*n)).

Things are not always that easy: sometimes the worse case is completely uninteresting. E.g., the most widely used simplex algorithm[0] has an exponential worse case cost. Exponential means that it essentially sucks. 2^1000 is a huge[1] number and 1000 is not an extremely huge input size. In pratice, the simplex algorithm is used, since the average complexity is polynomial. There is still active research on the estimation of the complexity of that algorithm.

The Master Theorem

A large class of algorithms are called "divide et impera"; a typical example is mergesort[2]. Here a trivial python implementation is given:

def merge_sort(L):
    if len(L) <= 1:
        return L
    else:
        middle = len(L) / 2
        left = merge_sort(L[:middle])
        right = merge_sort(L[middle:])
        return merge(left, right)

def merge(left, right):
    i = j = 0
    max_i = len(left)
    max_j = len(right)
    sorted_ = []
    while i < max_i and j < max_j:
        if left[i] < right[j]:
            sorted_.append(left[i])
            i += 1
        else:
            sorted_.append(right[j])
            j += 1
    if i < max_i:
        sorted_.extend(left[i:])
    else:
        sorted_.extend(right[j:])
    return sorted_
Of course, no Python user would use such functions. They use recursion instead of iteration, they are not particularly efficient, the "Timsort"[3] implemented in the Python library is far more efficient both algorithmically and code-wise. Algorithm books tell us that merge-sort has worse and average case O(N log N). As with most divide et impera algorithms, the result can be proved with ad hoc demostrations. However, I found extremely useful using the so-called master theorem[4]. This is essentially a swiss-knife result that gives us the complexity of an algorithm given the way the data is "divided" and "recomposed" (to rule, of course!). The complexity of the algorithm on an input of size n is T(n). f(n) is the cost done outside the recursive calls (e.g., the time needed to decompose and put together the data); a is the number of subproblems in the recursion and n/b is the size of the input of the subproblems. For example, in merge-sort a = 2 and b = 2. f(n) is proportional to the size of the input vector. The master theorem states that if:


The big theta notation essentially says that the specified bound is asymptotically both a lower and an upper bound. The big omega means that the specified bound is a lower bound.

Notes

[0] http://en.wikipedia.org/wiki/Simplex_algorithm
[1] 10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
[2] http://en.wikipedia.org/wiki/Merge_sort [3] http://bugs.python.org/file4451/timsort.txt
[4] Introduction to Algorithms; Cormen, Leiserson, Rivest, and Stein; MIT Press

More on Word

Some pieces of software are unbelievable: as soon as you start noticing how nice are certain features, you discover that some other parts are so incredibly broken that you would want to put the whole thing in the thrash.

As is customary, I received the template to format my article, which was an IEEE standard. Nice. I opened it in Word 2007 for Mac (and the template should be compatible with that) and I removed the contents of the template (as advised on the template itself) and filled with my own text. Shit happens.

For no good reason, the styles built in the template were imported utterly broken. For example the base font size was plainly wrong, some styles did not use the “keep together with next paragraph” thing and so on. Essentially, I had to manually check the styles I used with ones in the unmodified template in order to fix things. And I was able to, however, it took time.

Then I sent the file to my advisor (as a word file, since perhaps he would like to modify some parts). And he told me I was not following the standards, for example because the image labels did not start with “Figure x.”; I found this quite strange, since I was pleasantly surprised by the fact that the text was auto-inserted choosing the style (I think as a kind of numerated list style). I checked my file and the figure x. was still there. So I opened the document with word for Windows. And the Figure x. was missing.

Apparently Office 2008 for Mac and Office 2007 for Windows are not compatible at all. The question is: how many things are going to be fucked up? If I got a lot of nice features which I can’t use because otherwise I can’t share files with colleagues, then the whole point of using Word instead of Latex (and especially the “revisions” feature, which is very useful) is lost.

As a nice side note, I discovered the bibliography IEEE style I downloaded to “auto-manage” the bibliography is utterly broken. First, the references like [4] have italic numbers and the standard I have to follow has not… perhaps there are multiple standards, but then… how to discover the right file?

Second the bibliography itself is formatted as a huge double column table, which is completely unusable in 2-column mode. Other styles (such as the ones built-in with Word) do not format the references, so essentially applying the “references” style to the unformatted references simply solves the problem. But not with the style I have to use. And I don’t understand why word does not come with such widely used styles and I have to download broken third party implementations. The solution was to format the references by hand. Besides, Papers “import into Word” feature also gave me some problems, as every single link contained the pdf url, even if that is not standard. I don’t understand if it’s Papers fault, Word fault or the style fault. Well… doing things by hand, I solved the problem.

I still don’t understand the part “Word is user friendly”.