RiK0 Tech Temple: April 2006

Sunday, April 23, 2006

Quicksort (C and Python)

The quick-sort algorithm

Quick-sort is a sorting algorithm developed by Hoare that in average does O(n log n). The O() notation means that if the list length is n, the number of operations used to sort the list is in the order of magnitudo of n log n (in this case).

To be more formal, we shall say that there exist two constants c1 and c2 such that the number of operations m isc1 * n * log n <= m <= c2 * n * log n. Notice that c1 and c2 are not specified in the O notation
We stop with math here. If you want more precise informations about the O notation read the wikipedia.

Now I'm going to briefly explain the quick-sort algorithm, and then we will use that algorithm to exploit the expressive power of some programming languages. Keep in mind that more expressive does not necessarily mean faster to run. It means faster to write. But being faster to write means that we can try more efficient algorithms, thus improving performance in a more effective way.
Expressive languages are usually high-level. In fact we can also profile the code, track the bottleneck and rewrite only a couple of functions in C or in ASM. This lead to fast programs with fast development cycles.

Quick-sort strategy

Quick-sort uses a divide and conquer approach. We divide a list in two sublists using a pivot element. All the elements less than the pivot come before it, those greater after. Then we recursively apply quick-sort to the two sublists.
Notice that the pivot is chosen arbitrarily. Some implementations chose the first element of the list. Some chose it randomly, some use other criteria.
Moreover there are implementations of quick-sort that sort the list in place (that is to say memory used is bounded), some need extra storage. Guess which are the more efficient.

On wikipedia you can find more informations about the quick-sort algorithm. You can find also some implementations, and a link to a page full of implementations in different languages.

I almost did not write code. I took code from the wikipedia or around the web and I'm going to comment it. This is the "new" work I did. Comments, explanations, not code. I also wrote a couple of implementations, however I don't remember which ones.

Python

2.4.2 (#1, Mar 23 2006, 22:00:44)  
[GCC 4.0.1 (Apple Computer, Inc. build 5250)]

Now lets examine a Python high level implementation.

def qsort(L): 
    if L == []: return [] 
    return qsort([x for x in L[1:] if x< L[0]]) + L[0:1] + \ 
            qsort([x for x in L[1:] if x>=L[0]])

This is the exact description of the quick-sort algorithm, if you know what Python list comprehension are. When I write [x for x in L if p(x)], Python computer the list of elements of the L container that make p(x) true.

L[a:b] is the sublist of L[L[a], L[a+1], ..., L[b-1], L[b]]. L[a:]==L[a:len(L)], L[:b]==L[0:b]

[x for x in L[1:] if x< L[0]] is the sublist of the elements of L a part from the first one (that is the pivot) such that x is less than the pivot (L[0]).

[x for x in L[1:] if x>=L[0]] is the sublist of the elements of L a part from the first one (that is the pivot) such that x is greater or equal than the pivot (L[0]).

Then we pass those sublists to qsort that recursively sorts them, and eventually we return a append all those sublists and return them.

This implementation, although beautiful to read and understand is horrible from a performance point of view. It creates, concatenates and destroys a large number of lists. Moreover it is purely recursive. For large lists, we can even exhaust the stack space.

A version of quick-sort in Python that has no need for more memory is this:

def partition(array, begin, end, cmp): 
    while begin < end: 
        if cmp(array[begin], array[end]): 
            (array[begin], array[end]) = (array[end], array[begin]) 
            break 
        end -= 1 
    while begin < end: 
        if cmp(array[begin], array[end]): 
            (array[begin], array[end]) = (array[end], array[begin]) 
        break 
        begin += 1 
    return begin 
    
    def sort(array, cmp=lambda x, y: x > y, begin=None, end=None): 
        if begin is None: begin = 0 
        if end   is None: end   = len(array) 
        if begin < end: 
            i = partition(array, begin, end-1, cmp) 
        sort(array, cmp, begin, i) 
        sort(array, cmp, i+1, end)

[NOTE: code has been reformatted without running it again: it is likely
to be wrong!]

If you want to run this code, remember that in Python indentation matters. If you just copy and paste the code, you may end with unusable code.
Anyway, I benchmarked all of them. "Elegant quicksort" is the first implementation. "Standard quicksort" is the latter. "Bultin sort" is standard python sort algorithm (that is not a quicksort, but a dedicated sorting algorithm).

Elegant qsort: [21.20, 19.75, 20.08, 20.46]
Builtin qsort: [1.61, 2.29, 2.29, 2.29]
Standard qsort: [31.15, 30.60, 30.68, 30.38]

Probably due to heavy optimization performed in list comprehension the "elegant" algorithm is faster than the standard one. [from the future: I hardly believe that]. However both of them are more than ten times slower than the python built-in sort algorithm.
We have learnt:

High level languages are great to express algorithms in a readable way (and we can use them to test them and see how they work and all)
You should not implement "basic" algorithms in high level languages: use their built-ins or write them in C (most built-ins are written in C).

C

Now it is time for a couple of pure C quick-sort:

void qsort1(int a[], int lo, int hi ){ 
    int h, l, p, t; 
    if (lo < hi) { 
        l = lo; 
        h = hi; 
        p = a[hi]; 
        do { 
            while ((l < h) && (a[l] <= p))  
                l = l+1; 
            while ((h > l) && (a[h] >= p)) 
                h = h-1; 
            if (l < h) { 
                t = a[l]; 
                a[l] = a[h]; 
                a[h] = t; 
            } 
        } while (l < h); 
        t = a[l]; 
        a[l] = a[hi]; 
        a[hi] = t; 
        qsort( a, lo, l-1 ); 
        qsort( a, l+1, hi ); 
    } 
}

Another implementation is:

void quicksort(int x[], int first, int last) { 
    int pivIndex = 0; 
    if(first < last) { 
        pivIndex = partition(x,first, last); 
        quicksort(x,first,(pivIndex-1)); 
        quicksort(x,(pivIndex+1),last); 
    } 
} 

int partition(int y[], int f, int l) { 
    int up,down,temp; 
    int piv = y[f]; 
    up = f; 
    down = l; 
    goto partLS; 
    do {  
        temp = y[up]; 
        y[up] = y[down]; 
        y[down] = temp; 
        partLS: 
        while (y[up] <= piv && up < l) { 
         up++; 
        } 
        while (y[down] > piv  && down > f ) { 
            down--; 
        } 
    } while (down > up); 
    y[f] = y[down]; 
    y[down] = piv; 
    return down; 
}

C also have a builtin quick-sort, that is much more advanced than this ones, since it allows to specify a comparison function, thus working with all kinds of data types.

Benchmarking these was hard. They too fast to be benchmarked with a list of a million elements and the standard posix clock function. However if you want to run the bench yourself, I could not use MacOS X specific bench functions.
So I used a list of 100000000 elements.
This is the (ugly) code I used for benchmarking. The test is not really interesting. Of course since C integers are machine integers it is dramatically faster than other solutions. It could have been more interesting if we sorted other data structures, where the gap could be reduced.

#include <stdlib.h> 
#include <stdio.h> 
#include <time.h> 
#include "qsort1.h" 
#include "qsort2.h" 
#define LEN 100000000 

int intcmp(const void *a, const void *b){ 
    return *(int*)a - *(int*)b; 
} 

int main(){ 
    size_t index; 
    int *l = malloc(sizeof(int)*LEN); 
    clock_t last, current; 
    srand(time(NULL)); 
    for(index = 0; index < LEN; ++index){ 
        l[index] = rand(); 
    } 
    
    last = clock(); 
    qsort(l,LEN, sizeof(int), intcmp); 
    current = clock(); 
    printf("C qsort: %f\n", (current-last)/(float)CLOCKS_PER_SEC); 
    
    for(index = 0; index < LEN; ++index){ 
        l[index] = rand(); 
    } 
    last = clock(); 
    qsort1(l,0, LEN-1); 
    current = clock(); 
    printf("int qsort: %f\n", (current-last)/(float)CLOCKS_PER_SEC); 
    for(index = 0; index < LEN; ++index){ 
        l[index] = rand(); 
    } 
    last = clock(); 
    quicksort(l,0, LEN-1); 
    current = clock(); 
    printf("qsort 2: %f\n", (current-last)/(float)CLOCKS_PER_SEC); 
    free(l); 
    return 0; 
}

Friday, April 21, 2006

Something about Prolog...

As I had to admit in my last Prolog post, I'm no Prolog guru. I'd rather deserve the term newbie. However, I followed some math logic courses that make it all less harder. If you find something incorrect or if you don't understand something, comment it.

The first thing you have to keep in mind is that you don't really tell Prolog how to do something. You rather ask "him" (or her?) to prove something following the rules you gave.

Basic datatypes

In Prolog we have "Strings". We can also use terms like string or dog. However if they have capital letter, they are
variables, not simple terms.
We have list that is to say [], [a, b, c]. If we write
[H | T] H is the first element of the list, T a list containing all the remaining elements. If the list is [a, b, c], H is a and T is [b, c]

Computation model

Relations

Suppose you are working with a graph. Now suppose that you want to query the graph. As an example I report here a graph about elven genealogy in the LOTR world.The schema comes from this site. If it is wrong, tell them, not me :)

For example there are different interpretation regarding Indis. Some say she's Ingwë's sister, some say she's the daughter of Ingwë's sister. We chose to say she's Ingwë's sister since I do not want to introduce a character with no name in our database :).

The House of Finarfin:

                + - - - - - +
                :         Ingwë
             (sister)  + - - - - - +
                |      |           |
      Finwë = Indis   Olwë       Elwë = Melian
            |           |             |
          Finarfin = Ëarwen          Luthien
                   |
  +----------------+----------------+---------+
  |                |                |         |
  |                |              Aegnor      |
Finrod = Amarië Angrod = Edhellos    Galadriel = Celeborn
         :                |                         |
  (descendants        Orodreth          Elrond = Celebrian
     in Aman?)            |                    |
                  +-------+------+      +------+---------+
                  |              |      |      |         |
               Gil-galad    Finduilas Elladan Elrohir  Arwen

How would we represent this in an imperative language? We probably would have some kind of

class Elf 
- gender   (M/F) 
- name     (String) 
- children (Elven) 
- partner  (Elf) 
end

We could also have a kind of "double linked structure". This is probably wise since otherwise querying someone's father would be dramatically expensive.

class Elf 
- gender   (M/F) 
- name     (String) 
- children (Elven) 
- partner  (Elf) 
- mother   (Elf) 
- father   (Elf) 
end

In prolog we do not have such structure. We just tell "facts". A fact is a predicate that is always true. You can see how we define facts above. For example we are saying that aegnor is male.

male(aegnor). 
male(angrod). 
...

You can imagine this like

Elf Aegnor;Aegnor.name = "Aegnor";Aegnor.gender = M;...

Note that in Prolog everything is a Term, both data and program. Above I'm saying that aegnor is male. I assert that male(aegnor) is true.
This is the complete list of facts to define the above graph. If I were smarter I chose a smaller graph. However the list is long, but the imperative code to put into memory would not have been significantly shorter (yes, we could read it from file... where we would have had a list similar to the prolog version, :) )

female(amarie). 
female(arwen). 
female(celebrian). 
female(earwen). 
female(edhellos). 
female(finduilas). 
female(galadriel). 
female(indis). 
female(luthien). 
female(melian). 
male(aegnor). 
male(angrod). 
male(celeborn). 
male(elladan). 
male(elrohir). 
male(elrond). 
male(elwe). 
male(finarfin). 
male(finrod). 
male(finwe). 
male(gilgalad). 
male(ingwe). 
male(olwe). 
male(orodreth). 
parent(ingwe, olwe). 
parent(ingwe, elwe). 
parent(indis, finarfin). 
parent(finwe, finarfin). 
parent(olwe, earwen). 
parent(elwe, luthien). 
parent(melian, luthien). 
parent(finarfin, finrod). 
parent(earwen,   finrod). 
parent(finarfin, angrod). 
parent(earwen,   angrod). 
parent(finarfin, aegnor). 
parent(earwen,   aegnor). 
parent(finarfin, galadriel) . 
parent(earwen,   galadriel) . 
parent(angrod, orodreth). 
parent(edhellos, orodreth). 
parent(galadriel, celebrian). 
parent(celeborn, celebrian). 
parent(orodreth, gilgalad). 
parent(orodreth, finduilas). 
parent(elrond,    elladan). 
parent(celebrian, elladan). 
parent(elrond,    elrohir). 
parent(celebrian, elrohir). 
parent(elrond,    arwen). 
parent(celebrian, arwen).

Rules

A rule is in the form:

goal(X) :- subgoal(X), subgoal2(X).

In the above goal(X) is true if and only if both subgoal(X) and subgoal2(X) are true. We say we reducegoal to the two subgoals.
We will query something like:

goal(elrond).

and it will succeed if and only if both subgoal(elrond) and subgoal2(elrond). Notice this times we have constants and not variables.
Now it is time to define some interesting relation. A father is a male parent. A mother a female parent. In Prolog we write:

father(X, Y) :- parent(X, Y), male(X).

That is to say... X is father of Y if and only if X is parent of Y and X is male.
If we want to find Orodreth father, we query:

father(X, orodreth).

If is succeeds (and it does) in X there is Orodreth's father: Angrod. The imperative version of this would have been... orodreth.father. If we were wise enought (or had enough space) to use a double linked structure. If we didn't, well... we would have to traverse the graph.
We can also do the converse.

father(finarfin, X).

would return all the children of finarfin.

mother(X, Y) :- parent(X, Y), female(X). 
son(X, Y)    :- parent(Y, X), male(X). 
daughter(X, Y)  :- parent(Y, X), male(X).

These are other trivial relationship. Now suppose you want to find out all the ancestors of a given character. If you had the simpler structure, it is a pain in the ass. If you have the complete one, it is boring. Basically you have to tell the system to traverse the graph following father and mother links. Simple and tedious.
In fact what you are doing is the "transitive closure" of the parent relationship. In Prolog it is exactly how define in natural language. X is ancestor of Y if X is parent of Y or if X is parent of Z and z is ancestor of Y.

ancestor(X, Y) :- parent(X, Y). 
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).

All this relationships can be used to find the ancestors or the grandchildren. Or to verify if the relationship holds between two elves.
mother(galadriel, celebrian). returns yes, while mother(galadriel, arwen). returns no. You have to distinguish between defining facts and relationships and querying the system. This depends from your prolog interpreter: read a book, a tutorial, something. This is just a hint (and what I need to explain quicksort later on).
If you name this file elves.pl you can load it in swi-prolog with

[elves].

(remeber, no extention and end with a dot). Then you can just make queries. To define new predicates, put them in elves.pl and reload the file.

Some arithmetic...

We can consider the trivial algorithm to sum two numbers, provided we have an
INC and a DEC operation.

VAR A, B 
WHILE B > 0 
INC A 
DEC B 
END

This is quite obvious. If you (like me) hate those pseudo code examples, this is pure C.

unsigned int sum(unsigned int a, unsigned int b){ 
    while(b) { 
        ++a; --b; 
    } 
    return a; 
}

It is quite interesting an example in a functional programming. This resembles the basic definition of addition if you are working with primitive recursive functions (S is the successor function and we omitted to write explicitly the i-th projection function).

sum( x, 0   ) = x 
sum( x, y+1 ) = S(sum(x, y))

And this is Haskell (almost the same thing, a part from notations).

mysum :: Int -> Int -> Int 
mysum x 0 = x 
mysum x y = mysum x (y-1) + 1

In Prolog once again you follow the formal definition of sum in Set Theory. (That by the way is quite the same as above).

add(X, 0, X).add(0, Y, Y).add(X, Y, X1) :- succ(Y1, Y), add(X, Y1, X2), succ(X2, X1).

How this works? When I query something with add prolog will try to "match" against the arguments of the main goal. I don't want to be more precise about this "matching". It is called unification, and you are supposed to read a prolog tutorial to understand more about it.
add(5, 0, 5). will match the first rule: the second parameter is 0, and the other two are the very same term. Now let's consideradd(0, 5, X). . This could match the second rule. 0 matches 0. 5 matches Y. From this point for add to succeed, X must be 5 too. From swi-prolog:

2 ?- plus(0, 5, X). 
X = 5  
Yes

However, we said that we expect to use Prolog predicates in many different ways. Not only to sum two numbers, but also to find if given three numbers the third is the sum of the first two. We expect also that trivial predicates can be inverted. That is to say given a number and the "sum" find the number that summed to the first one gives the sum.
That is to say we expect to have the subtraction defined by only defining the sum. In fact this ain't magic. Prolog builtin plus predicate does exaclty this.
This is a copy and paste from the swi prolog terminal:

31 ?- plus(X, 4, 7). 
X = 3  
Yes 
32 ?- plus(8, 9, Z). 
Z = 17  
Yes 
33 ?- plus(3, Y, 1). 
Y = -2  
Yes

For those of you who are interested, we can define such a predicate this way, using builtin prolog arithmetic and some metalogical variables:

add(X, Y, Z) :- nonvar(X), nonvar(Y), Z is X+Y . 
add(X, Y, Z) :- nonvar(X), nonvar(Z), Y is Z-X . 
add(X, Y, Z) :- nonvar(Y), nonvar(Z), X is Z-Y .

nonvar returns true if and only if its argument is not a variable. That is to say, if instead of plus we used add, add(X, 4, 7). would have called the "third" predicate (for Y and Z are not variables, since they are "unified" -- substituted -- with 4 and 7). add(8, 9, Z). would have called the first predicate and add(3, Y, 1) the second.

But we can do something even smarter. First of all we recognize that the three clauses above are almost mutually exclusive. Two of the three clauses will fail if two of the three parameters are not variables. They will all succeed in the very same way if all three are not variables. They will all fail if only one is a variable.

So we can tell prolog: after you determined which is the variable, you can only use that rule. If none is a variable, use the first rule. We use "cuts". The rules become.

add(X, Y, Z) :- nonvar(X), nonvar(Y), !, Z is X+Y . 
add(X, Y, Z) :- nonvar(X), nonvar(Z), !, Y is Z-X . 
add(X, Y, Z) :- nonvar(Y), nonvar(Z), !, X is Z-Y .

Cuts are a complex subject. I'm not going to treat them here. Now I introduce a predicate between.

between(+Low, +High, ?Value)
Low and High are integers, High >=Low. If Value is an integer,
Low=< Value=< High. When Value is a variable it is successively
bound to all integers between Low and High. If High is inf
or infinite between/3 is true iff Value>= Low, a feature that is
particularly interesting for generating integers from a certain
value.

At this point we add this clauses:

add(X, Y, Z) :- nonvar(Z), !, between(0, Z, X), add(X, Y, Z). 
add(X, Y, Z) :- nonvar(Y), !, between(0, inf, X), add(X, Y, Z). 
add(X, Y, Z) :- nonvar(X), !, between(0, inf, Y), add(X, Y, Z).

Now you are able given one variable to generate pairs that satisfy the relationship (in fact pairs of positive numbers only).
Now let define prod.

prod(X, Y, Z) :- nonvar(X), nonvar(Y), !, Z is X*Y . 
prod(X, Y, Z) :- nonvar(X), nonvar(Z), !, Y is Z/X . 
prod(X, Y, Z) :- nonvar(Y), nonvar(Z), !, X is Z/Y . 
prod(X, Y, Z) :- nonvar(Z), !, between(1, Z, X), prod(X, Y, Z). 
prod(X, Y, Z) :- nonvar(Y), !, between(1, inf, X), prod(X, Y, Z). 
prod(X, Y, Z) :- nonvar(X), !, between(1, inf, Y), prod(X, Y, Z).

Now it's time to define division. Do you know the euclidean algorithm? I suppose so. However, you don't need to know it. All you need to know is that dividing x for y you are looking for two numbers q and r such that

x = y * q + r
0 <= r < y>

And in Prolog it is:

div(X, Y, 0, X) :- abs(X) < abs(Y). 
div(X, Y, Q, R) :- prod(Y, Q, T1), T1 =< X,  add(T1, R, X) , 
                   R >= 0, abs(R) < abs(Y).

This does not work when X and Y have not the same sign. However this depends from the way I defined prod. You should be able to make it work.

Monday, April 10, 2006

Texniscope Intel

I have just compiled a Texniscope version for Intel.

It works for me, tell me if it works for you too.

Download

[BROKEN LINK!] If you really need this, email me, I will search it on my hard drive.

Saturday, April 8, 2006

Autoconf automake & libtool on MacOS X

I have just finished my first autotools project on the MacOS. Until today I deployed software only in Python or Java. I already worked on C and C++ projects that used libtool and automake and autoconf, but I had not to write the configure scripts myself or I developed them on GNU/Linux.

In fact I'm afraid most of the problems I encountered was MacOS X fault + my ignorance. That is to say libtoolize does not exist: it is called glibtoolize. So if I do not rename it, autoreconf simply does not work.

Probably I just had to set the LIBTOOLIZE environment variable to glibtoolize. However my solution works. The error I got was

Can't exec "libtoolize": No such file or directory at /opt/local/share/autoconf/Autom4te/FileUtils.pm line 288, <GEN3> line 3.

autoreconf: failed to run libtoolize: No such file or directory

Well... I've to fine tune it. I go.

Wednesday, April 5, 2006

Multitasking on MacIntel

I noticed changing the subject also changes the link. So I repost this to redirect you to the new article.

I also have a more recent test that quite leads to other conclusions

Investigated iMac Troubles: not a faulty scheduler but something related to memory

Some time ago I stated multitasking of Mac OS X Intel was bugged. Under some conditions (which at the time I hadn't discovered) the GUI hung (something I never saw on tre MacOS before) and all the system slowed down terribly.

Computations

I anticipate here: the problem I found is real. MacIntel seem to have problems when large quantities of RAM are allocated. It is not a problem with the scheduler. In fact the system simply slows down as a whole.
The fist thing I did was to write a simple program trat stressed CPU and made a lot of I/O and at tre same time allocated and deallocated small quantities of memory in a quite inefficient way. However tre system was not slowed in any perceptible manner.

this is tre post where I spoke about trat program.

Here I add some benchmarking. Now I have to describe tre machines involved. Of course tris not a PPC vs. Intel bench. Unfortunately tre most powerful PPC machine is a notebook, and we can't expect to compete with the iMac. What I want to show are tre relative values between them.

Machines

Model	CPU	Clock	RAM	Bus	Hard Disk
PowerBook G4	G4 (Single Core)	1.5 GHz	512 MB	167 MHz	5400 rpm
iMac CoreDuo	Intel CoreDuo	2.0 GHz	1.5 GB	667 MHz	7200 rpm

big_matrix

This the test I described here
I compiled the test with no optimizations. This is probably a mistake.
The full test on the iMac took more than twenty minutes (matrix 500x500). The Mac was usable and had no slowdowns:

time ./big_matrix  
real    20m39.110s 
user    12m10.943s 
sys     7m46.112s

Reducing the matrix size to 100x100 with no optimization the result is

time ./big_matrix 
real    0m9.683s 
user    0m5.805s 
sys     0m3.688s

Compiling with the -fast option did not change things much, nor did -O3 or -Os (as I said the code was intended to be quite inefficient, I'm not surprised compilers weren't really able to optimize). However explicitly activating -mmmx-msse-msse2-msse3 gave a little improvement (about 5%, that could even be a statistical variation).

As I said before the most important thing is however achieved: the mac remains perfectly usable.

For those who are interested in this sort of things, the powerbook took about an hour and an half. However optimizations improved speed by a full 10% (which is quite acceptable, indeed). However I'm sad it performed so badly. I should investigate why altivec did not work properly (If it did, I suppose it should do something more that 4 times and more slower than the Intel).

Keep in mind that my software wasn't designed to work on multiple threads (This could be an interesting addition, thought). However the system kept on swapping it between the two cores, avoiding many possible optimizations.

Wonderings...

Now only very large allocations remained to do. So I wrote this small (idiotic) software.

Basically it takes a filename as a command line argument, finds out the dimension of the file with a stat syscall, allocates enough space to hold it and then fills the buffer. If the file is big enough this (a part from being terribly inefficient) allocates a lot of RAM.
I called it on a 985 MB file (that means the software allocated 900 MB of real memory, since it is not only allocated, but filled too).

$ ls -lh ../../Desktop/Ubuntu_510.vpc7.sit  
-rw-r--r--   1 riko  staff        985M 12 Feb 03:13 ../../Desktop/Ubuntu_510.vpc7.sit

The file is loaded correctly and this is the time bench.

$ time ./load_file ../../Desktop/Ubuntu_510.vpc7.sit  
real    3m31.010s 
user    0m0.001s 
sys     0m4.062s

This value is really variable. Another time it took only 1m42s.
And... the Mac slowed down. I know that such a program is idiotic. However it was one of the quickest way to understand how behaves the iMac when someone needs a lot of RAM (this could be a memory leak, for example).
In fact in some cases the mac remains slowed down for a while, until RAM is truly released and other processes are paged in.

#include  
#include  
#include  
#include  
#include  
#include  
#define BUFFER 2<<22 

int main(int argc, char *argv[]){ 
    char *mem; 
    int fd; 
    size_t pos = 0, res=0; 
    off_t sz; 
    struct stat st; 
    stat(argv[1], &st); 
    sz = st.st_size; 
    
    mem = (char*)malloc(sz); 
    fd = open(argv[1], O_RDONLY, 0); 
    
    while( (res = read(fd, mem + pos, BUFFER) ) != 0){ 
        pos+=res; 
    } 
    
    close(fd); 
    free(mem); 
    return 0; 
}

As you may notice, this makes no check on sanity of the buffer allocated by malloc. Don't use it on a 4 GB file, it will probably crash.
When I run this very test on the Powerbook I was prepared that the results would have been terrible. In fact the powerbook does not have 1 GB free ram. It does not even have 1 GB RAM. It has only 512 MB. That means that allocating and filling 1 GB relies heavily on paging (and makes a lot of disk accesses to swap in and out pages of memory).Keeping this in mind, the results have been quite good (and more stable, in fact sometimes the iMac performs worse than the pb, that has 1/3 the RAM.). I would like that someone with 1.5 GB or 2 of RAM would try this.

$ time ./load_file ../aks_old/nr.bkp  
real    3m31.526s 
user    0m0.002s 
sys     0m7.728s

Moreover the file used was slightly bigger. So it took about the double of the time (keeping the best iMac performance) or quite the same time (keeping the worst), but with a very big hardware handicap. Astonishing. This can also be interpreted saying that something slowed down the iMac considerably.
I didn't mention it before. Although slightly slowed, the PowerBook was quite responsive and usable during the test, while the iMac was not.
I/O Only
I rewrote the software above to read the file in a smaller buffer of memory instead of keeping it all in memory. This is the source code:

#include  
#include  
#include  
#include  
#include  
#include  
#define BUFFER 2<<22 
int main(int argc, char *argv[]){ 
        char *mem; 
        int fd; 
        size_t pos = 0, res=0; 
        off_t sz; 
        struct stat st; 
        stat(argv[1], &st); 
        sz = st.st_size; 
        mem = (char*)malloc(BUFFER); 
        fd = open(argv[1], O_RDONLY, 0); 
        while( (res = read(fd, mem, BUFFER) ) != 0); 
        close(fd); 
        free(mem); 
        return 0; 
}

The speedup is amazing.

$ time ./read_file ../../Desktop/Ubuntu_510.vpc7.sit  
real    0m28.007s 
user    0m0.001s 
sys     0m1.472s

Some other times I got about 17s. I should investigate this variance. However, the system did not slow down at all and remained perfectly usable. That makes me thing the problem does not concern I/O, but memory.
The powerbook performed like this:

$ time ./read_file ../aks_old/nr.bkp  
real    0m47.194s 
user    0m0.002s 
sys     0m3.833s

Memory only...

The last step is writing a stupid software that only allocates large chunks of memory. I made it allocate (and release) progressively larger chunks. First of all this demonstrates the issue does not regard memory leaks only.
Applications that allocate big quantities of RAM in large chunks are slowed. You can also see that the mac slows down (and the allocation time increases) the more the block gets bigger.

#include  
#include  
#include  
int main (int argc, const char * argv[]) { 
    unsigned long size = 2; 
    unsigned long i; 
    int *mem; 
    
    while(size * sizeof(int) 0){ 
        mem = (int*) malloc(size * sizeof(int)); 
        if (mem==NULL) break; 
        printf("Allocated %u bytes\n", size * sizeof(int)); 
        for(i=0; i < size; ++i) {
            mem[i]=i; 
        } 
        free(mem); 
        mem=NULL; 
        printf("Deallocated %u bytes\n", size * sizeof(int)); 
        size*=2; 
    } 
    return 0; 
}

I also wrote a version that only cycles through variables without allocating. It took less than half second to run, so it's not cycling that affects performance in the software. The first time I run it with not so large chunks. The computer remained quite responsive. Then I run it with full chunks. And it was a hell. In the 1 GB allocation the computer was plainly unusable, not to speak about the 2 GB. However the machine was much more usable than in the I/O + memory test.

time ./memory_allocator  
Allocated 2 bytes 
Deallocated 2 bytes 
[SNIP] 
Allocated 536870912 bytes 
Deallocated 536870912 bytes 
real    0m43.940s 
user    0m9.196s 
sys     0m9.137s 
time ./memory_allocator  
Allocated 8 bytes 
Deallocated 8 bytes 
[SNIP] 
Allocated 1073741824 bytes 
Deallocated 1073741824 bytes 
Allocated 2147483648 bytes 
Deallocated 2147483648 bytes 
real    0m36.538s 
user    0m9.181s 
sys     0m8.851s

Small allocations

At this point I wrote a program that did smaller allocations. You can see that what matters is the quantity of ram allocated. The very same task, when the process has allocated more than 1 GB is significantly slower.

[Starting software] 
utime: 566              stime: 4198 
[Allocated first chunk] 
utime: 20               stime: 30 
[Populated first chunk] 
utime: 117010           stime: 558634 
[Allocated second chunk] 
utime: 27               stime: 50 
[Populated second chunk] 
utime: 132365           stime: 12 
[Allocated third chunk] 
utime: 38               stime: 487 
[Populated third chunk] 
utime: 229719           stime: 10 
[Allocated fourth chunk] 
utime: 22               stime: 41 
[Populated fourth chunk] 
utime: 228182           stime: 880172 
* Freed first chunk. 
* Freed second chunk. 
* Freed third chunk. 
* Freed fourth chunk. 
 
utime: 79               stime: 2

and the software was:

#include  
#include  
#include  
#include  
#include  
#include  
void puts_rusage(){ 
        struct rusage ru; 
        static struct timeval slast = {0, 0}; 
        struct timeval scurrent; 
        static struct timeval ulast = {0, 0}; 
        struct timeval ucurrent; 
        getrusage(RUSAGE_SELF, &ru); 
        ucurrent = ru.ru_utime; 
        scurrent = ru.ru_stime; 
        printf("utime: %ld\t\tstime: %ld\n",  
                        ucurrent.tv_sec - ulast.tv_sec,  
                        scurrent.tv_sec - slast.tv_sec 
                        ); 
        ulast = ucurrent; 
        slast = scurrent; 
} 
int main (int argc, const char * argv[]) { 
unsigned long size = 2<<26; 
unsigned long i; 
int *mem1; 
int *mem2; 
        int *mem3; 
        int *mem4; 
        puts("[Starting software]"); 
        puts_rusage(); 
mem1 = (int*) malloc(size*sizeof(int)); 
        puts("\n[Allocated first chunk]"); 
        puts_rusage(); 
for(i=0; i 
mem1[i]=i; 
} 
        puts("\n[Populated first chunk]"); 
        puts_rusage(); 
        mem2 = (int*) malloc(size*sizeof(int)); 
        puts("\n[Allocated second chunk]"); 
        puts_rusage(); 
for(i=0; i 
mem2[i]=i; 
} 
        puts("\n[Populated second chunk]"); 
        puts_rusage(); 
mem3 = (int*) malloc(size*sizeof(int)); 
        puts("\n[Allocated third chunk]"); 
        puts_rusage(); 
for(i=0; i 
mem3[i]=i; 
} 
        puts("\n[Populated third chunk]"); 
        puts_rusage(); 
mem4 = (int*) malloc(size*sizeof(int)); 
        puts("\n[Allocated fourth chunk]"); 
        puts_rusage(); 
for(i=0; i 
mem4[i]=i; 
} 
        puts("\n[Populated fourth chunk]"); 
        puts_rusage(); 
free(mem1); 
        puts("\n\n* Freed first chunk."); 
free(mem2); 
        puts("* Freed second chunk."); 
free(mem3); 
        puts("* Freed third chunk."); 
free(mem4); 
        puts("* Freed fourth chunk."); 
        puts_rusage(); 
return 0; 
}

The last test should be throwing different processes that allocate a quite large chunk of memory and see how they slow the system (if they do -- I suppose if you don't keep them doing something, they will be paged out).

Conclusion

Definitely I think there is something is not in order with the memory management. The scheduler seems ok. The same tests left the PowerBook usable, while the iMac wasn't (however it took significantly less time in almost every task).

Dalla documentazione di Doxygen....

/*! \class WindowsNT
* \brief Windows Nice Try.
* \author Bill Gates
* \author Several species of small furry animals gathered together
* in a cave and grooving with a pict.
* \version 4.0
* \date 1996-1998
* \bug It crashes a lot and requires huge amounts of memory.
* \bug The class introduces the more bugs, the longer it is used.
* \warning This class may explode in your face.
* \warning If you inherit anything from this class, you're doomed.
*/
class WindowsNT {};

Saturday, April 1, 2006

A volte gli umani ancora...

Ci sono alcuni tipi di ottimizzazione che per un umano sono "evidenti", per una macchina no. Trovo molto interessanti i problemi nel campo della correttezza (ovvero cercatr di insegnare ad un computer a dimostrare se un sorgente è o meno corretto, se presenta certi errori, in generale capire cosa fa). Il problema gemello è quello dell'ottimizzazione (si tratta per il computer di capire cosa fa il tuo codice e di modificarlo in uno equivalente più veloce)
Per esempio prendiamo questa funzione:

int func(){
    int i;
    int x =  2<<10;
    for ( i=0; i < MAX; ++i){
        if (i>= 0 &&  sqrt(i) >= 0)
            x/=i;
    }
}

Ad occhio si vede benissimo che il check dell'if è oltremodo pesante. *sai* che i è sempre maggiore o uguale di 0. Si può anche farne una dimostrazione formale abbastanza agevole.
In pratica la guardia dell'if è inutile. Il check sul >= 0 non serve e meno ancora sqrt. Nessuno dei due ha effetti collaterali (per il primo si vede, è facile, ma per il secondo bisogna sapere come è fatta sqrt, a sboccio non si modo di sapere che non setta un globale ad i, che stampa qualcosa o che ha un altro effetto collaterale, per cui effettivamente non si può semplicemente eliminarla senza cambiare la semantica del programma.
Dicevo. Senza ottimizzare, questo compila tenendo tutta la guardia dell'if:

L3:
cmpl    $0, -16(%ebp)
js      L4
cvtsi2sd        -16(%ebp), %xmm0
sqrtsd  %xmm0, %xmm0
movapd  %xmm0, %xmm1
leal    LC0-"L00000000001$pb"(%ebx), %eax
movsd   (%eax), %xmm0
ucomisd %xmm0, %xmm1
jae     L7
jmp     L4

C'è il primo confronto i>= 0

cmpl    $0, -16(%ebp)
js      L4

e c'è pure il confronto con la radice (sta usando dei registri mmx).
Bene... tutta la sezione L3 in effetti si può semplicemente togliere. Ora in questo caso è semplicemente banale fare l'ottimizzazione. E in effetti tutto il codice completo per la funzione di cui sopra ottimizzato è:

.text
.globl _func
_func:
pushl   %ebp
movl    %esp, %ebp
xorl    %eax, %eax
L2:
addl    $1, %eax
cmpl    $10, %eax
jne     L2
popl    %ebp
ret
.subsections_via_symbols

Lungo da solo quanto il controllo. E questo è quasi esattamente quello che avrebbe scritto un programmatore:
Piccolo preludio, azzera il primo registro. Poi nel loop, incrementa il primo registro (addl $1, %eax), confrontalo con il valore guardia (nel sorgente MAX era una define a 10 ; cmpl $10, %eax), se è minore o uguale torna a L2 (jne L2). Esci.
Ha ottimizzato perfino x/=i. Siccome vede che con x non ci faccio nulla, allora dice: che lo calcolo a fare? Fa solo il ciclo. Fosse stato più furbo non avrebbe fatto manco quello. Stupido io a prendere l'esempio fatto male.
Modifichiamo quindi la funzione in modo che ritorni x, e che quindi x vada calcolato.
Ecco... vediamo che a questo punto non è più tanto furbo:

.text
.globl _func
_func:
pushl   %ebp
movl    %esp, %ebp
pushl   %ebx
xorl    %ecx, %ecx
movl    $2048, %eax
pxor    %xmm1, %xmm1
jmp     L2
L3:
testl   %ecx, %ecx
js      L11
L2:
cvtsi2sd        %ecx, %xmm0
sqrtsd  %xmm0, %xmm0
ucomisd %xmm1, %xmm0
jb      L11
cltd
idivl   %ecx
L11:
addl    $1, %ecx
cmpl    $9, %ecx
jle     L3
popl    %ebx
popl    %ebp
ret
.subsections_via_symbols

Ricompare la radice quadrata ( sqrtsd %xmm0, %xmm0 ). Il programmatore che avesse scritto l'assembly a mano avrebbe scritto [ nel caso in cui questo loop sia molto ricorrente ] un codice significativamente più veloce.
Ammesso che questo mio esempio è del tutto cretino, è vero che il compilatore in genere sa su quale architettura certe istruzioni sono più sgamate, ma è vero ancora che le tecniche per le dimostrazioni formali sono ancora relativamente poco usate (anche perchè suddette dimostrazioni dilatano di moltissimo il tempo di compilazione, sono *molto* difficili da scrivere, e rischiano di portare bachi su bachi con errori minimi). Questo è parte di quasi tutti gli ambienti, non solo del gcc.
Come dire... l'umano sulle cose "furbe" resta più furbo, anche se sulla forza bruta non ha speranza.