Wednesday, April 5, 2006

Investigated iMac Troubles: not a faulty scheduler but something related to memory

Some time ago I stated multitasking of Mac OS X Intel was bugged. Under some conditions (which at the time I hadn't discovered) the GUI hung (something I never saw on tre MacOS before) and all the system slowed down terribly.

Computations

I anticipate here: the problem I found is real. MacIntel seem to have problems when large quantities of RAM are allocated. It is not a problem with the scheduler. In fact the system simply slows down as a whole.
The fist thing I did was to write a simple program trat stressed CPU and made a lot of I/O and at tre same time allocated and deallocated small quantities of memory in a quite inefficient way. However tre system was not slowed in any perceptible manner.

this is tre post where I spoke about trat program.

Here I add some benchmarking. Now I have to describe tre machines involved. Of course tris not a PPC vs. Intel bench. Unfortunately tre most powerful PPC machine is a notebook, and we can't expect to compete with the iMac. What I want to show are tre relative values between them.

Machines


Model

CPU

Clock

RAM

Bus

Hard Disk

PowerBook G4

G4 (Single Core)

1.5 GHz

512 MB

167 MHz

5400 rpm

iMac CoreDuo

Intel CoreDuo

2.0 GHz

1.5 GB

667 MHz

7200 rpm


big_matrix

This the test I described here
I compiled the test with no optimizations. This is probably a mistake.
The full test on the iMac took more than twenty minutes (matrix 500x500). The Mac was usable and had no slowdowns:
time ./big_matrix  
real    20m39.110s 
user    12m10.943s 
sys     7m46.112s 

Reducing the matrix size to 100x100 with no optimization the result is
time ./big_matrix 
real    0m9.683s 
user    0m5.805s 
sys     0m3.688s 

Compiling with the -fast option did not change things much, nor did -O3 or -Os (as I said the code was intended to be quite inefficient, I'm not surprised compilers weren't really able to optimize). However explicitly activating -mmmx-msse-msse2-msse3 gave a little improvement (about 5%, that could even be a statistical variation).

As I said before the most important thing is however achieved: the mac remains perfectly usable.

For those who are interested in this sort of things, the powerbook took about an hour and an half. However optimizations improved speed by a full 10% (which is quite acceptable, indeed). However I'm sad it performed so badly. I should investigate why altivec did not work properly (If it did, I suppose it should do something more that 4 times and more slower than the Intel).

Keep in mind that my software wasn't designed to work on multiple threads (This could be an interesting addition, thought). However the system kept on swapping it between the two cores, avoiding many possible optimizations.

Wonderings...

Now only very large allocations remained to do. So I wrote this small (idiotic) software.

Basically it takes a filename as a command line argument, finds out the dimension of the file with a stat syscall, allocates enough space to hold it and then fills the buffer. If the file is big enough this (a part from being terribly inefficient) allocates a lot of RAM.
I called it on a 985 MB file (that means the software allocated 900 MB of real memory, since it is not only allocated, but filled too).

$ ls -lh ../../Desktop/Ubuntu_510.vpc7.sit  
-rw-r--r--   1 riko  staff        985M 12 Feb 03:13 ../../Desktop/Ubuntu_510.vpc7.sit 

The file is loaded correctly and this is the time bench.

$ time ./load_file ../../Desktop/Ubuntu_510.vpc7.sit  
real    3m31.010s 
user    0m0.001s 
sys     0m4.062s 

This value is really variable. Another time it took only 1m42s.
And... the Mac slowed down. I know that such a program is idiotic. However it was one of the quickest way to understand how behaves the iMac when someone needs a lot of RAM (this could be a memory leak, for example).
In fact in some cases the mac remains slowed down for a while, until RAM is truly released and other processes are paged in.

#include  
#include  
#include  
#include  
#include  
#include  
#define BUFFER 2<<22 

int main(int argc, char *argv[]){ 
    char *mem; 
    int fd; 
    size_t pos = 0, res=0; 
    off_t sz; 
    struct stat st; 
    stat(argv[1], &st); 
    sz = st.st_size; 
    
    mem = (char*)malloc(sz); 
    fd = open(argv[1], O_RDONLY, 0); 
    
    while( (res = read(fd, mem + pos, BUFFER) ) != 0){ 
        pos+=res; 
    } 
    
    close(fd); 
    free(mem); 
    return 0; 
} 

As you may notice, this makes no check on sanity of the buffer allocated by malloc. Don't use it on a 4 GB file, it will probably crash.
When I run this very test on the Powerbook I was prepared that the results would have been terrible. In fact the powerbook does not have 1 GB free ram. It does not even have 1 GB RAM. It has only 512 MB. That means that allocating and filling 1 GB relies heavily on paging (and makes a lot of disk accesses to swap in and out pages of memory).Keeping this in mind, the results have been quite good (and more stable, in fact sometimes the iMac performs worse than the pb, that has 1/3 the RAM.). I would like that someone with 1.5 GB or 2 of RAM would try this.

$ time ./load_file ../aks_old/nr.bkp  
real    3m31.526s 
user    0m0.002s 
sys     0m7.728s 

Moreover the file used was slightly bigger. So it took about the double of the time (keeping the best iMac performance) or quite the same time (keeping the worst), but with a very big hardware handicap. Astonishing. This can also be interpreted saying that something slowed down the iMac considerably.
I didn't mention it before. Although slightly slowed, the PowerBook was quite responsive and usable during the test, while the iMac was not.
I/O Only
I rewrote the software above to read the file in a smaller buffer of memory instead of keeping it all in memory. This is the source code:

#include  
#include  
#include  
#include  
#include  
#include  
#define BUFFER 2<<22 
int main(int argc, char *argv[]){ 
        char *mem; 
        int fd; 
        size_t pos = 0, res=0; 
        off_t sz; 
        struct stat st; 
        stat(argv[1], &st); 
        sz = st.st_size; 
        mem = (char*)malloc(BUFFER); 
        fd = open(argv[1], O_RDONLY, 0); 
        while( (res = read(fd, mem, BUFFER) ) != 0); 
        close(fd); 
        free(mem); 
        return 0; 
} 

The speedup is amazing.

$ time ./read_file ../../Desktop/Ubuntu_510.vpc7.sit  
real    0m28.007s 
user    0m0.001s 
sys     0m1.472s 

Some other times I got about 17s. I should investigate this variance. However, the system did not slow down at all and remained perfectly usable. That makes me thing the problem does not concern I/O, but memory.
The powerbook performed like this:

$ time ./read_file ../aks_old/nr.bkp  
real    0m47.194s 
user    0m0.002s 
sys     0m3.833s 

Memory only...

The last step is writing a stupid software that only allocates large chunks of memory. I made it allocate (and release) progressively larger chunks. First of all this demonstrates the issue does not regard memory leaks only.
Applications that allocate big quantities of RAM in large chunks are slowed. You can also see that the mac slows down (and the allocation time increases) the more the block gets bigger.

#include  
#include  
#include  
int main (int argc, const char * argv[]) { 
    unsigned long size = 2; 
    unsigned long i; 
    int *mem; 
    
    while(size * sizeof(int) 0){ 
        mem = (int*) malloc(size * sizeof(int)); 
        if (mem==NULL) break; 
        printf("Allocated %u bytes\n", size * sizeof(int)); 
        for(i=0; i < size; ++i) {
            mem[i]=i; 
        } 
        free(mem); 
        mem=NULL; 
        printf("Deallocated %u bytes\n", size * sizeof(int)); 
        size*=2; 
    } 
    return 0; 
} 
I also wrote a version that only cycles through variables without allocating. It took less than half second to run, so it's not cycling that affects performance in the software. The first time I run it with not so large chunks. The computer remained quite responsive. Then I run it with full chunks. And it was a hell. In the 1 GB allocation the computer was plainly unusable, not to speak about the 2 GB. However the machine was much more usable than in the I/O + memory test.
time ./memory_allocator  
Allocated 2 bytes 
Deallocated 2 bytes 
[SNIP] 
Allocated 536870912 bytes 
Deallocated 536870912 bytes 
real    0m43.940s 
user    0m9.196s 
sys     0m9.137s 
time ./memory_allocator  
Allocated 8 bytes 
Deallocated 8 bytes 
[SNIP] 
Allocated 1073741824 bytes 
Deallocated 1073741824 bytes 
Allocated 2147483648 bytes 
Deallocated 2147483648 bytes 
real    0m36.538s 
user    0m9.181s 
sys     0m8.851s 

Small allocations

At this point I wrote a program that did smaller allocations. You can see that what matters is the quantity of ram allocated. The very same task, when the process has allocated more than 1 GB is significantly slower.
[Starting software] 
utime: 566              stime: 4198 
[Allocated first chunk] 
utime: 20               stime: 30 
[Populated first chunk] 
utime: 117010           stime: 558634 
[Allocated second chunk] 
utime: 27               stime: 50 
[Populated second chunk] 
utime: 132365           stime: 12 
[Allocated third chunk] 
utime: 38               stime: 487 
[Populated third chunk] 
utime: 229719           stime: 10 
[Allocated fourth chunk] 
utime: 22               stime: 41 
[Populated fourth chunk] 
utime: 228182           stime: 880172 
* Freed first chunk. 
* Freed second chunk. 
* Freed third chunk. 
* Freed fourth chunk. 
 
utime: 79               stime: 2 
and the software was:
#include  
#include  
#include  
#include  
#include  
#include  
void puts_rusage(){ 
        struct rusage ru; 
        static struct timeval slast = {0, 0}; 
        struct timeval scurrent; 
        static struct timeval ulast = {0, 0}; 
        struct timeval ucurrent; 
        getrusage(RUSAGE_SELF, &ru); 
        ucurrent = ru.ru_utime; 
        scurrent = ru.ru_stime; 
        printf("utime: %ld\t\tstime: %ld\n",  
                        ucurrent.tv_sec - ulast.tv_sec,  
                        scurrent.tv_sec - slast.tv_sec 
                        ); 
        ulast = ucurrent; 
        slast = scurrent; 
} 
int main (int argc, const char * argv[]) { 
unsigned long size = 2<<26; 
unsigned long i; 
int *mem1; 
int *mem2; 
        int *mem3; 
        int *mem4; 
        puts("[Starting software]"); 
        puts_rusage(); 
mem1 = (int*) malloc(size*sizeof(int)); 
        puts("\n[Allocated first chunk]"); 
        puts_rusage(); 
for(i=0; i 
mem1[i]=i; 
} 
        puts("\n[Populated first chunk]"); 
        puts_rusage(); 
        mem2 = (int*) malloc(size*sizeof(int)); 
        puts("\n[Allocated second chunk]"); 
        puts_rusage(); 
for(i=0; i 
mem2[i]=i; 
} 
        puts("\n[Populated second chunk]"); 
        puts_rusage(); 
mem3 = (int*) malloc(size*sizeof(int)); 
        puts("\n[Allocated third chunk]"); 
        puts_rusage(); 
for(i=0; i 
mem3[i]=i; 
} 
        puts("\n[Populated third chunk]"); 
        puts_rusage(); 
mem4 = (int*) malloc(size*sizeof(int)); 
        puts("\n[Allocated fourth chunk]"); 
        puts_rusage(); 
for(i=0; i 
mem4[i]=i; 
} 
        puts("\n[Populated fourth chunk]"); 
        puts_rusage(); 
free(mem1); 
        puts("\n\n* Freed first chunk."); 
free(mem2); 
        puts("* Freed second chunk."); 
free(mem3); 
        puts("* Freed third chunk."); 
free(mem4); 
        puts("* Freed fourth chunk."); 
        puts_rusage(); 
return 0; 
} 
The last test should be throwing different processes that allocate a quite large chunk of memory and see how they slow the system (if they do -- I suppose if you don't keep them doing something, they will be paged out).

Conclusion

Definitely I think there is something is not in order with the memory management. The scheduler seems ok. The same tests left the PowerBook usable, while the iMac wasn't (however it took significantly less time in almost every task).

No comments: