Computations
I anticipate here: the problem I found is real. MacIntel seem to have problems when large quantities of RAM are allocated. It is not a problem with the scheduler. In fact the system simply slows down as a whole.The fist thing I did was to write a simple program trat stressed CPU and made a lot of I/O and at tre same time allocated and deallocated small quantities of memory in a quite inefficient way. However tre system was not slowed in any perceptible manner.
this is tre post where I spoke about trat program.
Here I add some benchmarking. Now I have to describe tre machines involved. Of course tris not a PPC vs. Intel bench. Unfortunately tre most powerful PPC machine is a notebook, and we can't expect to compete with the iMac. What I want to show are tre relative values between them.
Machines
Model | CPU | Clock | RAM | Bus | Hard Disk |
PowerBook G4 | G4 (Single Core) | 1.5 GHz | 512 MB | 167 MHz | 5400 rpm |
iMac CoreDuo | Intel CoreDuo | 2.0 GHz | 1.5 GB | 667 MHz | 7200 rpm |
big_matrix
This the test I described hereI compiled the test with no optimizations. This is probably a mistake.
The full test on the iMac took more than twenty minutes (matrix 500x500). The Mac was usable and had no slowdowns:
time ./big_matrix real 20m39.110s user 12m10.943s sys 7m46.112s
Reducing the matrix size to 100x100 with no optimization the result is
time ./big_matrix real 0m9.683s user 0m5.805s sys 0m3.688s
Compiling with the -fast option did not change things much, nor did -O3 or -Os (as I said the code was intended to be quite inefficient, I'm not surprised compilers weren't really able to optimize). However explicitly activating -mmmx-msse-msse2-msse3 gave a little improvement (about 5%, that could even be a statistical variation).
As I said before the most important thing is however achieved: the mac remains perfectly usable.
For those who are interested in this sort of things, the powerbook took about an hour and an half. However optimizations improved speed by a full 10% (which is quite acceptable, indeed). However I'm sad it performed so badly. I should investigate why altivec did not work properly (If it did, I suppose it should do something more that 4 times and more slower than the Intel).
Keep in mind that my software wasn't designed to work on multiple threads (This could be an interesting addition, thought). However the system kept on swapping it between the two cores, avoiding many possible optimizations.
Wonderings...
Now only very large allocations remained to do. So I wrote this small (idiotic) software.Basically it takes a filename as a command line argument, finds out the dimension of the file with a stat syscall, allocates enough space to hold it and then fills the buffer. If the file is big enough this (a part from being terribly inefficient) allocates a lot of RAM.
I called it on a 985 MB file (that means the software allocated 900 MB of real memory, since it is not only allocated, but filled too).
$ ls -lh ../../Desktop/Ubuntu_510.vpc7.sit -rw-r--r-- 1 riko staff 985M 12 Feb 03:13 ../../Desktop/Ubuntu_510.vpc7.sit
The file is loaded correctly and this is the time bench.
$ time ./load_file ../../Desktop/Ubuntu_510.vpc7.sit real 3m31.010s user 0m0.001s sys 0m4.062s
This value is really variable. Another time it took only 1m42s.
And... the Mac slowed down. I know that such a program is idiotic. However it was one of the quickest way to understand how behaves the iMac when someone needs a lot of RAM (this could be a memory leak, for example).
In fact in some cases the mac remains slowed down for a while, until RAM is truly released and other processes are paged in.
#include #include #include #include #include #include #define BUFFER 2<<22 int main(int argc, char *argv[]){ char *mem; int fd; size_t pos = 0, res=0; off_t sz; struct stat st; stat(argv[1], &st); sz = st.st_size; mem = (char*)malloc(sz); fd = open(argv[1], O_RDONLY, 0); while( (res = read(fd, mem + pos, BUFFER) ) != 0){ pos+=res; } close(fd); free(mem); return 0; }
As you may notice, this makes no check on sanity of the buffer allocated by malloc. Don't use it on a 4 GB file, it will probably crash.
When I run this very test on the Powerbook I was prepared that the results would have been terrible. In fact the powerbook does not have 1 GB free ram. It does not even have 1 GB RAM. It has only 512 MB. That means that allocating and filling 1 GB relies heavily on paging (and makes a lot of disk accesses to swap in and out pages of memory).Keeping this in mind, the results have been quite good (and more stable, in fact sometimes the iMac performs worse than the pb, that has 1/3 the RAM.). I would like that someone with 1.5 GB or 2 of RAM would try this.
$ time ./load_file ../aks_old/nr.bkp real 3m31.526s user 0m0.002s sys 0m7.728s
Moreover the file used was slightly bigger. So it took about the double of the time (keeping the best iMac performance) or quite the same time (keeping the worst), but with a very big hardware handicap. Astonishing. This can also be interpreted saying that something slowed down the iMac considerably.
I didn't mention it before. Although slightly slowed, the PowerBook was quite responsive and usable during the test, while the iMac was not.
I/O Only
I rewrote the software above to read the file in a smaller buffer of memory instead of keeping it all in memory. This is the source code:
#include #include #include #include #include #include #define BUFFER 2<<22 int main(int argc, char *argv[]){ char *mem; int fd; size_t pos = 0, res=0; off_t sz; struct stat st; stat(argv[1], &st); sz = st.st_size; mem = (char*)malloc(BUFFER); fd = open(argv[1], O_RDONLY, 0); while( (res = read(fd, mem, BUFFER) ) != 0); close(fd); free(mem); return 0; }
The speedup is amazing.
$ time ./read_file ../../Desktop/Ubuntu_510.vpc7.sit real 0m28.007s user 0m0.001s sys 0m1.472s
Some other times I got about 17s. I should investigate this variance. However, the system did not slow down at all and remained perfectly usable. That makes me thing the problem does not concern I/O, but memory.
The powerbook performed like this:
$ time ./read_file ../aks_old/nr.bkp real 0m47.194s user 0m0.002s sys 0m3.833s
Memory only...
The last step is writing a stupid software that only allocates large chunks of memory. I made it allocate (and release) progressively larger chunks. First of all this demonstrates the issue does not regard memory leaks only.Applications that allocate big quantities of RAM in large chunks are slowed. You can also see that the mac slows down (and the allocation time increases) the more the block gets bigger.
#include #include #include int main (int argc, const char * argv[]) { unsigned long size = 2; unsigned long i; int *mem; while(size * sizeof(int) 0){ mem = (int*) malloc(size * sizeof(int)); if (mem==NULL) break; printf("Allocated %u bytes\n", size * sizeof(int)); for(i=0; i < size; ++i) { mem[i]=i; } free(mem); mem=NULL; printf("Deallocated %u bytes\n", size * sizeof(int)); size*=2; } return 0; }I also wrote a version that only cycles through variables without allocating. It took less than half second to run, so it's not cycling that affects performance in the software. The first time I run it with not so large chunks. The computer remained quite responsive. Then I run it with full chunks. And it was a hell. In the 1 GB allocation the computer was plainly unusable, not to speak about the 2 GB. However the machine was much more usable than in the I/O + memory test.
time ./memory_allocator Allocated 2 bytes Deallocated 2 bytes [SNIP] Allocated 536870912 bytes Deallocated 536870912 bytes real 0m43.940s user 0m9.196s sys 0m9.137s time ./memory_allocator Allocated 8 bytes Deallocated 8 bytes [SNIP] Allocated 1073741824 bytes Deallocated 1073741824 bytes Allocated 2147483648 bytes Deallocated 2147483648 bytes real 0m36.538s user 0m9.181s sys 0m8.851s
Small allocations
At this point I wrote a program that did smaller allocations. You can see that what matters is the quantity of ram allocated. The very same task, when the process has allocated more than 1 GB is significantly slower.[Starting software] utime: 566 stime: 4198 [Allocated first chunk] utime: 20 stime: 30 [Populated first chunk] utime: 117010 stime: 558634 [Allocated second chunk] utime: 27 stime: 50 [Populated second chunk] utime: 132365 stime: 12 [Allocated third chunk] utime: 38 stime: 487 [Populated third chunk] utime: 229719 stime: 10 [Allocated fourth chunk] utime: 22 stime: 41 [Populated fourth chunk] utime: 228182 stime: 880172 * Freed first chunk. * Freed second chunk. * Freed third chunk. * Freed fourth chunk. utime: 79 stime: 2and the software was:
#include #include #include #include #include #include void puts_rusage(){ struct rusage ru; static struct timeval slast = {0, 0}; struct timeval scurrent; static struct timeval ulast = {0, 0}; struct timeval ucurrent; getrusage(RUSAGE_SELF, &ru); ucurrent = ru.ru_utime; scurrent = ru.ru_stime; printf("utime: %ld\t\tstime: %ld\n", ucurrent.tv_sec - ulast.tv_sec, scurrent.tv_sec - slast.tv_sec ); ulast = ucurrent; slast = scurrent; } int main (int argc, const char * argv[]) { unsigned long size = 2<<26; unsigned long i; int *mem1; int *mem2; int *mem3; int *mem4; puts("[Starting software]"); puts_rusage(); mem1 = (int*) malloc(size*sizeof(int)); puts("\n[Allocated first chunk]"); puts_rusage(); for(i=0; i mem1[i]=i; } puts("\n[Populated first chunk]"); puts_rusage(); mem2 = (int*) malloc(size*sizeof(int)); puts("\n[Allocated second chunk]"); puts_rusage(); for(i=0; i mem2[i]=i; } puts("\n[Populated second chunk]"); puts_rusage(); mem3 = (int*) malloc(size*sizeof(int)); puts("\n[Allocated third chunk]"); puts_rusage(); for(i=0; i mem3[i]=i; } puts("\n[Populated third chunk]"); puts_rusage(); mem4 = (int*) malloc(size*sizeof(int)); puts("\n[Allocated fourth chunk]"); puts_rusage(); for(i=0; i mem4[i]=i; } puts("\n[Populated fourth chunk]"); puts_rusage(); free(mem1); puts("\n\n* Freed first chunk."); free(mem2); puts("* Freed second chunk."); free(mem3); puts("* Freed third chunk."); free(mem4); puts("* Freed fourth chunk."); puts_rusage(); return 0; }The last test should be throwing different processes that allocate a quite large chunk of memory and see how they slow the system (if they do -- I suppose if you don't keep them doing something, they will be paged out).
No comments:
Post a Comment