unsigned int pow(unsigned int n, unsigned int exp)
function which computes the power of a given number.
A C implementation is rather trivial:
unsigned int traditional_pow(unsigned int n, unsigned int exp) { unsigned int value = 1; for(unsigned int i = 0; i < exp; ++i) { value *= n; } return value; }
Basically we could write the very same thing in C++. We could also perform some minor optimizations in the code in order to make it look more efficient. However, compilers are quite good in optimizing code. Indeed, they are better than humans.
Nonetheless, that function contains jumps. Notice that even though the exp parameter is known at compile time (which is a pretty common use-case), there is no way for the compiler to kill jumps short of doing a pretty expensive and thorough inter-procedural analysis. Which is not performed by gcc even using -O3.
The only result is having traditional_pow inlined in main. This is rather unsatisfactory: branching is among the more costly operations in modern CPU's.
We could try a recursive implementation:
unsigned int traditional_rec_pow( unsigned int n, unsigned int exp, unsigned int acc) { switch(exp) { case 0: return 1; case 1: return acc; default: return traditional_rec_pow(n, exp-1, acc*n); } } unsigned int traditional_rec_pow(unsigned int n, unsigned int exp) { return traditional_rec_pow(n, exp, n); }
Here we are using an accumulator (a standard functional programmer's trick) in order to make the function tail recursive. With mild optimizations turned on, code generated for traditional_rec_pow looks like a less optimized version of traditional_pow (tail recursion is eliminated, though). Full optimizations yield a 74 lines long version of the function, where some special cases have been optimized.
We are very far from the intuitive idea of full optimization of pow if exp is known at compile time. We think that the compiler should be able to produce the very same code it would for the expression:
n * n * n * n * n * n
in case exp==5.
I promise you: we can do that.
I wrote the recursive functional version as an introduction to a templatic variant. The idea is that if exp is known at compile time, it can become a template parameter.
The structure remains similar, with template specialization used instead of explicit switch/if. This is even more readable for a functional programmer!
template<unsigned int exp> inline unsigned int pow(unsigned int n, unsigned int acc) { return pow<exp-1>(n, acc*n); } template<> inline unsigned int pow<1>(unsigned int n, unsigned int acc) { return acc; } template<> inline unsigned int pow<0>(unsigned int n, unsigned int acc) { return 1; } template<unsigned int exp> unsigned int pow(unsigned int n) { return pow<exp>(n, n); }
Note that we give special cases for exp==1 and exp==0. They are needed because otherwise compilation would not terminate (well, it terminates with an error for implementation reasons).
Here some computation is implicitly performed at compile time. Suppose in the body of the main we call
pow<3>(3);
With no optimizations, the compiler generates some functions. There is the pow<1> specialization (which basically returns acc). Acc is the second parameter and according to OS X intel conventions that is 12(%ebp), while the return value is in %eax:
.globl __Z3powILj1EEjjj .weak_definition __Z3powILj1EEjjj __Z3powILj1EEjjj: LFB13: pushl %ebp LCFI0: movl %esp, %ebp LCFI1: subl $8, %esp LCFI2: movl 12(%ebp), %eax leave ret LFE13: .align 1
For each exp value greater than one a function is generated. Each of this functions calls the one "before":
.globl __Z3powILj2EEjjj .weak_definition __Z3powILj2EEjjj __Z3powILj2EEjjj: LFB19: pushl %ebp LCFI3: movl %esp, %ebp LCFI4: subl $24, %esp LCFI5: movl 12(%ebp), %eax imull 8(%ebp), %eax movl %eax, 4(%esp) movl 8(%ebp), %eax movl %eax, (%esp) call __Z3powILj1EEjjj leave ret LFE19: .align 1 .globl __Z3powILj3EEjjj .weak_definition __Z3powILj3EEjjj __Z3powILj3EEjjj: LFB18: pushl %ebp LCFI6: movl %esp, %ebp LCFI7: subl $24, %esp LCFI8: movl 12(%ebp), %eax imull 8(%ebp), %eax movl %eax, 4(%esp) movl 8(%ebp), %eax movl %eax, (%esp) call __Z3powILj2EEjjj leave ret LFE18: .align 1
Notice that __Z3powILj3EEjjj(pow<3>) calls __Z3powILj2EEjjj(pow<2>), which calls __Z3powILj1EEjjj(pow<1>). A part from this, most instructions deal with the stack (in order to create and destroy the current function stack frame) and multiply the first parameter for the second one, passing the result to the subsequent call.
Notice that no function contains conditional operations, tests or jumps a part from the call to the subsequent function in the chain. This is the kind of things optimizers shine on.
Indeed, even -O does the right thing. All the functions are inlined and the generated code is (not considering stack operations):
movl 8(%ebp), %eax movl %eax, %edx imull %eax, %edx imull %edx, %eax
Wow! This is how we would have written by hand. Notice that if we have multiple pow<n> calls, specialized ( and optimized code) is generated for each variant. For example this is for pow<11>
.globl __Z3powILj11EEjj .weak_definition __Z3powILj11EEjj __Z3powILj11EEjj: LFB17: pushl %ebp LCFI0: movl %esp, %ebp LCFI1: movl 8(%ebp), %eax movl %eax, %edx imull %eax, %edx imull %eax, %edx imull %eax, %edx imull %eax, %edx imull %eax, %edx imull %eax, %edx imull %eax, %edx imull %eax, %edx imull %eax, %edx imull %edx, %eax leave ret
No comments:
Post a Comment