Tuesday, December 23, 2008

Using template programming for efficiency reasons

Suppose we want to write a

unsigned int pow(unsigned int n, unsigned int exp) 

function which computes the power of a given number.

A C implementation is rather trivial:

unsigned int
traditional_pow(unsigned int n, unsigned int exp) {

        unsigned int value = 1;

        for(unsigned int i = 0; i < exp; ++i) {

                value *= n;

        }

        return value;
}

Basically we could write the very same thing in C++. We could also perform some minor optimizations in the code in order to make it look more efficient. However, compilers are quite good in optimizing code. Indeed, they are better than humans.

Nonetheless, that function contains jumps. Notice that even though the exp parameter is known at compile time (which is a pretty common use-case), there is no way for the compiler to kill jumps short of doing a pretty expensive and thorough inter-procedural analysis. Which is not performed by gcc even using -O3.

The only result is having traditional_pow inlined in main. This is rather unsatisfactory: branching is among the more costly operations in modern CPU's.

We could try a recursive implementation:

unsigned int
traditional_rec_pow(
        unsigned int n,
        unsigned int exp,
        unsigned int acc) {
        switch(exp) {
                case 0: return 1;
                case 1: return acc;
                default:
                return traditional_rec_pow(n, exp-1, acc*n);
        }
}

unsigned int
traditional_rec_pow(unsigned int n, unsigned int exp) {
        return traditional_rec_pow(n, exp, n);
}

Here we are using an accumulator (a standard functional programmer's trick) in order to make the function tail recursive. With mild optimizations turned on, code generated for traditional_rec_pow looks like a less optimized version of traditional_pow (tail recursion is eliminated, though). Full optimizations yield a 74 lines long version of the function, where some special cases have been optimized.

We are very far from the intuitive idea of full optimization of pow if exp is known at compile time. We think that the compiler should be able to produce the very same code it would for the expression:
n * n * n * n * n * n

in case exp==5.

I promise you: we can do that.
I wrote the recursive functional version as an introduction to a templatic variant. The idea is that if exp is known at compile time, it can become a template parameter.

The structure remains similar, with template specialization used instead of explicit switch/if. This is even more readable for a functional programmer!



template<unsigned int exp> inline unsigned int
pow(unsigned int n, unsigned int acc) {

        return pow<exp-1>(n, acc*n);

}

template<> inline unsigned int
pow<1>(unsigned int n, unsigned int acc) {
        return acc;
}

template<> inline unsigned int
pow<0>(unsigned int n, unsigned int acc) {
        return 1;
}

template<unsigned int exp> unsigned int
pow(unsigned int n) {
        return pow<exp>(n, n);
}

Note that we give special cases for exp==1 and exp==0. They are needed because otherwise compilation would not terminate (well, it terminates with an error for implementation reasons).

Here some computation is implicitly performed at compile time. Suppose in the body of the main we call

pow<3>(3);

With no optimizations, the compiler generates some functions. There is the pow<1> specialization (which basically returns acc). Acc is the second parameter and according to OS X intel conventions that is 12(%ebp), while the return value is in %eax:

.globl __Z3powILj1EEjjj
                .weak_definition __Z3powILj1EEjjj
        __Z3powILj1EEjjj:
        LFB13:
                pushl        %ebp
        LCFI0:
                movl        %esp, %ebp
        LCFI1:
                subl        $8, %esp
        LCFI2:
                movl        12(%ebp), %eax
                leave
                ret
        LFE13:
                .align 1

For each exp value greater than one a function is generated. Each of this functions calls the one "before":

.globl __Z3powILj2EEjjj
                .weak_definition __Z3powILj2EEjjj
        __Z3powILj2EEjjj:
        LFB19:
                pushl        %ebp
        LCFI3:
                movl        %esp, %ebp
        LCFI4:
                subl        $24, %esp
        LCFI5:
                movl        12(%ebp), %eax
                imull        8(%ebp), %eax
                movl        %eax, 4(%esp)
                movl        8(%ebp), %eax
                movl        %eax, (%esp)
                call        __Z3powILj1EEjjj
                leave
                ret
        LFE19:
                .align 1
        .globl __Z3powILj3EEjjj
                .weak_definition __Z3powILj3EEjjj
        __Z3powILj3EEjjj:
        LFB18:
                pushl        %ebp
        LCFI6:
                movl        %esp, %ebp
        LCFI7:
                subl        $24, %esp
        LCFI8:
                movl        12(%ebp), %eax
                imull        8(%ebp), %eax
                movl        %eax, 4(%esp)
                movl        8(%ebp), %eax
                movl        %eax, (%esp)
                call        __Z3powILj2EEjjj
                leave
                ret
        LFE18:
                .align 1

Notice that __Z3powILj3EEjjj(pow<3>) calls __Z3powILj2EEjjj(pow<2>), which calls __Z3powILj1EEjjj(pow<1>). A part from this, most instructions deal with the stack (in order to create and destroy the current function stack frame) and multiply the first parameter for the second one, passing the result to the subsequent call.

Notice that no function contains conditional operations, tests or jumps a part from the call to the subsequent function in the chain. This is the kind of things optimizers shine on.

Indeed, even -O does the right thing. All the functions are inlined and the generated code is (not considering stack operations):
movl        8(%ebp), %eax
movl        %eax, %edx
imull        %eax, %edx
imull        %edx, %eax

Wow! This is how we would have written by hand. Notice that if we have multiple pow<n> calls, specialized ( and optimized code) is generated for each variant. For example this is for pow<11>


.globl __Z3powILj11EEjj
                .weak_definition __Z3powILj11EEjj
        __Z3powILj11EEjj:
        LFB17:
                pushl        %ebp
        LCFI0:
                movl        %esp, %ebp
        LCFI1:
                movl        8(%ebp), %eax
                movl        %eax, %edx
                imull        %eax, %edx
                imull        %eax, %edx
                imull        %eax, %edx
                imull        %eax, %edx
                imull        %eax, %edx
                imull        %eax, %edx
                imull        %eax, %edx
                imull        %eax, %edx
                imull        %eax, %edx
                imull        %edx, %eax
                leave
                ret

No comments: