C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden C++ for Scientists - Technische Universität Dresden

math.tu.dresden.de
from math.tu.dresden.de More from this publisher
03.12.2012 Views

160 CHAPTER 5. META-PROGRAMMING assign entry 0 assign entry 1 assign entry 2 assign entry 3 In this implementation, we replaced the loop by a recursion — counting on the compiler to inline the operations (otherwise it would be even slower as the loop) — and made sure that no loop index is incremented and tested for termination. This is only beneficial for small loops that run in L1 cache. Larger loops are dominated by loading the data from memory and the loop overhead is irrelevant. To the contrary, unrolling operations on very large vectors entirely will probably decrease the performance because a lot of instructions need to be loaded and decrease therefore the available bandwidth for the data. As mentioned before, compilers can unroll such operations by themselves — and hopefully know when it is better not to — and sometimes this automatic unrolling is even slightly faster then the explicit implementation. 5.4.2 Nested Unrolling From our experience, compilers usually unroll nested loops. Even a good compiler that can handle certain nested loops will not be able to optimize every program kernel, in particular those with heavily templatized programs instantiated with user-defined types. We will demonstrate here how to unroll nested loops at compile time at the example of matrix vector multiplication. For this purpose, we introduce a simplistic fixed-size matrix type: template class fsize matrix { typedef fsize matrix self; public: typedef T value type; BOOST STATIC ASSERT((Rows ∗ Cols > 0)); const static int my rows= Rows, my cols= Cols; fsize matrix() { for (int i= 0; i < my rows; ++i) for (int j= 0; j < my cols; ++j) data[i][j]= T(0); } fsize matrix( const self& that ) { ... } // cannot check column index const T∗ operator[](int r) const { return data[r]; } T∗ operator[](int r) { return data[r]; } mat vec et operator∗(const fsize vector& v) const { return mat vec et (∗this, v); } private:

5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 161 }; T data[Rows][Cols]; The bracket operator returns a pointer for the sake of simplicity but a good implementation should return a proxy that allows for checking the column index. The multiplication with a vector is realized by means of an expression template for not copying the result vector. Then the vector assigment needs a specialization for the expression template 17 template class fsize vector { template self& operator=( const mat vec et& that ) { typedef mat vec et et; fsize mat vec mult()(that.A, that.v, ∗this); return ∗this; } }; The functor fsize mat vec mult must now compute the matrix vector product on the three arguments. The general implementation of the functor reads: template struct fsize mat vec mult { template void operator()(const Matrix& A, const VecIn& v in, VecOut& v out) { fsize mat vec mult()(A, v in, v out); v out[Rows]+= A[Rows][Cols] ∗ v in[Cols]; } }; Again, the functor is only templatized on the sizes and the container types are deduced. The operator assumes that all smaller column indices are already handled and we can increment v out[Rows] by A[Rows][Cols] ∗ v in[Cols]. In particular, we assume that the first operation on v out[Rows] initializes it. Thus we need a (partial) specialization for Cols = 0: template struct fsize mat vec mult { template void operator()(const Matrix& A, const VecIn& v in, VecOut& v out) { fsize mat vec mult()(A, v in, v out); v out[Rows]= A[Rows][0] ∗ v in[0]; } }; The careful reader noticed the substitution of += by =. We also notice that we have to call the computation for the preceeding row with all columns and inductively for all smaller rows. The 17 A better solution would be implementing all assignments with a functor and specialize the functor because partial template specialization of functions does not always work as expected.

5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 161<br />

};<br />

T data[Rows][Cols];<br />

The bracket operator returns a pointer <strong>for</strong> the sake of simplicity but a good implementation<br />

should return a proxy that allows <strong>for</strong> checking the column index. The multiplication with a<br />

vector is realized by means of an expression template <strong>for</strong> not copying the result vector.<br />

Then the vector assigment needs a specialization <strong>for</strong> the expression template 17<br />

template <br />

class fsize vector<br />

{<br />

template <br />

self& operator=( const mat vec et& that )<br />

{<br />

typedef mat vec et et;<br />

fsize mat vec mult()(that.A, that.v, ∗this);<br />

return ∗this;<br />

}<br />

};<br />

The functor fsize mat vec mult must now compute the matrix vector product on the three arguments.<br />

The general implementation of the functor reads:<br />

template <br />

struct fsize mat vec mult<br />

{<br />

template <br />

void operator()(const Matrix& A, const VecIn& v in, VecOut& v out)<br />

{<br />

fsize mat vec mult()(A, v in, v out);<br />

v out[Rows]+= A[Rows][Cols] ∗ v in[Cols];<br />

}<br />

};<br />

Again, the functor is only templatized on the sizes and the container types are deduced. The<br />

operator assumes that all smaller column indices are already handled and we can increment<br />

v out[Rows] by A[Rows][Cols] ∗ v in[Cols]. In particular, we assume that the first operation on<br />

v out[Rows] initializes it. Thus we need a (partial) specialization <strong>for</strong> Cols = 0:<br />

template <br />

struct fsize mat vec mult<br />

{<br />

template <br />

void operator()(const Matrix& A, const VecIn& v in, VecOut& v out)<br />

{<br />

fsize mat vec mult()(A, v in, v out);<br />

v out[Rows]= A[Rows][0] ∗ v in[0];<br />

}<br />

};<br />

The careful reader noticed the substitution of += by =. We also notice that we have to call the<br />

computation <strong>for</strong> the preceeding row with all columns and inductively <strong>for</strong> all smaller rows. The<br />

17 A better solution would be implementing all assignments with a functor and specialize the functor because<br />

partial template specialization of functions does not always work as expected.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!