C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden C++ for Scientists - Technische Universität Dresden

math.tu.dresden.de
from math.tu.dresden.de More from this publisher
03.12.2012 Views

162 CHAPTER 5. META-PROGRAMMING number of columns in the matrix is taken from an internal definition in the matrix type for the sake of simplicity. Passing this as extra template argument or taking a type traits would have been more general because we are now limited to types where my cols is defined in the class. We still need a (full) specialization to terminate the recursion: template struct fsize mat vec mult { template void operator()(const Matrix& A, const VecIn& v in, VecOut& v out) { v out[0]= A[0][0] ∗ v in[0]; } }; With the inlining, our program will execute the operation w= A ∗ v for vectors of size 4 as: w[0]= A[0][0] ∗ v[0]; w[0]+= A[0][1] ∗ v[1]; w[0]+= A[0][2] ∗ v[2]; w[0]+= A[0][3] ∗ v[3]; w[1]= A[1][0] ∗ v[0]; w[1]+= A[1][1] ∗ v[1]; w[1]+= A[1][2] ∗ v[2]; w[1]+= A[1][3] ∗ v[3]; w[2]= A[2][0] ∗ v[0]; w[2]+= A[2][1] ∗ v[1]; w[2]+= A[2][2] ∗ v[2]; w[2]+= A[2][3] ∗ v[3]; w[3]= A[3][0] ∗ v[0]; w[3]+= A[3][1] ∗ v[1]; w[3]+= A[3][2] ∗ v[2]; w[3]+= A[3][3] ∗ v[3]; Our tests have shown that such an implementation is really faster than the compiler optimization on loops. 18 Increasing Concurrency A disadvantage of the preceeding implementation is that all operations on an entry of the target vector are performed in one sweep. Therefore, the second operation must wait for the first the third for the second on so on. The fifth operation can be done in parallel with the forth, the ninth with the eighth but this is not satisfying. We like to have more concurrency in our program that enables parallel pipelines in superscalar processors. Again, we can twiddle our thumbs and hope that the compiler will reorder the statements or take it in our hands. More concurrency is provided by the following operation sequence: w[0]= A[0][0] ∗ v[0]; w[1]= A[1][0] ∗ v[0]; w[2]= A[2][0] ∗ v[0]; w[3]= A[3][0] ∗ v[0]; w[0]+= A[0][1] ∗ v[1]; 18 TODO: Give numbers

5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 163 w[1]+= A[1][1] ∗ v[1]; w[2]+= A[2][1] ∗ v[1]; w[3]+= A[3][1] ∗ v[1]; w[0]+= A[0][2] ∗ v[2]; w[1]+= A[1][2] ∗ v[2]; w[2]+= A[2][2] ∗ v[2]; w[3]+= A[3][2] ∗ v[2]; w[0]+= A[0][3] ∗ v[3]; w[1]+= A[1][3] ∗ v[3]; w[2]+= A[2][3] ∗ v[3]; w[3]+= A[3][3] ∗ v[3]; We only need to reorganize our functor. The general template reads now: template struct fsize mat vec mult cm { template void operator()(const Matrix& A, const VecIn& v in, VecOut& v out) { fsize mat vec mult cm()(A, v in, v out); v out[Rows]+= A[Rows][Cols] ∗ v in[Cols]; } }; Now, we need a partial specialization for row 0 to go the next column: template struct fsize mat vec mult cm { template void operator()(const Matrix& A, const VecIn& v in, VecOut& v out) { fsize mat vec mult cm()(A, v in, v out); v out[0]+= A[0][Cols] ∗ v in[Cols]; } }; The partial specialization for column 0 is also needed to initialize the entry of the output vector: template struct fsize mat vec mult cm { template void operator()(const Matrix& A, const VecIn& v in, VecOut& v out) { fsize mat vec mult cm()(A, v in, v out); v out[Rows]= A[Rows][0] ∗ v in[0]; } }; Finally, we still need a specialization for row and column 0 to terminate the recursion. This can be reused from the previous functor: template struct fsize mat vec mult cm : fsize mat vec mult {};

162 CHAPTER 5. META-PROGRAMMING<br />

number of columns in the matrix is taken from an internal definition in the matrix type <strong>for</strong> the<br />

sake of simplicity. Passing this as extra template argument or taking a type traits would have<br />

been more general because we are now limited to types where my cols is defined in the class.<br />

We still need a (full) specialization to terminate the recursion:<br />

template <br />

struct fsize mat vec mult<br />

{<br />

template <br />

void operator()(const Matrix& A, const VecIn& v in, VecOut& v out)<br />

{<br />

v out[0]= A[0][0] ∗ v in[0];<br />

}<br />

};<br />

With the inlining, our program will execute the operation w= A ∗ v <strong>for</strong> vectors of size 4 as:<br />

w[0]= A[0][0] ∗ v[0];<br />

w[0]+= A[0][1] ∗ v[1];<br />

w[0]+= A[0][2] ∗ v[2];<br />

w[0]+= A[0][3] ∗ v[3];<br />

w[1]= A[1][0] ∗ v[0];<br />

w[1]+= A[1][1] ∗ v[1];<br />

w[1]+= A[1][2] ∗ v[2];<br />

w[1]+= A[1][3] ∗ v[3];<br />

w[2]= A[2][0] ∗ v[0];<br />

w[2]+= A[2][1] ∗ v[1];<br />

w[2]+= A[2][2] ∗ v[2];<br />

w[2]+= A[2][3] ∗ v[3];<br />

w[3]= A[3][0] ∗ v[0];<br />

w[3]+= A[3][1] ∗ v[1];<br />

w[3]+= A[3][2] ∗ v[2];<br />

w[3]+= A[3][3] ∗ v[3];<br />

Our tests have shown that such an implementation is really faster than the compiler optimization<br />

on loops. 18<br />

Increasing Concurrency<br />

A disadvantage of the preceeding implementation is that all operations on an entry of the target<br />

vector are per<strong>for</strong>med in one sweep. There<strong>for</strong>e, the second operation must wait <strong>for</strong> the first the<br />

third <strong>for</strong> the second on so on. The fifth operation can be done in parallel with the <strong>for</strong>th,<br />

the ninth with the eighth but this is not satisfying. We like to have more concurrency in our<br />

program that enables parallel pipelines in superscalar processors. Again, we can twiddle our<br />

thumbs and hope that the compiler will reorder the statements or take it in our hands. More<br />

concurrency is provided by the following operation sequence:<br />

w[0]= A[0][0] ∗ v[0];<br />

w[1]= A[1][0] ∗ v[0];<br />

w[2]= A[2][0] ∗ v[0];<br />

w[3]= A[3][0] ∗ v[0];<br />

w[0]+= A[0][1] ∗ v[1];<br />

18 TODO: Give numbers

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!