C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden C++ for Scientists - Technische Universität Dresden

math.tu.dresden.de
from math.tu.dresden.de More from this publisher
03.12.2012 Views

180 CHAPTER 5. META-PROGRAMMING } typedef typename Matrix::value type value type; unsigned s= A.num rows(); for (unsigned i= 0; i < s; i++) for (unsigned k= 0; k < s; k++) { value type tmp(0); for (unsigned j= 0; j < s; j++) tmp+= A(i, j) ∗ B(j, k); C(i, k)= tmp; } For this implementation, we write a benchmark function: template void bench(const Matrix& A, const Matrix& B, Matrix& C, const unsigned rep) { boost::timer t1; for (unsigned j= 0; j < rep; j++) mult(A, B, C); double t= t1.elapsed() / double(rep); unsigned s= A.num rows(); } std::cout ≪ ”Compute time mult(A, B, C) is ” ≪ 1000000.0 ∗ t ≪ ” µs. This are ” ≪ s ∗ s ∗ (2∗s − 1) / t / 1000000.0 ≪ ” MFlops.\n”; The run time and performance of our canonical implementation (with 128 × 128 matrices) is: Compute time mult(A, B, C) is 5290 µs. This are 789.777 MFlops. This implementation is our reference regarding performance and results. For the development of the unrolled implementation we go back to 4 × 4 matrices. In contrast to Section 5.4.6 we do not unroll a single reduction but perform multiple reductions in parallel. That means for the three loops to unroll the two outer loops and to replace the body in the inner loop by multiple operations. The latter we achieve as usual with a functor. As in the canonical implementation, the reduction shall not be performed in elements of C but in temporaries. For this purpose we use the class multi tmp from § 5.4.6. For the sake of simplicity we limit ourselves to matrix sizes that are multiples of the unroll parameters. 30 An unrolled matrix multiplication is shown in the following code: template void inline mult(const Matrix& A, const Matrix& B, Matrix& C) { assert(A.num rows() == B.num rows()); // ... assert(A.num rows() % Size0 == 0); // we omitted cleanup here assert(A.num cols() % Size1 == 0); // we omitted cleanup here typedef typename Matrix::value type value type; unsigned s= A.num rows(); 30 A full implementation for arbitrary matrix sizes is realized in MTL4.

5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 181 } mult block block; for (unsigned i= 0; i < s; i+= Size0) for (unsigned k= 0; k < s; k+= Size1) { multi tmp tmp(value type(0)); for (unsigned j= 0; j < s; j++) block(tmp, A, B, i, j, k); block.update(tmp, C, i, k); } We still owe the reader the implementation of the functor mult block. The techniques are the same as in vector operations but we have to deal with more indices and their respective limits: template struct mult block { typedef mult block next; template void operator()(Tmp& tmp, const Matrix& A, const Matrix& B, unsigned i, unsigned j, unsigned k) { std::cout ≪ ”tmp.” ≪ tmp.bs ≪ ”+= A[” ≪ i + Index0 ≪ ”][” ≪ j ≪ ”] ∗ B[” ≪ j ≪ ”][” ≪ k + Index1 ≪ ”]\n”; tmp.value+= A(i + Index0, j) ∗ B(j, k + Index1); next()(tmp.sub, A, B, i, j, k); } }; template void update(const Tmp& tmp, Matrix& C, unsigned i, unsigned k) { std::cout ≪ ”C[” ≪ i + Index0 ≪ ”][” ≪ k + Index1 ≪ ”]= tmp.” ≪ tmp.bs ≪ ”\n”; C(i + Index0, k + Index1)= tmp.value; next().update(tmp.sub, C, i, k); } template struct mult block { typedef mult block next; template void operator()(Tmp& tmp, const Matrix& A, const Matrix& B, unsigned i, unsigned j, unsigned k) { std::cout ≪ ”tmp.” ≪ tmp.bs ≪ ”+= A[” ≪ i + Index0 ≪ ”][” ≪ j ≪ ”] ∗ B[” ≪ j ≪ ”][” ≪ k + Max1 ≪ ”]\n”; tmp.value+= A(i + Index0, j) ∗ B(j, k + Max1); next()(tmp.sub, A, B, i, j, k); } template void update(const Tmp& tmp, Matrix& C, unsigned i, unsigned k) { std::cout ≪ ”C[” ≪ i + Index0 ≪ ”][” ≪ k + Max1 ≪ ”]= tmp.” ≪ tmp.bs ≪ ”\n”;

5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 181<br />

}<br />

mult block block;<br />

<strong>for</strong> (unsigned i= 0; i < s; i+= Size0)<br />

<strong>for</strong> (unsigned k= 0; k < s; k+= Size1) {<br />

multi tmp tmp(value type(0));<br />

<strong>for</strong> (unsigned j= 0; j < s; j++)<br />

block(tmp, A, B, i, j, k);<br />

block.update(tmp, C, i, k);<br />

}<br />

We still owe the reader the implementation of the functor mult block. The techniques are the<br />

same as in vector operations but we have to deal with more indices and their respective limits:<br />

template <br />

struct mult block<br />

{<br />

typedef mult block next;<br />

template <br />

void operator()(Tmp& tmp, const Matrix& A, const Matrix& B, unsigned i, unsigned j, unsigned k)<br />

{<br />

std::cout ≪ ”tmp.” ≪ tmp.bs ≪ ”+= A[” ≪ i + Index0 ≪ ”][” ≪ j ≪ ”] ∗ B[” ≪ j ≪ ”][” ≪<br />

k + Index1 ≪ ”]\n”;<br />

tmp.value+= A(i + Index0, j) ∗ B(j, k + Index1);<br />

next()(tmp.sub, A, B, i, j, k);<br />

}<br />

};<br />

template <br />

void update(const Tmp& tmp, Matrix& C, unsigned i, unsigned k)<br />

{<br />

std::cout ≪ ”C[” ≪ i + Index0 ≪ ”][” ≪ k + Index1 ≪ ”]= tmp.” ≪ tmp.bs ≪ ”\n”;<br />

C(i + Index0, k + Index1)= tmp.value;<br />

next().update(tmp.sub, C, i, k);<br />

}<br />

template <br />

struct mult block<br />

{<br />

typedef mult block next;<br />

template <br />

void operator()(Tmp& tmp, const Matrix& A, const Matrix& B, unsigned i, unsigned j, unsigned k)<br />

{<br />

std::cout ≪ ”tmp.” ≪ tmp.bs ≪ ”+= A[” ≪ i + Index0 ≪ ”][” ≪ j ≪ ”] ∗ B[” ≪ j ≪ ”][” ≪<br />

k + Max1 ≪ ”]\n”;<br />

tmp.value+= A(i + Index0, j) ∗ B(j, k + Max1);<br />

next()(tmp.sub, A, B, i, j, k);<br />

}<br />

template <br />

void update(const Tmp& tmp, Matrix& C, unsigned i, unsigned k)<br />

{<br />

std::cout ≪ ”C[” ≪ i + Index0 ≪ ”][” ≪ k + Max1 ≪ ”]= tmp.” ≪ tmp.bs ≪ ”\n”;

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!