03.12.2012 Views

C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

180 CHAPTER 5. META-PROGRAMMING<br />

}<br />

typedef typename Matrix::value type value type;<br />

unsigned s= A.num rows();<br />

<strong>for</strong> (unsigned i= 0; i < s; i++)<br />

<strong>for</strong> (unsigned k= 0; k < s; k++) {<br />

value type tmp(0);<br />

<strong>for</strong> (unsigned j= 0; j < s; j++)<br />

tmp+= A(i, j) ∗ B(j, k);<br />

C(i, k)= tmp;<br />

}<br />

For this implementation, we write a benchmark function:<br />

template <br />

void bench(const Matrix& A, const Matrix& B, Matrix& C, const unsigned rep)<br />

{<br />

boost::timer t1;<br />

<strong>for</strong> (unsigned j= 0; j < rep; j++)<br />

mult(A, B, C);<br />

double t= t1.elapsed() / double(rep);<br />

unsigned s= A.num rows();<br />

}<br />

std::cout ≪ ”Compute time mult(A, B, C) is ”<br />

≪ 1000000.0 ∗ t ≪ ” µs. This are ”<br />

≪ s ∗ s ∗ (2∗s − 1) / t / 1000000.0 ≪ ” MFlops.\n”;<br />

The run time and per<strong>for</strong>mance of our canonical implementation (with 128 × 128 matrices) is:<br />

Compute time mult(A, B, C) is 5290 µs. This are 789.777 MFlops.<br />

This implementation is our reference regarding per<strong>for</strong>mance and results.<br />

For the development of the unrolled implementation we go back to 4 × 4 matrices. In contrast<br />

to Section 5.4.6 we do not unroll a single reduction but per<strong>for</strong>m multiple reductions in parallel.<br />

That means <strong>for</strong> the three loops to unroll the two outer loops and to replace the body in the<br />

inner loop by multiple operations. The latter we achieve as usual with a functor.<br />

As in the canonical implementation, the reduction shall not be per<strong>for</strong>med in elements of C<br />

but in temporaries. For this purpose we use the class multi tmp from § 5.4.6. For the sake of<br />

simplicity we limit ourselves to matrix sizes that are multiples of the unroll parameters. 30 An<br />

unrolled matrix multiplication is shown in the following code:<br />

template <br />

void inline mult(const Matrix& A, const Matrix& B, Matrix& C)<br />

{<br />

assert(A.num rows() == B.num rows()); // ...<br />

assert(A.num rows() % Size0 == 0); // we omitted cleanup here<br />

assert(A.num cols() % Size1 == 0); // we omitted cleanup here<br />

typedef typename Matrix::value type value type;<br />

unsigned s= A.num rows();<br />

30 A full implementation <strong>for</strong> arbitrary matrix sizes is realized in MTL4.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!