03.12.2012 Views

C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 179<br />

of C ++ one can use (or abuse) the compiler to generate the most efficient version without<br />

rewriting the program sources, as one would need in C or Fortran. The power of internal<br />

code generation with the C ++ compiler only makes external code generation as in ATLAS 28<br />

unnecessary. In ATLAS, functions are written in a domain specific language and C programs 29 in<br />

slight variations are generated with a tool and compared regarding per<strong>for</strong>mance. The techniques<br />

presented here empower us to generate binaries equivalent to those variations by just using a<br />

C ++ compiler. Thus, we can tune our programs by changing template arguments or constants<br />

(that might be set plat<strong>for</strong>m-dependently).<br />

5.4.7 Tuning Nested Loops<br />

⇒ matrix unroll example.cpp<br />

The most used (and abused) example in per<strong>for</strong>mance discussions is dense matrix multiplication.<br />

We do not claim to compete with hand-tuned assembler codes but we show the power of metaprogramming<br />

to generate code variations from a single implementation. As starting point we<br />

use a templatized implementation of matrix class from Section 3.7.4.<br />

We begin our implementation with a simple test case:<br />

int main()<br />

{<br />

const unsigned s= 4; // s= 4 <strong>for</strong> testing and 128 <strong>for</strong> timing<br />

matrix A(s, s), B(s, s), C(s, s);<br />

}<br />

<strong>for</strong> (unsigned i= 0; i < s; i++)<br />

<strong>for</strong> (unsigned j= 0; j < s; j++) {<br />

A(i, j)= 100.0 ∗ i + j;<br />

B(i, j)= 200.0 ∗ i + j;<br />

}<br />

mult(A, B, C);<br />

std::cout ≪ ”C is ” ≪ C ≪ ’\n’;<br />

A matrix multiplication is easily implemented with three nested loops. One of the 6 possible<br />

nestings is a dot-product-like calculation of each entry from C:<br />

cik = Ai · B k<br />

where Ai is the i th row of A and Bk the k th column of B. We use a temporary in the innermost<br />

loop to decrease the cache-invalidation overhead of writing to C’s elements in each operation:<br />

template <br />

void inline mult(const Matrix& A, const Matrix& B, Matrix& C)<br />

{<br />

assert(A.num rows() == B.num rows()); // ...<br />

machines at least <strong>for</strong> some years since not everybody has GPU card <strong>for</strong> numerics and not every algorithm is<br />

already successfully ported (e.g. incomplete LU on arbitrary sparse matrices). By the time of this writing their<br />

is not even support <strong>for</strong> std::complex.<br />

28 http://math-atlas.source<strong>for</strong>ge.net/<br />

29 In some cases the C programs contain assembler snippets <strong>for</strong> a given plat<strong>for</strong>m in order to achieve per<strong>for</strong>mance<br />

close to peak.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!