C++ for Scientists - Technische Universität Dresden
C++ for Scientists - Technische Universität Dresden
C++ for Scientists - Technische Universität Dresden
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 179<br />
of C ++ one can use (or abuse) the compiler to generate the most efficient version without<br />
rewriting the program sources, as one would need in C or Fortran. The power of internal<br />
code generation with the C ++ compiler only makes external code generation as in ATLAS 28<br />
unnecessary. In ATLAS, functions are written in a domain specific language and C programs 29 in<br />
slight variations are generated with a tool and compared regarding per<strong>for</strong>mance. The techniques<br />
presented here empower us to generate binaries equivalent to those variations by just using a<br />
C ++ compiler. Thus, we can tune our programs by changing template arguments or constants<br />
(that might be set plat<strong>for</strong>m-dependently).<br />
5.4.7 Tuning Nested Loops<br />
⇒ matrix unroll example.cpp<br />
The most used (and abused) example in per<strong>for</strong>mance discussions is dense matrix multiplication.<br />
We do not claim to compete with hand-tuned assembler codes but we show the power of metaprogramming<br />
to generate code variations from a single implementation. As starting point we<br />
use a templatized implementation of matrix class from Section 3.7.4.<br />
We begin our implementation with a simple test case:<br />
int main()<br />
{<br />
const unsigned s= 4; // s= 4 <strong>for</strong> testing and 128 <strong>for</strong> timing<br />
matrix A(s, s), B(s, s), C(s, s);<br />
}<br />
<strong>for</strong> (unsigned i= 0; i < s; i++)<br />
<strong>for</strong> (unsigned j= 0; j < s; j++) {<br />
A(i, j)= 100.0 ∗ i + j;<br />
B(i, j)= 200.0 ∗ i + j;<br />
}<br />
mult(A, B, C);<br />
std::cout ≪ ”C is ” ≪ C ≪ ’\n’;<br />
A matrix multiplication is easily implemented with three nested loops. One of the 6 possible<br />
nestings is a dot-product-like calculation of each entry from C:<br />
cik = Ai · B k<br />
where Ai is the i th row of A and Bk the k th column of B. We use a temporary in the innermost<br />
loop to decrease the cache-invalidation overhead of writing to C’s elements in each operation:<br />
template <br />
void inline mult(const Matrix& A, const Matrix& B, Matrix& C)<br />
{<br />
assert(A.num rows() == B.num rows()); // ...<br />
machines at least <strong>for</strong> some years since not everybody has GPU card <strong>for</strong> numerics and not every algorithm is<br />
already successfully ported (e.g. incomplete LU on arbitrary sparse matrices). By the time of this writing their<br />
is not even support <strong>for</strong> std::complex.<br />
28 http://math-atlas.source<strong>for</strong>ge.net/<br />
29 In some cases the C programs contain assembler snippets <strong>for</strong> a given plat<strong>for</strong>m in order to achieve per<strong>for</strong>mance<br />
close to peak.