03.12.2012 Views

C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 157<br />

trans<strong>for</strong>ms the operations to a <strong>for</strong>m that is best <strong>for</strong> execution time. We would only need a new<br />

compiler and our programs become faster. 13 But live — especially as advanced C ++ programmer<br />

— is no walk in the park. Of course, the compiler helps us a lot to speed up our programs.<br />

But there are limitations, many optimizations need knowledge of the semantic behavior and can<br />

there<strong>for</strong>e only be applied on types and operations where the semantic is known at the time the<br />

compiler is written, see also discussion in [?]. Research is going on, to overcome this limitations<br />

by providing concept-based optimization [?]. Un<strong>for</strong>tunately, this will take time until it becomes<br />

mainstream, especially now that concepts are taken out of the C ++0x standard. An alternative<br />

is source-to-source code trans<strong>for</strong>mation with external tools like ROSE [?].<br />

Even <strong>for</strong> types and operations that the compiler can handle, it has its limitations. Most compilers<br />

(gcc, . . . 14 ) only deal with the inner loop in nested ones (see solution in Section 5.4.2)<br />

and does not dare to introduce extra temporaries (see solution in Section ??). Some compilers<br />

are particularly tuned <strong>for</strong> benchmarks. 15 For instance, they have pattern matching to recognize<br />

a 3-nested loop that computes a dense matrix product and trans<strong>for</strong>m those in BLAS-like code<br />

with 7 or 9 plat<strong>for</strong>m-dependent loops. 16 All this said, writing high-per<strong>for</strong>mance software is no<br />

walk in the park. That does not mean that such software must be unreadable and unmaintainable<br />

hackery. The route of success is again to provide appropriate abstractions. Those can be<br />

empowered with compile-time optimizations so that the applications are still writen in natural<br />

mathematical notation whereas the generated binaries can still explore all known techniques<br />

<strong>for</strong> fast execution.<br />

5.4.1 Classical Fixed-Size Unrolling<br />

The easiest <strong>for</strong>m of compile-time optimization can be realized <strong>for</strong> fixed-size data types, in<br />

particular vectors as in Section 4.7. Simular to the default assignment, we can write a generic<br />

vector assignment:<br />

template <br />

class fsize vector<br />

{<br />

public:<br />

const static int my size= Size;<br />

};<br />

template <br />

self& operator=(const self& that)<br />

{<br />

<strong>for</strong> (int i= 0; i < my size; ++i)<br />

data[i]= that[i];<br />

}<br />

13<br />

In some sense, this is the programming equivalent of communism: everybody contributes as much as he<br />

pleases and like he pleases and in the end, the right thing will happen anyway thanks to a self-improving society.<br />

Likewise, some people write software in a very naïve fashion and blame the compiler not trans<strong>for</strong>ming their<br />

programs into high-per<strong>for</strong>mance code.<br />

14<br />

TODO: we should run some benchmarks on MSVC and icc.<br />

15<br />

TODO: search <strong>for</strong> paper on kcc.<br />

16<br />

One could sometimes get the impression that the HPC community beliefs that multiplying dense matrices<br />

at near-peak per<strong>for</strong>mance solves all per<strong>for</strong>mance issues of the world or at least demonstrates that everything can<br />

be computed at near-peak per<strong>for</strong>mance if only one tries hard enough. Fortunately, more and more people in the<br />

supercomputer centers realize that their machines are not only running BLAS3 and LAPACK operations and<br />

that real-world applications are more often than not limited by memory bandwidth and latency.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!