C++ for Scientists - Technische Universität Dresden
C++ for Scientists - Technische Universität Dresden
C++ for Scientists - Technische Universität Dresden
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 157<br />
trans<strong>for</strong>ms the operations to a <strong>for</strong>m that is best <strong>for</strong> execution time. We would only need a new<br />
compiler and our programs become faster. 13 But live — especially as advanced C ++ programmer<br />
— is no walk in the park. Of course, the compiler helps us a lot to speed up our programs.<br />
But there are limitations, many optimizations need knowledge of the semantic behavior and can<br />
there<strong>for</strong>e only be applied on types and operations where the semantic is known at the time the<br />
compiler is written, see also discussion in [?]. Research is going on, to overcome this limitations<br />
by providing concept-based optimization [?]. Un<strong>for</strong>tunately, this will take time until it becomes<br />
mainstream, especially now that concepts are taken out of the C ++0x standard. An alternative<br />
is source-to-source code trans<strong>for</strong>mation with external tools like ROSE [?].<br />
Even <strong>for</strong> types and operations that the compiler can handle, it has its limitations. Most compilers<br />
(gcc, . . . 14 ) only deal with the inner loop in nested ones (see solution in Section 5.4.2)<br />
and does not dare to introduce extra temporaries (see solution in Section ??). Some compilers<br />
are particularly tuned <strong>for</strong> benchmarks. 15 For instance, they have pattern matching to recognize<br />
a 3-nested loop that computes a dense matrix product and trans<strong>for</strong>m those in BLAS-like code<br />
with 7 or 9 plat<strong>for</strong>m-dependent loops. 16 All this said, writing high-per<strong>for</strong>mance software is no<br />
walk in the park. That does not mean that such software must be unreadable and unmaintainable<br />
hackery. The route of success is again to provide appropriate abstractions. Those can be<br />
empowered with compile-time optimizations so that the applications are still writen in natural<br />
mathematical notation whereas the generated binaries can still explore all known techniques<br />
<strong>for</strong> fast execution.<br />
5.4.1 Classical Fixed-Size Unrolling<br />
The easiest <strong>for</strong>m of compile-time optimization can be realized <strong>for</strong> fixed-size data types, in<br />
particular vectors as in Section 4.7. Simular to the default assignment, we can write a generic<br />
vector assignment:<br />
template <br />
class fsize vector<br />
{<br />
public:<br />
const static int my size= Size;<br />
};<br />
template <br />
self& operator=(const self& that)<br />
{<br />
<strong>for</strong> (int i= 0; i < my size; ++i)<br />
data[i]= that[i];<br />
}<br />
13<br />
In some sense, this is the programming equivalent of communism: everybody contributes as much as he<br />
pleases and like he pleases and in the end, the right thing will happen anyway thanks to a self-improving society.<br />
Likewise, some people write software in a very naïve fashion and blame the compiler not trans<strong>for</strong>ming their<br />
programs into high-per<strong>for</strong>mance code.<br />
14<br />
TODO: we should run some benchmarks on MSVC and icc.<br />
15<br />
TODO: search <strong>for</strong> paper on kcc.<br />
16<br />
One could sometimes get the impression that the HPC community beliefs that multiplying dense matrices<br />
at near-peak per<strong>for</strong>mance solves all per<strong>for</strong>mance issues of the world or at least demonstrates that everything can<br />
be computed at near-peak per<strong>for</strong>mance if only one tries hard enough. Fortunately, more and more people in the<br />
supercomputer centers realize that their machines are not only running BLAS3 and LAPACK operations and<br />
that real-world applications are more often than not limited by memory bandwidth and latency.