C++ for Scientists - Technische Universität Dresden
C++ for Scientists - Technische Universität Dresden
C++ for Scientists - Technische Universität Dresden
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 177<br />
Earlier experiments with older compilers (gcc 3.4) 25 exposed a serious overhead <strong>for</strong> using arrays<br />
or nested classes; it was finally even slower then using one single variable. The reason was<br />
probably that the compiler could not use registers <strong>for</strong> these types. 26<br />
The most likely way to store temporaries in registers is to declare them as separate variables:<br />
inline one norm(const Vector& v)<br />
{<br />
typename Vector::value type s0(0), s1(0), s2(0), ...<br />
}<br />
As one can see, the problem is how many one declares. The number cannot depend on the<br />
template argument but must be fix <strong>for</strong> all sizes (unless one writes a different implementation<br />
<strong>for</strong> each number and undermines the expressiveness of templates). Thus, we have to fix a certain<br />
number of variables — say 8. Then, we cannot unroll it more than eight times.<br />
The next issue we run into is the number of function arguments. When we call the iteration<br />
block we pass all variables (registers):<br />
<strong>for</strong> (unsigned i= 0; i < sb; i+= BSize)<br />
one norm ftor()(s0, s1, s2, s3, s4, s5, s6, s7, v, i);<br />
The first calculation in such a block is per<strong>for</strong>med on s0 and s1–s2 are only passed to the functors<br />
<strong>for</strong> the following computations. After this, the second computation must accumulate on the<br />
second function argument, the third calculation on the third argument, . . . This is un<strong>for</strong>tunately<br />
not implementable with templates (only with very ugly and highly error-prone source code<br />
manipulations by macros).<br />
Alternatively, each computation is per<strong>for</strong>med on its first function argument and subsequent<br />
functors are called with omitted first argument:<br />
one norm ftor()(s1, s2, s3, s4, s5, s6, s7, v, i);<br />
one norm ftor()(s2, s3, s4, s5, s6, s7, v, i);<br />
one norm ftor()(s3, s4, s5, s6, s7, v, i);<br />
This is neither realizable with templates.<br />
The solution is to rotate the references to registers:<br />
one norm ftor()(s1, s2, s3, s4, s5, s6, s7, s0, v, i);<br />
one norm ftor()(s2, s3, s4, s5, s6, s7, s0, s1, v, i);<br />
one norm ftor()(s3, s4, s5, s6, s7, s0, s1, s2, v, i);<br />
This rotation is achieved by the following functor implementation:<br />
template <br />
struct one norm ftor<br />
{<br />
template <br />
void operator()(S& s0, S& s1, S& s2, S& s3, S& s4, S& s5, S& s6, S& s7, const V& v, unsigned i)<br />
{<br />
using std::abs;<br />
s0+= abs(v[i+Offset]);<br />
one norm ftor()(s1, s2, s3, s4, s5, s6, s7, s0, v, i);<br />
25 TODO: Show!!!<br />
26 TODO: which raises the question why they can do it today