03.12.2012 Views

C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 177<br />

Earlier experiments with older compilers (gcc 3.4) 25 exposed a serious overhead <strong>for</strong> using arrays<br />

or nested classes; it was finally even slower then using one single variable. The reason was<br />

probably that the compiler could not use registers <strong>for</strong> these types. 26<br />

The most likely way to store temporaries in registers is to declare them as separate variables:<br />

inline one norm(const Vector& v)<br />

{<br />

typename Vector::value type s0(0), s1(0), s2(0), ...<br />

}<br />

As one can see, the problem is how many one declares. The number cannot depend on the<br />

template argument but must be fix <strong>for</strong> all sizes (unless one writes a different implementation<br />

<strong>for</strong> each number and undermines the expressiveness of templates). Thus, we have to fix a certain<br />

number of variables — say 8. Then, we cannot unroll it more than eight times.<br />

The next issue we run into is the number of function arguments. When we call the iteration<br />

block we pass all variables (registers):<br />

<strong>for</strong> (unsigned i= 0; i < sb; i+= BSize)<br />

one norm ftor()(s0, s1, s2, s3, s4, s5, s6, s7, v, i);<br />

The first calculation in such a block is per<strong>for</strong>med on s0 and s1–s2 are only passed to the functors<br />

<strong>for</strong> the following computations. After this, the second computation must accumulate on the<br />

second function argument, the third calculation on the third argument, . . . This is un<strong>for</strong>tunately<br />

not implementable with templates (only with very ugly and highly error-prone source code<br />

manipulations by macros).<br />

Alternatively, each computation is per<strong>for</strong>med on its first function argument and subsequent<br />

functors are called with omitted first argument:<br />

one norm ftor()(s1, s2, s3, s4, s5, s6, s7, v, i);<br />

one norm ftor()(s2, s3, s4, s5, s6, s7, v, i);<br />

one norm ftor()(s3, s4, s5, s6, s7, v, i);<br />

This is neither realizable with templates.<br />

The solution is to rotate the references to registers:<br />

one norm ftor()(s1, s2, s3, s4, s5, s6, s7, s0, v, i);<br />

one norm ftor()(s2, s3, s4, s5, s6, s7, s0, s1, v, i);<br />

one norm ftor()(s3, s4, s5, s6, s7, s0, s1, s2, v, i);<br />

This rotation is achieved by the following functor implementation:<br />

template <br />

struct one norm ftor<br />

{<br />

template <br />

void operator()(S& s0, S& s1, S& s2, S& s3, S& s4, S& s5, S& s6, S& s7, const V& v, unsigned i)<br />

{<br />

using std::abs;<br />

s0+= abs(v[i+Offset]);<br />

one norm ftor()(s1, s2, s3, s4, s5, s6, s7, s0, v, i);<br />

25 TODO: Show!!!<br />

26 TODO: which raises the question why they can do it today

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!