03.12.2012 Views

C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

178 CHAPTER 5. META-PROGRAMMING<br />

};<br />

}<br />

template <br />

struct one norm ftor<br />

{<br />

template <br />

void operator()(S& s0, S& s1, S& s2, S& s3, S& s4, S& s5, S& s6, S& s7, const V& v, unsigned i) {}<br />

};<br />

The according one norm function based on this functor is straight<strong>for</strong>ward:<br />

template <br />

typename Vector::value type<br />

inline one norm(const Vector& v)<br />

{<br />

using std::abs;<br />

typename Vector::value type s0(0), s1(0), s2(0), s3(0), s4(0), s5(0), s6(0), s7(0);<br />

unsigned s= size(v), sb= s / BSize ∗ BSize;<br />

}<br />

<strong>for</strong> (unsigned i= 0; i < sb; i+= BSize)<br />

one norm ftor()(s0, s1, s2, s3, s4, s5, s6, s7, v, i);<br />

s0+= s1 + s2 + s3 + s4 + s5 + s6 + s7;<br />

<strong>for</strong> (unsigned i= sb; i < s; i++)<br />

s0+= abs(v[i]);<br />

return s0;<br />

A slight disadvantage is that all registers must be accumulated after the first iteration no matter<br />

how small BSize is and how short the vector. A great advantage of the rotation is that BSize<br />

is not limited to the number of temporary variables in such accumulations. If BSize is larger<br />

then some or all variables are used multiple times without corrupting the result. The number<br />

of temporaries is nonetheless a limiting factor <strong>for</strong> the concurrency.<br />

The execution of this implementation durates on the test machine:<br />

Compute time one_norm(v) is 6.77 µs.<br />

Compute time one_norm(v) is 1.13 µs.<br />

Compute time one_norm(v) is 0.71 µs.<br />

Compute time one_norm(v) is 0.75 µs.<br />

Compute time one_norm(v) is 1.07 µs.<br />

This is comparable with the nested class (in this environment).<br />

Résumé on Reduction Tuning<br />

The goal of this section was not to determine the ultimately tuned reduction implementation<br />

<strong>for</strong> superscalar processors. 27 The main ambition of this section, in fact of the whole book, is to<br />

demonstrate the diversity of implementation opportunities. With the enormous expressiveness<br />

27 In the presence of the new GPU cards with hundreds of cores and millions of threads, the fight <strong>for</strong> this little<br />

concurrency is not so impressive. Nonetheless, we will still need per<strong>for</strong>mance tuning on single-core and “few-core”

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!