C++ for Scientists - Technische Universität Dresden
C++ for Scientists - Technische Universität Dresden
C++ for Scientists - Technische Universität Dresden
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
178 CHAPTER 5. META-PROGRAMMING<br />
};<br />
}<br />
template <br />
struct one norm ftor<br />
{<br />
template <br />
void operator()(S& s0, S& s1, S& s2, S& s3, S& s4, S& s5, S& s6, S& s7, const V& v, unsigned i) {}<br />
};<br />
The according one norm function based on this functor is straight<strong>for</strong>ward:<br />
template <br />
typename Vector::value type<br />
inline one norm(const Vector& v)<br />
{<br />
using std::abs;<br />
typename Vector::value type s0(0), s1(0), s2(0), s3(0), s4(0), s5(0), s6(0), s7(0);<br />
unsigned s= size(v), sb= s / BSize ∗ BSize;<br />
}<br />
<strong>for</strong> (unsigned i= 0; i < sb; i+= BSize)<br />
one norm ftor()(s0, s1, s2, s3, s4, s5, s6, s7, v, i);<br />
s0+= s1 + s2 + s3 + s4 + s5 + s6 + s7;<br />
<strong>for</strong> (unsigned i= sb; i < s; i++)<br />
s0+= abs(v[i]);<br />
return s0;<br />
A slight disadvantage is that all registers must be accumulated after the first iteration no matter<br />
how small BSize is and how short the vector. A great advantage of the rotation is that BSize<br />
is not limited to the number of temporary variables in such accumulations. If BSize is larger<br />
then some or all variables are used multiple times without corrupting the result. The number<br />
of temporaries is nonetheless a limiting factor <strong>for</strong> the concurrency.<br />
The execution of this implementation durates on the test machine:<br />
Compute time one_norm(v) is 6.77 µs.<br />
Compute time one_norm(v) is 1.13 µs.<br />
Compute time one_norm(v) is 0.71 µs.<br />
Compute time one_norm(v) is 0.75 µs.<br />
Compute time one_norm(v) is 1.07 µs.<br />
This is comparable with the nested class (in this environment).<br />
Résumé on Reduction Tuning<br />
The goal of this section was not to determine the ultimately tuned reduction implementation<br />
<strong>for</strong> superscalar processors. 27 The main ambition of this section, in fact of the whole book, is to<br />
demonstrate the diversity of implementation opportunities. With the enormous expressiveness<br />
27 In the presence of the new GPU cards with hundreds of cores and millions of threads, the fight <strong>for</strong> this little<br />
concurrency is not so impressive. Nonetheless, we will still need per<strong>for</strong>mance tuning on single-core and “few-core”