C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden C++ for Scientists - Technische Universität Dresden

math.tu.dresden.de
from math.tu.dresden.de More from this publisher
03.12.2012 Views

172 CHAPTER 5. META-PROGRAMMING for (unsigned i= 0; i < sb; i+= BSize) assign()(ref, that, i); for (unsigned i= sb; i < s; i++) ref[i]= that[i]; return ref; } private: V& ref; }; Evaluting the considered vector expressions for some block sizes yields: Compute time unroll(u)= v + v + w is 1.72 µs. Compute time unroll(u)= v + v + w is 1.52 µs. Compute time unroll(u)= v + v + w is 1.36 µs. Compute time unroll(u)= v + v + w is 1.37 µs. Compute time unroll(u)= v + v + w is 1.4 µs. This few benchmarks are consistent with the previous results, i.e. unroll is equal to the canocical implementation and unroll is as fast as the hard-wired unrolling. 5.4.6 Tuning Reduction Operations Reducing on a Single Variable ⇒ reduction unroll example.cpp In the preceding vector operations, the i th entry of each vector was handled independently of any other entry. For reduction operations, they are related by one or more temporary variables. And this temporary variable(s) can become a serious bottle neck. First, we test if a reduction operation, say the discrete L1 norm (also known as Manhattan norm) can be accelerated by the techniques from Section 5.4.4. We implement the one norm function in terms of a functor for the iteration block: template typename Vector::value type inline one norm(const Vector& v) { using std::abs; typename Vector::value type sum(0); unsigned s= size(v), sb= s / BSize ∗ BSize; } for (unsigned i= 0; i < sb; i+= BSize) one norm ftor()(sum, v, i); for (unsigned i= sb; i < s; i++) sum+= abs(v[i]); return sum;

5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 173 The functor is also implemented in the same manner as before: template struct one norm ftor { template void operator()(S& sum, const V& v, unsigned i) { using std::abs; sum+= abs(v[i+Offset]); one norm ftor()(sum, v, i); } }; template struct one norm ftor { template void operator()(S& sum, const V& v, unsigned i) {} }; The measured run-time behavior behavior is: Compute time one_norm(v) is 7.42 µs. Compute time one_norm(v) is 3.64 µs. Compute time one_norm(v) is 1.9 µs. Compute time one_norm(v) is 1.25 µs. Compute time one_norm(v) is 1.03 µs. This is already a good improvement but maybe we can do better. 23 Reducing on an Array ⇒ reduction unroll array example.cpp When we look at the previous computation, we see that a different entry of v is used in each iteration. But every computation accesses the same temporary variable sum and this limits concurrency. To provide more concurrency, we can use multiple temporaries 24 in an array for instance. The modified function reads then: template typename Vector::value type inline one norm(const Vector& v) { using std::abs; typename Vector::value type sum[BSize]; for (unsigned i= 0; i < BSize; i++) sum[i]= 0; 23 TODO: Test it with gcc 3.4 and MSVC. Speed up in table 24 Strictly speaking, this is not true for every possible scalar type we can think of. The addition of the sum type must be a commutative monoid because we change the evaluation order. This holds of course for all intrinsic numeric types and certainly for almost all user-defined arithmetic types. But one is free to define an addition that is not commutative or not monoidal. In this case our transformation would be wrong. To deal with such exceptions we need semantic concepts which hopefully become part of C ++ in the next years.

5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 173<br />

The functor is also implemented in the same manner as be<strong>for</strong>e:<br />

template <br />

struct one norm ftor<br />

{<br />

template <br />

void operator()(S& sum, const V& v, unsigned i)<br />

{<br />

using std::abs;<br />

sum+= abs(v[i+Offset]);<br />

one norm ftor()(sum, v, i);<br />

}<br />

};<br />

template <br />

struct one norm ftor<br />

{<br />

template <br />

void operator()(S& sum, const V& v, unsigned i) {}<br />

};<br />

The measured run-time behavior behavior is:<br />

Compute time one_norm(v) is 7.42 µs.<br />

Compute time one_norm(v) is 3.64 µs.<br />

Compute time one_norm(v) is 1.9 µs.<br />

Compute time one_norm(v) is 1.25 µs.<br />

Compute time one_norm(v) is 1.03 µs.<br />

This is already a good improvement but maybe we can do better. 23<br />

Reducing on an Array<br />

⇒ reduction unroll array example.cpp<br />

When we look at the previous computation, we see that a different entry of v is used in each<br />

iteration. But every computation accesses the same temporary variable sum and this limits<br />

concurrency. To provide more concurrency, we can use multiple temporaries 24 in an array <strong>for</strong><br />

instance. The modified function reads then:<br />

template <br />

typename Vector::value type<br />

inline one norm(const Vector& v)<br />

{<br />

using std::abs;<br />

typename Vector::value type sum[BSize];<br />

<strong>for</strong> (unsigned i= 0; i < BSize; i++)<br />

sum[i]= 0;<br />

23 TODO: Test it with gcc 3.4 and MSVC. Speed up in table<br />

24 Strictly speaking, this is not true <strong>for</strong> every possible scalar type we can think of. The addition of the sum type<br />

must be a commutative monoid because we change the evaluation order. This holds of course <strong>for</strong> all intrinsic<br />

numeric types and certainly <strong>for</strong> almost all user-defined arithmetic types. But one is free to define an addition<br />

that is not commutative or not monoidal. In this case our trans<strong>for</strong>mation would be wrong. To deal with such<br />

exceptions we need semantic concepts which hopefully become part of C ++ in the next years.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!