C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden C++ for Scientists - Technische Universität Dresden

math.tu.dresden.de
from math.tu.dresden.de More from this publisher
03.12.2012 Views

168 CHAPTER 5. META-PROGRAMMING The answer is yes. All of them. The main reason (but not the only one) is that different processors have different numbers of registers. How many registers are needed in one iteration depends on the expression and on the types (a complex value needs more registers than a float). In the following section we will address both issues: how to encapsulate the transformation so that it does not show up in the application and how we can change the block size without rewritten the loop. 5.4.4 Unrolling Vector Expressions For easier understanding, we discuss the abstraction in meta-tuning step by step. We start with the previous loop and implement a function for it. Say the function’s name is my axpy and it has a template argument for the block size so that we can write for instance for (unsigned j= 0; j < rep; j++) my axpy(u, v, w); This function shall contain a main loop in unrolled manner with customizable size and a clean-up loop at the end: template void my axpy(U& u, const V& v, const W& w) { assert(u.size() == v.size() && v.size() == w.size()); unsigned s= u.size(), sb= s / BSize ∗ BSize; } for (unsigned i= 0; i < sb; i+= BSize) my axpy ftor()(u, v, w, i); for (unsigned i= sb; i < s; i++) u[i]= 3.0f ∗ v[i] + w[i]; As mentioned before, deduced template types, as the vector types in our case, must be defined at the end and the explicitly given arguments, in our case the block size, must be at the beginning of the template arguments. The implementation of the block statement in the first loop can be implemented similarly to the functor in Section 5.4.1. We deviate a bit from this implementation by using two template arguments where the former is increased until it is equal to the second. It appeared that this approach yielded faster binaries on gcc than using only one argument and counting it down to zero. 22 In addition, the two-argument version is more consistent with the multi-dimensional implementation in Section ??. As for fixed-size unrolling we need a recursive template definition. Within the operators, a single statement is performed and the following statements are called: template struct my axpy ftor { template void operator()(U& u, const V& v, const W& w, unsigned i) { u[i+Offset]= 3.0f ∗ v[i+Offset] + w[i+Offset]; 22 TODO: exercise for it

5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 169 }; } my axpy ftor()(u, v, w, i); The only difference to fixed-size unrolling is that the indices are relative to an argument — here i. The operator() is first called with Offset equal to 0, then with 1, 2, . . . Since each call is inlined the functor call results in one monolithic block of operations without loop control and function call. Thus, the call of my axpy ftor()(u, v, w, i) performs the same operations as one iteration of the first loop in Listing 5.4. Of course this compilation would end in an infinite loop if we forget to specialize it for Max: template struct my axpy ftor { template void operator()(U& u, const V& v, const W& w, unsigned i) {} }; Performing the considered vector operation with different unrollings yields Compute time unrolled loop is 1.44 µs. Compute time unrolled loop is 1.15 µs. Compute time unrolled loop is 1.15 µs. Compute time unrolled loop is 1.14 µs. Now we can call this operation for any block size we like. On the other hand, it is rather cumbersome to implement the according functions and functors for each vector expression. Therefore, we combine this technique now with expression templates. 5.4.5 Tuning an Expression Template ⇒ vector unroll example2.cpp Let us recall Section 5.3.3. So far, we developed a vector class with expression templates for vector sums. In the same manner we can implement the product of a scalar and a vector but we leave this as exercise and consider expressions with addition only, for example: u = v + v + w Now we frame this vector operation with a repeting loop and the time measure: boost::timer t; for (unsigned j= 0; j < rep; j++) u= v + v + w; std::cout ≪ ”Compute time is ” ≪ 1000000.0 ∗ t.elapsed() / double(rep) ≪ ” µs.\n”; This results in: Compute time is 1.72 µs. To incorporate meta-tuning into expression templates we only need to modify the actual assignment because only here a loop is performed. All the other operations (well so far we have

168 CHAPTER 5. META-PROGRAMMING<br />

The answer is yes. All of them. The main reason (but not the only one) is that different<br />

processors have different numbers of registers. How many registers are needed in one iteration<br />

depends on the expression and on the types (a complex value needs more registers than a float).<br />

In the following section we will address both issues: how to encapsulate the trans<strong>for</strong>mation<br />

so that it does not show up in the application and how we can change the block size without<br />

rewritten the loop.<br />

5.4.4 Unrolling Vector Expressions<br />

For easier understanding, we discuss the abstraction in meta-tuning step by step. We start with<br />

the previous loop and implement a function <strong>for</strong> it. Say the function’s name is my axpy and it<br />

has a template argument <strong>for</strong> the block size so that we can write <strong>for</strong> instance<br />

<strong>for</strong> (unsigned j= 0; j < rep; j++)<br />

my axpy(u, v, w);<br />

This function shall contain a main loop in unrolled manner with customizable size and a clean-up<br />

loop at the end:<br />

template <br />

void my axpy(U& u, const V& v, const W& w)<br />

{<br />

assert(u.size() == v.size() && v.size() == w.size());<br />

unsigned s= u.size(), sb= s / BSize ∗ BSize;<br />

}<br />

<strong>for</strong> (unsigned i= 0; i < sb; i+= BSize)<br />

my axpy ftor()(u, v, w, i);<br />

<strong>for</strong> (unsigned i= sb; i < s; i++)<br />

u[i]= 3.0f ∗ v[i] + w[i];<br />

As mentioned be<strong>for</strong>e, deduced template types, as the vector types in our case, must be defined<br />

at the end and the explicitly given arguments, in our case the block size, must be at the<br />

beginning of the template arguments. The implementation of the block statement in the first<br />

loop can be implemented similarly to the functor in Section 5.4.1. We deviate a bit from this<br />

implementation by using two template arguments where the <strong>for</strong>mer is increased until it is equal<br />

to the second. It appeared that this approach yielded faster binaries on gcc than using only<br />

one argument and counting it down to zero. 22 In addition, the two-argument version is more<br />

consistent with the multi-dimensional implementation in Section ??. As <strong>for</strong> fixed-size unrolling<br />

we need a recursive template definition. Within the operators, a single statement is per<strong>for</strong>med<br />

and the following statements are called:<br />

template <br />

struct my axpy ftor<br />

{<br />

template <br />

void operator()(U& u, const V& v, const W& w, unsigned i)<br />

{<br />

u[i+Offset]= 3.0f ∗ v[i+Offset] + w[i+Offset];<br />

22 TODO: exercise <strong>for</strong> it

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!