C++ for Scientists - Technische Universität Dresden
C++ for Scientists - Technische Universität Dresden C++ for Scientists - Technische Universität Dresden
168 CHAPTER 5. META-PROGRAMMING The answer is yes. All of them. The main reason (but not the only one) is that different processors have different numbers of registers. How many registers are needed in one iteration depends on the expression and on the types (a complex value needs more registers than a float). In the following section we will address both issues: how to encapsulate the transformation so that it does not show up in the application and how we can change the block size without rewritten the loop. 5.4.4 Unrolling Vector Expressions For easier understanding, we discuss the abstraction in meta-tuning step by step. We start with the previous loop and implement a function for it. Say the function’s name is my axpy and it has a template argument for the block size so that we can write for instance for (unsigned j= 0; j < rep; j++) my axpy(u, v, w); This function shall contain a main loop in unrolled manner with customizable size and a clean-up loop at the end: template void my axpy(U& u, const V& v, const W& w) { assert(u.size() == v.size() && v.size() == w.size()); unsigned s= u.size(), sb= s / BSize ∗ BSize; } for (unsigned i= 0; i < sb; i+= BSize) my axpy ftor()(u, v, w, i); for (unsigned i= sb; i < s; i++) u[i]= 3.0f ∗ v[i] + w[i]; As mentioned before, deduced template types, as the vector types in our case, must be defined at the end and the explicitly given arguments, in our case the block size, must be at the beginning of the template arguments. The implementation of the block statement in the first loop can be implemented similarly to the functor in Section 5.4.1. We deviate a bit from this implementation by using two template arguments where the former is increased until it is equal to the second. It appeared that this approach yielded faster binaries on gcc than using only one argument and counting it down to zero. 22 In addition, the two-argument version is more consistent with the multi-dimensional implementation in Section ??. As for fixed-size unrolling we need a recursive template definition. Within the operators, a single statement is performed and the following statements are called: template struct my axpy ftor { template void operator()(U& u, const V& v, const W& w, unsigned i) { u[i+Offset]= 3.0f ∗ v[i+Offset] + w[i+Offset]; 22 TODO: exercise for it
5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 169 }; } my axpy ftor()(u, v, w, i); The only difference to fixed-size unrolling is that the indices are relative to an argument — here i. The operator() is first called with Offset equal to 0, then with 1, 2, . . . Since each call is inlined the functor call results in one monolithic block of operations without loop control and function call. Thus, the call of my axpy ftor()(u, v, w, i) performs the same operations as one iteration of the first loop in Listing 5.4. Of course this compilation would end in an infinite loop if we forget to specialize it for Max: template struct my axpy ftor { template void operator()(U& u, const V& v, const W& w, unsigned i) {} }; Performing the considered vector operation with different unrollings yields Compute time unrolled loop is 1.44 µs. Compute time unrolled loop is 1.15 µs. Compute time unrolled loop is 1.15 µs. Compute time unrolled loop is 1.14 µs. Now we can call this operation for any block size we like. On the other hand, it is rather cumbersome to implement the according functions and functors for each vector expression. Therefore, we combine this technique now with expression templates. 5.4.5 Tuning an Expression Template ⇒ vector unroll example2.cpp Let us recall Section 5.3.3. So far, we developed a vector class with expression templates for vector sums. In the same manner we can implement the product of a scalar and a vector but we leave this as exercise and consider expressions with addition only, for example: u = v + v + w Now we frame this vector operation with a repeting loop and the time measure: boost::timer t; for (unsigned j= 0; j < rep; j++) u= v + v + w; std::cout ≪ ”Compute time is ” ≪ 1000000.0 ∗ t.elapsed() / double(rep) ≪ ” µs.\n”; This results in: Compute time is 1.72 µs. To incorporate meta-tuning into expression templates we only need to modify the actual assignment because only here a loop is performed. All the other operations (well so far we have
- Page 118 and 119: 118 CHAPTER 4. GENERIC PROGRAMMING
- Page 120 and 121: 120 CHAPTER 4. GENERIC PROGRAMMING
- Page 122 and 123: 122 CHAPTER 4. GENERIC PROGRAMMING
- Page 124 and 125: 124 CHAPTER 4. GENERIC PROGRAMMING
- Page 126 and 127: 126 CHAPTER 4. GENERIC PROGRAMMING
- Page 128 and 129: 128 CHAPTER 4. GENERIC PROGRAMMING
- Page 130 and 131: 130 CHAPTER 4. GENERIC PROGRAMMING
- Page 132 and 133: 132 CHAPTER 4. GENERIC PROGRAMMING
- Page 134 and 135: 134 CHAPTER 5. META-PROGRAMMING exp
- Page 136 and 137: 136 CHAPTER 5. META-PROGRAMMING dou
- Page 138 and 139: 138 CHAPTER 5. META-PROGRAMMING We
- Page 140 and 141: 140 CHAPTER 5. META-PROGRAMMING Fir
- Page 142 and 143: 142 CHAPTER 5. META-PROGRAMMING hig
- Page 144 and 145: 144 CHAPTER 5. META-PROGRAMMING The
- Page 146 and 147: 146 CHAPTER 5. META-PROGRAMMING tra
- Page 148 and 149: 148 CHAPTER 5. META-PROGRAMMING tem
- Page 150 and 151: 150 CHAPTER 5. META-PROGRAMMING 5.3
- Page 152 and 153: 152 CHAPTER 5. META-PROGRAMMING •
- Page 154 and 155: 154 CHAPTER 5. META-PROGRAMMING Dis
- Page 156 and 157: 156 CHAPTER 5. META-PROGRAMMING };
- Page 158 and 159: 158 CHAPTER 5. META-PROGRAMMING A s
- Page 160 and 161: 160 CHAPTER 5. META-PROGRAMMING ass
- Page 162 and 163: 162 CHAPTER 5. META-PROGRAMMING num
- Page 164 and 165: 164 CHAPTER 5. META-PROGRAMMING Usi
- Page 166 and 167: 166 CHAPTER 5. META-PROGRAMMING } v
- Page 170 and 171: 170 CHAPTER 5. META-PROGRAMMING onl
- Page 172 and 173: 172 CHAPTER 5. META-PROGRAMMING for
- Page 174 and 175: 174 CHAPTER 5. META-PROGRAMMING } u
- Page 176 and 177: 176 CHAPTER 5. META-PROGRAMMING } r
- Page 178 and 179: 178 CHAPTER 5. META-PROGRAMMING };
- Page 180 and 181: 180 CHAPTER 5. META-PROGRAMMING } t
- Page 182 and 183: 182 CHAPTER 5. META-PROGRAMMING };
- Page 184 and 185: 184 CHAPTER 5. META-PROGRAMMING Com
- Page 186 and 187: 186 CHAPTER 5. META-PROGRAMMING tem
- Page 188 and 189: 188 CHAPTER 6. INHERITANCE { } std:
- Page 190 and 191: 190 CHAPTER 6. INHERITANCE 6.4.1 Ca
- Page 192 and 193: 192 CHAPTER 6. INHERITANCE dbp= sta
- Page 194 and 195: 194 CHAPTER 6. INHERITANCE Our comp
- Page 196 and 197: 196 CHAPTER 6. INHERITANCE Another
- Page 198 and 199: 198 CHAPTER 6. INHERITANCE
- Page 200 and 201: 200 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 202 and 203: 202 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 204 and 205: 204 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 206 and 207: 206 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 208 and 209: 208 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 210 and 211: 210 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 212 and 213: 212 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 214 and 215: 214 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 216 and 217: 216 CHAPTER 7. EFFECTIVE PROGRAMMIN
168 CHAPTER 5. META-PROGRAMMING<br />
The answer is yes. All of them. The main reason (but not the only one) is that different<br />
processors have different numbers of registers. How many registers are needed in one iteration<br />
depends on the expression and on the types (a complex value needs more registers than a float).<br />
In the following section we will address both issues: how to encapsulate the trans<strong>for</strong>mation<br />
so that it does not show up in the application and how we can change the block size without<br />
rewritten the loop.<br />
5.4.4 Unrolling Vector Expressions<br />
For easier understanding, we discuss the abstraction in meta-tuning step by step. We start with<br />
the previous loop and implement a function <strong>for</strong> it. Say the function’s name is my axpy and it<br />
has a template argument <strong>for</strong> the block size so that we can write <strong>for</strong> instance<br />
<strong>for</strong> (unsigned j= 0; j < rep; j++)<br />
my axpy(u, v, w);<br />
This function shall contain a main loop in unrolled manner with customizable size and a clean-up<br />
loop at the end:<br />
template <br />
void my axpy(U& u, const V& v, const W& w)<br />
{<br />
assert(u.size() == v.size() && v.size() == w.size());<br />
unsigned s= u.size(), sb= s / BSize ∗ BSize;<br />
}<br />
<strong>for</strong> (unsigned i= 0; i < sb; i+= BSize)<br />
my axpy ftor()(u, v, w, i);<br />
<strong>for</strong> (unsigned i= sb; i < s; i++)<br />
u[i]= 3.0f ∗ v[i] + w[i];<br />
As mentioned be<strong>for</strong>e, deduced template types, as the vector types in our case, must be defined<br />
at the end and the explicitly given arguments, in our case the block size, must be at the<br />
beginning of the template arguments. The implementation of the block statement in the first<br />
loop can be implemented similarly to the functor in Section 5.4.1. We deviate a bit from this<br />
implementation by using two template arguments where the <strong>for</strong>mer is increased until it is equal<br />
to the second. It appeared that this approach yielded faster binaries on gcc than using only<br />
one argument and counting it down to zero. 22 In addition, the two-argument version is more<br />
consistent with the multi-dimensional implementation in Section ??. As <strong>for</strong> fixed-size unrolling<br />
we need a recursive template definition. Within the operators, a single statement is per<strong>for</strong>med<br />
and the following statements are called:<br />
template <br />
struct my axpy ftor<br />
{<br />
template <br />
void operator()(U& u, const V& v, const W& w, unsigned i)<br />
{<br />
u[i+Offset]= 3.0f ∗ v[i+Offset] + w[i+Offset];<br />
22 TODO: exercise <strong>for</strong> it