C++ for Scientists - Technische Universität Dresden
C++ for Scientists - Technische Universität Dresden C++ for Scientists - Technische Universität Dresden
176 CHAPTER 5. META-PROGRAMMING } return sum; There is one piece still missing. We need to reduce the partial sums in multi sum. Unfortunately we cannot write a loop over the members of multi sum. So, we need a recursive function that dives down into multi sum. This would be a bit cumbersome as free function, especially as we try to avoid partial specialization of template. As a member function, it is much easier and the specialization happens more safely on the class level: template struct multi tmp { Value sum() const { return value + sub.sum(); } }; template struct multi tmp { Value sum() const { return 0; } }; Note that we started the summation with 0 not the innermost value member. We could do this but then we need another specialization for multi tmp. Likewise we can implement a general reduction but we need as in std::accumulate an initial element: template struct multi tmp { template Value reduce(Op op, const Value& init) const { return op(value, sub.reduce(op, init)); } }; template struct multi tmp { template Value reduce(Op, const Value& init) const { return init; } }; The compute time of this version is: Compute time one_norm(v) is 7.47 µs. Compute time one_norm(v) is 1.14 µs. Compute time one_norm(v) is 0.71 µs. Compute time one_norm(v) is 0.75 µs. Compute time one_norm(v) is 1.01 µs. Pushing Temporaries into Registers ⇒ reduction unroll registers example.cpp
5.4. META-TUNING: WRITE YOUR OWN COMPILER OPTIMIZATION 177 Earlier experiments with older compilers (gcc 3.4) 25 exposed a serious overhead for using arrays or nested classes; it was finally even slower then using one single variable. The reason was probably that the compiler could not use registers for these types. 26 The most likely way to store temporaries in registers is to declare them as separate variables: inline one norm(const Vector& v) { typename Vector::value type s0(0), s1(0), s2(0), ... } As one can see, the problem is how many one declares. The number cannot depend on the template argument but must be fix for all sizes (unless one writes a different implementation for each number and undermines the expressiveness of templates). Thus, we have to fix a certain number of variables — say 8. Then, we cannot unroll it more than eight times. The next issue we run into is the number of function arguments. When we call the iteration block we pass all variables (registers): for (unsigned i= 0; i < sb; i+= BSize) one norm ftor()(s0, s1, s2, s3, s4, s5, s6, s7, v, i); The first calculation in such a block is performed on s0 and s1–s2 are only passed to the functors for the following computations. After this, the second computation must accumulate on the second function argument, the third calculation on the third argument, . . . This is unfortunately not implementable with templates (only with very ugly and highly error-prone source code manipulations by macros). Alternatively, each computation is performed on its first function argument and subsequent functors are called with omitted first argument: one norm ftor()(s1, s2, s3, s4, s5, s6, s7, v, i); one norm ftor()(s2, s3, s4, s5, s6, s7, v, i); one norm ftor()(s3, s4, s5, s6, s7, v, i); This is neither realizable with templates. The solution is to rotate the references to registers: one norm ftor()(s1, s2, s3, s4, s5, s6, s7, s0, v, i); one norm ftor()(s2, s3, s4, s5, s6, s7, s0, s1, v, i); one norm ftor()(s3, s4, s5, s6, s7, s0, s1, s2, v, i); This rotation is achieved by the following functor implementation: template struct one norm ftor { template void operator()(S& s0, S& s1, S& s2, S& s3, S& s4, S& s5, S& s6, S& s7, const V& v, unsigned i) { using std::abs; s0+= abs(v[i+Offset]); one norm ftor()(s1, s2, s3, s4, s5, s6, s7, s0, v, i); 25 TODO: Show!!! 26 TODO: which raises the question why they can do it today
- Page 126 and 127: 126 CHAPTER 4. GENERIC PROGRAMMING
- Page 128 and 129: 128 CHAPTER 4. GENERIC PROGRAMMING
- Page 130 and 131: 130 CHAPTER 4. GENERIC PROGRAMMING
- Page 132 and 133: 132 CHAPTER 4. GENERIC PROGRAMMING
- Page 134 and 135: 134 CHAPTER 5. META-PROGRAMMING exp
- Page 136 and 137: 136 CHAPTER 5. META-PROGRAMMING dou
- Page 138 and 139: 138 CHAPTER 5. META-PROGRAMMING We
- Page 140 and 141: 140 CHAPTER 5. META-PROGRAMMING Fir
- Page 142 and 143: 142 CHAPTER 5. META-PROGRAMMING hig
- Page 144 and 145: 144 CHAPTER 5. META-PROGRAMMING The
- Page 146 and 147: 146 CHAPTER 5. META-PROGRAMMING tra
- Page 148 and 149: 148 CHAPTER 5. META-PROGRAMMING tem
- Page 150 and 151: 150 CHAPTER 5. META-PROGRAMMING 5.3
- Page 152 and 153: 152 CHAPTER 5. META-PROGRAMMING •
- Page 154 and 155: 154 CHAPTER 5. META-PROGRAMMING Dis
- Page 156 and 157: 156 CHAPTER 5. META-PROGRAMMING };
- Page 158 and 159: 158 CHAPTER 5. META-PROGRAMMING A s
- Page 160 and 161: 160 CHAPTER 5. META-PROGRAMMING ass
- Page 162 and 163: 162 CHAPTER 5. META-PROGRAMMING num
- Page 164 and 165: 164 CHAPTER 5. META-PROGRAMMING Usi
- Page 166 and 167: 166 CHAPTER 5. META-PROGRAMMING } v
- Page 168 and 169: 168 CHAPTER 5. META-PROGRAMMING The
- Page 170 and 171: 170 CHAPTER 5. META-PROGRAMMING onl
- Page 172 and 173: 172 CHAPTER 5. META-PROGRAMMING for
- Page 174 and 175: 174 CHAPTER 5. META-PROGRAMMING } u
- Page 178 and 179: 178 CHAPTER 5. META-PROGRAMMING };
- Page 180 and 181: 180 CHAPTER 5. META-PROGRAMMING } t
- Page 182 and 183: 182 CHAPTER 5. META-PROGRAMMING };
- Page 184 and 185: 184 CHAPTER 5. META-PROGRAMMING Com
- Page 186 and 187: 186 CHAPTER 5. META-PROGRAMMING tem
- Page 188 and 189: 188 CHAPTER 6. INHERITANCE { } std:
- Page 190 and 191: 190 CHAPTER 6. INHERITANCE 6.4.1 Ca
- Page 192 and 193: 192 CHAPTER 6. INHERITANCE dbp= sta
- Page 194 and 195: 194 CHAPTER 6. INHERITANCE Our comp
- Page 196 and 197: 196 CHAPTER 6. INHERITANCE Another
- Page 198 and 199: 198 CHAPTER 6. INHERITANCE
- Page 200 and 201: 200 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 202 and 203: 202 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 204 and 205: 204 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 206 and 207: 206 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 208 and 209: 208 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 210 and 211: 210 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 212 and 213: 212 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 214 and 215: 214 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 216 and 217: 216 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 218 and 219: 218 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 220 and 221: 220 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 222 and 223: 222 CHAPTER 7. EFFECTIVE PROGRAMMIN
- Page 225 and 226: Finite World of Computers Chapter 8
176 CHAPTER 5. META-PROGRAMMING<br />
}<br />
return sum;<br />
There is one piece still missing. We need to reduce the partial sums in multi sum. Un<strong>for</strong>tunately<br />
we cannot write a loop over the members of multi sum. So, we need a recursive function that<br />
dives down into multi sum. This would be a bit cumbersome as free function, especially as we<br />
try to avoid partial specialization of template. As a member function, it is much easier and the<br />
specialization happens more safely on the class level:<br />
template <br />
struct multi tmp<br />
{<br />
Value sum() const { return value + sub.sum(); }<br />
};<br />
template <br />
struct multi tmp<br />
{<br />
Value sum() const { return 0; }<br />
};<br />
Note that we started the summation with 0 not the innermost value member. We could do this<br />
but then we need another specialization <strong>for</strong> multi tmp. Likewise we can implement a<br />
general reduction but we need as in std::accumulate an initial element:<br />
template <br />
struct multi tmp<br />
{<br />
template <br />
Value reduce(Op op, const Value& init) const { return op(value, sub.reduce(op, init)); }<br />
};<br />
template <br />
struct multi tmp<br />
{<br />
template <br />
Value reduce(Op, const Value& init) const { return init; }<br />
};<br />
The compute time of this version is:<br />
Compute time one_norm(v) is 7.47 µs.<br />
Compute time one_norm(v) is 1.14 µs.<br />
Compute time one_norm(v) is 0.71 µs.<br />
Compute time one_norm(v) is 0.75 µs.<br />
Compute time one_norm(v) is 1.01 µs.<br />
Pushing Temporaries into Registers<br />
⇒ reduction unroll registers example.cpp