C++ for Scientists - Technische Universität Dresden
C++ for Scientists - Technische Universität Dresden
C++ for Scientists - Technische Universität Dresden
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
184 CHAPTER 5. META-PROGRAMMING<br />
Compute time mult(A, B, C) is 5250 µs. This are 795.794 MFlops.<br />
Compute time mult(A, B, C) is 2770 µs. This are 1508.27 MFlops.<br />
Compute time mult(A, B, C) is 1990 µs. This are 2099.46 MFlops.<br />
Compute time mult(A, B, C) is 2230 µs. This are 1873.51 MFlops.<br />
Compute time mult(A, B, C) is 2130 µs. This are 1961.46 MFlops.<br />
Compute time mult(A, B, C) is 2930 µs. This are 1425.91 MFlops.<br />
Compute time mult(A, B, C) is 2350 µs. This are 1777.84 MFlops.<br />
Compute time mult(A, B, C) is 3420 µs. This are 1221.61 MFlops.<br />
Compute time mult(A, B, C) is 4010 µs. This are 1041.88 MFlops.<br />
Compute time mult(A, B, C) is 2870 µs. This are 1455.72 MFlops.<br />
Compute time mult(A, B, C) is 3230 µs. This are 1293.47 MFlops.<br />
Compute time mult(A, B, C) is 3060 µs. This are 1365.33 MFlops.<br />
Compute time mult(A, B, C) is 2780 µs. This are 1502.85 MFlops.<br />
One can see that mult has the same per<strong>for</strong>mance as the original implementation which<br />
in fact is per<strong>for</strong>ming the operations in exactly the same order (so far the compiler optimization<br />
does not change the order internally). We see also that the unrolled versions are all faster, up<br />
to a speed-up of 2.6.<br />
With double matrices the per<strong>for</strong>mance is lower in total:<br />
Compute time mult(A, B, C) is 10080 µs. This are 414.476 MFlops.<br />
Compute time mult(A, B, C) is 8700 µs. This are 480.221 MFlops.<br />
Compute time mult(A, B, C) is 7470 µs. This are 559.293 MFlops.<br />
Compute time mult(A, B, C) is 5910 µs. This are 706.924 MFlops.<br />
Compute time mult(A, B, C) is 3750 µs. This are 1114.11 MFlops.<br />
Compute time mult(A, B, C) is 5140 µs. This are 812.825 MFlops.<br />
Compute time mult(A, B, C) is 3420 µs. This are 1221.61 MFlops.<br />
Compute time mult(A, B, C) is 4590 µs. This are 910.222 MFlops.<br />
Compute time mult(A, B, C) is 4310 µs. This are 969.355 MFlops.<br />
Compute time mult(A, B, C) is 6280 µs. This are 665.274 MFlops.<br />
Compute time mult(A, B, C) is 5310 µs. This are 786.802 MFlops.<br />
Compute time mult(A, B, C) is 4290 µs. This are 973.874 MFlops.<br />
Compute time mult(A, B, C) is 3490 µs. This are 1197.11 MFlops.<br />
It shows that other parametrizations yield more acceleration and that the per<strong>for</strong>mance could<br />
almost be tripled.<br />
Which configuration is best and why is — as mentioned be<strong>for</strong>e — not topic of this script; we<br />
only show programming techniques. The reader is invited to try this program on his/her own<br />
computer. The technique in this section is intended <strong>for</strong> L1 cache usage. If matrices are larger,<br />
one should use more levels of blocking. A general-purpose methodology <strong>for</strong> locality on L2, L3,<br />
main memory, local disk, . . . is recursion. This avoids reimplementation <strong>for</strong> each cache size and<br />
per<strong>for</strong>ms even reasonably well in virtually memory, see <strong>for</strong> instance [?].