03.12.2012 Views

C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden

C++ for Scientists - Technische Universität Dresden

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

184 CHAPTER 5. META-PROGRAMMING<br />

Compute time mult(A, B, C) is 5250 µs. This are 795.794 MFlops.<br />

Compute time mult(A, B, C) is 2770 µs. This are 1508.27 MFlops.<br />

Compute time mult(A, B, C) is 1990 µs. This are 2099.46 MFlops.<br />

Compute time mult(A, B, C) is 2230 µs. This are 1873.51 MFlops.<br />

Compute time mult(A, B, C) is 2130 µs. This are 1961.46 MFlops.<br />

Compute time mult(A, B, C) is 2930 µs. This are 1425.91 MFlops.<br />

Compute time mult(A, B, C) is 2350 µs. This are 1777.84 MFlops.<br />

Compute time mult(A, B, C) is 3420 µs. This are 1221.61 MFlops.<br />

Compute time mult(A, B, C) is 4010 µs. This are 1041.88 MFlops.<br />

Compute time mult(A, B, C) is 2870 µs. This are 1455.72 MFlops.<br />

Compute time mult(A, B, C) is 3230 µs. This are 1293.47 MFlops.<br />

Compute time mult(A, B, C) is 3060 µs. This are 1365.33 MFlops.<br />

Compute time mult(A, B, C) is 2780 µs. This are 1502.85 MFlops.<br />

One can see that mult has the same per<strong>for</strong>mance as the original implementation which<br />

in fact is per<strong>for</strong>ming the operations in exactly the same order (so far the compiler optimization<br />

does not change the order internally). We see also that the unrolled versions are all faster, up<br />

to a speed-up of 2.6.<br />

With double matrices the per<strong>for</strong>mance is lower in total:<br />

Compute time mult(A, B, C) is 10080 µs. This are 414.476 MFlops.<br />

Compute time mult(A, B, C) is 8700 µs. This are 480.221 MFlops.<br />

Compute time mult(A, B, C) is 7470 µs. This are 559.293 MFlops.<br />

Compute time mult(A, B, C) is 5910 µs. This are 706.924 MFlops.<br />

Compute time mult(A, B, C) is 3750 µs. This are 1114.11 MFlops.<br />

Compute time mult(A, B, C) is 5140 µs. This are 812.825 MFlops.<br />

Compute time mult(A, B, C) is 3420 µs. This are 1221.61 MFlops.<br />

Compute time mult(A, B, C) is 4590 µs. This are 910.222 MFlops.<br />

Compute time mult(A, B, C) is 4310 µs. This are 969.355 MFlops.<br />

Compute time mult(A, B, C) is 6280 µs. This are 665.274 MFlops.<br />

Compute time mult(A, B, C) is 5310 µs. This are 786.802 MFlops.<br />

Compute time mult(A, B, C) is 4290 µs. This are 973.874 MFlops.<br />

Compute time mult(A, B, C) is 3490 µs. This are 1197.11 MFlops.<br />

It shows that other parametrizations yield more acceleration and that the per<strong>for</strong>mance could<br />

almost be tripled.<br />

Which configuration is best and why is — as mentioned be<strong>for</strong>e — not topic of this script; we<br />

only show programming techniques. The reader is invited to try this program on his/her own<br />

computer. The technique in this section is intended <strong>for</strong> L1 cache usage. If matrices are larger,<br />

one should use more levels of blocking. A general-purpose methodology <strong>for</strong> locality on L2, L3,<br />

main memory, local disk, . . . is recursion. This avoids reimplementation <strong>for</strong> each cache size and<br />

per<strong>for</strong>ms even reasonably well in virtually memory, see <strong>for</strong> instance [?].

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!