Auto-generating optimized CUDA for stencil ... - FEniCS Project
Auto-generating optimized CUDA for stencil ... - FEniCS Project
Auto-generating optimized CUDA for stencil ... - FEniCS Project
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Gflops<br />
50<br />
45<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
Mint<br />
Heat 7pt Poisson<br />
7pt<br />
Variable Poisson<br />
7pt 19pt<br />
Tesla C1060<br />
Hand-<strong>CUDA</strong><br />
Heat 7pt Poisson<br />
7pt<br />
Variable Poisson<br />
7pt 19pt<br />
Tesla C2050<br />
On(Tesla(C1060,(Mint(achieves(79%(of(the(handS<strong>optimized</strong>(<strong>CUDA</strong>.((<br />
On(Tesla(C2050((Fermi),(Mint(achieves(76%(of(the(handS<strong>optimized</strong>(<strong>CUDA</strong>.((<br />
37<br />
Gflop/s<br />
The(Mint(optimizer(improves(the(per<strong>for</strong>mance(2.6x(over(the(Mint(baseline.(<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
1 MPI 8 MPI 16 MPI 32 MPI Mint Mint<br />
baseline optimizer<br />
8 Nehalem cores / node Tesla C2050 GPU<br />
4 nodes<br />
2.6x<br />
Hand<br />
<strong>CUDA</strong><br />
40<br />
! Petascale(anelastic(wave(propagation(code(<br />
› Used(by(researchers(Southern(CA(Earthquake(Center(<br />
› EarthquakeSinduced(seismic(wave(propagation((<br />
! Gordon(Bell(Prize(finalist(at(SC’10(<br />
› Yifeng(Cui(and(Jun(Zhou(at(SDSC((<br />
! Refers(to(31(threeSdim(arrays(<br />
› asymmetric(13Spoint(<strong>stencil</strong>(<br />
! Time(consuming(loops:(185(lines(<br />
! Generated(<strong>CUDA</strong>(code:(1185(lines(<br />
Gflop/s<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
The(Minted(code(on(single(GPU(is(slightly(faster(than(32(cores.(<br />
32 MPI processes<br />
!<br />
D.(Unat,(J.(Zhou,(Y.(Cui,(X.(Cai,(and(S.(Baden.(“Accelerating!a!3D!Finite!<br />
Difference!Earthquake!Simulation!with!a!CMtoM<strong>CUDA</strong>!Translator”,!in(Computing(<br />
in(Science(and(Engineering(Journal,(2012.((<br />
38<br />
0<br />
1 MPI 8 MPI 16 MPI 32 MPI Mint Mint<br />
baseline optimizer<br />
8 Nehalem cores / node Tesla C2050 GPU<br />
4 nodes<br />
Hand<br />
<strong>CUDA</strong><br />
41<br />
(((((The((Minted(code(achieves(83%(of(the(handS<strong>optimized</strong>(<strong>CUDA</strong>.(((<br />
80<br />
70<br />
60<br />
80<br />
70<br />
60<br />
83%<br />
Gflop/s<br />
50<br />
40<br />
30<br />
Gflop/s<br />
50<br />
40<br />
30<br />
20<br />
20<br />
10<br />
10<br />
0<br />
1 MPI 8 MPI 16 MPI 32 MPI Mint Mint<br />
baseline optimizer<br />
Hand<br />
<strong>CUDA</strong><br />
0<br />
1 MPI 8 MPI 16 MPI 32 MPI Mint Mint<br />
baseline optimizer<br />
Hand<br />
<strong>CUDA</strong><br />
8 Nehalem cores / node Tesla C2050 GPU<br />
8 Nehalem cores / node Tesla C2050 GPU<br />
4 nodes<br />
39<br />
4 nodes<br />
42