19.06.2015 Views

Auto-generating optimized CUDA for stencil ... - FEniCS Project

Auto-generating optimized CUDA for stencil ... - FEniCS Project

Auto-generating optimized CUDA for stencil ... - FEniCS Project

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Gflops<br />

50<br />

45<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

Mint<br />

Heat 7pt Poisson<br />

7pt<br />

Variable Poisson<br />

7pt 19pt<br />

Tesla C1060<br />

Hand-<strong>CUDA</strong><br />

Heat 7pt Poisson<br />

7pt<br />

Variable Poisson<br />

7pt 19pt<br />

Tesla C2050<br />

On(Tesla(C1060,(Mint(achieves(79%(of(the(handS<strong>optimized</strong>(<strong>CUDA</strong>.((<br />

On(Tesla(C2050((Fermi),(Mint(achieves(76%(of(the(handS<strong>optimized</strong>(<strong>CUDA</strong>.((<br />

37<br />

Gflop/s<br />

The(Mint(optimizer(improves(the(per<strong>for</strong>mance(2.6x(over(the(Mint(baseline.(<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

1 MPI 8 MPI 16 MPI 32 MPI Mint Mint<br />

baseline optimizer<br />

8 Nehalem cores / node Tesla C2050 GPU<br />

4 nodes<br />

2.6x<br />

Hand<br />

<strong>CUDA</strong><br />

40<br />

! Petascale(anelastic(wave(propagation(code(<br />

› Used(by(researchers(Southern(CA(Earthquake(Center(<br />

› EarthquakeSinduced(seismic(wave(propagation((<br />

! Gordon(Bell(Prize(finalist(at(SC’10(<br />

› Yifeng(Cui(and(Jun(Zhou(at(SDSC((<br />

! Refers(to(31(threeSdim(arrays(<br />

› asymmetric(13Spoint(<strong>stencil</strong>(<br />

! Time(consuming(loops:(185(lines(<br />

! Generated(<strong>CUDA</strong>(code:(1185(lines(<br />

Gflop/s<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

The(Minted(code(on(single(GPU(is(slightly(faster(than(32(cores.(<br />

32 MPI processes<br />

!<br />

D.(Unat,(J.(Zhou,(Y.(Cui,(X.(Cai,(and(S.(Baden.(“Accelerating!a!3D!Finite!<br />

Difference!Earthquake!Simulation!with!a!CMtoM<strong>CUDA</strong>!Translator”,!in(Computing(<br />

in(Science(and(Engineering(Journal,(2012.((<br />

38<br />

0<br />

1 MPI 8 MPI 16 MPI 32 MPI Mint Mint<br />

baseline optimizer<br />

8 Nehalem cores / node Tesla C2050 GPU<br />

4 nodes<br />

Hand<br />

<strong>CUDA</strong><br />

41<br />

(((((The((Minted(code(achieves(83%(of(the(handS<strong>optimized</strong>(<strong>CUDA</strong>.(((<br />

80<br />

70<br />

60<br />

80<br />

70<br />

60<br />

83%<br />

Gflop/s<br />

50<br />

40<br />

30<br />

Gflop/s<br />

50<br />

40<br />

30<br />

20<br />

20<br />

10<br />

10<br />

0<br />

1 MPI 8 MPI 16 MPI 32 MPI Mint Mint<br />

baseline optimizer<br />

Hand<br />

<strong>CUDA</strong><br />

0<br />

1 MPI 8 MPI 16 MPI 32 MPI Mint Mint<br />

baseline optimizer<br />

Hand<br />

<strong>CUDA</strong><br />

8 Nehalem cores / node Tesla C2050 GPU<br />

8 Nehalem cores / node Tesla C2050 GPU<br />

4 nodes<br />

39<br />

4 nodes<br />

42

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!