RAMSES on the GPU

RAMSES on the GPU RAMSES on the GPU

gputechconf.com
from gputechconf.com More from this publisher

Cosmological simulati<strong>on</strong>s and data analysisNumerical simulati<strong>on</strong>s represent an extraordinary tool tostudy and to solve astrophysical problems.They require:• Sophisticated simulati<strong>on</strong> codes, including all necessary physicsand adopting suitable and effective algorithms• Data processing, analysis and visualizati<strong>on</strong> tools necessary toprocess <strong>the</strong> enormous amount of generated informati<strong>on</strong>and• High-end HPCarchitectures,that provide <strong>the</strong>necessarycomputing power© CSCS 2013 - Claudio Gheller 2


But………….• Can <strong>the</strong>se applicati<strong>on</strong>s efficiently exploit <strong>GPU</strong> acceleratedsystems?• If so, what performance can we get out of <strong>the</strong>m?Three cases under investigati<strong>on</strong>• <str<strong>on</strong>g>RAMSES</str<strong>on</strong>g>• SPLOTCH• (ENZO)System used for our experiments:“TODI” CRAY XK7 system @ CSCS272 nodes, each with 16-core AMD Opter<strong>on</strong> CPU, 32 GB DDR3memory + <strong>on</strong>e NVIDIA Tesla K20X <strong>GPU</strong> with 6 GB of GDDR5memory for a total of 4352 cores and 272 <strong>GPU</strong>s.© CSCS 2013 - Claudio Gheller 3


The computing node4


<str<strong>on</strong>g>RAMSES</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> <strong>GPU</strong>Various <strong>on</strong>going efforts to develop <str<strong>on</strong>g>RAMSES</str<strong>on</strong>g> <strong>on</strong> <strong>GPU</strong>s1. CUDA implementati<strong>on</strong> of <strong>the</strong> hydro solver <strong>on</strong> a regular mesh(NO AMR – P. Kestener et al.)2. CUDA implementati<strong>on</strong> of <strong>the</strong> ATON, radiative transfer andi<strong>on</strong>izati<strong>on</strong> solver (D. Aubert et al.)3. OpenACC implementati<strong>on</strong> of <strong>the</strong> hydro solver <strong>on</strong> a AMR grid6


<str<strong>on</strong>g>RAMSES</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> <strong>GPU</strong> – 1 D.Aubert (CDS) and R.Teyssier (Zurich)ATON module. Example of decoupled CPU-<strong>GPU</strong> computati<strong>on</strong>: solveradiative transfer for rei<strong>on</strong>izati<strong>on</strong> <strong>on</strong> <strong>the</strong> <strong>GPU</strong> (each device solves<strong>the</strong> most suitable part) – CUDA basedMain steps:– Solve hydrodynamical step <strong>on</strong> <strong>the</strong> CPU– CPU (host) sends hydro data to <strong>the</strong> <strong>GPU</strong>,– <strong>the</strong>n <strong>the</strong> <strong>GPU</strong> computes <strong>the</strong> radiati<strong>on</strong> transport,– <strong>the</strong>n <strong>the</strong> chemistry,– finally <strong>the</strong> cooling– Host starts <strong>the</strong> next time stepTest: i<strong>on</strong>izati<strong>on</strong> of a uniform mediumby a point source (64^3 example)CPU versi<strong>on</strong> sec <strong>GPU</strong> versi<strong>on</strong> sec1 core 247 145x SPEEDUP!!!8 cores 3416 cores 17 1 <strong>GPU</strong> 1.77


<str<strong>on</strong>g>RAMSES</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> <strong>GPU</strong> – 2 C.Gheller, R. Teyssier, Alister HartDirective based approach à OpenACCFocused <strong>on</strong> <strong>the</strong> Hydro core solverFirst attempt:8


ResultsSedov Blast wave test (hydro <strong>on</strong>ly):9ACC NVECTOR Ttot Tacc Ttransf Eff. Speed-­‐up OFF -­‐ 1 Pe 10 94.54 0 0 ON 512 55.83 38.22 9.2 2.012820513 ON 1024 45.66 29.27 9.2 2.669969252 ON 2048 42.08 25.36 9.2 3.068611987 ON 4096 41.32 23.2 9.2 3.293965517 ON 8192 41.19 23.15 9.2 3.304535637 20 GB tranferred in/out POOR performance improvement – four mainproblems:• Data structure, irregular data distributi<strong>on</strong>• Amount of transferred data• Low flops per byte ratio• Asynchr<strong>on</strong>ous operati<strong>on</strong>s not permitted


<str<strong>on</strong>g>RAMSES</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> <strong>GPU</strong> – 3Directive based approach à OpenACCFocused <strong>on</strong> <strong>the</strong> Hydro core solverSec<strong>on</strong>d attempt (<strong>on</strong>going):10


The SPLOTCH codeSplotch is a ray-tracing algorithm for effective visualizati<strong>on</strong>of large/huge astrophysical datasets coming from particlebasedsimulati<strong>on</strong>s. It is based <strong>on</strong> <strong>the</strong> soluti<strong>on</strong> of <strong>the</strong> radiativetransport equati<strong>on</strong>where <strong>the</strong> density is calculated smoothing <strong>the</strong> particlequantity <strong>on</strong> a “proper” neighborhood, by a Gaussiandistributi<strong>on</strong> functi<strong>on</strong>.© CSCS 2013 - Claudio Gheller 11


Splotch Workflow© CSCS 2013 - Claudio Gheller 12


Splotch <strong>on</strong> <strong>the</strong> <strong>GPU</strong> by C.Gheller, M.Rivi, M.Krokos1. Offload data (in chunks) to <strong>GPU</strong>2. Rasterisati<strong>on</strong>: efficient <strong>on</strong>e-thread-per-particle approach3. Rendering: a refactoring of <strong>the</strong> algorithm is required tosolve <strong>the</strong> following issues• unbalanced granularity of particles can compromiseexecuti<strong>on</strong> times• each particle access randomly <strong>the</strong> image, <strong>the</strong>reforepossible c<strong>on</strong>current access to <strong>the</strong> same pixel may occur4. Copy back <strong>the</strong> final image to <strong>the</strong> CPU and compose it with<strong>the</strong> <strong>on</strong>e produced by <strong>the</strong> host© CSCS 2013 - Claudio Gheller 13


<strong>GPU</strong> algorithm design1. Classify particles according to <strong>the</strong>ir size:– Small (1 pixel)– Regular (1 < Size < Size_max)– Large (Size > Size_max)2. Solve different particles classes according to differentschemas:– Small: 1 thread per particle + frame buffer + finalreducti<strong>on</strong> (sort+reduce Thrust based)– Regular: tiling based schema1. Divide <strong>the</strong> image in tiles of appropriate size(particle is entirely c<strong>on</strong>tained in <strong>the</strong> tile)2. Sort particles per tile index (Thrust)3. Calculate <strong>the</strong> tile (tile à Block): <strong>on</strong>e pixel per thread4. Compose tiles– Large: copy back to CPU and overlap to <strong>GPU</strong> calculati<strong>on</strong>© CSCS 2013 - Claudio Gheller 14


CUDA tests – data dependencyPerformance is dependent <strong>on</strong> <strong>the</strong> particleradius distributi<strong>on</strong> and its expectati<strong>on</strong>value R 0 = • Image resoluti<strong>on</strong> fixed N pix =1000• 7 different camera positi<strong>on</strong>s closer andcloser to <strong>the</strong> box centerDATASET:Medium sized(~7.5GB)cosmological N-bodysimulati<strong>on</strong>performed by Gadgetcode• 200M particles ofdark matter• 200M particles ofbary<strong>on</strong>ic matter• 10M star particles© CSCS 2013 - Claudio Gheller 15


CUDA tests: <strong>GPU</strong> speedup and overheadKernels speed-up (<strong>GPU</strong>/CPU)Specific <strong>GPU</strong> related overheads• Rasterisati<strong>on</strong> is str<strong>on</strong>gly accelerated, speed-up ranges 45-65• Rendering maximum speed-up is 13 for R 0 around unity• Overhead can be up to 50% of <strong>the</strong> total computing time


CUDA tests: Overall Speed-upOverall computing times per particle vs average particle radius• Bi-modal behaviour• for R s < 2, performanceis 6x compared to 1CPU, 8 cores comparedto OMP• For R s > 2, largeparticles rendering <strong>on</strong><strong>the</strong> CPU becomesrelevant


Summary• Numerical simulati<strong>on</strong>s and data processing in astrophysicsrequire huge computing resources and <strong>GPU</strong>s can provide<strong>the</strong> necessary power;• Two codes under investigati<strong>on</strong>, <strong>on</strong>e for each types: <str<strong>on</strong>g>RAMSES</str<strong>on</strong>g>and SPLOTCH;• Results and c<strong>on</strong>clusi<strong>on</strong>s are quite “heterogeneous”:1. It is hard to move a full code effectively <strong>on</strong> <strong>the</strong> accelerator2. The performance of specific algorithmic comp<strong>on</strong>ents canbe str<strong>on</strong>gly improved by <strong>the</strong> <strong>GPU</strong> (even orders ofmagnitude)3. Sometimes code refactoring introduces overheads thatcan penalize <strong>the</strong> final performance4. <strong>GPU</strong> refactoring is in general demanding but finally pays…© CSCS 2013 - Claudio Gheller 18


ReferencesPOSTERS @ GTCP0166: <str<strong>on</strong>g>RAMSES</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> <strong>GPU</strong>: Accelerating Cosmological Simulati<strong>on</strong>s P0205: CUDA Splotch: <strong>GPU</strong> Accelerated Astrophysical Visualizati<strong>on</strong><str<strong>on</strong>g>RAMSES</str<strong>on</strong>g> Web sitehttp://www.itp.uzh.ch/~teyssier/Site/<str<strong>on</strong>g>RAMSES</str<strong>on</strong>g>.htmlSplotch web site:http://www.mpa-garching.mpg.de/~kdolag/SplotchPapers:K. Dolag, M. Reinecke, C. Gheller, S. Imboden, “Splotch: Visualizing CosmologicalSimulati<strong>on</strong>s”, New Journal of Physics, 10(12) (2008)Z. Jin, M. Krokos, M. Rivi, C. Gheller, K. Dolag, M. Reinecke, “High-PerformanceAstrophysical Visualizati<strong>on</strong> using Splotch”, Procedia Computer Science 1 (2010) pp.1775-1784M. Rivi, C. Gheller, M. Krokos, “<strong>GPU</strong> Accelerated Particle Visualizati<strong>on</strong> with Splotch”, inpreparati<strong>on</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!