RAMSES on the GPU

RAMSES on the GPU RAMSES on the GPU

from gputechconf.com More from this publisher

12.07.2015 Views

Page 2 and 3: Cosmological simulations and data a
Page 4: The computing node4
Page 7 and 8: RAMSES on the GPU
Page 9 and 10: ResultsSedov Blast wave test (hydro
Page 11 and 12: The SPLOTCH codeSplotch is a ray-tr
Page 13 and 14: Splotch on the GPU by C.Gheller, M.
Page 15 and 16: CUDA tests - data dependencyPerform
Page 17 and 18: CUDA tests: Overall Speed-upOverall
Page 19: ReferencesPOSTERS @ GTCP0166: <stro

Cosmological simulations and data analysisNumerical simulations represent an extraordinary tool tostudy and to solve astrophysical problems.They require:• Sophisticated simulation codes, including all necessary physicsand adopting suitable and effective algorithms• Data processing, analysis and visualization tools necessary toprocess the enormous amount of generated informationand• High-end HPCarchitectures,that provide thenecessarycomputing power© CSCS 2013 - Claudio Gheller 2

But………….• Can these applications efficiently exploit GPU acceleratedsystems?• If so, what performance can we get out of them?Three cases under investigation• <strong>RAMSES</strong>• SPLOTCH• (ENZO)System used for our experiments:“TODI” CRAY XK7 system @ CSCS272 nodes, each with 16-core AMD Opteron CPU, 32 GB DDR3memory + one NVIDIA Tesla K20X GPU with 6 GB of GDDR5memory for a total of 4352 cores and 272 GPUs.© CSCS 2013 - Claudio Gheller 3

The computing node4

<strong>RAMSES</strong> on the GPUVarious ongoing efforts to develop <strong>RAMSES</strong> on GPUs1. CUDA implementation of the hydro solver on a regular mesh(NO AMR – P. Kestener et al.)2. CUDA implementation of the ATON, radiative transfer andionization solver (D. Aubert et al.)3. OpenACC implementation of the hydro solver on a AMR grid6

<strong>RAMSES</strong> on the GPU – 1 D.Aubert (CDS) and R.Teyssier (Zurich)ATON module. Example of decoupled CPU-GPU computation: solveradiative transfer for reionization on the GPU (each device solvesthe most suitable part) – CUDA basedMain steps:– Solve hydrodynamical step on the CPU– CPU (host) sends hydro data to the GPU,– then the GPU computes the radiation transport,– then the chemistry,– finally the cooling– Host starts the next time stepTest: ionization of a uniform mediumby a point source (64^3 example)CPU version sec GPU version sec1 core 247 145x SPEEDUP!!!8 cores 3416 cores 17 1 GPU 1.77

<strong>RAMSES</strong> on the GPU – 2 C.Gheller, R. Teyssier, Alister HartDirective based approach à OpenACCFocused on the Hydro core solverFirst attempt:8

ResultsSedov Blast wave test (hydro only):9ACC NVECTOR Ttot Tacc Ttransf Eff. Speed-‐up OFF -‐ 1 Pe 10 94.54 0 0 ON 512 55.83 38.22 9.2 2.012820513 ON 1024 45.66 29.27 9.2 2.669969252 ON 2048 42.08 25.36 9.2 3.068611987 ON 4096 41.32 23.2 9.2 3.293965517 ON 8192 41.19 23.15 9.2 3.304535637 20 GB tranferred in/out POOR performance improvement – four mainproblems:• Data structure, irregular data distribution• Amount of transferred data• Low flops per byte ratio• Asynchronous operations not permitted

<strong>RAMSES</strong> on the GPU – 3Directive based approach à OpenACCFocused on the Hydro core solverSecond attempt (ongoing):10

The SPLOTCH codeSplotch is a ray-tracing algorithm for effective visualizationof large/huge astrophysical datasets coming from particlebasedsimulations. It is based on the solution of the radiativetransport equationwhere the density is calculated smoothing the particlequantity on a “proper” neighborhood, by a Gaussiandistribution function.© CSCS 2013 - Claudio Gheller 11

Splotch on the GPU by C.Gheller, M.Rivi, M.Krokos1. Offload data (in chunks) to GPU2. Rasterisation: efficient one-thread-per-particle approach3. Rendering: a refactoring of the algorithm is required tosolve the following issues• unbalanced granularity of particles can compromiseexecution times• each particle access randomly the image, thereforepossible concurrent access to the same pixel may occur4. Copy back the final image to the CPU and compose it withthe one produced by the host© CSCS 2013 - Claudio Gheller 13

GPU algorithm design1. Classify particles according to their size:– Small (1 pixel)– Regular (1 < Size < Size_max)– Large (Size > Size_max)2. Solve different particles classes according to differentschemas:– Small: 1 thread per particle + frame buffer + finalreduction (sort+reduce Thrust based)– Regular: tiling based schema1. Divide the image in tiles of appropriate size(particle is entirely contained in the tile)2. Sort particles per tile index (Thrust)3. Calculate the tile (tile à Block): one pixel per thread4. Compose tiles– Large: copy back to CPU and overlap to GPU calculation© CSCS 2013 - Claudio Gheller 14

CUDA tests – data dependencyPerformance is dependent on the particleradius distribution and its expectationvalue R 0 = • Image resolution fixed N pix =1000• 7 different camera positions closer andcloser to the box centerDATASET:Medium sized(~7.5GB)cosmological N-bodysimulationperformed by Gadgetcode• 200M particles ofdark matter• 200M particles ofbaryonic matter• 10M star particles© CSCS 2013 - Claudio Gheller 15

CUDA tests: GPU speedup and overheadKernels speed-up (GPU/CPU)Specific GPU related overheads• Rasterisation is strongly accelerated, speed-up ranges 45-65• Rendering maximum speed-up is 13 for R 0 around unity• Overhead can be up to 50% of the total computing time

CUDA tests: Overall Speed-upOverall computing times per particle vs average particle radius• Bi-modal behaviour• for R s < 2, performanceis 6x compared to 1CPU, 8 cores comparedto OMP• For R s > 2, largeparticles rendering onthe CPU becomesrelevant

Summary• Numerical simulations and data processing in astrophysicsrequire huge computing resources and GPUs can providethe necessary power;• Two codes under investigation, one for each types: <strong>RAMSES</strong>and SPLOTCH;• Results and conclusions are quite “heterogeneous”:1. It is hard to move a full code effectively on the accelerator2. The performance of specific algorithmic components canbe strongly improved by the GPU (even orders ofmagnitude)3. Sometimes code refactoring introduces overheads thatcan penalize the final performance4. GPU refactoring is in general demanding but finally pays…© CSCS 2013 - Claudio Gheller 18

ReferencesPOSTERS @ GTCP0166: <strong>RAMSES</strong> on the GPU: Accelerating Cosmological Simulations P0205: CUDA Splotch: GPU Accelerated Astrophysical Visualization<strong>RAMSES</strong> Web sitehttp://www.itp.uzh.ch/~teyssier/Site/<strong>RAMSES</strong>.htmlSplotch web site:http://www.mpa-garching.mpg.de/~kdolag/SplotchPapers:K. Dolag, M. Reinecke, C. Gheller, S. Imboden, “Splotch: Visualizing CosmologicalSimulations”, New Journal of Physics, 10(12) (2008)Z. Jin, M. Krokos, M. Rivi, C. Gheller, K. Dolag, M. Reinecke, “High-PerformanceAstrophysical Visualization using Splotch”, Procedia Computer Science 1 (2010) pp.1775-1784M. Rivi, C. Gheller, M. Krokos, “GPU Accelerated Particle Visualization with Splotch”, inpreparation

RAMSES on the GPU

RAMSES on the GPU ... View more RAMSES on the GPU

Delete template?

Save as template ?

RAMSES on the GPU RAMSES on the GPU