10.07.2015 Views

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Outline• Introduction / Motivation• Programming models on clusters of SMP nodes• Case Studies / pure <strong>MPI</strong> vs hybrid <strong>MPI</strong>+<strong>OpenMP</strong>• Practical “How-To” on hybrid <strong>programming</strong>• Mismatch Problems• Opportunities:Application categories that can benefit from hybrid parallelization• Thread-safety quality of <strong>MPI</strong> libraries• Tools for debugging <strong>and</strong> profiling <strong>MPI</strong>+<strong>OpenMP</strong>• Other options on clusters of SMP nodes• SummaryPure <strong>MPI</strong> – multi-core aware• Hierarchical domain decomposition(or distribution of Cartesian arrays)Example on 10 nodes, each with 4 sockets, each with 6 cores.Domain decomposition:1 sub-domain / SMP nodeFurtherpartitioning:1 sub-domain /socket1/coreCacheoptimization:Blocking inside ofeach core,block size relatesto cache size.1-3 cache levels.<strong>Hybrid</strong> Parallel ProgrammingSlide 133 / 154Rabenseifner, Hager, Jost<strong>Hybrid</strong> Parallel ProgrammingSlide 134 / 154Rabenseifner, Hager, JostHow to achieve ahierarchical domain decomposition (DD)?• Cartesian grids:– Several levels of subdivide– Ranking of <strong>MPI</strong>_COMM_WORLD – three choices:a) Sequential ranks through original data structure+ locating these ranks correctly on the hardware can be achieved with one-level DD on finest grid+ special startup (mpiexec) with optimized rank-mappingb) Sequential ranks in comm_cart (from <strong>MPI</strong>_CART_CREATE) requires optimized <strong>MPI</strong>_CART_CREATE,or special startup (mpiexec) with optimized rank-mappingc) Sequential ranks in <strong>MPI</strong>_COMM_WORLD+ additional communicator with sequential ranks in the data structure+ self-written <strong>and</strong> optimized rank mapping.How to achieve ahierarchical domain decomposition (DD)?• Unstructured grids:– Multi-level DD:• Top-down: Several levels of (Par)Metis not recommended• Bottom-up: Low level DD + higher level recombination– Single-level DD (finest level)• Analysis of the communication pattern in a first run(with only a few iterations)• Optimized rank mapping to the hardware before production run• E.g., with CrayPAT + CrayApprentice• Unstructured grids: next slide<strong>Hybrid</strong> Parallel ProgrammingSlide 135 / 154Rabenseifner, Hager, Jost<strong>Hybrid</strong> Parallel ProgrammingSlide 136 / 154Rabenseifner, Hager, Jost

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!