Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

More documents

Recommendations

Info

Outline• Introduction / Motivation• Programming models on clusters of SMP nodes• Case Studies / pure MPI vs hybrid MPI+OpenMP• Practical “How-To” on hybrid programming• Mismatch Problems• Opportunities:Application categories that can benefit from hybrid parallelization• Thread-safety quality of MPI libraries• Tools for debugging and profiling MPI+OpenMP• Other options on clusters of SMP nodes• SummaryPure MPI – multi-core aware• Hierarchical domain decomposition(or distribution of Cartesian arrays)Example on 10 nodes, each with 4 sockets, each with 6 cores.Domain decomposition:1 sub-domain / SMP nodeFurtherpartitioning:1 sub-domain /socket1/coreCacheoptimization:Blocking inside ofeach core,block size relatesto cache size.1-3 cache levels.Hybrid Parallel ProgrammingSlide 133 / 154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide 134 / 154Rabenseifner, Hager, JostHow to achieve ahierarchical domain decomposition (DD)?• Cartesian grids:– Several levels of subdivide– Ranking of MPI_COMM_WORLD – three choices:a) Sequential ranks through original data structure+ locating these ranks correctly on the hardware can be achieved with one-level DD on finest grid+ special startup (mpiexec) with optimized rank-mappingb) Sequential ranks in comm_cart (from MPI_CART_CREATE) requires optimized MPI_CART_CREATE,or special startup (mpiexec) with optimized rank-mappingc) Sequential ranks in MPI_COMM_WORLD+ additional communicator with sequential ranks in the data structure+ self-written and optimized rank mapping.How to achieve ahierarchical domain decomposition (DD)?• Unstructured grids:– Multi-level DD:• Top-down: Several levels of (Par)Metis not recommended• Bottom-up: Low level DD + higher level recombination– Single-level DD (finest level)• Analysis of the communication pattern in a first run(with only a few iterations)• Optimized rank mapping to the hardware before production run• E.g., with CrayPAT + CrayApprentice• Unstructured grids: next slideHybrid Parallel ProgrammingSlide 135 / 154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide 136 / 154Rabenseifner, Hager, Jost
— skipped —Top-down – several levels of (Par)Metis(not recommended)Bottom-up –Multi-level DD through recombinationSteps:– Load-balancing (e.g., withParMetis) on outer level,i.e., between all SMP nodes– Independent (Par)Metisinside of each node– Metis inside of each socket Subdivide does not care onbalancing of the outer boundary processes can get a lot ofneighbors with inter-nodecommunication unbalanced communication1. Core-level DD: partitioning of application’s data grid2. Socket-level DD: recombining of core-domains3. SMP node level DD: recombining of socket-domains• Problem:Recombinationmust notcalculate patchesthat are smalleror larger than theaverage• In this examplethe load-balancermust combinealways 6 cores, and 4 sockets• Advantage:Communicationis balanced!Hybrid Parallel ProgrammingSlide 137 / 154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide 138 / 154Rabenseifner, Hager, JostProfiling solution• First run with profiling– Analysis of the communication pattern• Optimization step– Calculation of an optimal mapping of ranks in MPI_COMM_WORLDto the hardware grid (physical cores / sockets / SMP nodes)• Restart of the application with this optimized locating of the ranks on thehardware grid• Example: CrayPat and CrayApprentice— skipped —Scalability of MPI to hundreds of thousands …Weak scalability of pure MPI• As long as the application does not use– MPI_ALLTOALLThe vendors will(or must) deliver– MPI_V (i.e., with length arrays)scalable MPIand applicationlibraries for their– distributes all data arrayslargest systems!one can expect:– Significant, but still scalable memory overhead for halo cells.– MPI library is internally scalable:• E.g., mapping ranks hardware grid– Centralized storing in shared memory (OS level)– In each MPI process, only used neighbor ranks are stored (cached) inprocess-local memory.• Tree based algorithm wiith O(log N)– From 1000 to 1000,000 process O(Log N) only doubles!Hybrid Parallel ProgrammingSlide 139 / 154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide 140 / 154Rabenseifner, Hager, Jost
Page 1 and 2: Hybrid MPI & OpenMPParallel Program
Page 3: Pure MPIHybrid Masteronlypure MPIon
Page 6: — skipped —SUN: Running hybrid
Page 11 and 12: Outline• Introduction / Motivatio
Page 13 and 14: — skipped —Running the codeExam
Page 15 and 16: Avoiding locality problems• How c
Page 17 and 18: CCCCCCOpenMP OverheadThread synchro
Page 19 and 20: Likwid Tool Suite• Command line t
Page 21 and 22: Outline• Introduction / Motivatio
Page 23 and 24: Numerical Optimization inside of an
Page 25 and 26: Experiment: Matrix-vector-multiply
Page 27 and 28: Comparison:MPI based parallelizatio
Page 29 and 30: Memory consumptionCase study: MPI+O
Page 31 and 32: MPI rules with OpenMP /Automatic SM
Page 33: — skipped —Thread Correctness -
Page 37 and 38: — skipped —Remarks on MPI and P
Page 39 and 40: Summary• This tutorial tried to-
Page 41 and 42: References (with direct relation to
Page 43 and 44: Further references• Matthias Hess

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Create successful ePaper yourself

Delete template?

Save as template?