Hybrid MPI and OpenMP programming tutorial - Prace Training Portal
Hybrid MPI and OpenMP programming tutorial - Prace Training Portal Hybrid MPI and OpenMP programming tutorial - Prace Training Portal
The Topology Problem withApplication example on 80 cores:• Cartesian application with 5 x 16 = 80 sub-domains• On system with 10 x dual socket x quad-coreHybrid Parallel ProgrammingSlide85/154Rabenseifner, Hager, Jostpure MPIone MPI processon each core0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 41 42 43 44 45 46 4748 49 50 51 52 53 54 55 56 57 58 59 60 61 62 6364 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79Two levels ofdomain decompositionBad affinity of cores to thread ranks12 x inter-node connections per node4 x inter-socket connection per nodeThe Topology Problem withApplication example on 80 cores:• Cartesian application with 5 x 16 = 80 sub-domains• On system with 10 x dual socket x quad-coreHybrid Parallel ProgrammingSlide86/154Rabenseifner, Hager, Jostpure MPIone MPI processon each core0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 41 42 43 44 45 46 4748 49 50 51 52 53 54 55 56 57 58 59 60 61 62 6364 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79Two levels of2 x inter-socket connection per node domain decompositionGood affinity of cores to thread ranks12 x inter-node connections per node— skipped —The Topology Problem withExa.: 2 SMP nodes, 8 cores/nodeOptimal ?MPI process 0 MPI process 1Loop-worksharingon 8 threadsOptimal ?Minimizing ccNUMAdata traffic throughdomain decompositioninside of eachMPI processhybrid MPI+OpenMPMPI: inter-node communicationOpenMP: inside of each SMP nodeProblem– Does application topology inside of SMP parallelizationfit on inner hardware topology of each SMP node?Solutions:– Domain decomposition inside of each thread-parallelMPI process, and– first touch strategy with OpenMPSuccessful examples:– Multi-Zone NAS Parallel Benchmarks (MZ-NPB)— skipped —ApplicationMPI LevelOpenMPThe Topology Problem withApplication example:• Same Cartesian application aspect ratio: 5 x 16• On system with 10 x dual socket x quad-core• 2 x 5 domain decompositionhybrid MPI+OpenMPMPI: inter-node communicationOpenMP: inside of each SMP node3 x inter-node connections per node, but ~ 4 x more traffic2 x inter-socket connection per nodeHybrid Parallel ProgrammingSlide87/154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide88/154Affinity Rabenseifner, Hager, of Jost cores to thread ranks !!!
Numerical Optimization inside of an SMP node2nd level of domain decomposition: OpenMP3rd level: 2nd level cache4th level: 1st level cacheOptimizing thenumericalperformanceThe Mapping Problem with mixed modelDo we have this?SMP nodeSocket 1MPIQuad-core process4 xCPUmultithreadedNode Interconnect… or that?SMP nodeSocket 2MPIQuad-core process4 xCPUmultithreadedSocket 1MPI MPIprocessCPUcess0Quad-core pro-1Socket 2Quad-coreCPUNode Interconnectpure MPI&hybrid MPI+OpenMPSeveral multi-threaded MPIprocess per SMP node:Problem– Where are your processesand threads really located?Solutions:– Depends on your platform,– e.g., with numactl Case study onSun Constellation ClusterRangerwith BT-MZ and SP-MZFurther questions:– Where is the NIC 1) located?– Which cores share caches?Hybrid Parallel ProgrammingSlide89/154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide90/154Rabenseifner, Hager, Jost1) NIC = Network Interface CardUnnecessary intra-node communicationProblem:– If several MPI process on each SMP node unnecessary intra-node communicationSolution:– Only one MPI process per SMP nodeRemarks:– MPI library must use appropriatefabrics / protocol for intra-node communication– Intra-node bandwidth higher thaninter-node bandwidth problem may be small– MPI implementation may causeunnecessary data copying waste of memory bandwidthHybrid Parallel ProgrammingSlide91/154Rabenseifner, Hager, Jostpure MPIMixed model(several multi-threaded MPIprocesses per SMP node)Quality aspectsof the MPI librarySleeping threads and network saturationwithfor (iteration ….){#pragma omp parallelnumerical code/*end omp parallel *//* on master thread only */MPI_Send (original datato halo areasin other SMP nodes)MPI_Recv (halo datafrom the neighbors)}/*endforloopMasteronlyMPI only outside ofparallel regionsHybrid Parallel ProgrammingSlide92/154SMP nodeSocket 1MasterthreadSocket 2sleepingNode InterconnectRabenseifner, Hager, JostSMP nodeSocket 1MasterthreadSocket 2sleepingProblem 1:– Can the master threadsaturate the network?Solution:– If not, use mixed model– i.e., several MPIprocesses per SMP nodeProblem 2:– Sleeping threads arewasting CPU timeSolution:– Overlapping ofcomputation andcommunicationProblem 1&2 together:– Producing more idle timethrough lousy bandwidthof master thread
- Page 1 and 2: Hybrid MPI & OpenMPParallel Program
- Page 3: Pure MPIHybrid Masteronlypure MPIon
- Page 6: — skipped —SUN: Running hybrid
- Page 11 and 12: Outline• Introduction / Motivatio
- Page 13 and 14: — skipped —Running the codeExam
- Page 15 and 16: Avoiding locality problems• How c
- Page 17 and 18: CCCCCCOpenMP OverheadThread synchro
- Page 19 and 20: Likwid Tool Suite• Command line t
- Page 21: Outline• Introduction / Motivatio
- Page 25 and 26: Experiment: Matrix-vector-multiply
- Page 27 and 28: Comparison:MPI based parallelizatio
- Page 29 and 30: Memory consumptionCase study: MPI+O
- Page 31 and 32: MPI rules with OpenMP /Automatic SM
- Page 33 and 34: — skipped —Thread Correctness -
- Page 35 and 36: — skipped —Top-down - several l
- Page 37 and 38: — skipped —Remarks on MPI and P
- Page 39 and 40: Summary• This tutorial tried to-
- Page 41 and 42: References (with direct relation to
- Page 43 and 44: Further references• Matthias Hess
The Topology Problem withApplication example on 80 cores:• Cartesian application with 5 x 16 = 80 sub-domains• On system with 10 x dual socket x quad-core<strong>Hybrid</strong> Parallel ProgrammingSlide85/154Rabenseifner, Hager, Jostpure <strong>MPI</strong>one <strong>MPI</strong> processon each core0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 41 42 43 44 45 46 4748 49 50 51 52 53 54 55 56 57 58 59 60 61 62 6364 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79Two levels ofdomain decompositionBad affinity of cores to thread ranks12 x inter-node connections per node4 x inter-socket connection per nodeThe Topology Problem withApplication example on 80 cores:• Cartesian application with 5 x 16 = 80 sub-domains• On system with 10 x dual socket x quad-core<strong>Hybrid</strong> Parallel ProgrammingSlide86/154Rabenseifner, Hager, Jostpure <strong>MPI</strong>one <strong>MPI</strong> processon each core0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 41 42 43 44 45 46 4748 49 50 51 52 53 54 55 56 57 58 59 60 61 62 6364 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79Two levels of2 x inter-socket connection per node domain decompositionGood affinity of cores to thread ranks12 x inter-node connections per node— skipped —The Topology Problem withExa.: 2 SMP nodes, 8 cores/nodeOptimal ?<strong>MPI</strong> process 0 <strong>MPI</strong> process 1Loop-worksharingon 8 threadsOptimal ?Minimizing ccNUMAdata traffic throughdomain decompositioninside of each<strong>MPI</strong> processhybrid <strong>MPI</strong>+<strong>OpenMP</strong><strong>MPI</strong>: inter-node communication<strong>OpenMP</strong>: inside of each SMP nodeProblem– Does application topology inside of SMP parallelizationfit on inner hardware topology of each SMP node?Solutions:– Domain decomposition inside of each thread-parallel<strong>MPI</strong> process, <strong>and</strong>– first touch strategy with <strong>OpenMP</strong>Successful examples:– Multi-Zone NAS Parallel Benchmarks (MZ-NPB)— skipped —Application<strong>MPI</strong> Level<strong>OpenMP</strong>The Topology Problem withApplication example:• Same Cartesian application aspect ratio: 5 x 16• On system with 10 x dual socket x quad-core• 2 x 5 domain decompositionhybrid <strong>MPI</strong>+<strong>OpenMP</strong><strong>MPI</strong>: inter-node communication<strong>OpenMP</strong>: inside of each SMP node3 x inter-node connections per node, but ~ 4 x more traffic2 x inter-socket connection per node<strong>Hybrid</strong> Parallel ProgrammingSlide87/154Rabenseifner, Hager, Jost<strong>Hybrid</strong> Parallel ProgrammingSlide88/154Affinity Rabenseifner, Hager, of Jost cores to thread ranks !!!