Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

More documents

Recommendations

Info

The Topology Problem withApplication example on 80 cores:• Cartesian application with 5 x 16 = 80 sub-domains• On system with 10 x dual socket x quad-coreHybrid Parallel ProgrammingSlide85/154Rabenseifner, Hager, Jostpure MPIone MPI processon each core0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 41 42 43 44 45 46 4748 49 50 51 52 53 54 55 56 57 58 59 60 61 62 6364 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79Two levels ofdomain decompositionBad affinity of cores to thread ranks12 x inter-node connections per node4 x inter-socket connection per nodeThe Topology Problem withApplication example on 80 cores:• Cartesian application with 5 x 16 = 80 sub-domains• On system with 10 x dual socket x quad-coreHybrid Parallel ProgrammingSlide86/154Rabenseifner, Hager, Jostpure MPIone MPI processon each core0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 41 42 43 44 45 46 4748 49 50 51 52 53 54 55 56 57 58 59 60 61 62 6364 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79Two levels of2 x inter-socket connection per node domain decompositionGood affinity of cores to thread ranks12 x inter-node connections per node— skipped —The Topology Problem withExa.: 2 SMP nodes, 8 cores/nodeOptimal ?MPI process 0 MPI process 1Loop-worksharingon 8 threadsOptimal ?Minimizing ccNUMAdata traffic throughdomain decompositioninside of eachMPI processhybrid MPI+OpenMPMPI: inter-node communicationOpenMP: inside of each SMP nodeProblem– Does application topology inside of SMP parallelizationfit on inner hardware topology of each SMP node?Solutions:– Domain decomposition inside of each thread-parallelMPI process, and– first touch strategy with OpenMPSuccessful examples:– Multi-Zone NAS Parallel Benchmarks (MZ-NPB)— skipped —ApplicationMPI LevelOpenMPThe Topology Problem withApplication example:• Same Cartesian application aspect ratio: 5 x 16• On system with 10 x dual socket x quad-core• 2 x 5 domain decompositionhybrid MPI+OpenMPMPI: inter-node communicationOpenMP: inside of each SMP node3 x inter-node connections per node, but ~ 4 x more traffic2 x inter-socket connection per nodeHybrid Parallel ProgrammingSlide87/154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide88/154Affinity Rabenseifner, Hager, of Jost cores to thread ranks !!!
Numerical Optimization inside of an SMP node2nd level of domain decomposition: OpenMP3rd level: 2nd level cache4th level: 1st level cacheOptimizing thenumericalperformanceThe Mapping Problem with mixed modelDo we have this?SMP nodeSocket 1MPIQuad-core process4 xCPUmultithreadedNode Interconnect… or that?SMP nodeSocket 2MPIQuad-core process4 xCPUmultithreadedSocket 1MPI MPIprocessCPUcess0Quad-core pro-1Socket 2Quad-coreCPUNode Interconnectpure MPI&hybrid MPI+OpenMPSeveral multi-threaded MPIprocess per SMP node:Problem– Where are your processesand threads really located?Solutions:– Depends on your platform,– e.g., with numactl Case study onSun Constellation ClusterRangerwith BT-MZ and SP-MZFurther questions:– Where is the NIC 1) located?– Which cores share caches?Hybrid Parallel ProgrammingSlide89/154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide90/154Rabenseifner, Hager, Jost1) NIC = Network Interface CardUnnecessary intra-node communicationProblem:– If several MPI process on each SMP node unnecessary intra-node communicationSolution:– Only one MPI process per SMP nodeRemarks:– MPI library must use appropriatefabrics / protocol for intra-node communication– Intra-node bandwidth higher thaninter-node bandwidth problem may be small– MPI implementation may causeunnecessary data copying waste of memory bandwidthHybrid Parallel ProgrammingSlide91/154Rabenseifner, Hager, Jostpure MPIMixed model(several multi-threaded MPIprocesses per SMP node)Quality aspectsof the MPI librarySleeping threads and network saturationwithfor (iteration ….){#pragma omp parallelnumerical code/*end omp parallel *//* on master thread only */MPI_Send (original datato halo areasin other SMP nodes)MPI_Recv (halo datafrom the neighbors)}/*endforloopMasteronlyMPI only outside ofparallel regionsHybrid Parallel ProgrammingSlide92/154SMP nodeSocket 1MasterthreadSocket 2sleepingNode InterconnectRabenseifner, Hager, JostSMP nodeSocket 1MasterthreadSocket 2sleepingProblem 1:– Can the master threadsaturate the network?Solution:– If not, use mixed model– i.e., several MPIprocesses per SMP nodeProblem 2:– Sleeping threads arewasting CPU timeSolution:– Overlapping ofcomputation andcommunicationProblem 1&2 together:– Producing more idle timethrough lousy bandwidthof master thread
Page 1 and 2: Hybrid MPI & OpenMPParallel Program
Page 3: Pure MPIHybrid Masteronlypure MPIon
Page 6: — skipped —SUN: Running hybrid
Page 11 and 12: Outline• Introduction / Motivatio
Page 13 and 14: — skipped —Running the codeExam
Page 15 and 16: Avoiding locality problems• How c
Page 17 and 18: CCCCCCOpenMP OverheadThread synchro
Page 19 and 20: Likwid Tool Suite• Command line t
Page 21: Outline• Introduction / Motivatio
Page 25 and 26: Experiment: Matrix-vector-multiply
Page 27 and 28: Comparison:MPI based parallelizatio
Page 29 and 30: Memory consumptionCase study: MPI+O
Page 31 and 32: MPI rules with OpenMP /Automatic SM
Page 33 and 34: — skipped —Thread Correctness -
Page 35 and 36: — skipped —Top-down - several l
Page 37 and 38: — skipped —Remarks on MPI and P
Page 39 and 40: Summary• This tutorial tried to-
Page 41 and 42: References (with direct relation to
Page 43 and 44: Further references• Matthias Hess

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?