Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

training.prace.ri.eu
from training.prace.ri.eu More from this publisher
10.07.2015 Views

The Topology Problem withApplication example on 80 cores:• Cartesian application with 5 x 16 = 80 sub-domains• On system with 10 x dual socket x quad-coreHybrid Parallel ProgrammingSlide85/154Rabenseifner, Hager, Jostpure MPIone MPI processon each core0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 41 42 43 44 45 46 4748 49 50 51 52 53 54 55 56 57 58 59 60 61 62 6364 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79Two levels ofdomain decompositionBad affinity of cores to thread ranks12 x inter-node connections per node4 x inter-socket connection per nodeThe Topology Problem withApplication example on 80 cores:• Cartesian application with 5 x 16 = 80 sub-domains• On system with 10 x dual socket x quad-coreHybrid Parallel ProgrammingSlide86/154Rabenseifner, Hager, Jostpure MPIone MPI processon each core0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 41 42 43 44 45 46 4748 49 50 51 52 53 54 55 56 57 58 59 60 61 62 6364 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79Two levels of2 x inter-socket connection per node domain decompositionGood affinity of cores to thread ranks12 x inter-node connections per node— skipped —The Topology Problem withExa.: 2 SMP nodes, 8 cores/nodeOptimal ?MPI process 0 MPI process 1Loop-worksharingon 8 threadsOptimal ?Minimizing ccNUMAdata traffic throughdomain decompositioninside of eachMPI processhybrid MPI+OpenMPMPI: inter-node communicationOpenMP: inside of each SMP nodeProblem– Does application topology inside of SMP parallelizationfit on inner hardware topology of each SMP node?Solutions:– Domain decomposition inside of each thread-parallelMPI process, and– first touch strategy with OpenMPSuccessful examples:– Multi-Zone NAS Parallel Benchmarks (MZ-NPB)— skipped —ApplicationMPI LevelOpenMPThe Topology Problem withApplication example:• Same Cartesian application aspect ratio: 5 x 16• On system with 10 x dual socket x quad-core• 2 x 5 domain decompositionhybrid MPI+OpenMPMPI: inter-node communicationOpenMP: inside of each SMP node3 x inter-node connections per node, but ~ 4 x more traffic2 x inter-socket connection per nodeHybrid Parallel ProgrammingSlide87/154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide88/154Affinity Rabenseifner, Hager, of Jost cores to thread ranks !!!

Numerical Optimization inside of an SMP node2nd level of domain decomposition: OpenMP3rd level: 2nd level cache4th level: 1st level cacheOptimizing thenumericalperformanceThe Mapping Problem with mixed modelDo we have this?SMP nodeSocket 1MPIQuad-core process4 xCPUmultithreadedNode Interconnect… or that?SMP nodeSocket 2MPIQuad-core process4 xCPUmultithreadedSocket 1MPI MPIprocessCPUcess0Quad-core pro-1Socket 2Quad-coreCPUNode Interconnectpure MPI&hybrid MPI+OpenMPSeveral multi-threaded MPIprocess per SMP node:Problem– Where are your processesand threads really located?Solutions:– Depends on your platform,– e.g., with numactl Case study onSun Constellation ClusterRangerwith BT-MZ and SP-MZFurther questions:– Where is the NIC 1) located?– Which cores share caches?Hybrid Parallel ProgrammingSlide89/154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide90/154Rabenseifner, Hager, Jost1) NIC = Network Interface CardUnnecessary intra-node communicationProblem:– If several MPI process on each SMP node unnecessary intra-node communicationSolution:– Only one MPI process per SMP nodeRemarks:– MPI library must use appropriatefabrics / protocol for intra-node communication– Intra-node bandwidth higher thaninter-node bandwidth problem may be small– MPI implementation may causeunnecessary data copying waste of memory bandwidthHybrid Parallel ProgrammingSlide91/154Rabenseifner, Hager, Jostpure MPIMixed model(several multi-threaded MPIprocesses per SMP node)Quality aspectsof the MPI librarySleeping threads and network saturationwithfor (iteration ….){#pragma omp parallelnumerical code/*end omp parallel *//* on master thread only */MPI_Send (original datato halo areasin other SMP nodes)MPI_Recv (halo datafrom the neighbors)}/*endforloopMasteronlyMPI only outside ofparallel regionsHybrid Parallel ProgrammingSlide92/154SMP nodeSocket 1MasterthreadSocket 2sleepingNode InterconnectRabenseifner, Hager, JostSMP nodeSocket 1MasterthreadSocket 2sleepingProblem 1:– Can the master threadsaturate the network?Solution:– If not, use mixed model– i.e., several MPIprocesses per SMP nodeProblem 2:– Sleeping threads arewasting CPU timeSolution:– Overlapping ofcomputation andcommunicationProblem 1&2 together:– Producing more idle timethrough lousy bandwidthof master thread

The Topology Problem withApplication example on 80 cores:• Cartesian application with 5 x 16 = 80 sub-domains• On system with 10 x dual socket x quad-core<strong>Hybrid</strong> Parallel ProgrammingSlide85/154Rabenseifner, Hager, Jostpure <strong>MPI</strong>one <strong>MPI</strong> processon each core0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 41 42 43 44 45 46 4748 49 50 51 52 53 54 55 56 57 58 59 60 61 62 6364 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79Two levels ofdomain decompositionBad affinity of cores to thread ranks12 x inter-node connections per node4 x inter-socket connection per nodeThe Topology Problem withApplication example on 80 cores:• Cartesian application with 5 x 16 = 80 sub-domains• On system with 10 x dual socket x quad-core<strong>Hybrid</strong> Parallel ProgrammingSlide86/154Rabenseifner, Hager, Jostpure <strong>MPI</strong>one <strong>MPI</strong> processon each core0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 41 42 43 44 45 46 4748 49 50 51 52 53 54 55 56 57 58 59 60 61 62 6364 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79Two levels of2 x inter-socket connection per node domain decompositionGood affinity of cores to thread ranks12 x inter-node connections per node— skipped —The Topology Problem withExa.: 2 SMP nodes, 8 cores/nodeOptimal ?<strong>MPI</strong> process 0 <strong>MPI</strong> process 1Loop-worksharingon 8 threadsOptimal ?Minimizing ccNUMAdata traffic throughdomain decompositioninside of each<strong>MPI</strong> processhybrid <strong>MPI</strong>+<strong>OpenMP</strong><strong>MPI</strong>: inter-node communication<strong>OpenMP</strong>: inside of each SMP nodeProblem– Does application topology inside of SMP parallelizationfit on inner hardware topology of each SMP node?Solutions:– Domain decomposition inside of each thread-parallel<strong>MPI</strong> process, <strong>and</strong>– first touch strategy with <strong>OpenMP</strong>Successful examples:– Multi-Zone NAS Parallel Benchmarks (MZ-NPB)— skipped —Application<strong>MPI</strong> Level<strong>OpenMP</strong>The Topology Problem withApplication example:• Same Cartesian application aspect ratio: 5 x 16• On system with 10 x dual socket x quad-core• 2 x 5 domain decompositionhybrid <strong>MPI</strong>+<strong>OpenMP</strong><strong>MPI</strong>: inter-node communication<strong>OpenMP</strong>: inside of each SMP node3 x inter-node connections per node, but ~ 4 x more traffic2 x inter-socket connection per node<strong>Hybrid</strong> Parallel ProgrammingSlide87/154Rabenseifner, Hager, Jost<strong>Hybrid</strong> Parallel ProgrammingSlide88/154Affinity Rabenseifner, Hager, of Jost cores to thread ranks !!!

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!