Hybrid MPI and OpenMP programming tutorial - Prace Training Portal
Hybrid MPI and OpenMP programming tutorial - Prace Training Portal Hybrid MPI and OpenMP programming tutorial - Prace Training Portal
Goals of this tutorial• Sensitize to problems on clusters of SMP nodessee sections Case studies Mismatch problems• Technical aspects of hybrid programmingsee sections Programming models on clusters Examples on hybrid programming• Opportunities with hybrid programmingsee section Opportunities: Application categoriesthat can benefit from hybrid paralleliz.• Issues and their Solutionswith sections Thread-safety quality of MPI libraries Tools for debugging and profilingfor MPI+OpenMP•Lessfrustration&•Moresuccesswith yourparallelprogram onclusters ofSMP nodesOutline• Introduction / Motivation• Programming models on clusters of SMP nodes• Case Studies / pure MPI vs hybrid MPI+OpenMP• Practical “How-To” on hybrid programming• Mismatch Problems• Opportunities:Application categories that can benefit from hybrid parallelization• Thread-safety quality of MPI libraries• Tools for debugging and profiling MPI+OpenMP• Other options on clusters of SMP nodes• SummaryHybrid Parallel ProgrammingSlide5/154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide6/154Rabenseifner, Hager, JostMajor Programming models on hybrid systems• Pure MPI (one MPI process on each core)• Hybrid MPI+OpenMP– shared memory OpenMP– distributed memory MPI• Other: Virtual shared memory systems, PGAS, HPF, …• Often hybrid programming (MPI+OpenMP) slower than pure MPI– why?MPISequentialprogram oneach corelocal data in each processdataExplicit Message Passingby calling MPI_Send & MPI_RecvOpenMP inside of theSMP nodesMPI between the nodesvia node interconnectOpenMP (shared data)some_serial_code#pragma omp parallel forfor (j=…;…; j++)block_to_be_parallelizedagain_some_serial_codeNode InterconnectMaster thread,other threads••• sleeping •••Parallel Programming Models on Hybrid Platformspure MPIone MPI processon each coreNo overlap of Comm. + Comp.MPI only outside of parallel regionsof the numerical application codeMasteronlyMPI only outsideof parallel regionsMPI local data in each processSequentialdataprogram oneach coreExplicit message transfersby calling MPI_Send & MPI_Recvhybrid MPI+OpenMPMPI: inter-node communicationOpenMP: inside of each SMP nodeOpenMP (shared data)some_serial_code#pragma omp parallel forfor (j=…;…; j++)block_to_be_parallelizedagain_some_serial_codeOpenMP onlydistributed virtualshared memoryOverlapping Comm. + Comp.MPI communication by one or a few threadswhile other threads are computingMaster thread,other threads••• sleeping •••Hybrid Parallel ProgrammingSlide7/154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide8/154Rabenseifner, Hager, Jost
Pure MPIHybrid Masteronlypure MPIone MPI processon each coreDiscussedin detail later onin the sectionMismatchProblemsAdvantages– No modifications on existing MPI codes– MPI library need not to support multiple threadsMajor problems– Does MPI library uses internally different protocols?• Shared memory inside of the SMP nodes• Network communication between the nodes– Does application topology fit on hardware topology?– Unnecessary MPI-communication inside of SMP nodes!MasteronlyMPI only outsideof parallel regionsfor (iteration ….){#pragma omp parallelnumerical code/*end omp parallel *//* on master thread only */MPI_Send (original datato halo areasin other SMP nodes)MPI_Recv (halo datafrom the neighbors)}/*endforloopAdvantages– No message passing inside of the SMP nodes– No topology problemMajor Problems– All other threads are sleepingwhile master thread communicates!– Which inter-node bandwidth?– MPI-lib must support at leastMPI_THREAD_FUNNELED SectionThread-safetyquality of MPIlibrariesHybrid Parallel ProgrammingSlide9/154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide10/154Rabenseifner, Hager, JostOverlapping Communication and ComputationOverlapping MPI communication bycommunication one or a few threads while other and threads computationare computingif (my_thread_rank < …) {MPI_Send/Recv….i.e., communicate all halo data}else{Execute those parts of the applicationthat do not need halo data(on non-communicating threads)}Execute those parts of the applicationthat need halo data(on all threads)Pure OpenMP (on the cluster)OpenMP onlydistributed virtualshared memory• Distributed shared virtual memory system needed• Must support clusters of SMP nodes• e.g., Intel ® Cluster OpenMP– Shared memory parallel inside of SMP nodes– Communication of modified parts of pagesat OpenMP flush (part of each OpenMP barrier)i.e., the OpenMP memory and parallelization modelis prepared for clusters!Experience: MismatchsectionHybrid Parallel ProgrammingSlide11/154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide12/154Rabenseifner, Hager, Jost
- Page 1: Hybrid MPI & OpenMPParallel Program
- Page 6: — skipped —SUN: Running hybrid
- Page 11 and 12: Outline• Introduction / Motivatio
- Page 13 and 14: — skipped —Running the codeExam
- Page 15 and 16: Avoiding locality problems• How c
- Page 17 and 18: CCCCCCOpenMP OverheadThread synchro
- Page 19 and 20: Likwid Tool Suite• Command line t
- Page 21 and 22: Outline• Introduction / Motivatio
- Page 23 and 24: Numerical Optimization inside of an
- Page 25 and 26: Experiment: Matrix-vector-multiply
- Page 27 and 28: Comparison:MPI based parallelizatio
- Page 29 and 30: Memory consumptionCase study: MPI+O
- Page 31 and 32: MPI rules with OpenMP /Automatic SM
- Page 33 and 34: — skipped —Thread Correctness -
- Page 35 and 36: — skipped —Top-down - several l
- Page 37 and 38: — skipped —Remarks on MPI and P
- Page 39 and 40: Summary• This tutorial tried to-
- Page 41 and 42: References (with direct relation to
- Page 43 and 44: Further references• Matthias Hess
Goals of this <strong>tutorial</strong>• Sensitize to problems on clusters of SMP nodessee sections Case studies Mismatch problems• Technical aspects of hybrid <strong>programming</strong>see sections Programming models on clusters Examples on hybrid <strong>programming</strong>• Opportunities with hybrid <strong>programming</strong>see section Opportunities: Application categoriesthat can benefit from hybrid paralleliz.• Issues <strong>and</strong> their Solutionswith sections Thread-safety quality of <strong>MPI</strong> libraries Tools for debugging <strong>and</strong> profilingfor <strong>MPI</strong>+<strong>OpenMP</strong>•Lessfrustration&•Moresuccesswith yourparallelprogram onclusters ofSMP nodesOutline• Introduction / Motivation• Programming models on clusters of SMP nodes• Case Studies / pure <strong>MPI</strong> vs hybrid <strong>MPI</strong>+<strong>OpenMP</strong>• Practical “How-To” on hybrid <strong>programming</strong>• Mismatch Problems• Opportunities:Application categories that can benefit from hybrid parallelization• Thread-safety quality of <strong>MPI</strong> libraries• Tools for debugging <strong>and</strong> profiling <strong>MPI</strong>+<strong>OpenMP</strong>• Other options on clusters of SMP nodes• Summary<strong>Hybrid</strong> Parallel ProgrammingSlide5/154Rabenseifner, Hager, Jost<strong>Hybrid</strong> Parallel ProgrammingSlide6/154Rabenseifner, Hager, JostMajor Programming models on hybrid systems• Pure <strong>MPI</strong> (one <strong>MPI</strong> process on each core)• <strong>Hybrid</strong> <strong>MPI</strong>+<strong>OpenMP</strong>– shared memory <strong>OpenMP</strong>– distributed memory <strong>MPI</strong>• Other: Virtual shared memory systems, PGAS, HPF, …• Often hybrid <strong>programming</strong> (<strong>MPI</strong>+<strong>OpenMP</strong>) slower than pure <strong>MPI</strong>– why?<strong>MPI</strong>Sequentialprogram oneach corelocal data in each processdataExplicit Message Passingby calling <strong>MPI</strong>_Send & <strong>MPI</strong>_Recv<strong>OpenMP</strong> inside of theSMP nodes<strong>MPI</strong> between the nodesvia node interconnect<strong>OpenMP</strong> (shared data)some_serial_code#pragma omp parallel forfor (j=…;…; j++)block_to_be_parallelizedagain_some_serial_codeNode InterconnectMaster thread,other threads••• sleeping •••Parallel Programming Models on <strong>Hybrid</strong> Platformspure <strong>MPI</strong>one <strong>MPI</strong> processon each coreNo overlap of Comm. + Comp.<strong>MPI</strong> only outside of parallel regionsof the numerical application codeMasteronly<strong>MPI</strong> only outsideof parallel regions<strong>MPI</strong> local data in each processSequentialdataprogram oneach coreExplicit message transfersby calling <strong>MPI</strong>_Send & <strong>MPI</strong>_Recvhybrid <strong>MPI</strong>+<strong>OpenMP</strong><strong>MPI</strong>: inter-node communication<strong>OpenMP</strong>: inside of each SMP node<strong>OpenMP</strong> (shared data)some_serial_code#pragma omp parallel forfor (j=…;…; j++)block_to_be_parallelizedagain_some_serial_code<strong>OpenMP</strong> onlydistributed virtualshared memoryOverlapping Comm. + Comp.<strong>MPI</strong> communication by one or a few threadswhile other threads are computingMaster thread,other threads••• sleeping •••<strong>Hybrid</strong> Parallel ProgrammingSlide7/154Rabenseifner, Hager, Jost<strong>Hybrid</strong> Parallel ProgrammingSlide8/154Rabenseifner, Hager, Jost