Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

More documents

Recommendations

Info

— skipped —Overlapping can be achieved with OpenMP tasks (2 nd part)OpenMP tasking version outperforms original shifter,especially in larger poloidal domains256 size run 2048 size runOverlapping particle reorderingParticle reordering of remainingparticles (above) and adding sentparticles into array (right) & sendingor receiving of shifted particles canbe independently executed.Hybrid Parallel ProgrammingSlide 101 / 154Rabenseifner, Hager, JostOverlapping remaining MPI_SendrecvSlides, courtesy of Alice Koniges, NERSC, LBNL• Performance breakdown of GTS shifter routine using 4 OpenMP threads per MPI processwith varying domain decomposition and particles per cell on Franklin Cray XT4.• MPI communication in the shift phase uses a toroidal MPI communicator(constantly 128).• Large performance differences in the 256 MPI run compared to 2048 MPI run!• Speed-Up is expected to be higher on larger GTS runs with hundreds of thousandsCPUs since MPI communication is more expensive.Hybrid Parallel ProgrammingSlide 102 / 154Rabenseifner, Hager, JostSlides, courtesy ofAlice Koniges, NERSC, LBNLOpenMP/DSM• Distributed shared memory (DSM) //• Distributed virtual shared memory (DVSM) //• Shared virtual memory (SVM)• Principles– emulates a shared memory– on distributed memory hardware• Implementations– e.g., Intel ® Cluster OpenMPOpenMP only— skipped —Intel ® Compilers with Cluster OpenMP –Consistency ProtocolOpenMP onlyBasic idea:• Between OpenMP barriers, data exchange is not necessary, i.e.,visibility of data modifications to other threads only after synchronization.• When a page of sharable memory is not up-to-date,it becomes protected.• Any access then faults (SIGSEGV) into Cluster OpenMP runtime library,which requests info from remote nodes and updates the page.• Protection is removed from page.• Instruction causing the fault is re-started,this time successfully accessing the data.Hybrid Parallel ProgrammingSlide 103 / 154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide 104 / 154Rabenseifner, Hager, JostCourtesy of J. Cownie, Intel
Comparison:MPI based parallelization• MPI based:Hybrid Parallel ProgrammingSlide 105 / 154Rabenseifner, Hager, JostDSM– Potential of boundary exchange between two domains in one large message Dominated by bandwidth of the network• DSM based (e.g. Intel ® Cluster OpenMP):– Additional latency based overhead in each barrier May be marginal– Communication of updated data of pages Not all of this data may be needed i.e., too much data is transferred Packages may be to small Significant latency– Communication not oriented on boundariesof a domain decomposition probably more data must be transferred thannecessaryhybrid MPI+OpenMPOpenMP onlyby rule of thumb:Communicationmay be10 times slowerthan with MPI— skipped —Comparing results with heat example• Normal OpenMP on shared memory (ccNUMA) NEC TX-7heat_x.c / heatc2_x.c with OpenMP on NEC TX-7Speedup181614121086420serial12346810threadsHybrid Parallel ProgrammingSlide 106 / 154Rabenseifner, Hager, Jost1000x1000250x25080x8020x20ideal speedup— skipped —Heat example: Cluster OpenMP EfficiencyBack to the mixed model – an Example• Cluster OpenMP on a Dual-Xeon clusterSpeedup76543210serial1/2heats2_x.c with Cluster OpenMP on NEC dual Xeon EM64T clusterUp to 3 CPUswith 3000x30001234nodes5678Efficiency only with smallcommunication foot-print6000x6000 static(default) 1 threads/node6000x6000 static(default) 2 threads/node6000x6000 static(1:1) 1 threads/node6000x6000 static(1:2) 1 threads/node6000x6000 static(1:10) 1 threads/node6000x6000 static(1:50) 1 threads/node3000x3000 static(default) 1 threads/node3000x3000 static(default) 2 threads/node1000x1000 static(default) 1 threads/node1000x1000 static(default) 2 threads/node250x250 static(default) 1 threads/node250x250 static(default) 2 threads/nodeTerrible with non-default scheduleNo speedup with 1000x1000Second CPU only usable in small casesSMP nodeSMP nodeSocket 1Socket 1MPIMPIQuad-core processQuad-core process4 xCPUmultithreadethreaded4 xCPUmulti-Socket 2Socket 2MPIMPIQuad-core processQuad-core process4 xCPUmultithreadethreaded4 xCPUmulti-Node Interconnect• Topology-problem solved:Only horizontal inter-node comm.• Still intra-node communication• Several threads per SMP node arecommunicating in parallel: network saturation is possible• Additional OpenMP overhead• With Masteronly style:75% of the threads sleep whilemaster thread communicates• With Overlapping Comm.& Comp.:Master thread should be reservedfor communication only partially –otherwise too expensive• MPI library must support– Multiple threads– Two fabrics (shmem + internode)Hybrid Parallel ProgrammingSlide 107 / 154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide 108 / 154Rabenseifner, Hager, Jost
Page 1 and 2: Hybrid MPI & OpenMPParallel Program
Page 3: Pure MPIHybrid Masteronlypure MPIon
Page 6: — skipped —SUN: Running hybrid
Page 11 and 12: Outline• Introduction / Motivatio
Page 13 and 14: — skipped —Running the codeExam
Page 15 and 16: Avoiding locality problems• How c
Page 17 and 18: CCCCCCOpenMP OverheadThread synchro
Page 19 and 20: Likwid Tool Suite• Command line t
Page 21 and 22: Outline• Introduction / Motivatio
Page 23 and 24: Numerical Optimization inside of an
Page 25: Experiment: Matrix-vector-multiply
Page 29 and 30: Memory consumptionCase study: MPI+O
Page 31 and 32: MPI rules with OpenMP /Automatic SM
Page 33 and 34: — skipped —Thread Correctness -
Page 35 and 36: — skipped —Top-down - several l
Page 37 and 38: — skipped —Remarks on MPI and P
Page 39 and 40: Summary• This tutorial tried to-
Page 41 and 42: References (with direct relation to
Page 43 and 44: Further references• Matthias Hess

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Create successful ePaper yourself

Delete template?

Save as template?