Hybrid MPI and OpenMP programming tutorial - Prace Training Portal
Hybrid MPI and OpenMP programming tutorial - Prace Training Portal Hybrid MPI and OpenMP programming tutorial - Prace Training Portal
— skipped —Thread support within Open MPI• In order to enable thread support in Open MPI, configure with:configure --enable-mpi-threads• This turns on:– Support for full MPI_THREAD_MULTIPLE– internal checks when run with threads (--enable-debug)configure --enable-mpi-threads --enable-progress-threads• This (additionally) turns on:– Progress threads to asynchronously transfer/receive data pernetwork BTL.• Additional Feature:– Compiling with debugging support, but without threads willcheck for recursive lockingOutline• Introduction / Motivation• Programming models on clusters of SMP nodes• Case Studies / pure MPI vs hybrid MPI+OpenMP• Practical “How-To” on hybrid programming• Mismatch Problems• Opportunities:Application categories that can benefit from hybrid parallelization• Thread-safety quality of MPI libraries• Tools for debugging and profiling MPI+OpenMP• Other options on clusters of SMP nodes• SummaryThis section is skipped, seetalks on tools on ThursdayHybrid Parallel ProgrammingSlide 125 / 154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide 126 / 154Rabenseifner, Hager, JostCourtesy of Rainer Keller, HLRS and ORNL— skipped —Thread Correctness – Intel ThreadChecker 1/3— skipped —Thread Correctness – Intel ThreadChecker 2/3• Intel ThreadChecker operates in a similar fashion to helgrind,• Compile with –tcheck, then run program using tcheck_cl:• One may output to HTML:tcheck_cl --format HTML --report pthread_race.html pthread_raceApplication finished_______________________________________________________________________________|ID|Short De|Sever|C|Contex|Description|1st Acc|2nd Acc|| |scriptio|ity |o|t[Best| |ess[Bes|ess[Bes|| |n |Name |u|] | |t] |t] || | | |n| | | | || | | |t| | | | |_______________________________________________________________________________|1 |Write ->|Error|1|"pthre|Memory write of global_variable at|"pthrea|"pthrea|| |Write da| | |ad_rac|"pthread_race.c":31 conflicts with|d_race.|d_race.|| |ta-race | | |e.c":2|a prior memory write of |c":31 |c":31 || | | | |5 |global_variable at | | || | | | | |"pthread_race.c":31 (output | | || | | | | |dependence) | | |_______________________________________________________________________________• Caution: Intel Inspector XE 2011 is a GUI based tool not suitable forhybrid code execution (?)Hybrid Parallel ProgrammingSlide 127 / 154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide 128 / 154Rabenseifner, Hager, JostCourtesy of Rainer Keller, HLRS and ORNLCourtesy of Rainer Keller, HLRS and ORNL
— skipped —Thread Correctness – Intel ThreadChecker 3/3— skipped —Performance Tools Support for Hybrid Code• If one wants to compile with threaded Open MPI (option for IB):configure --enable-mpi-threads--enable-debug--enable-mca-no-build=memory-ptmalloc2CC=icc F77=ifort FC=ifortCFLAGS=‘-debug all –inline-debug-info tcheck’CXXFLAGS=‘-debug all –inline-debug-info tcheck’FFLAGS=‘-debug all –tcheck’ LDFLAGS=‘tcheck’• Paraver examples have alreadybeen shown, tracing is done withlinking against (closed-source)omptrace or ompitrace• Then run with:mpirun --mca tcp,sm,self -np 2 tcheck_cl \--reinstrument -u full --format html \--cache_dir '/tmp/my_username_$$__tc_cl_cache' \--report 'tc_mpi_test_suite_$$' \--options 'file=tc_my_executable_%H_%I, \pad=128, delay=2, stall=2' -- \./my_executable my_arg1 my_arg2 …• For Vampir/Vampirtrace performance analysis:./configure –enable-omp–enable-hyb–with-mpi-dir=/opt/OpenMPI/1.3-iccCC=icc F77=ifort FC=ifort(Attention: does not wrap MPI_Init_thread!)Hybrid Parallel ProgrammingSlide 129 / 154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide 130 / 154Rabenseifner, Hager, JostCourtesy of Rainer Keller, HLRS and ORNLCourtesy of Rainer Keller, HLRS and ORNL— skipped —Scalasca – Example “Wait at Barrier”— skipped —Scalasca – Example “Wait at Barrier”, SolutionIndication ofnon-optimal loadbalanceBetter load balancingwith dynamicloop scheduleHybrid Parallel ProgrammingSlide 131 / 154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide 132 / 154Rabenseifner, Hager, JostScreenshots, courtesy of KOJAK JSC, FZ JülichScreenshots, courtesy of KOJAK JSC, FZ Jülich
- Page 1 and 2: Hybrid MPI & OpenMPParallel Program
- Page 3: Pure MPIHybrid Masteronlypure MPIon
- Page 6: — skipped —SUN: Running hybrid
- Page 11 and 12: Outline• Introduction / Motivatio
- Page 13 and 14: — skipped —Running the codeExam
- Page 15 and 16: Avoiding locality problems• How c
- Page 17 and 18: CCCCCCOpenMP OverheadThread synchro
- Page 19 and 20: Likwid Tool Suite• Command line t
- Page 21 and 22: Outline• Introduction / Motivatio
- Page 23 and 24: Numerical Optimization inside of an
- Page 25 and 26: Experiment: Matrix-vector-multiply
- Page 27 and 28: Comparison:MPI based parallelizatio
- Page 29 and 30: Memory consumptionCase study: MPI+O
- Page 31: MPI rules with OpenMP /Automatic SM
- Page 35 and 36: — skipped —Top-down - several l
- Page 37 and 38: — skipped —Remarks on MPI and P
- Page 39 and 40: Summary• This tutorial tried to-
- Page 41 and 42: References (with direct relation to
- Page 43 and 44: Further references• Matthias Hess
— skipped —Thread Correctness – Intel ThreadChecker 3/3— skipped —Performance Tools Support for <strong>Hybrid</strong> Code• If one wants to compile with threaded Open <strong>MPI</strong> (option for IB):configure --enable-mpi-threads--enable-debug--enable-mca-no-build=memory-ptmalloc2CC=icc F77=ifort FC=ifortCFLAGS=‘-debug all –inline-debug-info tcheck’CXXFLAGS=‘-debug all –inline-debug-info tcheck’FFLAGS=‘-debug all –tcheck’ LDFLAGS=‘tcheck’• Paraver examples have alreadybeen shown, tracing is done withlinking against (closed-source)omptrace or ompitrace• Then run with:mpirun --mca tcp,sm,self -np 2 tcheck_cl \--reinstrument -u full --format html \--cache_dir '/tmp/my_username_$$__tc_cl_cache' \--report 'tc_mpi_test_suite_$$' \--options 'file=tc_my_executable_%H_%I, \pad=128, delay=2, stall=2' -- \./my_executable my_arg1 my_arg2 …• For Vampir/Vampirtrace performance analysis:./configure –enable-omp–enable-hyb–with-mpi-dir=/opt/Open<strong>MPI</strong>/1.3-iccCC=icc F77=ifort FC=ifort(Attention: does not wrap <strong>MPI</strong>_Init_thread!)<strong>Hybrid</strong> Parallel ProgrammingSlide 129 / 154Rabenseifner, Hager, Jost<strong>Hybrid</strong> Parallel ProgrammingSlide 130 / 154Rabenseifner, Hager, JostCourtesy of Rainer Keller, HLRS <strong>and</strong> ORNLCourtesy of Rainer Keller, HLRS <strong>and</strong> ORNL— skipped —Scalasca – Example “Wait at Barrier”— skipped —Scalasca – Example “Wait at Barrier”, SolutionIndication ofnon-optimal loadbalanceBetter load balancingwith dynamicloop schedule<strong>Hybrid</strong> Parallel ProgrammingSlide 131 / 154Rabenseifner, Hager, Jost<strong>Hybrid</strong> Parallel ProgrammingSlide 132 / 154Rabenseifner, Hager, JostScreenshots, courtesy of KOJAK JSC, FZ JülichScreenshots, courtesy of KOJAK JSC, FZ Jülich