Hybrid MPI and OpenMP programming tutorial - Prace Training Portal
Hybrid MPI and OpenMP programming tutorial - Prace Training Portal Hybrid MPI and OpenMP programming tutorial - Prace Training Portal
Example: HP DL585 G54-socket ccNUMA Opteron 8220 ServerEffect of non-local access on HP DL585 G5:Serial vector triad A(:)=B(:)+C(:)*D(:)• CPU– 64kBL1percore– 1 MB L2 per core– No shared caches– On-chip memory controller (MI)– 10.6 GB/s local memory bandwidth• HyperTransport 1000 network– 4 GB/s per link per direction• 3 distance categories forcore-to-memory connections:– same LD– 1 hop– 2 hopsPCCMIPCCMemoryMemoryP PC CC C• Q1: What are the real penalties for non-local accesses?• Q2: What is the impact of contention?HTMIHTHTPCCMIPCCMemoryMemoryMIP PC CC CHTlocal1 hop2 hopsHybrid Parallel ProgrammingSlide53/154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide54/154Rabenseifner, Hager, Jost— skipped —Contention vs. parallel access on HP DL585 G5:OpenMP vector triad A(:)=B(:)+C(:)*D(:)ccNUMA Memory Locality ProblemsIn-cache performanceunharmed by ccNUMAAffinity matters!T = # threadsS = # socketsSingle LD saturatedby 2 cores!Perfect scalingacross LDs• Locality of reference is key to scalable performance on ccNUMA– Less of a problem with pure MPI, but see below• What factors can destroy locality?• MPI programming:– processes lose their association with the CPU the mapping tookplace on originally– OS kernel tries to maintain strong affinity, but sometimes fails• Shared Memory Programming (OpenMP, hybrid):– threads losing association with the CPU the mapping took place onoriginally– improper initialization of distributed data– Lots of extra threads are running on a node, especially for hybrid• All cases:– Other agents (e.g., OS kernel) may fill memory with data thatprevents optimal placement of user dataHybrid Parallel ProgrammingSlide55/154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide56/154Rabenseifner, Hager, Jost
Avoiding locality problems• How can we make sure that memory ends up where it is close tothe CPU that uses it?– See the following slides• How can we make sure that it stays that way throughout programexecution?– See end of sectionImportantSolving Memory Locality Problems: First Touch• "Golden Rule" of ccNUMA:A memory page gets mapped into the local memory of theprocessor that first touches it!– Except if there is not enough local memory available– this might be a problem, see later– Some OSs allow to influence placement in more direct ways• cf. libnuma (Linux), MPO (Solaris), …• Caveat: "touch" means "write", not "allocate"• Example:double *huge = (double*)malloc(N*sizeof(double));// memory not mapped yetfor(i=0; i
- Page 1 and 2: Hybrid MPI & OpenMPParallel Program
- Page 3: Pure MPIHybrid Masteronlypure MPIon
- Page 6: — skipped —SUN: Running hybrid
- Page 11 and 12: Outline• Introduction / Motivatio
- Page 13: — skipped —Running the codeExam
- Page 17 and 18: CCCCCCOpenMP OverheadThread synchro
- Page 19 and 20: Likwid Tool Suite• Command line t
- Page 21 and 22: Outline• Introduction / Motivatio
- Page 23 and 24: Numerical Optimization inside of an
- Page 25 and 26: Experiment: Matrix-vector-multiply
- Page 27 and 28: Comparison:MPI based parallelizatio
- Page 29 and 30: Memory consumptionCase study: MPI+O
- Page 31 and 32: MPI rules with OpenMP /Automatic SM
- Page 33 and 34: — skipped —Thread Correctness -
- Page 35 and 36: — skipped —Top-down - several l
- Page 37 and 38: — skipped —Remarks on MPI and P
- Page 39 and 40: Summary• This tutorial tried to-
- Page 41 and 42: References (with direct relation to
- Page 43 and 44: Further references• Matthias Hess
Example: HP DL585 G54-socket ccNUMA Opteron 8220 ServerEffect of non-local access on HP DL585 G5:Serial vector triad A(:)=B(:)+C(:)*D(:)• CPU– 64kBL1percore– 1 MB L2 per core– No shared caches– On-chip memory controller (MI)– 10.6 GB/s local memory b<strong>and</strong>width• HyperTransport 1000 network– 4 GB/s per link per direction• 3 distance categories forcore-to-memory connections:– same LD– 1 hop– 2 hopsPCCMIPCCMemoryMemoryP PC CC C• Q1: What are the real penalties for non-local accesses?• Q2: What is the impact of contention?HTMIHTHTPCCMIPCCMemoryMemoryMIP PC CC CHTlocal1 hop2 hops<strong>Hybrid</strong> Parallel ProgrammingSlide53/154Rabenseifner, Hager, Jost<strong>Hybrid</strong> Parallel ProgrammingSlide54/154Rabenseifner, Hager, Jost— skipped —Contention vs. parallel access on HP DL585 G5:<strong>OpenMP</strong> vector triad A(:)=B(:)+C(:)*D(:)ccNUMA Memory Locality ProblemsIn-cache performanceunharmed by ccNUMAAffinity matters!T = # threadsS = # socketsSingle LD saturatedby 2 cores!Perfect scalingacross LDs• Locality of reference is key to scalable performance on ccNUMA– Less of a problem with pure <strong>MPI</strong>, but see below• What factors can destroy locality?• <strong>MPI</strong> <strong>programming</strong>:– processes lose their association with the CPU the mapping tookplace on originally– OS kernel tries to maintain strong affinity, but sometimes fails• Shared Memory Programming (<strong>OpenMP</strong>, hybrid):– threads losing association with the CPU the mapping took place onoriginally– improper initialization of distributed data– Lots of extra threads are running on a node, especially for hybrid• All cases:– Other agents (e.g., OS kernel) may fill memory with data thatprevents optimal placement of user data<strong>Hybrid</strong> Parallel ProgrammingSlide55/154Rabenseifner, Hager, Jost<strong>Hybrid</strong> Parallel ProgrammingSlide56/154Rabenseifner, Hager, Jost