Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

10.07.2015 Views
Example: HP DL585 G54-socket ccNUMA Opteron 8220 ServerEffect of non-local access on HP DL585 G5:Serial vector triad A(:)=B(:)+C(:)*D(:)• CPU– 64kBL1percore– 1 MB L2 per core– No shared caches– On-chip memory controller (MI)– 10.6 GB/s local memory bandwidth• HyperTransport 1000 network– 4 GB/s per link per direction• 3 distance categories forcore-to-memory connections:– same LD– 1 hop– 2 hopsPCCMIPCCMemoryMemoryP PC CC C• Q1: What are the real penalties for non-local accesses?• Q2: What is the impact of contention?HTMIHTHTPCCMIPCCMemoryMemoryMIP PC CC CHTlocal1 hop2 hopsHybrid Parallel ProgrammingSlide53/154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide54/154Rabenseifner, Hager, Jost— skipped —Contention vs. parallel access on HP DL585 G5:OpenMP vector triad A(:)=B(:)+C(:)*D(:)ccNUMA Memory Locality ProblemsIn-cache performanceunharmed by ccNUMAAffinity matters!T = # threadsS = # socketsSingle LD saturatedby 2 cores!Perfect scalingacross LDs• Locality of reference is key to scalable performance on ccNUMA– Less of a problem with pure MPI, but see below• What factors can destroy locality?• MPI programming:– processes lose their association with the CPU the mapping tookplace on originally– OS kernel tries to maintain strong affinity, but sometimes fails• Shared Memory Programming (OpenMP, hybrid):– threads losing association with the CPU the mapping took place onoriginally– improper initialization of distributed data– Lots of extra threads are running on a node, especially for hybrid• All cases:– Other agents (e.g., OS kernel) may fill memory with data thatprevents optimal placement of user dataHybrid Parallel ProgrammingSlide55/154Rabenseifner, Hager, JostHybrid Parallel ProgrammingSlide56/154Rabenseifner, Hager, Jost

Avoiding locality problems• How can we make sure that memory ends up where it is close tothe CPU that uses it?– See the following slides• How can we make sure that it stays that way throughout programexecution?– See end of sectionImportantSolving Memory Locality Problems: First Touch• "Golden Rule" of ccNUMA:A memory page gets mapped into the local memory of theprocessor that first touches it!– Except if there is not enough local memory available– this might be a problem, see later– Some OSs allow to influence placement in more direct ways• cf. libnuma (Linux), MPO (Solaris), …• Caveat: "touch" means "write", not "allocate"• Example:double *huge = (double*)malloc(N*sizeof(double));// memory not mapped yetfor(i=0; i

Page 1 and 2: Hybrid MPI & OpenMPParallel Program

Page 3: Pure MPIHybrid Masteronlypure MPIon

Page 6: — skipped —SUN: Running hybrid

Page 11 and 12: Outline• Introduction / Motivatio

Page 13: — skipped —Running the codeExam

Page 17 and 18: CCCCCCOpenMP OverheadThread synchro

Page 19 and 20: Likwid Tool Suite• Command line t

Page 21 and 22: Outline• Introduction / Motivatio

Page 23 and 24: Numerical Optimization inside of an

Page 25 and 26: Experiment: Matrix-vector-multiply

Page 27 and 28: Comparison:MPI based parallelizatio

Page 29 and 30: Memory consumptionCase study: MPI+O

Page 31 and 32: MPI rules with OpenMP /Automatic SM

Page 33 and 34: — skipped —Thread Correctness -

Page 35 and 36: — skipped —Top-down - several l

Page 37 and 38: — skipped —Remarks on MPI and P

Page 39 and 40: Summary• This tutorial tried to-

Page 41 and 42: References (with direct relation to

Page 43 and 44: Further references• Matthias Hess

hager

hybrid

openmp

programmingslide

josthybrid

threads

programming

jost

communication

skipped

tutorial

prace

portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal ... View more Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Delete template?

Save as template ?

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal Hybrid MPI and OpenMP programming tutorial - Prace Training Portal