Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

10.07.2015 Views
Intra-node MPI characteristics: IMB Ping-Pong benchmark• Code(toberunon2processors):wc = MPI_WTIME()do i=1,NREPEATif(rank.eq.0) thenMPI_SEND(buffer,N,MPI_BYTE,1,0,MPI_COMM_WORLD,ierr)MPI_RECV(buffer,N,MPI_BYTE,1,0,MPI_COMM_WORLD, &status,ierr)elseP PMPI_RECV(…)C CCMPI_SEND(…)endifChipsetenddowc = MPI_WTIME() - wc• Intranode (1S): mpirun –np 2 –pin “1 3” ./a.out• Intranode (2S): mpirun –np 2 –pin “2 3” ./a.out• Internode: mpirun –np 2 –pernode ./a.outHybrid Parallel ProgrammingSlide61/154Rabenseifner, Hager, JostMemoryPCPCCIMB Ping-Pong: LatencyIntra-node vs. Inter-node on Woodcrest DDR-IB cluster (Intel MPI 3.1)Latency [μs3,532,521,510,50Hybrid Parallel ProgrammingSlide62/1543,24Rabenseifner, Hager, Jost0,55P PC CC0,31IB internode IB intranode 2S IB intranode 1SChipsetMemoryP PC CCAffinity matters!IMB Ping-Pong: Bandwidth CharacteristicsIntra-node vs. Inter-node on Woodcrest DDR-IB cluster (Intel MPI 3.1)OpenMP OverheadP PC CCChipsetMemoryP PC CCHybrid Parallel ProgrammingSlide63/154Between two cores ofone socketRabenseifner, Hager, Jostintranodeshm commBetween two socketsof one nodeShared cacheadvantageBetween two nodesvia InfiniBandAffinity matters!• As with intra-node MPI, OpenMP loop start overhead varies with themutual position of threads in a team• Possible variations– Intra-socket vs. inter-socket– Different overhead for “parallel for” vs. plain “for”– If one multi-threaded MPI process spans multiple sockets,• … are neighboring threads on neighboring cores?• … or are threads distributed “round-robin” across cores?• Test benchmark: Vector triad#pragma omp parallelfor(int j=0; j < NITER; j++){#pragma omp (parallel) forfor(i=0; i < N; ++i)a[i]=b[i]+c[i]*d[i];if(OBSCURE)dummy(a,b,c,d);}Hybrid Parallel ProgrammingSlide64/154Rabenseifner, Hager, JostLook at performance for smallarray sizes!

CCCCCCOpenMP OverheadThread synchronization overheadBarrier overhead in CPU cycles: pthreads vs. OpenMP vs. spin loopNomenclature:P PC CCP PC CCPCCP PC CC CCPCCP PC CCChipsetMemoryP PC CC1S/2S1-/2-socketRRround-robinSSsocket-socketinnerparallel oninner loop2 Threads Q9550 (shared L2) i7 920 (shared L3)pthreads_barrier_wait 23739 6511omp barrier (icc 11.0) 399 469Spin loop 231 2704 Threads Q9550 i7 920 (shared L3)pthreads_barrier_wait 42533 9820omp barrier (icc 11.0) 977 814Spin loop 1106 475Hybrid Parallel ProgrammingSlide65/154Rabenseifner, Hager, JostOMP overhead can becomparable to MPI latency!Affinity matters!pthreads OS kernel callSpin loop does fine for shared cache syncOpenMP & Intel compilerHybrid Parallel ProgrammingSlide66/154Rabenseifner, Hager, JostThread synchronization overheadBarrier overhead: OpenMP icc vs. gccThread synchronization overheadBarrier overhead: Topology influencegcc obviously uses a pthreads barrier for the OpenMP barrier:2 Threads Q9550 (shared L2) i7 920 (shared L3)gcc 4.3.3 22603 7333icc 11.0 399 4694 Threads Q9550 i7 920 (shared L3)gcc 4.3.3 64143 10901icc 11.0 977 814Correct pinning of threads:• Manual pinning in source code (see below) or• likwid-pin: http://code.google.com/p/likwid/Hybrid Parallel ProgrammingSlide67/154PCRabenseifner, Hager, JostCPCPCCPCPCCPCCCPCCPCCPCPCPCPCPCPCPCPCPCPCC CPCPCC CPCPCC CPCPCC CChipsetMemoryMemory MemoryXeon E5420 2 Threads shared L2 same socket different socketpthreads_barrier_wait 5863 27032 27647omp barrier (icc 11.0) 576 760 1269Spin loop 259 485 11602Nehalem 2 Threads Shared SMT shared L3 different socketthreadspthreads_barrier_wait 23352 4796 49237omp barrier (icc 11.0) 2761 479 1206Spin loop 17388 267 787• SMT can be a big performance problem for synchronizing threads• Well known for a long time…Hybrid Parallel ProgrammingSlide68/154Rabenseifner, Hager, Jost

Page 1 and 2: Hybrid MPI & OpenMPParallel Program

Page 3: Pure MPIHybrid Masteronlypure MPIon

Page 6: — skipped —SUN: Running hybrid

Page 11 and 12: Outline• Introduction / Motivatio

Page 13 and 14: — skipped —Running the codeExam

Page 15: Avoiding locality problems• How c

Page 19 and 20: Likwid Tool Suite• Command line t

Page 21 and 22: Outline• Introduction / Motivatio

Page 23 and 24: Numerical Optimization inside of an

Page 25 and 26: Experiment: Matrix-vector-multiply

Page 27 and 28: Comparison:MPI based parallelizatio

Page 29 and 30: Memory consumptionCase study: MPI+O

Page 31 and 32: MPI rules with OpenMP /Automatic SM

Page 33 and 34: — skipped —Thread Correctness -

Page 35 and 36: — skipped —Top-down - several l

Page 37 and 38: — skipped —Remarks on MPI and P

Page 39 and 40: Summary• This tutorial tried to-

Page 41 and 42: References (with direct relation to

Page 43 and 44: Further references• Matthias Hess

hager

hybrid

openmp

programmingslide

josthybrid

threads

programming

jost

communication

skipped

tutorial

prace

portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal ... View more Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Delete template?

Save as template ?

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal Hybrid MPI and OpenMP programming tutorial - Prace Training Portal