Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

training.prace.ri.eu
from training.prace.ri.eu More from this publisher
10.07.2015 Views

Intra-node MPI characteristics: IMB Ping-Pong benchmark• Code(toberunon2processors):wc = MPI_WTIME()do i=1,NREPEATif(rank.eq.0) thenMPI_SEND(buffer,N,MPI_BYTE,1,0,MPI_COMM_WORLD,ierr)MPI_RECV(buffer,N,MPI_BYTE,1,0,MPI_COMM_WORLD, &status,ierr)elseP PMPI_RECV(…)C CCMPI_SEND(…)endifChipsetenddowc = MPI_WTIME() - wc• Intranode (1S): mpirun –np 2 –pin “1 3” ./a.out• Intranode (2S): mpirun –np 2 –pin “2 3” ./a.out• Internode: mpirun –np 2 –pernode ./a.outHybrid Parallel ProgrammingSlide61/154Rabenseifner, Hager, JostMemoryPCPCCIMB Ping-Pong: LatencyIntra-node vs. Inter-node on Woodcrest DDR-IB cluster (Intel MPI 3.1)Latency [μs3,532,521,510,50Hybrid Parallel ProgrammingSlide62/1543,24Rabenseifner, Hager, Jost0,55P PC CC0,31IB internode IB intranode 2S IB intranode 1SChipsetMemoryP PC CCAffinity matters!IMB Ping-Pong: Bandwidth CharacteristicsIntra-node vs. Inter-node on Woodcrest DDR-IB cluster (Intel MPI 3.1)OpenMP OverheadP PC CCChipsetMemoryP PC CCHybrid Parallel ProgrammingSlide63/154Between two cores ofone socketRabenseifner, Hager, Jostintranodeshm commBetween two socketsof one nodeShared cacheadvantageBetween two nodesvia InfiniBandAffinity matters!• As with intra-node MPI, OpenMP loop start overhead varies with themutual position of threads in a team• Possible variations– Intra-socket vs. inter-socket– Different overhead for “parallel for” vs. plain “for”– If one multi-threaded MPI process spans multiple sockets,• … are neighboring threads on neighboring cores?• … or are threads distributed “round-robin” across cores?• Test benchmark: Vector triad#pragma omp parallelfor(int j=0; j < NITER; j++){#pragma omp (parallel) forfor(i=0; i < N; ++i)a[i]=b[i]+c[i]*d[i];if(OBSCURE)dummy(a,b,c,d);}Hybrid Parallel ProgrammingSlide64/154Rabenseifner, Hager, JostLook at performance for smallarray sizes!

CCCCCCOpenMP OverheadThread synchronization overheadBarrier overhead in CPU cycles: pthreads vs. OpenMP vs. spin loopNomenclature:P PC CCP PC CCPCCP PC CC CCPCCP PC CCChipsetMemoryP PC CC1S/2S1-/2-socketRRround-robinSSsocket-socketinnerparallel oninner loop2 Threads Q9550 (shared L2) i7 920 (shared L3)pthreads_barrier_wait 23739 6511omp barrier (icc 11.0) 399 469Spin loop 231 2704 Threads Q9550 i7 920 (shared L3)pthreads_barrier_wait 42533 9820omp barrier (icc 11.0) 977 814Spin loop 1106 475Hybrid Parallel ProgrammingSlide65/154Rabenseifner, Hager, JostOMP overhead can becomparable to MPI latency!Affinity matters!pthreads OS kernel callSpin loop does fine for shared cache syncOpenMP & Intel compilerHybrid Parallel ProgrammingSlide66/154Rabenseifner, Hager, JostThread synchronization overheadBarrier overhead: OpenMP icc vs. gccThread synchronization overheadBarrier overhead: Topology influencegcc obviously uses a pthreads barrier for the OpenMP barrier:2 Threads Q9550 (shared L2) i7 920 (shared L3)gcc 4.3.3 22603 7333icc 11.0 399 4694 Threads Q9550 i7 920 (shared L3)gcc 4.3.3 64143 10901icc 11.0 977 814Correct pinning of threads:• Manual pinning in source code (see below) or• likwid-pin: http://code.google.com/p/likwid/Hybrid Parallel ProgrammingSlide67/154PCRabenseifner, Hager, JostCPCPCCPCPCCPCCCPCCPCCPCPCPCPCPCPCPCPCPCPCC CPCPCC CPCPCC CPCPCC CChipsetMemoryMemory MemoryXeon E5420 2 Threads shared L2 same socket different socketpthreads_barrier_wait 5863 27032 27647omp barrier (icc 11.0) 576 760 1269Spin loop 259 485 11602Nehalem 2 Threads Shared SMT shared L3 different socketthreadspthreads_barrier_wait 23352 4796 49237omp barrier (icc 11.0) 2761 479 1206Spin loop 17388 267 787• SMT can be a big performance problem for synchronizing threads• Well known for a long time…Hybrid Parallel ProgrammingSlide68/154Rabenseifner, Hager, Jost

CCCCCC<strong>OpenMP</strong> OverheadThread synchronization overheadBarrier overhead in CPU cycles: pthreads vs. <strong>OpenMP</strong> vs. spin loopNomenclature:P PC CCP PC CCPCCP PC CC CCPCCP PC CCChipsetMemoryP PC CC1S/2S1-/2-socketRRround-robinSSsocket-socketinnerparallel oninner loop2 Threads Q9550 (shared L2) i7 920 (shared L3)pthreads_barrier_wait 23739 6511omp barrier (icc 11.0) 399 469Spin loop 231 2704 Threads Q9550 i7 920 (shared L3)pthreads_barrier_wait 42533 9820omp barrier (icc 11.0) 977 814Spin loop 1106 475<strong>Hybrid</strong> Parallel ProgrammingSlide65/154Rabenseifner, Hager, JostOMP overhead can becomparable to <strong>MPI</strong> latency!Affinity matters!pthreads OS kernel callSpin loop does fine for shared cache sync<strong>OpenMP</strong> & Intel compiler<strong>Hybrid</strong> Parallel ProgrammingSlide66/154Rabenseifner, Hager, JostThread synchronization overheadBarrier overhead: <strong>OpenMP</strong> icc vs. gccThread synchronization overheadBarrier overhead: Topology influencegcc obviously uses a pthreads barrier for the <strong>OpenMP</strong> barrier:2 Threads Q9550 (shared L2) i7 920 (shared L3)gcc 4.3.3 22603 7333icc 11.0 399 4694 Threads Q9550 i7 920 (shared L3)gcc 4.3.3 64143 10901icc 11.0 977 814Correct pinning of threads:• Manual pinning in source code (see below) or• likwid-pin: http://code.google.com/p/likwid/<strong>Hybrid</strong> Parallel ProgrammingSlide67/154PCRabenseifner, Hager, JostCPCPCCPCPCCPCCCPCCPCCPCPCPCPCPCPCPCPCPCPCC CPCPCC CPCPCC CPCPCC CChipsetMemoryMemory MemoryXeon E5420 2 Threads shared L2 same socket different socketpthreads_barrier_wait 5863 27032 27647omp barrier (icc 11.0) 576 760 1269Spin loop 259 485 11602Nehalem 2 Threads Shared SMT shared L3 different socketthreadspthreads_barrier_wait 23352 4796 49237omp barrier (icc 11.0) 2761 479 1206Spin loop 17388 267 787• SMT can be a big performance problem for synchronizing threads• Well known for a long time…<strong>Hybrid</strong> Parallel ProgrammingSlide68/154Rabenseifner, Hager, Jost

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!