Hybrid MPI and OpenMP programming tutorial - Prace Training Portal
Hybrid MPI and OpenMP programming tutorial - Prace Training Portal
Hybrid MPI and OpenMP programming tutorial - Prace Training Portal
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Intra-node <strong>MPI</strong> characteristics: IMB Ping-Pong benchmark• Code(toberunon2processors):wc = <strong>MPI</strong>_WTIME()do i=1,NREPEATif(rank.eq.0) then<strong>MPI</strong>_SEND(buffer,N,<strong>MPI</strong>_BYTE,1,0,<strong>MPI</strong>_COMM_WORLD,ierr)<strong>MPI</strong>_RECV(buffer,N,<strong>MPI</strong>_BYTE,1,0,<strong>MPI</strong>_COMM_WORLD, &status,ierr)elseP P<strong>MPI</strong>_RECV(…)C CC<strong>MPI</strong>_SEND(…)endifChipsetenddowc = <strong>MPI</strong>_WTIME() - wc• Intranode (1S): mpirun –np 2 –pin “1 3” ./a.out• Intranode (2S): mpirun –np 2 –pin “2 3” ./a.out• Internode: mpirun –np 2 –pernode ./a.out<strong>Hybrid</strong> Parallel ProgrammingSlide61/154Rabenseifner, Hager, JostMemoryPCPCCIMB Ping-Pong: LatencyIntra-node vs. Inter-node on Woodcrest DDR-IB cluster (Intel <strong>MPI</strong> 3.1)Latency [μs3,532,521,510,50<strong>Hybrid</strong> Parallel ProgrammingSlide62/1543,24Rabenseifner, Hager, Jost0,55P PC CC0,31IB internode IB intranode 2S IB intranode 1SChipsetMemoryP PC CCAffinity matters!IMB Ping-Pong: B<strong>and</strong>width CharacteristicsIntra-node vs. Inter-node on Woodcrest DDR-IB cluster (Intel <strong>MPI</strong> 3.1)<strong>OpenMP</strong> OverheadP PC CCChipsetMemoryP PC CC<strong>Hybrid</strong> Parallel ProgrammingSlide63/154Between two cores ofone socketRabenseifner, Hager, Jostintranodeshm commBetween two socketsof one nodeShared cacheadvantageBetween two nodesvia InfiniB<strong>and</strong>Affinity matters!• As with intra-node <strong>MPI</strong>, <strong>OpenMP</strong> loop start overhead varies with themutual position of threads in a team• Possible variations– Intra-socket vs. inter-socket– Different overhead for “parallel for” vs. plain “for”– If one multi-threaded <strong>MPI</strong> process spans multiple sockets,• … are neighboring threads on neighboring cores?• … or are threads distributed “round-robin” across cores?• Test benchmark: Vector triad#pragma omp parallelfor(int j=0; j < NITER; j++){#pragma omp (parallel) forfor(i=0; i < N; ++i)a[i]=b[i]+c[i]*d[i];if(OBSCURE)dummy(a,b,c,d);}<strong>Hybrid</strong> Parallel ProgrammingSlide64/154Rabenseifner, Hager, JostLook at performance for smallarray sizes!