10.07.2015 Views

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Intra-node <strong>MPI</strong> characteristics: IMB Ping-Pong benchmark• Code(toberunon2processors):wc = <strong>MPI</strong>_WTIME()do i=1,NREPEATif(rank.eq.0) then<strong>MPI</strong>_SEND(buffer,N,<strong>MPI</strong>_BYTE,1,0,<strong>MPI</strong>_COMM_WORLD,ierr)<strong>MPI</strong>_RECV(buffer,N,<strong>MPI</strong>_BYTE,1,0,<strong>MPI</strong>_COMM_WORLD, &status,ierr)elseP P<strong>MPI</strong>_RECV(…)C CC<strong>MPI</strong>_SEND(…)endifChipsetenddowc = <strong>MPI</strong>_WTIME() - wc• Intranode (1S): mpirun –np 2 –pin “1 3” ./a.out• Intranode (2S): mpirun –np 2 –pin “2 3” ./a.out• Internode: mpirun –np 2 –pernode ./a.out<strong>Hybrid</strong> Parallel ProgrammingSlide61/154Rabenseifner, Hager, JostMemoryPCPCCIMB Ping-Pong: LatencyIntra-node vs. Inter-node on Woodcrest DDR-IB cluster (Intel <strong>MPI</strong> 3.1)Latency [μs3,532,521,510,50<strong>Hybrid</strong> Parallel ProgrammingSlide62/1543,24Rabenseifner, Hager, Jost0,55P PC CC0,31IB internode IB intranode 2S IB intranode 1SChipsetMemoryP PC CCAffinity matters!IMB Ping-Pong: B<strong>and</strong>width CharacteristicsIntra-node vs. Inter-node on Woodcrest DDR-IB cluster (Intel <strong>MPI</strong> 3.1)<strong>OpenMP</strong> OverheadP PC CCChipsetMemoryP PC CC<strong>Hybrid</strong> Parallel ProgrammingSlide63/154Between two cores ofone socketRabenseifner, Hager, Jostintranodeshm commBetween two socketsof one nodeShared cacheadvantageBetween two nodesvia InfiniB<strong>and</strong>Affinity matters!• As with intra-node <strong>MPI</strong>, <strong>OpenMP</strong> loop start overhead varies with themutual position of threads in a team• Possible variations– Intra-socket vs. inter-socket– Different overhead for “parallel for” vs. plain “for”– If one multi-threaded <strong>MPI</strong> process spans multiple sockets,• … are neighboring threads on neighboring cores?• … or are threads distributed “round-robin” across cores?• Test benchmark: Vector triad#pragma omp parallelfor(int j=0; j < NITER; j++){#pragma omp (parallel) forfor(i=0; i < N; ++i)a[i]=b[i]+c[i]*d[i];if(OBSCURE)dummy(a,b,c,d);}<strong>Hybrid</strong> Parallel ProgrammingSlide64/154Rabenseifner, Hager, JostLook at performance for smallarray sizes!

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!