10.07.2015 Views

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

— skipped —SUN: Running hybrid on Sun ConstellationCluster RangerSUN: NPB-MZ Class E Scalability on Ranger• Highly hierarchical• Shared Memory:– Cache-coherent, Nonuniformmemory access(ccNUMA) 16-way Node(Blade)• Distributed memory:– Network of ccNUMA blades• Core-to-Core• Socket-to-Socket• Blade-to-Blade• Chassis-to-Chassis<strong>Hybrid</strong> Parallel ProgrammingSlide21/154Rabenseifner, Hager, Jost networkMFlop/s5000000450000040000003500000300000025000002000000150000010000005000000<strong>Hybrid</strong> Parallel ProgrammingSlide22/154NPB-MZ Class E Scalability on Sun ConstellationSP-MZ (<strong>MPI</strong>)SP-MZ <strong>MPI</strong>+<strong>OpenMP</strong>BT-MZ (<strong>MPI</strong>)BT-MZ <strong>MPI</strong>+<strong>OpenMP</strong>1024 2048 4096 8192core#• Scalability in Mflops• <strong>MPI</strong>/<strong>OpenMP</strong> outperforms pure <strong>MPI</strong>• Use of numactl essential to achieve scalabilityRabenseifner, Hager, JostBTSignificant improvement(235%):Load-balancingissues solved with<strong>MPI</strong>+<strong>OpenMP</strong>SPPure <strong>MPI</strong> is alreadyload-balanced.But hybrid9.6% faster, due tosmaller messagerate at NICCannot be build for8192 processes!<strong>Hybrid</strong>:SP: still scalesBT: does not scaleNUMA Control: Process Placement• Affinity <strong>and</strong> Policy can be changed externally through numactl atthe socket <strong>and</strong> core level.Comm<strong>and</strong>: numactl ./a.outnumactl -N 1 ./a.outnumactl –c 0,1 ./a.outNUMA Operations: Memory Placementnumactl -N 1 -l ./a.outMemory allocation:• <strong>MPI</strong>– local allocation is best• <strong>OpenMP</strong>– Interleave best for large, completelyshared arrays that are r<strong>and</strong>omlyaccessed by different threads– local best for private arrays• Once allocated,a memory-structure is fixed<strong>Hybrid</strong> Parallel ProgrammingSlide23/154Rabenseifner, Hager, Jost<strong>Hybrid</strong> Parallel ProgrammingSlide24/154Rabenseifner, Hager, Jost

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!