10.07.2015 Views

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

— skipped —Running the codeExamples with mpiexec• Example for using mpiexec on a dual-socket quad-core cluster:$ export OMP_NUM_THREADS=8$ mpiexec -pernode ./a.out• Same but 2 <strong>MPI</strong> processes per node:$ export OMP_NUM_THREADS=4$ mpiexec -npernode 2 ./a.outRunning the code efficiently?• Symmetric, UMA-type compute nodes have become rare animals– NEC SX– Intel 1-socket (“Port Townsend/Melstone/Lynnfield”) – see casestudies– Hitachi SR8000, IBM SP2, single-core multi-socket Intel Xeon…(all dead)• Instead, systems have become “non-isotropic” on the node level– ccNUMA (AMD Opteron, SGI Altix,IBM Power6 (p575), Intel Nehalem)PCCPCCPCCPCCMIMI• Pure <strong>MPI</strong>:$ export OMP_NUM_THREADS=1 # or nothing if serial code$ mpiexec ./a.out– Multi-core, multi-socket• Shared vs. separate caches• Multi-chip vs. single-chip• Separate/shared busesPCCPCPCCPCPCCChipsetMemoryPCPCCPCMemoryMemory<strong>Hybrid</strong> Parallel ProgrammingSlide49/154Rabenseifner, Hager, Jost<strong>Hybrid</strong> Parallel ProgrammingSlide50/154Rabenseifner, Hager, JostIssues for running code efficientlyon “non-isotropic” nodes• ccNUMA locality effects– Penalties for inter-LD access– Impact of contention– Consequences of file I/O for page placement– Placement of <strong>MPI</strong> buffersA short introduction to ccNUMA• ccNUMA:– whole memory is transparently accessible by all processors– but physically distributed– with varying b<strong>and</strong>width <strong>and</strong> latency– <strong>and</strong> potential contention (shared memory paths)• Multi-core / multi-socket anisotropy effects– B<strong>and</strong>width bottlenecks, shared caches– Intra-node <strong>MPI</strong> performance• Core core vs. socket socket– <strong>OpenMP</strong> loop overhead depends on mutual position of threadsin teamC C C CMMC C C CMM<strong>Hybrid</strong> Parallel ProgrammingSlide51/154Rabenseifner, Hager, Jost<strong>Hybrid</strong> Parallel ProgrammingSlide52/154Rabenseifner, Hager, Jost

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!