10.07.2015 Views

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

Hybrid MPI and OpenMP programming tutorial - Prace Training Portal

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Comparison:<strong>MPI</strong> based parallelization• <strong>MPI</strong> based:<strong>Hybrid</strong> Parallel ProgrammingSlide 105 / 154Rabenseifner, Hager, JostDSM– Potential of boundary exchange between two domains in one large message Dominated by b<strong>and</strong>width of the network• DSM based (e.g. Intel ® Cluster <strong>OpenMP</strong>):– Additional latency based overhead in each barrier May be marginal– Communication of updated data of pages Not all of this data may be needed i.e., too much data is transferred Packages may be to small Significant latency– Communication not oriented on boundariesof a domain decomposition probably more data must be transferred thannecessaryhybrid <strong>MPI</strong>+<strong>OpenMP</strong><strong>OpenMP</strong> onlyby rule of thumb:Communicationmay be10 times slowerthan with <strong>MPI</strong>— skipped —Comparing results with heat example• Normal <strong>OpenMP</strong> on shared memory (ccNUMA) NEC TX-7heat_x.c / heatc2_x.c with <strong>OpenMP</strong> on NEC TX-7Speedup181614121086420serial12346810threads<strong>Hybrid</strong> Parallel ProgrammingSlide 106 / 154Rabenseifner, Hager, Jost1000x1000250x25080x8020x20ideal speedup— skipped —Heat example: Cluster <strong>OpenMP</strong> EfficiencyBack to the mixed model – an Example• Cluster <strong>OpenMP</strong> on a Dual-Xeon clusterSpeedup76543210serial1/2heats2_x.c with Cluster <strong>OpenMP</strong> on NEC dual Xeon EM64T clusterUp to 3 CPUswith 3000x30001234nodes5678Efficiency only with smallcommunication foot-print6000x6000 static(default) 1 threads/node6000x6000 static(default) 2 threads/node6000x6000 static(1:1) 1 threads/node6000x6000 static(1:2) 1 threads/node6000x6000 static(1:10) 1 threads/node6000x6000 static(1:50) 1 threads/node3000x3000 static(default) 1 threads/node3000x3000 static(default) 2 threads/node1000x1000 static(default) 1 threads/node1000x1000 static(default) 2 threads/node250x250 static(default) 1 threads/node250x250 static(default) 2 threads/nodeTerrible with non-default scheduleNo speedup with 1000x1000Second CPU only usable in small casesSMP nodeSMP nodeSocket 1Socket 1<strong>MPI</strong><strong>MPI</strong>Quad-core processQuad-core process4 xCPUmultithreadethreaded4 xCPUmulti-Socket 2Socket 2<strong>MPI</strong><strong>MPI</strong>Quad-core processQuad-core process4 xCPUmultithreadethreaded4 xCPUmulti-Node Interconnect• Topology-problem solved:Only horizontal inter-node comm.• Still intra-node communication• Several threads per SMP node arecommunicating in parallel: network saturation is possible• Additional <strong>OpenMP</strong> overhead• With Masteronly style:75% of the threads sleep whilemaster thread communicates• With Overlapping Comm.& Comp.:Master thread should be reservedfor communication only partially –otherwise too expensive• <strong>MPI</strong> library must support– Multiple threads– Two fabrics (shmem + internode)<strong>Hybrid</strong> Parallel ProgrammingSlide 107 / 154Rabenseifner, Hager, Jost<strong>Hybrid</strong> Parallel ProgrammingSlide 108 / 154Rabenseifner, Hager, Jost

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!