12.07.2015 Views

OpenMP and MPI - EPCC

OpenMP and MPI - EPCC

OpenMP and MPI - EPCC

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Shared MemoryProgramming<strong>OpenMP</strong> <strong>and</strong> <strong>MPI</strong>


Clustered architecture


Programming clusters• How should we program such a machine?• Cannot (in general) use <strong>OpenMP</strong> across whole system– requires support for single address space– this is possible in software but inefficient• Could use <strong>MPI</strong> across whole system• Could use <strong>OpenMP</strong> within a node <strong>and</strong> <strong>MPI</strong> between nodes– is there any advantage to this?


IssuesWe need to consider:• Development / maintenance costs• Portability• Performance


Development / maintenance• In most cases, development <strong>and</strong> maintenance will be harderthan for an <strong>MPI</strong> code, <strong>and</strong> much harder than for an <strong>OpenMP</strong>code.• If <strong>MPI</strong> code already exists, addition of <strong>OpenMP</strong> may not betoo much overhead.• In some cases, it may be possible to use a simpler <strong>MPI</strong>implementation because the need for scalability is reduced.– e.g. 1-D domain decomposition instead of 2-D


Portability• Both <strong>OpenMP</strong> <strong>and</strong> <strong>MPI</strong> are themselves highly portable (butnot perfect).• Combined <strong>MPI</strong>/<strong>OpenMP</strong> is less so– main issue is thread safety of <strong>MPI</strong>– if thread safety is assumed, portability will be reduced– system level control of number of threads/processes not st<strong>and</strong>ardised– batch environments have varying amounts of support for mixed modecodes.• Need to make sure code functions correctly (with conditionalcompilation) as st<strong>and</strong>-alone <strong>MPI</strong> code <strong>and</strong> as st<strong>and</strong>-alone<strong>OpenMP</strong> code.


PerformanceSix possible performance reasons for mixed <strong>OpenMP</strong>/<strong>MPI</strong>codes:1. Intra-node <strong>MPI</strong> overheads2. Poorly scaling <strong>MPI</strong> codes3. Replicated data4. Restricted <strong>MPI</strong> process numbers5. Limited <strong>MPI</strong> process numbers6. Computational power balancing


Intra-node <strong>MPI</strong> overheads• Simple argument:– Use of <strong>OpenMP</strong> within a node avoids overheads associated withcalling the <strong>MPI</strong> library.– Therefore a mixed <strong>OpenMP</strong>/<strong>MPI</strong> implementation will outperform apure <strong>MPI</strong> version.


Intra-node <strong>MPI</strong> overheads• Complicating factors:– The <strong>OpenMP</strong> implementation may introduce additional overheads notpresent in the <strong>MPI</strong> code (e.g. synchronisation, false sharing,sequential sections.– The mixed implementation may require more synchronisation than apure <strong>OpenMP</strong> version, if non-thread-safety of <strong>MPI</strong> is assumed.– Implicit point-to-point synchronisation may be replaced by (moreexpensive) barriers.– In the pure <strong>MPI</strong> code, the intra-node messages will often be naturallyoverlapped with inter-node messages– Harder to overlap inter-thread communication with inter-nodemessages.


Example!$omp parallel doDO I=1,NA(I) = B(I) + C(I)END DOCALL <strong>MPI</strong>_BSEND(A(N),1,.....)CALL <strong>MPI</strong>_RECV(A(0),1,.....)!$omp parallel doDO I = 1,ND(I) = A(I-1) + A(I)END DOImplicit barrier added hereIntra-node messagesoverlapped with internodeInter-thread communicationoccurs here


Replicated data• Some <strong>MPI</strong> codes use a replicated data strategy– all processes have a copy of a major data structure• A pure <strong>MPI</strong> code needs one copy per process(or).• A mixed code would only require one copy per node– data structure can be shared by multiple threads within a process.• Should improve performance on large data sizes throughreduced swapping.


Restricted <strong>MPI</strong> process numbers• Some <strong>MPI</strong> codes cannot run on arbitrary numbers ofprocessors– e.g. may be restricted to powers of 2• Some SMP’s don’t have power-of-two numbers ofprocessors• May be easier to exploit these numbers of processors withmixed code than to re-engineer the <strong>MPI</strong> code.


Limited <strong>MPI</strong> process numbers• <strong>MPI</strong> library implementation may not be able to h<strong>and</strong>le 100’sof processes adequately.• Only likely to be an issue on very large systems.• Mixed <strong>MPI</strong>/<strong>OpenMP</strong> implementation will reduce number of<strong>MPI</strong> processes.


Computational power balancing• Dynamic load balancing technique– Huang <strong>and</strong> Tafti (1999)• Mixed mode code– work is initially partitioned equally between <strong>MPI</strong> processes– when a process becomes overloaded, it increases the number of<strong>OpenMP</strong> threads– requires flexible job scheduling policy– better suited to large shared memory machines than to clusters.


Case study• Regular 2-dimensional grid (NxN).• Nearest neighbour communication.• Options for mixed <strong>OpenMP</strong>/<strong>MPI</strong> implementation


Option 1• 1-D domain decomposition in <strong>MPI</strong>• Loop level parallelism in <strong>OpenMP</strong>process 0 process 1 process 2 process 3


Option 1Pros:• Simplest to program• <strong>MPI</strong> calls naturally occur outside parallel loops: thread safetyof <strong>MPI</strong> not an issue.Cons:• Communication is O(N).• Loop level parallelism may incur unnecessarysynchronisation overheads.


Option 2• 1-D domain decomposition in <strong>MPI</strong>• 1-D domain decomposition (in other dimension) in <strong>OpenMP</strong>03691471025811process 0 process 1 process 2 process 3


Option 2Pros:• Still relatively simple to program• Avoids excessive synchronisationCons:• Still O(N) communication• Need to take care if <strong>MPI</strong> is not thread-safe (additionalsynchronisation?)• If <strong>MPI</strong> is thread safe, each thread could communicate withneighbouring threads in neighbouring processes.


Option 3• 2-D domain decomposition in both <strong>MPI</strong> <strong>and</strong> <strong>OpenMP</strong>


Option 3Pros:• O(N 1/2 ) communicationCons:• Complex to program, especially if all threads communicate.


Practical SessionTraffic• Starting from a working <strong>MPI</strong> version, develop a mixed<strong>OpenMP</strong>/<strong>MPI</strong> implementation of a simple traffic simulation.• Experiment with different numbers of threads <strong>and</strong> processes.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!