OpenMP and MPI - EPCC

Shared MemoryProgrammingOpenMP and MPI

Clustered architecture

Programming clusters• How should we program such a machine?• Cannot (in general) use OpenMP across whole system– requires support for single address space– this is possible in software but inefficient• Could use MPI across whole system• Could use OpenMP within a node and MPI between nodes– is there any advantage to this?

IssuesWe need to consider:• Development / maintenance costs• Portability• Performance

Development / maintenance• In most cases, development and maintenance will be harderthan for an MPI code, and much harder than for an OpenMPcode.• If MPI code already exists, addition of OpenMP may not betoo much overhead.• In some cases, it may be possible to use a simpler MPIimplementation because the need for scalability is reduced.– e.g. 1-D domain decomposition instead of 2-D

Portability• Both OpenMP and MPI are themselves highly portable (butnot perfect).• Combined MPI/OpenMP is less so– main issue is thread safety of MPI– if thread safety is assumed, portability will be reduced– system level control of number of threads/processes not standardised– batch environments have varying amounts of support for mixed modecodes.• Need to make sure code functions correctly (with conditionalcompilation) as stand-alone MPI code and as stand-aloneOpenMP code.

PerformanceSix possible performance reasons for mixed OpenMP/MPIcodes:1. Intra-node MPI overheads2. Poorly scaling MPI codes3. Replicated data4. Restricted MPI process numbers5. Limited MPI process numbers6. Computational power balancing

Intra-node MPI overheads• Simple argument:– Use of OpenMP within a node avoids overheads associated withcalling the MPI library.– Therefore a mixed OpenMP/MPI implementation will outperform apure MPI version.

Intra-node MPI overheads• Complicating factors:– The OpenMP implementation may introduce additional overheads notpresent in the MPI code (e.g. synchronisation, false sharing,sequential sections.– The mixed implementation may require more synchronisation than apure OpenMP version, if non-thread-safety of MPI is assumed.– Implicit point-to-point synchronisation may be replaced by (moreexpensive) barriers.– In the pure MPI code, the intra-node messages will often be naturallyoverlapped with inter-node messages– Harder to overlap inter-thread communication with inter-nodemessages.

Example!$omp parallel doDO I=1,NA(I) = B(I) + C(I)END DOCALL MPI_BSEND(A(N),1,.....)CALL MPI_RECV(A(0),1,.....)!$omp parallel doDO I = 1,ND(I) = A(I-1) + A(I)END DOImplicit barrier added hereIntra-node messagesoverlapped with internodeInter-thread communicationoccurs here

Replicated data• Some MPI codes use a replicated data strategy– all processes have a copy of a major data structure• A pure MPI code needs one copy per process(or).• A mixed code would only require one copy per node– data structure can be shared by multiple threads within a process.• Should improve performance on large data sizes throughreduced swapping.

Restricted MPI process numbers• Some MPI codes cannot run on arbitrary numbers ofprocessors– e.g. may be restricted to powers of 2• Some SMP’s don’t have power-of-two numbers ofprocessors• May be easier to exploit these numbers of processors withmixed code than to re-engineer the MPI code.

Limited MPI process numbers• MPI library implementation may not be able to handle 100’sof processes adequately.• Only likely to be an issue on very large systems.• Mixed MPI/OpenMP implementation will reduce number ofMPI processes.

Computational power balancing• Dynamic load balancing technique– Huang and Tafti (1999)• Mixed mode code– work is initially partitioned equally between MPI processes– when a process becomes overloaded, it increases the number ofOpenMP threads– requires flexible job scheduling policy– better suited to large shared memory machines than to clusters.

Case study• Regular 2-dimensional grid (NxN).• Nearest neighbour communication.• Options for mixed OpenMP/MPI implementation

Option 1• 1-D domain decomposition in MPI• Loop level parallelism in OpenMPprocess 0 process 1 process 2 process 3

Option 1Pros:• Simplest to program• MPI calls naturally occur outside parallel loops: thread safetyof MPI not an issue.Cons:• Communication is O(N).• Loop level parallelism may incur unnecessarysynchronisation overheads.

Option 2• 1-D domain decomposition in MPI• 1-D domain decomposition (in other dimension) in OpenMP03691471025811process 0 process 1 process 2 process 3

Option 2Pros:• Still relatively simple to program• Avoids excessive synchronisationCons:• Still O(N) communication• Need to take care if MPI is not thread-safe (additionalsynchronisation?)• If MPI is thread safe, each thread could communicate withneighbouring threads in neighbouring processes.

Option 3• 2-D domain decomposition in both MPI and OpenMP

Option 3Pros:• O(N 1/2 ) communicationCons:• Complex to program, especially if all threads communicate.

Practical SessionTraffic• Starting from a working MPI version, develop a mixedOpenMP/MPI implementation of a simple traffic simulation.• Experiment with different numbers of threads and processes.

OpenMP and MPI - EPCC

Create successful ePaper yourself

Delete template?

Save as template?