10.07.2015 Views

Automatic parallelization and optimization for irregular ... - IEEE Xplore

Automatic parallelization and optimization for irregular ... - IEEE Xplore

Automatic parallelization and optimization for irregular ... - IEEE Xplore

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

src_iteriterP0 P1 P2P0 P1 P1 P1 P0 P2 P2 P0 P0 P0P2 P2P1 P0 ={5,8} P2 P0 = {9, 10}P0 P1 = {2, 3, 4} P2 P1 ={}P0 P2 ={} P1 P2 = {6, 7}iixiyizP0 P1 P21 5 8 9 10 2 3 4 6 7 11125 1 4 6 8 7 9 11 2 4 10124 1 10 2 4 4 3 3 6 9 6 84 8 2 1 3 5 6 7 1012 9 11Figure 2. Subscript computations of index array redistributionin Example 1.Another consideration is to align messages so that thesize of messages as near as possible in a step, since the costof a step is likely to be dictated by the length of the longestmessage exchanged during the step. We use a sparse matrixM to represent the communication pattern between thesource <strong>and</strong> target processor sets where M(i, j) =M expressesa sending processor P i should send an M size messageto a receiving processor Q j . The following matrix describesa redistribution pattern <strong>and</strong> a scheduling pattern.⎡M =⎢⎣0 3 0 10 310 0 2 1 22 2 0 3 910 0 2 0 00 4 3 3 0⎤ ⎡⎥⎦ ,CS = ⎢⎣1 4 34 2 0 30 3 4 10 23 1 2We have proposed an algorithm that accepts M as input<strong>and</strong> generates a communication scheduling table CS to expressthe scheduling result where CS(i, k) =j means thata sending processor P i sends message to a receiving processorQ j at a communication step k. The scheduling satisfies:• there is no node contention in each communicationstep; <strong>and</strong>• the sum of longest message length in each step is minimum.4 Inter-procedural Communication Optimization<strong>for</strong> Irregular LoopsIn some <strong>irregular</strong> scientific codes, an important <strong>optimization</strong>required is communication preprocessing among⎤⎥⎦procedure calls. In this section, we extend a classical dataflow <strong>optimization</strong> technique – Partial Redundancy Elimination– to an Interprocedural Partial Redundancy Eliminationas a basis <strong>for</strong> per<strong>for</strong>ming interprocedural communication<strong>optimization</strong> [2]. Partial Redundancy Elimination encompassestraditional <strong>optimization</strong>s like loop invariant codemotion <strong>and</strong> redundant computation elimination.For <strong>irregular</strong> code with procedure call, initial intraproceduralanalysis (see [8]) inserts pre-communicating call (includingone buffering <strong>and</strong> one gathering routine) <strong>and</strong> postcommunicating(buffering <strong>and</strong> scattering routine) call <strong>for</strong>each of the two data parallel loops in two subroutines. Afterinterprocedural analysis, loop invariants <strong>and</strong> can be hoistedoutside the loop.Here, data arrays <strong>and</strong> index arrays are the same in loopbodies of two subroutines. While some communicationstatement may not be redundant, there may be some othercommunication statement, which may be gathering at leasta subset of the values gathered in this statement.In some situations, the same data array A is accessed usingan indirection array IA in one subroutine SUB1 <strong>and</strong>using another indirection array IB in another subroutineSUB2. Further, none of the indirection arrays or the dataarray A is modified between flow control from first loop tothe second loop. There will be at least some overlap betweenrequired communication data elements made in thesetwo loops. Another case is that the data array <strong>and</strong> indirectionarray is the same but the loop bound is different. In thiscase, the first loop can be viewed as a part of the secondloop.We divide two kinds of communication routines <strong>for</strong> suchsituations. A common communication routine takes morethan one indirection array, or takes common part of two indirectionarrays. A common communication routine willtake in arrays IA <strong>and</strong> IB producing a single buffering. Incrementalpreprocessing routine will take in indirection arrayIA <strong>and</strong> IB, <strong>and</strong> will determine the off-processor referencesmade uniquely through indirection array IB <strong>and</strong> notthrough indirection array IA. While executing the secondloop, communication using an incremental schedule can bedone, to gather only the data elements which were not gatheredduring the first loop.5 OpenMP Extensions <strong>for</strong> Irregular ApplicationsOpenMP’s programming model uses <strong>for</strong>k-join parallelism:master thread spawns a team of threads as needed.Parallelism is added incrementally: i.e. the sequential programevolves into a parallel program. Hence, we do nothave to parallelize the whole program at once. OpenMP isusually used to parallelize loops. A user finds his most timeProceedings of the 18th International Parallel <strong>and</strong> Distributed Processing Symposium (IPDPS’04)0-7695-2132-0/04/$17.00 (C) 2004 <strong>IEEE</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!