<strong>for</strong> linear subscripts are completely disabled), because mostof practical <strong>irregular</strong> scientific applications have this kindof loops.Generally, in distributed memory compilation, loop iterationsare partitioned to processors according to the ownercomputes rule [1]. This rule specifies that, on a singlestatementloop, each iteration will be executed by the processorwhich owns the left h<strong>and</strong> side array reference of theassignment <strong>for</strong> that iteration.However, owner computes rule is often not best suited<strong>for</strong> <strong>irregular</strong> codes. This is because use of indirection inaccessing left h<strong>and</strong> side array makes it difficult to partitionthe loop iterations according to the owner computers rule.There<strong>for</strong>e, in CHAOS library, Ponnusamy et al. [13, 14]proposed a heuristic method <strong>for</strong> <strong>irregular</strong> loop partitioningcalled almost owner computes rule, in which an iteration isexecuted on the processor that is the owner of the largestnumber of distributed array references in the iteration.Some HPF compilers employ this scheme by usingEXECUTE-ON-HOME clause [15]. However, when weparallelize a fluid dynamics solver ZEUS-2D code by usingalmost owner computes rule, we find that the almostowner computes rule is not optimal manner in minimizingcommunication cost — either communication steps or elementsto be communicated. Another drawback is that it isnot straight<strong>for</strong>ward to choose optimal owner if several processorsown the same number of array references.We propose a more efficient computes rule <strong>for</strong> <strong>irregular</strong>loop partition [8]. This approach partitions iterations on aparticular processor such that executing the iteration on thatprocessor ensures• the communication steps is minimum, <strong>and</strong>• the total number of data to be communicated is minimumIn our approach, neither owner computes rule nor almostowner computes rule is used in parallel execution ofa loop iteration <strong>for</strong> <strong>irregular</strong> computation. A communicationcost reduction computes rule, called least communicationcomputes rule, is proposed. For a given <strong>irregular</strong>loop, we first investigate <strong>for</strong> all processors P k , 0 ≤k ≤ m (m is the number of processors) in which two setsFanIn(P k ) <strong>and</strong> FanOut(P k ) <strong>for</strong> each processor P k aredefined. FanIn(P k ) is a set of processors which have tosend data to processor P k be<strong>for</strong>e the iteration is executed,<strong>and</strong> FanOut(P k ) is a set of processors which have to senddata from processor P k after the iteration is executed. Accordingto these knowledge we partition the loop iterationto a processor on which the minimal communication is ensuredwhen executing that iteration. Then, after all iterationsare partitioned into various processors. Please refer to[8] <strong>for</strong> details.3 Global-local Array Trans<strong>for</strong>mation <strong>and</strong> IndexArray RemappingThere are two approaches to generating SPMD <strong>irregular</strong>codes after loop partitioning. One is receiving required datain an iteration every time be<strong>for</strong>e the iteration is executed,<strong>and</strong> send the changed values (which are resident in otherprocessors originally) to other processors after the iterationis executed. Another is gather all remote data from otherprocessors <strong>for</strong> all iterations executed on this processor <strong>and</strong>scatter all changed remote values to other processors afterall iterations are executed. Because message aggregation isa main communication <strong>optimization</strong> means, obviously thelater one is better <strong>for</strong> communication per<strong>for</strong>mance. In orderto per<strong>for</strong>m communication aggregation, this section discussesredistribution of indirection arrays.3.1 Index Array RedistributionAfter loop partitioning analysis, all iterations havebeen assigned to various processors: iter(P 0 ) ={i 0,1 ,i 0,2 ,...,i 0,α0 },..., iter(P m−1 )={i m−1,1 ,i m−1,2 ,...,i m−1,αm−1 }. If an iteration i r is partitioned to processorP k , the index array elements ix(i r ),iy(i r ), ···maynot be certainly resident in P k . There<strong>for</strong>e, we needto redistribute all index arrays so that <strong>for</strong> iter(P k ) ={i k,1 ,i k,2 ,...,i k,αk } <strong>and</strong> every index array ix, elementsix(i k,1 ),...,ix(i k,αk ) are local accessible.As mentioned above, The source BLOCK partitionscheme <strong>for</strong> processor P k is src iter(P k )={⌈ g m ⌉∗k +1..(⌈ g ⌉ +1)∗ k}, where g is the number of iterations ofmthe loop. Then in a redistribution, the elements of an indexarray need to communicate from P k to Pk ′ can be indicatedby src iter(P k ) ∩ iter(Pk ′ ). Back to the Example1, same as Example 3, let the size of data array <strong>and</strong> indexarrays be 12, after loop partitioning analysis, we can obtainiter(P 0 )={1, 5, 8, 9, 10}, iter(P 1 )={2, 3, 4}, <strong>and</strong>iter(P 2 )={6, 7, 11, 12}. The index array elements to beredistributed are shown in Figure 2.3.2 Scheduling in Redistribution ProcedureA redistribution routine can be divided into two part:subscript computation <strong>and</strong> interprocessor communication.If there is no communication scheduling in a redistributionroutine, communication contention may occur, which increasesthe communication waiting time. Clearly in eachcommunication step, there are some processors sendingmessages to the same destination processor. This leads tonode contention. Node contention will result in overall per<strong>for</strong>mancedegradation.[5, 6] The scheduling can avoid thiscontention.Proceedings of the 18th International Parallel <strong>and</strong> Distributed Processing Symposium (IPDPS’04)0-7695-2132-0/04/$17.00 (C) 2004 <strong>IEEE</strong>
src_iteriterP0 P1 P2P0 P1 P1 P1 P0 P2 P2 P0 P0 P0P2 P2P1 P0 ={5,8} P2 P0 = {9, 10}P0 P1 = {2, 3, 4} P2 P1 ={}P0 P2 ={} P1 P2 = {6, 7}iixiyizP0 P1 P21 5 8 9 10 2 3 4 6 7 11125 1 4 6 8 7 9 11 2 4 10124 1 10 2 4 4 3 3 6 9 6 84 8 2 1 3 5 6 7 1012 9 11Figure 2. Subscript computations of index array redistributionin Example 1.Another consideration is to align messages so that thesize of messages as near as possible in a step, since the costof a step is likely to be dictated by the length of the longestmessage exchanged during the step. We use a sparse matrixM to represent the communication pattern between thesource <strong>and</strong> target processor sets where M(i, j) =M expressesa sending processor P i should send an M size messageto a receiving processor Q j . The following matrix describesa redistribution pattern <strong>and</strong> a scheduling pattern.⎡M =⎢⎣0 3 0 10 310 0 2 1 22 2 0 3 910 0 2 0 00 4 3 3 0⎤ ⎡⎥⎦ ,CS = ⎢⎣1 4 34 2 0 30 3 4 10 23 1 2We have proposed an algorithm that accepts M as input<strong>and</strong> generates a communication scheduling table CS to expressthe scheduling result where CS(i, k) =j means thata sending processor P i sends message to a receiving processorQ j at a communication step k. The scheduling satisfies:• there is no node contention in each communicationstep; <strong>and</strong>• the sum of longest message length in each step is minimum.4 Inter-procedural Communication Optimization<strong>for</strong> Irregular LoopsIn some <strong>irregular</strong> scientific codes, an important <strong>optimization</strong>required is communication preprocessing among⎤⎥⎦procedure calls. In this section, we extend a classical dataflow <strong>optimization</strong> technique – Partial Redundancy Elimination– to an Interprocedural Partial Redundancy Eliminationas a basis <strong>for</strong> per<strong>for</strong>ming interprocedural communication<strong>optimization</strong> [2]. Partial Redundancy Elimination encompassestraditional <strong>optimization</strong>s like loop invariant codemotion <strong>and</strong> redundant computation elimination.For <strong>irregular</strong> code with procedure call, initial intraproceduralanalysis (see [8]) inserts pre-communicating call (includingone buffering <strong>and</strong> one gathering routine) <strong>and</strong> postcommunicating(buffering <strong>and</strong> scattering routine) call <strong>for</strong>each of the two data parallel loops in two subroutines. Afterinterprocedural analysis, loop invariants <strong>and</strong> can be hoistedoutside the loop.Here, data arrays <strong>and</strong> index arrays are the same in loopbodies of two subroutines. While some communicationstatement may not be redundant, there may be some othercommunication statement, which may be gathering at leasta subset of the values gathered in this statement.In some situations, the same data array A is accessed usingan indirection array IA in one subroutine SUB1 <strong>and</strong>using another indirection array IB in another subroutineSUB2. Further, none of the indirection arrays or the dataarray A is modified between flow control from first loop tothe second loop. There will be at least some overlap betweenrequired communication data elements made in thesetwo loops. Another case is that the data array <strong>and</strong> indirectionarray is the same but the loop bound is different. In thiscase, the first loop can be viewed as a part of the secondloop.We divide two kinds of communication routines <strong>for</strong> suchsituations. A common communication routine takes morethan one indirection array, or takes common part of two indirectionarrays. A common communication routine willtake in arrays IA <strong>and</strong> IB producing a single buffering. Incrementalpreprocessing routine will take in indirection arrayIA <strong>and</strong> IB, <strong>and</strong> will determine the off-processor referencesmade uniquely through indirection array IB <strong>and</strong> notthrough indirection array IA. While executing the secondloop, communication using an incremental schedule can bedone, to gather only the data elements which were not gatheredduring the first loop.5 OpenMP Extensions <strong>for</strong> Irregular ApplicationsOpenMP’s programming model uses <strong>for</strong>k-join parallelism:master thread spawns a team of threads as needed.Parallelism is added incrementally: i.e. the sequential programevolves into a parallel program. Hence, we do nothave to parallelize the whole program at once. OpenMP isusually used to parallelize loops. A user finds his most timeProceedings of the 18th International Parallel <strong>and</strong> Distributed Processing Symposium (IPDPS’04)0-7695-2132-0/04/$17.00 (C) 2004 <strong>IEEE</strong>