Automatic parallelization and optimization for irregular ... - IEEE Xplore

More documents

Recommendations

Info

for linear subscripts are completely disabled), because mostof practical irregular scientific applications have this kindof loops.Generally, in distributed memory compilation, loop iterationsare partitioned to processors according to the ownercomputes rule [1]. This rule specifies that, on a singlestatementloop, each iteration will be executed by the processorwhich owns the left hand side array reference of theassignment for that iteration.However, owner computes rule is often not best suitedfor irregular codes. This is because use of indirection inaccessing left hand side array makes it difficult to partitionthe loop iterations according to the owner computers rule.Therefore, in CHAOS library, Ponnusamy et al. [13, 14]proposed a heuristic method for irregular loop partitioningcalled almost owner computes rule, in which an iteration isexecuted on the processor that is the owner of the largestnumber of distributed array references in the iteration.Some HPF compilers employ this scheme by usingEXECUTE-ON-HOME clause [15]. However, when weparallelize a fluid dynamics solver ZEUS-2D code by usingalmost owner computes rule, we find that the almostowner computes rule is not optimal manner in minimizingcommunication cost — either communication steps or elementsto be communicated. Another drawback is that it isnot straightforward to choose optimal owner if several processorsown the same number of array references.We propose a more efficient computes rule for irregularloop partition [8]. This approach partitions iterations on aparticular processor such that executing the iteration on thatprocessor ensures• the communication steps is minimum, and• the total number of data to be communicated is minimumIn our approach, neither owner computes rule nor almostowner computes rule is used in parallel execution ofa loop iteration for irregular computation. A communicationcost reduction computes rule, called least communicationcomputes rule, is proposed. For a given irregularloop, we first investigate for all processors P k , 0 ≤k ≤ m (m is the number of processors) in which two setsFanIn(P k ) and FanOut(P k ) for each processor P k aredefined. FanIn(P k ) is a set of processors which have tosend data to processor P k before the iteration is executed,and FanOut(P k ) is a set of processors which have to senddata from processor P k after the iteration is executed. Accordingto these knowledge we partition the loop iterationto a processor on which the minimal communication is ensuredwhen executing that iteration. Then, after all iterationsare partitioned into various processors. Please refer to[8] for details.3 Global-local Array Transformation and IndexArray RemappingThere are two approaches to generating SPMD irregularcodes after loop partitioning. One is receiving required datain an iteration every time before the iteration is executed,and send the changed values (which are resident in otherprocessors originally) to other processors after the iterationis executed. Another is gather all remote data from otherprocessors for all iterations executed on this processor andscatter all changed remote values to other processors afterall iterations are executed. Because message aggregation isa main communication optimization means, obviously thelater one is better for communication performance. In orderto perform communication aggregation, this section discussesredistribution of indirection arrays.3.1 Index Array RedistributionAfter loop partitioning analysis, all iterations havebeen assigned to various processors: iter(P 0 ) ={i 0,1 ,i 0,2 ,...,i 0,α0 },..., iter(P m−1 )={i m−1,1 ,i m−1,2 ,...,i m−1,αm−1 }. If an iteration i r is partitioned to processorP k , the index array elements ix(i r ),iy(i r ), ···maynot be certainly resident in P k . Therefore, we needto redistribute all index arrays so that for iter(P k ) ={i k,1 ,i k,2 ,...,i k,αk } and every index array ix, elementsix(i k,1 ),...,ix(i k,αk ) are local accessible.As mentioned above, The source BLOCK partitionscheme for processor P k is src iter(P k )={⌈ g m ⌉∗k +1..(⌈ g ⌉ +1)∗ k}, where g is the number of iterations ofmthe loop. Then in a redistribution, the elements of an indexarray need to communicate from P k to Pk ′ can be indicatedby src iter(P k ) ∩ iter(Pk ′ ). Back to the Example1, same as Example 3, let the size of data array and indexarrays be 12, after loop partitioning analysis, we can obtainiter(P 0 )={1, 5, 8, 9, 10}, iter(P 1 )={2, 3, 4}, anditer(P 2 )={6, 7, 11, 12}. The index array elements to beredistributed are shown in Figure 2.3.2 Scheduling in Redistribution ProcedureA redistribution routine can be divided into two part:subscript computation and interprocessor communication.If there is no communication scheduling in a redistributionroutine, communication contention may occur, which increasesthe communication waiting time. Clearly in eachcommunication step, there are some processors sendingmessages to the same destination processor. This leads tonode contention. Node contention will result in overall performancedegradation.[5, 6] The scheduling can avoid thiscontention.Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)0-7695-2132-0/04/$17.00 (C) 2004 IEEE
src_iteriterP0 P1 P2P0 P1 P1 P1 P0 P2 P2 P0 P0 P0P2 P2P1 P0 ={5,8} P2 P0 = {9, 10}P0 P1 = {2, 3, 4} P2 P1 ={}P0 P2 ={} P1 P2 = {6, 7}iixiyizP0 P1 P21 5 8 9 10 2 3 4 6 7 11125 1 4 6 8 7 9 11 2 4 10124 1 10 2 4 4 3 3 6 9 6 84 8 2 1 3 5 6 7 1012 9 11Figure 2. Subscript computations of index array redistributionin Example 1.Another consideration is to align messages so that thesize of messages as near as possible in a step, since the costof a step is likely to be dictated by the length of the longestmessage exchanged during the step. We use a sparse matrixM to represent the communication pattern between thesource and target processor sets where M(i, j) =M expressesa sending processor P i should send an M size messageto a receiving processor Q j . The following matrix describesa redistribution pattern and a scheduling pattern.⎡M =⎢⎣0 3 0 10 310 0 2 1 22 2 0 3 910 0 2 0 00 4 3 3 0⎤ ⎡⎥⎦ ,CS = ⎢⎣1 4 34 2 0 30 3 4 10 23 1 2We have proposed an algorithm that accepts M as inputand generates a communication scheduling table CS to expressthe scheduling result where CS(i, k) =j means thata sending processor P i sends message to a receiving processorQ j at a communication step k. The scheduling satisfies:• there is no node contention in each communicationstep; and• the sum of longest message length in each step is minimum.4 Inter-procedural Communication Optimizationfor Irregular LoopsIn some irregular scientific codes, an important optimizationrequired is communication preprocessing among⎤⎥⎦procedure calls. In this section, we extend a classical dataflow optimization technique – Partial Redundancy Elimination– to an Interprocedural Partial Redundancy Eliminationas a basis for performing interprocedural communicationoptimization [2]. Partial Redundancy Elimination encompassestraditional optimizations like loop invariant codemotion and redundant computation elimination.For irregular code with procedure call, initial intraproceduralanalysis (see [8]) inserts pre-communicating call (includingone buffering and one gathering routine) and postcommunicating(buffering and scattering routine) call foreach of the two data parallel loops in two subroutines. Afterinterprocedural analysis, loop invariants and can be hoistedoutside the loop.Here, data arrays and index arrays are the same in loopbodies of two subroutines. While some communicationstatement may not be redundant, there may be some othercommunication statement, which may be gathering at leasta subset of the values gathered in this statement.In some situations, the same data array A is accessed usingan indirection array IA in one subroutine SUB1 andusing another indirection array IB in another subroutineSUB2. Further, none of the indirection arrays or the dataarray A is modified between flow control from first loop tothe second loop. There will be at least some overlap betweenrequired communication data elements made in thesetwo loops. Another case is that the data array and indirectionarray is the same but the loop bound is different. In thiscase, the first loop can be viewed as a part of the secondloop.We divide two kinds of communication routines for suchsituations. A common communication routine takes morethan one indirection array, or takes common part of two indirectionarrays. A common communication routine willtake in arrays IA and IB producing a single buffering. Incrementalpreprocessing routine will take in indirection arrayIA and IB, and will determine the off-processor referencesmade uniquely through indirection array IB and notthrough indirection array IA. While executing the secondloop, communication using an incremental schedule can bedone, to gather only the data elements which were not gatheredduring the first loop.5 OpenMP Extensions for Irregular ApplicationsOpenMP’s programming model uses fork-join parallelism:master thread spawns a team of threads as needed.Parallelism is added incrementally: i.e. the sequential programevolves into a parallel program. Hence, we do nothave to parallelize the whole program at once. OpenMP isusually used to parallelize loops. A user finds his most timeProceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)0-7695-2132-0/04/$17.00 (C) 2004 IEEE
Page 4 and 5: consuming loops in his code, and sp

Automatic parallelization and optimization for irregular ... - IEEE Xplore

Create successful ePaper yourself

Delete template?

Save as template?