10.07.2015 Views

Automatic parallelization and optimization for irregular ... - IEEE Xplore

Automatic parallelization and optimization for irregular ... - IEEE Xplore

Automatic parallelization and optimization for irregular ... - IEEE Xplore

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>for</strong> linear subscripts are completely disabled), because mostof practical <strong>irregular</strong> scientific applications have this kindof loops.Generally, in distributed memory compilation, loop iterationsare partitioned to processors according to the ownercomputes rule [1]. This rule specifies that, on a singlestatementloop, each iteration will be executed by the processorwhich owns the left h<strong>and</strong> side array reference of theassignment <strong>for</strong> that iteration.However, owner computes rule is often not best suited<strong>for</strong> <strong>irregular</strong> codes. This is because use of indirection inaccessing left h<strong>and</strong> side array makes it difficult to partitionthe loop iterations according to the owner computers rule.There<strong>for</strong>e, in CHAOS library, Ponnusamy et al. [13, 14]proposed a heuristic method <strong>for</strong> <strong>irregular</strong> loop partitioningcalled almost owner computes rule, in which an iteration isexecuted on the processor that is the owner of the largestnumber of distributed array references in the iteration.Some HPF compilers employ this scheme by usingEXECUTE-ON-HOME clause [15]. However, when weparallelize a fluid dynamics solver ZEUS-2D code by usingalmost owner computes rule, we find that the almostowner computes rule is not optimal manner in minimizingcommunication cost — either communication steps or elementsto be communicated. Another drawback is that it isnot straight<strong>for</strong>ward to choose optimal owner if several processorsown the same number of array references.We propose a more efficient computes rule <strong>for</strong> <strong>irregular</strong>loop partition [8]. This approach partitions iterations on aparticular processor such that executing the iteration on thatprocessor ensures• the communication steps is minimum, <strong>and</strong>• the total number of data to be communicated is minimumIn our approach, neither owner computes rule nor almostowner computes rule is used in parallel execution ofa loop iteration <strong>for</strong> <strong>irregular</strong> computation. A communicationcost reduction computes rule, called least communicationcomputes rule, is proposed. For a given <strong>irregular</strong>loop, we first investigate <strong>for</strong> all processors P k , 0 ≤k ≤ m (m is the number of processors) in which two setsFanIn(P k ) <strong>and</strong> FanOut(P k ) <strong>for</strong> each processor P k aredefined. FanIn(P k ) is a set of processors which have tosend data to processor P k be<strong>for</strong>e the iteration is executed,<strong>and</strong> FanOut(P k ) is a set of processors which have to senddata from processor P k after the iteration is executed. Accordingto these knowledge we partition the loop iterationto a processor on which the minimal communication is ensuredwhen executing that iteration. Then, after all iterationsare partitioned into various processors. Please refer to[8] <strong>for</strong> details.3 Global-local Array Trans<strong>for</strong>mation <strong>and</strong> IndexArray RemappingThere are two approaches to generating SPMD <strong>irregular</strong>codes after loop partitioning. One is receiving required datain an iteration every time be<strong>for</strong>e the iteration is executed,<strong>and</strong> send the changed values (which are resident in otherprocessors originally) to other processors after the iterationis executed. Another is gather all remote data from otherprocessors <strong>for</strong> all iterations executed on this processor <strong>and</strong>scatter all changed remote values to other processors afterall iterations are executed. Because message aggregation isa main communication <strong>optimization</strong> means, obviously thelater one is better <strong>for</strong> communication per<strong>for</strong>mance. In orderto per<strong>for</strong>m communication aggregation, this section discussesredistribution of indirection arrays.3.1 Index Array RedistributionAfter loop partitioning analysis, all iterations havebeen assigned to various processors: iter(P 0 ) ={i 0,1 ,i 0,2 ,...,i 0,α0 },..., iter(P m−1 )={i m−1,1 ,i m−1,2 ,...,i m−1,αm−1 }. If an iteration i r is partitioned to processorP k , the index array elements ix(i r ),iy(i r ), ···maynot be certainly resident in P k . There<strong>for</strong>e, we needto redistribute all index arrays so that <strong>for</strong> iter(P k ) ={i k,1 ,i k,2 ,...,i k,αk } <strong>and</strong> every index array ix, elementsix(i k,1 ),...,ix(i k,αk ) are local accessible.As mentioned above, The source BLOCK partitionscheme <strong>for</strong> processor P k is src iter(P k )={⌈ g m ⌉∗k +1..(⌈ g ⌉ +1)∗ k}, where g is the number of iterations ofmthe loop. Then in a redistribution, the elements of an indexarray need to communicate from P k to Pk ′ can be indicatedby src iter(P k ) ∩ iter(Pk ′ ). Back to the Example1, same as Example 3, let the size of data array <strong>and</strong> indexarrays be 12, after loop partitioning analysis, we can obtainiter(P 0 )={1, 5, 8, 9, 10}, iter(P 1 )={2, 3, 4}, <strong>and</strong>iter(P 2 )={6, 7, 11, 12}. The index array elements to beredistributed are shown in Figure 2.3.2 Scheduling in Redistribution ProcedureA redistribution routine can be divided into two part:subscript computation <strong>and</strong> interprocessor communication.If there is no communication scheduling in a redistributionroutine, communication contention may occur, which increasesthe communication waiting time. Clearly in eachcommunication step, there are some processors sendingmessages to the same destination processor. This leads tonode contention. Node contention will result in overall per<strong>for</strong>mancedegradation.[5, 6] The scheduling can avoid thiscontention.Proceedings of the 18th International Parallel <strong>and</strong> Distributed Processing Symposium (IPDPS’04)0-7695-2132-0/04/$17.00 (C) 2004 <strong>IEEE</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!