Automatic parallelization and optimization for irregular ... - IEEE Xplore

More documents

Recommendations

Info

consuming loops in his code, and split them up betweenthreads.OpenMP is a model for programming any parallel architecturethat provides the abstraction of a shared addressspace to the programmer. The purpose of OpenMP is to easethe development of portable parallel code, by enabling incrementalconstruction of parallel programs and hiding thedetail of the underlying hardware/software interface fromthe programmer. A parallel OpenMP program can be obtaineddirectly from its sequential counterpart, by addingparallelization directives.If the loop shown in Fig. 1 is split across OpenMPthreads then, although threads will always have distinct valuesof j, the values of prev_elam and next_elem maysimultaneously have the same values on different threads.As a result, there is a potential problem with updating thevalue of new_cell. There are some simple solutions tothis problem, which include making all the updates atomic,or having each thread compute temporary results which arethen combined across threads. However, for the extremelycommon situation of sparse array access neither of these approachesis particularly efficient.5.1 Directive Extension for IrregularLoopsThis section provides the new extensions to OpenMP, theirregular directive. The irregular directive couldbe applied to the parallel do directive in one of thefollowing situations:• When the parallel region is recognized as an irregularloop: in this case the compiler will invoke a runtimelibrary which partitions irregular loop according to aspecial computes rule.• When an ordered clause is recognized in the parallelregion where the loop is irregular: in this case thecompiler will treat this loop as a partial ordered; that is,some iterations are executed sequentially while someothers may be executed in parallel.• When an reduction clause is recognized in the parallelregion where the loop is irregular: in this case thecompiler will invoke an inspector/executor routines toperform irregular reduction in parallel.The irregular directives in extended OpenMP versionmay have the following patterns:$!omp parallel do schedule(irregular,[irarray1,...,irarrayN])This case designates that the compiler will encounteran irregular loop, where irarray1,...,irarrayN arepossible indirection arrays, or$!omp parallel do reduction|orderedirregular([expr1,...,exprN])This case designates that the compiler will encounter aspecial irregular reduction or irregular ordered loop whereexpr1, ..., exprN are expressions such as loop indexvariables.5.2 Partially Ordered LoopsA special case for the example in Fig. 1 occurs when theshared updates need not only to be performed in mutual exclusion,but also in an ordered way. The use of the irregularclause in this case tells the compiler that for thoseiterations which may update the same data in the differentthreads, they need to be executed in an ordered way. Otheriterations are still executed in parallel. An example of codeusing the indirect clause in this manner is the following:Example 1$!omp parallel do ordered irregular(x,y)do i = 1, nx[i] = indirect(1,i)y[i] = indirect(2,i)$!omp ordereda(x[i]) = a(y[i]) ......$!omp end orderedend do$!omp end parallel do6 ConclusionsThe efficiency of loop partitioning influences performanceof parallel program considerably. For automaticallyparallelizing irregular scientific codes, the owner computesrule is not suitable for partitioning irregular loops. In thispaper, we have presented an efficient loop partitioning approachto reduce communication cost for a kind of irregularloop with nonlinear array subscripts. In our approach, runtimepreprocessing is used to determine the communicationrequired between the processors. We have developed thealgorithms for performing these communication optimization.We have also presented how interprocedural communicationoptimization can be achieved. Furthermore, if irregularcodes are parallelized in OpenMP, we also proposedto extend OpenMP directives to be suitable for compilingand executing such codes. We have done a preliminary implementationof the schemes presented in this paper. Theexperimental results demonstrate efficacy of our schemes.Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)0-7695-2132-0/04/$17.00 (C) 2004 IEEE
References[1] R. Allen and K. Kennedy. Optimizing compilers forModern Architectures. Morgan Kaufmann Publishers,2001.[2] G. Agrawal and J. Saltz. Interprocedural compilationof Irregular Applilcations for Distributed memory machines.Language and Compilers for Parallel Computing,pp. 1-16, August 1994.[3] R. Das, M. Uysal, J. Saltz, and Y-S. Hwang. Communicationoptimizations for irregular scientific computationson distributed memory architectures. Journalof Parallel and Distributed Computing, 22(3):462–479, September 1994.[4] C. Ding and K. Kennedy. Improving cache performanceof dynamic applications with computation anddata layout transformations. In Proceedings of theSIGPLAN’99 Conference on Programming LanguageDesign and Implementation, Atlanta, GA, May, 1999.[5] M. Guo, I, Nakata, and Y. Yamashita. Contentionfreecommunication scheduling for array redistribution.Parallel Computing, 26(2000), pp. 1325-1343,2000.[12] M. Berry, D. Chen, P. Koss, D. Kuck, S. Lo, Y. Pang,R. Roloff, A. Sameh, E. Clementi, S. Chin, D. Schneider,G. Fox, P. Messina, D. Walker, C. Hsiung, J.Schwarzmeier, K. Lue, S. Orzag, F. Seidl, O. Johnson,G. Swanson, R. Goodrum, and J. Martin. The PER-FECT club benchmarks: effective performance evaluationof supercomputers. International Journal of SupercomputingApplications, pp. 3(3):5-40, 1989.[13] R. Ponnusamy, Y-S. Hwang, R. Das, J. Saltz, A.Choudhary, G. Fox. Supporting irregular distributionsin Fortran D/HPF compilers. Technical report CS-TR-3268, University of Maryland, Department of ComputerScience, 1994[14] R. Ponnusamy, J. Saltz, A. Choudhary, S. Hwang,and G. Fox. Runtime support and compilation methodsfor user-specified data distributions. IEEE Transactionson Parallel and Distributed Systems, 6(8), pp.815-831, 1995.[15] M. Ujaldon, E.L. Zapata, B.M. Chapman, and H.P.Zima. Vienna-Fortran/HPF extensions for sparse andirregular problems and their compilation. IEEE Transactionson Parallel and Distributed Systems. 8(10),Oct. 1997. pp. 1068 -1083.[6] M. Guo and I. Nakata. A framework for efficient arrayredistribution on distributed memory machines. TheJournal of Supercomputing, Vol. 20, No. 3, pp. 253-265, 2001.[7] M. Guo, Y. Pan, and C. Liu. Symbolic CommunicationSet generation for irregular parallel applications. Toappear in The Journal of Supercomputing, 2002.[8] M. Guo, Z. Liu, C. Liu, L. Li. Reducing Communicationcost for Parallelizing Irregular Scientific Codes.In Proceedings of The 6th International Conferenceon Applied Parallel Computing, Finland, June 2002.[9] E. Gutierrez, R. Asenjo, O. Plata, and E.L. Zapata. Automaticparallelization of irregular applications. ParallelComputing, 26(2000), pp. 1709-1738, 2000.[10] Y.-S Hwang, B. Moon, S. D. Sharma, R. Ponnusamy,R. Das, and J. Saltz. Runtime and language support forcompiling adaptive irregular programs on distributedmemory machines. Software-Practivce and Experience,Vol.25(6), pp. 597-621,1995.[11] J.M. Stone and M. Norman. ZEUS-2D: A radiationmagnetohydrodynamics code for astrophysical flowsin two space dimensions: The hydrodynamic algorithmsand tests. Astrophysical Journal SupplementSeries, Vol. 80, pp. 753-790, 1992.Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)0-7695-2132-0/04/$17.00 (C) 2004 IEEE
Page 2 and 3: for linear subscripts are completel

Automatic parallelization and optimization for irregular ... - IEEE Xplore

Create successful ePaper yourself

Delete template?

Save as template?