automatically exploiting cross-invocation parallelism using runtime ...

13.07.2015 Views
[62] L. Rauchwerger and D. Padua. The LRPD test: speculative run-time parallelizationof loops with privatization and reduction parallelization. ACM SIGPLAN Notices,volume 30, pages 218–232, 1995.[63] A. Robison, M. Voss, and A. Kukanov. Optimization via reflection on work stealingin TBB. In IEEE International Symposium on Parallel and Distributed Processing(IPDPS), 2008.[64] S. Rus, L. Rauchwerger, and J. Hoeflinger. Hybrid analysis: static & dynamic memoryreference analysis. Int. J. Parallel Program., volume 31, pages 251–283, August2003.[65] J. Saltz, R. Mirchandaney, and R. Crowley. Run-time parallelization and schedulingof loops. IEEE Transactions on Computers, volume 40, 1991.[66] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: A dynamicdata race detector for multithreaded programs. ACM Transactions on ComputerSystems, volume 15, pages 391–411, 1997.[67] N. Shavit and D. Touitou. Software transactional memory. In Proceedings of the 14thannual ACM symposium on Principles of Distributed Computing (PODC), 1995.[68] M. F. Spear, M. M. Michael, and C. von Praun. RingSTM: scalable transactionswith a single atomic instruction. In Proceedings of the 20th annual Symposium onParallelism in Algorithms and Architectures (SPAA), 2008.[69] Standard Performance Evaluation Corporation (SPEC).http://www.spec.org/.[70] J. G. Steffan, C. Colohan, A. Zhai, and T. C. Mowry. The STAMPede approach tothread-level speculation. ACM Transactions on Computer Systems, volume 23, pages253–300, February 2005.104

[71] P. Swamy and C. Vipin. Minimum dependence distance tiling of nested loops withnon-uniform dependences. In Proceedings of the 6th IEEE Symposium on Paralleland Distributed Processing (IPDPS), 1994.[72] C.-W. Tseng. Compiler optimizations for eliminating barrier synchronization. In Proceedingsof the 5th ACM SIGPLAN symposium on Principles and Practice of ParallelProgramming (PPOPP), 1995.[73] T. Tzen and L. Ni. Trapezoid self-scheduling: a practical scheduling scheme for parallelcompilers. Parallel and Distributed Systems, IEEE Transactions on, volume 4,January 1993.[74] M. Weiser. Program slicing. In Proceedings of the 5th International Conference onSoftware Engineering (ICSE), 1981.[75] M. Wolfe. Doany: Not just another parallel loop. In Proceedings of the 4th workshopon Languages and Compilers for Parallel Computing (LCPC), 1992.[76] M. J. Wolfe. Optimizing Compilers for Supercomputers. PhD thesis, Department ofComputer Science, University of Illinois, Urbana, IL, October 1982.[77] L. Yen, J. Bobba, M. R. Marty, K. E. Moore, H. Volos, M. D. Hill, M. M. Swift, andD. A. Wood. LogTM-SE: Decoupling hardware transactional memory from caches.In Proceedings of the 13th IEEE international symposium on High Performance ComputerArchitecture (HPCA), 2007.[78] N. Yonezawa, K. Wada, and T. Aida. Barrier elimination based on access dependencyanalysis for openmp. In Parallel and Distributed Processing and Applications. 2006.[79] J. Zhao, J. Shirako, V. K. Nandivada, and V. Sarkar. Reducing task creation and terminationoverhead in explicitly parallel programs. In Proceedings of the 19th inter-105

Page 1 and 2: AUTOMATICALLY EXPLOITINGCROSS-INVOC

Page 4 and 5: techniques. Among twenty programs f

Page 6 and 7: in particular. Their professionalis

Page 8 and 9: ContentsAbstract . . . . . . . . .

Page 10 and 11: 4.5.4 Load Balancing Techniques . .

Page 12 and 13: 2.4 Sequential Loop Example for DOA

Page 14 and 15: 4.5 Overview of SPECCROSS: At compi

Page 16 and 17: 1.1 Limitations of Existing Approac

Page 18 and 19: advanced forms of parallelism (MPI,

Page 20 and 21: the graph stands for an iteration i

Page 22 and 23: 1.2 ContributionsFigure 1.5 demonst

Page 24 and 25: 1.3 Dissertation OrganizationChapte

Page 26 and 27: alias the array regular via inter-p

Page 28 and 29: 1 for (i = 0; i < M; i++){2 node =

Page 30 and 31: 1 cost = 0;2 node = list->head;3 Wh

Page 32 and 33: example which cannot benefit from e

Page 34 and 35: These techniques are referred to as

Page 36 and 37: unnecessary overhead at runtime.Tab

Page 38 and 39: Chapter 3Non-Speculatively Exploiti

Page 40 and 41: for a variety of reasons. For insta

Page 42 and 43: 12x11x10xDOMOREPthread Barrier9xLoo

Page 44 and 45: Algorithm 1: Pseudo-code for schedu

Page 46 and 47: Algorithm 2: Pseudo-code for worker

Page 48 and 49: 3.3 Compiler ImplementationThe DOMO

Page 50 and 51: to T i . DOMORE’s MTCG follows th

Page 52 and 53: Outer_Preheaderbr BB1ABB1A:ind1 = P

Page 54 and 55: Algorithm 3: Pseudo-code for genera

Page 56 and 57: Scheduler Function SchedulerSync Fu

Page 58 and 59: SchedulerWorker1 Worker2Worker3Work

Page 60 and 61: 3.5 Related Work3.5.1 Cross-invocat

Page 62 and 63: ations during the inspecting proces

Page 64 and 65: for (t = 0; t < STEP; t++) {L1: for

Page 66 and 67: sequential_func() {for (t = 0; t <

Page 68 and 69: Workerthread 1Workerthread 2Workert

Page 70 and 71: library provides efficient misspecu

Page 72 and 73: Workerthread 1TimeFigure 4.6: Timin

Page 74 and 75: 4.2 SPECCROSS Runtime System4.2.1 M

Page 76 and 77: takes up to 200MB memory space.To d

Page 78 and 79: checkpoint, the child spawns new wo

Page 80 and 81: Operation DescriptionFunctions for

Page 82 and 83: Main thread:main() {init();create_t

Page 84 and 85: implemented in the Liberty parallel

Page 86 and 87: Algorithm 5: Pseudo-code for SPECCR

Page 88 and 89: CROSS, since SPECCROSS can be appli

Page 90 and 91: techniques.Synchronization via sche

Page 92 and 93: Source Benchmark Function % of exec

Page 94 and 95: applied to the outermost loop, gene

Page 96 and 97: 5.2 SPECCROSS Performance Evaluatio

Page 98 and 99: and the number of checking requests

Page 100 and 101: 8x7xno misspec.with misspec.Geomean

Page 102 and 103: This thesisPrevious workSpeedup (x)

Page 104 and 105: Program Speedup6x5x4x3x2xLOCALWRITE

Page 106 and 107: for DOMORE and SPECCROSS. Others (e

Page 108 and 109: programs and it achieves a geomean

Page 110 and 111: Bibliography[1] R. Allen and K. Ken

Page 112 and 113: [15] R. Cytron. DOACROSS: Beyond ve

Page 114 and 115: [31] T. B. Jablin, Y. Zhang, J. A.

Page 116 and 117: [47] A. Nicolau, G. Li, A. V. Veide

Page 120: national conference on Parallel Arc

domore

parallelization

speccross

dependence

threads

runtime

techniques

execution

iteration

programs

automatically

exploiting

parallelism

automatically exploiting cross-invocation parallelism using runtime ...

automatically exploiting cross-invocation parallelism using runtime ... ... View more automatically exploiting cross-invocation parallelism using runtime ...

Delete template?

Save as template ?

automatically exploiting cross-invocation parallelism using runtime ... automatically exploiting cross-invocation parallelism using runtime ...