13.07.2015 Views

automatically exploiting cross-invocation parallelism using runtime ...

automatically exploiting cross-invocation parallelism using runtime ...

automatically exploiting cross-invocation parallelism using runtime ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

advanced forms of <strong>parallelism</strong> (MPI, GPU) mainly borrowed already parallelized code anddid not know how to further adjust the <strong>parallelism</strong> to their own computing environment.These survey results implied that researchers need <strong>parallelism</strong> to achieve bigger researchgoals. Nevertheless, they were not competent enough to write scalable parallel programsby themselves.A promising alternative approach for producing multi-threaded codes is to let the compiler<strong>automatically</strong> convert single-threaded applications into multi-threaded ones.Thisapproach is attractive as it removes the burden of writing multi-threaded code from theprogrammer. Additionally, it allows the compiler to <strong>automatically</strong> adjust the amount andtype of <strong>parallelism</strong> extracted based on the underlying architecture, just as instruction-level<strong>parallelism</strong> (ILP) optimizations relieved programmers of the burden of targeting their applicationsto complex single-threaded architectures.Numerous compiler-based automatic parallelization techniques have been proposed inthe past. Some of them [1, 15] achieved success in parallelizing array-based programs withregular memory accesses and limited control flow. More recent techniques [42, 53, 55,60, 62, 65] perform speculation and pipeline style parallelization to successfully parallelizegeneral purpose codes with arbitrary control flow and memory access patterns. However,all these automatic parallelization techniques only exploit loop level <strong>parallelism</strong>. A loop isa sequence of statements that can be executed 0, 1, or any finite number of times. A singletime execution of the sequence of statements is referred to as a loop iteration and oneexecution of all iterations within a loop is defined as a loop <strong>invocation</strong>. These techniquesparallelize each loop iteration and globally synchronize at the end of each loop <strong>invocation</strong>.Consequently, programs with many loop <strong>invocation</strong>s will have to synchronize frequently.These parallelization techniques fail to deliver scalable performance because synchronizationforces all threads to wait for the last thread to finish an <strong>invocation</strong> [45]. At highthread counts, threads spend more time idling at synchronization points than doing usefulcomputation. There is an opportunity to improve the performance by <strong>exploiting</strong> additional4

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!