13.07.2015 Views

automatically exploiting cross-invocation parallelism using runtime ...

automatically exploiting cross-invocation parallelism using runtime ...

automatically exploiting cross-invocation parallelism using runtime ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A. for (i = 0; i < N; i++) {B. start = A[i];C. end = B[i];D. for (j = start; j < end; j++) {E. update (&C[j]);}}DEBADC(a)(b)<strong>cross</strong>-iteration Dependencesintra-iteration DependencesE(c)Figure 3.1: Example program: (a) Simplified code for a nested loop in CG (b) PDG forinner loop. The dependence pattern allows DOALL parallelization. (c) PDG for outerloop. Cross-iteration dependence deriving from E to itself has manifest rate 72.4%.3.1 Motivation and OverviewTo motivate the DOMORE technique, we present a running example <strong>using</strong> the program CGfrom the NAS suite [48]. Figure 3.1 (a) shows a simplified version of CG’s performancedominating loop nest. The outer loop computes the loop bounds of the inner loop, andthe inner loop calls the update function, updating values in array C. For the outer loop,aside from induction variables, the only <strong>cross</strong>-iteration dependence is between calls to theupdate function. Profiling reveals that this dependence manifests a<strong>cross</strong> 72.4% of outerloop iterations. The inner loop has no <strong>cross</strong>-iteration dependence since no two iterations inthe same <strong>invocation</strong> update the same element in array C.The update dependence prevents DOALL parallelization [1] of the outer loop. Spec-DOALL [62] can parallelize outer loop by speculating that the update dependence doesnot occur. However, speculating the outer loop dependence is not profitable, since theupdate dependence frequently manifests a<strong>cross</strong> outer loop iterations. As a result, DOALLwill parallelize the inner loop and insert barrier synchronizations between inner loop <strong>invocation</strong>sto ensure the dependence is respected between <strong>invocation</strong>s.Figure 3.2(a) shows the execution plan for a DOALL parallelization of CG. Iterationsin the same inner loop <strong>invocation</strong> execute in parallel. After each inner loop <strong>invocation</strong>,threads synchronize at the barrier. Typically, threads do not reach barriers at the same time25

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!