automatically exploiting cross-invocation parallelism using runtime ...
automatically exploiting cross-invocation parallelism using runtime ...
automatically exploiting cross-invocation parallelism using runtime ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Algorithm 3: Pseudo-code for generating the computeAddr function from the workerfunctionInput: worker : worker function IRInput: pdg : program dependence graphOutput: computeAddr : computeAddr function IRdepInsts ← getCrossMemDepInsts(pdg)depAddr ← getMemOperands(depInsts)computeAddr ← reverseProgramSlice(worker, depAddr)for (i = 0; i < N; i++) {start = A[i];end = B[i];for (j = start; j < end; j++) {update (&C[j]);}}(a)vector computeAddr(int iternum) {vector addrSet;int addr = (long)&C[j];addrSet.push_back(addr);return addrSet;}(b)Figure 3.7: (a) Example loop from Figure 3.1; (b) computeAddr function generated forthis loop3.3.4 Generating the computeAddr functionThe scheduler thread uses the computeAddr function to determine which addresses willbe accessed by worker threads. DOMORE <strong>automatically</strong> generates the computeAddrfunction from the worker thread function <strong>using</strong> Algorithm 3. The algorithm takes as inputthe worker thread’s IR in SSA form and a program dependence graph (PDG) describingthe dependences in the original loop nest. The compiler uses the PDG to find all instructionswith memory dependences a<strong>cross</strong> the inner loop iterations or <strong>invocation</strong>s. Theseinstructions will consist of loads and stores. In the worker thread, program slicing [74] isperformed to create the set of instructions required to generate the address of the memorybeing accessed. Presently, the DOMORE transformation does not handle computeAddrfunctions with side-effects. If program slicing duplicates instructions with side-effects,the DOMORE transformation aborts. After the transformation, a performance guard comparesthe weights of the computeAddr function and the original worker thread. If thecomputeAddr function is too heavy relative to the original worker, the scheduler would40