13.07.2015 Views

automatically exploiting cross-invocation parallelism using runtime ...

automatically exploiting cross-invocation parallelism using runtime ...

automatically exploiting cross-invocation parallelism using runtime ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3.3 Compiler ImplementationThe DOMORE compiler generates scalable parallel programs by <strong>exploiting</strong> both intra<strong>invocation</strong>and <strong>cross</strong>-<strong>invocation</strong> <strong>parallelism</strong>. DOMORE first detects a candidate code regionwhich contains a large number of loop <strong>invocation</strong>s. DOMORE currently targets loopnest whose outer loop cannot be efficiently parallelized because of frequent <strong>runtime</strong> dependences,and whose inner loop is invoked many times and can be parallelized easily. Foreach candidate loop nest, DOMORE generates parallel code for the scheduler and workerthreads. This section uses the example loop from CG (Figure 3.1) to demonstrate each stepof the code transformation. Figure 3.6(a) gives the pseudo IR code of the CG example.3.3.1 Partitioning Scheduler and WorkerDOMORE allows threads to execute iterations from consecutive parallel <strong>invocation</strong>s. However,two parallel <strong>invocation</strong>s do not necessarily execute consecutively; typically a sequentialregion exists between them. In CG’s loop, statement A, B and C belong to the sequentialregion. After removing the barriers, threads must execute these sequential regions beforestarting the iterations from next parallel <strong>invocation</strong>.DOMORE executes the sequential code in the scheduler thread. This provides a generalsolution to handle the sequential code enclosed by the outer loop. After partitioning, onlythe scheduler thread executes the code. There is no redundant computation and no need forspecial handling of side-effecting operations. If a data flow dependence exists between thescheduler and worker threads, the value can be forwarded to worker threads by the samequeues used to communicate synchronization conditions.The rule for partitioning code into worker and scheduler threads is straightforward. Theinner loop body is partitioned into two sections. The loop-traversal instructions belong tothe scheduler thread, and the inner loop body belongs to the worker thread. Instructionsoutside the inner loop but enclosed by the outer loop are treated as sequential code and thus34

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!