13.07.2015 Views

automatically exploiting cross-invocation parallelism using runtime ...

automatically exploiting cross-invocation parallelism using runtime ...

automatically exploiting cross-invocation parallelism using runtime ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

for CG, LLUBENCHMARK and BLACKSCHOLES because their scheduler threads arequite small compared to the worker threads (< 5% <strong>runtime</strong>, as shown in Table 5.3) andprocessor utilization is high. ECLAT, FLUIDANIMATE and SYMM do not show as muchimprovement. The following paragraphs provide details about those programs.ECLAT from MineBench [46] is a data mining program <strong>using</strong> a vertical database format.The target loop is a two-level nested-loop. The outer loop traverses a graph of nodes.The inner loop traverses a list of items in each node and appends each item to correspondinglist(s) in the database based upon on the item’s transaction number. Since two items mightshare the same transaction number, and the transaction number is calculated non-linearly,static analysis cannot determine the dependence pattern. Profiling information shows thatthere is no dynamic dependence in the inner loop. For the outer loop, the same dependencemanifests in each iteration (99%). As a result, Spec-DOALL is chosen to parallelize theinner loop and a barrier is inserted after each <strong>invocation</strong>. Spec-DOALL achieves its peakspeedup at 3 cores. For DOMORE, a relatively large scheduler thread (12.5% scheduler/-worker ratio) limits scalability. As we can see in Figure 5.1, DOMORE achieves scalableperformance up to 5 processors. After that, the sequential code becomes the bottleneck andno more speedup is achieved.FLUIDANIMATE from the PARSEC [6] benchmark suite uses an extension of theSmoothed Particle Hydrodynamics (SPH) method to simulate an incompressible fluid forinteractive animation purposes. The target loop is a six-level nested-loop. The outer loopgoes through each particle while an inner loop goes through the nearest neighbors of thatparticle. The inner loop calculates influences between the particle and its neighbors andupdates all of their statuses. One particle can be neighbor to multiple particles, resultingin statically unanalyzable update patterns. LOCALWRITE chooses to parallelize the innerloop to attempt to reduce some of the computational redundancy. Performance resultsshow that parallelizing the inner loop does not provide any performance gain. Redundantcomputation and barrier synchronizations negate the benefits of <strong>parallelism</strong>. DOMORE is79

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!