automatically exploiting cross-invocation parallelism using runtime ...

automatically exploiting cross-invocation parallelism using runtime ... automatically exploiting cross-invocation parallelism using runtime ...

dataspace.princeton.edu
from dataspace.princeton.edu More from this publisher
13.07.2015 Views

applied to the outermost loop, generating a parallel program with the redundant code in thescheduler thread and each inner loop iteration is scheduled only to the appropriate ownerthread. Although DOMORE reduces the overhead of redundant computation, partitioningthe redundant code to the scheduler increases the size of the sequential region, whichbecomes the major factor limiting the scalability in this case.SYMM from the PolyBench [54] suite demonstrates the capabilities of a very simplemulti-grid solver in computing a three dimensional potential field. The target loop is athree-level nested-loop. DOALL applicable to the second level inner loop. As shownin the results, even after DOMORE optimization, the scalability of SYMM is poor. Themajor cause is that the execution time of each inner loop invocation only takes about 4,000clock cycles. With increasing number of threads, the overhead involved in multi-threadingoutweighs all performance gain.The performance of DOMORE is limited by the sequential scheduler thread at largethread counts. To address this problem, we could parallelize the computeAddr function.The algorithm proposed in [36] can be adopted to achieve that purpose. This will be thefuture work.80

12x11x10xDOMOREPthread Barrier12x11x10xDOMOREPthread Barrier9x9xLoop Speedup8x7x6x5x4xLoop Speedup8x7x6x5x4x3x3x2x2x1x1x0x2 4 6 8 10 12 14 16 18 20 22 24Number of Threads(a) BLACKSCHOLES0x2 4 6 8 10 12 14 16 18 20 22 24Number of Threads(b) CG12x11x10xDOMOREPthread Barrier12x11x10xDOMOREPthread Barrier9x9xLoop Speedup8x7x6x5x4xLoop Speedup8x7x6x5x4x3x3x2x2x1x1x0x2 4 6 8 10 12 14 16 18 20 22 24Number of Threads(c) ECLAT0x2 4 6 8 10 12 14 16 18 20 22 24Number of Threads(d) FLUIDANIMATE-112x11x10xDOMOREPthread Barrier12x11x10xDOMOREPthread Barrier9x9xLoop Speedup8x7x6x5x4xLoop Speedup8x7x6x5x4x3x3x2x2x1x1x0x2 4 6 8 10 12 14 16 18 20 22 24Number of Threads(e) LLUBENCH0x2 4 6 8 10 12 14 16 18 20 22 24Number of Threads(f) SYMMFigure 5.1: Performance comparison between code parallelized with pthread barrier andDOMORE.81

12x11x10xDOMOREPthread Barrier12x11x10xDOMOREPthread Barrier9x9xLoop Speedup8x7x6x5x4xLoop Speedup8x7x6x5x4x3x3x2x2x1x1x0x2 4 6 8 10 12 14 16 18 20 22 24Number of Threads(a) BLACKSCHOLES0x2 4 6 8 10 12 14 16 18 20 22 24Number of Threads(b) CG12x11x10xDOMOREPthread Barrier12x11x10xDOMOREPthread Barrier9x9xLoop Speedup8x7x6x5x4xLoop Speedup8x7x6x5x4x3x3x2x2x1x1x0x2 4 6 8 10 12 14 16 18 20 22 24Number of Threads(c) ECLAT0x2 4 6 8 10 12 14 16 18 20 22 24Number of Threads(d) FLUIDANIMATE-112x11x10xDOMOREPthread Barrier12x11x10xDOMOREPthread Barrier9x9xLoop Speedup8x7x6x5x4xLoop Speedup8x7x6x5x4x3x3x2x2x1x1x0x2 4 6 8 10 12 14 16 18 20 22 24Number of Threads(e) LLUBENCH0x2 4 6 8 10 12 14 16 18 20 22 24Number of Threads(f) SYMMFigure 5.1: Performance comparison between code parallelized with pthread barrier andDOMORE.81

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!