automatically exploiting cross-invocation parallelism using runtime ...
automatically exploiting cross-invocation parallelism using runtime ...
automatically exploiting cross-invocation parallelism using runtime ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
sequential_func() {for (t = 0; t < STEP; t++) {L1: for (i = 0; i < M; i++) {A[i] = calc_1(B[C[i]]);}L2: for (j = 0; j < M; j++) {B[j] = calc_2(A[D[j]]);}}}(a) Sequential Programnaïve_parallel_func() {for (t = 0; t < STEP; t++) {L1: for (i = TID; i < M; i = i + THREADNUM) {A[i] = calc_1(B[C[i]]);}L2: for (j = TID; j < M; j = j + THREADNUM) {B[j] = calc_2(A[D[j]]);}}}(b) Naïve Parallelization without Synchronizationsbarrier_parallel_func() {for (t = 0; t < STEP; t++) {L1: for (i = TID; i < M; i = i + THREADNUM) {A[i] = calc_1(B[C[i]]);}pthread_barrier_wait(&barrier);L2: for (j = TID; j < M; j = j + THREADNUM) {B[j] = calc_2(A[D[j]]);}pthread_barrier_wait(&barrier);}}(c) Parallelization with Barrier Synchronizationtransaction_parallel_func() {for (t = 0; t < STEP; t++) {L1: for (i = TID; i < M; i = i + THREADNUM) {TX_begin();A[i] = calc_1(B[C[i]]);TX_end();}L2: for (j = TID; j < M; j = j + THREADNUM) {TX_begin();B[j] = calc_2(A[D[j]]);TX_end();}}}(d) Parallelization with TransactionsFigure 4.2: Example of parallelizing a program with different techniquesamount of time threads sit idle waiting for the slowest thread to reach the barrier. For mostof these programs, barrier overhead accounts for more than 30% of the parallel executiontime and increases with the number of threads. This overhead translates into an Amdahl’slaw limit of 3.33× max program speedup.4.1.2 Speculative <strong>cross</strong>-<strong>invocation</strong> parallelizationThe conservativeness of barrier synchronization prevents any <strong>cross</strong>-<strong>invocation</strong> parallelizationand significantly limits performance gain. Alternatively, the optimistic approach unlockspotential opportunities for <strong>cross</strong>-<strong>invocation</strong> parallelization. It allows the threads toexecute a<strong>cross</strong> the <strong>invocation</strong> boundary under the assumption that the dependences rarelymanifest at <strong>runtime</strong>. Two major optimistic solutions have been proposed in literature sofar: one relies on transactional memory and the other uses speculative barriers.52