Parallel programming with OpenMP - LinkSCEEM
Parallel programming with OpenMP - LinkSCEEM
Parallel programming with OpenMP - LinkSCEEM
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Klaus Klingmüller<br />
<strong>Parallel</strong> <strong>programming</strong><br />
<strong>with</strong> <strong>OpenMP</strong><br />
deltarec = (RECMAX - RECMIN) / (NREC - 1);<br />
deltaimc = (IMCMAX - IMCMIN) / (NIMC - 1);<br />
#pragma omp parallel for default(none) schedule(dynamic) \<br />
shared(result, deltarec, deltaimc) private(i, j, rec, imc)<br />
for (i = 0; i < NREC; i++) {<br />
rec = RECMIN + deltarec * i;<br />
for (j = 0; j < NIMC; j++) {<br />
imc = IMCMIN + deltaimc * j;<br />
result[i][j] = iterate(rec, imc, NI);
Klaus Klingmüller<br />
The Cyprus Institute<br />
Acknowledgement<br />
Alexander Schnurpfeil
Distributed memory<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
CPU<br />
CPU<br />
CPU<br />
CPU<br />
RAM<br />
RAM<br />
RAM<br />
RAM<br />
Shared memory<br />
CPU CPU CPU CPU<br />
RAM
Distributed memory<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
MPI<br />
CPU<br />
CPU<br />
CPU<br />
CPU<br />
RAM<br />
RAM<br />
RAM<br />
RAM<br />
Shared memory<br />
CPU CPU CPU CPU<br />
RAM
Distributed memory<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
MPI<br />
CPU<br />
CPU<br />
CPU<br />
CPU<br />
RAM<br />
RAM<br />
RAM<br />
RAM<br />
Shared memory<br />
<strong>OpenMP</strong><br />
CPU CPU CPU CPU<br />
RAM
Shared memory nodes<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
MPI + <strong>OpenMP</strong><br />
CPU CPU CPU CPU<br />
RAM<br />
CPU CPU CPU CPU<br />
RAM<br />
CPU CPU CPU CPU<br />
RAM
Klaus Klingmüller<br />
The Cyprus Institute<br />
Why not simply one MPI task per CPU?<br />
Sometimes this actually is the best solution<br />
But not, if<br />
• memory consumption is relevant<br />
• communication overhead is a problem<br />
• the nature of the problem limits the number<br />
of MPI tasks<br />
Often only little extra work required for<br />
MPI/<strong>OpenMP</strong> hybrid-parallelisation
Klaus Klingmüller<br />
The Cyprus Institute<br />
Targets<br />
• Multiprocessor systems<br />
• Multi-core CPUs<br />
• Simultaneous multithreading CPUs<br />
Programming languages<br />
C, C++, Fortran
Klaus Klingmüller<br />
The Cyprus Institute<br />
Highlights<br />
• Easy to use<br />
- compiler works out the details<br />
- incrementel parallelisation<br />
- single source code for sequential<br />
and parallel version<br />
• De facto standard<br />
- widely adopted<br />
- strong vendor support
History<br />
• Proprietary designs by some vendors (SGI, CRAY,<br />
ORACLE, IBM, ) end of the 1980s<br />
• Different unsuccessful attempts to standardize API<br />
• PCF<br />
• ANSI X3H5<br />
• <strong>OpenMP</strong>-Forum founded to define portable API<br />
• 1997 first API for Fortran (V1.0)<br />
• 1998 first API for C/C++ (V1.0)<br />
• 2000 Fortran V2.0<br />
• 2002 C/C++ V2.0<br />
• 2005 combined C/C++/Fortran specification V2.5<br />
• 2008 combined C/C++/Fortran specification V3.0<br />
• 2011 combined C/C++/Fortran specification V3.1<br />
16. August 2011 Slide 12
Klaus Klingmüller<br />
The Cyprus Institute<br />
Why start using <strong>OpenMP</strong> now?<br />
• Multi-core CPUs everywhere<br />
• More and more cores per CPU<br />
Cy-Tera: 12 cores per node<br />
24 simultaneous threads<br />
• OpenACC extends <strong>OpenMP</strong>'s concept to<br />
GPU <strong>programming</strong><br />
tomorrow's lecture<br />
• Many integrated cores (MIC):<br />
1 TFLOPS on one PCIe card
Klaus Klingmüller<br />
The Cyprus Institute<br />
Literature/Links<br />
• Chapman, Jost, van der Pas, Using <strong>OpenMP</strong><br />
• www.openmp.org<br />
Specifications, summary cards<br />
• www.compunity.org<br />
• www.google.com#q=openmp<br />
• linksceem.eu/ATutor/go.php/7/index.php
Klaus Klingmüller<br />
The Cyprus Institute<br />
euclid.cyi.ac.cy<br />
tar -xf /opt/examples/pragma-training/openmp_tutorial.tar.gz
Klaus Klingmüller<br />
The Cyprus Institute<br />
Demo: Mandelbrot set<br />
Complex numbers c,<br />
for which the sequence<br />
z 0<br />
,z 1<br />
,z 2<br />
,...,<br />
where<br />
z 0<br />
=0,<br />
z i+1<br />
=z 2 i<br />
+c,<br />
is bounded
mandelbrot/cmandelbrot-v1.c<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
int main()<br />
{<br />
int i, j, result[NREC][NIMC];<br />
double deltarec, deltaimc, rec, imc;<br />
deltarec = (RECMAX - RECMIN) / (NREC - 1);<br />
deltaimc = (IMCMAX - IMCMIN) / (NIMC - 1);<br />
for (i = 0; i < NREC; i++) {<br />
rec = RECMIN + deltarec * i;<br />
for (j = 0; j < NIMC; j++) {<br />
imc = IMCMIN + deltaimc * j;<br />
result[i][j] = iterate(rec, imc, NI);<br />
}<br />
}<br />
printpgm(NREC, NIMC, NI, result[0]);<br />
}<br />
return 0;
mandelbrot/fmandelbrot-v1.f90<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
PROGRAM mandelbrot<br />
IMPLICIT NONE<br />
INTEGER, PARAMETER :: ni = 30000, nrec = 800, nimc = 600<br />
REAL, PARAMETER :: recmin = -2., recmax = .666, imcmin = -1., imcmax =1.<br />
INTEGER :: n, i, j, result(nrec, nimc)<br />
REAL :: rec, imc, deltarec, deltaimc<br />
COMPLEX :: c<br />
deltarec = (recmax - recmin) / (nrec - 1)<br />
deltaimc = (imcmax - imcmin) / (nimc - 1)<br />
DO j = 1, nimc<br />
imc = imcmin + deltaimc * (j - 1)<br />
DO i = 1, nrec<br />
rec = recmin + deltarec * (i - 1)<br />
c = CMPLX(rec, imc)<br />
CALL iterate(c, ni, n)<br />
result(i, j) = n<br />
END DO<br />
END DO<br />
CALL printpgm(nrec, nimc, ni, result)<br />
END PROGRAM mandelbrot
Klaus Klingmüller<br />
The Cyprus Institute<br />
C<br />
deltarec = (RECMAX - RECMIN) / (NREC - 1);<br />
deltaimc = (IMCMAX - IMCMIN) / (NIMC - 1);<br />
Fortran<br />
#pragma omp parallel for default(none) \<br />
shared(result, deltarec, deltaimc) private(i, j, rec, imc)<br />
for (i = 0; i < NREC; i++) {<br />
rec = RECMIN + deltarec * i;<br />
for (j = 0; j < NIMC; j++) {<br />
imc = IMCMIN + deltaimc * j;<br />
result[i][j] = iterate(rec, imc, NI);<br />
deltarec = (recmax - recmin) / (nrec - 1)<br />
deltaimc = (imcmax - imcmin) / (nimc - 1)<br />
!$OMP PARALLEL DO DEFAULT(NONE) &<br />
!$OMP SHARED(result, deltarec, deltaimc) PRIVATE(n, i, j, rec, imc, c)<br />
DO j = 1, nimc<br />
imc = imcmin + deltaimc * (j - 1)<br />
DO i = 1, nrec<br />
rec = recmin + deltarec * (i - 1)<br />
c = CMPLX(rec, imc)<br />
CALL iterate(c, ni, n)<br />
result(i, j) = n<br />
END DO<br />
END DO<br />
!$OMP END PARALLEL DO<br />
CALL printpgm(nrec, nimc, ni, result)
Klaus Klingmüller<br />
The Cyprus Institute<br />
C<br />
deltarec = (RECMAX - RECMIN) / (NREC - 1);<br />
deltaimc = (IMCMAX - IMCMIN) / (NIMC - 1);<br />
Fortran<br />
#pragma omp parallel for default(none) \<br />
shared(result, deltarec, deltaimc) private(i, j, rec, imc)<br />
for (i = 0; i < NREC; i++) {<br />
rec = RECMIN + deltarec * i; Compile & run<br />
for (j = 0; j < NIMC; j++) {<br />
qsub cmandelbrot-v1.pbs<br />
result[i][j] = iterate(rec, imc, NI);<br />
...<br />
cat out.txt<br />
deltarec = (recmax - recmin) / (nrec - 1)<br />
deltaimc = (imcmax - imcmin) / (nimc - 1)<br />
!$OMP PARALLEL DO DEFAULT(NONE) &<br />
!$OMP SHARED(result, deltarec, deltaimc) PRIVATE(n, i, j, rec, imc, c)<br />
DO j = 1, nimc<br />
imc = imcmin + deltaimc * (j - 1)<br />
DO i = 1, nrec<br />
rec = recmin + deltarec * (i - 1)<br />
c = CMPLX(rec, imc)<br />
CALL iterate(c, ni, n)<br />
result(i, j) = n<br />
END DO<br />
END DO<br />
!$OMP END PARALLEL DO<br />
make<br />
imc = IMCMIN + deltaimc * j;<br />
CALL printpgm(nrec, nimc, ni, result)<br />
Compile & run<br />
make<br />
qsub fmandelbrot-v1.pbs<br />
...<br />
cat out.txt
Klaus Klingmüller<br />
The Cyprus Institute<br />
C<br />
deltarec = (RECMAX - RECMIN) / (NREC - 1);<br />
deltaimc = (IMCMAX - IMCMIN) / (NIMC - 1);<br />
Fortran<br />
#pragma omp parallel for default(none) schedule(dynamic) \<br />
shared(result, deltarec, deltaimc) private(i, j, rec, imc)<br />
for (i = 0; i < NREC; i++) {<br />
rec = RECMIN + deltarec * i;<br />
for (j = 0; j < NIMC; j++) {<br />
imc = IMCMIN + deltaimc * j;<br />
result[i][j] = iterate(rec, imc, NI);<br />
deltarec = (recmax - recmin) / (nrec - 1)<br />
deltaimc = (imcmax - imcmin) / (nimc - 1)<br />
!$OMP PARALLEL DO DEFAULT(NONE) SCHEDULE(DYNAMIC) &<br />
!$OMP SHARED(result, deltarec, deltaimc) PRIVATE(n, i, j, rec, imc, c)<br />
DO j = 1, nimc<br />
imc = imcmin + deltaimc * (j - 1)<br />
DO i = 1, nrec<br />
rec = recmin + deltarec * (i - 1)<br />
c = CMPLX(rec, imc)<br />
CALL iterate(c, ni, n)<br />
result(i, j) = n<br />
END DO<br />
END DO<br />
!$OMP END PARALLEL DO<br />
CALL printpgm(nrec, nimc, ni, result)
Klaus Klingmüller<br />
The Cyprus Institute<br />
Basics
Basic <strong>Parallel</strong> Programming Paradigm: SPMD<br />
• SPMD: Single Program Multiple Data<br />
• Programmer writes one program, which is executed on all<br />
processors<br />
• Basic paradigm for implementing parallel programs<br />
• Special cases (e.g., different control flow) handled inside<br />
the program<br />
16. August 2011 Slide 5
Processes and Threads<br />
Process<br />
active instance of program code<br />
unique process id (pid)<br />
stack for storing local data<br />
Threads<br />
one or more threads belong to a<br />
process<br />
thread id (tid)<br />
has its own stack<br />
heap for storing dynamic data<br />
space for storing global data<br />
16. August 2011 Slide 8
Threads<br />
• Elements of a process, e.g. program code, data area, heap<br />
etc. are shared among the existing threads<br />
• A single program <strong>with</strong> several threads is able to handle<br />
several tasks concurrently<br />
• Shared access to common resources:<br />
• Less information has to be stored<br />
• It is more efficient to switch between threads instead of<br />
processes<br />
<strong>OpenMP</strong> is based on this thread concept<br />
16. August 2011 Slide 9
Single-threaded vs. Multi-threaded<br />
16. August 2011 Slide 10
Klaus Klingmüller<br />
The Cyprus Institute<br />
Initial thread<br />
Fork<br />
Team of threads<br />
Join<br />
Initial thread
<strong>Parallel</strong> construct<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
C, C++<br />
#pragma omp parallel [clause[[,] clause]. . . ]<br />
structured block<br />
Fortran<br />
!$omp parallel [clause[[,] clause]. . . ]<br />
structured block<br />
!$omp end parallel
parallel_construct/parallel_construct.c<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
#include <br />
#include <br />
int main()<br />
{<br />
printf("This is thread %d. ", omp_get_thread_num());<br />
printf("Number of threads: %d.\n", omp_get_num_threads());<br />
printf("+++\n");<br />
#pragma omp parallel<br />
{<br />
printf("This is thread %d. ", omp_get_thread_num());<br />
printf("Number of threads: %d.\n", omp_get_num_threads());<br />
}<br />
printf("+++\n");<br />
printf("This is thread %d. ", omp_get_thread_num());<br />
printf("Number of threads: %d.\n", omp_get_num_threads());<br />
}<br />
return 0;
parallel_construct/parallel_construct.c<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
#include <br />
#include <br />
Fortran:<br />
use omp_lib<br />
int main()<br />
{<br />
printf("This is thread %d. ", omp_get_thread_num());<br />
printf("Number of threads: %d.\n", omp_get_num_threads());<br />
printf("+++\n");<br />
#pragma omp parallel<br />
{<br />
printf("This is thread %d. ", omp_get_thread_num());<br />
printf("Number of threads: %d.\n", omp_get_num_threads());<br />
}<br />
printf("+++\n");<br />
printf("This is thread %d. ", omp_get_thread_num());<br />
printf("Number of threads: %d.\n", omp_get_num_threads());<br />
}<br />
return 0;
Syntax of Directives<br />
• Directives are special compiler pragmas<br />
• Directive and API function names are case-sensitive and<br />
are written lowercase<br />
• There are no END directives as in Fortran, but rather the<br />
directives apply to the following structured block (statement<br />
<strong>with</strong> one entry and one exit)<br />
#pragma omp DirectiveName [ParameterList]<br />
structured-block<br />
• Directive continuation lines:<br />
#pragma omp DirectiveName here-is-something \<br />
and-here-is-some-more<br />
16. August 2011 Slide 21
Syntax of Directives<br />
• Directives are special-formatted comments<br />
• Directives ignored by non-<strong>OpenMP</strong> compiler<br />
• Case-insensitive<br />
!$OMP DirectiveName [ParameterList]<br />
block<br />
!$OMP END DirectiveName<br />
• Directive continuation lines:<br />
!$OMP DirectiveName here-is-something &<br />
!$OMP and-here-is-some-more<br />
16. August 2011 Slide 22
Directives<br />
• In general, all directives have the same structure In C/C++<br />
and Fortran<br />
#pragma omp DirectiveName [ParameterList]<br />
!$OMP DirectiveName [ParameterList]<br />
• There are directives for<br />
• <strong>Parallel</strong>ism<br />
• Work<br />
• Synchronization<br />
• Many parameters apply to directives equally (e.g.,<br />
PRIVATE(var-list))<br />
16. August 2011 Slide 25
Environment Variables<br />
• Set default number of threads for parallel regions<br />
• Set scheduling type for parallel loops <strong>with</strong> scheduling<br />
strategy RUNTIME<br />
16. August 2011 Slide 18
Run-time Library Functions<br />
#include // header<br />
omp_set_num_threads(nthreads)<br />
• Set number of threads to use for subsequent parallel<br />
regions (overrides value of OMP_NUM_THREADS)<br />
nthreads = omp_get_num_threads()<br />
• Returns the number of threads in current team<br />
iam = omp_get_thread_num()<br />
• Returns thread number ( iam = 0 .. nthreads-1)<br />
iam = omp_get_wtime()<br />
• Returns elapsed wall clock time in seconds<br />
16. August 2011 Slide 19
Data Environment I<br />
• Data objects (variables) can be shared or private<br />
• By default almost all variables are shared<br />
• Accessible to all threads<br />
• Single instance in shared memory<br />
• Caveat: pointers itself are shared but not contents<br />
• Variables can be declared private<br />
• Then each thread allocates its own private copy of the<br />
data<br />
• Only exists during the execution of a parallel region!<br />
• Value undefined upon entry of parallel region<br />
16. August 2011 Slide 15
Data Environment II<br />
• Exceptions to default shared:<br />
• Loop index variable in parallel loop<br />
• Local variables of subroutines called inside parallel<br />
regions<br />
16. August 2011 Slide 16
Running <strong>OpenMP</strong> I<br />
16. August 2011 Slide 14
Directive Parameters for Data Environment I<br />
• Variables (varList) are declared to be private<br />
PRIVATE(varList)<br />
• Variables (varList) are declared to be shared<br />
SHARED(varList)<br />
• Set the default data mode (private only allowed in Fortran)<br />
DEFAULT(shared | private | none)<br />
• Variables (varList) are declared to be private but also get<br />
initialized <strong>with</strong> the value the original shared variable had<br />
just before entering the parallel region<br />
FIRSTPRIVATE(varList)<br />
16. August 2011 Slide 26
Directive Parameters for Data Environment II<br />
• Variables (varList) are declared private and the value of the<br />
variable of the last thread finishing the parallel region is<br />
written to the corresponding shared variable afterwards<br />
LASTPRIVATE(varList)<br />
• Variables can appear in both FIRSTPRIVATE and<br />
LASTPRIVATE<br />
• (varList): comma-separated list of variables<br />
16. August 2011 Slide 27
Running <strong>OpenMP</strong> II<br />
16. August 2011 Slide 17
<strong>Parallel</strong> Region II<br />
• Nesting of parallel regions is allowed, but not all compilers<br />
implement this feature yet. The Master thread is always<br />
thread 0, which executes the sequential code of the<br />
program<br />
• Typically, inside of parallel regions further <strong>OpenMP</strong><br />
constructs are used to coordinate the work on the different<br />
threads work share constructs like a parallel loop<br />
16. August 2011 Slide 29
Exercises: Introduction I<br />
• Go to directory /ex_introduction<br />
a) Write a program where the master thread forks a parallel<br />
region. Each thread should print its thread number.<br />
Furthermore, only the master thread should print the used<br />
number of threads.<br />
b) Let the program run <strong>with</strong> two and four threads respectively.<br />
Set the number of threads by using<br />
1) The corresponding environment variable<br />
2) The corresponding <strong>OpenMP</strong> library function<br />
For Fortran programmers: include „ use omp_lib“ in your<br />
code when you use the intel compiler<br />
16. August 2011 Slide 54
Klaus Klingmüller<br />
The Cyprus Institute<br />
Loops
Loop construct<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
C, C++<br />
#pragma omp for [clause[[,] clause]. . . ]<br />
for-loop<br />
Fortran<br />
!$omp do [clause[[,] clause]. . . ]<br />
do-loop<br />
!$omp end do [nowait]
loop_construct/loop_construct-v1.c<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
#include <br />
#include <br />
#define N 10<br />
int main()<br />
int i;<br />
#pragma omp parallel<br />
{<br />
#pragma omp for<br />
for (i = 0; i < N; i++) {<br />
printf("i = %d (thread %d)\n", i, omp_get_thread_num());<br />
}<br />
}<br />
}<br />
return 0;
Do-loop Execution I<br />
Program<br />
(<strong>Parallel</strong>) Execution<br />
Thread 0 Thread 1 Thread 2<br />
a = 1<br />
a = 1<br />
a = 2<br />
!$OMP PARALLEL<br />
a = 2<br />
do i=1,9<br />
!$OMP call DO work(i)<br />
enddo<br />
do i=1,9<br />
call work(i)<br />
enddo<br />
!$OMP END PARAL.<br />
a = 2<br />
a = 2 a = 2 a = 2<br />
do i=1,9<br />
do call i=1,9 i=1,3 work(i)<br />
do i=1,9 i=4,6 do i=1,9 i=7,9<br />
enddo<br />
call work(i) call work(i) call work<br />
enddo enddo enddo<br />
a = 3<br />
a = 3<br />
16. August 2011 Slide 35
<strong>Parallel</strong> Loop<br />
• Iterations of a parallel loop are executed in parallel by all<br />
threads of the current team<br />
• The calculations inside an iteration must not depend on<br />
other iterations responsibility of the programmer<br />
• A schedule determines how iterations are divided among<br />
the threads<br />
• Specified by SCHEDULE parameter<br />
• Default: undefined!<br />
• The form of the loop has to allow computing the number of<br />
iterations prior to entry into the loop e.g., no WHILE<br />
loops<br />
• Implicit barrier synchronization at the end of the loop unless<br />
NOWAIT parameter is specified<br />
16. August 2011 Slide 33
<strong>Parallel</strong> Loop for C/C++<br />
• Same restrictions as <strong>with</strong> Fortran DO-loop<br />
• No END pragma NOWAIT parameter is specified at entry<br />
of loop<br />
• for loop can only have very restricted form<br />
#pragma omp for<br />
for ( var = first ; var cmp-op end ; incr-expr)<br />
loop-body<br />
Comparison operators (cmp-op)<br />
=<br />
• Increment expression (inctr-expr)<br />
++var, var++, --var, var--,<br />
var += incr, var -= incr,<br />
var = var + incr, var = incr + var<br />
• first, end, incr: constant expressions<br />
16. August 2011 Slide 46
Canonical Loop Structure<br />
Restrictions on loops to simplify compiler-based parallelization<br />
• Allows number of iterations to be computed at loop entry<br />
• Program must complete all iterations of the loop<br />
• no break or goto leaving the loop<br />
• no exit or goto leaving the loop<br />
• Exiting current iteration and starting next one possible<br />
• continue allowed<br />
• cycle allowed<br />
• Termination of the entire program inside the loop<br />
possible<br />
• exit allowed<br />
• stop allowed<br />
16. August 2011 Slide 47
Combined Constructs (Shortcut Version)<br />
• If a parallel region only contains a parallel loop or parallel<br />
sections a shortcut notation can be used:<br />
16. August 2011 Slide 60
loop_construct/loop_construct-v1.c<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
#include <br />
#include <br />
#define N 10<br />
int main()<br />
int i;<br />
#pragma omp parallel<br />
{<br />
#pragma omp for<br />
for (i = 0; i < N; i++) {<br />
printf("i = %d (thread %d)\n", i, omp_get_thread_num());<br />
}<br />
}<br />
}<br />
return 0;
loop_construct/loop_construct-v2.c<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
#include <br />
#include <br />
#define N 10<br />
int main()<br />
{<br />
int i;<br />
#pragma omp parallel for<br />
for (i = 0; i < N; i++) {<br />
printf("i = %d (thread %d)\n", i, omp_get_thread_num());<br />
}<br />
}<br />
return 0;
Example: Work Sharing <strong>with</strong> <strong>Parallel</strong> Loop I<br />
16. August 2011 Slide 48
Example: Work Sharing <strong>with</strong> <strong>Parallel</strong> Loop II<br />
<strong>Parallel</strong> region and parallel for-loop<br />
# pragma omp parallel<br />
{<br />
#pragma omp for schedule(static)<br />
for (i=0 ; i
Example: Work Sharing <strong>with</strong> <strong>Parallel</strong> Loop III<br />
<strong>Parallel</strong> region and parallel do-loop<br />
!$OMP PARALLEL<br />
!$OMP DO SCHEDULE(STATIC)<br />
do i = 1, 100<br />
enddo<br />
!$OMP END DO<br />
a(i) = b(i) + c(i)<br />
!$OMP END PARALLEL<br />
!$OMP PARALLEL DO SCHEDULE(STATIC)<br />
do i = 1, 100<br />
a(i) = b(i) + c(i)<br />
enddo<br />
!$OMP END PARALLEL DO<br />
16. August 2011 Slide 50
Static and Dynamic Scheduling<br />
• Static scheduling<br />
• Distribution is done at loop-entry time based on<br />
• Number of threads<br />
• Total number of iterations<br />
• Less flexible<br />
• Almost no scheduling overhead<br />
• Dynamic scheduling<br />
• Distribution is done during execution of the loop<br />
• Each thread is assigned a subset of the iterations at loop<br />
entry<br />
• After completion each thread asks for more iterations<br />
• More flexible<br />
• Can easily adjust to load imbalances<br />
• More scheduling overhead (synchronization)<br />
16. August 2011 Slide 38
Scheduling Strategies I<br />
• Distribution of iterations occurs in chunks<br />
• Chunks may have different sizes<br />
• Chunks are assigned either statically or dynamically<br />
• There are different assignment algorithms (types)<br />
• SCHEDULE parameter<br />
schedule (type [, chunk])<br />
• Schedule types<br />
• static<br />
• dynamic<br />
• guided<br />
• runtime<br />
16. August 2011 Slide 39
Scheduling Strategies II<br />
• Schedule type STATIC <strong>with</strong>out chunk size<br />
• One chunk of iterations per thread, all chunks (nearly) equal size<br />
• Schedule type STATIC <strong>with</strong> chunk size<br />
• Chunks <strong>with</strong> specified size are assigned in round-robin fashion<br />
• Schedule type DYNAMIC<br />
• Threads request new chunks dynamically<br />
• Default chunk size is 1<br />
• Schedule type GUIDED<br />
• First chunk has implementation-dependent size<br />
• Size of each successive chunk decreases exponentially<br />
• Chunks are assigned dynamically<br />
• Chunks size specifies minimum size, default is 1<br />
16. August 2011 Slide 40
Scheduling Strategies III<br />
• Schedule type RUNTIME<br />
• Scheduling strategy is determined by environment variable<br />
• If variable is not set, scheduling implementation dependent<br />
export OMP_SCHEDULE=“type [, chunk]“<br />
• No SCHEDULE parameter<br />
• Scheduling implementation dependent<br />
• Correctness of program must not depend on scheduling<br />
strategy<br />
• Setting via Runtime Libraries Routines possible (<strong>OpenMP</strong><br />
3.0)<br />
• omp_set_schedule / omp_get_schedule<br />
16. August 2011 Slide 41
Scheduling - Summary<br />
Iterations<br />
type chunk size number<br />
static no n/p p<br />
static yes c n/c (overlapping)<br />
dynamic optional c n/c<br />
guided optional first n/p, then<br />
decreasing<br />
less than n/c<br />
runtime no depends depends<br />
n: number of iterations, p: number of threads, c: chunk size<br />
16. August 2011 Slide 42
Reason for Different Work Schedules<br />
Work load should be balanced, e.g. each thread needs same<br />
period of time to handle its task<br />
• Load might randomly differ from iteration to iteration<br />
dynamic scheduling<br />
• Execution speed might increase/decrease <strong>with</strong> increasing iteration<br />
index<br />
‣ guided scheduling: chunksizeis proportional to number of<br />
iterations left divided by the number of threads in the team<br />
16. August 2011 Slide 43
Exercise: <strong>OpenMP</strong> Work Distribution<br />
do i = 1,10<br />
end do<br />
call work(i)<br />
Which schedule clause results in which work distribution?<br />
a)<br />
b)<br />
c)<br />
d)<br />
1 2 3 4 5 6 7 8 9 10<br />
1 2 3 4 5 6 7 8 9 10<br />
1 2 3 4 5 6 7 8 9 10<br />
1 2 3 4 5 6 7 8 9 10<br />
= thread 0<br />
= thread 1<br />
= thread 2<br />
= thread 3<br />
16. August 2011 Slide 44
<strong>OpenMP</strong> Work Distribution<br />
• Sequential:<br />
1 2 3 4 5 6 7 8 9 10<br />
• Cyclic:<br />
• Block:<br />
do i = 1,10<br />
call work(i)<br />
end do<br />
• <strong>OpenMP</strong> also provides:<br />
1 2 3 4 5 6 7 8 9 10<br />
!$omp do schedule(static,1) ...<br />
1 2 3 4 5 6 7 8 9 10<br />
1 2 3 4 5 6 7 8 9 10<br />
!$omp do schedule(static) ...<br />
schedule(dynamic [, chunk]), schedule(guided [, chunk])<br />
16. August 2011 Slide 45
Klaus Klingmüller<br />
The Cyprus Institute<br />
Thread<br />
Re c
Klaus Klingmüller<br />
The Cyprus Institute<br />
Thread<br />
Re c
Klaus Klingmüller<br />
The Cyprus Institute<br />
xercise<br />
Test different scheduling options in the<br />
Mandelbrot example and compare the<br />
performance
Synchronization: NOWAIT<br />
• Use NOWAIT parameter for parallel loops which do not<br />
need to be synchronized upon exit of the loop<br />
• keeps synchronization overhead low<br />
# pragma omp parallel<br />
{<br />
}<br />
#pragma omp for nowait<br />
for (i=0 ; i
Synchronization: NOWAIT<br />
• Use NOWAIT parameter for parallel loops which do not<br />
need to be synchronized upon exit of the loop<br />
• keeps synchronization overhead low<br />
!$OMP PARALLEL<br />
!$OMP DO<br />
do i = 1, 100<br />
enddo<br />
a(i) = b(i) + c(i)<br />
!$OMP END DO NOWAIT<br />
!$OMP DO<br />
do i = 1, 100<br />
enddo<br />
z(i) = sqrt(b(i))<br />
!$OMP END DO NOWAIT<br />
!$OMP END PARALLEL<br />
16. August 2011 Slide 52
Loop Nests<br />
• <strong>Parallel</strong> loop construct only applies to loop directly following<br />
it<br />
• Nesting of work-sharing constructs is illegal in <strong>OpenMP</strong>!<br />
• <strong>Parallel</strong>ization of nested loops<br />
• Normally try to parallelize outer loop less overhead<br />
• Sometimes necessary to parallelize inner loops (e.g., small n) rearrange<br />
loops?<br />
• Manually re-write into one loop<br />
• Nested parallel regions not yet supported by all compilers<br />
16. August 2011 Slide 53
If Clause: Conditional <strong>Parallel</strong>ization I<br />
• Avoiding parallel overhead because of low number of loop<br />
iterations<br />
Explicit versioning:<br />
if (size > 10000) {<br />
#pragma omp parallel for<br />
for (i=0 ; i
If Clause: Conditional <strong>Parallel</strong>ization II<br />
Using the „if clause“:<br />
#pragma omp parallel for if (size > 10000)<br />
for (i=0 ; i
Exercises: Introduction II<br />
a) Write a program that implements DAXPY (Double<br />
precision real Alpha X Plus Y):<br />
<br />
y<br />
<br />
x<br />
(vector sizes and values can be chosen as you like)<br />
‣ Measure the time it takes in dependence of the number<br />
of threads<br />
<br />
y<br />
16. August 2011 Slide 55
Klaus Klingmüller<br />
The Cyprus Institute<br />
xercise<br />
Matrix times vector<br />
∑<br />
Solution in using_openmp_examples/<br />
i<br />
a ij x j
Klaus Klingmüller<br />
The Cyprus Institute<br />
Reduction
Reduction Parameter<br />
• For solving cases like this, <strong>OpenMP</strong> provides the<br />
REDUCTION parameter<br />
16. August 2011 Slide 73
Reduction Parameter<br />
• For solving cases like this, <strong>OpenMP</strong> provides the<br />
REDUCTION parameter<br />
16. August 2011 Slide 74
Reductions<br />
• Reductions often occur <strong>with</strong>in parallel regions or loops<br />
• Reduction variables have to be shared in enclosing parallel<br />
context<br />
• Thread-local results get combined <strong>with</strong> outside variables<br />
using reduction operation<br />
• Note: order of operations unspecified can produce<br />
different results than sequential version<br />
• Applications: Compute sum of array or find the largest<br />
element<br />
16. August 2011 Slide 75
Reductions, Example<br />
16. August 2011 Slide 76
Reduction Operators<br />
The table shows:<br />
• The operators allowed for reductions<br />
• The initial values used to initialize<br />
Operator Data Types Initial Value<br />
+ floating point, integer 0<br />
* floating point, integer 1<br />
- floating point, integer 0<br />
& integer all bits on<br />
| integer 0<br />
^ integer 0<br />
&& integer 1<br />
|| integer 0<br />
16. August 2011 Slide 77
Reduction Operators I<br />
The table shows:<br />
• The operators and intrinsic allowed for reductions<br />
• The initial values used to initialize<br />
Operator Data Types Initial Value<br />
+ floating point, integer (complex<br />
or real)<br />
* floating point, integer (complex<br />
or real)<br />
- floating point, integer (complex<br />
or real)<br />
0<br />
1<br />
0<br />
.AND. logical .TRUE.<br />
.OR. logical .FALSE.<br />
.EQV. logical .TRUE.<br />
.NEQV. logical .FALSE.<br />
16. August 2011 Slide 78
Reduction Operators II<br />
Operator Data Types Initial Value<br />
MAX<br />
MIN<br />
floating point, integer<br />
(real only)<br />
floating point, integer<br />
(real only)<br />
IAND integer all bits on<br />
IOR integer 0<br />
IEOR integer 0<br />
smallest possible value<br />
largest possible value<br />
16. August 2011 Slide 79
Klaus Klingmüller<br />
The Cyprus Institute<br />
xercises<br />
.<br />
x·y<br />
. n!
eduction_clause/reduction_clause.c<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
#include <br />
int main()<br />
{<br />
int i;<br />
double a[3], b[3], dotp;<br />
a[0] = .2;<br />
a[1] = 2.2;<br />
a[2] = -1.;<br />
b[0] = 1.;<br />
b[1] = -.5;<br />
b[2] = 1.5;<br />
dotp = 0;<br />
#pragma omp parallel for reduction(+:dotp) \<br />
default(none) shared(a, b) private(i)<br />
for (i = 0; i < 3; i++) {<br />
dotp += a[i] * b[i];<br />
}<br />
printf("%g\n", dotp);<br />
}<br />
return 0;
Exercises: Advanced Topics I<br />
• Go to directory /ex_advanced_topics<br />
a) can be calculated in the following way:<br />
x<br />
i<br />
n<br />
1<br />
4.0<br />
(1.0 x^2)<br />
1<br />
x<br />
n<br />
x ( i 0.5) x<br />
Write a program that does the calculation and parallelize it<br />
using <strong>OpenMP</strong> (10^8 iterations). How much time does it<br />
take?<br />
Play around <strong>with</strong> the „schedule“ clause. Which settings<br />
gives the best performance?<br />
16. August 2011 Slide 96
Klaus Klingmüller<br />
The Cyprus Institute<br />
xercise<br />
Monte Carlo computation of<br />
π<br />
Random number generators in random/ (don't use them<br />
for science), solution in pi/
pi/cpi.c<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
#include <br />
#ifdef _OPENMP<br />
#include <br />
#else<br />
#define omp_get_thread_num() 0<br />
#endif<br />
#define N 1000000000<br />
#include "../random/rand_r.c"<br />
int main()<br />
{<br />
long i, counter;<br />
unsigned seed;<br />
double x, y, pi;<br />
counter = 0;<br />
#pragma omp parallel default(none) private(i, x, y, seed) \<br />
reduction(+: counter)<br />
{<br />
seed = omp_get_thread_num();<br />
#pragma omp for<br />
for (i = 0; i < N; i++) {<br />
x = (double) rand_r(&seed) / RAND_MAX;<br />
y = (double) rand_r(&seed) / RAND_MAX;<br />
if (x * x + y * y
pi/cpi.c<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
#include <br />
#ifdef _OPENMP<br />
#include <br />
#else<br />
#define omp_get_thread_num() 0<br />
#endif<br />
#define N 1000000000<br />
#include "../random/rand_r.c"<br />
int main()<br />
{<br />
long i, counter;<br />
unsigned seed;<br />
double x, y, pi;<br />
counter = 0;<br />
#pragma omp parallel default(none) private(i, x, y, seed) \<br />
reduction(+: counter)<br />
{<br />
seed = omp_get_thread_num();<br />
#pragma omp for<br />
for (i = 0; i < N; i++) {<br />
x = (double) rand_r(&seed) / RAND_MAX;<br />
y = (double) rand_r(&seed) / RAND_MAX;<br />
if (x * x + y * y
Conditional Compilation<br />
• Preprocessor macro _OPENMP for C/C++ and Fortran:<br />
#ifdef _OPENMP<br />
iam = omp_get_thread_num();<br />
#endif<br />
• Special comment for Fortran preprocessor<br />
!$ iam = OMP_GET_THREAD_NUM()<br />
• Helpful check of serial and parallel version of the code<br />
16. August 2011 Slide 30
pi/cpi.c<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
#include <br />
#ifdef _OPENMP<br />
#include <br />
#else<br />
#define omp_get_thread_num() 0<br />
#endif<br />
#define N 1000000000<br />
#include "../random/rand_r.c"<br />
int main()<br />
{<br />
long i, counter;<br />
unsigned seed;<br />
double x, y, pi;<br />
counter = 0;<br />
#pragma omp parallel default(none) private(i, x, y, seed) \<br />
reduction(+: counter)<br />
{<br />
seed = omp_get_thread_num();<br />
#pragma omp for<br />
for (i = 0; i < N; i++) {<br />
x = (double) rand_r(&seed) / RAND_MAX;<br />
y = (double) rand_r(&seed) / RAND_MAX;<br />
if (x * x + y * y
Klaus Klingmüller<br />
The Cyprus Institute<br />
Synchronisation
What now?<br />
16. August 2011 Slide 61
Critical Region<br />
16. August 2011 Slide 62
Critical Region II<br />
• A critical region restricts execution of the associated block<br />
of statements to a single thread at a time<br />
• An optional name may be used to identify the critical region<br />
• A thread waits at the beginning of a critical region until no<br />
other thread is executing a critical region (anywhere in the<br />
program) <strong>with</strong> the same name<br />
16. August 2011 Slide 63
Example: Critical Region<br />
• Note: the loop body only consists of a critical region<br />
• The program gets extremely slow No speedup<br />
16. August 2011 Slide 64
Example: Critical Region<br />
• Note: the loop body only consists of a critical region<br />
• The program gets extremely slow No speedup<br />
16. August 2011 Slide 65
Critical Region III<br />
16. August 2011 Slide 66
Example: Critical Region II<br />
16. August 2011 Slide 67
Example: Critical Region II<br />
16. August 2011 Slide 68
Atomic Statements<br />
• The ATOMIC directives ensures that a specific memory<br />
location is updated atomically.<br />
• <strong>OpenMP</strong> implementation could replace it <strong>with</strong> a CRITICAL<br />
construct. ATOMIC construct permits better optimization<br />
(based on hardware instructions)<br />
#<br />
• The statement following the ATOMIC directive must have<br />
one of the following forms:<br />
16. August 2011 Slide 69
Atomic Statements<br />
• The ATOMIC directives ensures that a specific memory<br />
location is updated atomically.<br />
• <strong>OpenMP</strong> implementation could replace it <strong>with</strong> a CRITICAL<br />
construct. ATOMIC construct permits better optimization<br />
(based on hardware instructions)<br />
• The statement following the ATOMIC directive must have<br />
one of the following forms:<br />
16. August 2011 Slide 70
Example: Atomic Statement<br />
s = 0<br />
#pragma omp parallel private(i, s_local)<br />
{<br />
s_local = 0.0;<br />
#pragma omp for<br />
for (i=0 ; i
Example: Atomic Statement<br />
16. August 2011 Slide 72
Locks<br />
• More flexible way of implementing “critical regions”<br />
• Lock variable type: omp_lock_t, passed by address<br />
• Initialize lock<br />
omp_init_lock(&lockvar)<br />
• Remove (deallocates) lock<br />
omp_destroy_lock(&lockvar)<br />
• Blocks calling thread until lock is available<br />
omp_set_lock(&lockvar)<br />
• Releases locks<br />
omp_unset_lock(&lockvar)<br />
• Test and try to set lock (returns 1 if success else 0)<br />
intvar = omp_test_lock(&lockvar)<br />
16. August 2011 Slide 83
Locks<br />
• More flexible way of implementing “critical regions”<br />
• Lock variable have to be of type integer<br />
• Initialize lock<br />
CALL OMP_INIT_LOCK(lockvar)<br />
• Remove (deallocates) lock<br />
CALL OMP_DESTROY_LOCK(lockvar)<br />
• Blocks calling thread until lock is available<br />
CALL_SET_LOCK(lockvar)<br />
• Releases locks<br />
CALL_OMP_UNSET_LOCK(lockvar)<br />
• Test and try to set lock (returns TRUE if success)<br />
logicalvar = OMP_TEST_LOCK(lockvar)<br />
16. August 2011 Slide 84
Example: Locks<br />
16. August 2011 Slide 85
Example: Locks<br />
16. August 2011 Slide 86
Master / Single I<br />
• Sometimes it is useful that <strong>with</strong>in a parallel region just one<br />
thread is executing code, e.g., to read/write data. <strong>OpenMP</strong><br />
provides two ways to accomplish this:<br />
• MASTER construct: The master thread (thread 0)<br />
executes the enclosed code, all other threads ignore<br />
this block of statements, i.e. there is no implicit<br />
barrier<br />
#pragma omp master<br />
{ statementBlock }<br />
!$OMP MASTER<br />
statementBlock<br />
!$OMP END MASTER<br />
16. August 2011 Slide 80
Master / Single II<br />
• Single construct: The first thread reaching the<br />
directive executes the code. All threads execute an<br />
implicit barrier synchronization unless NOWAIT is<br />
specified<br />
#pragma omp single [nowait, private(list),...]<br />
{ statementBlock }<br />
!$OMP SINGLE [PRIVATE(list), FIRSTPRIVATE(list)]<br />
statementBlock<br />
!$OMP END SINGLE [NOWAIT]<br />
16. August 2011 Slide 81
single_construct/single_construct.c<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
#include <br />
#include <br />
#define N 4<br />
int main()<br />
{<br />
int i, j;<br />
j = 1;<br />
#pragma omp parallel default(none) shared(j) private(i)<br />
{<br />
#pragma omp for<br />
for (i = 0; i < N; i++) {<br />
printf("i = %d, j = %d (thread %d)\n", i, j,<br />
omp_get_thread_num());<br />
}<br />
#pragma omp single<br />
{<br />
j = 2;<br />
printf("Thread %d changes the shared variable j.\n",<br />
omp_get_thread_num());<br />
}<br />
}<br />
#pragma omp for<br />
for (i = 0; i < N; i++) {<br />
printf("i = %d, j = %d (thread %d)\n", i, j,<br />
omp_get_thread_num());<br />
}<br />
}<br />
return 0;
Barrier<br />
• The barrier directive explicitly synchronizes all the threads<br />
in a team.<br />
• When encountered, each thread in the team waits until all<br />
the others have reached this point<br />
#pragma omp barrier<br />
!$OMP BARRIER<br />
• There are also implicit barriers at the end<br />
• of parallel regions<br />
• of work share constructs (DO/for, SECTIONS, SINGLE)<br />
these can be disabled by specifying NOWAIT<br />
16. August 2011 Slide 82
Klaus Klingmüller<br />
The Cyprus Institute<br />
Misc
Section construct<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
C, C++<br />
#pragma omp sections [clause[[,] clause]. . . ]<br />
{<br />
#pragma omp section<br />
structured block<br />
#pragma omp section<br />
structured block<br />
...<br />
}<br />
Fortran<br />
!$omp sections [clause[[,] clause]. . . ]<br />
!$omp section<br />
structured block<br />
!$omp section<br />
structured block<br />
...<br />
!$omp end sections [nowait]
section_construct/section_construct.c<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
#include <br />
#include <br />
int main()<br />
{<br />
#pragma omp parallel sections<br />
{<br />
#pragma omp section<br />
printf("Section 1: thread %d\n", omp_get_thread_num());<br />
#pragma omp section<br />
printf("Section 2: thread %d\n", omp_get_thread_num());<br />
#pragma omp section<br />
printf("Section 3: thread %d\n", omp_get_thread_num());<br />
}<br />
#pragma omp section<br />
printf("Section 4: thread %d\n", omp_get_thread_num());<br />
}<br />
return 0;
Workshare construct<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
Fortran<br />
!$omp workshare<br />
structured block<br />
!$omp end workshare [nowait]
workshare_construct/<br />
Klaus Klingmüller<br />
The Cyprus Institute<br />
workshare_construct.f90<br />
PROGRAM workshare_construct<br />
IMPLICIT NONE<br />
INTEGER, PARAMETER :: n = 5<br />
REAL :: a(n), b(n), c(n)<br />
INTEGER :: i<br />
a = (/ (.1 * i, i = 1, n) /)<br />
b = (/ (.5 * i, i = 1, n) /)<br />
!$OMP PARALLEL WORKSHARE<br />
c = a + b<br />
!$OMP END PARALLEL WORKSHARE<br />
WRITE(*, *) a<br />
WRITE(*, *) b<br />
WRITE(*, *) c<br />
END PROGRAM mandelbrot
Orphaned Work-sharing Construct<br />
• Work-sharing construct can occur anywhere outside the<br />
lexical extent of a parallel region orphaned construct<br />
• Case 1: called in a parallel context works as expected<br />
• Case 2: called in a sequential context “ignores” directive<br />
• Global variables are shared<br />
16. August 2011 Slide 87
Klaus Klingmüller<br />
The Cyprus Institute<br />
<strong>OpenMP</strong> 3.x
Loop <strong>Parallel</strong>ism I (<strong>OpenMP</strong> 3.0)<br />
• <strong>OpenMP</strong> 3.0 guarantees for STATIC schedule<br />
• same number of iterations<br />
• second loop is divided in the same chunks<br />
• not guaranteed in <strong>OpenMP</strong> 2.5<br />
16. August 2011 Slide 89
Loop <strong>Parallel</strong>ism II (<strong>OpenMP</strong> 3.0)<br />
• Loop collapsing<br />
!$OMP PARALLEL DO COLLAPSE(2)<br />
do i = 1, n<br />
enddo<br />
do j = 1, m<br />
enddo<br />
do k =1, l<br />
enddo<br />
!$OMP END PARALLEL DO<br />
foo(i, j, k)<br />
• Rectangular iteration space from the outer two loops<br />
• outer loop is collapsed into one larger loop <strong>with</strong> more<br />
iterations<br />
16. August 2011 Slide 90
Task <strong>Parallel</strong>ism (<strong>OpenMP</strong> 3.0)<br />
Tasks:<br />
• Unit of independent work (block of code, like sections)<br />
• Direct or deferred execution of work by one thread of<br />
team<br />
• Tasks are composed of<br />
• code to execute, and<br />
• the data environment (constructed at creation time)<br />
• internal control variables (ICV)<br />
• can be tied to a thread: only this thread can execute the<br />
task<br />
• can be used for unbounded loops, recursive algorithms,<br />
working on linked lists (pointer), consumer/producer<br />
16. August 2011 Slide 91
Task directive<br />
• ParameterList:<br />
• data scope: default, private, firstprivate, shared<br />
• conditional: if (expr)<br />
‣create task if expr true, otherwise direct execution<br />
schedule: untied<br />
• Each encountering thread creates a new task<br />
• Tasks can be nested, into another task or a worksharing<br />
construct<br />
16. August 2011 Slide 92
Task synchronization (<strong>OpenMP</strong> 3.0)<br />
• Task is executed by a thread of current team sometimes<br />
• Completion of all task created by all threads of current team<br />
at synchronization points:<br />
• Barriers<br />
• implicit: e.g. end of parallel region<br />
• explicit: e.g. OMP BARRIER<br />
• Task Barrier:<br />
#pragma omp taskwait<br />
!$OMP TASKWAIT<br />
encountering task suspends until all its child tasks<br />
(generated before taskwait) are completed<br />
16. August 2011 Slide 93
Task vs Sections<br />
Task<br />
• <strong>OpenMP</strong> task can be<br />
defined by any thread of<br />
team<br />
• a thread of the team<br />
executes the <strong>OpenMP</strong> task<br />
• definition of <strong>OpenMP</strong> tasks<br />
at any location<br />
• dynamic setup of <strong>OpenMP</strong><br />
tasks<br />
Sections<br />
• independend tasks can be<br />
performed in parallel<br />
• static blocks of code<br />
• Nature of task and number<br />
of tasks must be known at<br />
compile time<br />
16. August 2011 Slide 94
Exercises: Advanced Topics II<br />
a) <strong>Parallel</strong>ize fibonacci.c using <strong>OpenMP</strong> tasks (Fibonacci<br />
sequence: 0, 1, 1, 2, 3, 5, 8, 13, …)<br />
16. August 2011 Slide 97
Klaus Klingmüller<br />
The Cyprus Institute<br />
Troubleshooting
Race Conditions and Data Dependencies I<br />
Most important rule: „<strong>Parallel</strong>ization of code must not affect the<br />
correctness of a program!“<br />
• In loops: the results of each single iteration must not<br />
depend on each other<br />
• Race conditions must be avoided<br />
• Result must not depend on the order of threads<br />
• Correctness of the program must not depend on number<br />
of threads<br />
• Correctness must not depend on the work schedule<br />
16. August 2011 Slide 99
Race Conditions and Data Dependencies II<br />
• Threads read and write to the same object at the same time<br />
‣ Unpredictable results (sometimes work, sometimes not)<br />
‣ Wrong answers <strong>with</strong>out a warning signal!<br />
• Correctness depends on order of read/write accesses<br />
• Hard to debug because the debugger often runs the<br />
program in a serialized, deterministic ordering.<br />
• To insure that readers do not get ahead of writers, thread<br />
synchronization is needed.<br />
• Distributed memory systems: messages are often used<br />
to synchronize, <strong>with</strong> readers blocking until the message<br />
arrives.<br />
• Shared memory systems: need barriers, critical regions,<br />
locks, ...<br />
16. August 2011 Slide 100
Race Conditions and Data Dependencies II<br />
• Note: be careful <strong>with</strong> synchronization<br />
• Degrades performance<br />
• Deadlocks: threads lock up waiting for a locked resource<br />
that never will become available<br />
16. August 2011 Slide 101
Exercises: Data Dependencies I<br />
a)<br />
b)<br />
c)<br />
d)<br />
for (int i=1 ; i
Exercises: Data Dependencies II<br />
e)<br />
f)<br />
for (int i=1 ; i
Intel Inspector XE (Formerly Intel Thread Checker)<br />
• Memory error and thread checker tool<br />
• Supported languages on linux systems:<br />
• C/C++, Fortran<br />
• Maps errors to the source code line and call stack<br />
• Detecs problems that are not recognized by the compiler<br />
(e.g. race conditions, data dependencies)<br />
16. August 2011 Slide 104
Basics<br />
Shared memory<br />
Fork-Join<br />
SPMD Threads<br />
Compiler directives<br />
<strong>Parallel</strong> construct<br />
Reduction<br />
Private vs. shared<br />
Critical region<br />
Loops<br />
Scheduling<br />
Atomic statements<br />
<strong>OpenMP</strong> 3.x<br />
Tasks<br />
Race conditions<br />
Data dependencies<br />
Synchronisation<br />
Single Barrier<br />
Locks Master<br />
Troubleshooting<br />
Thread checker<br />
Klaus Klingmüller<br />
The Cyprus Institute