Parallel programming with OpenMP - LinkSCEEM

Klaus Klingmüller 

Parallel programming 

with OpenMP 

deltarec = (RECMAX - RECMIN) / (NREC - 1); 

deltaimc = (IMCMAX - IMCMIN) / (NIMC - 1); 

#pragma omp parallel for default(none) schedule(dynamic) \ 

shared(result, deltarec, deltaimc) private(i, j, rec, imc) 

for (i = 0; i < NREC; i++) { 

rec = RECMIN + deltarec * i; 

for (j = 0; j < NIMC; j++) { 

imc = IMCMIN + deltaimc * j; 

result[i][j] = iterate(rec, imc, NI);


The Cyprus Institute 

Acknowledgement 

Alexander Schnurpfeil

Distributed memory 



CPU 

CPU 

CPU 

CPU 

RAM 

RAM 

RAM 

RAM 

Shared memory 

CPU CPU CPU CPU 

RAM




MPI 

CPU 

CPU 

CPU 

CPU 

RAM 

RAM 

RAM 

RAM 

Shared memory 


RAM




MPI 

CPU 

CPU 

CPU 

CPU 

RAM 

RAM 

RAM 

RAM 

Shared memory 

OpenMP 


RAM

Shared memory nodes 



MPI + OpenMP 


RAM 


RAM 


RAM



Why not simply one MPI task per CPU? 

Sometimes this actually is the best solution 

But not, if 

• memory consumption is relevant 

• communication overhead is a problem 

• the nature of the problem limits the number 

of MPI tasks 

Often only little extra work required for 

MPI/OpenMP hybrid-parallelisation



Targets 

• Multiprocessor systems 

• Multi-core CPUs 

• Simultaneous multithreading CPUs 

Programming languages 

C, C++, Fortran



Highlights 

• Easy to use 

- compiler works out the details 

- incrementel parallelisation 

- single source code for sequential 

and parallel version 

• De facto standard 

- widely adopted 

- strong vendor support

History 

• Proprietary designs by some vendors (SGI, CRAY, 

ORACLE, IBM, ) end of the 1980s 

• Different unsuccessful attempts to standardize API 

• PCF 

• ANSI X3H5 

• OpenMP-Forum founded to define portable API 

• 1997 first API for Fortran (V1.0) 

• 1998 first API for C/C++ (V1.0) 

• 2000 Fortran V2.0 

• 2002 C/C++ V2.0 

• 2005 combined C/C++/Fortran specification V2.5 



16. August 2011 Slide 12



Why start using OpenMP now? 

• Multi-core CPUs everywhere 

• More and more cores per CPU 

Cy-Tera: 12 cores per node 

24 simultaneous threads 

• OpenACC extends OpenMP's concept to 

GPU programming 

tomorrow's lecture 

• Many integrated cores (MIC): 

1 TFLOPS on one PCIe card



Literature/Links 

• Chapman, Jost, van der Pas, Using OpenMP 

• www.openmp.org 

Specifications, summary cards 

• www.compunity.org 

• www.google.com#q=openmp 

• linksceem.eu/ATutor/go.php/7/index.php



euclid.cyi.ac.cy 

tar -xf /opt/examples/pragma-training/openmp_tutorial.tar.gz



Demo: Mandelbrot set 

Complex numbers c, 

for which the sequence 

z 0 

,z 1 

,z 2 

,..., 

where 

z 0 

=0, 

z i+1 

=z 2 i 

+c, 

is bounded

mandelbrot/cmandelbrot-v1.c 



int main() 

{ 

int i, j, result[NREC][NIMC]; 

double deltarec, deltaimc, rec, imc; 



for (i = 0; i < NREC; i++) { 


for (j = 0; j < NIMC; j++) { 


result[i][j] = iterate(rec, imc, NI); 

} 

} 

printpgm(NREC, NIMC, NI, result[0]); 

} 

return 0;

mandelbrot/fmandelbrot-v1.f90 



PROGRAM mandelbrot 

IMPLICIT NONE 

INTEGER, PARAMETER :: ni = 30000, nrec = 800, nimc = 600 

REAL, PARAMETER :: recmin = -2., recmax = .666, imcmin = -1., imcmax =1. 

INTEGER :: n, i, j, result(nrec, nimc) 

REAL :: rec, imc, deltarec, deltaimc 

COMPLEX :: c 

deltarec = (recmax - recmin) / (nrec - 1) 

deltaimc = (imcmax - imcmin) / (nimc - 1) 

DO j = 1, nimc 

imc = imcmin + deltaimc * (j - 1) 

DO i = 1, nrec 

rec = recmin + deltarec * (i - 1) 

c = CMPLX(rec, imc) 

CALL iterate(c, ni, n) 

result(i, j) = n 

END DO 

END DO 

CALL printpgm(nrec, nimc, ni, result) 

END PROGRAM mandelbrot



C 



Fortran 

#pragma omp parallel for default(none) \ 


for (i = 0; i < NREC; i++) { 


for (j = 0; j < NIMC; j++) { 





!$OMP PARALLEL DO DEFAULT(NONE) & 

!$OMP SHARED(result, deltarec, deltaimc) PRIVATE(n, i, j, rec, imc, c) 








END DO 

END DO 

!$OMP END PARALLEL DO 

CALL printpgm(nrec, nimc, ni, result)



C 



Fortran 

#pragma omp parallel for default(none) \ 


for (i = 0; i < NREC; i++) { 

rec = RECMIN + deltarec * i; Compile & run 

for (j = 0; j < NIMC; j++) { 

qsub cmandelbrot-v1.pbs 


... 

cat out.txt 



!$OMP PARALLEL DO DEFAULT(NONE) & 









END DO 

END DO 


make 


CALL printpgm(nrec, nimc, ni, result) 

Compile & run 

make 

qsub fmandelbrot-v1.pbs 

... 

cat out.txt



C 



Fortran 

#pragma omp parallel for default(none) schedule(dynamic) \ 


for (i = 0; i < NREC; i++) { 


for (j = 0; j < NIMC; j++) { 





!$OMP PARALLEL DO DEFAULT(NONE) SCHEDULE(DYNAMIC) & 









END DO 

END DO 


CALL printpgm(nrec, nimc, ni, result)



Basics

Basic Parallel Programming Paradigm: SPMD 

• SPMD: Single Program Multiple Data 

• Programmer writes one program, which is executed on all 

processors 

• Basic paradigm for implementing parallel programs 

• Special cases (e.g., different control flow) handled inside 

the program 


Processes and Threads 

Process 

active instance of program code 

unique process id (pid) 

stack for storing local data 

Threads 

one or more threads belong to a 

process 

thread id (tid) 

has its own stack 

heap for storing dynamic data 

space for storing global data 


Threads 

• Elements of a process, e.g. program code, data area, heap 

etc. are shared among the existing threads 

• A single program with several threads is able to handle 

several tasks concurrently 

• Shared access to common resources: 

• Less information has to be stored 

• It is more efficient to switch between threads instead of 

processes 

OpenMP is based on this thread concept 


Single-threaded vs. Multi-threaded 




Initial thread 

Fork 

Team of threads 

Join 

Initial thread

Parallel construct 



C, C++ 

#pragma omp parallel [clause[[,] clause]. . . ] 

structured block 

Fortran 

!$omp parallel [clause[[,] clause]. . . ] 


!$omp end parallel

parallel_construct/parallel_construct.c 



#include 

#include 

int main() 

{ 

printf("This is thread %d. ", omp_get_thread_num()); 

printf("Number of threads: %d.\n", omp_get_num_threads()); 

printf("+++\n"); 

#pragma omp parallel 

{ 



} 




} 

return 0;

parallel_construct/parallel_construct.c 



#include 

#include 

Fortran: 

use omp_lib 

int main() 

{ 





{ 



} 




} 

return 0;

Syntax of Directives 

• Directives are special compiler pragmas 

• Directive and API function names are case-sensitive and 

are written lowercase 

• There are no END directives as in Fortran, but rather the 

directives apply to the following structured block (statement 

with one entry and one exit) 

#pragma omp DirectiveName [ParameterList] 

structured-block 

• Directive continuation lines: 

#pragma omp DirectiveName here-is-something \ 

and-here-is-some-more 


Syntax of Directives 

• Directives are special-formatted comments 

• Directives ignored by non-OpenMP compiler 

• Case-insensitive 

!$OMP DirectiveName [ParameterList] 

block 

!$OMP END DirectiveName 

• Directive continuation lines: 

!$OMP DirectiveName here-is-something & 

!$OMP and-here-is-some-more 


Directives 

• In general, all directives have the same structure In C/C++ 

and Fortran 

#pragma omp DirectiveName [ParameterList] 

!$OMP DirectiveName [ParameterList] 

• There are directives for 

• Parallelism 

• Work 

• Synchronization 

• Many parameters apply to directives equally (e.g., 

PRIVATE(var-list)) 


Environment Variables 

• Set default number of threads for parallel regions 

• Set scheduling type for parallel loops with scheduling 

strategy RUNTIME 


Run-time Library Functions 

#include // header 

omp_set_num_threads(nthreads) 

• Set number of threads to use for subsequent parallel 

regions (overrides value of OMP_NUM_THREADS) 

nthreads = omp_get_num_threads() 

• Returns the number of threads in current team 

iam = omp_get_thread_num() 

• Returns thread number ( iam = 0 .. nthreads-1) 

iam = omp_get_wtime() 

• Returns elapsed wall clock time in seconds 


Data Environment I 

• Data objects (variables) can be shared or private 

• By default almost all variables are shared 

• Accessible to all threads 

• Single instance in shared memory 

• Caveat: pointers itself are shared but not contents 

• Variables can be declared private 

• Then each thread allocates its own private copy of the 

data 

• Only exists during the execution of a parallel region! 

• Value undefined upon entry of parallel region 


Data Environment II 

• Exceptions to default shared: 

• Loop index variable in parallel loop 

• Local variables of subroutines called inside parallel 

regions 


Running OpenMP I 


Directive Parameters for Data Environment I 

• Variables (varList) are declared to be private 

PRIVATE(varList) 

• Variables (varList) are declared to be shared 

SHARED(varList) 

• Set the default data mode (private only allowed in Fortran) 

DEFAULT(shared | private | none) 

• Variables (varList) are declared to be private but also get 

initialized with the value the original shared variable had 

just before entering the parallel region 

FIRSTPRIVATE(varList) 


Directive Parameters for Data Environment II 

• Variables (varList) are declared private and the value of the 

variable of the last thread finishing the parallel region is 

written to the corresponding shared variable afterwards 

LASTPRIVATE(varList) 

• Variables can appear in both FIRSTPRIVATE and 

LASTPRIVATE 

• (varList): comma-separated list of variables 


Running OpenMP II 


Parallel Region II 

• Nesting of parallel regions is allowed, but not all compilers 

implement this feature yet. The Master thread is always 

thread 0, which executes the sequential code of the 

program 

• Typically, inside of parallel regions further OpenMP 

constructs are used to coordinate the work on the different 

threads work share constructs like a parallel loop 


Exercises: Introduction I 

• Go to directory /ex_introduction 

a) Write a program where the master thread forks a parallel 

region. Each thread should print its thread number. 

Furthermore, only the master thread should print the used 

number of threads. 

b) Let the program run with two and four threads respectively. 

Set the number of threads by using 

1) The corresponding environment variable 

2) The corresponding OpenMP library function 

For Fortran programmers: include „ use omp_lib“ in your 

code when you use the intel compiler 




Loops

Loop construct 



C, C++ 

#pragma omp for [clause[[,] clause]. . . ] 

for-loop 

Fortran 

!$omp do [clause[[,] clause]. . . ] 

do-loop 

!$omp end do [nowait]

loop_construct/loop_construct-v1.c 



#include 

#include 

#define N 10 

int main() 

int i; 


{ 

#pragma omp for 

for (i = 0; i < N; i++) { 

printf("i = %d (thread %d)\n", i, omp_get_thread_num()); 

} 

} 

} 

return 0;

Do-loop Execution I 

Program 

(Parallel) Execution 

Thread 0 Thread 1 Thread 2 

a = 1 

a = 1 

a = 2 

!$OMP PARALLEL 

a = 2 

do i=1,9 

!$OMP call DO work(i) 

enddo 

do i=1,9 

call work(i) 

enddo 

!$OMP END PARAL. 

a = 2 

a = 2 a = 2 a = 2 

do i=1,9 

do call i=1,9 i=1,3 work(i) 

do i=1,9 i=4,6 do i=1,9 i=7,9 

enddo 

call work(i) call work(i) call work 

enddo enddo enddo 

a = 3 

a = 3 


Parallel Loop 

• Iterations of a parallel loop are executed in parallel by all 

threads of the current team 

• The calculations inside an iteration must not depend on 

other iterations responsibility of the programmer 

• A schedule determines how iterations are divided among 

the threads 

• Specified by SCHEDULE parameter 

• Default: undefined! 

• The form of the loop has to allow computing the number of 

iterations prior to entry into the loop e.g., no WHILE 

loops 

• Implicit barrier synchronization at the end of the loop unless 

NOWAIT parameter is specified 


Parallel Loop for C/C++ 

• Same restrictions as with Fortran DO-loop 

• No END pragma NOWAIT parameter is specified at entry 

of loop 

• for loop can only have very restricted form 


for ( var = first ; var cmp-op end ; incr-expr) 

loop-body 

Comparison operators (cmp-op) 

= 

• Increment expression (inctr-expr) 

++var, var++, --var, var--, 

var += incr, var -= incr, 

var = var + incr, var = incr + var 

• first, end, incr: constant expressions 


Canonical Loop Structure 

Restrictions on loops to simplify compiler-based parallelization 

• Allows number of iterations to be computed at loop entry 

• Program must complete all iterations of the loop 

• no break or goto leaving the loop 

• no exit or goto leaving the loop 

• Exiting current iteration and starting next one possible 

• continue allowed 

• cycle allowed 

• Termination of the entire program inside the loop 

possible 

• exit allowed 

• stop allowed 


Combined Constructs (Shortcut Version) 

• If a parallel region only contains a parallel loop or parallel 

sections a shortcut notation can be used: 





#include 

#include 

#define N 10 

int main() 

int i; 


{ 


for (i = 0; i < N; i++) { 


} 

} 

} 

return 0;




#include 

#include 

#define N 10 

int main() 

{ 

int i; 

#pragma omp parallel for 

for (i = 0; i < N; i++) { 


} 

} 

return 0;

Example: Work Sharing with Parallel Loop I 


Example: Work Sharing with Parallel Loop II 

Parallel region and parallel for-loop 

# pragma omp parallel 

{ 

#pragma omp for schedule(static) 

for (i=0 ; i

Example: Work Sharing with Parallel Loop III 

Parallel region and parallel do-loop 


!$OMP DO SCHEDULE(STATIC) 

do i = 1, 100 

enddo 

!$OMP END DO 

a(i) = b(i) + c(i) 

!$OMP END PARALLEL 

!$OMP PARALLEL DO SCHEDULE(STATIC) 

do i = 1, 100 

a(i) = b(i) + c(i) 

enddo 



Static and Dynamic Scheduling 

• Static scheduling 

• Distribution is done at loop-entry time based on 

• Number of threads 

• Total number of iterations 

• Less flexible 

• Almost no scheduling overhead 

• Dynamic scheduling 

• Distribution is done during execution of the loop 

• Each thread is assigned a subset of the iterations at loop 

entry 

• After completion each thread asks for more iterations 

• More flexible 

• Can easily adjust to load imbalances 

• More scheduling overhead (synchronization) 


Scheduling Strategies I 

• Distribution of iterations occurs in chunks 

• Chunks may have different sizes 

• Chunks are assigned either statically or dynamically 

• There are different assignment algorithms (types) 

• SCHEDULE parameter 

schedule (type [, chunk]) 

• Schedule types 

• static 

• dynamic 

• guided 

• runtime 


Scheduling Strategies II 

• Schedule type STATIC without chunk size 

• One chunk of iterations per thread, all chunks (nearly) equal size 

• Schedule type STATIC with chunk size 

• Chunks with specified size are assigned in round-robin fashion 

• Schedule type DYNAMIC 

• Threads request new chunks dynamically 

• Default chunk size is 1 

• Schedule type GUIDED 

• First chunk has implementation-dependent size 

• Size of each successive chunk decreases exponentially 

• Chunks are assigned dynamically 

• Chunks size specifies minimum size, default is 1 


Scheduling Strategies III 

• Schedule type RUNTIME 

• Scheduling strategy is determined by environment variable 

• If variable is not set, scheduling implementation dependent 

export OMP_SCHEDULE=“type [, chunk]“ 

• No SCHEDULE parameter 

• Scheduling implementation dependent 

• Correctness of program must not depend on scheduling 

strategy 

• Setting via Runtime Libraries Routines possible (OpenMP 

3.0) 

• omp_set_schedule / omp_get_schedule 


Scheduling - Summary 

Iterations 

type chunk size number 

static no n/p p 

static yes c n/c (overlapping) 

dynamic optional c n/c 

guided optional first n/p, then 

decreasing 

less than n/c 

runtime no depends depends 

n: number of iterations, p: number of threads, c: chunk size 


Reason for Different Work Schedules 

Work load should be balanced, e.g. each thread needs same 

period of time to handle its task 

• Load might randomly differ from iteration to iteration 

dynamic scheduling 

• Execution speed might increase/decrease with increasing iteration 

index 

‣ guided scheduling: chunksizeis proportional to number of 

iterations left divided by the number of threads in the team 


Exercise: OpenMP Work Distribution 

do i = 1,10 

end do 

call work(i) 

Which schedule clause results in which work distribution? 

a) 

b) 

c) 

d) 

1 2 3 4 5 6 7 8 9 10 

1 2 3 4 5 6 7 8 9 10 

1 2 3 4 5 6 7 8 9 10 

1 2 3 4 5 6 7 8 9 10 

= thread 0 

= thread 1 

= thread 2 

= thread 3 


OpenMP Work Distribution 

• Sequential: 

1 2 3 4 5 6 7 8 9 10 

• Cyclic: 

• Block: 

do i = 1,10 

call work(i) 

end do 

• OpenMP also provides: 

1 2 3 4 5 6 7 8 9 10 

!$omp do schedule(static,1) ... 

1 2 3 4 5 6 7 8 9 10 

1 2 3 4 5 6 7 8 9 10 

!$omp do schedule(static) ... 

schedule(dynamic [, chunk]), schedule(guided [, chunk]) 




Thread 

Re c



Thread 

Re c



xercise 

Test different scheduling options in the 

Mandelbrot example and compare the 

performance

Synchronization: NOWAIT 

• Use NOWAIT parameter for parallel loops which do not 

need to be synchronized upon exit of the loop 

• keeps synchronization overhead low 

# pragma omp parallel 

{ 

} 

#pragma omp for nowait 

for (i=0 ; i

Synchronization: NOWAIT 

• Use NOWAIT parameter for parallel loops which do not 

need to be synchronized upon exit of the loop 

• keeps synchronization overhead low 


!$OMP DO 

do i = 1, 100 

enddo 

a(i) = b(i) + c(i) 

!$OMP END DO NOWAIT 

!$OMP DO 

do i = 1, 100 

enddo 

z(i) = sqrt(b(i)) 

!$OMP END DO NOWAIT 

!$OMP END PARALLEL 


Loop Nests 

• Parallel loop construct only applies to loop directly following 

it 

• Nesting of work-sharing constructs is illegal in OpenMP! 

• Parallelization of nested loops 

• Normally try to parallelize outer loop less overhead 

• Sometimes necessary to parallelize inner loops (e.g., small n) rearrange 

loops? 

• Manually re-write into one loop 

• Nested parallel regions not yet supported by all compilers 


If Clause: Conditional Parallelization I 

• Avoiding parallel overhead because of low number of loop 

iterations 

Explicit versioning: 

if (size > 10000) { 

#pragma omp parallel for 

for (i=0 ; i

If Clause: Conditional Parallelization II 

Using the „if clause“: 

#pragma omp parallel for if (size > 10000) 

for (i=0 ; i

Exercises: Introduction II 

a) Write a program that implements DAXPY (Double 

precision real Alpha X Plus Y): 

 

y 

 

x 

(vector sizes and values can be chosen as you like) 

‣ Measure the time it takes in dependence of the number 

of threads 

 

y 




xercise 

Matrix times vector 

∑ 

Solution in using_openmp_examples/ 

i 

a ij x j



Reduction

Reduction Parameter 

• For solving cases like this, OpenMP provides the 

REDUCTION parameter 


Reduction Parameter 

• For solving cases like this, OpenMP provides the 

REDUCTION parameter 


Reductions 

• Reductions often occur within parallel regions or loops 

• Reduction variables have to be shared in enclosing parallel 

context 

• Thread-local results get combined with outside variables 

using reduction operation 

• Note: order of operations unspecified can produce 

different results than sequential version 

• Applications: Compute sum of array or find the largest 

element 


Reductions, Example 


Reduction Operators 

The table shows: 

• The operators allowed for reductions 

• The initial values used to initialize 

Operator Data Types Initial Value 

+ floating point, integer 0 

* floating point, integer 1 

- floating point, integer 0 

& integer all bits on 

| integer 0 

^ integer 0 

&& integer 1 

|| integer 0 


Reduction Operators I 

The table shows: 

• The operators and intrinsic allowed for reductions 

• The initial values used to initialize 


+ floating point, integer (complex 

or real) 

* floating point, integer (complex 

or real) 

- floating point, integer (complex 

or real) 

0 

1 

0 

.AND. logical .TRUE. 

.OR. logical .FALSE. 

.EQV. logical .TRUE. 

.NEQV. logical .FALSE. 


Reduction Operators II 


MAX 

MIN 

floating point, integer 

(real only) 

floating point, integer 

(real only) 

IAND integer all bits on 

IOR integer 0 

IEOR integer 0 

smallest possible value 

largest possible value 




xercises 

. 

x·y 

. n!

eduction_clause/reduction_clause.c 



#include 

int main() 

{ 

int i; 

double a[3], b[3], dotp; 

a[0] = .2; 

a[1] = 2.2; 

a[2] = -1.; 

b[0] = 1.; 

b[1] = -.5; 

b[2] = 1.5; 

dotp = 0; 

#pragma omp parallel for reduction(+:dotp) \ 

default(none) shared(a, b) private(i) 

for (i = 0; i < 3; i++) { 

dotp += a[i] * b[i]; 

} 

printf("%g\n", dotp); 

} 

return 0;

Exercises: Advanced Topics I 

• Go to directory /ex_advanced_topics 

a) can be calculated in the following way: 

x 

i 

n 

1 

4.0 

(1.0 x^2) 

1 

x 

n 

x ( i 0.5) x 

Write a program that does the calculation and parallelize it 

using OpenMP (10^8 iterations). How much time does it 

take? 

Play around with the „schedule“ clause. Which settings 

gives the best performance? 




xercise 

Monte Carlo computation of 

π 

Random number generators in random/ (don't use them 

for science), solution in pi/

pi/cpi.c 



#include 

#ifdef _OPENMP 

#include 

#else 

#define omp_get_thread_num() 0 

#endif 

#define N 1000000000 

#include "../random/rand_r.c" 

int main() 

{ 

long i, counter; 

unsigned seed; 

double x, y, pi; 

counter = 0; 

#pragma omp parallel default(none) private(i, x, y, seed) \ 

reduction(+: counter) 

{ 

seed = omp_get_thread_num(); 


for (i = 0; i < N; i++) { 

x = (double) rand_r(&seed) / RAND_MAX; 

y = (double) rand_r(&seed) / RAND_MAX; 

if (x * x + y * y

pi/cpi.c 



#include 


#include 

#else 


#endif 

#define N 1000000000 


int main() 

{ 




counter = 0; 



{ 



for (i = 0; i < N; i++) { 



if (x * x + y * y

Conditional Compilation 

• Preprocessor macro _OPENMP for C/C++ and Fortran: 


iam = omp_get_thread_num(); 

#endif 

• Special comment for Fortran preprocessor 

!$ iam = OMP_GET_THREAD_NUM() 

• Helpful check of serial and parallel version of the code 


pi/cpi.c 



#include 


#include 

#else 


#endif 

#define N 1000000000 


int main() 

{ 




counter = 0; 



{ 



for (i = 0; i < N; i++) { 



if (x * x + y * y



Synchronisation

What now? 


Critical Region 


Critical Region II 

• A critical region restricts execution of the associated block 

of statements to a single thread at a time 

• An optional name may be used to identify the critical region 

• A thread waits at the beginning of a critical region until no 

other thread is executing a critical region (anywhere in the 

program) with the same name 


Example: Critical Region 

• Note: the loop body only consists of a critical region 

• The program gets extremely slow No speedup 


Example: Critical Region 

• Note: the loop body only consists of a critical region 

• The program gets extremely slow No speedup 


Critical Region III 


Example: Critical Region II 


Example: Critical Region II 


Atomic Statements 

• The ATOMIC directives ensures that a specific memory 

location is updated atomically. 

• OpenMP implementation could replace it with a CRITICAL 

construct. ATOMIC construct permits better optimization 

(based on hardware instructions) 

# 

• The statement following the ATOMIC directive must have 

one of the following forms: 


Atomic Statements 

• The ATOMIC directives ensures that a specific memory 

location is updated atomically. 

• OpenMP implementation could replace it with a CRITICAL 

construct. ATOMIC construct permits better optimization 

(based on hardware instructions) 

• The statement following the ATOMIC directive must have 

one of the following forms: 


Example: Atomic Statement 

s = 0 

#pragma omp parallel private(i, s_local) 

{ 

s_local = 0.0; 


for (i=0 ; i

Example: Atomic Statement 


Locks 

• More flexible way of implementing “critical regions” 

• Lock variable type: omp_lock_t, passed by address 

• Initialize lock 

omp_init_lock(&lockvar) 

• Remove (deallocates) lock 

omp_destroy_lock(&lockvar) 

• Blocks calling thread until lock is available 

omp_set_lock(&lockvar) 

• Releases locks 

omp_unset_lock(&lockvar) 

• Test and try to set lock (returns 1 if success else 0) 

intvar = omp_test_lock(&lockvar) 


Locks 

• More flexible way of implementing “critical regions” 

• Lock variable have to be of type integer 

• Initialize lock 

CALL OMP_INIT_LOCK(lockvar) 

• Remove (deallocates) lock 

CALL OMP_DESTROY_LOCK(lockvar) 

• Blocks calling thread until lock is available 

CALL_SET_LOCK(lockvar) 

• Releases locks 

CALL_OMP_UNSET_LOCK(lockvar) 

• Test and try to set lock (returns TRUE if success) 

logicalvar = OMP_TEST_LOCK(lockvar) 


Example: Locks 


Example: Locks 


Master / Single I 

• Sometimes it is useful that within a parallel region just one 

thread is executing code, e.g., to read/write data. OpenMP 

provides two ways to accomplish this: 

• MASTER construct: The master thread (thread 0) 

executes the enclosed code, all other threads ignore 

this block of statements, i.e. there is no implicit 

barrier 

#pragma omp master 

{ statementBlock } 

!$OMP MASTER 

statementBlock 

!$OMP END MASTER 


Master / Single II 

• Single construct: The first thread reaching the 

directive executes the code. All threads execute an 

implicit barrier synchronization unless NOWAIT is 

specified 

#pragma omp single [nowait, private(list),...] 

{ statementBlock } 

!$OMP SINGLE [PRIVATE(list), FIRSTPRIVATE(list)] 

statementBlock 

!$OMP END SINGLE [NOWAIT] 


single_construct/single_construct.c 



#include 

#include 

#define N 4 

int main() 

{ 

int i, j; 

j = 1; 

#pragma omp parallel default(none) shared(j) private(i) 

{ 


for (i = 0; i < N; i++) { 

printf("i = %d, j = %d (thread %d)\n", i, j, 

omp_get_thread_num()); 

} 

#pragma omp single 

{ 

j = 2; 

printf("Thread %d changes the shared variable j.\n", 


} 

} 


for (i = 0; i < N; i++) { 

printf("i = %d, j = %d (thread %d)\n", i, j, 


} 

} 

return 0;

Barrier 

• The barrier directive explicitly synchronizes all the threads 

in a team. 

• When encountered, each thread in the team waits until all 

the others have reached this point 

#pragma omp barrier 

!$OMP BARRIER 

• There are also implicit barriers at the end 

• of parallel regions 

• of work share constructs (DO/for, SECTIONS, SINGLE) 

these can be disabled by specifying NOWAIT 




Misc

Section construct 



C, C++ 

#pragma omp sections [clause[[,] clause]. . . ] 

{ 

#pragma omp section 




... 

} 

Fortran 

!$omp sections [clause[[,] clause]. . . ] 

!$omp section 


!$omp section 


... 

!$omp end sections [nowait]

section_construct/section_construct.c 



#include 

#include 

int main() 

{ 

#pragma omp parallel sections 

{ 


printf("Section 1: thread %d\n", omp_get_thread_num()); 





} 



} 

return 0;

Workshare construct 



Fortran 

!$omp workshare 


!$omp end workshare [nowait]

workshare_construct/ 



workshare_construct.f90 

PROGRAM workshare_construct 

IMPLICIT NONE 

INTEGER, PARAMETER :: n = 5 

REAL :: a(n), b(n), c(n) 

INTEGER :: i 

a = (/ (.1 * i, i = 1, n) /) 

b = (/ (.5 * i, i = 1, n) /) 

!$OMP PARALLEL WORKSHARE 

c = a + b 

!$OMP END PARALLEL WORKSHARE 

WRITE(*, *) a 

WRITE(*, *) b 

WRITE(*, *) c 

END PROGRAM mandelbrot

Orphaned Work-sharing Construct 

• Work-sharing construct can occur anywhere outside the 

lexical extent of a parallel region orphaned construct 

• Case 1: called in a parallel context works as expected 

• Case 2: called in a sequential context “ignores” directive 

• Global variables are shared 




OpenMP 3.x

Loop Parallelism I (OpenMP 3.0) 

• OpenMP 3.0 guarantees for STATIC schedule 

• same number of iterations 

• second loop is divided in the same chunks 

• not guaranteed in OpenMP 2.5 


Loop Parallelism II (OpenMP 3.0) 

• Loop collapsing 

!$OMP PARALLEL DO COLLAPSE(2) 

do i = 1, n 

enddo 

do j = 1, m 

enddo 

do k =1, l 

enddo 


foo(i, j, k) 

• Rectangular iteration space from the outer two loops 

• outer loop is collapsed into one larger loop with more 

iterations 


Task Parallelism (OpenMP 3.0) 

Tasks: 

• Unit of independent work (block of code, like sections) 

• Direct or deferred execution of work by one thread of 

team 

• Tasks are composed of 

• code to execute, and 

• the data environment (constructed at creation time) 

• internal control variables (ICV) 

• can be tied to a thread: only this thread can execute the 

task 

• can be used for unbounded loops, recursive algorithms, 

working on linked lists (pointer), consumer/producer 


Task directive 

• ParameterList: 

• data scope: default, private, firstprivate, shared 

• conditional: if (expr) 

‣create task if expr true, otherwise direct execution 

schedule: untied 

• Each encountering thread creates a new task 

• Tasks can be nested, into another task or a worksharing 

construct 


Task synchronization (OpenMP 3.0) 

• Task is executed by a thread of current team sometimes 

• Completion of all task created by all threads of current team 

at synchronization points: 

• Barriers 

• implicit: e.g. end of parallel region 

• explicit: e.g. OMP BARRIER 

• Task Barrier: 

#pragma omp taskwait 

!$OMP TASKWAIT 

encountering task suspends until all its child tasks 

(generated before taskwait) are completed 


Task vs Sections 

Task 

• OpenMP task can be 

defined by any thread of 

team 

• a thread of the team 

executes the OpenMP task 

• definition of OpenMP tasks 

at any location 

• dynamic setup of OpenMP 

tasks 

Sections 

• independend tasks can be 

performed in parallel 

• static blocks of code 

• Nature of task and number 

of tasks must be known at 

compile time 


Exercises: Advanced Topics II 

a) Parallelize fibonacci.c using OpenMP tasks (Fibonacci 

sequence: 0, 1, 1, 2, 3, 5, 8, 13, …) 




Troubleshooting

Race Conditions and Data Dependencies I 

Most important rule: „Parallelization of code must not affect the 

correctness of a program!“ 

• In loops: the results of each single iteration must not 

depend on each other 

• Race conditions must be avoided 

• Result must not depend on the order of threads 

• Correctness of the program must not depend on number 

of threads 

• Correctness must not depend on the work schedule 


Race Conditions and Data Dependencies II 

• Threads read and write to the same object at the same time 

‣ Unpredictable results (sometimes work, sometimes not) 

‣ Wrong answers without a warning signal! 

• Correctness depends on order of read/write accesses 

• Hard to debug because the debugger often runs the 

program in a serialized, deterministic ordering. 

• To insure that readers do not get ahead of writers, thread 

synchronization is needed. 

• Distributed memory systems: messages are often used 

to synchronize, with readers blocking until the message 

arrives. 

• Shared memory systems: need barriers, critical regions, 

locks, ... 


Race Conditions and Data Dependencies II 

• Note: be careful with synchronization 

• Degrades performance 

• Deadlocks: threads lock up waiting for a locked resource 

that never will become available 


Exercises: Data Dependencies I 

a) 

b) 

c) 

d) 

for (int i=1 ; i

Exercises: Data Dependencies II 

e) 

f) 

for (int i=1 ; i

Intel Inspector XE (Formerly Intel Thread Checker) 

• Memory error and thread checker tool 

• Supported languages on linux systems: 

• C/C++, Fortran 

• Maps errors to the source code line and call stack 

• Detecs problems that are not recognized by the compiler 

(e.g. race conditions, data dependencies) 


Basics 

Shared memory 

Fork-Join 

SPMD Threads 

Compiler directives 

Parallel construct 

Reduction 

Private vs. shared 

Critical region 

Loops 

Scheduling 

Atomic statements 

OpenMP 3.x 

Tasks 

Race conditions 

Data dependencies 

Synchronisation 

Single Barrier 

Locks Master 

Troubleshooting 

Thread checker 


The Cyprus Institute

Parallel programming with OpenMP - LinkSCEEM

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?