Basic Programming Models - Prace Training Portal

PRACE Autumn School 2010Basic Programming Models

Introduction• Basic programming environment#include int main(int argc, char * argv[]){printf (“Hello world\n”);return 0;}Compiler$ gcc -o hello hello.c$hello (binary file)Execution environment(OS & libraries)Hardware$ ./hellohello world$3

Architecture• Single processor model– Core– Cache hierarchy– Memory4

A problem• The number of transistors in a chip keeps increasing– but power consumption and heat dissipation have reached a limitIntel Core i7 Processor Extreme Edition32 nm.6 cores / 12 threads3.33 Ghz. Clock speed12 Mb. cache cannot increase!!!• Solution: keep same clock rates and use the area tocolocate several processors (cores) in the same die• Use parallelism to increase overall throughput5

Parallelism• “Serial” architectures already had a lot of parallelism– ILP – Instruction-Level Parallelism• Pipelined and superscalar processors– SIMD – Single-Instruction Multiple-Data• SSE4, AVX, Altivec...• The focus is now TLP – Thread-Level Parallelism– Also exploited in SMP nodes with single core processors– Requires to rewrite applications to take advantage of multiplehardware threads/cores6

A thread...• simply defined as an independent execution context– Appear at all levels: Application, libraries – user-level, OS, hardware– Before multicore processors, eachprocessor had only one hardwarethreadUser-levelOSHardware7

Simultaneous multithreading• Single processor modelwith SMT– Seen as two hardwarethreads from the OSpoint of viewQ: how to spawnwork to the secondSMT?8

Multicore and SMT• Multi-core model(with SMT)– Seen as two hardwarethreads from the OSpoint of view• No difference withrespect to SMT(see /proc/cpuinfo)Q: Differencesbetween SMT andmulti-core?9

Multicore SMP machines• Multi-chip model and memory interconnect• Coherence protocol– Hardware ensures data coherency and consistency– At a cache line granularity• Invalidations, false sharing...Q: Interfaces?10

Accelerators• Heterogeneity– GPUs, FPGAs – Accelerators in general12

Cluster of multicore SMP machines• Cluster model – distributed memoryCluster InterconnectQ: How data is transfered among memories in different nodes?Q: Interfaces?13

Basic Programming Models - Outline• Introduction• Key concepts– Architectures– Programming models– Programming languages– Compilers– Operating system & libraries– APIs14

Programming Models• Shared memory– Automatic data accesses– No need to express communication– up to 256 – 1024 cores• Message passing – Distributed memory– Programmer is responsible– ... through expressing communication– up to 30000 – 75000 nodesCluster Interconnect15

Shared memory• Memory is accessed (read and written) from allprocessor coresCPU1x=Nx=x=NCPU2• Communication and synchronization happen throughshared memory16

Shared memoryNo other writes to x• Cache coherence protocol– write – read on hardware thread PWx, NRxtimeN– write (on P1) – read (on P2)– write (on P1) – write (on P2)No other writes to xW Rx, N x timeNNo other writes to xW Wx, N x, M timeNever read location x as M first and then N17

Distributed memory• Message passing– Each processor has its own memory – cannot be accesseddirectly from other processorsx=NCPU1x=CPU2x=x=...– Communication and synchronization happen through explicitmessages18

Performance measurement• Speed-up (S)– Expresses how much faster (or slower) the parallel execution isT sT(p)Execution time of the sequential version of the programExecution time of the parallel version running on P processorsS(p) =T sT(p)19

Performance measurement• Efficiency (E)– Measures how well we are using the machine resourcesE(p) =pS(p)20

Performance measurement• Scalability curve– Expresses how good the efficiency is as we increase the numberof processorsSpeed-upSuperlinearIdealAcceptablePoor scalabilityNumber of processors21

Amdahl's Law (from Gene Amdahl)• Determines which is the maximum speed-up we canexpect• Represents the impact of the sequential part of aprogram in its overall scalabilityfFraction of program that is sequential (cannot be parallelized)S max(p) =f +11 – fpf = 0.0001f = 0.001f = 0.011% serialf = 0.1 10% serial!!22

Comparison 100 to 1000 cores• Amdahl's Lawf = 0.0001f = 0.001f = 0.01 1% serialf = 0.1 10% serial!!23

Reality is usually worse• Scalability curve degraded– Overhead– Communication & synchronization between threadsSpeed-up• ConflictsSuperlinearIdealAcceptableDegraded performanceNumber of processors24

Sources of overhead• Management of parallelism– Cost of creating, and joining parallelism• Communication– Cache coherence and/or application messaging• Synchronization– Locks, barriers... depending on their use and implementation• Load imbalance– Program has not enough work to keep all resources busy, or– The distribution of work results in some processors receivingmore work to do than others25

Sources of overhead• They mainly cause an increase in the serial portion ofthe program– With the corresponding loose of scalability• Their impact depends on– architecture– runtime system– application26

Programming Languages and Interfaces• C, Fortran• Pthreads, Message Passing Interface (MPI)• OpenMP• Global address space approaches– UPC, Coarray Fortran, X10, Chapel, Fortress27

Identifying parallelism• Two main sources– Functional decomposition (task parallelism)• Which parts of the application can run in parallel?– Data decomposition (data parallelism)• Which operations on data can be performed in parallel– Loops are usually a good source of parallelism28

Choosing the right granularity• Should we parallelize big or small portions of theapplication?– Ideally, we should choose the coarser grain• Lower overall overhead• Usually less communication and synchronization• ... but it can lead to load imbalance– Using finer grain• Larger overall overhead, communication and synchronization• ... but much better load balance is possible– Large applications use multi-level parallelization• With the possibility of using several programming models– MPI+OpenMP29

After parallelization... solve the problems• Correctness– Incorrect parallelization, race conditions, deadlock...• Performance– Load imbalance– False sharing– Bad locality management• Finding the source of problems is intrinsically harderthan in sequential programs– Use as much help as possible from support tools30

True sharing and false sharing• True sharing: two processors access the samememory location, and the hardware will transfer thedata from CPU1's cache to CPU2's cacheCPU1x=Nx=x=NCPU2• False sharing: two processors access different (close)memory locations, in the same cache lineCPU1x0=Nx0=x1= x2= x3= x4=x4=MCPU2cache line moving from CPU1's cache to CPU2's cache31

Compilers• Different languages require different compilers– gcc -o hello hello.c– g++ -o hello hello.C– gfortran -o hello hello.f– gfortran –free-form -o hello hello.f– mpicc -o hello hello.c– mpif90 -o hello hello.f9032

Detailed compilation flowSource codeHeadersPreprocessorCompilerOtherObject filesLibrariesAssemblerObject fileLinkerExecutable file33

Some compiler options• Optimization -O -O3• Debugging -g -ggdb• Profiling -pg• Preprocess only -E• Go to object file -c• Add include search directory -I• Add libraries search directory -L• Add libraries to link with -l– -lpthread indicates to link with libpthread.so library– -lm link to the math library (use also #include )• Support OpenMP -fopenmp34

Basic debugging• A debugger allows to examine the execution of aprogram– Start your program (run)• gdb hello• run– Specifying any parameter that might affect its behavior• run -i -t 10– Stops your program on critical errors (segm. fault)– Stop your program on specified conditions– Examine the program when stopped– Change registers/variables/memory in your program– Solve small bugs an keep going35

Debugging a program• Give attention to the compiler flags– -g to generate debugging information– -O0 to get accurate information about your program– -O, -O3 can cause inaccuracies• Invoke the debugger with your program– gdb • Run the program– run – Ctrl-C will interrupt execution and gdb will regain control36

Debugging a program• Attaching to a running process– gdb -pid PID• Setting breakpoints– break function– break file:function– break – break file:– break ... if i>5• Setting watchpoints– watch i==10037

Execution environment• Applications• Runtime libraries• Operating system• HardwareOperatingSystemTaskThreadPC+SPThreadTaskPC+SP PC+SPProcess- application- runtimelibrariesHardware38

OS support• Generic structure of the address spaceCodeDataBSSSharedlibrariesMainstackPC2PC0 PC1SP3 SP1PC3SP2– Data contains constant-initialized data– BSS grows with malloc– Other memory areas can be mapped with mmap– Thread stacks are allocated at thread creation– Main stack grows automatically up to ulimit -s (e.g. 8 Mb.)SP039

Runtime libraries• C support library – glibc– UNIX system calls – open/close/read/write/fork/exec...– Buffered IO – printf, fprintf, fread, fwrite, fopen...– Sockets – TCP and UDP communications• MPI builds on sockets or other communication infrastructures– ...• Pthreads– Creation, termination– Attributes– Scheduling, priorities, binding...• OpenMP builds on top of pthreads40

Static and dynamic libraries• Non-shared (static)– The executable has the code of all functions• Uses more memory space• It cannot take advantage of new library versions• Dynamic (shared)– The executable contains only its own code– It links the libraries dynamically• Only one resident copy of the library• Installing a new version will affect all programs41

Pthreads• Creation of a pthread– Allocates a thread descriptor– Allocates stack and builds stack frame• Function, argument– Creates a kernel-level thread if needed• Sharing process resources#include int pthread_create(pthread_t * thread ,pthread_attr_t * attr ,void * (* start_routine) (void *),void * arg );42

Pthreads• Example: creation of a pthreadvoid main (){int res;pthread_t th;...res = pthread_create (&th, NULL, func, argument);...}void * func (void * argument){printf (“argument %d\n”, (int) argument);}th43

Pthreads – types of parallelism• Fork/joinFork / joinn x pthread_createJoinGets the terminationcode of the pthreadsn x pthread_join44

Pthreads – types of parallelism• Unstructured - detachUnstructuredpthread_create +pthread_detachpthread_exitDetachThe application does not need to get thepthreads termination code45

Pthreads - interface• pthread_exit (code)– saves the termination code in the pthread structure• pthread_join (pth, &status)– retrieves the pthread termination code• pthread_detach (pth)– marks the pthread structure to be freed when the pthreadterminates• The pthread descriptor can be released after– Pthread_exit and pthread_join– Pthread_detach and pthread_exit46

Pthreads - main difficulties• Code outlining– Programmer has to identify parallelism, outline regions infunctions and define the interfaces• Synchronization– Locking– Conditional variables• In heterogeneous multicores (Cell/B.E., GPUs...)– Split the program in different files for compilation47

MPI• Based on messaging added to the application, and theMPI library– msg_send, msg_rcv• Compiler driver– mpicc -o hello hello.c• takes care of adding -I and -lmpi• Runtime system– The MPI library48

OpenMP• Based on directives on top of C/C++/Fortran• Compiler driver– gcc -fopenmp• -lgomp -lpthreadsadded automatically by the gcc compiler• Runtime system– Based on Pthreads, not seen by programmer49

Next steps on this track• MPI– Monday afternoon to Wednesday• OpenMP– Thursday and Friday50

Basic Programming Models - Prace Training Portal

Create successful ePaper yourself

Delete template?

Save as template?