11.07.2015 Views

Basic Programming Models - Prace Training Portal

Basic Programming Models - Prace Training Portal

Basic Programming Models - Prace Training Portal

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

PRACE Autumn School 2010<strong>Basic</strong> <strong>Programming</strong> <strong>Models</strong>


Introduction• <strong>Basic</strong> programming environment#include int main(int argc, char * argv[]){printf (“Hello world\n”);return 0;}Compiler$ gcc -o hello hello.c$hello (binary file)Execution environment(OS & libraries)Hardware$ ./hellohello world$3


Architecture• Single processor model– Core– Cache hierarchy– Memory4


A problem• The number of transistors in a chip keeps increasing– but power consumption and heat dissipation have reached a limitIntel Core i7 Processor Extreme Edition32 nm.6 cores / 12 threads3.33 Ghz. Clock speed12 Mb. cache cannot increase!!!• Solution: keep same clock rates and use the area tocolocate several processors (cores) in the same die• Use parallelism to increase overall throughput5


Parallelism• “Serial” architectures already had a lot of parallelism– ILP – Instruction-Level Parallelism• Pipelined and superscalar processors– SIMD – Single-Instruction Multiple-Data• SSE4, AVX, Altivec...• The focus is now TLP – Thread-Level Parallelism– Also exploited in SMP nodes with single core processors– Requires to rewrite applications to take advantage of multiplehardware threads/cores6


A thread...• simply defined as an independent execution context– Appear at all levels: Application, libraries – user-level, OS, hardware– Before multicore processors, eachprocessor had only one hardwarethreadUser-levelOSHardware7


Simultaneous multithreading• Single processor modelwith SMT– Seen as two hardwarethreads from the OSpoint of viewQ: how to spawnwork to the secondSMT?8


Multicore and SMT• Multi-core model(with SMT)– Seen as two hardwarethreads from the OSpoint of view• No difference withrespect to SMT(see /proc/cpuinfo)Q: Differencesbetween SMT andmulti-core?9


Multicore SMP machines• Multi-chip model and memory interconnect• Coherence protocol– Hardware ensures data coherency and consistency– At a cache line granularity• Invalidations, false sharing...Q: Interfaces?10


Accelerators• Heterogeneity– GPUs, FPGAs – Accelerators in general12


Cluster of multicore SMP machines• Cluster model – distributed memoryCluster InterconnectQ: How data is transfered among memories in different nodes?Q: Interfaces?13


<strong>Basic</strong> <strong>Programming</strong> <strong>Models</strong> - Outline• Introduction• Key concepts– Architectures– <strong>Programming</strong> models– <strong>Programming</strong> languages– Compilers– Operating system & libraries– APIs14


<strong>Programming</strong> <strong>Models</strong>• Shared memory– Automatic data accesses– No need to express communication– up to 256 – 1024 cores• Message passing – Distributed memory– Programmer is responsible– ... through expressing communication– up to 30000 – 75000 nodesCluster Interconnect15


Shared memory• Memory is accessed (read and written) from allprocessor coresCPU1x=Nx=x=NCPU2• Communication and synchronization happen throughshared memory16


Shared memoryNo other writes to x• Cache coherence protocol– write – read on hardware thread PWx, NRxtimeN– write (on P1) – read (on P2)– write (on P1) – write (on P2)No other writes to xW Rx, N x timeNNo other writes to xW Wx, N x, M timeNever read location x as M first and then N17


Distributed memory• Message passing– Each processor has its own memory – cannot be accesseddirectly from other processorsx=NCPU1x=CPU2x=x=...– Communication and synchronization happen through explicitmessages18


Performance measurement• Speed-up (S)– Expresses how much faster (or slower) the parallel execution isT sT(p)Execution time of the sequential version of the programExecution time of the parallel version running on P processorsS(p) =T sT(p)19


Performance measurement• Efficiency (E)– Measures how well we are using the machine resourcesE(p) =pS(p)20


Performance measurement• Scalability curve– Expresses how good the efficiency is as we increase the numberof processorsSpeed-upSuperlinearIdealAcceptablePoor scalabilityNumber of processors21


Amdahl's Law (from Gene Amdahl)• Determines which is the maximum speed-up we canexpect• Represents the impact of the sequential part of aprogram in its overall scalabilityfFraction of program that is sequential (cannot be parallelized)S max(p) =f +11 – fpf = 0.0001f = 0.001f = 0.011% serialf = 0.1 10% serial!!22


Comparison 100 to 1000 cores• Amdahl's Lawf = 0.0001f = 0.001f = 0.01 1% serialf = 0.1 10% serial!!23


Reality is usually worse• Scalability curve degraded– Overhead– Communication & synchronization between threadsSpeed-up• ConflictsSuperlinearIdealAcceptableDegraded performanceNumber of processors24


Sources of overhead• Management of parallelism– Cost of creating, and joining parallelism• Communication– Cache coherence and/or application messaging• Synchronization– Locks, barriers... depending on their use and implementation• Load imbalance– Program has not enough work to keep all resources busy, or– The distribution of work results in some processors receivingmore work to do than others25


Sources of overhead• They mainly cause an increase in the serial portion ofthe program– With the corresponding loose of scalability• Their impact depends on– architecture– runtime system– application26


<strong>Programming</strong> Languages and Interfaces• C, Fortran• Pthreads, Message Passing Interface (MPI)• OpenMP• Global address space approaches– UPC, Coarray Fortran, X10, Chapel, Fortress27


Identifying parallelism• Two main sources– Functional decomposition (task parallelism)• Which parts of the application can run in parallel?– Data decomposition (data parallelism)• Which operations on data can be performed in parallel– Loops are usually a good source of parallelism28


Choosing the right granularity• Should we parallelize big or small portions of theapplication?– Ideally, we should choose the coarser grain• Lower overall overhead• Usually less communication and synchronization• ... but it can lead to load imbalance– Using finer grain• Larger overall overhead, communication and synchronization• ... but much better load balance is possible– Large applications use multi-level parallelization• With the possibility of using several programming models– MPI+OpenMP29


After parallelization... solve the problems• Correctness– Incorrect parallelization, race conditions, deadlock...• Performance– Load imbalance– False sharing– Bad locality management• Finding the source of problems is intrinsically harderthan in sequential programs– Use as much help as possible from support tools30


True sharing and false sharing• True sharing: two processors access the samememory location, and the hardware will transfer thedata from CPU1's cache to CPU2's cacheCPU1x=Nx=x=NCPU2• False sharing: two processors access different (close)memory locations, in the same cache lineCPU1x0=Nx0=x1= x2= x3= x4=x4=MCPU2cache line moving from CPU1's cache to CPU2's cache31


Compilers• Different languages require different compilers– gcc -o hello hello.c– g++ -o hello hello.C– gfortran -o hello hello.f– gfortran –free-form -o hello hello.f– mpicc -o hello hello.c– mpif90 -o hello hello.f9032


Detailed compilation flowSource codeHeadersPreprocessorCompilerOtherObject filesLibrariesAssemblerObject fileLinkerExecutable file33


Some compiler options• Optimization -O -O3• Debugging -g -ggdb• Profiling -pg• Preprocess only -E• Go to object file -c• Add include search directory -I• Add libraries search directory -L• Add libraries to link with -l– -lpthread indicates to link with libpthread.so library– -lm link to the math library (use also #include )• Support OpenMP -fopenmp34


<strong>Basic</strong> debugging• A debugger allows to examine the execution of aprogram– Start your program (run)• gdb hello• run– Specifying any parameter that might affect its behavior• run -i -t 10– Stops your program on critical errors (segm. fault)– Stop your program on specified conditions– Examine the program when stopped– Change registers/variables/memory in your program– Solve small bugs an keep going35


Debugging a program• Give attention to the compiler flags– -g to generate debugging information– -O0 to get accurate information about your program– -O, -O3 can cause inaccuracies• Invoke the debugger with your program– gdb • Run the program– run – Ctrl-C will interrupt execution and gdb will regain control36


Debugging a program• Attaching to a running process– gdb -pid PID• Setting breakpoints– break function– break file:function– break – break file:– break ... if i>5• Setting watchpoints– watch i==10037


Execution environment• Applications• Runtime libraries• Operating system• HardwareOperatingSystemTaskThreadPC+SPThreadTaskPC+SP PC+SPProcess- application- runtimelibrariesHardware38


OS support• Generic structure of the address spaceCodeDataBSSSharedlibrariesMainstackPC2PC0 PC1SP3 SP1PC3SP2– Data contains constant-initialized data– BSS grows with malloc– Other memory areas can be mapped with mmap– Thread stacks are allocated at thread creation– Main stack grows automatically up to ulimit -s (e.g. 8 Mb.)SP039


Runtime libraries• C support library – glibc– UNIX system calls – open/close/read/write/fork/exec...– Buffered IO – printf, fprintf, fread, fwrite, fopen...– Sockets – TCP and UDP communications• MPI builds on sockets or other communication infrastructures– ...• Pthreads– Creation, termination– Attributes– Scheduling, priorities, binding...• OpenMP builds on top of pthreads40


Static and dynamic libraries• Non-shared (static)– The executable has the code of all functions• Uses more memory space• It cannot take advantage of new library versions• Dynamic (shared)– The executable contains only its own code– It links the libraries dynamically• Only one resident copy of the library• Installing a new version will affect all programs41


Pthreads• Creation of a pthread– Allocates a thread descriptor– Allocates stack and builds stack frame• Function, argument– Creates a kernel-level thread if needed• Sharing process resources#include int pthread_create(pthread_t * thread ,pthread_attr_t * attr ,void * (* start_routine) (void *),void * arg );42


Pthreads• Example: creation of a pthreadvoid main (){int res;pthread_t th;...res = pthread_create (&th, NULL, func, argument);...}void * func (void * argument){printf (“argument %d\n”, (int) argument);}th43


Pthreads – types of parallelism• Fork/joinFork / joinn x pthread_createJoinGets the terminationcode of the pthreadsn x pthread_join44


Pthreads – types of parallelism• Unstructured - detachUnstructuredpthread_create +pthread_detachpthread_exitDetachThe application does not need to get thepthreads termination code45


Pthreads - interface• pthread_exit (code)– saves the termination code in the pthread structure• pthread_join (pth, &status)– retrieves the pthread termination code• pthread_detach (pth)– marks the pthread structure to be freed when the pthreadterminates• The pthread descriptor can be released after– Pthread_exit and pthread_join– Pthread_detach and pthread_exit46


Pthreads - main difficulties• Code outlining– Programmer has to identify parallelism, outline regions infunctions and define the interfaces• Synchronization– Locking– Conditional variables• In heterogeneous multicores (Cell/B.E., GPUs...)– Split the program in different files for compilation47


MPI• Based on messaging added to the application, and theMPI library– msg_send, msg_rcv• Compiler driver– mpicc -o hello hello.c• takes care of adding -I and -lmpi• Runtime system– The MPI library48


OpenMP• Based on directives on top of C/C++/Fortran• Compiler driver– gcc -fopenmp• -lgomp -lpthreadsadded automatically by the gcc compiler• Runtime system– Based on Pthreads, not seen by programmer49


Next steps on this track• MPI– Monday afternoon to Wednesday• OpenMP– Thursday and Friday50

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!