13.07.2015 Views

PGAS Programming with UPC and Fortran Coarrays

PGAS Programming with UPC and Fortran Coarrays

PGAS Programming with UPC and Fortran Coarrays

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>PGAS</strong> <strong>Programming</strong> <strong>with</strong><strong>UPC</strong> <strong>and</strong> <strong>Fortran</strong> <strong>Coarrays</strong>EPCC PRACE AdvancedTraining CentreCourse Slides9-10 January 2013EPCC© The University of Edinburgh


•19/12/2012Parallel <strong>Programming</strong>Languages <strong>and</strong> ApproachesDr Alan D SimpsonTechnical Director, EPCCa.simpson@epcc.ed.ac.uk+44 131 650 5120Contents• A Little Bit of History– Non-Parallel <strong>Programming</strong> Languages– Vector Processing– Data Parallel– Early Parallel Languages• Current Status of Parallel <strong>Programming</strong>– Parallelisation Strategies– Mainstream HPC• Alternative Parallel <strong>Programming</strong> Languages– Single-Sided Communication– <strong>PGAS</strong>– Accelerators– Hybrid Approaches• Final Remarks <strong>and</strong> SummaryParallel <strong>Programming</strong> Languages 2•1


•19/12/2012Contents• A Little Bit of History– Non-Parallel <strong>Programming</strong> Languages– Vector Processing– Data Parallel– Early Parallel Languages• Current Status of Parallel <strong>Programming</strong>– Parallelisation Strategies– Mainstream HPC• Alternative Parallel <strong>Programming</strong> Languages– Single-Sided Communication– <strong>PGAS</strong>– Accelerators– Hybrid Approaches• Final Remarks <strong>and</strong> SummaryParallel <strong>Programming</strong> Languages 3Non-Parallel <strong>Programming</strong> Languages• Serial languages are also important for HPC– Used for much scientific computing– Basis for parallel languages• PRACE Survey results:<strong>Fortran</strong>CC++PythonPerlOtherJavaChapelCo-array <strong>Fortran</strong>0 50 100 150 200 250Response Count• PRACE Survey indicates that nearly all applications are written in:– <strong>Fortran</strong>: well suited for scientific computing– C/C++: allows good access to hardware• Supplemented by– Scripts using Python, PERL <strong>and</strong> BASH– <strong>PGAS</strong> languages starting to be usedParallel <strong>Programming</strong> Languages 4•2


•19/12/2012Vector <strong>Programming</strong>• Exploit hardware support for pipelining– <strong>and</strong> for fast data access• Early supercomputers were often vector systems– Such as the Cray-1• Allowed operations on vectors– A vector is a a series of values– e.g., a section of a <strong>Fortran</strong> array• Typical vector loopDO i = 1, ny(i) = a*x(i) + y(i)END DOParallel <strong>Programming</strong> Languages 5Vector Multiply• Multiply process is made up of a number of stages• Vector hardware allows stages to work independently <strong>and</strong> topass results to each other in an “assembly line” manner• Start-up cost as pipeline fills, but then a result every cycleR(5)R(4)R(3)R(2)R(1)SerialTimeR(5)R(4)R(3)R(2)R(1)VectorTimeParallel <strong>Programming</strong> Languages 6•3


•19/12/2012Vectorisation• Sometimes required restructuring of loops to allow efficientvectorisation• Directives used to provide information to the compiler aboutwhether a particular operation was vectorisable• Compilers became increasingly good at spottingopportunities for vectorisation• Vector supercomputers became less popular as parallelcomputing grew• However, many modern CPUs contain vector-like features– e.g., Interlagos Opteron processors in Cray XEParallel <strong>Programming</strong> Languages 7Data Parallel• Processors perform similar operations across data elements in an array• Higher level programming paradigm, characterised by:– single-threaded control– global name space– loosely synchronous processes– parallelism implied by operations applied to data– compiler directives• Data parallel languages: generally serial language (e.g., <strong>Fortran</strong> 90) plus– compiler directives (e.g., for data distribution)– first class language constructs to support parallelism– new intrinsics <strong>and</strong> library functions• Paradigm well suited to a number of early (SIMD) parallel computers– Connection Machine, DAP, MasPar,…Parallel <strong>Programming</strong> Languages 8•4


•19/12/2012Data Parallel II• Many data parallel languages implemented:– <strong>Fortran</strong>-Plus, DAP <strong>Fortran</strong>, MP <strong>Fortran</strong>, CM <strong>Fortran</strong>, *LISP, C*,CRAFT, <strong>Fortran</strong> D, Vienna <strong>Fortran</strong>• Languages expressed data parallel operations differently• Machine-specific languages meant poor portability• Needed a portable st<strong>and</strong>ard: High Performance <strong>Fortran</strong>• Easy to port codes to, but performance could rarely matchthat from message passing codes– Struggled to gain broad popularityParallel <strong>Programming</strong> Languages 9Early Parallel Languages• Connection Machine languages– Thinking Machines Corporation provided data parallel versions of a variety ofsequential languages (*LISP, C*, CM-<strong>Fortran</strong>)– Allowed users to exploit a large number of simple processors in parallel• OCCAM– Early message passing language– …based on Communicating Sequential Processes– Developed by INMOS for programming Transputers– Explicitly parallel loops via PAR keyword– Language constructs for sending <strong>and</strong> receiving data through named channel– Could only communicate <strong>with</strong> neighbouring processors– → Message routing had to be in user software• Most early languages for parallel computing were vendor-specificParallel <strong>Programming</strong> Languages 10•5


•19/12/2012Contents• A Little Bit of History– Non-Parallel <strong>Programming</strong> Languages– Vector Processing– Data Parallel– Early Parallel Languages• Current Status of Parallel <strong>Programming</strong>– Parallelisation Strategies– Mainstream HPC• Alternative Parallel <strong>Programming</strong> Languages– Single-Sided Communication– <strong>PGAS</strong>– Accelerators– Hybrid Approaches• Final Remarks <strong>and</strong> SummaryParallel <strong>Programming</strong> Languages 11Parallelisation Strategies• PRACE recently asked more than 400 European HPC users– “Which parallelisation implementations do you use?”MPIOpenMPCombined MPI+OpenMPMPI, including MPI-2 single-sidedPosix threadsOtherCombined MPI+Posix threadsCombined MPI+SHMEMSHMEMHPF0 50 100 150 200 250Response Count– Unsurprisingly, most popular answers were MPI, OpenMP <strong>and</strong>Combined MPI+OpenMP– Some users of Single-Sided communicationsParallel <strong>Programming</strong> Languages 12•6


•19/12/2012Parallelisation Strategies II• PRACE also asked users of very largest systems:– “Which parallelisation method does your application use?”• Most popular: “MPI Only” <strong>and</strong> “Combined MPI+OpenMP”• 12% used single-sided routinesParallel <strong>Programming</strong> Languages 13Mainstream HPC• For the last 15+years, most HPC cycles on large systems have beenused to run MPI programs, written in <strong>Fortran</strong> or C/C++– Plus OpenMP used on shared memory systems/nodes• MSc in HPC includes compulsory courses in MPI <strong>and</strong> OpenMP• However, there are now reasons why this may be changing:– Currently, HPC systems have increasingly large numbers of cores, but theindividual core performance is relatively static– There are significant challenges in exploiting future Exascale systems• So, alongside mainstream HPC, there is also significant activity in:– Single-sided communication– <strong>PGAS</strong> languages– Accelerators– Hybrid approaches• Many of these areas are discussed later in this courseParallel <strong>Programming</strong> Languages 14•7


•19/12/2012Shared Memory• Multiple threads sharing global memory• Developed for systems <strong>with</strong> shared memory (MIMD-SM)• Program loop iterations can be distributed to threads• Each thread can refer to private objects <strong>with</strong>in a parallel context• Implementation• Threads map to user threads running on one shared memory node• Extensions to distributed memory not so successful• Posix Threads/PThreads is a portable st<strong>and</strong>ard for threading• Vendors had various shared-memory directives• OpenMP developed as common st<strong>and</strong>ard for HPC• OpenMP is a good model to use <strong>with</strong>in a node• More recent task featuresParallel <strong>Programming</strong> Languages 15Message Passing• Processes cooperate to solve problem by exchanging data• Can be used on most architectures– Especially suited for distributed memory systems (MIMD-DM)• The message passing model is based on the notion of processes– Process: an instance of a running program, together <strong>with</strong> the program’s data• Each process has access only to its own data– i.e., all variables are private• Processes communicate <strong>with</strong> each other by sending+receiving messages– Typically library calls from a conventional sequential language• During the 1980s, there was an explosion in message passing languages<strong>and</strong> libraries– CS Tools, OCCAM, CHIMP (developed by EPCC), PVM, PARMACS, …Parallel <strong>Programming</strong> Languages 16•8


•19/12/2012MPI: Message Passing Interface• De facto st<strong>and</strong>ard developed by working group of around 60 vendors <strong>and</strong>researchers from 40 organisations in USA <strong>and</strong> Europe– Took two years– MPI-1 released in 1993– Built on experiences from previous message passing libraries• MPI's prime goals are:– To provide source-code portability– To allow efficient implementation• MPI-2 was released in 1996– New features: parallel I/O, dynamic process management <strong>and</strong> remotememory operations (single-sided communication)• Now, MPI is used by nearly all message passing programsParallel <strong>Programming</strong> Languages 17Contents• A Little Bit of History– Non-Parallel <strong>Programming</strong> Languages– Vector Processing– Data Parallel– Early Parallel Languages• Current Status of Parallel <strong>Programming</strong>– Parallelisation Strategies– Mainstream HPC• Alternative Parallel <strong>Programming</strong> Languages– Single-Sided Communication– <strong>PGAS</strong>– Accelerators– Hybrid Approaches• Final Remarks <strong>and</strong> SummaryParallel <strong>Programming</strong> Languages 18•9


•19/12/2012Single-Sided Communication• Allows direct access to memory of other processors– Each process can access total memory, even on distributed memory systems• Simpler protocol can bring performance benefits– But requires thinking about synchronisation, remote addresses, caching...• Key routines– PUT is a remote write– GET is a remote read• Libraries give <strong>PGAS</strong> functionality• Vendor-specific libraries– SHMEM (Cray/SGI), LAPI (IBM)• Portable implementations– MPI-2, OpenSHMEMParallel <strong>Programming</strong> Languages 19Single-Sided Communication• Single-sided communication is major part of MPI-2 st<strong>and</strong>ard– Quite general <strong>and</strong> portable to most platforms– However, portability <strong>and</strong> robustness can have an impact on latency– Quite complicated <strong>and</strong> messy to use• Better performance from lower-level interfaces, like SHMEM– Originally developed by Cray but a variety of similar implementationswere developed on other platforms– Simple interface but hard to program correctly• OpenSHMEM– New initiative to provide st<strong>and</strong>ard interface– See http://www.openshmem.orgParallel <strong>Programming</strong> Languages 20•10


•19/12/2012<strong>PGAS</strong>: Partitioned Global Address Space• Access to local memory via st<strong>and</strong>ard program mechanismsplus access to remote memory directly supported by language• The combination of access to all data plus also exploitinglocality could give good performance <strong>and</strong> scaling• Well suited to modern MIMD systems <strong>with</strong> multicore (sharedmemory) nodes• Newly popular approach initially driven by US funding– Productive, Easy-to-use, Reliable Computing System (PERCS) projectfunded by DARPA’s High Productivity Computing Systems (HPCS)Parallel <strong>Programming</strong> Languages 21<strong>PGAS</strong> II• Currently active <strong>and</strong> enthusiastic community• Very wide variety of languages under the <strong>PGAS</strong> banner• See http://www.pgas.org• Including: CAF, <strong>UPC</strong>, Titanium, Fortress, X10, CAF 2.0, Chapel,Global Arrays, HPF?, …• Often, these languages have more differences thansimilarities…Parallel <strong>Programming</strong> Languages 22•11


•19/12/2012<strong>PGAS</strong> Languages• The broad range of <strong>PGAS</strong> languages makes it difficult tochoose which to use• Currently, CAF <strong>and</strong> <strong>UPC</strong> are probably most relevant asCray’s compilers <strong>and</strong> hardware now support CAF <strong>and</strong> <strong>UPC</strong>in quite an efficient manner• CAF: <strong>Fortran</strong> <strong>with</strong> <strong>Coarrays</strong>– Minimal addition to <strong>Fortran</strong> to support parallelism– Incorporated in <strong>Fortran</strong> 2008 st<strong>and</strong>ard!• <strong>UPC</strong>: Unified Parallel C– Adding parallel features to CParallel <strong>Programming</strong> Languages 23Accelerators• Use accelerator hardware for faster node performance• Recently, most HPC systems have been increasing thenumber of cores, but individual cores are not getting faster– This gives significant scaling challenges• Accelerators are increasingly interesting– …for some applications• PRACE Survey– “Could your applicationbenefit from accelerators,such as GPGPUs?”– 61% thought soParallel <strong>Programming</strong> Languages 24•12


•19/12/2012UK Usage of AcceleratorsCurrent UsageFuture Plans9.8%26.8%22.0%61.0%GPUsIntelMIC80%70%60%50%40%30%20%10%0%GPUsIntelMICFPGAs Other• FPGAs were very fashionable a few years ago– But proved difficult to program• Currently, most interest is around GPUs– With Intel MIC as an important future prospect• Most users thought accelerators would increase in importanceParallel <strong>Programming</strong> Languages 25<strong>Programming</strong> GPUs• Graphics Processing Units (GPUs) have been increasing in performancemuch quicker than st<strong>and</strong>ard processor cores• Lead to an interest in GPGPU (General Purpose computation onGraphics Processing Units)– Where the GPU acts as an accelerator to the CPU• Variety of different ways to program GPGPUs– CUDA– NVIDIA’s proprietary interface to the architecture.– Extensions to C (<strong>and</strong> <strong>Fortran</strong>) language which allow interfacing to the hardware– OpenCL– Cross platform API– Similar in design to CUDA, but lower level <strong>and</strong> not so mature– OpenACC– Directives based approach– OpenMP style directives - abstract complexities away from programmerParallel <strong>Programming</strong> Languages 26•13


•19/12/2012Hybrid Approaches• Use more than one parallelisation strategy <strong>with</strong>in a single program• Trying to obtain more parallelisation by exploiting hierarchicalparallelisation• Most commonly, combining MPI + OpenMP– Using OpenMP across a shared-memory partition, <strong>with</strong> MPI tocommunicate between partitions– May make sense to use OpenMP just <strong>with</strong>in a many-core processor• But can also combine MPI <strong>with</strong> Pthreads or OpenSHMEM or CAF…• Using many GPGPUs also often requires the use of MPI alongsidethe GPGPU programming approachParallel <strong>Programming</strong> Languages 27Contents• A Little Bit of History– Non-Parallel <strong>Programming</strong> Languages– Vector Processing– Data Parallel– Early Parallel Languages• Current Status of Parallel <strong>Programming</strong>– Parallelisation Strategies– Mainstream HPC• Alternative Parallel <strong>Programming</strong> Languages– Single-Sided Communication– <strong>PGAS</strong>– Accelerators– Hybrid Approaches• Final Remarks <strong>and</strong> SummaryParallel <strong>Programming</strong> Languages 28•14


•19/12/2012Why do Languages Survive or Die?• It is not always entirely clear why some languages <strong>and</strong>approaches thrive while others fade away…• However, languages which survive do have a number ofcommon characteristics– Appropriate model for current hardware– Good portability– Ease of use– Applicable to a broad range of problems– Strong engagement from both vendors <strong>and</strong> user communities– Efficient implementations availableParallel <strong>Programming</strong> Languages 29Summary• Development of portable st<strong>and</strong>ards have been essential foruptake of new parallel programming ideas• Mainstream HPC is currently based on MPI <strong>and</strong> OpenMP– However, there are alternatives• Exascale challenges have injected new life into developmentof novel parallel programming languages <strong>and</strong> approaches• The remainder of this course focuses on <strong>PGAS</strong> languages<strong>and</strong> programming GPGPUs– Plus lectures on data parallel programming <strong>and</strong> single-sidedcommunicationParallel <strong>Programming</strong> Languages 30•15


•19/12/2012References• PRACE-PP– D6.1: Identification <strong>and</strong> Categorisation of Applications<strong>and</strong> Initial Benchmarks Suite, Alan Simpson, Mark Bull <strong>and</strong> Jon Hill, EPCC• PRACE-1IP– D7.4.1: Applications <strong>and</strong> user requirements for Tier-0 systems, Mark Bull(EPCC), Xu Guo (EPCC), Ioannis Liabotis (GRNET)– D7.4.3: Tier-0 Applications <strong>and</strong> Systems Usage, Xu Guo, Mark Bull(EPCC)• ARCHER User Requirements– Project Working Group: Katharine Bowes (EPSRC), Ian Reid (NAG), SimonMcIntosh-Smith (Bristol), Bryan Lawrence (NCAS/Reading) <strong>and</strong> AlanSimpson (EPCC)Parallel <strong>Programming</strong> Languages 31•16


<strong>UPC</strong>Introduction & BasicsDr Michèle Weil<strong>and</strong>Applications Consultant, EPCCm.weil<strong>and</strong>@epcc.ed.ac.ukObjectives of the coming three lectures:o underst<strong>and</strong> the basic principles of <strong>UPC</strong>o motivation behind <strong>PGAS</strong>o learn about data distribution, synchronisationo advanced features (dynamic memory allocation, collectives) Practicals will try <strong>and</strong> emphasise the most importantaspects of <strong>UPC</strong>21


<strong>UPC</strong>Unified Parallel CParallel extension to ISO C 99, addingo explicit parallelismo global shared address spaceo synchronisationBoth commercial <strong>and</strong> open source compilers availableo Cray, IBM, SGI, HPo GWU, LBNL, GCC<strong>UPC</strong>3<strong>UPC</strong> <strong>and</strong> the World of <strong>PGAS</strong><strong>UPC</strong> != <strong>PGAS</strong>o <strong>PGAS</strong> is a programming modelo <strong>UPC</strong> is one implementation of this modelMany other implementationso language extension: Coarray <strong>Fortran</strong>o new languages: Chapel, X10, Fortress, Titaniumo <strong>PGAS</strong>-like libraries: Global Arrays, OpenSHMEMAll implementations are different, but follow the same model!42


<strong>UPC</strong> threads<strong>UPC</strong> uses threads that operate independently in a SPMD fashion threads execute the same <strong>UPC</strong> programIdentifiers that return information about the programenvironment:THREADS:MYTHREAD:holds total number of threadsstores thread index index runs from 0 to THREADS-15<strong>UPC</strong> threads#include #include void main() {}printf(“Thread %d of %d says: Hello!”, MYTHREAD, THREADS);63


Private vs. shared memory spaceConcept of two memory spaces: private <strong>and</strong> sharedobjects declared in private memory space are only accessible by asingle threadobjects declared in shared memory space are accessible by allthreads shared memory space is used to communicateinformation between threads7Private vs. shared dataprivate variables declared as normal C variableo multiple instances of variable will existint x; // private variableshared variables declared <strong>with</strong> shared qualifiero only allocated once, in shared memory spaceo accessible by all threadsshared int y; // shared variable84


<strong>UPC</strong> data localityIf shared variable is scalar, space only allocated on thread 0int x;shared int y;thread 0 thread 1 thread 2 thread 3x x x xPrivate Memory SpaceyShared Memory Space9Affinityall threads can directly access shared data, even if it resides in aremote location<strong>UPC</strong> creates logical partitioning of the shared memory space objects have affinity to one thread shared scalars always have affinity to thread 0better performance if a thread access data to which it has affinity always keep data locality <strong>and</strong> affinity in mind105


Shared array distributionIf a shared variable is an array, space allocated across sharedmemory space in a cyclic fashion by defaultint x;shared int y[16];thread 0 thread 1 thread 2 thread 3x x x xPrivate Memory Spacey[0]y[1]y[2]y[3]y[4]y[8]y[12]y[5]y[9]y[13]y[6]y[10]y[14]y[7]y[11]y[15]Shared Memory Space11Shared array distribution (2)Change data layout by adding a “blocking factor” to shared arraysshared[blocksize] type array[n]int x;shared[2] int y[16];thread 0 thread 1 thread 2 thread 3x x x xPrivate Memory Spacey[0]y[2]y[4]y[6]y[1]y[3]y[5]y[7]Shared Memory Spacey[8]y[9]y[10]y[11]y[12]y[13]y[14]y[15]126


Work sharingShared data means shared workload!If shared data is distributed between threads, threads can distributework on this data between them<strong>UPC</strong> has built-in mechanism for explicitly distributing <strong>and</strong> sharingwork13Work sharing: upc_forallStatement for work distributiono allows loop assignment of tasks to threadso parallel for loop4 th parameter defines affinity to threado if “affinity % THREADS” matches MYTHREAD, execute iteration for thatTHREADupc_forall(expression; expression; expression; affinity)Condition: iterations of upc_forall must be independent!147


Example: vector addition (1)#define N 10 * THREADSshared int vector1[N];shared int vector2[N];shared int sum[N];void main() {int i;for(i=0; i


Side effects of shared dataHolding data in a shared memory space has implications1) the lifetime of shared data needs to extend beyond the scope itwas defined in (unless this is program scope) storage duration2) the shared data needs to be keep up-to-date synchronisation17Storage duration of shared objectsShared objects cannot have automatic storage durationo any variable defined inside a function!Why?SPMD model means a shared variable may be accessedoutside lifetime of the function!Conclusionshared variables must eithero have file scope;o or be declared as static if defined inside a function.189


“Static” keywordensures shared objects are accessible throughout program execution objects are not linked to the scope of a thread objects will not simply “disappear” after a thread exists the scopein which the object was defined19Example: maximum of an array#define max(a,b) (((a)>(b)) ? (a) : (b))shared int maximum[THREADS];shared int globalMax = 0;shared int a[THREADS*10];Here: shared variables have file scope!void main(int argc, char **argv) {… // initialise array aupc_forall(int i=0; i


SynchronisationEnsure all threads reach same point in executiono necessary for memory <strong>and</strong> data consistencyBarriers used for synchronisationo blockingo split-phase (non-blocking)upc_barrier blockingupc_notify, upc_wait non-blocking21Example: maximum of an array#define max(a,b) (((a)>(b)) ? (a) : (b))shared int maximum[THREADS];shared int globalMax = 0;shared int a[THREADS*10];Here: shared variables have file scope!void main(int argc, char **argv) {…}… // initialise array aupc_barrier;upc_forall(int i=0; i


References• <strong>UPC</strong> Language Specification (Version 1.2):http://upc.gwu.edu/docs/upc_specs_1.2.pdf• <strong>UPC</strong> homepage:http://upc.gwu.edu/• GCC<strong>UPC</strong> compiler:http://www.gccupc.org23<strong>UPC</strong> on HECToR XE6Cray compiler so far only one to support <strong>UPC</strong>– default compiler on HECToR XE6– start by checking the correct programming environment/compiler isloadeduserid@hector-xe6:~> which cc/opt/cray/xt-asyncpe/5.09/bin/ccCompiler option for <strong>UPC</strong> code is: –h upcFor full information on compiler options executeman craycc2412


XE6: Compiling & running your codeTo compile either use the provided Makefile, or docc –h upc –o myprogram myprogram.upcOn HECToR, login nodes <strong>and</strong> compute nodes use different filesystemso compile on login nodes, i.e. $HOMEo batch jobs can only read from/write to compute nodes, i.e. $WORKo copy your binary, any input files <strong>and</strong> submit script to $WORK, e.g.userid@hector-xe6:~> cp myprogram script $WORK/upc/myprogram Always keep copies of critical files on $HOME, as $WORK is notbacked up!25Submitting a jobBatch scheduler is PBS– qsub –q myscript.pbs to submit a job– qstat to check job status– qdel to delete job from queueParallel job launcher for compute nodes is aprun– argument –n to specify total number of processes– argument –N to specify the number of tasks per node– call aprun from a subdirectory of /workJob submission easiest <strong>with</strong> PBS script submitted from /work2613


Sample PBS script for XE6#!/bin/bash --login#PBS -N my_job#PBS -l mppwidth=32#PBS -l mppnppn=32#PBS -l walltime=00:20:00#PBS -A d45use correct budget# Change to the directory that the job was submitted fromcd $PBS_O_WORKDIR# Set the number of <strong>UPC</strong> threadsexport NPROC=`qstat -f $PBS_JOBID | awk '/mppwidth/ {print $3}'`export NTASK=`qstat -f $PBS_JOBID | awk '/mppppn/ {print $3}'`aprun -n $NPROC -N $NTASK ./myprogram2714


<strong>UPC</strong>Data distribution,synchronisation & work sharingDr Michèle Weil<strong>and</strong>Applications Consultant, EPCCm.weil<strong>and</strong>@epcc.ed.ac.uk data distributiono multi-dimensional data synchronisation methodsoblocking versus non-blocking work sharingo examples: vector addition revisited, matrix-vector multiplication<strong>UPC</strong> 2–1


Brief recap… private <strong>and</strong> shared data, logically partitioned memory space data objects have affinity to one thread exactly work sharing through upc_forall distribution of shared data storage duration of shared data synchronisationfocus of today’slecture<strong>UPC</strong> 3Data DistributionCyclic distribution is the defaultint x;shared int y[13];thread 0 thread 1 thread 2 thread 3x x x xPrivate Memory Spacey[0]y[1]y[2]y[3]y[4]y[8]y[12]y[5]y[9]y[6]y[10]y[7]y[11]Shared Memory Space<strong>UPC</strong> 4–2


Data Distribution (2)If number of elements is not an exact multiple of the thread count,threads can end up having uneven numbers of elements:int x;shared[2] int y[13];thread 0 thread 1 thread 2 thread 3x x x xPrivate Memory Spacey[0]y[2]y[4]y[6]y[1]y[3]y[5]y[7]Shared Memory Spacey[8]y[9]y[10]y[11]y[12]<strong>UPC</strong> 5Blocking factorshould be used if default distribution is not suitableo more on the meaning of “suitable” later on… four different casesshared [4] defines a block size of 4 elementsshared [0] all elements are given affinity to thread 0shared [*] when possible, data is distributed in contiguousblocksshared [] equivalent to shared [0]<strong>UPC</strong> 6–3


Multi-dimensional data<strong>UPC</strong> can distribute data using block cyclic distributionsDistributions represent a top-down approacho shared objects can be distributed into segments using the blocking factor conceptually opposite of CAF, where shared objects are “created”by merging the pieces from every image in a bottom-up approach<strong>UPC</strong> 72D array decompositiondistribution using the * layout qualifier <strong>and</strong> empty brackets‣ Block distributionshared[*] int y[8][8];yx‣ Entire array on mastershared[] int y[8][8]; orshared[0] int y[8][8];<strong>UPC</strong> 8–4


2D array decompositionDistribution using different blocking factorsshared[8] int y[8][8];yshared[6] int y[8][8];yxx<strong>UPC</strong> 9Multi-dimensional data – Case 1shared double grid[8][8][8] <strong>with</strong> THREADS == 3zxyyyN.B. array layout convention is arr[x][y][z]!<strong>UPC</strong> 10–5


Multi-dimensional data – Case 1shared double grid[8][8][8] <strong>with</strong> THREADS == 4zxyN.B. array layout convention is arr[x][y][z]!<strong>UPC</strong> 11Multi-dimensional data – Case 2shared [3] double grid[8][8][8] <strong>with</strong> THREADS == 3zxy<strong>UPC</strong> 12–6


Multi-dimensional data – Case 2blocking factor depending on dimensions pencil distributionshared [8] double grid[8][8][8] <strong>with</strong> THREADS == 3zxy<strong>UPC</strong> 13Multi-dimensional data – Case 2combining thread count <strong>and</strong> blocking factor slab distributionshared [8] double grid[8][8][8] <strong>with</strong> THREADS == 4zyxyy<strong>UPC</strong> 14–7


Multi-dimensional data – Case 3slabs are contiguous in memory blocking factor product of 2 dimensionsshared [8*8] double grid[8][8][8] <strong>with</strong> THREADS == 4zxy<strong>UPC</strong>15Why is the distribution important?it is all about performance <strong>and</strong> minimising the cost of reading <strong>and</strong> writingdata…accessing shared data which resides in a physically remote location ismore expensive than accessing shared data which has affinity <strong>with</strong> thethread!optimise layout of data by using knowledge of problem size <strong>and</strong>, ifpossible, the number of threads<strong>UPC</strong> 16–8


Static vs. dynamic compilation (1)number of <strong>UPC</strong> threads can be specified either at compile time(static) or at runtime (dynamic)oCray compiler: -X numThreadsAdvantagesoodynamic: program can be executed using any number of threadsstatic: easier to distribute data based on THREADSDisadvantagesoodynamic: not always possible to achieve best possible distributionstatic: program needs to be executed <strong>with</strong> number of threads specified atcompile time<strong>UPC</strong> 17Static vs. dynamic compilation (2)“An array declaration is illegal if THREADS is specified at runtime<strong>and</strong> the number of elements to allocate at each thread depends onTHREADS.”shared int x[4*THREADS];shared[] int x[8];legal for static <strong>and</strong> dynamic environmentsshared int x[8];shared[] int x[THREADS];shared int x[10+THREADS];illegal for dynamic environment<strong>UPC</strong> 18–9


Static vs. dynamic compilation (3)static compilation can often give better performance, as data distributionis much easier to controldynamic compilation provides greater flexibility, however optimal datadistribution may not always be achievable tradeoff between performance <strong>and</strong> convenience<strong>UPC</strong> 19Synchronisationneeded to ensure all threads reach same point in execution flowo memory <strong>and</strong> data consistencyo two types of synchronisation: blocking <strong>and</strong> non-blockingblocking synchronisation makes all threads wait at a barrier untilthe last thread has reached that barrier before allowing executionto continuenon-blocking synchronisation allows for some local computations tobe executed while waiting for other threads<strong>UPC</strong> 20–10


codecodelocalcomputationBarrierupc_barrier exp opt1. all threads execute the code that requires synchronisation2. once finished they wait at the barrier3. when the last thread reaches the barrier, all threads are released to continueexecutiont 0 t 1 t n-1 t n…upc_barrier<strong>UPC</strong> 21Split-phase barrierupc_notify exp opt <strong>and</strong> upc_wait exp opt1. thread finishes work that requires synchronisation upc_notify to inform others2. thread performs local computations once finished, wait3. when all threads have execute upc_notify, thread waiting at barrier can continuet 0 t 1 t n-1 t n…upc_notify (non-blocking)upc_wait (blocking)<strong>UPC</strong> 22–11


Debugging barriersthe optional value exp can be used to check that all threads havereached the same barrierif a thread executes a barrier <strong>with</strong> different exp tag than the otherthreads, the application reports an expression mismatch <strong>and</strong> aborts very useful for making sure that all threads are on theintended execution path<strong>UPC</strong> 23Work sharing revisited4 th parameter in upc_forall loop represents affinityevaluate if MYTHREAD will execute an iterationaffinity is an integer expression affinity % THREADS == MYTHREADaffinity is a pointer-to-shared object pointed to has affinity to MYTHREAD upc_threadof(affinity)<strong>UPC</strong> 24–12


Example: vector addition (1/3)three vectors are distributed in cyclic fashion <strong>with</strong> the default blocking factor of 1the modulo function identifies the local elements per threado if distribution changes, this code will fail to identify the local elementso it will still produce the correct result!#include #define N 100*THREADSshared int v1[N], v2[N], v1plusv2[N];void main() {int i;}for(i=0; i


Example: Vector addition (3/3)Advantage of affinity parameter:Integer expression, simple syntax if the distribution changes, upc_forall will behave correctly <strong>and</strong> identifylocal elements (for round-robin, unless modified)#include #define N 100*THREADSshared int v1[N], v2[N], v1plusv2[N];void main() {int i;affinity parameter isinteger expression}upc_forall(i=0; i


Data distributionCAB001 201= 0 1 2*1201 22number of remote operations:C 0 = A 0,0 B 0 + A 0,1 B 1 + A 0,2 B 2 4C 1 = A 1,0 B 0 + A 1,1 B 1 + A 1,2 B 2 4C 2 = A 2,0 B 0 + A 2,1 B 1 + A 2,2 B 2 4<strong>UPC</strong> 29We can do betterdistribute matrix a in blocks of size THREADS each row will be placed locally onto each thread#include shared [THREADS] int a[THREADS][THREADS];shared int b[THREADS], c[THREADS];void main (void){int i, j;}upc_forall( i = 0 ; i < THREADS ; i++; &c[i]) {c[i] = 0;for ( j= 0 ; j < THREADS ; j++)c[i] += a[i][j]*b[j];}<strong>UPC</strong> 30–15


Improved data distributionCAB00 0 001= 1 1 1*122222number of remote operations:C 0 = A 0,0 B 0 + A 0,1 B 1 + A 0,2 B 2 2C 1 = A 1,0 B 0 + A 1,1 B 1 + A 1,2 B 2 2C 2 = A 2,0 B 0 + A 2,1 B 1 + A 2,2 B 2 2<strong>UPC</strong> 31Summarycorrect data distribution is important for performancekeep the number of remote reads/writes as low as possible<strong>UPC</strong> gives programmers control over data layout <strong>and</strong> work sharing it is important to be aware of performance implications aim to keep work sharing loops independent of data distribution<strong>UPC</strong> 32–16


<strong>UPC</strong><strong>UPC</strong> PointersDynamic Memory Allocation<strong>UPC</strong> collectivesDr Michèle Weil<strong>and</strong>Applications Consultant, EPCCm.weil<strong>and</strong>@epcc.ed.ac.ukAdvanced use of <strong>UPC</strong>‣ C <strong>and</strong> <strong>UPC</strong> pointers‣ dynamic memory allocation‣ locks‣ <strong>UPC</strong> collectives<strong>UPC</strong> 2–1


C pointersA pointer in C is a data type whose value points to anothervariable’s memory addressfloat array[2];float* ptr = &array[0]; // Value of ptr is 213ptr213 214 215 216 217 218219 220<strong>UPC</strong> 3Pointer arithmeticchange what object a pointer is referring to through pointerarithmetic:o is type dependento incrementing a float pointer will move by sizeof(float)float array[2];float* ptr = &array[0]; // Value of ptr is 213ptr = ptr++; // Value of ptr is now 217ptr213 214 215 216 217 218219 220ptr++<strong>UPC</strong> 4–2


<strong>UPC</strong> pointerssimilar concept as in Cpointers are variables that contain addresses of other variables<strong>UPC</strong> pointers canreside in private or shared memory spacereference private or shared memory space<strong>UPC</strong> 5Types of pointersprivate to privateprivate to sharedshared to private int *p1; shared int *p2; int *shared p3; (not recommended)shared to shared shared int *shared p4;p4p3p4Shared Memory SpacePrivate Memory Spacep1p2p2p1<strong>UPC</strong> 6–3


<strong>UPC</strong> pointers<strong>UPC</strong> pointers have three fieldso thread : the thread affinity of the pointero address : the virtual address of the blocko phase : indicates the element location <strong>with</strong>in that blockthread block address phasethe values of these fields are obtained from the functionssize_t upc_threadof (shared void *ptr)size_t upc_phaseof (shared void *ptr)size_t upc_addrfield (shared void *ptr)<strong>UPC</strong> 7Pointer properties (1/3)pointer arithmetic takes into account the blocking factorshared int x[16]; // shared int arrayshared int *dp = &x[5], *dp1; // private pointers to shareddp1 = dp + 9; // default blocking factor 1thread 0 thread 1 thread 2 thread 3dpdp1 dp dp1 dpdp1dpdp1Private Memory Spacex[0]x[1]x[2]x[3]+3x[4]x[8]+4x[5]x[9]+1+5x[6]x[10]+2+6x[7]x[11]Shared Memory Space+7x[12]+8x[13] +9 x[14] x[15]<strong>UPC</strong> 8–4


Pointer properties (2/3)the pointer will follow its own blocking factorshared [3] int x[16];shared [3] int *dp = &x[5], *dp1;dp1 = dp + 6; // blocking factor 3thread 0 thread 1 thread 2 thread 3dp dp1 dp dp1 dpdp1dpdp1Private Memory Spacex[0]x[3]+1x[6]+4x[9]x[1]x[2]x[4]x[5]+2+3x[7]x[8]+5+6x[10]x[11]Shared Memory Spacex[12]x[15]x[13]x[14]<strong>UPC</strong> 9Pointer properties (3/3)‣ casting a shared pointer to a private pointer is allowed but not theother way around‣ casting a shared pointer to private will result in loss of information‣ thread & phase‣ casting is only well defined if the object pointed to by the sharedpointer has local affinity<strong>UPC</strong> 10–5


Pointer castingcasting a shared pointer to a private pointer results in information lossshared [3] int x[16];shared int *dp = &x[5];int *ptr;ptr = (int *) dp; ptr != upc_addrfield(dp)phase thread addressdp2 1 0003FFB008ptr00AFF53008<strong>UPC</strong> 11Dynamic memory allocationso far only seen how to allocate memory staticallydynamic memory allocation in <strong>UPC</strong> is of course possibleprovides flexibilityallows object sizes to changes during runtime this is one of the cases where pointers are requireddynamic memory allocation in private space is done usingst<strong>and</strong>ard C functionsdynamic memory allocation in shared space is achieved usingspecial <strong>UPC</strong> functions<strong>UPC</strong> 12–6


Dynamic memory allocation (2)two types of memory allocation functionso non-collective: upc_global_alloc, upc_alloco collective: upc_all_alloccollective calls are called by all threads <strong>and</strong> return the same address value(pointer) to all of themnon-collective calls can be executed by multiple threads. Each call will allocate adifferent shared block.free the allocated memory using upc_freeo not a collective call<strong>UPC</strong> 13Non-collective dynamic allocationeach thread allocates a memory block in its own shared memory spaceshared [] int *ptr;ptr = (shared [] int *) upc_alloc(N*sizeof(int));NNNShared Memory SpacePrivate Memory Spacep1p1p1<strong>UPC</strong> 14–7


Non-collective dynamic allocation (2)shared [N] int *ptr;ptr = (shared [N] int *) upc_global_alloc(THREADS, N*sizeof(int));N N NN N NNNNShared Memory SpacePrivate Memory Spacep1p1p1<strong>UPC</strong> 15Collective dynamic allocationallocate contiguous segments of shared memoryshared [N] int *ptr;ptr = (shared [N] int *)upc_all_alloc(THREADS, N*sizeof(int));NNNShared Memory SpacePrivate Memory Spacep1p1p1<strong>UPC</strong> 16–8


Locksaccess control mechanism for critical sectionssections which should be executed by one thread at a time serialisedexecution<strong>UPC</strong> shared data type upc_lock_tcan have one of two states: locked or unlockedcan be seen by all threadslocks need to be manipulated through pointers<strong>UPC</strong> 17Program flow <strong>with</strong> locksthread enterscritical sectionunlock, nextthread canobtain lockobtains lock, noother threadscan enterthreadcompletescritical section<strong>UPC</strong> 18–9


Creating locksinitial state of a new lock object is unlockedlocks can be created collectively return value on every thread points to the same object upc_lock_t *upc_all_lock_alloc(void);or non-collectively all threads that call the function obtain different locks upc_lock_t *upc_global_lock_alloc(void);resources allocated by locks need to be freed upc_lock_free (upc_lock_t *ptr);<strong>UPC</strong> 19Using locksthreads need to lock <strong>and</strong> unlock upc_lock(upc_lock_t *ptr); /* blocking */ upc_lock_attempt(upc_lock_t *ptr); /* non-blocking */ upc_unlock(upc_lock_t *ptr);upc_lock(lock)lockedunlockedtry againobtainlockcomplete criticalsectionupc_unlock(lock)<strong>UPC</strong> 20–10


Example: Dining Philosophers<strong>UPC</strong> 21Example: Dining Philosophers (contd)model forks as shared array of locks, then allocate a lock (i.e. fork)dynamically <strong>and</strong> return a pointer to it:upc_lock_t *shared fork[THREADS];fork[MYTHREAD]= upc_global_lock_alloc();now attempt to get the locks (forks) on either side:left_fork = upc_lock_attempt(fork[MYTHREAD]);right_fork = upc_lock_attempt(fork[(MYTHREAD+1)%THREADS]);<strong>UPC</strong> 22–11


Example: Dining Philosophers (contd)if both forks are unlocked, eat meal <strong>and</strong> release forks when finished:upc_unlock(fork[MYTHREAD]);upc_unlock(fork[(MYTHREAD+1)%THREADS]);if only one fork can be unlocked, release that fork <strong>and</strong> try again until twoforks become available<strong>UPC</strong> 23<strong>UPC</strong> collectivessupported by most compilerso readable codeo but not necessarily optimised for performancerequires separate header file#include <strong>UPC</strong> 24–12


CollectivesTwo types of collective operations defined as part of the <strong>UPC</strong>st<strong>and</strong>ard specification:12relocalisation collectivesupc_all_broadcast, upc_all_scatter, upc_all_gather,upc_all_gather_all, upc_all_exchange, upc_all_permutecomputational collectivesupc_all_reduceT, upc_all_prefix_reduceT, upc_all_sortSupported operations: <strong>UPC</strong>_SUM, <strong>UPC</strong>_MULT, <strong>UPC</strong>_AND, <strong>UPC</strong>_OR,<strong>UPC</strong>_XOR, <strong>UPC</strong>_LOGAND, <strong>UPC</strong>_LOGOR, <strong>UPC</strong>_MIN, <strong>UPC</strong>_MAXuser specified functions supported via: <strong>UPC</strong>_FUNC, <strong>UPC</strong>_NONCOMM_FUNC Calls to these functions must be performed by all threads<strong>UPC</strong> 25Computational collectives11 variations of the upc_all_reduceT <strong>and</strong> upc_all_prefix_reduceT T needs to be replaced <strong>with</strong> the type used in the reduction operationT Type T TypeC signed char L signed longUC unsigned char UL unsigned longS signed short F FloatUS unsigned short D doubleI signed int LD long doubleUIunsigned int<strong>UPC</strong> 26–13


Broadcastshared[] int A[2];shared[2] int B[8];upc_all_broadcast(B, A, 2*sizeof(int), <strong>UPC</strong>_IN_ALLSYNC | <strong>UPC</strong>_OUT_ALLSYNC);ABABThread 0A 0 A 1A 0 A 1A 0 A 1Thread 1A 0 A 1Thread 2A 0 A 1Thread 3A 0 A 1<strong>UPC</strong> 27Scattershared[] int A[8];shared[2] int B[8];upc_all_scatter(B, A, 2*sizeof(int), <strong>UPC</strong>_IN_ALLSYNC | <strong>UPC</strong>_OUT_ALLSYNC);Thread 0A B ABA 0 A 1 … A 7A 0 A 1 … A 7 A 0 A 1Thread 1A 2 A 3Thread 2A 4 A 5Thread 3A 6 A 7<strong>UPC</strong> 28–14


Gathershared[] int A[8];shared[2] int B[8];upc_all_gather(A, B, 2*sizeof(int), <strong>UPC</strong>_IN_ALLSYNC | <strong>UPC</strong>_OUT_ALLSYNC);Thread 0A B ABA 0 A 1 A 0 A 1 … A 7 A 0 A 1Thread 1A 2 A 3A 2 A 3Thread 2A 4 A 5A 4 A 5Thread 3A 6 A 7A 6 A 7<strong>UPC</strong> 29Reductionshared[4] double A[16];shared double B;upc_all_reduceD(&B, A, <strong>UPC</strong>_SUM, 4, 16, NULL, <strong>UPC</strong>_IN_ALLSYNC | <strong>UPC</strong>_OUT_ALLSYNC);Thread 0AA 00 A 01 A 02 A 03BBThread 1A 10 A 11 A 12 A 13Thread 2A 20 A 21 A 22 A 23Thread 3A 30 A 31 A 32 A 33<strong>UPC</strong> 30–15


Final summarydata distribution & worksharingpointers, dynamicmemory allocation<strong>PGAS</strong>extension to Csynchronisation throughbarriers & lockscollective operations<strong>UPC</strong> 31Referenceshttp://upc.gwu.edu/documentation.htmlLanguage Specification (version 1.2)<strong>UPC</strong> Manual<strong>UPC</strong> Collective Operations Specification (version 1.0)<strong>UPC</strong> 32–16


12/19/2012Parallel <strong>Programming</strong><strong>with</strong><strong>Fortran</strong> <strong>Coarrays</strong>David Henty, Alan Simpson (EPCC)Harvey Richardson, Bill Long (Cray)Overview• Parallel <strong>Programming</strong> <strong>with</strong> <strong>Fortran</strong> <strong>Coarrays</strong>• <strong>Fortran</strong> <strong>Programming</strong> Model• Basic coarray features• Further coarray features• Advanced coarray features• Experiences <strong>with</strong> coarrays21


12/19/2012The <strong>Fortran</strong><strong>Programming</strong> ModelMotivation• <strong>Fortran</strong> now supports parallelism as a full first-class feature ofthe language• Changes are minimal• Performance is maintained• Flexibility in expressing communication patterns42


12/19/2012<strong>Programming</strong> models for HPC• The challenge is to efficiently map a problem to thearchitecture we have• Take advantage of all computational resources• Manage distributed memories etc.• Optimal use of any communication networks• The HPC industry has long experience in parallel programming• Vector, threading, data-parallel, message-passing etc.• We would like to have models or combinations that are• efficient• safe• easy to learn <strong>and</strong> use5Why consider new programming models?• Next-generation architectures bring new challenges:• Very large numbers of processors <strong>with</strong> many cores• Complex memory hierarchy• even today (2011) we are at 500k cores• Parallel programming is hard, need to make this simpler• Some of the models we currently use are• bolt-ons to existing languages as APIs or directives• Hard to program for underlying architecture• unable to scale due to overheads• So, is there an alternative to the models prevalent today?• Most popular are OpenMP <strong>and</strong> MPI …63


12/19/2012<strong>Fortran</strong> 2008 coarray model• Example of a Partitioned Global Address Space (<strong>PGAS</strong>)model• Set of participating processes like MPI• Participating processes have access to local memoryvia st<strong>and</strong>ard program mechanisms• Access to remote memory is directly supported bythe language7<strong>Fortran</strong> coarray modelprocessmemoryprocessmemoryprocessmemorycpucpucpu84


12/19/2012<strong>Fortran</strong> coarray modelprocessprocessprocessmemorymemorymemorycpucpucpua = b[3]9<strong>Fortran</strong> coarrays• Remote access is a full feature of the language:• Type checking• Opportunity to optimize communication• No penalty for local memory access• Single-sided programming model more natural forsome algorithms• <strong>and</strong> a good match for modern networks <strong>with</strong> RDMA105


12/19/2012<strong>Fortran</strong> coarraysBasic FeaturesCoarray <strong>Fortran</strong>"<strong>Coarrays</strong> were designed to answer the question:‘What is the smallest change required to convert <strong>Fortran</strong>into a robust <strong>and</strong> efficient parallel language?’The answer: a simple syntactic extension.It looks <strong>and</strong> feels like <strong>Fortran</strong> <strong>and</strong> requires<strong>Fortran</strong> programmers to learn only a few new rules."John Reid,ISO <strong>Fortran</strong> Convener126


12/19/2012Some History• Introduced in current form by Numrich <strong>and</strong> Reid in 1998 as asimple extension to <strong>Fortran</strong> 95 for parallel processing• Many years of experience, mainly on Cray hardware• A set of core features are now part of the <strong>Fortran</strong> st<strong>and</strong>ardISO/IEC 1539-1:2010• Additional features are expected to be published in aTechnical Specification in due course.13How Does It Work?• SPMD - Single Program, Multiple Data• single program replicated a fixed number of times• Each replication is called an image• Images are executed asynchronously• execution path may differ from image to image• some situations cause images to synchronize• Images access remote data using coarrays• Normal rules of <strong>Fortran</strong> apply147


12/19/2012What are coarrays?• Arrays or scalars that can be accessed remotely• images can access data objects on any other image• Additional <strong>Fortran</strong> syntax for coarrays• Specifying a codimension declares a coarrayreal, dimension(10), codimension[*]:: xreal :: x(10)[*]• these are equivalent declarations of a array xof size 10 on each image• x is now remotely accessible• coarrays have the same size on each image!15Accessing coarraysinteger :: a(4)[*], b(4)[*] !declare coarraysb(:) = a(:)[n]! copy• integer arrays a <strong>and</strong> b declared to be size 4 on all images• copy array a from remote image n into local array b• () for local access [] for remote access• e.g. for two images <strong>and</strong> n = 2:image 1 image 2a1 2 3 4a2937b52 69 73 87 b 10 2 11 9 12 3 13 7168


12/19/2012Synchronisation• Be careful when updating coarrays:• If we get remote data was it valid?• Could another process send us data <strong>and</strong> overwritesomething we have not yet used?• How do we know that data sent to us has arrived?• <strong>Fortran</strong> provides synchronisation statements• For example, barrier for synchronisation of all images:sync all• do not make assumptions about execution timing on images• unless executed after synchronisation• Note there is implicit synchronisation at program start17Retrieving information about images• Two intrinsics provide index of this image <strong>and</strong> number ofimages• this_image() (image indexes start at 1)• num_images()real :: x[*]if(this_image() == 1) thenread *,xdo image = 2,num_images()x[image] = xend doend ifsync all189


12/19/2012Making remote references• We used a loop over imagesdo image = 2,num_images()x[image] = xend do• Note that array indexing <strong>with</strong>in the coindex is not allowedso we can not writex[2:num_images()] = x !illegal19Data usage• coarrays have the same size on every image• Declarations:• round brackets () describe rank, shape <strong>and</strong> extent of localdata• square brackets [] describe layout of images that hold localdata• Many HPC problems have physical quantities mapped to n-dimensional grids• You need to implement your view of global data from the localcoarrays as <strong>Fortran</strong> does not provide the global view• You can be flexible <strong>with</strong> the coindexing (see later)• You can use any access pattern you wish2010


12/19/2012Data usage• print out a 16 element “global” integer array A from 4processors• 4 elements per processor = 4 coarrays on 4 imagesinteger :: ca(4)[*]do image=1,num_images()print *,ca(:)[image]end doAca(1:4)[1] ca(1:4)[2] ca(1:4)[3] ca(1:4)[4]image 1 image 2 image 3 image 4211D cyclic data access• coarray declarations remain unchanged• but we use a cyclic access patterninteger :: ca(4)[*]do i=1,4do image=1,num_images()print *,ca(i)[image]end doend doA............image 1 image 2 image 3 image 42211


12/19/2012Synchronisation• code execution on images is independent• programmer has to control execution using synchronisation• synchronise before accessing coarrays• ensure content is not updated from remote images beforeyou can use it• synchronise after accessing coarrays• ensure new content is available to all images• implicit synchronisation after variable declarations at firstexecutable statement• guarantees coarrays exist on all images when your firstprogram statement is executed• We will revisit this topic later23Example: maximum of arrayreal :: a(10)real :: maximum[*]implicit synchronisationcall r<strong>and</strong>om_number(a)maximum = maxval(a)ensure all images set local maximumsync allif (this_image() == 1) thendo image = 2, num_images()maximum = max(maximum, maximum[image])end dodo image = 2, num_images()maximum[image] = maximumend doend ifensure all images have copy of maximum valuesync all2412


12/19/2012RecapWe now know the basics of coarrays• declarations• references <strong>with</strong> []• this_image() <strong>and</strong> num_images()• sync allNow consider a full program example...25Example2: Calculate density of primesprogram pdensityimplicit noneinteger, parameter :: n=10000000, nimages=8integer start,end,iinteger, dimension(nimages) :: nprimes[*]real densitystart = (this_image()-1) * n/num_images() + 1end = start + n/num_images() - 1nprimes(this_image())[1] = num_primes(start,end)sync all2613


12/19/2012Example2: Calculate density of primes…if (this_image()==1) then&&nprimes(1)=sum(nprimes)density=real(nprimes(1))/nprint *,"Calculating prime density on", &num_images(),"images"print *,nprimes(1),'primes in',n,'numbers'write(*,'(" density is ",2Pf0.2,"%")')densitywrite(*,'(" asymptotic theory gives ", &2Pf0.2,"%")')1.0/(log(real(n))-1.0)end if27Example2: Calculate density of primesCalculating prime density on 2 images664580 primes in 10000000 numbersdensity is 6.65%asymptotic theory gives 6.61%2814


12/19/2012Launching a coarray program• The <strong>Fortran</strong> st<strong>and</strong>ard does not specify how a program islaunched• The number of images may be set at compile, link or run-time• A compiler could optimize for a single imageExamples on Linux• Cray XEaprun –n 16000 solver• g95./solver --g95 -images=229Observations so far on coarrays• Natural extension, easy to learn• Makes parallel parts of program obvious (syntax)• Part of <strong>Fortran</strong> language (type checking, etc)• No mapping of data to buffers (or copying) or creation ofcomplex types (as we might have <strong>with</strong> MPI)• Compiler can optimize for communication• More observations later...3015


12/19/2012Exercise Session 1• Look at the Exercise Notes document for full details• Write, compile <strong>and</strong> run a “Hello World” program that printsout the value of the running image’s image index <strong>and</strong> thenumber of images• Extend the simple <strong>Fortran</strong> code provided in order to performoperations on parts of a picture using coarrays31Additional SlidesComparison of <strong>Programming</strong> Models3216


12/19/2012Shared-memory directives <strong>and</strong> OpenMPmemorythreads33OpenMP: work distributionmemory1-8 9-16 17-24 25-32!$OMP PARALLEL DOdo i=1,32a(i)=a(i)*2end dothreads3417


12/19/2012OpenMP implementationmemoryprocesscpusthreads35Shared Memory Directives• Multiple threads share global memory• Most common variant: OpenMP• Program loop iterations distributed to threads,more recent task features• Each thread has a means to refer to private objects<strong>with</strong>in a parallel context• Terminology• Thread, thread team• Implementation• Threads map to user threads running on one SMP node• Extensions to distributed memory not so successful• OpenMP is a good model to use <strong>with</strong>in a node3618


12/19/2012Cooperating Processes ModelsPROBLEMprocesses37Message Passing, MPIprocessmemorymemorymemorycpucpucpu3819


12/19/2012MPIprocess 0memoryprocess 1memorycpucpuMPI_Send(a,...,1,…)MPI_Recv(a,...,0,…)39Message Passing• Participating processes communicate using a message-passingAPI• Remote data can only be communicated (sent or received) viathe API• MPI (the Message Passing Interface) is the st<strong>and</strong>ard• Implementation:MPI processes map to processes <strong>with</strong>in one SMP node oracross multiple networked nodes• API provides process numbering, point-to-point <strong>and</strong> collectivemessaging operations• Mostly used in two-sided way, each endpoint coordinates insending <strong>and</strong> receiving4020


12/19/2012SHMEMprocess 0memoryprocess 1memorycpucpushmem_put(a,b,…)41SHMEM• Participating processes communicate using an API• Fundamental operations are based on one-sided PUT <strong>and</strong> GET• Need to use symmetric memory locations• Remote side of communication does not participate• Can test for completion• Barriers <strong>and</strong> collectives• Popular on Cray <strong>and</strong> SGI hardware, also Blue Gene version• To make sense needs hardware support for low-latency RDMAtypeoperations4221


12/19/2012High Performance <strong>Fortran</strong> (HPF)• Data Parallel programming model• Single thread of control• Arrays can be distributed <strong>and</strong> operated on in parallel• Loosely synchronous• Parallelism mainly from <strong>Fortran</strong> 90 array syntax, FORALL <strong>and</strong>intrinsics.• This model popular on SIMD hardware (AMT DAP, ConnectionMachines) but extended to clusters where control thread isreplicated43HPFmemorymemorymemorymemorypepepepememorycpu4422


12/19/2012HPFmemorymemorymemorymemorypepepepememoryA=SQRT(A)A (distributed)cpu45<strong>UPC</strong>threadmemorythreadmemorythreadmemorycpucpucpu4623


12/19/2012<strong>UPC</strong>threadthreadthreadmemorymemorymemorycpucpucpuupc_forall(i=0;i


12/19/2012<strong>Fortran</strong> 2008 coarray model• Example of a Partitioned Global Address Space (<strong>PGAS</strong>)model• Set of participating processes like MPI• Participating processes have access to local memoryvia st<strong>and</strong>ard program mechanisms• Access to remote memory is directly supported bythe language49<strong>Fortran</strong> coarray modelprocessmemoryprocessmemoryprocessmemorycpucpucpu5025


12/19/2012<strong>Fortran</strong> coarray modelprocessprocessprocessmemorymemorymemorycpucpucpua = b[3]51<strong>Fortran</strong> coarrays• Remote access is a full feature of the language:• Type checking• Opportunity to optimize communication• No penalty for local memory access• Single-sided programming model more natural forsome algorithms• <strong>and</strong> a good match for modern networks <strong>with</strong> RDMA5226


12/19/201227


12/19/2012Parallel <strong>Programming</strong><strong>with</strong> <strong>Fortran</strong> <strong>Coarrays</strong>:Overview of ExercisesDavid Henty, Alan Simpson (EPCC)Harvey Richardson, Bill Long (Cray)Exercise 1• Hello world example• check you can log on, compile, submit <strong>and</strong> run• Writing arrays as pictures• declare <strong>and</strong> manipulate coarrays• write out arrays in PGM picture format• view them using display from ImageMagick• use both remote reads <strong>and</strong> remote writes21


12/19/2012Sample output on 4 images3Exercise 2• Perform simple edge detection of features in a picture• halo communication between 1D grid of images• Reconstruct picture from supplied edges• an iterative algorithm• computationally intensive so worth parallelising• Terminate based on some stopping criterion• requires global sums• Use global or point-to-point synchronisation• Look at scalability2


12/19/2012Edge detection <strong>and</strong> picture reconstructionsingle passhundredsof iterations5Exercise 3 (Extra)• Decompose picture across a 2D grid of images• using multiple codimensions3


12/19/2012Documentation• Full instructions in exercise notes• PDF copy in doc/ subdirectory• Go at your own pace• no direct dependencies between practicals & lectures• each exercise follows on from the last• If you’re not sure what to do or if you have any otherquestions then please ask us!4


12/19/2012More Coarray FeaturesParallel <strong>Programming</strong> <strong>with</strong> <strong>Fortran</strong> <strong>Coarrays</strong>David Henty, Alan Simpson (EPCC)Harvey Richardson, Bill Long (Cray)Overview• Multiple Dimensions <strong>and</strong> Codimensions• Allocatable <strong>Coarrays</strong> <strong>and</strong> Components of CoarrayStructures• <strong>Coarrays</strong> <strong>and</strong> Procedures21


12/19/2012Mapping data to imagesPhysical quantityPRESSUREVariables/arraysP(m,n)PROBLEMP(m,n)[k,*]P(m,n)[*]imagesimages32D Data• Corray <strong>Fortran</strong> has a “bottom-up” approach to global data• assemble rather than distribute• unlike HPF (“top-down”) or <strong>UPC</strong> shared distributed data• Can assemble a 2D data structure from 1D arraysinteger :: ca(4)[*]image 1image 3image 2image 442


12/19/20122D Data• However, images are not restricted to a 1D arrangement• For example, we can arrange images in 2x2 grid• coarrays <strong>with</strong> 2 codimensionsinteger :: ca(4)[2,*]ca(:)[1,1]ca(:)[1,2]ca(:)[2,1]ca(:)[2,2]image 1 image 2 image 3 image 452D Local Array on 2D Gridca(1:4,1:4)[1,1] ca(1:4,1:4)[1,2]image 1image 3Aca(1:4,1:4)[2,1] ca(1:4,1:4)[2,2]image 2global access: ca(3,1)[2,2]local access: ca(3,1)image 463


12/19/2012Coarray Subscripts• <strong>Fortran</strong> arrays defined by rank, bounds <strong>and</strong> shapeinteger, dimension(10,4) :: array• rank 2• lower bounds 1, 1; upper bounds 10, 4• shape [10, 4]• Coarray <strong>Fortran</strong> adds corank, cobounds <strong>and</strong> coshapeinteger :: array(10,4)[3,*]• corank 2• lower cobounds 1, 1; upper cobounds 3, m• coshape [3, m]m would be ceiling(num_images()/3)7Multiple Codimensions• <strong>Coarrays</strong> <strong>with</strong> multiple Codimensions:• character :: a(4)[2, *] !2D grid of images for 4 images, grid is 2x2; for 16 images, grid is 2x8• real :: b(8,8,8)[10,5,*] !3D grid of images 8x8x8 local array; <strong>with</strong> 150 images, grid is 10x5x3• integer :: c(6,5)[0:9,0:*] !2D grid of images lower cobounds [ 0, 0 ]; upper cobounds [ 9,n] useful if you want to interface <strong>with</strong> MPI or want C like coding• Sum of rank <strong>and</strong> corank should not exceed 15• Flexibility <strong>with</strong> cobounds• can set all but final upper cobound as required84


12/19/2012Codimensions: What They Mean• Images are organised into a logical 2D, 3D, …. grid• for that coarray only• A map so an image can find the coarray on any other image• access the coarray using its grid coordinates• e.g. character a(4)[2, *] on 6 images• gives a 2 x 3 image grid• usual <strong>Fortran</strong> subscript order to determine image indexAxis 1Axis 211image 1a(4)[1,2]23image 3 image 52image 2 image 4image 6a(2)[2,3]9Codimensions <strong>and</strong> Array-Element Order• Storage order for multi-dimensional<strong>Fortran</strong> arraysreal p(2,3,8)• Location Element1 p(1,1,1)2 p(2,1,1)3 p(1,2,1)4 p(2,2,1)5 p(1,3,1)6 p(2,3,1)7 p(1,1,2)8 p(2,1,2)…48 p(2,3,8)• Ordering of images in multi-dimensionalcogridsreal q(4)[2,3,*]• Image Elements1 q(1:4)[1,1,1]2 q(1:4)[2,1,1]3 q(1:4)[1,2,1]4 q(1:4)[2,2,1]5 q(1:4)[1,3,1]6 q(1:4)[2,3,1]7 q(1:4)[1,1,2]8 q(1:4)(2,1,2]…48 q(1:4)[2,3,8]…105


12/19/2012Multi Codimensions: An Example• Domain Decomposition• () gives local domain size• [] provides image grid <strong>and</strong> easy access to other images• 2D domain decomposition of Braveheart• Global data is 360 x 192• Domain decomposition on 8 images <strong>with</strong> 4 x 2 grid• local array size: (360 / 4) x (192 / 2) = 90 x 96• declaration = real :: localPic(90,96)[4,*]Axis 2Axis 136019211Multi Codimensions: An Example• Domain Decomposition• () gives local domain size• [] provides image grid <strong>and</strong> easy access to other images• 2D domain decomposition of Braveheart• Global data is 360 x 192• Domain decomposition on 8 images <strong>with</strong> 4 x 2 grid• local array size: (360 / 4) x (192 / 2) = 90 x 96• declaration = real :: localPic(90,96)[4,*]image 1Axis 212image 5Axis 11image 2image 6image 3909623image 7image 4image 84126


12/19/2012this_image() & image_index()this_image()returns the image index, i.e., number between 1 <strong>and</strong>num_images()this_image(z)returns the rank-one integer array of cosubscripts for thecalling image corresponding to the coarray zthis_image(z, dim) returns cosubscript ofcodimension dim of zimage_index(z, sub)returns image index <strong>with</strong> cosubscripts sub for coarray zsub is a rank-one integer array13Example 1PROGRAM CAF_Intrinsicsreal :: b(90,96)[4,*]this_image() = 2this_image(b) = [2, 1]image_index(b,[3,2]) = 7this_image() = 5this_image(b) = [1, 2]image_index(b,[3,2]) = 712write(*,*) “this_image() =“,&this_image()write(*,*) “this_image(b) =“,&this_image(b)Axis 1Axis 2123write(*,*) “image_index(b,[3,2]) =“,&image_index(b,[3,2])4END PROGRAM CAF_Intrinsicsthis_image() = 7this_image(b) = [3, 2]image_index(b,[3,2]) = 7147


12/19/2012Example 2PROGRAM CAF_Intrinsicsthis_image() = 96this_image(c) = [1, 0, 4]image_index(c,[1,0,4]) = 96real :: c(4,4,4)[5,-1:4,*]Axis 3write(*,*) “this_image() =“,&this_image()Axis 1Axis 2write(*,*) “this_image(c) =“,&this_image(c)write(*,*) “image_index(c,[1,0,4]) =“,&image_index(c,[1,0,4])END PROGRAM CAF_Intrinsicsthis_image() = 13this_image(c) = [3, 1, 1]image_index(c,[1,0,4]) = 96this_image() = 90this_image(c) = [5, 4, 3]image_index(c,[1,0,4]) = 9615Boundary SwappingPROGRAM CAF_HaloSwapinteger, parameter :: nximages = 4, nyimages = 2integer, parameter :: nxlocal = 90, nylocal = 96real :: pic(0:nxlocal+1, 0:nylocal+1)[nximages,*] ! Declare coarray <strong>with</strong> halosinteger :: myimage(2) ! Array for my row & column coordinatesmyimage = this_image(pic) ! Find my row & column coordinates… ! Initialise pic on each imageFind cosubscriptssync allEnsures pic initialised before accessed by other images! Halo swapif (myimage(1) > 1) &pic(0,1:nylocal) = pic(nxlocal,1:nylocal)[myimage(1)-1,myimage(2)]if (myimage(1) < nximages) &pic(nxlocal+1,1:nylocal) = pic(1,1:nylocal)[myimage(1)+1,myimage(2)]if (myimage(2) > 1) &pic(1:nxlocal,0) = pic(1:nxlocal,nylocal)[myimage(1),myimage(2)-1]if (myimage(2) < nyimages)pic(1:nxlocal,nylocal+1) = pic(1:nxlocal,1)[myimage(1),myimage(2)+1]sync allEnsures all images have got old values before pic is updated… ! Update pic on each imageEND PROGRAM CAF_HaloSwap168


12/19/2012Allocatable <strong>Coarrays</strong>• Can have allocatable <strong>Coarrays</strong>real, allocatable :: x(:)[:], s[:,:]n = num_images()allocate(x(n)[*], s[4,*])• Must specify cobounds in allocate statement• The size <strong>and</strong> value of each bound <strong>and</strong> cobound must be sameon all images.• allocate(x(this_image())[*]) ! Not allowed• Implicit synchronisation of all images…• …after each allocate statement involving coarrays• …before deallocate statements involving coarrays17Differently Sized Coarray Components• A coarray structure component can vary in size per image• Declare a coarray of derived type <strong>with</strong> a component that isallocatable (or pointer)…!Define data type <strong>with</strong> allocatable componenttype diffSizereal, allocatable :: data(:)end type diffSize!Declare coarray of type diffSizetype(diffSize) :: x[*]! Allocate x%data to a different size on each imageallocate(x%data(this_image())189


12/19/2012Pointer Coarray Structure Components• We are allowed to have a coarray that contains componentsthat are pointers• Note that the pointers have to point to local data• We can then access one of the pointers on a remote image toget at the data it points to• This technique is useful when adding coarrays into an existingMPI code• We can insert coarray code deep in call tree <strong>with</strong>outchanging many subroutine argument lists• We don’t need new coarray declarations• Example follows...19Pointer Coarray Structure Components...• Existing non-coarray arrays u,v,w• Create a type (coords) to hold pointers (x,y,z) that we use to pointto x,y,z. We can use the vects coarray to access u, v, w.subroutine calc(u,v,w)real, intent(in), target, dimension(100) :: u,v,wtype coordsreal, pointer, dimension(:) :: x,y,zend type coordstype(coords), save :: vects[*]! …vects%x => u ; vects%y => v ; vects%z => wsync allfirstx = vects[1]%x(1)2010


12/19/2012<strong>Coarrays</strong> <strong>and</strong> Procedures• An explicit interface is required if a dummy argument is acoarray• Dummy argument associated <strong>with</strong> coarray, not a copy• avoids synchronisation on entry <strong>and</strong> return• Other restrictions on passing coarrays are:• the actual argument should be contiguous• a(:,2) is OK, but a(2,:) is not contiguous• or the dummy argument should be assumed shape... to avoid copying• Function results cannot be coarrays2121<strong>Coarrays</strong> as Dummy Arguments• As <strong>with</strong> st<strong>and</strong>ard <strong>Fortran</strong> arrays, the coarray dummy argumentsin procedures can be:• Explicit shape: each dimension of a coarray declared <strong>with</strong> explicit value• Assumed shape: extents <strong>and</strong> bounds determined by actual array• Assumed size: only size determined from actual array• Allocatable: the size <strong>and</strong> shape can be determined at run-timesubroutine s(n, a, b, c, d)integer :: nreal :: a(n) [n,*] ! explicit shape - permittedreal :: b(:,:) [*] ! assumed shape - permittedreal :: c(n,*) [*] ! assumed size - permittedreal, allocatable :: d(:) [:,:] ! allocatable - permitted222211


12/19/2012Assumed Size <strong>Coarrays</strong>• Allow the coshape to be remapped to corank 1program cmaxreal, codimension[8,*] :: a(100), amaxa = [ (i, i=1,100) ] * this_image() / 100.0amax = maxval( a )sync allamax = AllReduce_max(amax)...containsreal function AllReduce_max(r) result(rmax)real :: r[*]sync allrmax = rdo i=1,num_images()rmax = max( rmax, r[i] )end dosync allend function AllReduce_max23<strong>Coarrays</strong> Local to a Procedure• <strong>Coarrays</strong> declared in procedures must have save attribute• unless they are dummy arguments or allocatable• save attribute: retains value between procedure calls• avoids synchronisation on entry <strong>and</strong> return• Automatic coarrays are not permitted• Automatic array: local array whose size depends on dummy arguments• would require synchronisation for memory allocation <strong>and</strong> deallocation• would need to ensure coarrays have same size on all imagessubroutine t(n)integer :: nreal :: temp(n)[*] ! automatic - not permittedinteger, save :: x(4)[*] ! coarray <strong>with</strong> save attributeinteger :: y(4)[*] ! not saved – not permitted2412


12/19/2012Summary• <strong>Coarrays</strong> <strong>with</strong> multiple codimensions used to create a grid ofimages• () gives local domain information• [] gives an image grid <strong>with</strong> easy access to other images• Can be used in various ways to assemble a multi-dimensionaldata set• this_image() <strong>and</strong> image_index()• are intrinsic functions that give information about the images in anmulti-codimension grid• Flexibility from non-coarray allocatable <strong>and</strong> pointercomponents of coarray structures• <strong>Coarrays</strong> can be allocatable, can be passed as argumentsto procedures, <strong>and</strong> can be dummy arguments2513


12/19/2012Advanced FeaturesParallel <strong>Programming</strong> <strong>with</strong> <strong>Fortran</strong> <strong>Coarrays</strong>MSc in HPCDavid Henty, Alan Simpson (EPCC)Harvey Richardson, Bill Long (Cray)Advanced Features: Overview• Execution segments <strong>and</strong> Synchronisation• Non-global Synchronisation• Critical Sections• Visibility of changes to memory• Other Intrinsics• Miscellaneous features• Future developments21


segmentsegment12/19/2012More on Synchronisation• We have to be careful <strong>with</strong> one-sided updates• If we read remote data, was it valid?• Could another process send us data <strong>and</strong> overwritesomething we have not yet used?• How do we know when remote data has arrived?• The st<strong>and</strong>ard introduces execution segments to deal <strong>with</strong> this:segments are bounded by image control statementsThe st<strong>and</strong>ard can be summarized as follows:• If a variable is defined in a segment, it must not be referenced,defined, or become undefined in another segment unless thesegments are ordered – John Reid3Execution Segmentsimage 1program hotdouble precision :: a(n)double precision :: temp(n)[*]!...if (this_image() == 1) thendo i=1, num_images()read *,atemp(:)[i] = aend doend iftemp = temp + 273d0sync all! …call ensemble(temp)2image 2program hotdouble precision :: a(n)double precision :: temp(n)[*]!...if (this_image() == 1) thendo i=1, num_images()read *,atemp(:)[i] = aend doend if2temp = temp + 273d0sync all! …call ensemble(temp)orderingiimage synchronisation points42


12/19/2012Synchronisation mistakes• This code is wrongsubroutine allreduce_max_getput(v,vmax)double precision, intent(in) :: v[*]double precision, intent(out) :: vmax[*]integer isync allvmax=vif (this_image()==1) thendo i=2,num_images()vmax=max(vmax,v[i])end dodo i=2,num_images()vmax[i]=vmaxend doend ifsync all5Synchronisation mistakes• It breaks the rulessubroutine allreduce_max_getput(v,vmax)double precision, intent(in) :: v[*]double precision, intent(out) :: vmax[*]integer isync allvmax=vif (this_image()==1) thendo i=2,num_images()vmax=max(vmax,v[i])end dodo i=2,num_images()vmax[i]=vmaxend doend ifsync all63


12/19/2012Synchronisation mistakes• This is oksubroutine allreduce_max_getput(v,vmax)double precision, intent(in) :: v[*]double precision, intent(out) :: vmax[*]integer isync allif (this_image()==1) thenvmax=vdo i=2,num_images()vmax=max(vmax,v[i])end dodo i=2,num_images()vmax[i]=vmaxend doend ifsync all7More about sync all• Usually all images execute the same sync all statement• But this is not a requirement...• Images execute different code <strong>with</strong> different sync allstatements• All images execute the first sync all they come across<strong>and</strong>….• this may match an arbitrary sync all on another image• causing incorrect execution <strong>and</strong>/or deadlock• Need to be careful <strong>with</strong> this ‘feature’• Possible to write code which doesn’t deadlock but giveswrong answers84


12/19/2012More about sync all• e.g. Image practical: wrong answer! Do halo swap, taking care at the upper <strong>and</strong> lower picture boundariesif (myimage < numimage) thenoldpic(1:nxlocal, nylocal+1) = oldpic(1:nxlocal, 1)[myimage+1]sync allend ifAll images NOT executing this sync all! ... <strong>and</strong> the same for down halo! Now update the local values of newpic...! Need to synchronise to ensure that all images have finished reading the! oldpic halo values on this image before overwriting it <strong>with</strong> newpicsync alloldpic(1:nxlocal,1:nylocal) = newpic(1:nxlocal,1:nylocal)! Need to synchronise to ensure that all images have finished updating! their oldpic arrays before this image reads any halo data from themsync allAll images ARE executing this sync all9More about sync all• sync images(imageList)• Performs a synchronisation of the image executing syncimages <strong>with</strong> each of the images specified in imageList• imageList can be an array or a scalarif (myimage < numimage) thenoldpic(1:nxlocal, nylocal+1) = oldpic(1:nxlocal, 1)[myimage+1]end ifif (myimage > 1) thenoldpic(1:nxlocal, 0) = oldpic(1:nxlocal, nylocal)[myimage-1]end if! Now perform local pairwise synchronisationsif (myimage == 1 ) thensync images( 2 )else if (myimage == numimage) thensync images( numimage-1 )elsesync images( (/ myimage-1, myimage+1 /) )end if105


12/19/2012Other Synchronisation• Critical sections• Limit execution of a piece of code to one image at a time• e.g. calculating global sum on master imageinteger :: a(100)[*]integer :: globalSum[*] = 0, localSum... ! Initialise a on each imagelocalSum = SUM(a) !Find localSum of a on each imagecriticalglobalSum[1] = globalSum[1] + localSumend critical11Other Synchronisation• sync memory• Coarray data held in caches/registers made visible to all images• requires some other synchronisation to be useful• unlikely to be used in most coarray codes• Example usage: Mixing MPI <strong>and</strong> coarraysloop: coarray operationssync memorycall MPI_Allreduce(...)• sync memory implied for sync all <strong>and</strong> sync images126


12/19/2012Other Synchronisation• lock <strong>and</strong> unlock statements• Control access to data defined or referenced by more than oneimage• as opposed to critical which controls access to lines ofcode• USE iso_fortran_env module <strong>and</strong> define coarray oftype(lock_type)• e.g. to lock data on image 2type(lock_type) :: qLock[*]lock(qLock[2])!access data on image 2unlock(qLock[2])13Other Intrinsic functions• lcobound(z)• Returns lower cobounds of the coarray z• lcobound(z,dim) returns lower cobound forcodimension dim of z• ucobound(z)• Returns upper cobounds of the coarray z• lcobound(z,dim) returns upper cobound forcodimension dim of z• real :: array(10)[4,0:*] on 16 images• lcobound(array) returns [ 1, 0 ]• ucobound(array) returns [ 4, 3 ]147


12/19/2012More on Cosubscripts• integer :: a[*] on 8 images• cosubscript a[9] is not valid• real :: b(10)[3,*] on 8 images• ucobound(b) returns [ 3, 3 ]• cosubscript b[2,3] is valid (corresponds to image 8)…• …but cosubscript b[3,3] is invalid (image 9)• Programmer needs to make sure that cosubscripts are valid• image_index returns 0 for invalid cosubscripts15Assumed Size <strong>Coarrays</strong>• Codimensions can be remapped to corank greater than 1• useful for determining optimal extents at runtimeprogram 2dreal, codimension[*] :: picture(100,100)integer :: numimage, numimagex, numimageynumimage = num_images()call get_best_2d_decomposition(numimage,&numimagex, numimagey)! Assume this ensures numimage=numimagex*numimageycall dothework(picture, numimagex, numimagey)...containssubroutine dothework(array, m, n)real, codimension[m,*] :: array(100,100)...end subroutine dothework168


12/19/2012I/O• Each image has its own set of input/output units• units are independent on each image• Default input unit is preconnected on image 1 only• read *,… , read(*,…)…• Default output unit is available on all images• print *,… , write(*,…)…• It is expected that the implementation will mergerecords from each image into one stream17Program Termination• STOP or END PROGRAM statements initiate normaltermination which includes a synchronisation step• An image’s data is still available after it has initiatednormal termination• Other images can test for this using STAT= specifier tosynchronisation calls or allocate/deallocate• test for STAT_STOPPED_IMAGE (defined inISO_FORTRAN_ENV module)• The ERROR STOP statement initiates errortermination <strong>and</strong> it is expected all images will beterminated.189


12/19/2012Coarray Technical Specification• Additional coarray features may be described in aTechnical Specification (TS)• Developed as part of the official ISO st<strong>and</strong>ards process• Work in progress <strong>and</strong> the areas of discussion are:• image teams• collective intrinsics for coarrays• atomics19TS: Teams• Often useful to consider subsets of processes• e.g. MPI communicators• Subsets not currently supported in <strong>Fortran</strong>, e.g.• sync all: all images• sync images: pairwise mutual synchronisation• Extension involves TEAMS of images• user creates teams• specified as an argument to various functions2010


12/19/2012TS: Teams...• To define a set of images as a teamcall form_team(oddteam,[ (i,i=1,n,2) ])• To synchronise the teamsync team(oddteam)• To determine images that constitute a team:oddimages=team_images(oddteam)21TS: Collectives• Collective operations a key part of real codes• broadcast• global sum• ...• Supported in other parallel models• OpenMP reductions• MPI_Allreduce• Not currently supported for coarrays• efficient implementation by h<strong>and</strong> is difficult• calling external MPI routines rather ugly2211


12/19/2012TS: Collective intrinsic subroutines• Collectives, <strong>with</strong> in/out arguments, invoked by same statementon all images (or team of images)• Routines• CO_BCAST• CO_SUM <strong>and</strong> other reduction operations• basically reproduce the MPI functionality• Arguments include SOURCE, RESULT, TEAM• Still discussion on need for implicit synchronisation <strong>and</strong>argument types (for example non-coarray arguments)23TS: Atomic operations• Critical or lock synchronisation sometimes overkill• counter[1] = counter[1] + 1• Simple atomic operations can be optimised• e.g. OpenMP atomic!$OMP atomicsharedcounter = sharedcounter + 1• New variable types <strong>and</strong> operations for coarrays2412


12/19/2012TS: Atomic variables• <strong>Fortran</strong> already includes some atomic support (define,ref)• TS exp<strong>and</strong>s on this to supports atomic compare <strong>and</strong> swap,fetch <strong>and</strong> add, …integer (atomic_int_kind) :: counter[*]call atomic_define(counter[1],0)call atomic_add(counter[1],1)call atomic_ref(countval,counter[1])2513


12/19/2012Experiences <strong>with</strong><strong>Coarrays</strong>Parallel <strong>Programming</strong> <strong>with</strong> <strong>Fortran</strong> <strong>Coarrays</strong>MSc in HPCDavid Henty, Alan Simpson (EPCC)Harvey Richardson, Bill Long (CrayOverview• Implementations• Performance considerations• Where to use the coarray model• Coarray benchmark suite• Examples of coarrays in practice• References• Wrapup21


12/19/2012Implementation Status• History of coarrays dates back to Cray implementations• Expect support from vendors as part of <strong>Fortran</strong> 2008• G95 had multi-image support in 2010• has not been updated for some time• gfortran• Introduced single-image support at version 4.6• Intel: multi-process coarray support in Intel Composer XE 2011(based on <strong>Fortran</strong> 2008 draft)• Runtimes are SMP, GASNet <strong>and</strong> compiler/vendor runtimes• GASNet has support for multiple environments(IB, Myrinet, MPI, UDP <strong>and</strong> Cray/IBM systems) socould be an option for new implementations3Implementation Status (Cray)• Cray has supported coarrays <strong>and</strong> <strong>UPC</strong> on various architecturesover the last decade (from T3E)• Full <strong>PGAS</strong> support on the Cray XT/XE• Cray Compiling Environment 7.0 – Dec 2008• Current release is Cray Compiler Environment 7.4• Full <strong>Fortran</strong> 2008 coarray support• Full <strong>Fortran</strong> 2003 <strong>with</strong> some <strong>Fortran</strong> 2008 features• Fully integrated <strong>with</strong> the Cray software stack• Same compiler drivers, job launch tools, libraries• Integrated <strong>with</strong> Craypat – Cray performance tools• Can mix MPI <strong>and</strong> coarrays42


12/19/2012Implementations we have used• Cray X1/X2• Hardware supports communication by direct load/store• Very efficient <strong>with</strong> low overhead• Cray XT• <strong>PGAS</strong> (<strong>UPC</strong>,CAF) layered on GASNet/portals (so messaging)• Not that efficient• Cray XE• <strong>PGAS</strong> layered on DMAPP portable layer over Gemininetwork hardware• Intermediate between XT <strong>and</strong> X1/2• Intel Composer XE 2011• SMP <strong>and</strong> message-passing runtimes5Implementations we have used...• g95 on shared-memory• Using cloned process images on Linux• This is not being actively developed63


12/19/2012Intel XE on Ubuntu VM7When to use coarrays• Two obvious contexts• Complete application using coarrays• Mixed <strong>with</strong> MPI• As an incremental addition to a (potentially large) serial code• As an incremental addition to an MPI code (allowing reuse ofmost of the existing code)• Use coarrays for some of the communication• opportunity to express communication much more simply• opportunity to overlap communication• For subset synchronisation• Work-sharing schemes84


12/19/2012Adding coarrays to existing applications• Constrain use of coarrays to part of application• Move relevant data into coarrays• Implement parallel part <strong>with</strong> coarray syntax• Move data back to original structures• Use coarray structures to contain pointers to existing data• Place relevant arrays in global scope (modules)• avoids multiple declarations• Declare existing arrays as coarrays at top level <strong>and</strong> throughthe complete call tree(some effort but only requires changes to declarations)9Performance Considerations• What is the latency?• Do you need to avoid strided transfers?• Is the compiler optimising the communication for targetarchitecture?• Is it using blocking communication <strong>with</strong>in a segment whenit does no need to?• Is it optimising strided communication?• Can it pattern-match loops to single communicationprimitives or collectives?105


12/19/2012Performance: Communication patterns• Try to avoid creating traffic jams on the network, such as allimages storing to a single image.• The following examples show two ways to implement anALLReduce() function using coarrays11AllReduce (everyone gets)• All images get data from others simultaneouslyfunction allreduce_max_allget(v) result(vmax)double precision :: vmax, v[*]integer isync allvmax=vdo i=1,num_images()vmax=max(vmax,v[i])end do126


12/19/2012AllReduce (everyone gets, optimized)• All images get data from others simultaneously but this isoptimized so communication is more balanced!...sync allvmax=vdo i=this_image()+1,num_images()vmax=max(vmax,v[i])end dodo i=1,this_image()-1vmax=max(vmax,v[i])end do• Have seen this much faster13Synchronization• For some algorithms (finite-difference etc.) don’t usesync allbut pairwise synchronization using sync images(image)147


12/19/2012Synchronization (one to many)• Often one image will be communicating <strong>with</strong> a set of images• In general not a good thing to do but assume we are...• Tempting to use sync all15Synchronisation (one to many)• If this is all images then could doif ( this_image() == 1) thensync images(*)elsesync images(1)end if• Note that sync all is likely to be fast so is analternative168


12/19/2012Synchronisation (one to many)• For a subset use thisif ( this_image() == image_list(1)) thensync images(image_list)elsesync images(image_list(1))end if• instead of sync images(image_list)for all of them which is likely to be slower17Collective Operations• If you need scalability to a large number of images you mayneed to temporarily work around current lack of collectives• Use MPI for the collectives if MPI+coarrays is supported• Implement your own but this might be hard• For reductions of scalars a tree will be the best to try• For reductions of more data you would have to experiment <strong>and</strong> this maydepend on the topology• <strong>Coarrays</strong> can be good for collective operations where• there is an unusual communication pattern that does notmatch what MPI collectives provide• there is opportunity to overlap communication<strong>with</strong> computation189


12/19/2012Tools: debugging <strong>and</strong> profiling• Tool support should improve once coarray takeup increases• Cray Craypat tool supports coarrays• Totalview works <strong>with</strong> coarray programs on Cray systems• Allinea DDT• support for coarrays <strong>and</strong> <strong>UPC</strong> for a number of compilers isin public beta <strong>and</strong> will be in DDT 3.1• Scalasca• Currently investigating how <strong>PGAS</strong> support can beincorporated.19Debugging Synchronisation problems• One-sided model is tricky because subtle synchronisationerrors change data• TRY TO GET IT RIGHT FIRST TIME• look carefully at the remote operations in the code• Think about synchronisation of segments• especially look for early arriving communications trashingyour data at the start of loops (this one is easy to miss)• One way to test is to put sleep() calls in the code• Delay one or more images• Delay master image, or other images for some patterns2010


12/19/2012Coarray Benchmark Suite• Developed by David Henty at EPCC• Aims to test fundamental features of a coarray implementation• We don’t have an API to test (cf. IMB for MPI)• We can test basic language syntax for communication of data<strong>and</strong> synchronization• Need to choose communication pattern <strong>and</strong> data access• There is some scope for a given communication pattern:• array syntax, loops over array elements• inline code or use subroutines• Choices can reveal compiler capabilities21Initial Benchmark• Single contiguous point-to-point read <strong>and</strong> write• Multiple contiguous point-to-point read <strong>and</strong> write• Strided point-to-point read <strong>and</strong> write• All basic synchronization operations• Various representative communication patterns• Halo-swap in multi-dimensional regular domain decomposition• All communications include synchronisation cost• use double precision (8-byte) values as the basic type• use <strong>Fortran</strong> array syntax• Synchronisations overhead measured separately asoverhead = (delay + sync) – delay2211


<strong>and</strong>width (MB/s)12/19/2012Platforms• Limited compiler support at present• results presented from Cray systems• Cray Compiler Environment (CCE) 7.4.1• Cray XT6MPP <strong>with</strong> dual 12-core Opteron Magny-Cours nodes <strong>and</strong> CraySeastar2+ torus interconnect.<strong>Coarrays</strong> implemented using GASNET.• Cray XE6Same nodes as XT6 but <strong>with</strong> Gemini torus interconnect which supportRDMA23Pingpong80006000XE6 putXT6 putXT6 MPIXE6 MPIPoint-to-point comms4000200001 10 100 1000 10000 100000 1000000 10000000message length (doubles)2412


microseconds12/19/2012Pingpong (small message regime)80XE6 putXT6 putXE6 MPI60XT6 MPIPoint-to-point commsmicroseconds402001 4 16 64 256 1024 4096message length (doubles)sync images put latency MPI latencyXT6 33.1 45.3 7.4XE6 3.0 3.7 1.625Global Synchronisation3000sync allMPI Barrier * 100XT6 Synchronisation20001000016 32 64 128 256 512 1024 2048images• XT coarray implementation not keeping up <strong>with</strong> MPI2613


12/19/2012Global Synchronisation50XE6 sync allXE6 Synchronisation40XE6 MPI_Barriermicroseconds30201004 16 64 256 1024 4096 16384images• Much faster than the previous XT results27Global Synchronisation• Also measured sync images• various point-to-point patterns• Observed that sync images is usually faster thansync all on more than 512 images2814


12/19/20123D Halo Swap on XE6 (weak scaling V=50^3)150100put p2pput allget p2pget allMB/s5008 32 128 512 2048 8192 32768images29Examples of coarrays in practice• Puzzles• Distributed Remote Gather• HIMENO Halo-Swap• Gyrokinetic Fusion Code3015


12/19/2012Solving Sudoku Puzzles31Going Parallel• Started <strong>with</strong> serial code• Changed to read in all 125,000 puzzles at start• Choose work-sharing strategy• One image (1) holds a queue of puzzles to solve• Each image picks work from the queue <strong>and</strong> writes resultback to queue• Arbitrarily decide to parcel work asblocksize = npuzzles /( 8* num_images() )3216


12/19/2012Data Structuresuse,intrinsic iso_fortran_envtype puzzleinteger :: input(9,9)integer :: solution(9,9)end type puzzletype queuetype (lock_type) :: lockinteger :: next_available = 1type(puzzle),allocatable :: puzzles(:)end type queuetype(queue),save :: workqueue[*]type(puzzle) :: local_puzzleinteger,save :: npuzzles[*],blocksize[*]33Inputif (this_image() == 1) then! After file Setup.inquire (unit=inunit,size=nbytes)nrecords = nbytes/10npuzzles = nrecords/9blocksize = npuzzles / (num_images()*8)write (*,*) "Found ", npuzzles, " puzzles.“allocate (workqueue%puzzles(npuzzles))do i = 1, npuzzlescall read_puzzles( && workqueue%puzzles(i)%input,inunit, && error)end doclose(inunit)3417


12/19/2012Core program structure! After coarray data loadedsync allblocksize = blocksize[1]npuzzles = npuzzles[1]done = .false.workloop: do! Acquire lock <strong>and</strong> claim work! Solve our puzzlesend do workloop35Acquire lock <strong>and</strong> claim work! Reserve the next block of puzzleslock (workqueue[1]%lock)next = workqueue[1]%next_availableif (next


12/19/2012Solve the puzzles <strong>and</strong> write back! Solve those puzzlesdo i = istart,iendlocal_puzzle%input = && workqueue[1]%puzzles(i)%inputcall sudoku_solve && (local_puzzle%input,local_puzzle%solution)workqueue[1]%puzzles(i)%solution = && local_puzzle%solutionend do37Output the solutions! Need to synchronize puzzle output updatessync allif (this_image() == 1) thenopen (outunit,file=outfile,iostat=error)&&do i = 1, npuzzlescall write_puzzle &(workqueue%puzzles(i)%input, &workqueue%puzzles(i)%solution,outunit,error)end do3819


12/19/2012More on the Locking• We protected access to the queue state by lock <strong>and</strong> unlock• During this time no other image can acquire the lock• We need to have discipline to only access data <strong>with</strong>in thewindow when we have the lock• There is no connection <strong>with</strong> the lock variable <strong>and</strong> the otherelements of the queue structure• The unlock is acting like sync memory• If one image executes an unlock...• Another image getting the lock is ordered after thefirst image39Summary <strong>and</strong> Commentary• We implemented solving the puzzles using a work-sharingscheme <strong>with</strong> coarrays• Scalability limited by serial work done by image 1• I/O• Parallel I/O (deferred to TS) <strong>with</strong> multiple images runningdistributed work queues.• Defer the character-integer format conversion to thesolver, which is executed in parallel.• Lock contention• Could use distributed work queues,each <strong>with</strong> its own lock.4020


12/19/2012Distributed remote gather• The problem is how to implement the following gather loop ona distributed memory systemREAL :: table(n), buffer(nelts)INTEGER :: index(nelts) ! nelts


MPI time / coarray time12/19/2012Remote gather: coarray implementation (get)• Image 1 gets the values from the other imagesIF (myimg.eq.1) THENDO i=1,neltspe =(index(i)-1)/nloc+1offset = MOD(index(i)-1,nloc)+1caf_buffer(i) = caf_table(offset)[pe]ENDDOENDIF43Remote gather: coarray vs MPI• Coarray implementations are much simpler• Coarray syntax allows the expression of remote data in anatural way – no need of complex protocols• Coarray implementation is orders of magnitude faster for smallnumbers of indices1000MPI to coarray ratio (1024 PEs)100101Number of Elements (nelts)4422


Performance (TFlop/s)12/19/2012HIMENO• HIMENO Halo-Swap benchmark• Uses Jacobi method to solve Poisson’s equation• Looked at a distributed implementation for GPUs• When distributed this gives a stencil computation <strong>and</strong> haloswapcommunication.• Used draft OpenMP GPU directives stencil computation• used MPI or coarrays for halo-swap between processes• Coarray code for halo-swap was simple <strong>and</strong> was bestperforming of the optimized versions• There is still scope to optimize the coarray version (reduceextra data copy)45HIMENO7Himeno Benchmark - XL configurationMPI/ACC OptCAF/ACC Opt65432100 32 64 96 128 160 192 224 256Number of nodes4623


12/19/2012Gyrokinetic Fusion Code• Particle in Cell (PIC) approach to simulate motion of confinedparticles• Motion caused by electromagnetic force on particle• Many particles stay in a cell for small timestep but some don’t• Timestep chosen to limit travel to 4 cells away• Departing particles stored in a buffer <strong>and</strong> when this is full thedata is sent to the neighboring cell’s incoming buffer• Force fields recomputed once particles are redistributed• <strong>Coarrays</strong> used to avoid coordinating the receive of the data• SC11 paper47References• “Cray's Approach to Heterogeneous Computing”,R. Ansaloni, A. Hart, Parco 2011 (to appear).• “The Himeno benchmark”, Ryutaro Himeno,http://accc.riken.jp/HPC_e/himenobmt_e.html.• “Multithreaded Address Space Communication Techniques forGyrokinetic Fusion Applications on Ultra-Scale Platforms”,Robert Preissl, Nathan Wichmann, Bill Long, John Shalf,Stephane Ethier, Alice Koniges, SC11 best paper finalist4824


12/19/2012References• http://lacsi.rice.edu/software/caf/downloads/documentation/nrRAL98060.pdf- Co-array <strong>Fortran</strong> for parallel programming,Numrich <strong>and</strong> Reid, 1998• ftp://ftp.nag.co.uk/sc22wg5/N1801-N1850/N1824.pdf“<strong>Coarrays</strong> in the next <strong>Fortran</strong> St<strong>and</strong>ard”, John Reid, April 2010• Ashby, J.V. <strong>and</strong> Reid, J.K (2008). Migrating a scientificapplication from MPI to coarrays. CUG 2008 Proceedings. RAL-TR-2008-015See http://www.numerical.rl.ac.uk/reports/reports.shtml• http://upc.gwu.edu/ - Unified Parallel C at George WashingtonUniversity• http://upc.lbl.gov/ - Berkeley Unified Parallel C Project49WrapupRemember our first Motivation slide?• <strong>Fortran</strong> now supports parallelism as a full first-class feature ofthe language• Changes are minimal• Performance is maintained• Flexibility in expressing communication patternsWe hope you learned something <strong>and</strong> have success<strong>with</strong> coarrays in the future5025


12/19/2012AcknowledgementsThe material for this tutorial is based on original contentdeveloped by EPCC of The University of Edinburgh for use inteaching their MSc in High-Performance Computing.The following people contributed to its development:Alan Simpson, Michele Weil<strong>and</strong>, Jim Enright <strong>and</strong> Paul GrahamThe material was subsequently developed by EPCC <strong>and</strong> Cray<strong>with</strong> contributions from the following people:David Henty, Alan Simpson ( EPCC)Harvey Richardson, Bill Long, Roberto Ansaloni,Jef Dawson, Nathan Wichmann (Cray)This material is Copyright © 2011by The University of Edinburgh <strong>and</strong> Cray Inc.5226

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!