PGAS Programming with UPC and Fortran Coarrays

PGAS Programming withUPC and Fortran CoarraysEPCC PRACE AdvancedTraining CentreCourse Slides9-10 January 2013EPCC© The University of Edinburgh

•19/12/2012Parallel ProgrammingLanguages and ApproachesDr Alan D SimpsonTechnical Director, EPCCa.simpson@epcc.ed.ac.uk+44 131 650 5120Contents• A Little Bit of History– Non-Parallel Programming Languages– Vector Processing– Data Parallel– Early Parallel Languages• Current Status of Parallel Programming– Parallelisation Strategies– Mainstream HPC• Alternative Parallel Programming Languages– Single-Sided Communication– PGAS– Accelerators– Hybrid Approaches• Final Remarks and SummaryParallel Programming Languages 2•1

•19/12/2012Contents• A Little Bit of History– Non-Parallel Programming Languages– Vector Processing– Data Parallel– Early Parallel Languages• Current Status of Parallel Programming– Parallelisation Strategies– Mainstream HPC• Alternative Parallel Programming Languages– Single-Sided Communication– PGAS– Accelerators– Hybrid Approaches• Final Remarks and SummaryParallel Programming Languages 3Non-Parallel Programming Languages• Serial languages are also important for HPC– Used for much scientific computing– Basis for parallel languages• PRACE Survey results:FortranCC++PythonPerlOtherJavaChapelCo-array Fortran0 50 100 150 200 250Response Count• PRACE Survey indicates that nearly all applications are written in:– Fortran: well suited for scientific computing– C/C++: allows good access to hardware• Supplemented by– Scripts using Python, PERL and BASH– PGAS languages starting to be usedParallel Programming Languages 4•2

•19/12/2012Vector Programming• Exploit hardware support for pipelining– and for fast data access• Early supercomputers were often vector systems– Such as the Cray-1• Allowed operations on vectors– A vector is a a series of values– e.g., a section of a Fortran array• Typical vector loopDO i = 1, ny(i) = a*x(i) + y(i)END DOParallel Programming Languages 5Vector Multiply• Multiply process is made up of a number of stages• Vector hardware allows stages to work independently and topass results to each other in an “assembly line” manner• Start-up cost as pipeline fills, but then a result every cycleR(5)R(4)R(3)R(2)R(1)SerialTimeR(5)R(4)R(3)R(2)R(1)VectorTimeParallel Programming Languages 6•3

•19/12/2012Vectorisation• Sometimes required restructuring of loops to allow efficientvectorisation• Directives used to provide information to the compiler aboutwhether a particular operation was vectorisable• Compilers became increasingly good at spottingopportunities for vectorisation• Vector supercomputers became less popular as parallelcomputing grew• However, many modern CPUs contain vector-like features– e.g., Interlagos Opteron processors in Cray XEParallel Programming Languages 7Data Parallel• Processors perform similar operations across data elements in an array• Higher level programming paradigm, characterised by:– single-threaded control– global name space– loosely synchronous processes– parallelism implied by operations applied to data– compiler directives• Data parallel languages: generally serial language (e.g., Fortran 90) plus– compiler directives (e.g., for data distribution)– first class language constructs to support parallelism– new intrinsics and library functions• Paradigm well suited to a number of early (SIMD) parallel computers– Connection Machine, DAP, MasPar,…Parallel Programming Languages 8•4

•19/12/2012Data Parallel II• Many data parallel languages implemented:– Fortran-Plus, DAP Fortran, MP Fortran, CM Fortran, *LISP, C*,CRAFT, Fortran D, Vienna Fortran• Languages expressed data parallel operations differently• Machine-specific languages meant poor portability• Needed a portable standard: High Performance Fortran• Easy to port codes to, but performance could rarely matchthat from message passing codes– Struggled to gain broad popularityParallel Programming Languages 9Early Parallel Languages• Connection Machine languages– Thinking Machines Corporation provided data parallel versions of a variety ofsequential languages (*LISP, C*, CM-Fortran)– Allowed users to exploit a large number of simple processors in parallel• OCCAM– Early message passing language– …based on Communicating Sequential Processes– Developed by INMOS for programming Transputers– Explicitly parallel loops via PAR keyword– Language constructs for sending and receiving data through named channel– Could only communicate with neighbouring processors– → Message routing had to be in user software• Most early languages for parallel computing were vendor-specificParallel Programming Languages 10•5

•19/12/2012Contents• A Little Bit of History– Non-Parallel Programming Languages– Vector Processing– Data Parallel– Early Parallel Languages• Current Status of Parallel Programming– Parallelisation Strategies– Mainstream HPC• Alternative Parallel Programming Languages– Single-Sided Communication– PGAS– Accelerators– Hybrid Approaches• Final Remarks and SummaryParallel Programming Languages 11Parallelisation Strategies• PRACE recently asked more than 400 European HPC users– “Which parallelisation implementations do you use?”MPIOpenMPCombined MPI+OpenMPMPI, including MPI-2 single-sidedPosix threadsOtherCombined MPI+Posix threadsCombined MPI+SHMEMSHMEMHPF0 50 100 150 200 250Response Count– Unsurprisingly, most popular answers were MPI, OpenMP andCombined MPI+OpenMP– Some users of Single-Sided communicationsParallel Programming Languages 12•6

•19/12/2012Parallelisation Strategies II• PRACE also asked users of very largest systems:– “Which parallelisation method does your application use?”• Most popular: “MPI Only” and “Combined MPI+OpenMP”• 12% used single-sided routinesParallel Programming Languages 13Mainstream HPC• For the last 15+years, most HPC cycles on large systems have beenused to run MPI programs, written in Fortran or C/C++– Plus OpenMP used on shared memory systems/nodes• MSc in HPC includes compulsory courses in MPI and OpenMP• However, there are now reasons why this may be changing:– Currently, HPC systems have increasingly large numbers of cores, but theindividual core performance is relatively static– There are significant challenges in exploiting future Exascale systems• So, alongside mainstream HPC, there is also significant activity in:– Single-sided communication– PGAS languages– Accelerators– Hybrid approaches• Many of these areas are discussed later in this courseParallel Programming Languages 14•7

•19/12/2012Shared Memory• Multiple threads sharing global memory• Developed for systems with shared memory (MIMD-SM)• Program loop iterations can be distributed to threads• Each thread can refer to private objects within a parallel context• Implementation• Threads map to user threads running on one shared memory node• Extensions to distributed memory not so successful• Posix Threads/PThreads is a portable standard for threading• Vendors had various shared-memory directives• OpenMP developed as common standard for HPC• OpenMP is a good model to use within a node• More recent task featuresParallel Programming Languages 15Message Passing• Processes cooperate to solve problem by exchanging data• Can be used on most architectures– Especially suited for distributed memory systems (MIMD-DM)• The message passing model is based on the notion of processes– Process: an instance of a running program, together with the program’s data• Each process has access only to its own data– i.e., all variables are private• Processes communicate with each other by sending+receiving messages– Typically library calls from a conventional sequential language• During the 1980s, there was an explosion in message passing languagesand libraries– CS Tools, OCCAM, CHIMP (developed by EPCC), PVM, PARMACS, …Parallel Programming Languages 16•8

•19/12/2012MPI: Message Passing Interface• De facto standard developed by working group of around 60 vendors andresearchers from 40 organisations in USA and Europe– Took two years– MPI-1 released in 1993– Built on experiences from previous message passing libraries• MPI's prime goals are:– To provide source-code portability– To allow efficient implementation• MPI-2 was released in 1996– New features: parallel I/O, dynamic process management and remotememory operations (single-sided communication)• Now, MPI is used by nearly all message passing programsParallel Programming Languages 17Contents• A Little Bit of History– Non-Parallel Programming Languages– Vector Processing– Data Parallel– Early Parallel Languages• Current Status of Parallel Programming– Parallelisation Strategies– Mainstream HPC• Alternative Parallel Programming Languages– Single-Sided Communication– PGAS– Accelerators– Hybrid Approaches• Final Remarks and SummaryParallel Programming Languages 18•9

•19/12/2012Single-Sided Communication• Allows direct access to memory of other processors– Each process can access total memory, even on distributed memory systems• Simpler protocol can bring performance benefits– But requires thinking about synchronisation, remote addresses, caching...• Key routines– PUT is a remote write– GET is a remote read• Libraries give PGAS functionality• Vendor-specific libraries– SHMEM (Cray/SGI), LAPI (IBM)• Portable implementations– MPI-2, OpenSHMEMParallel Programming Languages 19Single-Sided Communication• Single-sided communication is major part of MPI-2 standard– Quite general and portable to most platforms– However, portability and robustness can have an impact on latency– Quite complicated and messy to use• Better performance from lower-level interfaces, like SHMEM– Originally developed by Cray but a variety of similar implementationswere developed on other platforms– Simple interface but hard to program correctly• OpenSHMEM– New initiative to provide standard interface– See http://www.openshmem.orgParallel Programming Languages 20•10

•19/12/2012PGAS: Partitioned Global Address Space• Access to local memory via standard program mechanismsplus access to remote memory directly supported by language• The combination of access to all data plus also exploitinglocality could give good performance and scaling• Well suited to modern MIMD systems with multicore (sharedmemory) nodes• Newly popular approach initially driven by US funding– Productive, Easy-to-use, Reliable Computing System (PERCS) projectfunded by DARPA’s High Productivity Computing Systems (HPCS)Parallel Programming Languages 21PGAS II• Currently active and enthusiastic community• Very wide variety of languages under the PGAS banner• See http://www.pgas.org• Including: CAF, UPC, Titanium, Fortress, X10, CAF 2.0, Chapel,Global Arrays, HPF?, …• Often, these languages have more differences thansimilarities…Parallel Programming Languages 22•11

•19/12/2012PGAS Languages• The broad range of PGAS languages makes it difficult tochoose which to use• Currently, CAF and UPC are probably most relevant asCray’s compilers and hardware now support CAF and UPCin quite an efficient manner• CAF: Fortran with Coarrays– Minimal addition to Fortran to support parallelism– Incorporated in Fortran 2008 standard!• UPC: Unified Parallel C– Adding parallel features to CParallel Programming Languages 23Accelerators• Use accelerator hardware for faster node performance• Recently, most HPC systems have been increasing thenumber of cores, but individual cores are not getting faster– This gives significant scaling challenges• Accelerators are increasingly interesting– …for some applications• PRACE Survey– “Could your applicationbenefit from accelerators,such as GPGPUs?”– 61% thought soParallel Programming Languages 24•12

•19/12/2012UK Usage of AcceleratorsCurrent UsageFuture Plans9.8%26.8%22.0%61.0%GPUsIntelMIC80%70%60%50%40%30%20%10%0%GPUsIntelMICFPGAs Other• FPGAs were very fashionable a few years ago– But proved difficult to program• Currently, most interest is around GPUs– With Intel MIC as an important future prospect• Most users thought accelerators would increase in importanceParallel Programming Languages 25Programming GPUs• Graphics Processing Units (GPUs) have been increasing in performancemuch quicker than standard processor cores• Lead to an interest in GPGPU (General Purpose computation onGraphics Processing Units)– Where the GPU acts as an accelerator to the CPU• Variety of different ways to program GPGPUs– CUDA– NVIDIA’s proprietary interface to the architecture.– Extensions to C (and Fortran) language which allow interfacing to the hardware– OpenCL– Cross platform API– Similar in design to CUDA, but lower level and not so mature– OpenACC– Directives based approach– OpenMP style directives - abstract complexities away from programmerParallel Programming Languages 26•13

•19/12/2012Hybrid Approaches• Use more than one parallelisation strategy within a single program• Trying to obtain more parallelisation by exploiting hierarchicalparallelisation• Most commonly, combining MPI + OpenMP– Using OpenMP across a shared-memory partition, with MPI tocommunicate between partitions– May make sense to use OpenMP just within a many-core processor• But can also combine MPI with Pthreads or OpenSHMEM or CAF…• Using many GPGPUs also often requires the use of MPI alongsidethe GPGPU programming approachParallel Programming Languages 27Contents• A Little Bit of History– Non-Parallel Programming Languages– Vector Processing– Data Parallel– Early Parallel Languages• Current Status of Parallel Programming– Parallelisation Strategies– Mainstream HPC• Alternative Parallel Programming Languages– Single-Sided Communication– PGAS– Accelerators– Hybrid Approaches• Final Remarks and SummaryParallel Programming Languages 28•14

•19/12/2012Why do Languages Survive or Die?• It is not always entirely clear why some languages andapproaches thrive while others fade away…• However, languages which survive do have a number ofcommon characteristics– Appropriate model for current hardware– Good portability– Ease of use– Applicable to a broad range of problems– Strong engagement from both vendors and user communities– Efficient implementations availableParallel Programming Languages 29Summary• Development of portable standards have been essential foruptake of new parallel programming ideas• Mainstream HPC is currently based on MPI and OpenMP– However, there are alternatives• Exascale challenges have injected new life into developmentof novel parallel programming languages and approaches• The remainder of this course focuses on PGAS languagesand programming GPGPUs– Plus lectures on data parallel programming and single-sidedcommunicationParallel Programming Languages 30•15

•19/12/2012References• PRACE-PP– D6.1: Identification and Categorisation of Applicationsand Initial Benchmarks Suite, Alan Simpson, Mark Bull and Jon Hill, EPCC• PRACE-1IP– D7.4.1: Applications and user requirements for Tier-0 systems, Mark Bull(EPCC), Xu Guo (EPCC), Ioannis Liabotis (GRNET)– D7.4.3: Tier-0 Applications and Systems Usage, Xu Guo, Mark Bull(EPCC)• ARCHER User Requirements– Project Working Group: Katharine Bowes (EPSRC), Ian Reid (NAG), SimonMcIntosh-Smith (Bristol), Bryan Lawrence (NCAS/Reading) and AlanSimpson (EPCC)Parallel Programming Languages 31•16

UPCIntroduction & BasicsDr Michèle WeilandApplications Consultant, EPCCm.weiland@epcc.ed.ac.ukObjectives of the coming three lectures:o understand the basic principles of UPCo motivation behind PGASo learn about data distribution, synchronisationo advanced features (dynamic memory allocation, collectives) Practicals will try and emphasise the most importantaspects of UPC21

UPCUnified Parallel CParallel extension to ISO C 99, addingo explicit parallelismo global shared address spaceo synchronisationBoth commercial and open source compilers availableo Cray, IBM, SGI, HPo GWU, LBNL, GCCUPC3UPC and the World of PGASUPC != PGASo PGAS is a programming modelo UPC is one implementation of this modelMany other implementationso language extension: Coarray Fortrano new languages: Chapel, X10, Fortress, Titaniumo PGAS-like libraries: Global Arrays, OpenSHMEMAll implementations are different, but follow the same model!42

UPC threadsUPC uses threads that operate independently in a SPMD fashion threads execute the same UPC programIdentifiers that return information about the programenvironment:THREADS:MYTHREAD:holds total number of threadsstores thread index index runs from 0 to THREADS-15UPC threads#include #include void main() {}printf(“Thread %d of %d says: Hello!”, MYTHREAD, THREADS);63

Private vs. shared memory spaceConcept of two memory spaces: private and sharedobjects declared in private memory space are only accessible by asingle threadobjects declared in shared memory space are accessible by allthreads shared memory space is used to communicateinformation between threads7Private vs. shared dataprivate variables declared as normal C variableo multiple instances of variable will existint x; // private variableshared variables declared with shared qualifiero only allocated once, in shared memory spaceo accessible by all threadsshared int y; // shared variable84

UPC data localityIf shared variable is scalar, space only allocated on thread 0int x;shared int y;thread 0 thread 1 thread 2 thread 3x x x xPrivate Memory SpaceyShared Memory Space9Affinityall threads can directly access shared data, even if it resides in aremote locationUPC creates logical partitioning of the shared memory space objects have affinity to one thread shared scalars always have affinity to thread 0better performance if a thread access data to which it has affinity always keep data locality and affinity in mind105

Shared array distributionIf a shared variable is an array, space allocated across sharedmemory space in a cyclic fashion by defaultint x;shared int y[16];thread 0 thread 1 thread 2 thread 3x x x xPrivate Memory Spacey[0]y[1]y[2]y[3]y[4]y[8]y[12]y[5]y[9]y[13]y[6]y[10]y[14]y[7]y[11]y[15]Shared Memory Space11Shared array distribution (2)Change data layout by adding a “blocking factor” to shared arraysshared[blocksize] type array[n]int x;shared[2] int y[16];thread 0 thread 1 thread 2 thread 3x x x xPrivate Memory Spacey[0]y[2]y[4]y[6]y[1]y[3]y[5]y[7]Shared Memory Spacey[8]y[9]y[10]y[11]y[12]y[13]y[14]y[15]126

Work sharingShared data means shared workload!If shared data is distributed between threads, threads can distributework on this data between themUPC has built-in mechanism for explicitly distributing and sharingwork13Work sharing: upc_forallStatement for work distributiono allows loop assignment of tasks to threadso parallel for loop4 th parameter defines affinity to threado if “affinity % THREADS” matches MYTHREAD, execute iteration for thatTHREADupc_forall(expression; expression; expression; affinity)Condition: iterations of upc_forall must be independent!147

Example: vector addition (1)#define N 10 * THREADSshared int vector1[N];shared int vector2[N];shared int sum[N];void main() {int i;for(i=0; i

Side effects of shared dataHolding data in a shared memory space has implications1) the lifetime of shared data needs to extend beyond the scope itwas defined in (unless this is program scope) storage duration2) the shared data needs to be keep up-to-date synchronisation17Storage duration of shared objectsShared objects cannot have automatic storage durationo any variable defined inside a function!Why?SPMD model means a shared variable may be accessedoutside lifetime of the function!Conclusionshared variables must eithero have file scope;o or be declared as static if defined inside a function.189

“Static” keywordensures shared objects are accessible throughout program execution objects are not linked to the scope of a thread objects will not simply “disappear” after a thread exists the scopein which the object was defined19Example: maximum of an array#define max(a,b) (((a)>(b)) ? (a) : (b))shared int maximum[THREADS];shared int globalMax = 0;shared int a[THREADS*10];Here: shared variables have file scope!void main(int argc, char **argv) {… // initialise array aupc_forall(int i=0; i

SynchronisationEnsure all threads reach same point in executiono necessary for memory and data consistencyBarriers used for synchronisationo blockingo split-phase (non-blocking)upc_barrier blockingupc_notify, upc_wait non-blocking21Example: maximum of an array#define max(a,b) (((a)>(b)) ? (a) : (b))shared int maximum[THREADS];shared int globalMax = 0;shared int a[THREADS*10];Here: shared variables have file scope!void main(int argc, char **argv) {…}… // initialise array aupc_barrier;upc_forall(int i=0; i

References• UPC Language Specification (Version 1.2):http://upc.gwu.edu/docs/upc_specs_1.2.pdf• UPC homepage:http://upc.gwu.edu/• GCCUPC compiler:http://www.gccupc.org23UPC on HECToR XE6Cray compiler so far only one to support UPC– default compiler on HECToR XE6– start by checking the correct programming environment/compiler isloadeduserid@hector-xe6:~> which cc/opt/cray/xt-asyncpe/5.09/bin/ccCompiler option for UPC code is: –h upcFor full information on compiler options executeman craycc2412

XE6: Compiling & running your codeTo compile either use the provided Makefile, or docc –h upc –o myprogram myprogram.upcOn HECToR, login nodes and compute nodes use different filesystemso compile on login nodes, i.e. $HOMEo batch jobs can only read from/write to compute nodes, i.e. $WORKo copy your binary, any input files and submit script to $WORK, e.g.userid@hector-xe6:~> cp myprogram script $WORK/upc/myprogram Always keep copies of critical files on $HOME, as $WORK is notbacked up!25Submitting a jobBatch scheduler is PBS– qsub –q myscript.pbs to submit a job– qstat to check job status– qdel to delete job from queueParallel job launcher for compute nodes is aprun– argument –n to specify total number of processes– argument –N to specify the number of tasks per node– call aprun from a subdirectory of /workJob submission easiest with PBS script submitted from /work2613

Sample PBS script for XE6#!/bin/bash --login#PBS -N my_job#PBS -l mppwidth=32#PBS -l mppnppn=32#PBS -l walltime=00:20:00#PBS -A d45use correct budget# Change to the directory that the job was submitted fromcd $PBS_O_WORKDIR# Set the number of UPC threadsexport NPROC=`qstat -f $PBS_JOBID | awk '/mppwidth/ {print $3}'`export NTASK=`qstat -f $PBS_JOBID | awk '/mppppn/ {print $3}'`aprun -n $NPROC -N $NTASK ./myprogram2714

UPCData distribution,synchronisation & work sharingDr Michèle WeilandApplications Consultant, EPCCm.weiland@epcc.ed.ac.uk data distributiono multi-dimensional data synchronisation methodsoblocking versus non-blocking work sharingo examples: vector addition revisited, matrix-vector multiplicationUPC 2–1

Brief recap… private and shared data, logically partitioned memory space data objects have affinity to one thread exactly work sharing through upc_forall distribution of shared data storage duration of shared data synchronisationfocus of today’slectureUPC 3Data DistributionCyclic distribution is the defaultint x;shared int y[13];thread 0 thread 1 thread 2 thread 3x x x xPrivate Memory Spacey[0]y[1]y[2]y[3]y[4]y[8]y[12]y[5]y[9]y[6]y[10]y[7]y[11]Shared Memory SpaceUPC 4–2

Data Distribution (2)If number of elements is not an exact multiple of the thread count,threads can end up having uneven numbers of elements:int x;shared[2] int y[13];thread 0 thread 1 thread 2 thread 3x x x xPrivate Memory Spacey[0]y[2]y[4]y[6]y[1]y[3]y[5]y[7]Shared Memory Spacey[8]y[9]y[10]y[11]y[12]UPC 5Blocking factorshould be used if default distribution is not suitableo more on the meaning of “suitable” later on… four different casesshared [4] defines a block size of 4 elementsshared [0] all elements are given affinity to thread 0shared [*] when possible, data is distributed in contiguousblocksshared [] equivalent to shared [0]UPC 6–3

Multi-dimensional dataUPC can distribute data using block cyclic distributionsDistributions represent a top-down approacho shared objects can be distributed into segments using the blocking factor conceptually opposite of CAF, where shared objects are “created”by merging the pieces from every image in a bottom-up approachUPC 72D array decompositiondistribution using the * layout qualifier and empty brackets‣ Block distributionshared[*] int y[8][8];yx‣ Entire array on mastershared[] int y[8][8]; orshared[0] int y[8][8];UPC 8–4

2D array decompositionDistribution using different blocking factorsshared[8] int y[8][8];yshared[6] int y[8][8];yxxUPC 9Multi-dimensional data – Case 1shared double grid[8][8][8] with THREADS == 3zxyyyN.B. array layout convention is arr[x][y][z]!UPC 10–5

Multi-dimensional data – Case 1shared double grid[8][8][8] with THREADS == 4zxyN.B. array layout convention is arr[x][y][z]!UPC 11Multi-dimensional data – Case 2shared [3] double grid[8][8][8] with THREADS == 3zxyUPC 12–6

Multi-dimensional data – Case 2blocking factor depending on dimensions pencil distributionshared [8] double grid[8][8][8] with THREADS == 3zxyUPC 13Multi-dimensional data – Case 2combining thread count and blocking factor slab distributionshared [8] double grid[8][8][8] with THREADS == 4zyxyyUPC 14–7

Multi-dimensional data – Case 3slabs are contiguous in memory blocking factor product of 2 dimensionsshared [8*8] double grid[8][8][8] with THREADS == 4zxyUPC15Why is the distribution important?it is all about performance and minimising the cost of reading and writingdata…accessing shared data which resides in a physically remote location ismore expensive than accessing shared data which has affinity with thethread!optimise layout of data by using knowledge of problem size and, ifpossible, the number of threadsUPC 16–8

Static vs. dynamic compilation (1)number of UPC threads can be specified either at compile time(static) or at runtime (dynamic)oCray compiler: -X numThreadsAdvantagesoodynamic: program can be executed using any number of threadsstatic: easier to distribute data based on THREADSDisadvantagesoodynamic: not always possible to achieve best possible distributionstatic: program needs to be executed with number of threads specified atcompile timeUPC 17Static vs. dynamic compilation (2)“An array declaration is illegal if THREADS is specified at runtimeand the number of elements to allocate at each thread depends onTHREADS.”shared int x[4*THREADS];shared[] int x[8];legal for static and dynamic environmentsshared int x[8];shared[] int x[THREADS];shared int x[10+THREADS];illegal for dynamic environmentUPC 18–9

Static vs. dynamic compilation (3)static compilation can often give better performance, as data distributionis much easier to controldynamic compilation provides greater flexibility, however optimal datadistribution may not always be achievable tradeoff between performance and convenienceUPC 19Synchronisationneeded to ensure all threads reach same point in execution flowo memory and data consistencyo two types of synchronisation: blocking and non-blockingblocking synchronisation makes all threads wait at a barrier untilthe last thread has reached that barrier before allowing executionto continuenon-blocking synchronisation allows for some local computations tobe executed while waiting for other threadsUPC 20–10

codecodelocalcomputationBarrierupc_barrier exp opt1. all threads execute the code that requires synchronisation2. once finished they wait at the barrier3. when the last thread reaches the barrier, all threads are released to continueexecutiont 0 t 1 t n-1 t n…upc_barrierUPC 21Split-phase barrierupc_notify exp opt and upc_wait exp opt1. thread finishes work that requires synchronisation upc_notify to inform others2. thread performs local computations once finished, wait3. when all threads have execute upc_notify, thread waiting at barrier can continuet 0 t 1 t n-1 t n…upc_notify (non-blocking)upc_wait (blocking)UPC 22–11

Debugging barriersthe optional value exp can be used to check that all threads havereached the same barrierif a thread executes a barrier with different exp tag than the otherthreads, the application reports an expression mismatch and aborts very useful for making sure that all threads are on theintended execution pathUPC 23Work sharing revisited4 th parameter in upc_forall loop represents affinityevaluate if MYTHREAD will execute an iterationaffinity is an integer expression affinity % THREADS == MYTHREADaffinity is a pointer-to-shared object pointed to has affinity to MYTHREAD upc_threadof(affinity)UPC 24–12

Example: vector addition (1/3)three vectors are distributed in cyclic fashion with the default blocking factor of 1the modulo function identifies the local elements per threado if distribution changes, this code will fail to identify the local elementso it will still produce the correct result!#include #define N 100*THREADSshared int v1[N], v2[N], v1plusv2[N];void main() {int i;}for(i=0; i

Example: Vector addition (3/3)Advantage of affinity parameter:Integer expression, simple syntax if the distribution changes, upc_forall will behave correctly and identifylocal elements (for round-robin, unless modified)#include #define N 100*THREADSshared int v1[N], v2[N], v1plusv2[N];void main() {int i;affinity parameter isinteger expression}upc_forall(i=0; i

Data distributionCAB001 201= 0 1 2*1201 22number of remote operations:C 0 = A 0,0 B 0 + A 0,1 B 1 + A 0,2 B 2 4C 1 = A 1,0 B 0 + A 1,1 B 1 + A 1,2 B 2 4C 2 = A 2,0 B 0 + A 2,1 B 1 + A 2,2 B 2 4UPC 29We can do betterdistribute matrix a in blocks of size THREADS each row will be placed locally onto each thread#include shared [THREADS] int a[THREADS][THREADS];shared int b[THREADS], c[THREADS];void main (void){int i, j;}upc_forall( i = 0 ; i < THREADS ; i++; &c[i]) {c[i] = 0;for ( j= 0 ; j < THREADS ; j++)c[i] += a[i][j]*b[j];}UPC 30–15

Improved data distributionCAB00 0 001= 1 1 1*122222number of remote operations:C 0 = A 0,0 B 0 + A 0,1 B 1 + A 0,2 B 2 2C 1 = A 1,0 B 0 + A 1,1 B 1 + A 1,2 B 2 2C 2 = A 2,0 B 0 + A 2,1 B 1 + A 2,2 B 2 2UPC 31Summarycorrect data distribution is important for performancekeep the number of remote reads/writes as low as possibleUPC gives programmers control over data layout and work sharing it is important to be aware of performance implications aim to keep work sharing loops independent of data distributionUPC 32–16

UPCUPC PointersDynamic Memory AllocationUPC collectivesDr Michèle WeilandApplications Consultant, EPCCm.weiland@epcc.ed.ac.ukAdvanced use of UPC‣ C and UPC pointers‣ dynamic memory allocation‣ locks‣ UPC collectivesUPC 2–1

C pointersA pointer in C is a data type whose value points to anothervariable’s memory addressfloat array[2];float* ptr = &array[0]; // Value of ptr is 213ptr213 214 215 216 217 218219 220UPC 3Pointer arithmeticchange what object a pointer is referring to through pointerarithmetic:o is type dependento incrementing a float pointer will move by sizeof(float)float array[2];float* ptr = &array[0]; // Value of ptr is 213ptr = ptr++; // Value of ptr is now 217ptr213 214 215 216 217 218219 220ptr++UPC 4–2

UPC pointerssimilar concept as in Cpointers are variables that contain addresses of other variablesUPC pointers canreside in private or shared memory spacereference private or shared memory spaceUPC 5Types of pointersprivate to privateprivate to sharedshared to private int *p1; shared int *p2; int *shared p3; (not recommended)shared to shared shared int *shared p4;p4p3p4Shared Memory SpacePrivate Memory Spacep1p2p2p1UPC 6–3

UPC pointersUPC pointers have three fieldso thread : the thread affinity of the pointero address : the virtual address of the blocko phase : indicates the element location within that blockthread block address phasethe values of these fields are obtained from the functionssize_t upc_threadof (shared void *ptr)size_t upc_phaseof (shared void *ptr)size_t upc_addrfield (shared void *ptr)UPC 7Pointer properties (1/3)pointer arithmetic takes into account the blocking factorshared int x[16]; // shared int arrayshared int *dp = &x[5], *dp1; // private pointers to shareddp1 = dp + 9; // default blocking factor 1thread 0 thread 1 thread 2 thread 3dpdp1 dp dp1 dpdp1dpdp1Private Memory Spacex[0]x[1]x[2]x[3]+3x[4]x[8]+4x[5]x[9]+1+5x[6]x[10]+2+6x[7]x[11]Shared Memory Space+7x[12]+8x[13] +9 x[14] x[15]UPC 8–4

Pointer properties (2/3)the pointer will follow its own blocking factorshared [3] int x[16];shared [3] int *dp = &x[5], *dp1;dp1 = dp + 6; // blocking factor 3thread 0 thread 1 thread 2 thread 3dp dp1 dp dp1 dpdp1dpdp1Private Memory Spacex[0]x[3]+1x[6]+4x[9]x[1]x[2]x[4]x[5]+2+3x[7]x[8]+5+6x[10]x[11]Shared Memory Spacex[12]x[15]x[13]x[14]UPC 9Pointer properties (3/3)‣ casting a shared pointer to a private pointer is allowed but not theother way around‣ casting a shared pointer to private will result in loss of information‣ thread & phase‣ casting is only well defined if the object pointed to by the sharedpointer has local affinityUPC 10–5

Pointer castingcasting a shared pointer to a private pointer results in information lossshared [3] int x[16];shared int *dp = &x[5];int *ptr;ptr = (int *) dp; ptr != upc_addrfield(dp)phase thread addressdp2 1 0003FFB008ptr00AFF53008UPC 11Dynamic memory allocationso far only seen how to allocate memory staticallydynamic memory allocation in UPC is of course possibleprovides flexibilityallows object sizes to changes during runtime this is one of the cases where pointers are requireddynamic memory allocation in private space is done usingstandard C functionsdynamic memory allocation in shared space is achieved usingspecial UPC functionsUPC 12–6

Dynamic memory allocation (2)two types of memory allocation functionso non-collective: upc_global_alloc, upc_alloco collective: upc_all_alloccollective calls are called by all threads and return the same address value(pointer) to all of themnon-collective calls can be executed by multiple threads. Each call will allocate adifferent shared block.free the allocated memory using upc_freeo not a collective callUPC 13Non-collective dynamic allocationeach thread allocates a memory block in its own shared memory spaceshared [] int *ptr;ptr = (shared [] int *) upc_alloc(N*sizeof(int));NNNShared Memory SpacePrivate Memory Spacep1p1p1UPC 14–7

Non-collective dynamic allocation (2)shared [N] int *ptr;ptr = (shared [N] int *) upc_global_alloc(THREADS, N*sizeof(int));N N NN N NNNNShared Memory SpacePrivate Memory Spacep1p1p1UPC 15Collective dynamic allocationallocate contiguous segments of shared memoryshared [N] int *ptr;ptr = (shared [N] int *)upc_all_alloc(THREADS, N*sizeof(int));NNNShared Memory SpacePrivate Memory Spacep1p1p1UPC 16–8

Locksaccess control mechanism for critical sectionssections which should be executed by one thread at a time serialisedexecutionUPC shared data type upc_lock_tcan have one of two states: locked or unlockedcan be seen by all threadslocks need to be manipulated through pointersUPC 17Program flow with locksthread enterscritical sectionunlock, nextthread canobtain lockobtains lock, noother threadscan enterthreadcompletescritical sectionUPC 18–9

Creating locksinitial state of a new lock object is unlockedlocks can be created collectively return value on every thread points to the same object upc_lock_t *upc_all_lock_alloc(void);or non-collectively all threads that call the function obtain different locks upc_lock_t *upc_global_lock_alloc(void);resources allocated by locks need to be freed upc_lock_free (upc_lock_t *ptr);UPC 19Using locksthreads need to lock and unlock upc_lock(upc_lock_t *ptr); /* blocking */ upc_lock_attempt(upc_lock_t *ptr); /* non-blocking */ upc_unlock(upc_lock_t *ptr);upc_lock(lock)lockedunlockedtry againobtainlockcomplete criticalsectionupc_unlock(lock)UPC 20–10

Example: Dining PhilosophersUPC 21Example: Dining Philosophers (contd)model forks as shared array of locks, then allocate a lock (i.e. fork)dynamically and return a pointer to it:upc_lock_t *shared fork[THREADS];fork[MYTHREAD]= upc_global_lock_alloc();now attempt to get the locks (forks) on either side:left_fork = upc_lock_attempt(fork[MYTHREAD]);right_fork = upc_lock_attempt(fork[(MYTHREAD+1)%THREADS]);UPC 22–11

Example: Dining Philosophers (contd)if both forks are unlocked, eat meal and release forks when finished:upc_unlock(fork[MYTHREAD]);upc_unlock(fork[(MYTHREAD+1)%THREADS]);if only one fork can be unlocked, release that fork and try again until twoforks become availableUPC 23UPC collectivessupported by most compilerso readable codeo but not necessarily optimised for performancerequires separate header file#include UPC 24–12

CollectivesTwo types of collective operations defined as part of the UPCstandard specification:12relocalisation collectivesupc_all_broadcast, upc_all_scatter, upc_all_gather,upc_all_gather_all, upc_all_exchange, upc_all_permutecomputational collectivesupc_all_reduceT, upc_all_prefix_reduceT, upc_all_sortSupported operations: UPC_SUM, UPC_MULT, UPC_AND, UPC_OR,UPC_XOR, UPC_LOGAND, UPC_LOGOR, UPC_MIN, UPC_MAXuser specified functions supported via: UPC_FUNC, UPC_NONCOMM_FUNC Calls to these functions must be performed by all threadsUPC 25Computational collectives11 variations of the upc_all_reduceT and upc_all_prefix_reduceT T needs to be replaced with the type used in the reduction operationT Type T TypeC signed char L signed longUC unsigned char UL unsigned longS signed short F FloatUS unsigned short D doubleI signed int LD long doubleUIunsigned intUPC 26–13

Broadcastshared[] int A[2];shared[2] int B[8];upc_all_broadcast(B, A, 2*sizeof(int), UPC_IN_ALLSYNC | UPC_OUT_ALLSYNC);ABABThread 0A 0 A 1A 0 A 1A 0 A 1Thread 1A 0 A 1Thread 2A 0 A 1Thread 3A 0 A 1UPC 27Scattershared[] int A[8];shared[2] int B[8];upc_all_scatter(B, A, 2*sizeof(int), UPC_IN_ALLSYNC | UPC_OUT_ALLSYNC);Thread 0A B ABA 0 A 1 … A 7A 0 A 1 … A 7 A 0 A 1Thread 1A 2 A 3Thread 2A 4 A 5Thread 3A 6 A 7UPC 28–14

Gathershared[] int A[8];shared[2] int B[8];upc_all_gather(A, B, 2*sizeof(int), UPC_IN_ALLSYNC | UPC_OUT_ALLSYNC);Thread 0A B ABA 0 A 1 A 0 A 1 … A 7 A 0 A 1Thread 1A 2 A 3A 2 A 3Thread 2A 4 A 5A 4 A 5Thread 3A 6 A 7A 6 A 7UPC 29Reductionshared[4] double A[16];shared double B;upc_all_reduceD(&B, A, UPC_SUM, 4, 16, NULL, UPC_IN_ALLSYNC | UPC_OUT_ALLSYNC);Thread 0AA 00 A 01 A 02 A 03BBThread 1A 10 A 11 A 12 A 13Thread 2A 20 A 21 A 22 A 23Thread 3A 30 A 31 A 32 A 33UPC 30–15

Final summarydata distribution & worksharingpointers, dynamicmemory allocationPGASextension to Csynchronisation throughbarriers & lockscollective operationsUPC 31Referenceshttp://upc.gwu.edu/documentation.htmlLanguage Specification (version 1.2)UPC ManualUPC Collective Operations Specification (version 1.0)UPC 32–16

12/19/2012Parallel ProgrammingwithFortran CoarraysDavid Henty, Alan Simpson (EPCC)Harvey Richardson, Bill Long (Cray)Overview• Parallel Programming with Fortran Coarrays• Fortran Programming Model• Basic coarray features• Further coarray features• Advanced coarray features• Experiences with coarrays21

12/19/2012The FortranProgramming ModelMotivation• Fortran now supports parallelism as a full first-class feature ofthe language• Changes are minimal• Performance is maintained• Flexibility in expressing communication patterns42

12/19/2012Programming models for HPC• The challenge is to efficiently map a problem to thearchitecture we have• Take advantage of all computational resources• Manage distributed memories etc.• Optimal use of any communication networks• The HPC industry has long experience in parallel programming• Vector, threading, data-parallel, message-passing etc.• We would like to have models or combinations that are• efficient• safe• easy to learn and use5Why consider new programming models?• Next-generation architectures bring new challenges:• Very large numbers of processors with many cores• Complex memory hierarchy• even today (2011) we are at 500k cores• Parallel programming is hard, need to make this simpler• Some of the models we currently use are• bolt-ons to existing languages as APIs or directives• Hard to program for underlying architecture• unable to scale due to overheads• So, is there an alternative to the models prevalent today?• Most popular are OpenMP and MPI …63

12/19/2012Fortran 2008 coarray model• Example of a Partitioned Global Address Space (PGAS)model• Set of participating processes like MPI• Participating processes have access to local memoryvia standard program mechanisms• Access to remote memory is directly supported bythe language7Fortran coarray modelprocessmemoryprocessmemoryprocessmemorycpucpucpu84

12/19/2012Fortran coarray modelprocessprocessprocessmemorymemorymemorycpucpucpua = b[3]9Fortran coarrays• Remote access is a full feature of the language:• Type checking• Opportunity to optimize communication• No penalty for local memory access• Single-sided programming model more natural forsome algorithms• and a good match for modern networks with RDMA105

12/19/2012Fortran coarraysBasic FeaturesCoarray Fortran"Coarrays were designed to answer the question:‘What is the smallest change required to convert Fortraninto a robust and efficient parallel language?’The answer: a simple syntactic extension.It looks and feels like Fortran and requiresFortran programmers to learn only a few new rules."John Reid,ISO Fortran Convener126

12/19/2012Some History• Introduced in current form by Numrich and Reid in 1998 as asimple extension to Fortran 95 for parallel processing• Many years of experience, mainly on Cray hardware• A set of core features are now part of the Fortran standardISO/IEC 1539-1:2010• Additional features are expected to be published in aTechnical Specification in due course.13How Does It Work?• SPMD - Single Program, Multiple Data• single program replicated a fixed number of times• Each replication is called an image• Images are executed asynchronously• execution path may differ from image to image• some situations cause images to synchronize• Images access remote data using coarrays• Normal rules of Fortran apply147

12/19/2012What are coarrays?• Arrays or scalars that can be accessed remotely• images can access data objects on any other image• Additional Fortran syntax for coarrays• Specifying a codimension declares a coarrayreal, dimension(10), codimension[*]:: xreal :: x(10)[*]• these are equivalent declarations of a array xof size 10 on each image• x is now remotely accessible• coarrays have the same size on each image!15Accessing coarraysinteger :: a(4)[*], b(4)[*] !declare coarraysb(:) = a(:)[n]! copy• integer arrays a and b declared to be size 4 on all images• copy array a from remote image n into local array b• () for local access [] for remote access• e.g. for two images and n = 2:image 1 image 2a1 2 3 4a2937b52 69 73 87 b 10 2 11 9 12 3 13 7168

12/19/2012Synchronisation• Be careful when updating coarrays:• If we get remote data was it valid?• Could another process send us data and overwritesomething we have not yet used?• How do we know that data sent to us has arrived?• Fortran provides synchronisation statements• For example, barrier for synchronisation of all images:sync all• do not make assumptions about execution timing on images• unless executed after synchronisation• Note there is implicit synchronisation at program start17Retrieving information about images• Two intrinsics provide index of this image and number ofimages• this_image() (image indexes start at 1)• num_images()real :: x[*]if(this_image() == 1) thenread *,xdo image = 2,num_images()x[image] = xend doend ifsync all189

12/19/2012Making remote references• We used a loop over imagesdo image = 2,num_images()x[image] = xend do• Note that array indexing within the coindex is not allowedso we can not writex[2:num_images()] = x !illegal19Data usage• coarrays have the same size on every image• Declarations:• round brackets () describe rank, shape and extent of localdata• square brackets [] describe layout of images that hold localdata• Many HPC problems have physical quantities mapped to n-dimensional grids• You need to implement your view of global data from the localcoarrays as Fortran does not provide the global view• You can be flexible with the coindexing (see later)• You can use any access pattern you wish2010

12/19/2012Data usage• print out a 16 element “global” integer array A from 4processors• 4 elements per processor = 4 coarrays on 4 imagesinteger :: ca(4)[*]do image=1,num_images()print *,ca(:)[image]end doAca(1:4)[1] ca(1:4)[2] ca(1:4)[3] ca(1:4)[4]image 1 image 2 image 3 image 4211D cyclic data access• coarray declarations remain unchanged• but we use a cyclic access patterninteger :: ca(4)[*]do i=1,4do image=1,num_images()print *,ca(i)[image]end doend doA............image 1 image 2 image 3 image 42211

12/19/2012Synchronisation• code execution on images is independent• programmer has to control execution using synchronisation• synchronise before accessing coarrays• ensure content is not updated from remote images beforeyou can use it• synchronise after accessing coarrays• ensure new content is available to all images• implicit synchronisation after variable declarations at firstexecutable statement• guarantees coarrays exist on all images when your firstprogram statement is executed• We will revisit this topic later23Example: maximum of arrayreal :: a(10)real :: maximum[*]implicit synchronisationcall random_number(a)maximum = maxval(a)ensure all images set local maximumsync allif (this_image() == 1) thendo image = 2, num_images()maximum = max(maximum, maximum[image])end dodo image = 2, num_images()maximum[image] = maximumend doend ifensure all images have copy of maximum valuesync all2412

12/19/2012RecapWe now know the basics of coarrays• declarations• references with []• this_image() and num_images()• sync allNow consider a full program example...25Example2: Calculate density of primesprogram pdensityimplicit noneinteger, parameter :: n=10000000, nimages=8integer start,end,iinteger, dimension(nimages) :: nprimes[*]real densitystart = (this_image()-1) * n/num_images() + 1end = start + n/num_images() - 1nprimes(this_image())[1] = num_primes(start,end)sync all2613

12/19/2012Example2: Calculate density of primes…if (this_image()==1) then&&nprimes(1)=sum(nprimes)density=real(nprimes(1))/nprint *,"Calculating prime density on", &num_images(),"images"print *,nprimes(1),'primes in',n,'numbers'write(*,'(" density is ",2Pf0.2,"%")')densitywrite(*,'(" asymptotic theory gives ", &2Pf0.2,"%")')1.0/(log(real(n))-1.0)end if27Example2: Calculate density of primesCalculating prime density on 2 images664580 primes in 10000000 numbersdensity is 6.65%asymptotic theory gives 6.61%2814

12/19/2012Launching a coarray program• The Fortran standard does not specify how a program islaunched• The number of images may be set at compile, link or run-time• A compiler could optimize for a single imageExamples on Linux• Cray XEaprun –n 16000 solver• g95./solver --g95 -images=229Observations so far on coarrays• Natural extension, easy to learn• Makes parallel parts of program obvious (syntax)• Part of Fortran language (type checking, etc)• No mapping of data to buffers (or copying) or creation ofcomplex types (as we might have with MPI)• Compiler can optimize for communication• More observations later...3015

12/19/2012Exercise Session 1• Look at the Exercise Notes document for full details• Write, compile and run a “Hello World” program that printsout the value of the running image’s image index and thenumber of images• Extend the simple Fortran code provided in order to performoperations on parts of a picture using coarrays31Additional SlidesComparison of Programming Models3216

12/19/2012Shared-memory directives and OpenMPmemorythreads33OpenMP: work distributionmemory1-8 9-16 17-24 25-32!$OMP PARALLEL DOdo i=1,32a(i)=a(i)*2end dothreads3417

12/19/2012OpenMP implementationmemoryprocesscpusthreads35Shared Memory Directives• Multiple threads share global memory• Most common variant: OpenMP• Program loop iterations distributed to threads,more recent task features• Each thread has a means to refer to private objectswithin a parallel context• Terminology• Thread, thread team• Implementation• Threads map to user threads running on one SMP node• Extensions to distributed memory not so successful• OpenMP is a good model to use within a node3618

12/19/2012Cooperating Processes ModelsPROBLEMprocesses37Message Passing, MPIprocessmemorymemorymemorycpucpucpu3819

12/19/2012MPIprocess 0memoryprocess 1memorycpucpuMPI_Send(a,...,1,…)MPI_Recv(a,...,0,…)39Message Passing• Participating processes communicate using a message-passingAPI• Remote data can only be communicated (sent or received) viathe API• MPI (the Message Passing Interface) is the standard• Implementation:MPI processes map to processes within one SMP node oracross multiple networked nodes• API provides process numbering, point-to-point and collectivemessaging operations• Mostly used in two-sided way, each endpoint coordinates insending and receiving4020

12/19/2012SHMEMprocess 0memoryprocess 1memorycpucpushmem_put(a,b,…)41SHMEM• Participating processes communicate using an API• Fundamental operations are based on one-sided PUT and GET• Need to use symmetric memory locations• Remote side of communication does not participate• Can test for completion• Barriers and collectives• Popular on Cray and SGI hardware, also Blue Gene version• To make sense needs hardware support for low-latency RDMAtypeoperations4221

12/19/2012High Performance Fortran (HPF)• Data Parallel programming model• Single thread of control• Arrays can be distributed and operated on in parallel• Loosely synchronous• Parallelism mainly from Fortran 90 array syntax, FORALL andintrinsics.• This model popular on SIMD hardware (AMT DAP, ConnectionMachines) but extended to clusters where control thread isreplicated43HPFmemorymemorymemorymemorypepepepememorycpu4422

12/19/2012HPFmemorymemorymemorymemorypepepepememoryA=SQRT(A)A (distributed)cpu45UPCthreadmemorythreadmemorythreadmemorycpucpucpu4623

12/19/2012UPCthreadthreadthreadmemorymemorymemorycpucpucpuupc_forall(i=0;i

12/19/2012Fortran 2008 coarray model• Example of a Partitioned Global Address Space (PGAS)model• Set of participating processes like MPI• Participating processes have access to local memoryvia standard program mechanisms• Access to remote memory is directly supported bythe language49Fortran coarray modelprocessmemoryprocessmemoryprocessmemorycpucpucpu5025

12/19/2012Fortran coarray modelprocessprocessprocessmemorymemorymemorycpucpucpua = b[3]51Fortran coarrays• Remote access is a full feature of the language:• Type checking• Opportunity to optimize communication• No penalty for local memory access• Single-sided programming model more natural forsome algorithms• and a good match for modern networks with RDMA5226

12/19/201227

12/19/2012Parallel Programmingwith Fortran Coarrays:Overview of ExercisesDavid Henty, Alan Simpson (EPCC)Harvey Richardson, Bill Long (Cray)Exercise 1• Hello world example• check you can log on, compile, submit and run• Writing arrays as pictures• declare and manipulate coarrays• write out arrays in PGM picture format• view them using display from ImageMagick• use both remote reads and remote writes21

12/19/2012Sample output on 4 images3Exercise 2• Perform simple edge detection of features in a picture• halo communication between 1D grid of images• Reconstruct picture from supplied edges• an iterative algorithm• computationally intensive so worth parallelising• Terminate based on some stopping criterion• requires global sums• Use global or point-to-point synchronisation• Look at scalability2

12/19/2012Edge detection and picture reconstructionsingle passhundredsof iterations5Exercise 3 (Extra)• Decompose picture across a 2D grid of images• using multiple codimensions3

12/19/2012Documentation• Full instructions in exercise notes• PDF copy in doc/ subdirectory• Go at your own pace• no direct dependencies between practicals & lectures• each exercise follows on from the last• If you’re not sure what to do or if you have any otherquestions then please ask us!4

12/19/2012More Coarray FeaturesParallel Programming with Fortran CoarraysDavid Henty, Alan Simpson (EPCC)Harvey Richardson, Bill Long (Cray)Overview• Multiple Dimensions and Codimensions• Allocatable Coarrays and Components of CoarrayStructures• Coarrays and Procedures21

12/19/2012Mapping data to imagesPhysical quantityPRESSUREVariables/arraysP(m,n)PROBLEMP(m,n)[k,*]P(m,n)[*]imagesimages32D Data• Corray Fortran has a “bottom-up” approach to global data• assemble rather than distribute• unlike HPF (“top-down”) or UPC shared distributed data• Can assemble a 2D data structure from 1D arraysinteger :: ca(4)[*]image 1image 3image 2image 442

12/19/20122D Data• However, images are not restricted to a 1D arrangement• For example, we can arrange images in 2x2 grid• coarrays with 2 codimensionsinteger :: ca(4)[2,*]ca(:)[1,1]ca(:)[1,2]ca(:)[2,1]ca(:)[2,2]image 1 image 2 image 3 image 452D Local Array on 2D Gridca(1:4,1:4)[1,1] ca(1:4,1:4)[1,2]image 1image 3Aca(1:4,1:4)[2,1] ca(1:4,1:4)[2,2]image 2global access: ca(3,1)[2,2]local access: ca(3,1)image 463

12/19/2012Coarray Subscripts• Fortran arrays defined by rank, bounds and shapeinteger, dimension(10,4) :: array• rank 2• lower bounds 1, 1; upper bounds 10, 4• shape [10, 4]• Coarray Fortran adds corank, cobounds and coshapeinteger :: array(10,4)[3,*]• corank 2• lower cobounds 1, 1; upper cobounds 3, m• coshape [3, m]m would be ceiling(num_images()/3)7Multiple Codimensions• Coarrays with multiple Codimensions:• character :: a(4)[2, *] !2D grid of images for 4 images, grid is 2x2; for 16 images, grid is 2x8• real :: b(8,8,8)[10,5,*] !3D grid of images 8x8x8 local array; with 150 images, grid is 10x5x3• integer :: c(6,5)[0:9,0:*] !2D grid of images lower cobounds [ 0, 0 ]; upper cobounds [ 9,n] useful if you want to interface with MPI or want C like coding• Sum of rank and corank should not exceed 15• Flexibility with cobounds• can set all but final upper cobound as required84

12/19/2012Codimensions: What They Mean• Images are organised into a logical 2D, 3D, …. grid• for that coarray only• A map so an image can find the coarray on any other image• access the coarray using its grid coordinates• e.g. character a(4)[2, *] on 6 images• gives a 2 x 3 image grid• usual Fortran subscript order to determine image indexAxis 1Axis 211image 1a(4)[1,2]23image 3 image 52image 2 image 4image 6a(2)[2,3]9Codimensions and Array-Element Order• Storage order for multi-dimensionalFortran arraysreal p(2,3,8)• Location Element1 p(1,1,1)2 p(2,1,1)3 p(1,2,1)4 p(2,2,1)5 p(1,3,1)6 p(2,3,1)7 p(1,1,2)8 p(2,1,2)…48 p(2,3,8)• Ordering of images in multi-dimensionalcogridsreal q(4)[2,3,*]• Image Elements1 q(1:4)[1,1,1]2 q(1:4)[2,1,1]3 q(1:4)[1,2,1]4 q(1:4)[2,2,1]5 q(1:4)[1,3,1]6 q(1:4)[2,3,1]7 q(1:4)[1,1,2]8 q(1:4)(2,1,2]…48 q(1:4)[2,3,8]…105

12/19/2012Multi Codimensions: An Example• Domain Decomposition• () gives local domain size• [] provides image grid and easy access to other images• 2D domain decomposition of Braveheart• Global data is 360 x 192• Domain decomposition on 8 images with 4 x 2 grid• local array size: (360 / 4) x (192 / 2) = 90 x 96• declaration = real :: localPic(90,96)[4,*]Axis 2Axis 136019211Multi Codimensions: An Example• Domain Decomposition• () gives local domain size• [] provides image grid and easy access to other images• 2D domain decomposition of Braveheart• Global data is 360 x 192• Domain decomposition on 8 images with 4 x 2 grid• local array size: (360 / 4) x (192 / 2) = 90 x 96• declaration = real :: localPic(90,96)[4,*]image 1Axis 212image 5Axis 11image 2image 6image 3909623image 7image 4image 84126

12/19/2012this_image() & image_index()this_image()returns the image index, i.e., number between 1 andnum_images()this_image(z)returns the rank-one integer array of cosubscripts for thecalling image corresponding to the coarray zthis_image(z, dim) returns cosubscript ofcodimension dim of zimage_index(z, sub)returns image index with cosubscripts sub for coarray zsub is a rank-one integer array13Example 1PROGRAM CAF_Intrinsicsreal :: b(90,96)[4,*]this_image() = 2this_image(b) = [2, 1]image_index(b,[3,2]) = 7this_image() = 5this_image(b) = [1, 2]image_index(b,[3,2]) = 712write(*,*) “this_image() =“,&this_image()write(*,*) “this_image(b) =“,&this_image(b)Axis 1Axis 2123write(*,*) “image_index(b,[3,2]) =“,&image_index(b,[3,2])4END PROGRAM CAF_Intrinsicsthis_image() = 7this_image(b) = [3, 2]image_index(b,[3,2]) = 7147

12/19/2012Example 2PROGRAM CAF_Intrinsicsthis_image() = 96this_image(c) = [1, 0, 4]image_index(c,[1,0,4]) = 96real :: c(4,4,4)[5,-1:4,*]Axis 3write(*,*) “this_image() =“,&this_image()Axis 1Axis 2write(*,*) “this_image(c) =“,&this_image(c)write(*,*) “image_index(c,[1,0,4]) =“,&image_index(c,[1,0,4])END PROGRAM CAF_Intrinsicsthis_image() = 13this_image(c) = [3, 1, 1]image_index(c,[1,0,4]) = 96this_image() = 90this_image(c) = [5, 4, 3]image_index(c,[1,0,4]) = 9615Boundary SwappingPROGRAM CAF_HaloSwapinteger, parameter :: nximages = 4, nyimages = 2integer, parameter :: nxlocal = 90, nylocal = 96real :: pic(0:nxlocal+1, 0:nylocal+1)[nximages,*] ! Declare coarray with halosinteger :: myimage(2) ! Array for my row & column coordinatesmyimage = this_image(pic) ! Find my row & column coordinates… ! Initialise pic on each imageFind cosubscriptssync allEnsures pic initialised before accessed by other images! Halo swapif (myimage(1) > 1) &pic(0,1:nylocal) = pic(nxlocal,1:nylocal)[myimage(1)-1,myimage(2)]if (myimage(1) < nximages) &pic(nxlocal+1,1:nylocal) = pic(1,1:nylocal)[myimage(1)+1,myimage(2)]if (myimage(2) > 1) &pic(1:nxlocal,0) = pic(1:nxlocal,nylocal)[myimage(1),myimage(2)-1]if (myimage(2) < nyimages)pic(1:nxlocal,nylocal+1) = pic(1:nxlocal,1)[myimage(1),myimage(2)+1]sync allEnsures all images have got old values before pic is updated… ! Update pic on each imageEND PROGRAM CAF_HaloSwap168

12/19/2012Allocatable Coarrays• Can have allocatable Coarraysreal, allocatable :: x(:)[:], s[:,:]n = num_images()allocate(x(n)[*], s[4,*])• Must specify cobounds in allocate statement• The size and value of each bound and cobound must be sameon all images.• allocate(x(this_image())[*]) ! Not allowed• Implicit synchronisation of all images…• …after each allocate statement involving coarrays• …before deallocate statements involving coarrays17Differently Sized Coarray Components• A coarray structure component can vary in size per image• Declare a coarray of derived type with a component that isallocatable (or pointer)…!Define data type with allocatable componenttype diffSizereal, allocatable :: data(:)end type diffSize!Declare coarray of type diffSizetype(diffSize) :: x[*]! Allocate x%data to a different size on each imageallocate(x%data(this_image())189

12/19/2012Pointer Coarray Structure Components• We are allowed to have a coarray that contains componentsthat are pointers• Note that the pointers have to point to local data• We can then access one of the pointers on a remote image toget at the data it points to• This technique is useful when adding coarrays into an existingMPI code• We can insert coarray code deep in call tree withoutchanging many subroutine argument lists• We don’t need new coarray declarations• Example follows...19Pointer Coarray Structure Components...• Existing non-coarray arrays u,v,w• Create a type (coords) to hold pointers (x,y,z) that we use to pointto x,y,z. We can use the vects coarray to access u, v, w.subroutine calc(u,v,w)real, intent(in), target, dimension(100) :: u,v,wtype coordsreal, pointer, dimension(:) :: x,y,zend type coordstype(coords), save :: vects[*]! …vects%x => u ; vects%y => v ; vects%z => wsync allfirstx = vects[1]%x(1)2010

12/19/2012Coarrays and Procedures• An explicit interface is required if a dummy argument is acoarray• Dummy argument associated with coarray, not a copy• avoids synchronisation on entry and return• Other restrictions on passing coarrays are:• the actual argument should be contiguous• a(:,2) is OK, but a(2,:) is not contiguous• or the dummy argument should be assumed shape... to avoid copying• Function results cannot be coarrays2121Coarrays as Dummy Arguments• As with standard Fortran arrays, the coarray dummy argumentsin procedures can be:• Explicit shape: each dimension of a coarray declared with explicit value• Assumed shape: extents and bounds determined by actual array• Assumed size: only size determined from actual array• Allocatable: the size and shape can be determined at run-timesubroutine s(n, a, b, c, d)integer :: nreal :: a(n) [n,*] ! explicit shape - permittedreal :: b(:,:) [*] ! assumed shape - permittedreal :: c(n,*) [*] ! assumed size - permittedreal, allocatable :: d(:) [:,:] ! allocatable - permitted222211

12/19/2012Assumed Size Coarrays• Allow the coshape to be remapped to corank 1program cmaxreal, codimension[8,*] :: a(100), amaxa = [ (i, i=1,100) ] * this_image() / 100.0amax = maxval( a )sync allamax = AllReduce_max(amax)...containsreal function AllReduce_max(r) result(rmax)real :: r[*]sync allrmax = rdo i=1,num_images()rmax = max( rmax, r[i] )end dosync allend function AllReduce_max23Coarrays Local to a Procedure• Coarrays declared in procedures must have save attribute• unless they are dummy arguments or allocatable• save attribute: retains value between procedure calls• avoids synchronisation on entry and return• Automatic coarrays are not permitted• Automatic array: local array whose size depends on dummy arguments• would require synchronisation for memory allocation and deallocation• would need to ensure coarrays have same size on all imagessubroutine t(n)integer :: nreal :: temp(n)[*] ! automatic - not permittedinteger, save :: x(4)[*] ! coarray with save attributeinteger :: y(4)[*] ! not saved – not permitted2412

12/19/2012Summary• Coarrays with multiple codimensions used to create a grid ofimages• () gives local domain information• [] gives an image grid with easy access to other images• Can be used in various ways to assemble a multi-dimensionaldata set• this_image() and image_index()• are intrinsic functions that give information about the images in anmulti-codimension grid• Flexibility from non-coarray allocatable and pointercomponents of coarray structures• Coarrays can be allocatable, can be passed as argumentsto procedures, and can be dummy arguments2513

12/19/2012Advanced FeaturesParallel Programming with Fortran CoarraysMSc in HPCDavid Henty, Alan Simpson (EPCC)Harvey Richardson, Bill Long (Cray)Advanced Features: Overview• Execution segments and Synchronisation• Non-global Synchronisation• Critical Sections• Visibility of changes to memory• Other Intrinsics• Miscellaneous features• Future developments21

segmentsegment12/19/2012More on Synchronisation• We have to be careful with one-sided updates• If we read remote data, was it valid?• Could another process send us data and overwritesomething we have not yet used?• How do we know when remote data has arrived?• The standard introduces execution segments to deal with this:segments are bounded by image control statementsThe standard can be summarized as follows:• If a variable is defined in a segment, it must not be referenced,defined, or become undefined in another segment unless thesegments are ordered – John Reid3Execution Segmentsimage 1program hotdouble precision :: a(n)double precision :: temp(n)[*]!...if (this_image() == 1) thendo i=1, num_images()read *,atemp(:)[i] = aend doend iftemp = temp + 273d0sync all! …call ensemble(temp)2image 2program hotdouble precision :: a(n)double precision :: temp(n)[*]!...if (this_image() == 1) thendo i=1, num_images()read *,atemp(:)[i] = aend doend if2temp = temp + 273d0sync all! …call ensemble(temp)orderingiimage synchronisation points42

12/19/2012Synchronisation mistakes• This code is wrongsubroutine allreduce_max_getput(v,vmax)double precision, intent(in) :: v[*]double precision, intent(out) :: vmax[*]integer isync allvmax=vif (this_image()==1) thendo i=2,num_images()vmax=max(vmax,v[i])end dodo i=2,num_images()vmax[i]=vmaxend doend ifsync all5Synchronisation mistakes• It breaks the rulessubroutine allreduce_max_getput(v,vmax)double precision, intent(in) :: v[*]double precision, intent(out) :: vmax[*]integer isync allvmax=vif (this_image()==1) thendo i=2,num_images()vmax=max(vmax,v[i])end dodo i=2,num_images()vmax[i]=vmaxend doend ifsync all63

12/19/2012Synchronisation mistakes• This is oksubroutine allreduce_max_getput(v,vmax)double precision, intent(in) :: v[*]double precision, intent(out) :: vmax[*]integer isync allif (this_image()==1) thenvmax=vdo i=2,num_images()vmax=max(vmax,v[i])end dodo i=2,num_images()vmax[i]=vmaxend doend ifsync all7More about sync all• Usually all images execute the same sync all statement• But this is not a requirement...• Images execute different code with different sync allstatements• All images execute the first sync all they come acrossand….• this may match an arbitrary sync all on another image• causing incorrect execution and/or deadlock• Need to be careful with this ‘feature’• Possible to write code which doesn’t deadlock but giveswrong answers84

12/19/2012More about sync all• e.g. Image practical: wrong answer! Do halo swap, taking care at the upper and lower picture boundariesif (myimage < numimage) thenoldpic(1:nxlocal, nylocal+1) = oldpic(1:nxlocal, 1)[myimage+1]sync allend ifAll images NOT executing this sync all! ... and the same for down halo! Now update the local values of newpic...! Need to synchronise to ensure that all images have finished reading the! oldpic halo values on this image before overwriting it with newpicsync alloldpic(1:nxlocal,1:nylocal) = newpic(1:nxlocal,1:nylocal)! Need to synchronise to ensure that all images have finished updating! their oldpic arrays before this image reads any halo data from themsync allAll images ARE executing this sync all9More about sync all• sync images(imageList)• Performs a synchronisation of the image executing syncimages with each of the images specified in imageList• imageList can be an array or a scalarif (myimage < numimage) thenoldpic(1:nxlocal, nylocal+1) = oldpic(1:nxlocal, 1)[myimage+1]end ifif (myimage > 1) thenoldpic(1:nxlocal, 0) = oldpic(1:nxlocal, nylocal)[myimage-1]end if! Now perform local pairwise synchronisationsif (myimage == 1 ) thensync images( 2 )else if (myimage == numimage) thensync images( numimage-1 )elsesync images( (/ myimage-1, myimage+1 /) )end if105

12/19/2012Other Synchronisation• Critical sections• Limit execution of a piece of code to one image at a time• e.g. calculating global sum on master imageinteger :: a(100)[*]integer :: globalSum[*] = 0, localSum... ! Initialise a on each imagelocalSum = SUM(a) !Find localSum of a on each imagecriticalglobalSum[1] = globalSum[1] + localSumend critical11Other Synchronisation• sync memory• Coarray data held in caches/registers made visible to all images• requires some other synchronisation to be useful• unlikely to be used in most coarray codes• Example usage: Mixing MPI and coarraysloop: coarray operationssync memorycall MPI_Allreduce(...)• sync memory implied for sync all and sync images126

12/19/2012Other Synchronisation• lock and unlock statements• Control access to data defined or referenced by more than oneimage• as opposed to critical which controls access to lines ofcode• USE iso_fortran_env module and define coarray oftype(lock_type)• e.g. to lock data on image 2type(lock_type) :: qLock[*]lock(qLock[2])!access data on image 2unlock(qLock[2])13Other Intrinsic functions• lcobound(z)• Returns lower cobounds of the coarray z• lcobound(z,dim) returns lower cobound forcodimension dim of z• ucobound(z)• Returns upper cobounds of the coarray z• lcobound(z,dim) returns upper cobound forcodimension dim of z• real :: array(10)[4,0:*] on 16 images• lcobound(array) returns [ 1, 0 ]• ucobound(array) returns [ 4, 3 ]147

12/19/2012More on Cosubscripts• integer :: a[*] on 8 images• cosubscript a[9] is not valid• real :: b(10)[3,*] on 8 images• ucobound(b) returns [ 3, 3 ]• cosubscript b[2,3] is valid (corresponds to image 8)…• …but cosubscript b[3,3] is invalid (image 9)• Programmer needs to make sure that cosubscripts are valid• image_index returns 0 for invalid cosubscripts15Assumed Size Coarrays• Codimensions can be remapped to corank greater than 1• useful for determining optimal extents at runtimeprogram 2dreal, codimension[*] :: picture(100,100)integer :: numimage, numimagex, numimageynumimage = num_images()call get_best_2d_decomposition(numimage,&numimagex, numimagey)! Assume this ensures numimage=numimagex*numimageycall dothework(picture, numimagex, numimagey)...containssubroutine dothework(array, m, n)real, codimension[m,*] :: array(100,100)...end subroutine dothework168

12/19/2012I/O• Each image has its own set of input/output units• units are independent on each image• Default input unit is preconnected on image 1 only• read *,… , read(*,…)…• Default output unit is available on all images• print *,… , write(*,…)…• It is expected that the implementation will mergerecords from each image into one stream17Program Termination• STOP or END PROGRAM statements initiate normaltermination which includes a synchronisation step• An image’s data is still available after it has initiatednormal termination• Other images can test for this using STAT= specifier tosynchronisation calls or allocate/deallocate• test for STAT_STOPPED_IMAGE (defined inISO_FORTRAN_ENV module)• The ERROR STOP statement initiates errortermination and it is expected all images will beterminated.189

12/19/2012Coarray Technical Specification• Additional coarray features may be described in aTechnical Specification (TS)• Developed as part of the official ISO standards process• Work in progress and the areas of discussion are:• image teams• collective intrinsics for coarrays• atomics19TS: Teams• Often useful to consider subsets of processes• e.g. MPI communicators• Subsets not currently supported in Fortran, e.g.• sync all: all images• sync images: pairwise mutual synchronisation• Extension involves TEAMS of images• user creates teams• specified as an argument to various functions2010

12/19/2012TS: Teams...• To define a set of images as a teamcall form_team(oddteam,[ (i,i=1,n,2) ])• To synchronise the teamsync team(oddteam)• To determine images that constitute a team:oddimages=team_images(oddteam)21TS: Collectives• Collective operations a key part of real codes• broadcast• global sum• ...• Supported in other parallel models• OpenMP reductions• MPI_Allreduce• Not currently supported for coarrays• efficient implementation by hand is difficult• calling external MPI routines rather ugly2211

12/19/2012TS: Collective intrinsic subroutines• Collectives, with in/out arguments, invoked by same statementon all images (or team of images)• Routines• CO_BCAST• CO_SUM and other reduction operations• basically reproduce the MPI functionality• Arguments include SOURCE, RESULT, TEAM• Still discussion on need for implicit synchronisation andargument types (for example non-coarray arguments)23TS: Atomic operations• Critical or lock synchronisation sometimes overkill• counter[1] = counter[1] + 1• Simple atomic operations can be optimised• e.g. OpenMP atomic!$OMP atomicsharedcounter = sharedcounter + 1• New variable types and operations for coarrays2412

12/19/2012TS: Atomic variables• Fortran already includes some atomic support (define,ref)• TS expands on this to supports atomic compare and swap,fetch and add, …integer (atomic_int_kind) :: counter[*]call atomic_define(counter[1],0)call atomic_add(counter[1],1)call atomic_ref(countval,counter[1])2513

12/19/2012Experiences withCoarraysParallel Programming with Fortran CoarraysMSc in HPCDavid Henty, Alan Simpson (EPCC)Harvey Richardson, Bill Long (CrayOverview• Implementations• Performance considerations• Where to use the coarray model• Coarray benchmark suite• Examples of coarrays in practice• References• Wrapup21

12/19/2012Implementation Status• History of coarrays dates back to Cray implementations• Expect support from vendors as part of Fortran 2008• G95 had multi-image support in 2010• has not been updated for some time• gfortran• Introduced single-image support at version 4.6• Intel: multi-process coarray support in Intel Composer XE 2011(based on Fortran 2008 draft)• Runtimes are SMP, GASNet and compiler/vendor runtimes• GASNet has support for multiple environments(IB, Myrinet, MPI, UDP and Cray/IBM systems) socould be an option for new implementations3Implementation Status (Cray)• Cray has supported coarrays and UPC on various architecturesover the last decade (from T3E)• Full PGAS support on the Cray XT/XE• Cray Compiling Environment 7.0 – Dec 2008• Current release is Cray Compiler Environment 7.4• Full Fortran 2008 coarray support• Full Fortran 2003 with some Fortran 2008 features• Fully integrated with the Cray software stack• Same compiler drivers, job launch tools, libraries• Integrated with Craypat – Cray performance tools• Can mix MPI and coarrays42

12/19/2012Implementations we have used• Cray X1/X2• Hardware supports communication by direct load/store• Very efficient with low overhead• Cray XT• PGAS (UPC,CAF) layered on GASNet/portals (so messaging)• Not that efficient• Cray XE• PGAS layered on DMAPP portable layer over Gemininetwork hardware• Intermediate between XT and X1/2• Intel Composer XE 2011• SMP and message-passing runtimes5Implementations we have used...• g95 on shared-memory• Using cloned process images on Linux• This is not being actively developed63

12/19/2012Intel XE on Ubuntu VM7When to use coarrays• Two obvious contexts• Complete application using coarrays• Mixed with MPI• As an incremental addition to a (potentially large) serial code• As an incremental addition to an MPI code (allowing reuse ofmost of the existing code)• Use coarrays for some of the communication• opportunity to express communication much more simply• opportunity to overlap communication• For subset synchronisation• Work-sharing schemes84

12/19/2012Adding coarrays to existing applications• Constrain use of coarrays to part of application• Move relevant data into coarrays• Implement parallel part with coarray syntax• Move data back to original structures• Use coarray structures to contain pointers to existing data• Place relevant arrays in global scope (modules)• avoids multiple declarations• Declare existing arrays as coarrays at top level and throughthe complete call tree(some effort but only requires changes to declarations)9Performance Considerations• What is the latency?• Do you need to avoid strided transfers?• Is the compiler optimising the communication for targetarchitecture?• Is it using blocking communication within a segment whenit does no need to?• Is it optimising strided communication?• Can it pattern-match loops to single communicationprimitives or collectives?105

12/19/2012Performance: Communication patterns• Try to avoid creating traffic jams on the network, such as allimages storing to a single image.• The following examples show two ways to implement anALLReduce() function using coarrays11AllReduce (everyone gets)• All images get data from others simultaneouslyfunction allreduce_max_allget(v) result(vmax)double precision :: vmax, v[*]integer isync allvmax=vdo i=1,num_images()vmax=max(vmax,v[i])end do126

12/19/2012AllReduce (everyone gets, optimized)• All images get data from others simultaneously but this isoptimized so communication is more balanced!...sync allvmax=vdo i=this_image()+1,num_images()vmax=max(vmax,v[i])end dodo i=1,this_image()-1vmax=max(vmax,v[i])end do• Have seen this much faster13Synchronization• For some algorithms (finite-difference etc.) don’t usesync allbut pairwise synchronization using sync images(image)147

12/19/2012Synchronization (one to many)• Often one image will be communicating with a set of images• In general not a good thing to do but assume we are...• Tempting to use sync all15Synchronisation (one to many)• If this is all images then could doif ( this_image() == 1) thensync images(*)elsesync images(1)end if• Note that sync all is likely to be fast so is analternative168

12/19/2012Synchronisation (one to many)• For a subset use thisif ( this_image() == image_list(1)) thensync images(image_list)elsesync images(image_list(1))end if• instead of sync images(image_list)for all of them which is likely to be slower17Collective Operations• If you need scalability to a large number of images you mayneed to temporarily work around current lack of collectives• Use MPI for the collectives if MPI+coarrays is supported• Implement your own but this might be hard• For reductions of scalars a tree will be the best to try• For reductions of more data you would have to experiment and this maydepend on the topology• Coarrays can be good for collective operations where• there is an unusual communication pattern that does notmatch what MPI collectives provide• there is opportunity to overlap communicationwith computation189

12/19/2012Tools: debugging and profiling• Tool support should improve once coarray takeup increases• Cray Craypat tool supports coarrays• Totalview works with coarray programs on Cray systems• Allinea DDT• support for coarrays and UPC for a number of compilers isin public beta and will be in DDT 3.1• Scalasca• Currently investigating how PGAS support can beincorporated.19Debugging Synchronisation problems• One-sided model is tricky because subtle synchronisationerrors change data• TRY TO GET IT RIGHT FIRST TIME• look carefully at the remote operations in the code• Think about synchronisation of segments• especially look for early arriving communications trashingyour data at the start of loops (this one is easy to miss)• One way to test is to put sleep() calls in the code• Delay one or more images• Delay master image, or other images for some patterns2010

12/19/2012Coarray Benchmark Suite• Developed by David Henty at EPCC• Aims to test fundamental features of a coarray implementation• We don’t have an API to test (cf. IMB for MPI)• We can test basic language syntax for communication of dataand synchronization• Need to choose communication pattern and data access• There is some scope for a given communication pattern:• array syntax, loops over array elements• inline code or use subroutines• Choices can reveal compiler capabilities21Initial Benchmark• Single contiguous point-to-point read and write• Multiple contiguous point-to-point read and write• Strided point-to-point read and write• All basic synchronization operations• Various representative communication patterns• Halo-swap in multi-dimensional regular domain decomposition• All communications include synchronisation cost• use double precision (8-byte) values as the basic type• use Fortran array syntax• Synchronisations overhead measured separately asoverhead = (delay + sync) – delay2211

andwidth (MB/s)12/19/2012Platforms• Limited compiler support at present• results presented from Cray systems• Cray Compiler Environment (CCE) 7.4.1• Cray XT6MPP with dual 12-core Opteron Magny-Cours nodes and CraySeastar2+ torus interconnect.Coarrays implemented using GASNET.• Cray XE6Same nodes as XT6 but with Gemini torus interconnect which supportRDMA23Pingpong80006000XE6 putXT6 putXT6 MPIXE6 MPIPoint-to-point comms4000200001 10 100 1000 10000 100000 1000000 10000000message length (doubles)2412

microseconds12/19/2012Pingpong (small message regime)80XE6 putXT6 putXE6 MPI60XT6 MPIPoint-to-point commsmicroseconds402001 4 16 64 256 1024 4096message length (doubles)sync images put latency MPI latencyXT6 33.1 45.3 7.4XE6 3.0 3.7 1.625Global Synchronisation3000sync allMPI Barrier * 100XT6 Synchronisation20001000016 32 64 128 256 512 1024 2048images• XT coarray implementation not keeping up with MPI2613

12/19/2012Global Synchronisation50XE6 sync allXE6 Synchronisation40XE6 MPI_Barriermicroseconds30201004 16 64 256 1024 4096 16384images• Much faster than the previous XT results27Global Synchronisation• Also measured sync images• various point-to-point patterns• Observed that sync images is usually faster thansync all on more than 512 images2814

12/19/20123D Halo Swap on XE6 (weak scaling V=50^3)150100put p2pput allget p2pget allMB/s5008 32 128 512 2048 8192 32768images29Examples of coarrays in practice• Puzzles• Distributed Remote Gather• HIMENO Halo-Swap• Gyrokinetic Fusion Code3015

12/19/2012Solving Sudoku Puzzles31Going Parallel• Started with serial code• Changed to read in all 125,000 puzzles at start• Choose work-sharing strategy• One image (1) holds a queue of puzzles to solve• Each image picks work from the queue and writes resultback to queue• Arbitrarily decide to parcel work asblocksize = npuzzles /( 8* num_images() )3216

12/19/2012Data Structuresuse,intrinsic iso_fortran_envtype puzzleinteger :: input(9,9)integer :: solution(9,9)end type puzzletype queuetype (lock_type) :: lockinteger :: next_available = 1type(puzzle),allocatable :: puzzles(:)end type queuetype(queue),save :: workqueue[*]type(puzzle) :: local_puzzleinteger,save :: npuzzles[*],blocksize[*]33Inputif (this_image() == 1) then! After file Setup.inquire (unit=inunit,size=nbytes)nrecords = nbytes/10npuzzles = nrecords/9blocksize = npuzzles / (num_images()*8)write (*,*) "Found ", npuzzles, " puzzles.“allocate (workqueue%puzzles(npuzzles))do i = 1, npuzzlescall read_puzzles( && workqueue%puzzles(i)%input,inunit, && error)end doclose(inunit)3417

12/19/2012Core program structure! After coarray data loadedsync allblocksize = blocksize[1]npuzzles = npuzzles[1]done = .false.workloop: do! Acquire lock and claim work! Solve our puzzlesend do workloop35Acquire lock and claim work! Reserve the next block of puzzleslock (workqueue[1]%lock)next = workqueue[1]%next_availableif (next

12/19/2012Solve the puzzles and write back! Solve those puzzlesdo i = istart,iendlocal_puzzle%input = && workqueue[1]%puzzles(i)%inputcall sudoku_solve && (local_puzzle%input,local_puzzle%solution)workqueue[1]%puzzles(i)%solution = && local_puzzle%solutionend do37Output the solutions! Need to synchronize puzzle output updatessync allif (this_image() == 1) thenopen (outunit,file=outfile,iostat=error)&&do i = 1, npuzzlescall write_puzzle &(workqueue%puzzles(i)%input, &workqueue%puzzles(i)%solution,outunit,error)end do3819

12/19/2012More on the Locking• We protected access to the queue state by lock and unlock• During this time no other image can acquire the lock• We need to have discipline to only access data within thewindow when we have the lock• There is no connection with the lock variable and the otherelements of the queue structure• The unlock is acting like sync memory• If one image executes an unlock...• Another image getting the lock is ordered after thefirst image39Summary and Commentary• We implemented solving the puzzles using a work-sharingscheme with coarrays• Scalability limited by serial work done by image 1• I/O• Parallel I/O (deferred to TS) with multiple images runningdistributed work queues.• Defer the character-integer format conversion to thesolver, which is executed in parallel.• Lock contention• Could use distributed work queues,each with its own lock.4020

12/19/2012Distributed remote gather• The problem is how to implement the following gather loop ona distributed memory systemREAL :: table(n), buffer(nelts)INTEGER :: index(nelts) ! nelts

MPI time / coarray time12/19/2012Remote gather: coarray implementation (get)• Image 1 gets the values from the other imagesIF (myimg.eq.1) THENDO i=1,neltspe =(index(i)-1)/nloc+1offset = MOD(index(i)-1,nloc)+1caf_buffer(i) = caf_table(offset)[pe]ENDDOENDIF43Remote gather: coarray vs MPI• Coarray implementations are much simpler• Coarray syntax allows the expression of remote data in anatural way – no need of complex protocols• Coarray implementation is orders of magnitude faster for smallnumbers of indices1000MPI to coarray ratio (1024 PEs)100101Number of Elements (nelts)4422

Performance (TFlop/s)12/19/2012HIMENO• HIMENO Halo-Swap benchmark• Uses Jacobi method to solve Poisson’s equation• Looked at a distributed implementation for GPUs• When distributed this gives a stencil computation and haloswapcommunication.• Used draft OpenMP GPU directives stencil computation• used MPI or coarrays for halo-swap between processes• Coarray code for halo-swap was simple and was bestperforming of the optimized versions• There is still scope to optimize the coarray version (reduceextra data copy)45HIMENO7Himeno Benchmark - XL configurationMPI/ACC OptCAF/ACC Opt65432100 32 64 96 128 160 192 224 256Number of nodes4623

12/19/2012Gyrokinetic Fusion Code• Particle in Cell (PIC) approach to simulate motion of confinedparticles• Motion caused by electromagnetic force on particle• Many particles stay in a cell for small timestep but some don’t• Timestep chosen to limit travel to 4 cells away• Departing particles stored in a buffer and when this is full thedata is sent to the neighboring cell’s incoming buffer• Force fields recomputed once particles are redistributed• Coarrays used to avoid coordinating the receive of the data• SC11 paper47References• “Cray's Approach to Heterogeneous Computing”,R. Ansaloni, A. Hart, Parco 2011 (to appear).• “The Himeno benchmark”, Ryutaro Himeno,http://accc.riken.jp/HPC_e/himenobmt_e.html.• “Multithreaded Address Space Communication Techniques forGyrokinetic Fusion Applications on Ultra-Scale Platforms”,Robert Preissl, Nathan Wichmann, Bill Long, John Shalf,Stephane Ethier, Alice Koniges, SC11 best paper finalist4824

12/19/2012References• http://lacsi.rice.edu/software/caf/downloads/documentation/nrRAL98060.pdf- Co-array Fortran for parallel programming,Numrich and Reid, 1998• ftp://ftp.nag.co.uk/sc22wg5/N1801-N1850/N1824.pdf“Coarrays in the next Fortran Standard”, John Reid, April 2010• Ashby, J.V. and Reid, J.K (2008). Migrating a scientificapplication from MPI to coarrays. CUG 2008 Proceedings. RAL-TR-2008-015See http://www.numerical.rl.ac.uk/reports/reports.shtml• http://upc.gwu.edu/ - Unified Parallel C at George WashingtonUniversity• http://upc.lbl.gov/ - Berkeley Unified Parallel C Project49WrapupRemember our first Motivation slide?• Fortran now supports parallelism as a full first-class feature ofthe language• Changes are minimal• Performance is maintained• Flexibility in expressing communication patternsWe hope you learned something and have successwith coarrays in the future5025

12/19/2012AcknowledgementsThe material for this tutorial is based on original contentdeveloped by EPCC of The University of Edinburgh for use inteaching their MSc in High-Performance Computing.The following people contributed to its development:Alan Simpson, Michele Weiland, Jim Enright and Paul GrahamThe material was subsequently developed by EPCC and Craywith contributions from the following people:David Henty, Alan Simpson ( EPCC)Harvey Richardson, Bill Long, Roberto Ansaloni,Jef Dawson, Nathan Wichmann (Cray)This material is Copyright © 2011by The University of Edinburgh and Cray Inc.5226

PGAS Programming with UPC and Fortran Coarrays

Create successful ePaper yourself

Delete template?

Save as template?