EE 657, Fall 2007

Parallel vs. Sequential ProgrammingEE 657, Fall 2007Parallel Programming Models,Standards, and BenchmarksLecture 6 on Sept.14, 2007Professor Kai HwangUSC Internet and Grid Computing LaboratoryEmail: kaihwang@usc.eduApplications:AlgorithmicParadigms:ProgrammingModel :Database, Science, Engineering, Embedded Systems, etc.SequentialDivide-and-Conquer,Dynamic Programming,Branch-and-BoundBacktracking, GreedyThe von Neumann Model(Fortran, C, Cobol, 4GL)ParallelCompute-Interact,Work-Pool,Piepelining, Asynchronous Iteration,Master-Slave,Cellular AutomataImplicit Parallel (KAP),Data Parallel (Fortran 90, HPF),Message Passing (PVM, MPI),Shared Variable (X3H5)Shared Memory (PVP, SMP, DSM),Message Passing (MPP, Clusters),Data Parallel (SIMD)1September 14, 2007, Kai Hwang http://GridSec.usc.edu2Five Parallel Algorithmic ParadigmsParallel Programming ComponentsC C . . . CSynchronous InteractionC C . . . CSynchronous Interaction(a) Phase parallelData StreamPMaster(b) Divide and conquerWorkPoolParallelProgramming(Sequential or Parallel) Application AlgorithmUser (Programmer)(Sequential or Parallel)Source ProgramCompiler, Preprocessor,Assembler, and LinkerParallel Languageand other ToolsRun-Time Supportand other LibrariesQRSlave Slave SlaveP Q RNative Parallel Code(c) Pipeline(d) Process farm(e) Work poolParallel Platform (OS and Hardware)September 14, 2007, Kai Hwang http://GridSec.usc.edu3September 14, 2007, Kai Hwang http://GridSec.usc.edu4

for ( i = 0 ; i < N ; i ++ ) A[i] = b[i] * b[i+1] ;for ( i = 0 ; i < N ; i ++ ) c[i] = A[i] + A[i+1] ;(a) A sequential code fragmentid = my_process_id () ;p = number_of_processes () ;for ( i = id ; i < N ; i = i+p ) A[i] = b[i] * b[i+1] ;barrier () ;Parallel Program Generationfor ( i = id ; i < N ; i = i+p ) c[i] = A[i] + A[i+1] ;(b) Parallel code using library routines (MPI, PVM)September 14, 2007, Kai Hwang http://GridSec.usc.edu5my_process_id(), number_of_processes (), and barrier()A(0:N-1) = b(0:N-1) * b(1:N)c = A(0:N-1) + A(1:N)(c) Fortran 90 using vector/array operations#pragma parallel#pragma shared ( A, b, c )#pragma local ( i ){ #pragma pfor iterate (i=0; N ; 1)for ( i = 0 ; i < N ; i ++ ) A[i] = b[i] * b[i+1] ;#pragma synchronize#pragma pfor iterate (i=0; N ; 1)for ( i = 0 ; i < N ; i ++ ) c[i] = A[i] + A[i+1] }(d) Parallel C Program using SGI Power Cwith Compiler DirectivesSeptember 14, 2007, Kai Hwang http://GridSec.usc.edu6Main Features of Data-Parallel, Message-Passing, andShared-Variable Parallel Programming ModelsModelFeatureControl flow(threading)SynchronyAddressspaceData-ParallelMessage-PassingShared-VariableSingle Multiple MultipleLooselysynchronousAsynchronousAsynchronousSingle Multiple MultipleInteraction Implicit Explicit ExplicitDataallocationImplicit orSemi-explicitExplicitImplicit orSemi-explicitParallel Programming Standards :• OpenMP developed for shared-memory parallel programming onUnix or Windows NT platforms using compiler directives, libraryroutines, and other parallel constructs from the X3H5 .• MPI developed for message-passing parallel programming on NUMAmachines, cluster of computers, or loosely coupled Grid platforms.• Phreads (POSIX Threads) is an IEEE standard forexploring process- or task-level parallelism• HPF stands for high-performance Fortarn, evolved fromFortran 90 and beyondSeptember 14, 2007, Kai Hwang http://GridSec.usc.edu7September 14, 2007, Kai Hwang http://GridSec.usc.edu8

Five Parallel Programming Standards(Courtesy: OpenMP Standards Board [475], 1997)Attribute X3H5 MPI Pthreads HPF OpenMPScalable no yes sometimes yes yesFortran binding yes yes no yes yesC binding yes yes yes no plannedHigh level yes no no yes yesPerformance oriented no yes no yes yesSupports data parallelism yes no no yes yesPotable yes yes yes yes yesVendors support no widely Unix SMP widely startingIncremental parallelization yes no no no yesSeptember 14, 2007, Kai Hwang http://GridSec.usc.edu9An OpenMP program for computing π in usingCompiler Directives and run-time library Routinesprogram compute_piinteger n, ireal*8 w, x, sum, pi, f, ac function to integratef(a) = 4.d0 / (1.d0 + a*a)print *, 'Enter number of intervals: 'read *, nc calculate the interval sizew = 1.0d0/nsum = 0.0d0!$OMP PARALLEL DOPRIVATE(x), SHARED(w), REDUCTION(+: sum)do i = 1, nx = w * (i - 0.5d0)sum = sum + f(x)enddo!$OMP END PARALLEL DOpi = w * sumprint *, 'computed pi = ', pistopendSeptember 14, 2007, Kai Hwang http://GridSec.usc.edu10ConstructProblemInter-processor Synchronizationfor Parallel Program ConstructionSemaphoreor LockCriticalSectionTest& SetCompareand SwapSeptember 14, 2007, Kai Hwang http://GridSec.usc.edu11TransactionMemoryFetch&AddUnstructured XXX XXX XXX XXX XXXOverspecified XXX XX XXX XXX XXX XXXState dependent XXX XXX XXX XXX XXXSequentialexecutionXXX XXX XXX X XXXOverhead XXX XXX XX X XPriorityinversionConvoyingblockingXXX XXX XXX X XXXX XXX XXX X XNonserializable XX XX XX X XXXDeadlock XX XX XX X XXXXXX: Severe or extremely difficult to overcome; XX: Less sever or can be alleviated if usercarefully writes the code; X: Slight or can be eliminated if user follows certain well-definedrules; (blank): Not a problem, taken care of by the system.Anatomy of MPI in Message PassingAn Example Send and Receive operations in a message-passingprogram, where the variables M and S are the send/receive buffersProcess P: Process Q:M = 10; S = –100L1: Send M to Process Q; L1: Receive S from process P;L2: M = 20; L2: X = S + 1;goto L1;Specification of a Typical MPI command :SubroutineNameMessageAddressMessageCountMessageData TypeDestinationProcess IDMessageTagCommunicatorMPI_Send(&N , 1 , MPI_INT , i , i , MPI_COMM_WORLD)MPI is a platform-independent standard of message-passing library,supporting C and Fortran bindings. MPICH is the public-domainimplementation of MPI, which is portable to all machine platforms.September 14, 2007, Kai Hwang http://GridSec.usc.edu12

Point-to-Point and Collective CommunicationsP1P2P3135(a) Point-to-point : P1 sends 1 to P3P1P2P31, 3, 5 P1 1, 3, 5P2 3P3 5(c) Scatter: P1 sends onevalue to each nodeP1P2P31, 2, 34, 5, 67, 8, 9P1 1, 4, 7P2 2, 5, 8P3 3, 6, 9(e) Total exchange: each nodesends a distinct messageto every nodeP1P2P3135P1P2P3P1P2P3135, 11, 935(g) Reduction: P1 gets the sum1 + 3 + 5 = 9(b) Broadcast : P1 sends 1 to allSeptember 14, 2007, Kai Hwang http://GridSec.usc.edu13P1P2P3P1P2P31351351, 13, 15, 11, 3, 535(d) Gather: P1 gets one valuefrom each nodeP1P2P31 P1 1, 535 P2 3, 1P3 5, 3(f) Shift: each node sends onevalue to the next and receivesone from the previousP1P2P3135P1P2P3P1P2P3P1P2P31, 13, 45, 9(h) Scan: P1 gets 1, P2 gets 1 + 3 = 4,and P3 gets 1 + 3 + 5 = 9Collective Communications in MPIType Routine FunctionalityMPI_Bcast One-to-all, identical messageMPI_Gather All-to-one, personalized messagesMPI_Gatherv A generalization of MPI_GatherMPI_Allgather A generalization of MPI_GatherDatamovement MPI_Allgatherv A generalization of MPI_AllgatherMPI_Scatter One-to-all, personalized massagesMPI_Scatterv A generalization of MPI_ScatterMPI_Alltoall All-to-all, personalized messagesMPI_Alltoallv A generalization of MPI_AlltoallMPI_ReduceAll-to-one reductionAggregation MPI_Allreduce A generalization of MPI_ReduceMPI_Reduce_scatter A generalization of MPI_ReduceMPI_ScanAll-to-all parallel prefixSynchronization MPI_Barrier Barrier synchronizationSeptember 14, 2007, Kai Hwang http://GridSec.usc.edu14#include "mpi.h"int foo(i) int i ; { ... }main(argc, argv) A Message-Passing Program using MPIint argc;char * argv[] ;{ int i, tmp, sum = 0, group_size, my_rank, N;MPI_Init (&argc, &argv) ;MPI_Comm_size (MPI_COMM_WORLD, &group_size) ;MPI_Comm_rank (MPI_COMM_WORLD, &my_rank) ;if ( my_rank == 0 ) {printf("Enter N: "); scanf("%d", &N);for (i=1; i

Comparing Parallel Programming Models (Cont’d)ProgrammingModelsPlatform-independentExample Language/ToolsPlatform-dependentExample Language/ToolsSemanticissuesProgramabilityissuesImplicitModelKAP,ForgeDataParallelFortran 90,HPFCM C*MessagePassingPVM, MPISP2 MPL,Paragon NxSeptember 14, 2007, Kai Hwang http://GridSec.usc.edu17SharedVariableX3H5,OpenMPCray Craft,SGI Power CTermination Determinacy Correctness Generality Portability Structuredness Parallel Programming SoftwareToolkits and Benchmarks:• LSF for cluster and Grid computing• Globus for enabling Grid platforms• Condor-G for distributed job scheduling• LinPack is a linear algebraic solver system,commonly used to evaluate the Top-500 list ofworld’s fastest supercomputing systems• NAS is a parallel benchmark for numericalaerodynamic simulation at NASA AmesResearch CenterSeptember 14, 2007, Kai Hwang http://GridSec.usc.edu18NAS and LINPACK Benchmarks :‣ NAS (Numerical Aerodynamic Simulation) :• Benchmark of computational fluid dynamics• Resource sites: 12 sites (8: 8 nodes/site,4: 16 nodes/site)• Jobs: arrival time and workload from the trace file• N = 16,000 jobs running on M = 12 Grid sitesReadings on Parallel Benchmarks :• Pressel, D.M, “Results from Measuring thePerformance of the NAS Benchmarks on the CurrentGeneration of Parallel Computers”, Proceedings ofthe Users Group Conference, 2004, pp: 334-339.‣ LINPACK Benchmark for Evaluating theTop-500 Supercomputers• Benchmark of linear system and PDE solvers• Site processing speed:10 levels, M = 20 sites• Job workload: 20 levels, N = 5,000 jobs• Job arrival: Poisson distribution 0.008 jobs/sec• Wang et al, “LINPACK Performance on aGeographically Distributed Linux Cluster”, Proc. of18th In’, Parallel and Distributed ProcessingSymposium, 2004.September 14, 2007, Kai Hwang http://GridSec.usc.edu19September 14, 2007, Kai Hwang http://GridSec.usc.edu20

Non-imperative Parallel Programming Styles :• Object-Oriented Programming exploits parallelism at theobject level using abstract data types and operationalmethods defined over the specified data structures.Programming Environment for Grids• Functional Programming based on dataflow control insteadof control-flow of programs. This demands the compiler togenerate the execution sequence from the functionalspecification• Logic Programming gives the highest level of abstraction byspecifying the logic relationship between input and output.This demands even a more powerful compiler to derive analgorithm from the logic description.Chapter 20 in Grid ComputingSeptember 14, 2007, Kai Hwang http://GridSec.usc.edu21September 14, 2007, Kai Hwang http://GridSec.usc.edu22Conclusions in Parallel Programming:• The development and use of efficient languages,compilers, software tools, and common standardsfor parallel programming are far behind theadvances in parallel and distributed hardwareand networking technologies• The OpenMP is meant to unify the world ofshared-memory parallel programming• The MPI and its variants and extensions areyet to be accepted by ordinary programmers.• Data-Parallel programming is only used inspecial-purpose (SIMD) computing systems,even far away from the daily-life usersParallel Programming, Open MP, MPIReading Assignments[1] K. Hwang and Z. Xu, Chapter 12: “ParallelProgramming Models” in Scalable ParallelComputing, McGraw-Hill, 1998. (Handout)[2] G. Krawezik, F. Cappello, “ PerformanceComparison of MPI and three OpenMPProgramming Styles on Shared MemoryMultiprocessors”, Proc. of the 15 th Annual ACMSymposium on Parallel Algorithms and Architectures.June 2003September 14, 2007, Kai Hwang http://GridSec.usc.edu23September 14, 2007, Kai Hwang http://GridSec.usc.edu24

EE 657, Fall 2007

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?