10.07.2015 Views

OpenMP and Offloading

OpenMP and Offloading

OpenMP and Offloading

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1Preparing for Stampede: ProgrammingHeterogeneous Many-Core SupercomputersDan Stanzione, Lars Koesterke, Bill Barth, Kent Milfelddan/lars/bbarth/milfeld@tacc.utexas.eduXSEDE 12July 16, 2012


<strong>OpenMP</strong> <strong>and</strong> <strong>Offloading</strong>Outline• Strategies for Migrating to Stampede• <strong>OpenMP</strong> – refresh• MIC Example Codes• Heterogeneous Threading with <strong>OpenMP</strong>2


• Strategies for Migrating to Stampede• <strong>OpenMP</strong> – refresh• MIC Example Codes• Heterogeneous Threading with <strong>OpenMP</strong>3


Application Migration to Stampede ClusterUser Viewof a ClusterUser View StampedeN1N1MIC1N2N3Migrationto StampedeN2N3MIC2MIC3……NxNxMICxMPI on CPU <strong>and</strong> MIC (w/wo <strong>OpenMP</strong>)Offloaded execution (threaded) to MICSymmetricComputingAsymmetricComputing4


• Vector ReawakeningStrategies– Pre-S<strong>and</strong>y Bridge – DP vector units are only 2 wide– S<strong>and</strong>y Bridge – DP vector units are 4 wide– MIC -- DP vector units are 8 wide• Unvectorized loops lose 4x performance on S<strong>and</strong>yBridge <strong>and</strong> 8x performance on MIC!• Evaluate performance with/without compilervectorization turned on to assess overallvectorizationDP=double precision (8 bytes)5


• Strategies for Migrating to Stampede• <strong>OpenMP</strong> – refresh• MIC Example Codes• Heterogeneous Threading with <strong>OpenMP</strong>6


What is <strong>OpenMP</strong>?• <strong>OpenMP</strong> is an acronym for Open Multi-Processing• An Application Programming Interface (API) fordeveloping parallel programs in shared memoryarchitectures• Three primary components of the API are:– Compiler Directives– Runtime Library Routines– Environment Variables• de facto st<strong>and</strong>ard -- specified for C, C++, <strong>and</strong> FORTRAN• http://www.openmp.org/ has the specification, examples,tutorials <strong>and</strong> documentation7


<strong>OpenMP</strong>• All about executing concurrent work (tasks).– Tasks execute independently– Variable updates must be mutually exclusive– Synchronization through barriers...some work// We use loops for// repetitive tasksfor (i=0; i


<strong>OpenMP</strong> Fork-Join Parallelism• Programs begin as a single process: master thread• Master thread executes in serial mode until the parallel regionconstruct is encountered• Master thread creates (forks) a team of parallel threads thatsimultaneously execute tasks in a parallel region• After executing the statements in the parallel region, team threadssynchronize <strong>and</strong> terminate (join) but master continuese.g.4-threadexecutiontimeexecutionSerialParallelSerialParallelSerial4 threads4 threadsMaster ThreadMulti-Threaded9


Thread Memory Access• Every thread has access to “global” (shared) memory• All threads share the same address space• Threads can synchronize through barriers (implicit/explicit),<strong>and</strong> can have exclusive access to shared variables.• Simultaneous updates to shared memory can create a racecondition. Use mutual exclusion directives to avoid data raceconditions . Here is the C/C++ syntax:#pragma omp atomic --for single statements#pragma omp critical --for code block11


<strong>OpenMP</strong> Syntax• <strong>OpenMP</strong> Directives: Sentinel construct <strong>and</strong> clauses#pragma omp construct [clause [[,]clause]…] C!$omp construct [clause [[,]clause]…] F90• Example#pragma omp parallel private(i) reduction(+:sum) C!$omp parallel private(i) reduction(+:sum) F90Most <strong>OpenMP</strong> constructs apply to a “structured block”, that is, a blockof one or more statements with one point of entry at the top <strong>and</strong> onepoint of exit at the bottom.12


<strong>OpenMP</strong> Constructs<strong>OpenMP</strong> language“extensions”parallel controlstructuresparallel controlworksharingcontrolsingle taskdataenvironmentsynchronizationruntimeenvironment• governs flow ofcontrol in theprogramparalleldirective• distributes workamong threadsdoforsectionssingledirective• assigns workto a threadtaskdirective• specifiesvariables asshared or privatesharedprivatereductionclause• coordinates threadexecutioncriticalatomicbarrierdirective• runtime functions• environment variablesomp_set_num_threads()omp_get_thread_num()OMP_NUM_THREADSOMP_SCHEDULE•schedulingstatic, dynamic, guided13


General form:Parallel Region & WorksharingUse <strong>OpenMP</strong> directives to specify Parallel Region,Worksharing constructs, <strong>and</strong> Mutual ExclusionC#pragma omp parallel#pragma omp//end of parallel blockUse parallel … end parallel for F90Use parallel {…} for CCode blockdo / forsectionssinglecriticalatomicEach thread executesWorksharingWorksharingOne Thread (Worksharing)One Thread at a timeOne Thread at a timeparallel do/forparallel sectionsA single worksharing construct(e.g. a do/for) may be combinedon a parallel directive line.14


Worksharing: LoopC/C++1 #pragma parallel for2 for (i=0; i


Worksharing: LoopF901 !$omp parallel do2 do i=1,N3 a(i) = b(i) + c(i)4 enddo5 !$omp end parallel doorGeneral form:!$omp parallel!$omp dodo i=1,Na(i) = b(i) + c(i)enddo!$omp end parallelLine 1 Team of threads formed (parallel region).Line 2-4 Loop iterations are split among threads.Line 5 (Optional) end of parallel loop (implied barrier at enddo).Each loop iteration must be independent of other iterations.16


• Strategies for Migrating to Stampede• <strong>OpenMP</strong> – refresh• MIC Example Codes• Heterogeneous Threading with <strong>OpenMP</strong>17


Programming ModelsReady to use on day one!• TBB’s will be available to C++ programmers• MKL will be available– Automatic offloading by compiler for some MKL features• Cilk Plus– Useful for task-parallel programing (add-on to <strong>OpenMP</strong>)– May become available for Fortran users as well• <strong>OpenMP</strong>– TACC expects that <strong>OpenMP</strong> will be the most interestingprogramming model for our HPC users18


MIC Programming with <strong>OpenMP</strong>• MIC specific pragma precedes <strong>OpenMP</strong> pragma– Fortran: !dir$ omp offload target(mic) – C: #pragma offload target(mic) • All data transfer is h<strong>and</strong>led by the compiler(optional keywords provide user control)• Offload <strong>and</strong> <strong>OpenMP</strong> section will go over examples for:1. Automatic data management2. Manual data management3. I/O from within offloaded region• Data can “stream” through the MIC; no need to leave MIC to fetch new data• Also very helpful when debugging (print statements)4. <strong>Offloading</strong> a subroutine <strong>and</strong> using MKL19


Example 12-D array (a) is filled withdata on the coprocessorData management h<strong>and</strong>ledautomatically by compilerMemory for (a) allocatedon coprocessorPrivate variables (i, j, x)are createdResult is copied back!$ use omp_lib ! <strong>OpenMP</strong>integer :: n = 1024 ! Sizereal, dimension(:,:), allocatable :: a ! Arrayinteger :: i, j ! Indexreal :: x ! Scalarallocate(a(n,n))! Allocation!dir$ omp offload target(mic) ! <strong>Offloading</strong>!$omp parallel do shared(a,n), & ! Par. regionprivate(x, i, j), schedule(dynamic)do j=1, ndo i=j, nx = real(i + j); a(i,j) = xenddo; enddo#include /* C example */#include int main() {const int n = 1024; /* Size of the array */float a[n][n]; /* Array */int i, j; /* Index variables */float x; /* Scalar */#pragma offload target(mic)#pragma omp parallel for shared(a), \private(x), schedule(dynamic)for(i=0;i


Data management by programmerCopy in without deallocation:First <strong>and</strong> second use without any data movement:Finally, copy out without allocation:in(a: free_if(0))nocopy(a)out(a: alloc_if(0))! Allocate, transfer, no deallocation!dir$ omp offload target(mic) in(a: free_if(0))Example 2Stencil update <strong>and</strong>reduction withpersistent data! <strong>Offloading</strong>: no alloc. no trans, no dealloc!dir$ omp offload target(mic) nocopy(a)!$omp parallel do shared(a)do j=2, n-1do i=2, n-1a(i,j) = 0.5*(a(i+1,j) + a(i-1,j) + &a(i,j-1) + a(i,j+1))enddoenddosum = 0. ! host code between offloaded regions! <strong>Offloading</strong>: no alloc., no trans, no dealloc!dec$ omp offload target(mic) nocopy(a)!$omp parallel do shared(a) reduction(+:sum)do j=1, ndo i=1, nsum = sum + a(i,j)enddo; enddo! No alloc., transfer, deallocate!dir$ omp offload target(mic) out(a: alloc_if(0))21


Example 3• I/O from within offloaded region• File opened/closed by one thread (omp single)• Threads read from file (omp critical)• Threads may read in parallel (not shown)– Parallel file system– Threads read differentparts of file, stored ondifferent targets#pragma offload target(mic) //Offload region#pragma omp parallel{#pragma omp single /* Open File */{printf("Opening file in offload region\n");f1 = fopen("/var/tmp/mydata/list.dat","r");}#pragma omp forfor(i=1;i


Example 4• Two routines sgemm (MKL) <strong>and</strong> my_sgemm• Both called with offload directive– Explicit data movement used for my_sgemm– Input: in(a, b)– Output: out(d)• Use special attribute to have routine compiled for the coprocessor, orspecify library use.! Snippet from the Main Program!dir$ attributes offload:mic :: sgemm!dir$ offload target(mic) !Offload to MICcall &sgemm('N','N',n,n,n,alpha,a,n,b,n,beta,c,n)! Offload to accelerator with explicit! clauses for the data movement!dir$ offload target(mic) in(a,b) out(d)call my_sgemm(d,a,b)! Snippet from the H<strong>and</strong>-coded subprogram!dir$ attributes offload:mic :: my_sgemmsubroutine my_sgemm(d,a,b)real, dimension(:,:) :: a, b, d!$omp parallel dodo j=1, ndo i=1, nd(i,j) = 0.0do k=1, nd(i,j) = d(i,j)+a(i,k)*b(k,j)enddo; enddo; endoend subroutine23


• Strategies for Migrating to Stampede• <strong>OpenMP</strong> – refresh• MIC Example Codes• Heterogeneous Threading with <strong>OpenMP</strong>24


<strong>OpenMP</strong> Parallel – homogeneous threadingMPI Process(master thread)GenerateParallel RegionparallelregionwaitthreadteamC/C++#pragma parallel private(id){id=omp_get_num-thread();foo(id);}F90!$omp parallel private(id)id=omp_get_num-thread()foo(id)!$omp end parallel25


<strong>OpenMP</strong> Parallel – heterogeneous threading, serial“offload()” is any form of routine or pragma that does offloading.MPI Process(master thread)singleoffloadGenerateParallel Regionworksharecpuwaitthreadteamidlethreads#pragma parallel C/C++{#pragma omp single{ offload();}#pragma omp forfor(i=0; i


<strong>OpenMP</strong> Parallel – heterogeneous threading,concurrentassist whenfinished with singleMPI Process(master thread)singleoffloadGenerateParallel Regionworksharecpuwaitthreadteam#pragma parallelC/C++{#pragma omp single no wait{ offload(); }#pragma omp for schedule(dynamic)for(i=0; i


#include #include int main() {const int N=100000000;int i, id, nt, N_mic, N_cpu;float *a;<strong>Offloading</strong>Examplea = (float *) malloc(N*sizeof(float));for(i=0;i


Loop NestingtimeexecutionSerialNested Parallel RegionSerialMaster Thread<strong>OpenMP</strong> 3.0 supports nested parallelism, older implementationsmay ignore the nesting <strong>and</strong> serialize inner parallel regions.A nested parallel region can specify any number of threads to beused for the thread team, new id’s are assigned.29


omp_set_nested(1);omp_set_max_active_levels(2);omp_set_num_threads(5);#pragma omp parallel private(id){ printf("reporting in from %d\n", \omp_get_thread_num());#pragma omp sections nowait{#pragma omp section{for (i =0; i < 2 ; i++){#pragma omp taskfoo(i);}SectionsTaskNestingSections allows 1 generatingthread in each section.Nested level re-defines a threadteam with new thread ids.(Worksharing team is no longerdependent upon original parallelregion team size.)task allows “execute at will”scheduling. .}}}#pragma omp section{#pragma omp parallel for num_threads(3)30for(i=0;i

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!