Cathal Ó Broin

Parallel implementation of TDSEon a Graphics Processing Unit(GPU) platformCathal Ó BroinDublin City University

Together with...Lampros Nikolopoulos,In collaboration with Ken Taylor

GPU EvolutionCompilers are now available in higher levellanguages (C and Fortran) for GPUs.GPUs focus on parallelism.Compared to CPUs, GPUs:●have less control units●more processing elements (Cores)●increased amount of on chip memory

Current GPU Example●●●●NVIDIA Tesla Cards (with Fermi):448 Cores6GB of Memory0.5 Teraflops peak double precision performance148 GB/s bandwidth to the GPU

GPU ArchitectureGPUs are used for highly parallel tasks.●Most graphics cards have a SIMD architecture●Graphics cards have a high amount of on boardmemory●GPUs aim for high throughput●Double precision is available

What tasks are GPUs suitable for?GPUs are suitable for tasks where:●the task can be broken up into groups of units●the units in the group execute the sameinstructions with different data.But not for tasks that:●require high levels of communication within thetask●require high levels of flow control such as ifconditions within the code

The Physical ProblemAn atomic or molecular system in an intenselaser field fufills the TDSE:

The basis expansion approachThe problem can be changed to the form:

The Hamiltonian structure

Elements of the solutionThe solution is of the form:d

The Taylor Expansion Method (TE)p

What is OpenCLOpenCL (looks like C) is a language thatgeneralizes the computational resources of acomputer.●●●●OpenCL has:portability between all supported architecturescombined use of CPU and GPU executioncompilation of code at runtimemassive hardware vendor support

kernel void MatrixMultiplication(const global double * a, const global double * b, globaldouble * c, int n){int LId, GroupId;int divcol, divrow; //Number of answers we must getdouble curr;LId = get_local_id(0);GroupId = get_group_id(0);divcol = n/get_local_size(0);divrow = n/get_num_groups(0);// Memory protection:if ((GroupId+1)*divrow > n)divrow = n;if (divcol*(LId + divcol) > n)divcol = n;}for (int k = 0; k < divrow; k++) {for (int j = 0; j < divcol; j++) {curr = 0;for (int i = 0; i < n; i++)curr += a[(GroupId*divrow+k)*n + i] * b[i*n + divcol*LId + j];c[(GroupId*divrow+k)*n + divcol*LId + j] = curr;}}

Division of Work

Graphics card usedAMD FirePro 7800●Cost approx 750 Euro (pre-installed)●1GB of total global memory●32 KB per local memory unit●64 KB of total constant Memory●8 KB of private registers per processing element●1440 Processing element●64 processing elements per SIMD●18 Compute Units●400 Gigaflops maximum performance

Existing CPU code in C++●Thoroughly tested on a number of systems(H, He, Mg etc...)●Tested over the last ten years●Uses a NAG propagator

Results for N = 191100908070Time (sec)605040OpenCL WGSZ:64OpenCL WGSZ:128OpenCL WGSZ:256NAG30201003 5 7 9 11 13 15 17Angular Momentum

Results for N = 391700600500Time (Sec)400300OpenCLNAG Propagator20010004 5 6 7 8 9 10 11 12 13Highest angular momenta value

Further WorkWork will be undertaken to port theimplementation to the NVIDIA specific CUDA sothat it can operate at Ireland's High-PerformanceComputing Centre (ICHEC).Work will be done to implement moresophisticated methods on the GPU.

END

N = 191, L = 12OpenCLNAG

On OpenCL●Kernels are functions that are called fromregular CPU based programs (host code).●Kernels are written in an OpenCL variant ofC99.●Multiple instances of a kernel function areexecuted by different work items●Global synchronization of the memory to allwork items can not be done except at the start ofa new kernel function call.

Work ItemsEach work item executesan instance of a Kernel.A work item differs from a thread in that:●It's instruction set should be the same as the restof the work group●There is no communication between work itemsout of the work group

Queueing in host code●A problem can be broken up into tasks dividedalong synchronization points.● Each part of a task is then implemented in akernel function● In host code, written in host languages such asC, C++ and Fortran, kernels are queued forexecution.● Other items can also be queued, such ascopying of buffers, or reading/writing buffers intohost memory

Synchronization●When one item in a queue is finished the nextitem queued can guarantee that it is executedafter it.●Any changes to memory will be seen by the nextitem.●For the taylor expansion a synchronization pointis required after the calculation of eachsuccessive derivative.

Results400350300250200OpenCL WGSZ:16OpenCL WGSZ:32OpenCL WGSZ:64OpenCL WGSZ:128OpenCL WGSZ:256NAG1501005000 5 10 15 20 25 30 35 40 45

GPU Execution Model

Cathal Ó Broin

Create successful ePaper yourself

Delete template?

Save as template?