06.08.2012 Views

Uncontracted Rys Quadrature implementation of up to g functions on ...

Uncontracted Rys Quadrature implementation of up to g functions on ...

Uncontracted Rys Quadrature implementation of up to g functions on ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Andrey Asadchev<br />

Jacob Felder<br />

Veerendra Allada<br />

Dr. Mark S Gord<strong>on</strong><br />

Dr. Theresa Windus<br />

Dr. Brett Bode<br />

GPU Technology C<strong>on</strong>ference , NVIDIA , San Jose, 2009


� Computati<strong>on</strong>al Quantum Chemistry<br />

� General A<str<strong>on</strong>g>to</str<strong>on</strong>g>mic Molecular Electr<strong>on</strong>ic Structure Systems -GAMESS<br />

� Electr<strong>on</strong> Repulsi<strong>on</strong> Integral (ERI) Problem<br />

� Our Approach<br />

� CUDA Implementati<strong>on</strong><br />

� Optimizati<strong>on</strong>s<br />

� Au<str<strong>on</strong>g>to</str<strong>on</strong>g>matically generated code<br />

� Performance Results<br />

� Future Goals<br />

� Questi<strong>on</strong>s & Discussi<strong>on</strong>


� Use computati<strong>on</strong>al methods <str<strong>on</strong>g>to</str<strong>on</strong>g> solve the electr<strong>on</strong>ic structure and<br />

properties <str<strong>on</strong>g>of</str<strong>on</strong>g> molecules.<br />

� Finds utility in the design <str<strong>on</strong>g>of</str<strong>on</strong>g> new drugs and materials<br />

� Underlying theory is based <strong>on</strong> Quantum Mechanics –Schrodinger<br />

wave equati<strong>on</strong><br />

� Properties calculated<br />

� Energies<br />

� Electr<strong>on</strong>ic charge distributi<strong>on</strong><br />

� Dipole moments, vibrati<strong>on</strong>al frequencies.<br />

� Methods employed<br />

� Ab initio Methods ( Solve from first principles)<br />

� Density Functi<strong>on</strong>al Theory (DFT)<br />

� Semi-empirical methods<br />

� Molecular Mechanics (MM)


� Ab initio molecular quantum chemistry s<str<strong>on</strong>g>of</str<strong>on</strong>g>tware<br />

� USDOE “SciDAC Basic Energy Sciences” (BES) applicati<strong>on</strong><br />

� Serial and parallel versi<strong>on</strong>s for several methods<br />

� In brief, GAMESS can compute<br />

� Self C<strong>on</strong>sistent Field (SCF) wave <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> - RHF, ROHF, UHF, GVB,<br />

and MCSCF using the Hartree-Fock method<br />

� Correlati<strong>on</strong> correcti<strong>on</strong>s <str<strong>on</strong>g>to</str<strong>on</strong>g> SCF using c<strong>on</strong>figurati<strong>on</strong> interacti<strong>on</strong> (CI),<br />

sec<strong>on</strong>d order perturbati<strong>on</strong> theory, and co<str<strong>on</strong>g>up</str<strong>on</strong>g>led cluster theories (CC)<br />

� Density Functi<strong>on</strong>al Theory approximati<strong>on</strong>s<br />

Reference:"Advances in electr<strong>on</strong>ic structure theory: GAMESS a decade later" M.S.Gord<strong>on</strong>, M.W.Schmidt pp.<br />

1167-1189, in "Theory and Applicati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> Computati<strong>on</strong>al Chemistry: the first forty years" C.E.Dykstra,<br />

G.Frenking, K.S.Kim, G.E.Scuseria (edi<str<strong>on</strong>g>to</str<strong>on</strong>g>rs), Elsevier, Amsterdam, 2005.


� Molecules are made <str<strong>on</strong>g>of</str<strong>on</strong>g> a<str<strong>on</strong>g>to</str<strong>on</strong>g>ms and a<str<strong>on</strong>g>to</str<strong>on</strong>g>ms have electr<strong>on</strong>s<br />

� Electr<strong>on</strong>s live in shells – s, p, d, f, g, h<br />

� Shells are made <str<strong>on</strong>g>of</str<strong>on</strong>g> sub-shells – all have the same angular<br />

momentum (L)<br />

� Shells are represented using the mathematical <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g><br />

� Gaussian <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> are taken as standard primitive <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> (S.F. Boys)<br />

�<br />

� x, y, z – Cartesian center<br />

� a x, a y, a z – Angular momenta comp<strong>on</strong>ents; L = a x + a y + a z<br />

� � is the exp<strong>on</strong>ent<br />

� Shells with low angular momentum are typically c<strong>on</strong>tracted<br />

▪<br />

a a<br />

x y az<br />

2<br />

(r)= x y z exp( � r )<br />

� �<br />

K<br />

� ( r) � �<br />

D � ( r)<br />

a ka k<br />

k<br />

▪ K is the c<strong>on</strong>tracti<strong>on</strong> coefficient. D k’s are the c<strong>on</strong>tracti<strong>on</strong> coefficients


cheap <strong>on</strong>e-time<br />

operati<strong>on</strong><br />

1<br />

H core<br />

(<strong>on</strong>e-electr<strong>on</strong> integrals)<br />

Kinetic Energy Integrals (T)<br />

Nuclear Attracti<strong>on</strong><br />

Integrals (V)<br />

Form the Fock Matrix<br />

F = H core + G<br />

Transformati<strong>on</strong>s<br />

F ’ = X’FX<br />

C’ � Diag<strong>on</strong>alize(F’)<br />

C � XC’<br />

C<strong>on</strong>vergence<br />

Checks<br />

yes<br />

S<str<strong>on</strong>g>to</str<strong>on</strong>g>p<br />

6<br />

7<br />

Molecule Specificati<strong>on</strong><br />

•List <str<strong>on</strong>g>of</str<strong>on</strong>g> A<str<strong>on</strong>g>to</str<strong>on</strong>g>ms ( A<str<strong>on</strong>g>to</str<strong>on</strong>g>mic Numbers Z)<br />

•List <str<strong>on</strong>g>of</str<strong>on</strong>g> Nuclear Coordinates (R)<br />

• Number <str<strong>on</strong>g>of</str<strong>on</strong>g> electr<strong>on</strong>s<br />

•List <str<strong>on</strong>g>of</str<strong>on</strong>g> Primitive Functi<strong>on</strong>s, exp<strong>on</strong>ents<br />

• Number <str<strong>on</strong>g>of</str<strong>on</strong>g> c<strong>on</strong>tracti<strong>on</strong>s<br />

Form the basis <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> (M)<br />

2<br />

Initial guess <str<strong>on</strong>g>of</str<strong>on</strong>g> the wave<br />

functi<strong>on</strong><br />

Obtain the guess at the<br />

Density Matrix (P)<br />

O(M 2 )<br />

No<br />

8<br />

5<br />

4<br />

Update the density<br />

matrix from C<br />

Repeat steps 3, 4, 5, 6, 7<br />

3<br />

4<br />

G – Matrix<br />

O(M 2 )<br />

G = [(ij|kl) – ½(ik|jl)]*P<br />

•Required in every iterati<strong>on</strong><br />

•Very Expensive operati<strong>on</strong><br />

•S<str<strong>on</strong>g>to</str<strong>on</strong>g>red procedures not scalable<br />

•Re-compute in every iterati<strong>on</strong><br />

•Good target for GPU<br />

ERI<br />

Two Electr<strong>on</strong> Repulsi<strong>on</strong> Integral<br />

(ij|kl)<br />

O(M 3 ) <str<strong>on</strong>g>to</str<strong>on</strong>g> O(M 4 )


� Four-center two-electr<strong>on</strong> repulsi<strong>on</strong><br />

integral<br />

�<br />

1<br />

( ab|cd ) = �� �a( 1) �b( 1) �c( 2 ) �d(<br />

2 )<br />

r<br />

12<br />

� Major computati<strong>on</strong>al step in both Ab<br />

initio and DFT methods<br />

� Complexity is O(M 3 )-O(M 4 ), M is the<br />

number <str<strong>on</strong>g>of</str<strong>on</strong>g> basis <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> (Gaussian<br />

<str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> are standard)<br />

� <str<strong>on</strong>g>Rys</str<strong>on</strong>g> <str<strong>on</strong>g>Quadrature</str<strong>on</strong>g> – proposed by D<str<strong>on</strong>g>up</str<strong>on</strong>g>ius,<br />

<str<strong>on</strong>g>Rys</str<strong>on</strong>g>, King (DRK)<br />

� Numerical Gaussian quadrature based <strong>on</strong> a<br />

set <str<strong>on</strong>g>of</str<strong>on</strong>g> orthog<strong>on</strong>al <str<strong>on</strong>g>Rys</str<strong>on</strong>g> polynomials<br />

� Numerically stable, low memory foot print<br />

� Amenable for GPUs and architectures with<br />

smaller caches<br />

ERI Inputs<br />

bra ket<br />

shell a shell b shell c shell d<br />

CGTO CGTO CGTO CGTO<br />

x y z<br />

a x a y a z<br />

C<strong>on</strong>tracti<strong>on</strong><br />

Coefficients<br />

Gaussian Exp<strong>on</strong>ents<br />

Relevant <str<strong>on</strong>g>to</str<strong>on</strong>g> computati<strong>on</strong>s<br />

Lower order <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> are<br />

typically c<strong>on</strong>tracted


X �� ��<br />

( r ��<br />

r )<br />

2<br />

A B<br />

r �� ( �� r ����<br />

r ) / A<br />

A i i j j<br />

r �� ( �� r ����<br />

r ) / B<br />

B k k l l<br />

� Two electr<strong>on</strong> integral is expressed as<br />

1<br />

where , 2m2 F exp( ) and<br />

m(X)=<br />

�t<br />

�Xt<br />

dt<br />

0<br />

( ij |kl ) = C F ( X )<br />

� X depends <strong>on</strong> exp<strong>on</strong>ents, centers and is independent <str<strong>on</strong>g>of</str<strong>on</strong>g> angular<br />

momenta<br />

2<br />

1<br />

X ��( rA �rB)<br />

r �( � r ��<br />

r ) / A<br />

A<br />

i i j j<br />

r �( � r ��<br />

r ) / B<br />

B k k l l<br />

L= L a+L b+L c+Ld<br />

2<br />

� ( ij |kl ) = �exp(<br />

�Xt<br />

) PL ( t) dt , where PL(t) is polynomial <str<strong>on</strong>g>of</str<strong>on</strong>g> degree L in t2. Evaluated<br />

0<br />

N<br />

using N-point quadrature and hence ( ij |kl ) = � W�PL( t�)<br />

where N = L / 2+ 1<br />

� Using separati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> variables, P L(t) which is integral over dr 1dr 2 , can be<br />

written as a product <str<strong>on</strong>g>of</str<strong>on</strong>g> three (2-D) integrals over dx 1dx 2, dr 1dr 2, dz 1dz 2<br />

1/2<br />

� ( ij |kl ) = 2 ( �/ � ) � I x(t � )I y(t � )I z(t<br />

� )W�<br />

and I q( = x,y,z) (N, 0: L a, 0: L b, 0: Lc,0: Ld<br />

)<br />

�<br />

� Ix, Iy, Iz are computed using recurrence and transfer relati<strong>on</strong>s<br />

L<br />

�<br />

m�0<br />

� � AB / ( A �B)<br />

A �� ��<br />

i j<br />

B �� ��<br />

k l<br />

��1<br />

m m


<str<strong>on</strong>g>Rys</str<strong>on</strong>g> <str<strong>on</strong>g>Quadrature</str<strong>on</strong>g> Algorithm<br />

for all l do<br />

for all k do<br />

for all j do<br />

for all i do<br />

end for<br />

end for<br />

end for<br />

end for<br />

I( i, j, k, l) � � I x( �, ix, jx, kx, lx) I y( �, iy, jy, ky , ly ) I z( �,<br />

iz, jz, kz, lz<br />

)<br />

�<br />

� Summati<strong>on</strong> over the roots over all the intermediate 2-D integrals<br />

� floating point operati<strong>on</strong>s =<br />

3* N *<br />

La�1Lb�1Lc�1Ld �1<br />

2 2 2 2<br />

� �� �� �� �<br />

�� �� �� �� �� �� �� ��<br />

� �� �� �� �<br />

� Recurrence, transfer and roots have predictable memory access<br />

patterns, fewer flops. <str<strong>on</strong>g>Quadrature</str<strong>on</strong>g> step is the main focus here.


� Example: (dd|dd) ERI block<br />

� L a = L b = L c = L d = 2<br />

� Number <str<strong>on</strong>g>of</str<strong>on</strong>g> roots, N = 5<br />

� ERI size = 6 4 = 1296 elements<br />

� Intermediate 2-D integrals Ix, Iy, Iz size:3 4 *5 = 245<br />

Possible Optimizati<strong>on</strong>s<br />

� ERI computati<strong>on</strong>s are memory bound, hence optimize memory accesses<br />

� Intermediate 2-D integrals are reused multiple times <str<strong>on</strong>g>to</str<strong>on</strong>g> c<strong>on</strong>struct<br />

different ERI elements.<br />

� Generate the different combinati<strong>on</strong>s au<str<strong>on</strong>g>to</str<strong>on</strong>g>matically


SP1<br />

Registers<br />

SP5<br />

Registers<br />

SP2<br />

Registers<br />

SP6<br />

Registers<br />

Multithreaded Instructi<strong>on</strong> Unit<br />

SP3<br />

Registers<br />

SP7<br />

Registers<br />

Shared Memory<br />

C<strong>on</strong>stant Cache<br />

Texture Cache<br />

SP4<br />

Registers<br />

SP8<br />

Registers<br />

Device Memory<br />

SM – Streaming Multiprocessor<br />

SP – Scalar Processor Core<br />

SFU – Special Functi<strong>on</strong>al Unit<br />

DP – Double Precisi<strong>on</strong> Unit<br />

Symmetric Multiprocessor 2<br />

Symmetric Multiprocessor 1<br />

SFU<br />

SFU<br />

Symmetric Multiprocessor N<br />

DP unit<br />

Thread<br />

(0,0)<br />

Thread<br />

(0,1)<br />

Thread<br />

(0,2)<br />

Block<br />

(0,0)<br />

Block<br />

(0,1)<br />

Thread<br />

(1,0)<br />

Block<br />

(1,0)<br />

Block<br />

(1,1)<br />

Grid <str<strong>on</strong>g>of</str<strong>on</strong>g> Blocks<br />

Thread<br />

(1,1)<br />

Thread<br />

(1,2)<br />

Thread<br />

(2,0)<br />

Thread<br />

(2,1)<br />

Thread<br />

(2,2)<br />

Thread Block<br />

Block<br />

(2,0)<br />

Block<br />

(2,1)<br />

Thread<br />

(3,0)<br />

Thread<br />

(3,1)<br />

Thread<br />

(3,2)<br />

Thread<br />

(4,0)<br />

Thread<br />

(4,1)<br />

Thread<br />

(4,2)


� Since 2-D integrals are reused multiple times, load them in<str<strong>on</strong>g>to</str<strong>on</strong>g><br />

shared memory<br />

� However, shared memory access , synchr<strong>on</strong>izati<strong>on</strong> limited <str<strong>on</strong>g>to</str<strong>on</strong>g> thread block<br />

boundaries<br />

� ERI block should be mapped <strong>on</strong><str<strong>on</strong>g>to</str<strong>on</strong>g> a single thread block<br />

� Is it possible <str<strong>on</strong>g>to</str<strong>on</strong>g> map all the ERI elements <str<strong>on</strong>g>to</str<strong>on</strong>g> individual threads in a block ?<br />

� The answer depends <strong>on</strong> the ERI block under c<strong>on</strong>siderati<strong>on</strong><br />

� For a (dd|dd) ERI block, ERI size = 6 4 = 1296 elements<br />

� Maximum <str<strong>on</strong>g>of</str<strong>on</strong>g> 512 or 768 threads per block<br />

� Map i, j, k indices corresp<strong>on</strong>ding <str<strong>on</strong>g>to</str<strong>on</strong>g> the three shells <str<strong>on</strong>g>of</str<strong>on</strong>g> the block <str<strong>on</strong>g>to</str<strong>on</strong>g> unique<br />

threads and iterate over the l index<br />

� Thread blocks are three dimensi<strong>on</strong>al, the mapping <str<strong>on</strong>g>of</str<strong>on</strong>g> i, j, k is natural<br />

� For (ff|ff) ERI block, ERI size = 10 4 = 1000 elements<br />

� Map i, j indices corresp<strong>on</strong>ding <str<strong>on</strong>g>to</str<strong>on</strong>g> the first two shells <str<strong>on</strong>g>of</str<strong>on</strong>g> the block <str<strong>on</strong>g>to</str<strong>on</strong>g> unique<br />

threads and iterate over the l index


CUDA <str<strong>on</strong>g>Rys</str<strong>on</strong>g> quadrature: i, j, k mapping<br />

# map threads <str<strong>on</strong>g>to</str<strong>on</strong>g> ERI elements<br />

I = threadIdx.x, j = threadIdx.y, k = threadIdx.z<br />

# arrays LX, LY, LZ map <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> <str<strong>on</strong>g>to</str<strong>on</strong>g> exp<strong>on</strong>ents<br />

(ix, iy, iz) � (LX[i], LY[i], LZ[i])<br />

(jx, jy, jz) � (LX[j], LY[j], LZ[j])<br />

(kx, ky, kz)� (LX[k], LY[k], LZ[k])<br />

for all l do<br />

syncthreads<br />

## load the 2-D integrals <str<strong>on</strong>g>to</str<strong>on</strong>g> shmem<br />

I x, shmem � Ix(:,:,:,LX[l])<br />

I y, shmem �Iy(:,:,:,LX[l])<br />

I z, shmem � Iz(:,:,:,LX[l])<br />

syncthreads<br />

I(i, j, k, l) �<br />

end for<br />

�<br />

N<br />

I I I<br />

x, shmem y, shmem z, shmem<br />

Further optimizati<strong>on</strong>s<br />

� (dd|dd) case<br />

� I {x,y,z},shmem = 5(3 3 ) = 135<br />

elements per 2-D block<br />

� Across iterati<strong>on</strong>s, some <str<strong>on</strong>g>of</str<strong>on</strong>g> the<br />

elements in shared memory can<br />

be reused<br />

d-shell<br />

d x 2 , dy 2 , dz 2 , dxy, d xz, d yz � 18 loads<br />

d y 2 , dz 2 , dyz, d xy, d xz, d x 2 � 13 loads<br />

I x 0* 0 0 1* 1 2*<br />

I y 2* 0* 1* 1 0* 0<br />

I z 0* 2* 1* 0* 1* 0*


CUDA <str<strong>on</strong>g>Rys</str<strong>on</strong>g> quadrature: i, j mapping<br />

# map threads <str<strong>on</strong>g>to</str<strong>on</strong>g> ERI elements<br />

I = threadIdx.x, j = threadIdx.y<br />

# arrays LX, LY, LZ map <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> <str<strong>on</strong>g>to</str<strong>on</strong>g> exp<strong>on</strong>ents<br />

(ix, iy, iz) � (LX[i], LY[i], LZ[i])<br />

(jx, jy, jz) � (LX[j], LY[j], LZ[j])<br />

for all klz-block do<br />

syncthreads<br />

Iz, shmem � Iz(:,:,LZ[k],LZ[l])<br />

## load 2-D integrals <str<strong>on</strong>g>to</str<strong>on</strong>g> shmem<br />

for all klxy klz-block do<br />

syncthreads<br />

Ix, shmem � Ix(:, :, LX[k], LX[l])<br />

Iy, shmem � Iy(:, :, LY[k], LX[l])<br />

syncthreads<br />

I(i, j, k, l) � � I x, shmemI y, shmemI z, shmem<br />

end for N<br />

end for<br />

Further optimizati<strong>on</strong>s<br />

� (ff|ff) case<br />

� I {x,y,z},shmem = 7(4 2 ) = 112<br />

elements per 2-D block<br />

� 10 <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> in the f-shell<br />

� Reorder them ( next slide)


3<br />

0<br />

0<br />

2<br />

2<br />

1<br />

1<br />

1<br />

0<br />

0<br />

3<br />

2<br />

1<br />

0<br />

0<br />

1<br />

2<br />

1<br />

0<br />

0<br />

X Y Z<br />

3 0 0 2 2 1 1 1 0 0<br />

3 2 1 0 0 1 2 1 0 0<br />

0<br />

3<br />

0<br />

1<br />

0<br />

2<br />

0<br />

1<br />

2<br />

1<br />

0 3 0 1 0 2 0 1 2 1<br />

f x 3 , fy 3 , fz 3 , fx 2 y, f x 2 z, f xy 2 , fxz 2 , fxyz, f y 2 z, f yz 2<br />

0<br />

1<br />

2<br />

3<br />

2<br />

1<br />

0<br />

0<br />

1<br />

0<br />

0 1 2 3 2 1 0 0 1 0<br />

f x 3 , fx 2 y , f xy 2 fy 3 , fy 2 z , f xyz , f x 2 z, f xz 2 , fyz 2 , fz 3<br />

0<br />

0<br />

3<br />

0<br />

1<br />

0<br />

2<br />

1<br />

1<br />

2<br />

0<br />

0<br />

0<br />

0<br />

1<br />

1<br />

1<br />

2<br />

2<br />

3<br />

0 0 3 0 1 0 2 1 1 2<br />

0 0 0 0 1 1 1 2 2 3


� Number <str<strong>on</strong>g>of</str<strong>on</strong>g> registers per thread, shared memory per thread<br />

block limits the thread blocks that can be assigned per SM<br />

� Loops implemented directly result in high register usage<br />

� Explicitly unroll the loops. How ? Manually it’s tedious and<br />

error-pr<strong>on</strong>e<br />

� Use a comm<strong>on</strong> template and generate all the cases<br />

� Pyth<strong>on</strong> based Cheetah template engine is used- reuse<br />

existing Pyth<strong>on</strong> utilities and program s<str<strong>on</strong>g>up</str<strong>on</strong>g>port modules<br />

easily.


ERI blocks flop count GFLOPS SP 3 GFLOPS DP 4<br />

map 5 i jk map 5 i j map 5 i jk map 5 i j<br />

(gg|gg) 2000 2733750000 n/a 45.23 n/a 22.55<br />

(gg|f f) 4000 2160000000 n/a 34.42 n/a 15.32<br />

(f f |gg) 4000 2160000000 n/a 30.91 n/a 14.11<br />

(gg|dd) 10000 1701000000 n/a 43.08 n/a 21.05<br />

(gg|pp) 40000 1458000000 n/a 36.53 n/a 17.08<br />

(pp|gg) 40000 1458000000 34.23 6.93 18.20 5.38<br />

(f f |f f) 10000 2100000000 n/a 40.43 n/a 20.11<br />

(f f |dd) 20000 1296000000 n/a 37.54 n/a 18.29<br />

(dd|f f ) 20000 1296000000 37.69 23.32 16.53 15.04<br />

(f f |pp) 80000 1080000000 27.43 31.46 15.23 17.05<br />

(pp|f f) 80000 1080000000 32.23 6.21 17.45 4.84<br />

(dd|dd) 60000 1166400000 31.10 20.17 16.38 13.67<br />

� ERIs with odd number <str<strong>on</strong>g>of</str<strong>on</strong>g> roots have<br />

maximum performance over the<br />

even roots<br />

� Odd roots - (gg|gg), (gg|dd),<br />

and (ff|ff) cases<br />

� Even roots – (ff|gg), (gg|ff),<br />

and (dd|gg)<br />

� The difference is as high as 25%<br />

� Difference in the single and double<br />

precisi<strong>on</strong> is roughly a fac<str<strong>on</strong>g>to</str<strong>on</strong>g>r <str<strong>on</strong>g>of</str<strong>on</strong>g> two<br />

� Larger ijk mapping perform better<br />

than the ij mappings


ERI blocks flop count GFLOPS SP 3 GFLOPS DP 4<br />

GTX 275 Tesla GTX 275 Tesla<br />

(gg|gg) 2000 2733750000 45.23 55.97 22.55 27.34<br />

(gg|f f ) 4000 2160000000 34.42 42.07 15.32 18.67<br />

( f f|gg) 4000 2160000000 30.91 37.70 14.11 17.19<br />

(gg|dd) 10000 1701000000 43.08 53.39 21.05 25.34<br />

(dd|gg) 10000 1701000000 23.63 24.03 16.35 29.88<br />

(gg|pp) 40000 1458000000 36.53 45.15 17.08 20.65<br />

(pp|gg) 40000 1458000000 34.23 42.42 18.20 22.09<br />

( f f|f f) 10000 2100000000 40.43 50.19 20.11 24.46<br />

( f f|dd) 20000 1296000000 37.54 46.15 18.29 22.44<br />

(dd|f f) 20000 1296000000 37.69 45.71 16.53 19.71<br />

( f f|pp) 80000 1080000000 31.46 39.38 17.05 20.10<br />

( pp|f f ) 80000 1080000000 32.23 40.33 17.45 21.46<br />

(dd|dd) 60000 1166400000 31.10 38.74 16.38 19.78<br />

Inferences<br />

� Performance depends <strong>on</strong> the ERI<br />

class under evaluati<strong>on</strong> and hence<br />

also <strong>on</strong> the mapping ( i,j,k vs. i,j)<br />

� Difference between single and<br />

double precisi<strong>on</strong> performance is<br />

roughly a fac<str<strong>on</strong>g>to</str<strong>on</strong>g>r <str<strong>on</strong>g>of</str<strong>on</strong>g> two<br />

� Difference between the GTX and<br />

Tesla T is roughly 30% ( c<strong>on</strong>sistent<br />

with the clock speeds)<br />

� In terms <str<strong>on</strong>g>of</str<strong>on</strong>g> register and shared<br />

memory usage both are identical


� <str<strong>on</strong>g>Rys</str<strong>on</strong>g>q quadrature <str<strong>on</strong>g>implementati<strong>on</strong></str<strong>on</strong>g> performance results are comparable or<br />

better than DGEMV BLAS routines.<br />

� Some more improvements are possible by caching (texture, c<strong>on</strong>stant)<br />

and also by more aggressive memory reuse possibly at the expense <str<strong>on</strong>g>of</str<strong>on</strong>g> recomputati<strong>on</strong><br />

� Very easy <str<strong>on</strong>g>to</str<strong>on</strong>g> generate the possible ERI shell combinati<strong>on</strong>s using a single<br />

template<br />

� Explicit unrolling can be c<strong>on</strong>trolled at different levels such as shells, roots<br />

<str<strong>on</strong>g>to</str<strong>on</strong>g> test for performance improvements<br />

� Being developed as a standal<strong>on</strong>e library and applicati<strong>on</strong> agnostic


� ERIs are 4-dimensi<strong>on</strong>al, hence it is very expensive <str<strong>on</strong>g>to</str<strong>on</strong>g> transfer them <str<strong>on</strong>g>to</str<strong>on</strong>g> the<br />

host memory after computati<strong>on</strong>.<br />

� Fock matrix is 2-dimensi<strong>on</strong>al. So, c<strong>on</strong>sume the ERI’s as they are formed<br />

<str<strong>on</strong>g>to</str<strong>on</strong>g> build the Fock matrix<br />

� Handle the c<strong>on</strong>tracted ERI’s<br />

� Mixed precisi<strong>on</strong> s<str<strong>on</strong>g>up</str<strong>on</strong>g>port<br />

� A complete working SCF algorithm


1) <str<strong>on</strong>g>Rys</str<strong>on</strong>g>, J.; D<str<strong>on</strong>g>up</str<strong>on</strong>g>uis, M.; King, H. J. Comput. Phys. 1976, 21, 144.<br />

2) Boys, S.F. Proc. R. Soc 1950, 200, 542.<br />

3) <str<strong>on</strong>g>Rys</str<strong>on</strong>g>, J.; D<str<strong>on</strong>g>up</str<strong>on</strong>g>uis, M.; King, H. J. Comput. Chem. 1983, 4, 154–157.<br />

4) Gord<strong>on</strong>, M. S.; Schmidt, M. W. Advances in electr<strong>on</strong>ic structure theory:<br />

GAMESS a decade later. In Theory and Applicati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> Computati<strong>on</strong>al<br />

Chemistry: the first forty years;<br />

Dykstra, C. E.; Frenking, G.; Kim, K. S.; Scuseria, G. E., Eds.; Elsevier:<br />

Amsterdam, 2005.<br />

5) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theory Comput. 2008, 4, 222–231.<br />

6) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theory Comput. 2009, 5, 1004–1015.<br />

7) Yasuda, K. Journal <str<strong>on</strong>g>of</str<strong>on</strong>g> Computati<strong>on</strong>al Chemistry 2008, 29, 334-342.


US Department <str<strong>on</strong>g>of</str<strong>on</strong>g> Energy<br />

Department <str<strong>on</strong>g>of</str<strong>on</strong>g> Defense - DURIP Grant<br />

Ames Labora<str<strong>on</strong>g>to</str<strong>on</strong>g>ry, Iowa State University<br />

Air Force Office <str<strong>on</strong>g>of</str<strong>on</strong>g> Scientific Research<br />

Nati<strong>on</strong>al Science Foundati<strong>on</strong> - Petascale Applicati<strong>on</strong>s grant<br />

NVIDIA Corporati<strong>on</strong><br />

Pr<str<strong>on</strong>g>of</str<strong>on</strong>g>essor Todd Martinez and his gro<str<strong>on</strong>g>up</str<strong>on</strong>g><br />

asadchev@gmail.com<br />

jfelder@iastate.edu<br />

allada.v@gmail.com<br />

mark@ si.msg.chem.iastate.edu<br />

theresa@fi.ameslab.gov<br />

brett@si.msg.chem.iastate.edu

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!