12.07.2015 Views

Ab initio molecular dynamics: Theory and Implementation

Ab initio molecular dynamics: Theory and Implementation

Ab initio molecular dynamics: Theory and Implementation

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

are the improved load–balancing for the Fourier transforms <strong>and</strong> the bigger datapackages in the matrix transposes. The number of plane waves in the row groups(N prPW) is calculated as the sum over all local plane waves in the correspondingcolumn groups.MODULE Densityrho(1:Nx pr ,1:N y ,1:N z ) = 0FOR i=1:N b ,2*PcCALL ParallelTranspose(c(:,i),colgrp)scr1(1:N x ,1:N PWpencil,r ) = 0FOR j=1:N prPWscr1(ipg(1,i),mapxy(ipg(2,i),ipg(3,i))) = && c(j,i) + I * c(j,i+1)scr1(img(1,i),mapxy(img(2,i),img(3,i))) = && CONJG[c(j,i) + I * c(j,i+1)]ENDCALL ParallelFFT3D("INV",scr1,scr2,rowgrp)rho(1:N prx ,1:N y,1:N z ) = rho(1:N prx ,1:N y,1:N z ) + && REAL[scr2(1:N p x,1:N y ,1:N z )]**2 + && IMAG[scr2(1:N p x,1:N y ,1:N z )]**2ENDCALL GlobalSum(rho,colgrp)The use of two task groups in the example shown in Fig. 13 leads to an increase ofspeedup for 256 processors from 120 to 184 on a Cray T3E/600 computer.The effect of the non-scalability of the global communication used in CPMD isshown in Fig. 14. This example shows the percentage of time used in the globalcommunication routines (global sums <strong>and</strong> broadcasts) <strong>and</strong> the time spend in theparallel Fourier transforms for a system of 64 silicon atoms with a energy cutoff of12 Rydberg. It can clearly be seen that the global sums <strong>and</strong> broadcasts do not scale<strong>and</strong> therefore become more important the more processors are used. The Fouriertransforms on the other h<strong>and</strong> scale nicely for this range of processors. Wherethe communication becomes dominant depends on the size of the system <strong>and</strong> theperformance ratio of communication to cpu.Finally, the memory available on each processor may become a bottleneck for largecomputations. The replicated data approach for some arrays adapted in the implementationof the code poses limits on the system size that can be processed ona given type of computer. In the outline given in this chapter there are two typesof arrays that scale quadratically in system size that a replicated. The overlapmatrix of the projectors with the wavefunctions (fnl) <strong>and</strong> the overlap matrices ofthe wavefunctions themselves (smat). The fnl matrix is involved in two types ofcalculations where the parallel loop goes either over the b<strong>and</strong>s or the projectors.To avoid communication, two copies of the array are kept on each processor. Eachcopy holds the data needed in one of the distribution patterns. This scheme needsonly a small adaptation of the code described above.The distribution of the overlap matrices (smat) causes some more problems. In91

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!