12.07.2015 Views

Ab initio molecular dynamics: Theory and Implementation

Ab initio molecular dynamics: Theory and Implementation

Ab initio molecular dynamics: Theory and Implementation

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

over processors. All processors should hold approximately the same number ofplane waves. If a plane wave for the wavefunction cutoff is on a certain processor,the same plane wave should be on the same processor for the density cutoff. Thedistribution of the plane waves should be such that at the beginning or end of athree dimensional Fourier transform no additional communication is needed. Toachieve all of these goals the following heuristic algorithm 137 is used. The planewaves are ordered into ”pencils”. Each pencil holds all plane waves with the sameg y <strong>and</strong> g z components. The pencils are numbered according to the total numberof plane waves that are part of it. Pencils are distributed over processors in a”round robin” fashion switching directions after each round. This is first done forthe wavefunction cutoff. For the density cutoff the distribution is carried over, <strong>and</strong>all new pencils are distributed according to the same algorithm. Experience showsthat this algorithm leads to good results for the load balancing on both levels, thetotal number of plane waves <strong>and</strong> the total number of pencils. The number of pencilson a processor is proportional to the work for the first step in the three-dimensionalFourier transform.Special care has to be taken for the processor that holds the G = 0 component.This component has to be treated individually in the calculation of the overlaps.The processor that holds this component will be called p0.3.9.3 CPMD Program: Computational KernelsThere are three communication routines mostly used in the parallelization of theCPMD code. All of them are collective communication routines, meaning that allprocessors are involved. This also implies that synchronization steps are performedduring the execution of these routines. Occasionally other communication routineshave to be used (e.g. in the output routines for the collection of data) but theydo not appear in the basic computational kernels. The three routines are theBroadcast, GlobalSum, <strong>and</strong> MatrixTranspose. In the Broadcast routine data issend from one processor (px) to all other processorsx p ← x px . (265)In the GlobalSum routine a data item is replaced on each processor by the sum overthis quantity on all processorsx p ← ∑ px p . (266)The MatrixTranspose changes the distribution pattern of a matrix, e.g. from rowdistribution to column distributionx(p, :) ← x(:, p) . (267)On a parallel computer with P processors, a typical latency time t L (time for thefirst data to arrive) <strong>and</strong> a b<strong>and</strong>width of B, the time spend in the communicationroutines isBroadcastGlobalSumMatrixTransposelog 2 [P] {t L + N/B}log 2 [P] {t L + N/B}Pt L + N/(PB)85

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!