are the improved load–balancing for the Fourier transforms <strong>and</strong> the bigger datapackages in the matrix transposes. The number of plane waves in the row groups(N prPW) is calculated as the sum over all local plane waves in the correspondingcolumn groups.MODULE Densityrho(1:Nx pr ,1:N y ,1:N z ) = 0FOR i=1:N b ,2*PcCALL ParallelTranspose(c(:,i),colgrp)scr1(1:N x ,1:N PWpencil,r ) = 0FOR j=1:N prPWscr1(ipg(1,i),mapxy(ipg(2,i),ipg(3,i))) = && c(j,i) + I * c(j,i+1)scr1(img(1,i),mapxy(img(2,i),img(3,i))) = && CONJG[c(j,i) + I * c(j,i+1)]ENDCALL ParallelFFT3D("INV",scr1,scr2,rowgrp)rho(1:N prx ,1:N y,1:N z ) = rho(1:N prx ,1:N y,1:N z ) + && REAL[scr2(1:N p x,1:N y ,1:N z )]**2 + && IMAG[scr2(1:N p x,1:N y ,1:N z )]**2ENDCALL GlobalSum(rho,colgrp)The use of two task groups in the example shown in Fig. 13 leads to an increase ofspeedup for 256 processors from 120 to 184 on a Cray T3E/600 computer.The effect of the non-scalability of the global communication used in CPMD isshown in Fig. 14. This example shows the percentage of time used in the globalcommunication routines (global sums <strong>and</strong> broadcasts) <strong>and</strong> the time spend in theparallel Fourier transforms for a system of 64 silicon atoms with a energy cutoff of12 Rydberg. It can clearly be seen that the global sums <strong>and</strong> broadcasts do not scale<strong>and</strong> therefore become more important the more processors are used. The Fouriertransforms on the other h<strong>and</strong> scale nicely for this range of processors. Wherethe communication becomes dominant depends on the size of the system <strong>and</strong> theperformance ratio of communication to cpu.Finally, the memory available on each processor may become a bottleneck for largecomputations. The replicated data approach for some arrays adapted in the implementationof the code poses limits on the system size that can be processed ona given type of computer. In the outline given in this chapter there are two typesof arrays that scale quadratically in system size that a replicated. The overlapmatrix of the projectors with the wavefunctions (fnl) <strong>and</strong> the overlap matrices ofthe wavefunctions themselves (smat). The fnl matrix is involved in two types ofcalculations where the parallel loop goes either over the b<strong>and</strong>s or the projectors.To avoid communication, two copies of the array are kept on each processor. Eachcopy holds the data needed in one of the distribution patterns. This scheme needsonly a small adaptation of the code described above.The distribution of the overlap matrices (smat) causes some more problems. In91
3020Percentage1002 6 10 14Number of ProcessorsFigure 14. Percentage of total cpu time spend in global communication routines (solid line) <strong>and</strong>in Fourier transform routines (dashed line) for a system of 64 silicon atoms on a Cray T3E/600computer.addition to the adaptation of the overlap routine, also the matrix multiply routinesneeded for the orthogonalization step have to be done in parallel. Although thereare libraries for these tasks available the complexity of the code is considerablyincreased.3.9.5 SummaryEfficient parallel algorithms for the plane wave–pseudopotential density functionaltheory method exist. <strong>Implementation</strong>s of these algorithms are available <strong>and</strong> wereused in most of the large scale applications presented at the end of this paper(Sect. 5). Depending on the size of the problem, excellent speedups can be achievedeven on computers with several hundreds of processors. The limitations presentedin the last paragraph are of importance for high–end applications. Together withthe extensions presented, existing plane wave codes are well suited also for the nextgeneration of supercomputers.4 Advanced Techniques: Beyond . . .4.1 IntroductionThe discussion up to this point revolved essentially around the “basic” ab <strong>initio</strong><strong>molecular</strong> <strong>dynamics</strong> methodologies. This means in particular that classical nucleievolve in the electronic ground state in the microcanonical ensemble. This combinationallows already a multitude of applications, but many circumstances existwhere the underlying approximations are unsatisfactory. Among these cases are92
- Page 1 and 2:
John von Neumann Institute for Comp
- Page 4:
2500Number200015001000CP PRL 1985AI
- Page 7 and 8:
The goal of this section is to deri
- Page 9:
¡the Newtonian equation of motion
- Page 13 and 14:
Ehrenfest molecular dynamics is cer
- Page 15 and 16:
the Car-Parrinello approach 108 , s
- Page 17:
According to the Car-Parrinello equ
- Page 20 and 21:
Figure 4. (a) Comparison of the x-c
- Page 22 and 23:
Up to this point the entire discuss
- Page 24 and 25:
parameters are those used to repres
- Page 26 and 27:
in terms of a linear combination of
- Page 28 and 29:
structure calculations, see e.g. Re
- Page 30 and 31:
“unbound electrons” dissolved i
- Page 32 and 33:
Table 1. Timings in cpu seconds and
- Page 34 and 35:
stressed that the energy conservati
- Page 36 and 37:
see e.g. the discussion following E
- Page 38 and 39:
from a set of one-particle spin orb
- Page 40 and 41:
is used, which represents exactly a
- Page 42 and 43: 2.8.3 Generalized Plane WavesAn ext
- Page 44 and 45: disposable parameters that can be o
- Page 46 and 47: The index i runs over all states an
- Page 48 and 49: f(G) are related by three-dimension
- Page 50 and 51: where j l are spherical Bessel func
- Page 52 and 53: andE self = ∑ I1√2πZ 2 IR c I.
- Page 54 and 55: ¢££¤¤¢¢¢n tot (G)inv FTn to
- Page 56 and 57: correlation energyΩ ∑E xc = ε x
- Page 58 and 59: 3.4 Total Energy, Gradients, and St
- Page 60 and 61: 3.4.3 Gradient for Nuclear Position
- Page 62 and 63: The local part of the pseudopotenti
- Page 64 and 65: ¢¢¢¢¢i = 1 . . .N b¢c i (G)¢
- Page 66 and 67: ¢¢¢¢¢c i (G)123g, E self∆V I
- Page 68 and 69: where G c is a free parameter that
- Page 70 and 71: and a matrix form of the Gram-Schmi
- Page 72 and 73: The two sets of equations are coupl
- Page 74 and 75: introducing different masses for di
- Page 76 and 77: The Lagrange multiplier have to be
- Page 78 and 79: Table 3. Relative size of character
- Page 80 and 81: of the G vectors, and only real ope
- Page 82 and 83: CALL SGEMM(’T’,’N’,M,N b ,2
- Page 84 and 85: ing standard communication librarie
- Page 86 and 87: over processors. All processors sho
- Page 88 and 89: ab = 2 * sdot(2 * N p D ,A(1),1,B(1
- Page 90 and 91: ENDCALL ParallelFFT3D("INV",scr1,sc
- Page 94 and 95: situations where• it is necessary
- Page 96 and 97: espectively), but completely analog
- Page 98 and 99: 4.2.3 Imposing Pressure: BarostatsK
- Page 100 and 101: in the previous section. The isobar
- Page 102 and 103: 4.3.2 Many Excited States: Free Ene
- Page 104 and 105: down the generalization of the Helm
- Page 106 and 107: free energy functional discussed in
- Page 108 and 109: Figure 16. Four patterns of spin de
- Page 110 and 111: and electrons r = {r i } can be wri
- Page 112 and 113: The effective classical partition f
- Page 114 and 115: up e.g. in Refs. 132,37,596,597,428
- Page 116 and 117: The eigenvalues of A when multiplie
- Page 118 and 119: frequency of the electronic degrees
- Page 120 and 121: 5.2 Solids, Polymers, and Materials
- Page 122 and 123: the penetration of the oxidation la
- Page 124 and 125: in terms of their electronic struct
- Page 126 and 127: culations on very accurate global p
- Page 128 and 129: AcknowledgmentsOur knowledge on ab
- Page 130 and 131: 57. M. Bernasconi, G. L. Chiarotti,
- Page 132 and 133: Superiore di Studi Avanzati (SISSA)
- Page 134 and 135: 175. E. Ermakova, J. Solca, H. Hube
- Page 136 and 137: 244. H. Goldstein, Klassische Mecha
- Page 138 and 139: 313. T. Ikeda, M. Sprik, K. Terakur
- Page 140 and 141: 384. N. A. Marks, D. R. McKenzie, B
- Page 142 and 143:
442. S. Nosé and M. L. Klein, Mol.
- Page 144 and 145:
502. L. M. Ramaniah, M. Bernasconi,
- Page 146 and 147:
562. F. Shimojo, K. Hoshino, and Y.
- Page 148 and 149:
620. A. Tongraar, K. R. Liedl, and
- Page 150:
682. R. M. Wentzcovitch, Phys. Rev.