CASINO manual - Theory of Condensed Matter

More documents

Recommendations

Info

different processors interact via the population-control mechanism. The population on each processor fluctuates, and the cost of each time step is determined by the processor with the largest population. It is therefore necessary to even up the distribution of configurations between processors from time to time. Unfortunately, transferring configurations between processors can be costly, and is usually the principal limitation on the scaling of the DMC method with the number of processors. (There is also a small cost associated with the communication required to decide on a reference energy after each time step, but we assume this to be negligible henceforth.) [Note that with the release of casino 2.8 in Feb 2011, it was shown that the effective cost of configuration transfers can be reduced to essentially zero by using fancy MPI tricks such as non-blocking sends and receives, so the following discussion is largely academic, but we retain it for completeness.] 38.3.2 Behaviour of the population on each processor Let T redist τ be the redistribution period, i.e., we redistribute the configuration population after every T redist time steps τ. At any given time the population on a processor p must be increasing or decreasing exponentially, because the mean energy e p of the configuration population on that processor is unlikely to be exactly equal to the reference energy E T . We assume that e p −E T remains roughly constant over the redistribution period, i.e., that the autocorrelation period is much longer than the redistribution period. At the start of the redistribution period the population C p (1) on each processor is the same. At the end of the redistribution period, the expected population on processor p is C p (T redist ) = C p (1) exp[−(e p − E T )T redist τ]. Hence ¯C(T redist ) ≈ ¯C(1) exp[−(Ē − E T )T redist τ] + O(Tredist 2 τ 2 ), where the bar denotes an average over the processors, and so the average growth or decay of the population is the same as that of the entire population (which should be small, because E T is chosen so as to ensure this). 38.3.3 Optimal redistribution period Let A be the cost of propagating a single configuration over one time step. Let B be the cost of transferring a single configuration between processors. Let q be the processor with the largest number of configurations, i.e., with the lowest energy e q ≡ min{e p }. Both the cost of propagating configurations and the cost of transferring configurations are determined by processor q. The expected number of configurations on processor q at the end of the redistribution period (i.e., after T redist time steps) is max{C p (T redist )} ≈ ¯C(T redist ) + cT redist + O(Tredist 2 ), where c = ¯C(1)(Ē − min{e p})τ. NB, 〈 ¯C(1)〉 = N C /P , where N C is the target population and P is the number of processors, and 〈c〉 is a positive constant. At the end of the redistribution period, cT redist configurations are to be transferred from processor q. Hence the average cost of transferring configurations per time step is B〈c〉, which is independent of T redist . The average cost per time step of waiting for the processor q with the greatest number of configurations to finish propagating all its excess configurations is A〈c〉 [0 + 1 + . . . + (T redist − 1)] = A〈c〉(T redist − 1) . (459) T redist 2 So the total average cost per time step in DMC is T = AN C P + A〈c〉(T redist − 1) + B〈c〉. (460) 2 Clearly the redistribution period should be chosen to be as small as possible to minimize T . Numerical tests confirm that increasing the redistribution period only acts to slow down calculations. One should therefore choose T redist = 1, i.e., redistribution should take place after every time step. We assume this to be the case henceforth (the previously existing keyword redist period which allowed the user to do otherwise has now anyway been deleted). 38.4 Scaling of the DMC algorithm with the number of processors The cost A of propagating each configuration scales as N α . For typical systems, where extended orbitals represented in a localized basis are used and the CPU time is dominated by the evaluation of 208
the orbitals, α = 2 [11]. The use of localized orbitals can improve this to α = 1 [72]. For very large systems, or systems in which the orbitals are trivial to evaluate, the cost of updating the determinants will start to dominate: this gives α = 3 with extended orbitals and α = 2 with localized orbitals. Hence the average cost of propagating all the configurations over one time step, which is approximately the same on each processor, is T CPU ≈ a N α N C , (461) P where a is a constant which depends on both the system being studied and the computer being used. Let σ ep be the standard deviation of the set of processor energies. We assume that 〈e p 〉 − 〈min{e p }〉 ∝ σ ep ∝ √ NP/N C . Hence 〈c〉 ∝ √ NN C /P . The cost B of transferring a single configuration is proportional to the system size N. Hence the cost of load-balancing is √ NC N T comm ≈ b 3 , (462) P where the constant b depends on the system being studied, the wave-function quality and the computer architecture. Note that good trial wave functions will lead to smaller population fluctuations and therefore less time spent load-balancing. Clearly one would like to have T CPU ≫ T comm , as the DMC algorithm would be perfectly parallel in this limit. The ratio of the cost of load-balancing to the cost of propagating the configurations is T comm T CPU ( NC = b a P ) −1/2 N 3/2−α . (463) It is immediately clear that by increasing the number of configurations per processor N C /P the fraction of time spent on interprocessor communication can be made arbitrarily small. In practice the number of configurations per processor is limited by the available memory, and by the fact that carrying out DMC equilibration takes longer if more configurations are used. Increasing the number of configurations does not affect the efficiency of DMC statistics accumulation, so, assuming that equilibration remains a small fraction of the total CPU time, the configuration population should be made as large as memory constraints will allow. For α > 3/2 (which is always the case except in the regime where the cost of evaluating localized orbitals dominates), the fraction of time spent on interprocessor communication falls off with system size. Hence processor-scaling tests on small systems may significantly underestimate the maximum usable number of processors for larger problems. In summary, if a user wishes to assess the parallel performance of the DMC algorithm then it is best to perform tests using the system for which production calculations are to be performed. The number of configurations should be made as large as possible. Note that after all that, the new redistribution algorithm introduced in casino 2.8 means that the casino DMC algorithm now has essentially perfect scaling out to at least a hundred thousand cores and beyond, as shown in the following diagram obtained from calculations on 120000+ cores of a Cray XT5 and discussed in Ref. [114]. 209
Page 1 and 2:
CASINO User’s Guide Version 2.13
Page 3 and 4:
8.6 GAUSSIAN94/98/03/09 . . . . . .
Page 5 and 6:
29.2 Fourier transformation of CASI
Page 7 and 8:
1 Introduction casino is a computer
Page 9 and 10:
3.2 Legal stuff casino is given awa
Page 11 and 12:
- Grossman-Mitas DMC-DFT molecular
Page 13 and 14:
Your defined setup can then be perm
Page 15 and 16:
- : perform compilation (default).
Page 17 and 18:
6 Introductory user’s guide: how
Page 19 and 20:
ATOM BASIS TYPE The basis used to r
Page 21 and 22:
ENVMC v0.60: Script to extract VMC
Page 23 and 24:
Std. err. in the mean DMC energy (a
Page 25 and 26:
Jastrow factor from each cycle of t
Page 27 and 28:
V(x) Ψ init (x) τ{ t Ψ 0 (x) x T
Page 29 and 30:
the energy becomes roughly constant
Page 31 and 32:
many cores, time limits etc, which
Page 33 and 34:
Set the verbosity level of the mach
Page 35 and 36:
6.7 How to run coupled DFT-DMC mole
Page 37 and 38:
unqmcmd --startqmc=M Start the chai
Page 39 and 40:
config.out This is the name under w
Page 41 and 42:
‘slater-type’: use Slater-type
Page 43 and 44:
CUSTOM SPAIR DEP (Block) This input
Page 45 and 46:
initial configurations come from DM
Page 47 and 48:
DTDMC (Real) Time step for DMC run
Page 49 and 50:
FORCES INFO (Integer) Controls the
Page 51 and 52:
nucleus is assumed to lie at the or
Page 53 and 54:
MOVIECELLS (Logical) If F then casi
Page 55 and 56:
OPT MAXITER (Integer) Largest permi
Page 57 and 58:
of G vectors and discards all those
Page 59 and 60:
half a million cores, it may be des
Page 61 and 62:
VM REWEIGHT (Logical) If vm reweigh
Page 63 and 64:
default this is 0 hartree. It shoul
Page 65 and 66:
A very detailed specification of th
Page 67 and 68:
-1 0 0 1 1 1 1 1 1 1 0 2 1 0 1 2 Pa
Page 69 and 70:
cusp conditions; otherwise, only th
Page 71 and 72:
Detailed information about each of
Page 73 and 74:
|k| (outermost). Hence one can spec
Page 75 and 76:
large and the wave function is near
Page 77 and 78:
R(i) in atomic units 0.000000000000
Page 79 and 80:
2. The charge-density Fourier coeff
Page 81 and 82:
c_767: [ 996 ] ... Compressed expan
Page 83 and 84:
valence charges for each atom %%%%%
Page 85 and 86:
1988). This can be found online. An
Page 87 and 88:
1.3813695578425E+00 1.3957621401173
Page 89 and 90:
PWSCF Method: DFT DFT Functional: u
Page 91 and 92:
0.125203784849961E-01 0.13042462855
Page 93 and 94:
3.3000000000000E+00 2.0000000000000
Page 95 and 96:
Potential Number of grid points 200
Page 97 and 98:
9. Clearly, the total number of pos
Page 99 and 100:
EBEST (DMC only) Best estimate of t
Page 101 and 102:
Use G-vector set 1 Number of sets 2
Page 103 and 104:
1000.0 rho_a(G)*rho_b(-G) 16.0 4.12
Page 105 and 106:
total weight must always be include
Page 107 and 108:
The exporter does not currently wor
Page 109 and 110:
8.4.2 Generating gwfn.data files wi
Page 111 and 112:
After successfully running properti
Page 113 and 114:
% mv Test.FChk dna.Fchk run gaussia
Page 115 and 116:
• By default, gaussian uses spin
Page 117 and 118:
Routine Purpose qmc write Writes th
Page 119 and 120:
not the case the defaults can be ch
Page 121 and 122:
8.10 TURBOMOLE Website: www.turbomo
Page 123 and 124:
• crysgen06/09, crystaltoqmc: The
Page 125 and 126:
• quickblock: Simple reblocking u
Page 127 and 128:
• In Movie Settings choose Trajec
Page 129 and 130:
the move rejection probability is h
Page 131 and 132:
This leads to the single-electron d
Page 133 and 134:
In the UNR [19] scheme the modified
Page 135 and 136:
13.7 Evaluating expectation values
Page 137 and 138:
Population-control catastrophes sho
Page 139 and 140:
where u k has the periodicity of th
Page 141 and 142:
Although the constraint equations a
Page 143 and 144:
18 Wave-function updating Consider
Page 145 and 146:
19.2 Evaluating the nonlocal pseudo
Page 147 and 148:
of a unit point charge at r j in ev
Page 149 and 150:
19.4.3 1D Coulomb interaction Coulo
Page 151 and 152:
To investigate whether the extrapol
Page 153 and 154:
correlations, such as might be enco
Page 155 and 156:
where n is the order of the expansi
Page 157 and 158:
terms by definition, and it can be
Page 159 and 160:
which is therefore independent of r
Page 161 and 162:
where the local energy, E L , is E
Page 163 and 164: Another variant of variance minimiz
Page 165 and 166: The energy minimization method used
Page 167 and 168: valid. The first problem, which ari
Page 169 and 170: • Is emin min energy too high? Th
Page 171 and 172: It may be activated by setting vmc
Page 173 and 174: that the number of configuration mo
Page 175 and 176: • It is possible to use blip orbi
Page 177 and 178: 0.5 0.5 0.0 particle 1 det 1 : 9 or
Page 179 and 180: The clearup twistav script can be u
Page 181 and 182: In the infinite system limit, the s
Page 183 and 184: Note that this has the k −2 diver
Page 185 and 186: • The effective time step, Eq. (3
Page 187 and 188: 1 2 H He 1.00794 4.00260 3 4 5 6 7
Page 189 and 190: and the Dirac delta function is δ(
Page 191 and 192: 33.2.2 Atomic densities K eywords:
Page 193 and 194: The spin-density matrix is a two-by
Page 195 and 196: 33.3.3 Homogeneous and isotropic sy
Page 197 and 198: 33.4 Structure factor and spherical
Page 199 and 200: 33.5 One-body density matrix, two-b
Page 201 and 202: [ where the approximation ρ (1)T
Page 203 and 204: The figure shows the localization l
Page 205 and 206: with F HFT mix = − F P mix = −
Page 207 and 208: The input keyword to activate the c
Page 209 and 210: To each walker r j , we assign a li
Page 211 and 212: 37 Magnetic fields and the fixed-ph
Page 213: 38 Analysis of the performance of C
Page 217 and 218: 39.2 Implementation basics and perf
Page 219 and 220: • Use ‘endif’ and ‘enddo’
Page 221 and 222: • Evidence of testing on serial a
Page 223 and 224: B Appendix 2: Automatic testing of
Page 225 and 226: • Delete any eepot.data and densi
Page 227 and 228: * "Generic" CASINO_ARCHs are intend
Page 229 and 230: - ‘head -n 1 /etc/redhat-release
Page 231 and 232: for a single job in an ensemble of
Page 233 and 234: inary executable. CASINO does not r
Page 235 and 236: otherwise. The following tags are o
Page 237 and 238: [2] V.R. Saunders, R. Dovesi, C. Ro
Page 239 and 240: [50] W. Janke, Statistical Analysis
show all

CASINO manual - Theory of Condensed Matter

Create successful ePaper yourself

Delete template?

Save as template?