14.09.2014 Views

CASINO manual - Theory of Condensed Matter

CASINO manual - Theory of Condensed Matter

CASINO manual - Theory of Condensed Matter

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

the orbitals, α = 2 [11]. The use <strong>of</strong> localized orbitals can improve this to α = 1 [72]. For very large<br />

systems, or systems in which the orbitals are trivial to evaluate, the cost <strong>of</strong> updating the determinants<br />

will start to dominate: this gives α = 3 with extended orbitals and α = 2 with localized orbitals.<br />

Hence the average cost <strong>of</strong> propagating all the configurations over one time step, which is approximately<br />

the same on each processor, is<br />

T CPU ≈ a N α N C<br />

, (461)<br />

P<br />

where a is a constant which depends on both the system being studied and the computer being used.<br />

Let σ ep be the standard deviation <strong>of</strong> the set <strong>of</strong> processor energies. We assume that 〈e p 〉 − 〈min{e p }〉 ∝<br />

σ ep ∝ √ NP/N C . Hence 〈c〉 ∝ √ NN C /P . The cost B <strong>of</strong> transferring a single configuration is<br />

proportional to the system size N. Hence the cost <strong>of</strong> load-balancing is<br />

√<br />

NC N<br />

T comm ≈ b<br />

3<br />

, (462)<br />

P<br />

where the constant b depends on the system being studied, the wave-function quality and the computer<br />

architecture. Note that good trial wave functions will lead to smaller population fluctuations and<br />

therefore less time spent load-balancing.<br />

Clearly one would like to have T CPU ≫ T comm , as the DMC algorithm would be perfectly parallel in<br />

this limit. The ratio <strong>of</strong> the cost <strong>of</strong> load-balancing to the cost <strong>of</strong> propagating the configurations is<br />

T comm<br />

T CPU<br />

(<br />

NC<br />

= b a P<br />

) −1/2<br />

N 3/2−α . (463)<br />

It is immediately clear that by increasing the number <strong>of</strong> configurations per processor N C /P the<br />

fraction <strong>of</strong> time spent on interprocessor communication can be made arbitrarily small. In practice<br />

the number <strong>of</strong> configurations per processor is limited by the available memory, and by the fact that<br />

carrying out DMC equilibration takes longer if more configurations are used. Increasing the number<br />

<strong>of</strong> configurations does not affect the efficiency <strong>of</strong> DMC statistics accumulation, so, assuming that<br />

equilibration remains a small fraction <strong>of</strong> the total CPU time, the configuration population should be<br />

made as large as memory constraints will allow.<br />

For α > 3/2 (which is always the case except in the regime where the cost <strong>of</strong> evaluating localized<br />

orbitals dominates), the fraction <strong>of</strong> time spent on interprocessor communication falls <strong>of</strong>f with system<br />

size. Hence processor-scaling tests on small systems may significantly underestimate the maximum<br />

usable number <strong>of</strong> processors for larger problems.<br />

In summary, if a user wishes to assess the parallel performance <strong>of</strong> the DMC algorithm then it is best<br />

to perform tests using the system for which production calculations are to be performed. The number<br />

<strong>of</strong> configurations should be made as large as possible.<br />

Note that after all that, the new redistribution algorithm introduced in casino 2.8 means that the<br />

casino DMC algorithm now has essentially perfect scaling out to at least a hundred thousand cores<br />

and beyond, as shown in the following diagram obtained from calculations on 120000+ cores <strong>of</strong> a Cray<br />

XT5 and discussed in Ref. [114].<br />

209

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!