14.09.2014 Views

CASINO manual - Theory of Condensed Matter

CASINO manual - Theory of Condensed Matter

CASINO manual - Theory of Condensed Matter

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

38 Analysis <strong>of</strong> the performance <strong>of</strong> <strong>CASINO</strong> on parallel computers<br />

38.1 VMC in parallel<br />

The VMC algorithm is perfectly parallel: no interprocessor communication is required during simulations.<br />

Each processor carries out an independent random walk using a different random-number<br />

sequence, and the results are averaged at the end <strong>of</strong> each block, so that running for a length <strong>of</strong> time<br />

T on P processors generates the same amount <strong>of</strong> data as running for time P T on a single processor<br />

(assuming the equilibration time to be negligible). VMC should therefore scale to an arbitrarily large<br />

number <strong>of</strong> processors.<br />

Note that, although the energy obtained by running for time T on P processors should be in statistical<br />

agreement with that obtained by running for time P T on a single processor, the results will not be<br />

exactly equal, because the random walks are different in the two cases.<br />

38.2 Optimization in parallel<br />

38.2.1 Standard variance minimization<br />

The VMC stages <strong>of</strong> a variance-minimization calculation are perfectly parallel, as described above.<br />

In the optimization stages, the configuration set is distributed evenly between the processors. The<br />

master processor broadcasts the current set <strong>of</strong> optimizable parameters, then each processor calculates<br />

the local energy <strong>of</strong> each <strong>of</strong> its configurations and reports the energies (and weights, if required) to<br />

the master. The CPU time required to evaluate the local energies <strong>of</strong> the configuration set usually far<br />

exceeds the time spent communicating (reporting one or two numbers per configuration to the master<br />

and receiving a handful <strong>of</strong> parameter values at each iteration). In particular the time spent evaluating<br />

the local energies increases with system size, whereas the time spent on interprocessor communication<br />

is independent <strong>of</strong> system size. So the standard variance minimization method is essentially perfectly<br />

parallel.<br />

Note that the number <strong>of</strong> processor communications could easily be reduced further if each processor<br />

were simply to report the sum <strong>of</strong> its local energies and the sum <strong>of</strong> the squares <strong>of</strong> the local energies to<br />

the master.<br />

38.2.2 Variance minimization for linear Jastrow parameters<br />

The VMC stage <strong>of</strong> the optimization (including the construction and accumulation <strong>of</strong> the quartic<br />

coefficients) is perfectly parallel. The optimization itself is carried out in serial on the master node.<br />

However, this stage typically takes a fraction <strong>of</strong> a second, and is independent <strong>of</strong> system size. So the<br />

varmin-linjas scheme is essentially perfectly parallel.<br />

38.2.3 Energy minimization<br />

The VMC stages <strong>of</strong> an energy minimization are perfectly parallel, as described above. For the matrix<br />

algebra stages, the configurations are divided evenly between the processors, each <strong>of</strong> which separately<br />

generates one section <strong>of</strong> the full matrices. The full matrices are then gathered on the master processor,<br />

where the matrix algebra is done. The time taken to do the matrix algebra is usually insignificant<br />

in comparison to the time taken in VMC and matrix generation. The time taken in interprocessor<br />

communication is recorded and written out during energy minimization, and is typically at maximum<br />

a few percent <strong>of</strong> the total time spent in an iteration (and <strong>of</strong>ten much less than one percent). Overall,<br />

energy minimization is very nearly perfectly parallel.<br />

38.3 DMC in parallel<br />

38.3.1 Parallelization strategy<br />

When performing DMC on a parallel machine, the population <strong>of</strong> configurations is usually distributed<br />

evenly over the set <strong>of</strong> processors. The algorithm is not perfectly parallel because the populations on<br />

207

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!