CASINO manual - Theory of Condensed Matter
CASINO manual - Theory of Condensed Matter
CASINO manual - Theory of Condensed Matter
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
[CPU time (2592 cores) / CPU time (N cores)] * 2592<br />
1.2e+05<br />
1e+05<br />
80000<br />
60000<br />
40000<br />
20000<br />
Ideal linear scaling<br />
<strong>CASINO</strong> 2.6<br />
<strong>CASINO</strong> 2.8<br />
FIXED TARGET POPULATION<br />
PER CORE<br />
0<br />
0 20000 40000 60000 80000 1e+05 1.2e+05<br />
Number N <strong>of</strong> processor cores (JaguarPF)<br />
The largest calculations that have been done were by MDT on up to 524288 cores <strong>of</strong> Japan’s K<br />
computer where a similar scaling was achieved.<br />
Note, however, that perfect linear scaling may require that the combination <strong>of</strong> your hardware and<br />
MPI implementation is capable <strong>of</strong> genuinely asynchronous non-blocking MPI, i.e., that commands like<br />
MPI ISEND actually do what they are supposed to (in some MPI implementations this functionality<br />
is ‘faked’). Understanding the extent to which this true requires further study.<br />
39 OpenMP support<br />
39.1 Introduction<br />
In addition to MPI, casino also has a preliminary implementation <strong>of</strong> OpenMP, currently considered<br />
experimental. Further development depends on successful testing, currently underway.<br />
It is believed that the top-performance computing systems <strong>of</strong> this decade (2010–2019), which should<br />
reach the exaflop scale, will have processors with a hierarchical architecture due to limitations in the<br />
amount <strong>of</strong> power that can be reasonably delivered to and dissipated from each processing unit [115].<br />
It is likely that the different levels in the hierarchy will required multiple simultaneous approaches to<br />
parallelism, with an OpenMP-like level parallelizing across the cores in one or a few CPUs and an<br />
MPI-like level parallelizing across the entire system.<br />
For a pure-MPI QMC calculation with P processors, the total computation time t is roughly given by<br />
t ≈ MCt c /P , where M is number <strong>of</strong> steps, C is number <strong>of</strong> configurations and t c is the average time<br />
to move one configuration at each step. However on very large computers one can be in a situation<br />
where the desired C and P are such that P > C, which means that there will be nodes with no<br />
configurations in them (and thus idle), which is a waste <strong>of</strong> resources.<br />
The second level <strong>of</strong> parallelism becomes useful when P > C. Running multiple OpenMP threads on<br />
multiple cores allows keeping C small, effectively reducing t c in the cost formula above.<br />
210