14.09.2014 Views

CASINO manual - Theory of Condensed Matter

CASINO manual - Theory of Condensed Matter

CASINO manual - Theory of Condensed Matter

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

[CPU time (2592 cores) / CPU time (N cores)] * 2592<br />

1.2e+05<br />

1e+05<br />

80000<br />

60000<br />

40000<br />

20000<br />

Ideal linear scaling<br />

<strong>CASINO</strong> 2.6<br />

<strong>CASINO</strong> 2.8<br />

FIXED TARGET POPULATION<br />

PER CORE<br />

0<br />

0 20000 40000 60000 80000 1e+05 1.2e+05<br />

Number N <strong>of</strong> processor cores (JaguarPF)<br />

The largest calculations that have been done were by MDT on up to 524288 cores <strong>of</strong> Japan’s K<br />

computer where a similar scaling was achieved.<br />

Note, however, that perfect linear scaling may require that the combination <strong>of</strong> your hardware and<br />

MPI implementation is capable <strong>of</strong> genuinely asynchronous non-blocking MPI, i.e., that commands like<br />

MPI ISEND actually do what they are supposed to (in some MPI implementations this functionality<br />

is ‘faked’). Understanding the extent to which this true requires further study.<br />

39 OpenMP support<br />

39.1 Introduction<br />

In addition to MPI, casino also has a preliminary implementation <strong>of</strong> OpenMP, currently considered<br />

experimental. Further development depends on successful testing, currently underway.<br />

It is believed that the top-performance computing systems <strong>of</strong> this decade (2010–2019), which should<br />

reach the exaflop scale, will have processors with a hierarchical architecture due to limitations in the<br />

amount <strong>of</strong> power that can be reasonably delivered to and dissipated from each processing unit [115].<br />

It is likely that the different levels in the hierarchy will required multiple simultaneous approaches to<br />

parallelism, with an OpenMP-like level parallelizing across the cores in one or a few CPUs and an<br />

MPI-like level parallelizing across the entire system.<br />

For a pure-MPI QMC calculation with P processors, the total computation time t is roughly given by<br />

t ≈ MCt c /P , where M is number <strong>of</strong> steps, C is number <strong>of</strong> configurations and t c is the average time<br />

to move one configuration at each step. However on very large computers one can be in a situation<br />

where the desired C and P are such that P > C, which means that there will be nodes with no<br />

configurations in them (and thus idle), which is a waste <strong>of</strong> resources.<br />

The second level <strong>of</strong> parallelism becomes useful when P > C. Running multiple OpenMP threads on<br />

multiple cores allows keeping C small, effectively reducing t c in the cost formula above.<br />

210

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!