27.04.2015 Views

download report - Sapienza

download report - Sapienza

download report - Sapienza

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Scientific Report 2007-2009<br />

Laboratories and Facilities of the Department of Physics<br />

F3. The Tier–2 Computing Centre for LHC<br />

The Tier–2 Computing Centre for LHC is a joint INFN–University effort, as most of the research activities in high energy<br />

physics in Italy. As a result most of the resources come from INFN–Sez. di Roma, while manpower is both from INFN and<br />

from University.<br />

LHC experiments at CERN need large computing resources as well huge storage. In fact, each general purpose LHC<br />

experiment, such as ATLAS and CMS, is going to collect as much as 2–4 billions of events per year. Event size is of the<br />

order of 1 MB, resulting in a total of 2–4 PB of data to be stored. Data processing, moreover, is a CPU time–consuming<br />

activity for which the only solution is to parallelize jobs on many CPU cores. Taking into account that data analysis<br />

requires the comparison with simulated Monte Carlo events, and that the number of these events must be of the same order<br />

of magnitude of real data and is extremely costly in terms of CPU time, the figures given above almost double. Another<br />

factor 2–4 is required in order to guarantee some redundancy. Moreover, the experiments are expected to run for 10–15<br />

years and data must be available for analysis for at least 20 years.<br />

No single laboratory is able to concentrate enough computing power and enough storage in a single place, so that computing<br />

for LHC experiments is a distributed activity. We benefit from the existing GRID services, partly developed in Italy, to<br />

distribute both data and CPU load over several centres around the world, in a transparent way for the users. Resources<br />

are arranged hierarchically to make the system scalable. Data collected close to experiments are stored in the so–called<br />

Tier–0 at CERN, where they are initially processed as fast as possible. Once physics data have been reconstructed, events<br />

are distributed to few Tier–1 centres around the world, one of which is located in Bologna. Tier–1 centres have custodial<br />

responsibility of data and are in charge for data reprocessing, when needed.<br />

From Tier–1’s, data are distributed to Tier–2’s that usually host a fraction of 20 % of data collected in a Tier–1. Tier–2’s<br />

also provide computing power both for physicists’ data analysis and for Monte Carlo production teams.<br />

Users submit their jobs to a Resource Broker on the GRID which knows the location of data as well as the availability<br />

of computing power in each centre. It then distributes the jobs to many Tier–2’s, close to target data, collects and merge<br />

all the results and returns them back to the user. Users, then, do not need to know about the exact location of data, nor<br />

those of computer centre. They do not need to know specific data file names, either. They just provide the dataset name,<br />

i.e. a conventional, human readable identifier of a large data sample: the system associates it to a set of files that can be<br />

distributed and/or replicated to few centres. Databases are used to keep track of data and their location.<br />

One of the LHC Tier–2’s is located in Roma, in the basement of the Department of Physics, serving both the ATLAS<br />

and CMS experiments. It hosts, in a dedicated room, seven innovative water cooled racks, each 42U high. All the racks are<br />

currently almost filled with rack CPU servers and storage units. The centre has been designed to host up to 14 racks.<br />

Three tons of water are kept in a reservoir at 12 ◦ C by a redundant system of two chillers. A set of three computer<br />

controlled pumps makes the water flushing into a large pipe to which racks are attached in parallel. The racks, closed on all<br />

sides, contain a heat exchanger and three fans that produce a depression such that cool air from the bottom goes to the top<br />

of the rack creating a cool layer in front of the rack. Fresh air passes then through CPU’s thanks to the fans contained in<br />

each server and goes to the back of the rack, where it is pushed to the bottom to be cooled down again. The usage of water<br />

instead of air to keep the units at the right temperature has many advantages: it consistently reduces power consumption,<br />

it makes the temperature much more stable (the temperature is stable around (18 ± 0.1) ◦ C), it keeps the temperature of<br />

the environment comfortable and allows for some inertia in case of damages (the water stored in the reservoir is enough to<br />

keep the whole centre at a reasonable temperature for about 20 minutes, allowing for interventions).<br />

All the racks are connected to a UPS protected power line up to 120 kVA that is able to maintain the system running<br />

for about 30 minutes in case of troubles on the electrical line. Moreover the centre is provided with sensors for floods and<br />

smoke detectors, as well as with an automatic fire extinguishing system, connected to sound and visual alarms.<br />

Servers are internally connected by a 1 Gb LAN (to be upgraded to 10 Gb). The connection with the WAN is assured by<br />

two redundant connections to two different Garr POP’s, via as many 10 Gb fibers.<br />

A lot of effort has been spent to have the centre under full control. In particular we developed many monitoring tools<br />

that measure several quantities and <strong>report</strong> any anomaly to a centralized system from which we can check the current status<br />

and the history up to one year before of any monitored quantity. Every information is accessible via web even remotely and<br />

some intelligent agent has been deployed to automatically recover known problems. Moreover, the centre is equipped with a<br />

GSM interface that can be used either to send alarms to cell phones via SMS, or to receive commands from them in the form<br />

of an SMS. With this system we can remotely interrogate the databases as well as change some predefined configuration,<br />

even in the absence of any Internet access.<br />

The centre runs almost smoothly since two years 7/24.<br />

Related research activities: P1, P2, P3, P4, P5, P6.<br />

<strong>Sapienza</strong> Università di Roma 199 Dipartimento di Fisica

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!