12.07.2015 Views

Computational infrastructure for NGS data analysis - Bioinformatics ...

Computational infrastructure for NGS data analysis - Bioinformatics ...

Computational infrastructure for NGS data analysis - Bioinformatics ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Computational</strong> <strong>infrastructure</strong> <strong>for</strong><strong>NGS</strong> <strong>data</strong> <strong>analysis</strong>Pablo Escobarpescobar@cipf.es


<strong>Computational</strong> <strong>infrastructure</strong> <strong>for</strong> <strong>NGS</strong>In <strong>NGS</strong> we have to process really big amountsof <strong>data</strong>, which is not trivial in computing terms.Big <strong>NGS</strong> projects require supercomputing<strong>infrastructure</strong>s


Data tsunami is realSome disks in our lab.....


<strong>NGS</strong> file sizes


Sequencing cost vs IT costSequencing cost goes down....so IT cost goes upImage source: http://www.existencegenetics.com/fullgenome.php


<strong>Computational</strong> <strong>infrastructure</strong> <strong>for</strong> <strong>NGS</strong>These <strong>infrastructure</strong>s are expensive and not trivial to use, we require:Conditioned <strong>data</strong> center (servers room). This isexpensiveComputing cluster:Many computing nodes (servers)High per<strong>for</strong>mance and high capacity storageFast networks (10Gb ethernet, infiniband...)Skilled people in computing ( sysadmins and developers).In CNAG currently 30 staff - >50% in<strong>for</strong>matics


Computing clusterDistributed memory cluster8 or 12 cores by nodex86_64 archAt least 48GB ram per nodeFast networks10GbitInfinibandBatch queue system (sge, condor, pbs, slurm)Many GPUs tools are being developed, no a bad idea to havesome if you plan to use gpu tools


Storage systemStorage is the most important piece in the IT<strong>infrastructure</strong> <strong>for</strong> <strong>NGS</strong>Storage is the most expensiveGood design is really important. Talk withexpertsKeep in mind the storage scalability.Try to keep storage flexible. Changes come fast


Storage SystemTraditional backups are a problem, if evenpossible.Raid is your friend.Plan a good <strong>data</strong> storage policyRecommended reading:http://www.bioteam.net/wp-content/uploads/2010/03/cdag-xgen-storageFor<strong>NGS</strong>_v3.pdf


Storage systemDistributed filesystem <strong>for</strong> high per<strong>for</strong>mancestorageLustreGPFSIbrixGlusterFSPanasasIsilonThese filesystems are not trivial toadministerNFS is not a good option <strong>for</strong> supercomputing


Distributed filesystem schema


Infrastructure schema


Small <strong>infrastructure</strong>Recommended at least 2 machines8 or 12 cores each machine.48Gb ram minimum each machine.BIG local disk. At least 4TB each machineAs much local disks as we can af<strong>for</strong>dPrice range: starting at 8.000€ - 10.000€ (twomachines)


Sequencing centers in SpainMedical Genome ProjectSequencing Instruments7 GS-FLX (Roche)4 SolidTM 5500 (Applied Biosystems)In<strong>for</strong>matics <strong>infrastructure</strong>300 core cluster0,5 petabyte ibrix filesystem


Medical genome projectStorage racksIBRIX filesystemfront-ends


MGP raw <strong>data</strong> generationa solid sequencer run7 days runningGenerates around 4TBOnly the four solid sequencers workingfull time can generate around 12TBeach week.12TB just of raw <strong>data</strong>. After runningbioin<strong>for</strong>matics <strong>analysis</strong> more <strong>data</strong> isgeneratedRaw <strong>data</strong> size grows really fastNew sequencer modelsNew reagents


MGP raw <strong>data</strong> generation


Sequencing centers in SpainCNAGSequencing Instruments10 Illumina HiSeq2000In<strong>for</strong>matics <strong>infrastructure</strong>850 core cluster1.2 petabyte lustre filesystem (growing to 2PB)10 x 10 Gb/s link with marenostrum (Barcelona SuperComputer 10,240 cores)


CNAG


BGI - Largest sequencing centerin the worldSequencing InstrumentsIllumina HiSeqAB SOLiD SystemIon TorrentIn<strong>for</strong>matics <strong>infrastructure</strong> (8 <strong>data</strong>centers)20,576 cores cluster17PBSource: http://www.genomics.cn/en/navigation/show_navigation?nid=4109


Largest sequencing center in theworldBeijing Genomics Institute (BGI)


Sequencing center resources


Most used operating system is GNU/LINUXSource: http://www.top500.org/stats/list/36/osfamSource:http://www.top500.org/stats/list/36/osfam


Alternatives – cloud computingProsFlexibility.You pay what you use.Don´t need to maintain a <strong>data</strong> center.ConsTransfer big <strong>data</strong>sets over internet is slow.You pay <strong>for</strong> consumed bandwidth. That is aproblem with big <strong>data</strong>sets.Lower per<strong>for</strong>mance, specially in disk read/write.Privacy/security concerns.More expensive <strong>for</strong> big and long term projects.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!