Computational infrastructure for NGS data analysis - Bioinformatics ...
Computational infrastructure for NGS data analysis - Bioinformatics ...
Computational infrastructure for NGS data analysis - Bioinformatics ...
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Computational</strong> <strong>infrastructure</strong> <strong>for</strong><strong>NGS</strong> <strong>data</strong> <strong>analysis</strong>Pablo Escobarpescobar@cipf.es
<strong>Computational</strong> <strong>infrastructure</strong> <strong>for</strong> <strong>NGS</strong>In <strong>NGS</strong> we have to process really big amountsof <strong>data</strong>, which is not trivial in computing terms.Big <strong>NGS</strong> projects require supercomputing<strong>infrastructure</strong>s
Data tsunami is realSome disks in our lab.....
<strong>NGS</strong> file sizes
Sequencing cost vs IT costSequencing cost goes down....so IT cost goes upImage source: http://www.existencegenetics.com/fullgenome.php
<strong>Computational</strong> <strong>infrastructure</strong> <strong>for</strong> <strong>NGS</strong>These <strong>infrastructure</strong>s are expensive and not trivial to use, we require:Conditioned <strong>data</strong> center (servers room). This isexpensiveComputing cluster:Many computing nodes (servers)High per<strong>for</strong>mance and high capacity storageFast networks (10Gb ethernet, infiniband...)Skilled people in computing ( sysadmins and developers).In CNAG currently 30 staff - >50% in<strong>for</strong>matics
Computing clusterDistributed memory cluster8 or 12 cores by nodex86_64 archAt least 48GB ram per nodeFast networks10GbitInfinibandBatch queue system (sge, condor, pbs, slurm)Many GPUs tools are being developed, no a bad idea to havesome if you plan to use gpu tools
Storage systemStorage is the most important piece in the IT<strong>infrastructure</strong> <strong>for</strong> <strong>NGS</strong>Storage is the most expensiveGood design is really important. Talk withexpertsKeep in mind the storage scalability.Try to keep storage flexible. Changes come fast
Storage SystemTraditional backups are a problem, if evenpossible.Raid is your friend.Plan a good <strong>data</strong> storage policyRecommended reading:http://www.bioteam.net/wp-content/uploads/2010/03/cdag-xgen-storageFor<strong>NGS</strong>_v3.pdf
Storage systemDistributed filesystem <strong>for</strong> high per<strong>for</strong>mancestorageLustreGPFSIbrixGlusterFSPanasasIsilonThese filesystems are not trivial toadministerNFS is not a good option <strong>for</strong> supercomputing
Distributed filesystem schema
Infrastructure schema
Small <strong>infrastructure</strong>Recommended at least 2 machines8 or 12 cores each machine.48Gb ram minimum each machine.BIG local disk. At least 4TB each machineAs much local disks as we can af<strong>for</strong>dPrice range: starting at 8.000€ - 10.000€ (twomachines)
Sequencing centers in SpainMedical Genome ProjectSequencing Instruments7 GS-FLX (Roche)4 SolidTM 5500 (Applied Biosystems)In<strong>for</strong>matics <strong>infrastructure</strong>300 core cluster0,5 petabyte ibrix filesystem
Medical genome projectStorage racksIBRIX filesystemfront-ends
MGP raw <strong>data</strong> generationa solid sequencer run7 days runningGenerates around 4TBOnly the four solid sequencers workingfull time can generate around 12TBeach week.12TB just of raw <strong>data</strong>. After runningbioin<strong>for</strong>matics <strong>analysis</strong> more <strong>data</strong> isgeneratedRaw <strong>data</strong> size grows really fastNew sequencer modelsNew reagents
MGP raw <strong>data</strong> generation
Sequencing centers in SpainCNAGSequencing Instruments10 Illumina HiSeq2000In<strong>for</strong>matics <strong>infrastructure</strong>850 core cluster1.2 petabyte lustre filesystem (growing to 2PB)10 x 10 Gb/s link with marenostrum (Barcelona SuperComputer 10,240 cores)
CNAG
BGI - Largest sequencing centerin the worldSequencing InstrumentsIllumina HiSeqAB SOLiD SystemIon TorrentIn<strong>for</strong>matics <strong>infrastructure</strong> (8 <strong>data</strong>centers)20,576 cores cluster17PBSource: http://www.genomics.cn/en/navigation/show_navigation?nid=4109
Largest sequencing center in theworldBeijing Genomics Institute (BGI)
Sequencing center resources
Most used operating system is GNU/LINUXSource: http://www.top500.org/stats/list/36/osfamSource:http://www.top500.org/stats/list/36/osfam
Alternatives – cloud computingProsFlexibility.You pay what you use.Don´t need to maintain a <strong>data</strong> center.ConsTransfer big <strong>data</strong>sets over internet is slow.You pay <strong>for</strong> consumed bandwidth. That is aproblem with big <strong>data</strong>sets.Lower per<strong>for</strong>mance, specially in disk read/write.Privacy/security concerns.More expensive <strong>for</strong> big and long term projects.