12.07.2015 Views

HPG Variant - Bioinformatics and Genomics Department at CIPF

HPG Variant - Bioinformatics and Genomics Department at CIPF

HPG Variant - Bioinformatics and Genomics Department at CIPF

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

IntroductionD<strong>at</strong>a gener<strong>at</strong>ion pipelineNGSsequencerFastq fileQC, preprocessing <strong>and</strong> alignmentSAM / BAM fileQC, preprocessing <strong>and</strong> variant call analysisVCF fileQC <strong>and</strong> preprocessing<strong>Variant</strong> effect analysisGenomic-wide analysis


Introduction<strong>Variant</strong> Call Form<strong>at</strong>, VCFText file form<strong>at</strong> (versions 4.0 <strong>and</strong> 4.1)Each line contains inform<strong>at</strong>ion about a positionof the genome##fileform<strong>at</strong>=VCFv4.0##INFO=##INFO=##INFO=##INFO=##reference=human_b36_both.fasta##FORMAT=##FORMAT=##FORMAT=#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA128781 52066 rs28402963 T C . PASS AA=C;DP=84 GT:GQ:DP 1/0:44:23 1/0:43:20 1/0:70:361 695745 . G A . PASS AA=.;DP=124 GT:GQ:DP 1|0:100:34 0|0:62:20 1|0:100:561 742429 rs3094315 G A . PASS AA=g;DP=132;HM2 GT:GQ:DP 1|1:100:38 1|1:59:30 1|1:100:441 742584 rs3131972 A G . PASS AA=a;DP=160;HM3 GT:GQ:DP 1|1:100:50 1|1:100:33 1|1:100:601 744366 rs3115859 G A . PASS AA=g;DP=127 GT:GQ:DP 1|1:80:31 1|1:100:34 1|1:100:451 746243 rs3131963 T A . PASS AA=t;DP=105 GT:GQ:DP 1|1:52:29 1|1:43:24 1|1:100:561 746775 rs6699990 A G . PASS AA=N;DP=120 GT:GQ:DP 0|1:100:34 0|0:89:30 0|0:100:461 747503 rs3115853 G A . PASS AA=-;DP=113 GT:GQ:DP 1|1:100:37 1|1:61:25 1|1:100:275


HPC technologyOpenMPAPI for sharedmemory architecturesC/C++ <strong>and</strong> Fortranimplement<strong>at</strong>ionsDirectives transl<strong>at</strong>edby the compiler#pragma omp parallel forfor (int i = 0; i < max_loop; i++) {c[i] = a[i] + b[i];}#pragma omp parallel sections {#pragma omp section {push(element, stack);}#pragma omp section {element = pop(stack);}}7


HPC technologyOpenMPH<strong>and</strong>s-on demo!8


<strong>HPG</strong> VCF ToolsUtilities for VCF files managementAnalysis involves common but tedious tasks:Combining d<strong>at</strong>a from different sourcesRemoving non-interesting d<strong>at</strong>aExisting tools implemented in Perl → slow!Goal: Speed-up with low memory footprintResults: Biologists focused on analysis10


<strong>HPG</strong> VCF ToolsSplitting <strong>and</strong> mergingSplit:By criteria: chromosome, region, coverage...hpg-vcf st<strong>at</strong>s –vcf-file input.vcfMerge:2 or more input VCF fileshpg-vcf merge –vcf-file input1.vcf,input2.vcf –output output.vcf13


<strong>HPG</strong> VCF ToolsH<strong>and</strong>s-on demo stage 2!14


<strong>HPG</strong> <strong>Variant</strong>Suite for variant analysisSt<strong>at</strong>e of the art in other apps:Lots of fe<strong>at</strong>uresLack of performanceGoal: To exploit parallelism to speed-up queries<strong>and</strong> analysis with a low memory footprintMultiple tools:Effect of variantsGenomic analysis<strong>Variant</strong> calling15


<strong>HPG</strong> <strong>Variant</strong><strong>Variant</strong> Effect (I)A variant effect predictor toolPort of VARIANT tool:Queries CellBase d<strong>at</strong>abase via a RESTful WebServices APICellBase DB: many consequence types <strong>and</strong> speciesWS, web <strong>and</strong> comm<strong>and</strong>-line interfaces16


<strong>HPG</strong> <strong>Variant</strong><strong>Variant</strong> Effect (II)New back-end <strong>and</strong> comm<strong>and</strong>-line interfaceImplemented in C with OpenMP directivesQueries CellBase DB using libcurl libraryProcesses files of arbitrary sizeBenchmark for ~500K variants, 147 individuals:3 main threads <strong>and</strong> 4 “forked” threadsExecuted in 5:30 minutes on an Intel Xeon E31425, 8 GBRAM, network connection over Gigabit EthernetNo more than 2 GB of memory used (configurable)1500 variants per second!17


<strong>HPG</strong> <strong>Variant</strong>GWAS Analysis (I)Genomic <strong>and</strong> family-based st<strong>at</strong>istical testsParallel implement<strong>at</strong>ions where possiblePrecursor: PlinkSerial implement<strong>at</strong>ion in C++Can't manage ultra-large files, memory problemPED & MAP form<strong>at</strong>sUsed in PlinkInform<strong>at</strong>ion about family rel<strong>at</strong>ionships, genotypes <strong>and</strong>locusSuperseded by the PED & VCF combin<strong>at</strong>ion18


<strong>HPG</strong> <strong>Variant</strong>GWAS Analysis (II)Transmission disequilibrium test (TDT)Family-based testParalleliz<strong>at</strong>ion scheme similar to effect toolhpg-variant gwas –vcf-file input.vcf –ped-file input.pedTDT for 147 individuals, ~500K variants:350 MB PED file, 300 MB VCF file3 main threads <strong>and</strong> 4 “forked” threadsOn an Intel Xeon E31425, 8 GB RAM workst<strong>at</strong>ion:Plink: 1:40 min<strong>HPG</strong> <strong>Variant</strong>: 35 s, 3x speed-up, 1 GB RAM used19


<strong>HPG</strong> <strong>Variant</strong>GWAS Analysis (III)Coming soon...Associ<strong>at</strong>ionLinear modelLogistic modelLOHAnd more!20


<strong>HPG</strong> <strong>Variant</strong><strong>Variant</strong> CallingProcess of searching for variants in alignedsequencesBAM input → VCF outputCurrent solutions:GATK: JavaSAMtools mPileup: CSequential software → running for hoursSt<strong>at</strong>us: Porting GATK pipeline (see [2])21


<strong>HPG</strong> <strong>Variant</strong>H<strong>and</strong>s-on demo final stage!22


ConclusionsNGS technology gener<strong>at</strong>es an increasingamount of d<strong>at</strong>aPrevious solutions are limited:Sequential implement<strong>at</strong>ionsFiles loaded as a whole into memoryMore efficient applic<strong>at</strong>ions developed:Parallelism improves speedCareful memory management allows to processpotentially infinite d<strong>at</strong>a23


References1) A map of human genome vari<strong>at</strong>ion from popul<strong>at</strong>ion-scalesequencing, The 1000 Genomes Project Consortium (availablefor downloading <strong>at</strong> http://www.1000genomes.org/sites/1000genomes.org/files/docs/n<strong>at</strong>ure09534.pdf )2) A framework for vari<strong>at</strong>ion discovery <strong>and</strong> genotyping using nextgener<strong>at</strong>ionDNA sequencing d<strong>at</strong>a, Mark A. DePristo et al.(N<strong>at</strong>ure Genetics vol. 43, number 5, 2011)24

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!