NGS data preprocessing & Quality Control (QC) - Bioinformatics and ...
NGS data preprocessing & Quality Control (QC) - Bioinformatics and ... NGS data preprocessing & Quality Control (QC) - Bioinformatics and ...
NGS data preprocessing&Quality Control (QC)Buenos Aires, October 2011Javier Santoyo-Lopezjsantoyo@cipf.eshttp://bioinfo.cipf.esGenomics DepartmentCentro de Investigacion Principe Felipe (CIPF)(Valencia, Spain)
- Page 4 and 5: From Michael Metzker, http://view.n
- Page 6 and 7: Next-gen sequencersAdapted from Joh
- Page 9 and 10: Sequence to Variation WorkflowRawDa
- Page 11 and 12: Why Quality Control andPreprocessin
- Page 13 and 14: Why Quality Control andPreprocessin
- Page 15 and 16: Sequence QualityPer base PositionGo
- Page 17 and 18: Per Sequence QualityDistributionGoo
- Page 19 and 20: Nucleotide Contentper positionGood
- Page 21 and 22: GC DistributionGood dataFits with t
- Page 23 and 24: Per baseGC DistributionGood dataNo
- Page 25 and 26: Per baseN contentIt's notgood ifthe
- Page 27 and 28: Distribution LengthThis is descript
- Page 29 and 30: K-mer ContentHelps to detectproblem
<strong>NGS</strong> <strong>data</strong> <strong>preprocessing</strong>&<strong>Quality</strong> <strong>Control</strong> (<strong>QC</strong>)Buenos Aires, October 2011Javier Santoyo-Lopezjsantoyo@cipf.eshttp://bioinfo.cipf.esGenomics DepartmentCentro de Investigacion Principe Felipe (CIPF)(Valencia, Spain)
From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069
<strong>NGS</strong> platforms comparison10/18/11Source http://www.clcngs.com/2008/12/ngs-platforms-overview/
Next-gen sequencersAdapted from John McPherson, OICRbases per machine run100 Gb10 Gb1 Gb100 Mb10 Mb1 MbAB SOLiDv3200Gb, 50 bp readsIllumina HiSeq150Gb, 100bp reads454 GS FLX Titanium0.4-1 Gb, 100-500 bp readsABI capillary sequencer(0.04-0.08 Mb,450-800 bp reads10 bp 100 bp1,000 bpread length
Many Gbs of Sequences <strong>and</strong>...• Data management becomes a challenge.– Moving <strong>data</strong> across file systems takes time (several hundred Gbs)• What structure has the <strong>data</strong>?– Different sequencers output different files, but– There are some <strong>data</strong> formats that are being accepted widely (e.g.FastQ format)• Raw sequence <strong>data</strong> formats– SFF– Fasta, csfasta– Qual file– Fastq
Sequence to Variation WorkflowRawDataFastQIGVBWA/BWASWSAMSAMToolsGFFFilterVCFBCFToolsRawVCFBCFToolsPileupSamtoolsmPileupBAM
Sequence to Variation WorkflowRawDataFastQFastXTookitFastQIGVBWA/BWASWSAMBWA/BWASWSAMToolsGFFFilterVCFBCFToolsRawVCFBCFToolsPileupSamtoolsmPileupBAM
Why <strong>Quality</strong> <strong>Control</strong> <strong>and</strong>Preprocessing?Sequencer output:Reads + qualityIs the quality of my sequenced <strong>data</strong> OK?
Why <strong>Quality</strong> <strong>Control</strong> <strong>and</strong>Preprocessing?Sequencer output:Reads + qualityIs the quality of my sequenced <strong>data</strong> OK?If something is wrong can I fix it?
Why <strong>Quality</strong> <strong>Control</strong> <strong>and</strong>Preprocessing?Sequencer output:Problem:Reads + qualityIs the quality of my sequenced <strong>data</strong> OK?If something is wrong can I fix it?HUGE files...
Why <strong>Quality</strong> <strong>Control</strong> <strong>and</strong>Preprocessing?Sequencer output:Reads + qualityIs the quality of my sequenced <strong>data</strong> OK?Problem:If something is wrong can I fix it?HUGE files... How do they look?@HWUSIEAS460:2:1:368:1089#0/1TACGTACGTACGTACGTACGTAGATCGGAAGAGCGG+HWUSIEAS460:2:1:368:1089#0/1aa[a_a_a^a^a]VZ]R^P[]YNSUTZBBBBBBBBB@HWUSIEAS460:2:1:368:528#0/1CTATTATAATATGACCGACCAGCTAGATCTACAGTC+HWUSIEAS460:2:1:368:528#0/1abbbbaaaabba^aa`Y``aa`aaa``a`a_\_`[_Files are flat files <strong>and</strong> are big... tens of Gbs (please... don'tuse MS word to see or edit them)
Sequence <strong>Quality</strong>Per base PositionGood <strong>data</strong>ConsistentHigh quality alongthe read* The central red line is the median value* The yellow box represents the inter-quartile range (25-75%)* The upper <strong>and</strong> lower whiskers represent the 10% <strong>and</strong> 90% points* The blue line represents the mean quality
Sequence <strong>Quality</strong>Per base PositionBad <strong>data</strong>High variance<strong>Quality</strong> decreasewith length
Per Sequence <strong>Quality</strong>DistributionGood <strong>data</strong>Most are high-qualitysequences
Per Sequence <strong>Quality</strong>DistributionBad <strong>data</strong>Not uniformdistributionLow <strong>Quality</strong> Reads
Nucleotide Contentper positionGood <strong>data</strong>Smooth overlengthOrganismdependent (GC)
Nucleotide Contentper positionBad <strong>data</strong>Sequenceposition bias
GC DistributionGood <strong>data</strong>Fits with theexpectedOrganismdependent
Per sequenceGC DistributionBad <strong>data</strong>It does not fitwith expectedOrganismdependentLibrarycontamination?
Per baseGC DistributionGood <strong>data</strong>No variationacross readsequence
Per baseGC DistributionBad <strong>data</strong>Variationacrossreadsequence
Per baseN contentIt's notgood ifthere areN bias perbaseposition
Duplicated SequencesI don't expect high numberof duplicated sequences:PCR artifact?
Distribution LengthThis is descriptive.Some sequencersoutput sequences ofdifferent length (e.g.454)
Overrepresented SequencesQuestion:If you obtain the exact same sequence toomany times Do you have a problem?Answer:Sometimes!ExamplesPCR primers (Illumina)GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC
K-mer ContentHelps to detectproblemsAdapters?
Practical:Fast<strong>QC</strong> <strong>and</strong> Fastx-toolkitUse Fast<strong>QC</strong> to see your starting state.Use Fastx-toolkit to optimize different <strong>data</strong>sets<strong>and</strong> then visualize the result with Fast<strong>QC</strong> toprove your success!Hints: Try trimming, clipping <strong>and</strong> quality filtering.Go to the tutorial <strong>and</strong> try the exercises...