NGS data preprocessing & Quality Control (QC) - Bioinformatics and ...

NGS data preprocessing & Quality Control (QC) - Bioinformatics and ... NGS data preprocessing & Quality Control (QC) - Bioinformatics and ...

bioinfo.cipf.es
from bioinfo.cipf.es More from this publisher
11.07.2015 Views

NGS data preprocessing&Quality Control (QC)Buenos Aires, October 2011Javier Santoyo-Lopezjsantoyo@cipf.eshttp://bioinfo.cipf.esGenomics DepartmentCentro de Investigacion Principe Felipe (CIPF)(Valencia, Spain)

<strong>NGS</strong> <strong>data</strong> <strong>preprocessing</strong>&<strong>Quality</strong> <strong>Control</strong> (<strong>QC</strong>)Buenos Aires, October 2011Javier Santoyo-Lopezjsantoyo@cipf.eshttp://bioinfo.cipf.esGenomics DepartmentCentro de Investigacion Principe Felipe (CIPF)(Valencia, Spain)


From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069


<strong>NGS</strong> platforms comparison10/18/11Source http://www.clcngs.com/2008/12/ngs-platforms-overview/


Next-gen sequencersAdapted from John McPherson, OICRbases per machine run100 Gb10 Gb1 Gb100 Mb10 Mb1 MbAB SOLiDv3200Gb, 50 bp readsIllumina HiSeq150Gb, 100bp reads454 GS FLX Titanium0.4-1 Gb, 100-500 bp readsABI capillary sequencer(0.04-0.08 Mb,450-800 bp reads10 bp 100 bp1,000 bpread length


Many Gbs of Sequences <strong>and</strong>...• Data management becomes a challenge.– Moving <strong>data</strong> across file systems takes time (several hundred Gbs)• What structure has the <strong>data</strong>?– Different sequencers output different files, but– There are some <strong>data</strong> formats that are being accepted widely (e.g.FastQ format)• Raw sequence <strong>data</strong> formats– SFF– Fasta, csfasta– Qual file– Fastq


Sequence to Variation WorkflowRawDataFastQIGVBWA/BWASWSAMSAMToolsGFFFilterVCFBCFToolsRawVCFBCFToolsPileupSamtoolsmPileupBAM


Sequence to Variation WorkflowRawDataFastQFastXTookitFastQIGVBWA/BWASWSAMBWA/BWASWSAMToolsGFFFilterVCFBCFToolsRawVCFBCFToolsPileupSamtoolsmPileupBAM


Why <strong>Quality</strong> <strong>Control</strong> <strong>and</strong>Preprocessing?Sequencer output:Reads + qualityIs the quality of my sequenced <strong>data</strong> OK?


Why <strong>Quality</strong> <strong>Control</strong> <strong>and</strong>Preprocessing?Sequencer output:Reads + qualityIs the quality of my sequenced <strong>data</strong> OK?If something is wrong can I fix it?


Why <strong>Quality</strong> <strong>Control</strong> <strong>and</strong>Preprocessing?Sequencer output:Problem:Reads + qualityIs the quality of my sequenced <strong>data</strong> OK?If something is wrong can I fix it?HUGE files...


Why <strong>Quality</strong> <strong>Control</strong> <strong>and</strong>Preprocessing?Sequencer output:Reads + qualityIs the quality of my sequenced <strong>data</strong> OK?Problem:If something is wrong can I fix it?HUGE files... How do they look?@HWUSI­EAS460:2:1:368:1089#0/1TACGTACGTACGTACGTACGTAGATCGGAAGAGCGG+HWUSI­EAS460:2:1:368:1089#0/1aa[a_a_a^a^a]VZ]R^P[]YNSUTZBBBBBBBBB@HWUSI­EAS460:2:1:368:528#0/1CTATTATAATATGACCGACCAGCTAGATCTACAGTC+HWUSI­EAS460:2:1:368:528#0/1abbbbaaaabba^aa`Y``aa`aaa``a`a_\_`[_Files are flat files <strong>and</strong> are big... tens of Gbs (please... don'tuse MS word to see or edit them)


Sequence <strong>Quality</strong>Per base PositionGood <strong>data</strong>ConsistentHigh quality alongthe read* The central red line is the median value* The yellow box represents the inter-quartile range (25-75%)* The upper <strong>and</strong> lower whiskers represent the 10% <strong>and</strong> 90% points* The blue line represents the mean quality


Sequence <strong>Quality</strong>Per base PositionBad <strong>data</strong>High variance<strong>Quality</strong> decreasewith length


Per Sequence <strong>Quality</strong>DistributionGood <strong>data</strong>Most are high-qualitysequences


Per Sequence <strong>Quality</strong>DistributionBad <strong>data</strong>Not uniformdistributionLow <strong>Quality</strong> Reads


Nucleotide Contentper positionGood <strong>data</strong>Smooth overlengthOrganismdependent (GC)


Nucleotide Contentper positionBad <strong>data</strong>Sequenceposition bias


GC DistributionGood <strong>data</strong>Fits with theexpectedOrganismdependent


Per sequenceGC DistributionBad <strong>data</strong>It does not fitwith expectedOrganismdependentLibrarycontamination?


Per baseGC DistributionGood <strong>data</strong>No variationacross readsequence


Per baseGC DistributionBad <strong>data</strong>Variationacrossreadsequence


Per baseN contentIt's notgood ifthere areN bias perbaseposition


Duplicated SequencesI don't expect high numberof duplicated sequences:PCR artifact?


Distribution LengthThis is descriptive.Some sequencersoutput sequences ofdifferent length (e.g.454)


Overrepresented SequencesQuestion:If you obtain the exact same sequence toomany times Do you have a problem?Answer:Sometimes!ExamplesPCR primers (Illumina)GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC


K-mer ContentHelps to detectproblemsAdapters?


Practical:Fast<strong>QC</strong> <strong>and</strong> Fastx-toolkitUse Fast<strong>QC</strong> to see your starting state.Use Fastx-toolkit to optimize different <strong>data</strong>sets<strong>and</strong> then visualize the result with Fast<strong>QC</strong> toprove your success!Hints: Try trimming, clipping <strong>and</strong> quality filtering.Go to the tutorial <strong>and</strong> try the exercises...

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!