NGS data preprocessing & Quality Control (QC) - Bioinformatics and ...

NGS data preprocessing & Quality Control (QC) - Bioinformatics and ... NGS data preprocessing & Quality Control (QC) - Bioinformatics and ...

from bioinfo.cipf.es More from this publisher

11.07.2015 Views

NGS data preprocessing&Quality Control (QC)Buenos Aires, October 2011Javier Santoyo-Lopezjsantoyo@cipf.eshttp://bioinfo.cipf.esGenomics DepartmentCentro de Investigacion Principe Felipe (CIPF)(Valencia, Spain)

NGS data preprocessing&Quality Control (QC)Buenos Aires, October 2011Javier Santoyo-Lopezjsantoyo@cipf.eshttp://bioinfo.cipf.esGenomics DepartmentCentro de Investigacion Principe Felipe (CIPF)(Valencia, Spain)

From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069

NGS platforms comparison10/18/11Source http://www.clcngs.com/2008/12/ngs-platforms-overview/

Next-gen sequencersAdapted from John McPherson, OICRbases per machine run100 Gb10 Gb1 Gb100 Mb10 Mb1 MbAB SOLiDv3200Gb, 50 bp readsIllumina HiSeq150Gb, 100bp reads454 GS FLX Titanium0.4-1 Gb, 100-500 bp readsABI capillary sequencer(0.04-0.08 Mb,450-800 bp reads10 bp 100 bp1,000 bpread length

Many Gbs of Sequences and...• Data management becomes a challenge.– Moving data across file systems takes time (several hundred Gbs)• What structure has the data?– Different sequencers output different files, but– There are some data formats that are being accepted widely (e.g.FastQ format)• Raw sequence data formats– SFF– Fasta, csfasta– Qual file– Fastq

Sequence to Variation WorkflowRawDataFastQIGVBWA/BWASWSAMSAMToolsGFFFilterVCFBCFToolsRawVCFBCFToolsPileupSamtoolsmPileupBAM

Sequence to Variation WorkflowRawDataFastQFastXTookitFastQIGVBWA/BWASWSAMBWA/BWASWSAMToolsGFFFilterVCFBCFToolsRawVCFBCFToolsPileupSamtoolsmPileupBAM

Why Quality Control andPreprocessing?Sequencer output:Reads + qualityIs the quality of my sequenced data OK?

Why Quality Control andPreprocessing?Sequencer output:Problem:Reads + qualityIs the quality of my sequenced data OK?If something is wrong can I fix it?HUGE files...

Why Quality Control andPreprocessing?Sequencer output:Reads + qualityIs the quality of my sequenced data OK?Problem:If something is wrong can I fix it?HUGE files... How do they look?@HWUSIEAS460:2:1:368:1089#0/1TACGTACGTACGTACGTACGTAGATCGGAAGAGCGG+HWUSIEAS460:2:1:368:1089#0/1aa[a_a_aââ]VZ]R^P[]YNSUTZBBBBBBBBB@HWUSIEAS460:2:1:368:528#0/1CTATTATAATATGACCGACCAGCTAGATCTACAGTC+HWUSIEAS460:2:1:368:528#0/1abbbbaaaabbaâa`Y`àaàaa`àà_\_`[_Files are flat files and are big... tens of Gbs (please... don'tuse MS word to see or edit them)

Sequence QualityPer base PositionGood dataConsistentHigh quality alongthe read* The central red line is the median value* The yellow box represents the inter-quartile range (25-75%)* The upper and lower whiskers represent the 10% and 90% points* The blue line represents the mean quality

Sequence QualityPer base PositionBad dataHigh varianceQuality decreasewith length

Per Sequence QualityDistributionGood dataMost are high-qualitysequences

Per Sequence QualityDistributionBad dataNot uniformdistributionLow Quality Reads

Nucleotide Contentper positionGood dataSmooth overlengthOrganismdependent (GC)

Nucleotide Contentper positionBad dataSequenceposition bias

GC DistributionGood dataFits with theexpectedOrganismdependent

Per sequenceGC DistributionBad dataIt does not fitwith expectedOrganismdependentLibrarycontamination?

Per baseGC DistributionGood dataNo variationacross readsequence

Per baseGC DistributionBad dataVariationacrossreadsequence

Per baseN contentIt's notgood ifthere areN bias perbaseposition

Duplicated SequencesI don't expect high numberof duplicated sequences:PCR artifact?

Distribution LengthThis is descriptive.Some sequencersoutput sequences ofdifferent length (e.g.454)

Overrepresented SequencesQuestion:If you obtain the exact same sequence toomany times Do you have a problem?Answer:Sometimes!ExamplesPCR primers (Illumina)GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC

K-mer ContentHelps to detectproblemsAdapters?

Practical:FastQC and Fastx-toolkitUse FastQC to see your starting state.Use Fastx-toolkit to optimize different datasetsand then visualize the result with FastQC toprove your success!Hints: Try trimming, clipping and quality filtering.Go to the tutorial and try the exercises...

NGS data preprocessing & Quality Control (QC) - Bioinformatics and ...

NGS data preprocessing & Quality Control (QC) - Bioinformatics and ... ... View more NGS data preprocessing & Quality Control (QC) - Bioinformatics and ...

Delete template?

Save as template ?

NGS data preprocessing & Quality Control (QC) - Bioinformatics and ... NGS data preprocessing & Quality Control (QC) - Bioinformatics and ...