1 slide per page - GenoToul bioinformatics
1 slide per page - GenoToul bioinformatics
1 slide per page - GenoToul bioinformatics
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Aligning NGS reads and SNPfindingPhilippe Bardou & Christophe Klopp
OrganisationMorning (9h30 -12h30) :- Sequence quality– Theory + exercises- Read mapping– Theory + exercisesAfternoon (14h-17h) :- SAM format– Theory + exercises- Visualisation– Theory + exercises- SNP calling– Theory + exercises2
Where are we?SequencingDe NovoAssemblyAlignmentGenomeTranscriptomeSNP Genome CallingTranscriptomeChip SeqTranscriptomicGenomeTranscriptomeMethylation4
What are you going to learn?●To extract reads and reference genome from the NCBI●To verify the read quality●To format reference sequence●To align the reads on the reference genome●To index the reference sequence and the aligned reads●To visualise the alignments and variations●To call SNPs5
What you should already know?●How to connect to a remote unix server (putty)?●What a unix command looks like?●How to move around the unix environment?●How to edit a file?6
The pieces of software●Quality : fastqc●BWA : alignment●Samtools : formating SNP discovery●IGV : visualisation7
The 1000 genomes project●Joint project NCBI / EBI●Common data formats :– fastq– SAM (Sequence Alignment/Map)8
OverviewSRAENA[…]fastqfastqfastqFastQCBWA(aln / bwsw)SAMSAMSAMGFFpileupIGVsamtools(view/merge/sort/...)awksamtools.plvarfiltersamtoolspileupBAMBAMpileuppileup9
NGS platforms●Two platforms :– Illumina Solexa– Roche 45410
NGS reads2010HiSeq 2000200 Gb/run100 pb /lecturePaired reads11
Sequencing bias bibliography12
Sequencing bias●Platform related●Roche 454 (data from Jean-Marc Aury CNS)– 99,9% mapped reads– Mean error rate : 0,55%– 37% deletions, 53% insertions, 10% substitutions.– homopolymers errors– emPCR duplications●Solexa (data from Jean-Marc Aury CNS)– 98,5% mapped reads– Mean error rate : 0,38%– 3% deletions, 2% insertions, 95% substitutions– Low A/T rich coverage13
What data will we use?●The needed data :– A reference sequence :● Genome● Parts of the genome● Transcriptome– Short reads14
Where to get a reference genome?●Assemble your own●Use a public assembly :– NCBI : Genbank– EMBL15
Where to get short reads?●Produce your own sequences :– CNS– Local platform– Private company●Use public data :– SRA : NCBI Sequence Read Archive– ENA : EMBL/EBI European Nucleotide Archive16
NCBI SRA?17
EBI ENA18
Meta data●Meta data structure :– Ex<strong>per</strong>iment– Sample– Study– Run– Data file19
What is a fastq file20
fastq file formats21
Sequence quality●Phred : base callingWhat is Phred Quality?Traditionally, Phred quality is defined on base calls. Eachbase call is an estimate of the true nucleotide. It is arandom variable and can be wrong. The probability that abase call is wrong is called error probability.Explanation about the quality values :source http://maq.sourceforge.net/qual.shtml22
Which reads should I keep?●All●Some : what criteria and threshold should I use– Composition (number of Ns, complexity,...),– Quality,– Alignment based criteria,●Should I trim the reads using :– Composition– Quality23
Sequence quality analysis●FastQC :25
Sequence quality analysis●FastQC :26
NG6 - Runs28
NG6 - Run29
NG6 - Stats30
NG6 - StatsK-mers31
NG6 - StatsSequence contentacross all bases32
NG6 - StatsN content acrossall bases33
Exercises / set 1●snp server connexion●Short read retrieval (wget)●Read statistics (fastqc)●Data sets :– SRX002048– ERR003037– ERR00001734
Read alignment●What are the ideas?●The different software generations :– Smith-Waterman / Needleman-Wunch (1970)– BLAST (1990)– MAQ (2008)– BWA (2009)35
BWA●Much faster than MAQ●Exact match●Limited number of errors (2 for 32bp, 4 for 100 bp)http://bio-bwa.sourceforge.net/36
BWA prefix trie●Word 'googol'●^ = start character●--- Search of 'lol' with one errorThe prefix trie is compressed to fit in memory in mostcases ( 1Go for the human genome).37
http://bio-bwa.sourceforge.net/bwa.shtml38
CommandsReference sequence indexing :bwa index -a bwtsw db.fastaRead Alignment :bwa aln db.fasta short_read.fastq > aln_sa.saibwa bwasw database.fasta long_read.fastq > aln.samFormatting unpaired reads :bwa samse db.fasta aln_sa.sai short_read.fastq > aln.samFormatting pair ends :bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fqread2.fq > aln.sam39
Index40
aln41
samse & sampe42
wasw43
Exercises / set 2●Retrieving the reference sequence in fasta format :– gi|224581838|ref|NC_012125.1| Salmonella entericasubsp. enterica serovar Paratyphi C strain RKS4594, complete genome●Indexing the reference sequence●Aligning the reads (fastq format)●Formatting the alignment in SAM44
Sequence Alignment/Map (SAM) format46
SAM formatAlignment section➢11 mandatory fields➢Variable number of optional fields➢Fields are tab delimited48
SAM formatFull exampleAlignementHeader [:: [...]]X? : Reserved for end usersNM : Number of nuc. DifferenceMD : String for mismatching positionsRG : Read group[...]A : Printable characteri : Signed 32bit integerf : Singleprecision float numberZ : Printable stringH : Hex string (high nybble first)49
SAM formatFlag fieldhttp://picard.sourceforge.net/explain-flags.html50
SAM formatExtended CIGAR formatRef: GCATTCAGATGCAGTACGCRead: ccTCAGGCATTAgtgPOS CIGAR5 2S4M2D6M3S51
SAM formatExtended CIGAR format52
SAMtools➢ Library and software package➢ Creating sorted and indexed BAM files from SAM files➢ Removing PCR duplicates➢ Merging alignments➢ Visualization of alignments from BAM files➢ SNP calling➢ Short indel detectionhttp://samtools.sourceforge.net/samtools.shtml54
SAMtoolsExample usage55
SAMtoolsExample usage➢ Create BAM from SAMsamtools view bS aln.sam o aln.bam➢ Sort BAM filesamtools sort example.bam sortedExample➢ Merge sorted BAM filessamtools merge sortedMerge.bam sorted1.bam sorted2.bam➢ Index BAM filesamtools index sortedExample.bam➢ Visualize BAM filesamtools tview sortedExample.bam reference.fa56
Picard➢ A SAMtools complementary package➢ More format conversion than SAMtools➢ Visualization of alignments not available➢ SNP calling & short indel detection not availablehttp://picard.sourceforge.net/57
PicardExample usage➢➢➢➢➢➢➢ValidateSamFileSortSamMarkDuplicatesEstimateLibraryComplexityMergeSamFilesViewSamReplaceSamHeader➢➢➢➢➢➢SamToFastqFastqToSamSamFormatConverterCreateSequenceDictionaryCleanSamCompareSAMsjava Xmx2g jar PicardCmd.jar OPTION1=value1 OPTION2=value2...58
Exercise / set 359
Visualizing the alignmentSAMtools : tviewsamtools tview aln.sorted.bam ref.fasta60
Visualizing the alignmentIGV➢IGV : Integrative Genomics Viewer➢Website : http://www.broadinstitute.org/igv61
Visualizing the alignmentIGV➢ High-<strong>per</strong>formance visualization tool➢ Interactive exploration of large, integrateddatasets➢ Supports a wide variety of data types➢ Documentations➢ Developed at the Broad Institute of MITand Harvard62
Visualizing the alignmentIGV63
Visualizing the alignmentIGV - Loading the reference64
Visualizing the alignmentIGV - Loading the reference65
Visualizing the alignmentIGV - Loading the bam file66
Visualizing the alignmentIGV - Loading the bam file67
Visualizing the alignmentIGV - Zoom68
Visualizing the alignmentIGV - Zoom69
Visualizing the alignmentIGV - Loading a gff file70
Visualizing the alignmentIGV - Loading a gff file71
Visualizing the alignmentIGV - Coverage➢ Generate the coverage information to be displayed in IGV.java jar igvtools.jar count aln.bam aln.depth.tdf ref.genome➢ Remark : ref.genome was generated when we imported thegenome sequence➢ This step is optional, but it is essential if you want to see theread depth information in large scale.72
Visualizing the alignmentIGV - Coverage73
Exercise / set 474
The pileup formatChr. - Coord. - Base('*' for indel) - Number of reads covering the site - Read bases* - Base qualitiesRead bases :➢➢➢➢'.' and ',' : match to the reference base on the forward/reverse strand'ACTGN' and 'actgn' : for a mismatch on the forward/reverse strand'^' and '$' : start/end of a read segment'+[0-9]+[ACGTNacgtn]+' and '-[0-9]+[ACGTNacgtn]+' : insertion/deletionhttp://samtools.sourceforge.net/pileup.shtml75
Variant Calling with SAMtools➢Get the raw variant :samtools pileup vcf ref.fa aln.bam > raw.txtsamtools view u aln.bam X | samtools pileup vcf ref.fa > rawX.txtref.faaln.bamFasta formatted file of the reference genomeSorted BAM formatted file, from the alignmentsraw[-X].txt Output pileup formatted, with consensus calls-c Calls the consensus base at each position-v Show positions that do not agree with ref.fa-f Reference sequence, ref.fa (in FastA format)76
The pileup formatConsensus base - Consensus quality - Probability of difference from ref. base - Max. mapping quality1 st indel allele - 2 nd indel allele - Reads supporting 1 s t - Reads supporting 2 nd - Reads supporting 3 rd cReads bases Reads qualities77
Filter - samtools.pl➢ Filter the raw variant calls :samtools.pl varFilter raw.txt > raw_ok.txt(samtools.pl varFilter p raw.txt > raw_ok.txt) >& raw_filtered.txt78
Filter - awkAcquire final variant calls by setting a quality threshold (50 forindels and 20 for substitutions) :awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' raw.flt.txt > raw.final.txtConsensus base - Consensus quality - Probability of difference from ref. base - Max. mapping quality79
Consensus - samtools.plConsensus : A way of representing the results of a multiplesequence alignment (which residues are most abundant in thealignment at each position).samtools.pl pileup2fq raw.txt > raw.fq80
SynthesisSRAENA[…]fastqfastqfastqFASTX-ToolkitBWA(aln / bwsw)SAMSAMSAMGFFpileupIGVsamtools(view/merge/sort/...)awksamtools.plvarfiltersamtoolspileupBAMBAMpileuppileup81
Exercise / set 582
The ENDSatisfaction form :http://bioinfo.genotoul.fr/index.php?id=60Exam :http://bioinfo.genotoul.fr/index.php?id=12283