12.07.2015 Views

1 slide per page - GenoToul bioinformatics

1 slide per page - GenoToul bioinformatics

1 slide per page - GenoToul bioinformatics

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Aligning NGS reads and SNPfindingPhilippe Bardou & Christophe Klopp


OrganisationMorning (9h30 -12h30) :- Sequence quality– Theory + exercises- Read mapping– Theory + exercisesAfternoon (14h-17h) :- SAM format– Theory + exercises- Visualisation– Theory + exercises- SNP calling– Theory + exercises2


Where are we?SequencingDe NovoAssemblyAlignmentGenomeTranscriptomeSNP Genome CallingTranscriptomeChip SeqTranscriptomicGenomeTranscriptomeMethylation4


What are you going to learn?●To extract reads and reference genome from the NCBI●To verify the read quality●To format reference sequence●To align the reads on the reference genome●To index the reference sequence and the aligned reads●To visualise the alignments and variations●To call SNPs5


What you should already know?●How to connect to a remote unix server (putty)?●What a unix command looks like?●How to move around the unix environment?●How to edit a file?6


The pieces of software●Quality : fastqc●BWA : alignment●Samtools : formating SNP discovery●IGV : visualisation7


The 1000 genomes project●Joint project NCBI / EBI●Common data formats :– fastq– SAM (Sequence Alignment/Map)8


OverviewSRAENA[…]fastqfastqfastqFastQCBWA(aln / bwsw)SAMSAMSAMGFFpileupIGVsamtools(view/merge/sort/...)awksamtools.plvarfiltersamtoolspileupBAMBAMpileuppileup9


NGS platforms●Two platforms :– Illumina Solexa– Roche 45410


NGS reads2010HiSeq 2000200 Gb/run100 pb /lecturePaired reads11


Sequencing bias bibliography12


Sequencing bias●Platform related●Roche 454 (data from Jean-Marc Aury CNS)– 99,9% mapped reads– Mean error rate : 0,55%– 37% deletions, 53% insertions, 10% substitutions.– homopolymers errors– emPCR duplications●Solexa (data from Jean-Marc Aury CNS)– 98,5% mapped reads– Mean error rate : 0,38%– 3% deletions, 2% insertions, 95% substitutions– Low A/T rich coverage13


What data will we use?●The needed data :– A reference sequence :● Genome● Parts of the genome● Transcriptome– Short reads14


Where to get a reference genome?●Assemble your own●Use a public assembly :– NCBI : Genbank– EMBL15


Where to get short reads?●Produce your own sequences :– CNS– Local platform– Private company●Use public data :– SRA : NCBI Sequence Read Archive– ENA : EMBL/EBI European Nucleotide Archive16


NCBI SRA?17


EBI ENA18


Meta data●Meta data structure :– Ex<strong>per</strong>iment– Sample– Study– Run– Data file19


What is a fastq file20


fastq file formats21


Sequence quality●Phred : base callingWhat is Phred Quality?Traditionally, Phred quality is defined on base calls. Eachbase call is an estimate of the true nucleotide. It is arandom variable and can be wrong. The probability that abase call is wrong is called error probability.Explanation about the quality values :source http://maq.sourceforge.net/qual.shtml22


Which reads should I keep?●All●Some : what criteria and threshold should I use– Composition (number of Ns, complexity,...),– Quality,– Alignment based criteria,●Should I trim the reads using :– Composition– Quality23


Sequence quality analysis●FastQC :25


Sequence quality analysis●FastQC :26


NG6 - Runs28


NG6 - Run29


NG6 - Stats30


NG6 - StatsK-mers31


NG6 - StatsSequence contentacross all bases32


NG6 - StatsN content acrossall bases33


Exercises / set 1●snp server connexion●Short read retrieval (wget)●Read statistics (fastqc)●Data sets :– SRX002048– ERR003037– ERR00001734


Read alignment●What are the ideas?●The different software generations :– Smith-Waterman / Needleman-Wunch (1970)– BLAST (1990)– MAQ (2008)– BWA (2009)35


BWA●Much faster than MAQ●Exact match●Limited number of errors (2 for 32bp, 4 for 100 bp)http://bio-bwa.sourceforge.net/36


BWA prefix trie●Word 'googol'●^ = start character●--- Search of 'lol' with one errorThe prefix trie is compressed to fit in memory in mostcases ( 1Go for the human genome).37


http://bio-bwa.sourceforge.net/bwa.shtml38


CommandsReference sequence indexing :bwa index -a bwtsw db.fastaRead Alignment :bwa aln db.fasta short_read.fastq > aln_sa.saibwa bwasw database.fasta long_read.fastq > aln.samFormatting unpaired reads :bwa samse db.fasta aln_sa.sai short_read.fastq > aln.samFormatting pair ends :bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fqread2.fq > aln.sam39


Index40


aln41


samse & sampe42


wasw43


Exercises / set 2●Retrieving the reference sequence in fasta format :– gi|224581838|ref|NC_012125.1| Salmonella entericasubsp. enterica serovar Paratyphi C strain RKS4594, complete genome●Indexing the reference sequence●Aligning the reads (fastq format)●Formatting the alignment in SAM44


Sequence Alignment/Map (SAM) format46


SAM formatAlignment section➢11 mandatory fields➢Variable number of optional fields➢Fields are tab delimited48


SAM formatFull exampleAlignementHeader [:: [...]]X? : Reserved for end usersNM : Number of nuc. DifferenceMD : String for mismatching positionsRG : Read group[...]A : Printable characteri : Signed 32­bit integerf : Single­precision float numberZ : Printable stringH : Hex string (high nybble first)49


SAM formatFlag fieldhttp://picard.sourceforge.net/explain-flags.html50


SAM formatExtended CIGAR formatRef: GCATTCAGATGCAGTACGCRead: ccTCAG­­GCATTAgtgPOS CIGAR5 2S4M2D6M3S51


SAM formatExtended CIGAR format52


SAMtools➢ Library and software package➢ Creating sorted and indexed BAM files from SAM files➢ Removing PCR duplicates➢ Merging alignments➢ Visualization of alignments from BAM files➢ SNP calling➢ Short indel detectionhttp://samtools.sourceforge.net/samtools.shtml54


SAMtoolsExample usage55


SAMtoolsExample usage➢ Create BAM from SAMsamtools view ­bS aln.sam ­o aln.bam➢ Sort BAM filesamtools sort example.bam sortedExample➢ Merge sorted BAM filessamtools merge sortedMerge.bam sorted1.bam sorted2.bam➢ Index BAM filesamtools index sortedExample.bam➢ Visualize BAM filesamtools tview sortedExample.bam reference.fa56


Picard➢ A SAMtools complementary package➢ More format conversion than SAMtools➢ Visualization of alignments not available➢ SNP calling & short indel detection not availablehttp://picard.sourceforge.net/57


PicardExample usage➢➢➢➢➢➢➢ValidateSamFileSortSamMarkDuplicatesEstimateLibraryComplexityMergeSamFilesViewSamReplaceSamHeader➢➢➢➢➢➢SamToFastqFastqToSamSamFormatConverterCreateSequenceDictionaryCleanSamCompareSAMsjava ­Xmx2g ­jar PicardCmd.jar OPTION1=value1 OPTION2=value2...58


Exercise / set 359


Visualizing the alignmentSAMtools : tviewsamtools tview aln.sorted.bam ref.fasta60


Visualizing the alignmentIGV➢IGV : Integrative Genomics Viewer➢Website : http://www.broadinstitute.org/igv61


Visualizing the alignmentIGV➢ High-<strong>per</strong>formance visualization tool➢ Interactive exploration of large, integrateddatasets➢ Supports a wide variety of data types➢ Documentations➢ Developed at the Broad Institute of MITand Harvard62


Visualizing the alignmentIGV63


Visualizing the alignmentIGV - Loading the reference64


Visualizing the alignmentIGV - Loading the reference65


Visualizing the alignmentIGV - Loading the bam file66


Visualizing the alignmentIGV - Loading the bam file67


Visualizing the alignmentIGV - Zoom68


Visualizing the alignmentIGV - Zoom69


Visualizing the alignmentIGV - Loading a gff file70


Visualizing the alignmentIGV - Loading a gff file71


Visualizing the alignmentIGV - Coverage➢ Generate the coverage information to be displayed in IGV.java ­jar igvtools.jar count aln.bam aln.depth.tdf ref.genome➢ Remark : ref.genome was generated when we imported thegenome sequence➢ This step is optional, but it is essential if you want to see theread depth information in large scale.72


Visualizing the alignmentIGV - Coverage73


Exercise / set 474


The pileup formatChr. - Coord. - Base('*' for indel) - Number of reads covering the site - Read bases* - Base qualitiesRead bases :➢➢➢➢'.' and ',' : match to the reference base on the forward/reverse strand'ACTGN' and 'actgn' : for a mismatch on the forward/reverse strand'^' and '$' : start/end of a read segment'+[0-9]+[ACGTNacgtn]+' and '-[0-9]+[ACGTNacgtn]+' : insertion/deletionhttp://samtools.sourceforge.net/pileup.shtml75


Variant Calling with SAMtools➢Get the raw variant :samtools pileup ­vcf ref.fa aln.bam > raw.txtsamtools view ­u aln.bam X | samtools pileup ­vcf ref.fa ­ > raw­X.txtref.faaln.bamFasta formatted file of the reference genomeSorted BAM formatted file, from the alignmentsraw[-X].txt Output pileup formatted, with consensus calls-c Calls the consensus base at each position-v Show positions that do not agree with ref.fa-f Reference sequence, ref.fa (in FastA format)76


The pileup formatConsensus base - Consensus quality - Probability of difference from ref. base - Max. mapping quality1 st indel allele - 2 nd indel allele - Reads supporting 1 s t - Reads supporting 2 nd - Reads supporting 3 rd cReads bases Reads qualities77


Filter - samtools.pl➢ Filter the raw variant calls :samtools.pl varFilter raw.txt > raw_ok.txt(samtools.pl varFilter ­p raw.txt > raw_ok.txt) >& raw_filtered.txt78


Filter - awkAcquire final variant calls by setting a quality threshold (50 forindels and 20 for substitutions) :awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' raw.flt.txt > raw.final.txtConsensus base - Consensus quality - Probability of difference from ref. base - Max. mapping quality79


Consensus - samtools.plConsensus : A way of representing the results of a multiplesequence alignment (which residues are most abundant in thealignment at each position).samtools.pl pileup2fq raw.txt > raw.fq80


SynthesisSRAENA[…]fastqfastqfastqFASTX-ToolkitBWA(aln / bwsw)SAMSAMSAMGFFpileupIGVsamtools(view/merge/sort/...)awksamtools.plvarfiltersamtoolspileupBAMBAMpileuppileup81


Exercise / set 582


The ENDSatisfaction form :http://bioinfo.genotoul.fr/index.php?id=60Exam :http://bioinfo.genotoul.fr/index.php?id=12283

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!