1 slide per page - GenoToul bioinformatics

Aligning NGS reads and SNPfindingPhilippe Bardou & Christophe Klopp

OrganisationMorning (9h30 -12h30) :- Sequence quality– Theory + exercises- Read mapping– Theory + exercisesAfternoon (14h-17h) :- SAM format– Theory + exercises- Visualisation– Theory + exercises- SNP calling– Theory + exercises2

Where are we?SequencingDe NovoAssemblyAlignmentGenomeTranscriptomeSNP Genome CallingTranscriptomeChip SeqTranscriptomicGenomeTranscriptomeMethylation4

What are you going to learn?●To extract reads and reference genome from the NCBI●To verify the read quality●To format reference sequence●To align the reads on the reference genome●To index the reference sequence and the aligned reads●To visualise the alignments and variations●To call SNPs5

What you should already know?●How to connect to a remote unix server (putty)?●What a unix command looks like?●How to move around the unix environment?●How to edit a file?6

The pieces of software●Quality : fastqc●BWA : alignment●Samtools : formating SNP discovery●IGV : visualisation7

The 1000 genomes project●Joint project NCBI / EBI●Common data formats :– fastq– SAM (Sequence Alignment/Map)8

OverviewSRAENA[…]fastqfastqfastqFastQCBWA(aln / bwsw)SAMSAMSAMGFFpileupIGVsamtools(view/merge/sort/...)awksamtools.plvarfiltersamtoolspileupBAMBAMpileuppileup9

NGS platforms●Two platforms :– Illumina Solexa– Roche 45410

NGS reads2010HiSeq 2000200 Gb/run100 pb /lecturePaired reads11

Sequencing bias bibliography12

Sequencing bias●Platform related●Roche 454 (data from Jean-Marc Aury CNS)– 99,9% mapped reads– Mean error rate : 0,55%– 37% deletions, 53% insertions, 10% substitutions.– homopolymers errors– emPCR duplications●Solexa (data from Jean-Marc Aury CNS)– 98,5% mapped reads– Mean error rate : 0,38%– 3% deletions, 2% insertions, 95% substitutions– Low A/T rich coverage13

What data will we use?●The needed data :– A reference sequence :● Genome● Parts of the genome● Transcriptome– Short reads14

Where to get a reference genome?●Assemble your own●Use a public assembly :– NCBI : Genbank– EMBL15

Where to get short reads?●Produce your own sequences :– CNS– Local platform– Private company●Use public data :– SRA : NCBI Sequence Read Archive– ENA : EMBL/EBI European Nucleotide Archive16

NCBI SRA?17

EBI ENA18

Meta data●Meta data structure :– Ex<strong>per</strong>iment– Sample– Study– Run– Data file19

What is a fastq file20

fastq file formats21

Sequence quality●Phred : base callingWhat is Phred Quality?Traditionally, Phred quality is defined on base calls. Eachbase call is an estimate of the true nucleotide. It is arandom variable and can be wrong. The probability that abase call is wrong is called error probability.Explanation about the quality values :source http://maq.sourceforge.net/qual.shtml22

Which reads should I keep?●All●Some : what criteria and threshold should I use– Composition (number of Ns, complexity,...),– Quality,– Alignment based criteria,●Should I trim the reads using :– Composition– Quality23

Sequence quality analysis●FastQC :25

Sequence quality analysis●FastQC :26

NG6 - Runs28

NG6 - Run29

NG6 - Stats30

NG6 - StatsK-mers31

NG6 - StatsSequence contentacross all bases32

NG6 - StatsN content acrossall bases33

Exercises / set 1●snp server connexion●Short read retrieval (wget)●Read statistics (fastqc)●Data sets :– SRX002048– ERR003037– ERR00001734

Read alignment●What are the ideas?●The different software generations :– Smith-Waterman / Needleman-Wunch (1970)– BLAST (1990)– MAQ (2008)– BWA (2009)35

BWA●Much faster than MAQ●Exact match●Limited number of errors (2 for 32bp, 4 for 100 bp)http://bio-bwa.sourceforge.net/36

BWA prefix trie●Word 'googol'●^ = start character●--- Search of 'lol' with one errorThe prefix trie is compressed to fit in memory in mostcases ( 1Go for the human genome).37

http://bio-bwa.sourceforge.net/bwa.shtml38

CommandsReference sequence indexing :bwa index -a bwtsw db.fastaRead Alignment :bwa aln db.fasta short_read.fastq > aln_sa.saibwa bwasw database.fasta long_read.fastq > aln.samFormatting unpaired reads :bwa samse db.fasta aln_sa.sai short_read.fastq > aln.samFormatting pair ends :bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fqread2.fq > aln.sam39

Index40

aln41

samse & sampe42

wasw43

Exercises / set 2●Retrieving the reference sequence in fasta format :– gi|224581838|ref|NC_012125.1| Salmonella entericasubsp. enterica serovar Paratyphi C strain RKS4594, complete genome●Indexing the reference sequence●Aligning the reads (fastq format)●Formatting the alignment in SAM44

Sequence Alignment/Map (SAM) format46

SAM formatAlignment section➢11 mandatory fields➢Variable number of optional fields➢Fields are tab delimited48

SAM formatFull exampleAlignementHeader [:: [...]]X? : Reserved for end usersNM : Number of nuc. DifferenceMD : String for mismatching positionsRG : Read group[...]A : Printable characteri : Signed 32bit integerf : Singleprecision float numberZ : Printable stringH : Hex string (high nybble first)49

SAM formatFlag fieldhttp://picard.sourceforge.net/explain-flags.html50

SAM formatExtended CIGAR formatRef: GCATTCAGATGCAGTACGCRead: ccTCAGGCATTAgtgPOS CIGAR5 2S4M2D6M3S51

SAM formatExtended CIGAR format52

SAMtools➢ Library and software package➢ Creating sorted and indexed BAM files from SAM files➢ Removing PCR duplicates➢ Merging alignments➢ Visualization of alignments from BAM files➢ SNP calling➢ Short indel detectionhttp://samtools.sourceforge.net/samtools.shtml54

SAMtoolsExample usage55

SAMtoolsExample usage➢ Create BAM from SAMsamtools view bS aln.sam o aln.bam➢ Sort BAM filesamtools sort example.bam sortedExample➢ Merge sorted BAM filessamtools merge sortedMerge.bam sorted1.bam sorted2.bam➢ Index BAM filesamtools index sortedExample.bam➢ Visualize BAM filesamtools tview sortedExample.bam reference.fa56

Picard➢ A SAMtools complementary package➢ More format conversion than SAMtools➢ Visualization of alignments not available➢ SNP calling & short indel detection not availablehttp://picard.sourceforge.net/57

PicardExample usage➢➢➢➢➢➢➢ValidateSamFileSortSamMarkDuplicatesEstimateLibraryComplexityMergeSamFilesViewSamReplaceSamHeader➢➢➢➢➢➢SamToFastqFastqToSamSamFormatConverterCreateSequenceDictionaryCleanSamCompareSAMsjava Xmx2g jar PicardCmd.jar OPTION1=value1 OPTION2=value2...58

Exercise / set 359

Visualizing the alignmentSAMtools : tviewsamtools tview aln.sorted.bam ref.fasta60

Visualizing the alignmentIGV➢IGV : Integrative Genomics Viewer➢Website : http://www.broadinstitute.org/igv61

Visualizing the alignmentIGV➢ High-<strong>per</strong>formance visualization tool➢ Interactive exploration of large, integrateddatasets➢ Supports a wide variety of data types➢ Documentations➢ Developed at the Broad Institute of MITand Harvard62

Visualizing the alignmentIGV63

Visualizing the alignmentIGV - Loading the reference64

Visualizing the alignmentIGV - Loading the reference65

Visualizing the alignmentIGV - Loading the bam file66

Visualizing the alignmentIGV - Loading the bam file67

Visualizing the alignmentIGV - Zoom68

Visualizing the alignmentIGV - Zoom69

Visualizing the alignmentIGV - Loading a gff file70

Visualizing the alignmentIGV - Loading a gff file71

Visualizing the alignmentIGV - Coverage➢ Generate the coverage information to be displayed in IGV.java jar igvtools.jar count aln.bam aln.depth.tdf ref.genome➢ Remark : ref.genome was generated when we imported thegenome sequence➢ This step is optional, but it is essential if you want to see theread depth information in large scale.72

Visualizing the alignmentIGV - Coverage73

Exercise / set 474

The pileup formatChr. - Coord. - Base('*' for indel) - Number of reads covering the site - Read bases* - Base qualitiesRead bases :➢➢➢➢'.' and ',' : match to the reference base on the forward/reverse strand'ACTGN' and 'actgn' : for a mismatch on the forward/reverse strand'^' and '$' : start/end of a read segment'+[0-9]+[ACGTNacgtn]+' and '-[0-9]+[ACGTNacgtn]+' : insertion/deletionhttp://samtools.sourceforge.net/pileup.shtml75

Variant Calling with SAMtools➢Get the raw variant :samtools pileup vcf ref.fa aln.bam > raw.txtsamtools view u aln.bam X | samtools pileup vcf ref.fa > rawX.txtref.faaln.bamFasta formatted file of the reference genomeSorted BAM formatted file, from the alignmentsraw[-X].txt Output pileup formatted, with consensus calls-c Calls the consensus base at each position-v Show positions that do not agree with ref.fa-f Reference sequence, ref.fa (in FastA format)76

The pileup formatConsensus base - Consensus quality - Probability of difference from ref. base - Max. mapping quality1 st indel allele - 2 nd indel allele - Reads supporting 1 s t - Reads supporting 2 nd - Reads supporting 3 rd cReads bases Reads qualities77

Filter - samtools.pl➢ Filter the raw variant calls :samtools.pl varFilter raw.txt > raw_ok.txt(samtools.pl varFilter p raw.txt > raw_ok.txt) >& raw_filtered.txt78

Filter - awkAcquire final variant calls by setting a quality threshold (50 forindels and 20 for substitutions) :awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' raw.flt.txt > raw.final.txtConsensus base - Consensus quality - Probability of difference from ref. base - Max. mapping quality79

Consensus - samtools.plConsensus : A way of representing the results of a multiplesequence alignment (which residues are most abundant in thealignment at each position).samtools.pl pileup2fq raw.txt > raw.fq80

SynthesisSRAENA[…]fastqfastqfastqFASTX-ToolkitBWA(aln / bwsw)SAMSAMSAMGFFpileupIGVsamtools(view/merge/sort/...)awksamtools.plvarfiltersamtoolspileupBAMBAMpileuppileup81

Exercise / set 582

The ENDSatisfaction form :http://bioinfo.genotoul.fr/index.php?id=60Exam :http://bioinfo.genotoul.fr/index.php?id=12283

1 slide per page - GenoToul bioinformatics

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?