13.07.2015 Views

Differential expression analysis for sequencing count ... - Bioconductor

Differential expression analysis for sequencing count ... - Bioconductor

Differential expression analysis for sequencing count ... - Bioconductor

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Differential</strong> <strong>expression</strong> <strong>analysis</strong><strong>for</strong> <strong>sequencing</strong> <strong>count</strong> dataSimon Anders


Two applications of RNA-Seq• Discovery• find new transcripts• find transcript boundaries• find splice junctions• ComparisonGiven samples from different experimental conditions, find effectsof the treatment on• gene <strong>expression</strong> strengths• iso<strong>for</strong>m abundance ratios, splice patterns, transcriptboundaries


Count data in HTSGene GliNS1 G144 G166 G179 CB541 CB66013CDNA73 4 0 6 1 0 5A2BP1 19 18 20 7 1 8A2M 2724 2209 13 49 193 548A4GALT 0 0 48 0 0 0AAAS 57 29 224 49 202 92AACS 1904 1294 5073 5365 3737 3511AADACL1 3 13 239 683 158 40[...]• RNA-Seq• Tag-Seq• ChIP-Seq• HiC• Bar-Seq• ...


Sample-to-sample variationcomparison oftwo replicatescomparison oftreatment vs control


Sample-to-sample variability• In RNA-Seq, the minimum variance given by thePoisson distribution.• Taking only Poisson noise into ac<strong>count</strong> isinsufficient, though.• Many publications ignore this.


<strong>Differential</strong> <strong>expression</strong>: Two questionsAssume you use RNA-Seq to determine the concentration oftranscripts from some gene in different samples. What is yourquestion?1. “Is the concentration in one sample different fromthe <strong>expression</strong> in another sample?”or2. “Can the difference in concentration betweentreated samples and control samples be attributed tothe treatment?”


ReplicatesTwo replicates permit to• globally estimate variationSufficiently many replicates permit to• estimate variation <strong>for</strong> each gene• randomize out unknown covariates• spot outliers• improve precision of <strong>expression</strong> and fold-changeestimates


Replication at what level?Replicates should differ in all aspects in which controland treatment samples differ, except <strong>for</strong> the actualtreatment.


Variance depends strongly on the meanVariance calculated from comparing two replicatesPoissonv = μPoisson + constant CV v = μ + α μ 2Poisson + local regression v = μ + f(μ 2 )


The NB distribution from a hierarchical modelBiological samplewith mean µ andvariance vPoisson distributionwith mean q andvariance q.Negative binomialwith mean µ andvariance q+v.


Model fitting• Estimate the variance from replicates• Fit a line to get the variance-mean dependence v(μ)(local regression <strong>for</strong> a gamma-family generalized linear model, extra mathneeded to handle differing library sizes)


Dispersion fit


<strong>Differential</strong> <strong>expression</strong>RNA-Seq data: over<strong>expression</strong> of two differentgenes in flies [data: Furlong group]


Type-I error controlcomparison oftwo replicatescomparison oftreatment vs control


Further use casesSimilar <strong>count</strong> data appears in• comparative ChiP-Seq• barcode <strong>sequencing</strong>• ...and can be analysed with DESeq as well.


Comparative ChIP-Seq with DESeqStep 1: Get a list of <strong>count</strong>ing bins by either• running a peak finder on each samples and merging thepeak lists, or• merging the reads and running the finder on the pooledreads, or• using windows around annotated featuresStep 2: Make a <strong>count</strong> table:columns – samples; rows – <strong>count</strong>ing binsand use DESeqNote: The input samples are used in Step 1 only.


Generalized linear modelsSimple design:• Two groups of samples (“control” and “treatment”),no sub-structure within each group.Common complex designs:• Designs with blocking factors• Factorial designs


GLMs: Blocking factorSample treated sexS1 no maleS2 no maleS3 no maleS4 no femaleS5 no femaleS6 yes maleS7 yes maleS8 yes femaleS9 yes femaleS10 yes female


GLMs: Blocking factorfull model <strong>for</strong> gene i:reduced model <strong>for</strong> gene i:


GLMs: Blocking factorcds


GLMs: Interactionfull model <strong>for</strong> gene i:reduced model <strong>for</strong> gene i:


GLMs: paired designs• Often, samples are paired (e.g., a tumour and ahealthy-tissue sample from the same patient)• Then, using pair identity as blocking factorimproves power.full model:reduced model:


Alternative splicing• So far, we <strong>count</strong>ed reads in genes.• To study alternative splicing, reads have to beassigned to transcripts.• This introduces ambiguity, which adds uncertainty.• Proper inference has to take thin into ac<strong>count</strong>, andsample-to-sample variability


Data set used <strong>for</strong> to demonstrate DEXSeq:Drosophila melanogaster S2 cell cultures:• control (no treatment):4 biological replicates (2x single end, 2x paired end)• treatment: knock-down of pasilla (a splicing factor)3 biological replicates (1x single end, 2x paired end)


Alternative iso<strong>for</strong>m regulationData: Brooks et al., Genome Res., 2010


Exon <strong>count</strong>ing bins


Count table <strong>for</strong> a genenumber of reads mapped to each exon (or part of exon) in gene msn:treated_1 treated_2 control_1 control_2E01 398 556 561 456E02 112 180 153 137E03 238 306 298 226E04 162 171 183 146E05 192 272 234 199E06 314 464 419 331E07 373 525 481 404E08 323 427 475 373E09 194 213 273 176E10 90 90 530 398


Model<strong>count</strong>s in gene i,sample j, exon lsize factordispersion<strong>expression</strong> strengthin controlchange in <strong>expression</strong>due to treatmentfraction of readsfalling onto exon lin controlchange to fraction ofreads <strong>for</strong> exon l dueto treatment


Model, refined<strong>count</strong>s in gene i,sample j, exon lsize factordispersion<strong>expression</strong> strengthin sample jfraction of readsfalling onto exon lin controlchange to fraction ofreads <strong>for</strong> exon l dueto treatmentfurther refinement: fit an extra factor <strong>for</strong> library type (paired-end vs single)


Dispersion vs mean


RpS14a (FBgn0004403)


DEXSeq and other tools• MISO and ALEXA-Seq do not ac<strong>count</strong> <strong>for</strong> biologicalvariability.• Neither does cuffdiff, as described in the authors'publications.• New versions of cuffdiff claim to ac<strong>count</strong> <strong>for</strong>biological variability, however ...• See also Glaus et al.'s EBSeq, though.


DEXSeq and other tools• MISO and ALEXA-Seq do not ac<strong>count</strong> <strong>for</strong> biologicalvariability.• Neither does cuffdiff, as described in the authors'publications.• New versions of cuffdiff claim to ac<strong>count</strong> <strong>for</strong>biological variability, however ...• See also Glaus et al.'s BitSeq, though.


Test cuffdiff vs DEXSeq


Exons vs iso<strong>for</strong>ms• DEXSeq deliberately tests at the level of exons, notiso<strong>for</strong>ms.• This might be an advantage: We have moreannotation on exons than on iso<strong>for</strong>ms, anyway.


DEXSeq• combination of Python scripts and an R package• Python script to get <strong>count</strong>ing bins from a GTF file• Python script to get <strong>count</strong> table from SAM files• R functions to set up model frames and per<strong>for</strong>mGLM fits and ANODEV• R functions to visualize results and compile anHTML report


Conclusion• Counting within exons and NB-GLMs allows tostudy iso<strong>for</strong>m regulation.• Proper statistical testing allows to see whetherchanges in iso<strong>for</strong>m abundances are just randomvariation or may be attributed to changes in tissuetype or experimental condition.• Testing on the level of individual exons gives powerand might be helpful to study the mechanisms ofalternative iso<strong>for</strong>m regulation.• DEXSeq is availabe from <strong>Bioconductor</strong>, paper ispublished in Genome Research.


Outlook: Current developmentsUse of shrinkage estimators (empirical Bayes) <strong>for</strong>• dispersion• fold changes / GLM coefficientsImprovements to DEXSeq• “splice graphs”• junction reads


AcknowledgementsCoauthors:• Alejandro Reyes• Wolfgang Huber• Michael LoveFunding:• EMBL

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!