STAT 435: Topics in Advanced Statistics Data analysis for ...

STAT 435: Topics in Advanced Statistics 

Data analysis for bioinformatics 

Dr Mik Black 

Department of Biochemistry 

University of Otago 

Lecture 7

Annotation 

• From wikipedia.org: “Annotation is extra information 

associated with a particular point in a document or other 

piece of information.” 

• Here our “document” is the genome. 

• The goal of annotating the genome is to link all information 

relating to sequences, genes, protein, function...

Gene annotation 

• Once a fragment of a gene has been sequenced, it is 

assigned a unique identifier called an accession number. 

• The accession number is used to track that sequence, with 

additional information (e.g., function, other sequences) 

becoming associated with it as more is learned. 

• Sequences used on microarrays can be tracked back to an 

accession number. 

• Databases of these identifiers are maintained by the 

National Center for Biotechnology Information (NCBI), and 

others.

Entrez Gene 

• With the sequencing of the human genome, individual 

sequence fragments can be mapped to their position in the 

genome, and annotated (in conjunction with other 

fragments) as “genes”. 

• Each putative gene in the genome is also assigned an 

identifier in the “Entrez gene” database (also at NCBI). 

• The accession numbers of the constituent fragments are 

then associated with this identifier (and vice versa). 

• The gene identifier is also linked to a more descriptive gene 

name. This usually conveys some information about what 

that gene does (or at least what it was understood to be 

involved in at the time it was named).

Microarrays 

• Each spot on a microarray represents a gene fragment, and 

has a unique accession number. 

• Through this number we can find out about the gene which 

the fragment represents (if known), where in the genome 

this fragment is located, and (maybe) what it does. 

• In microarray experiments this means that we can find out 

the identity of genes which undergo differential expression. 

• Depending on what is known about these genes, this 

information may provide important clues about the 

underlying biological process being studied.

Gene function 

• For biologists, it’s not particularly interesting to find out 

that a gene is significantly differentially expressed if no 

other information is known about that gene. 

• One (very good) reason for this is that in microarray 

experiments there are often a lot of false positives, so 

biologists tend to be a little bit skeptical... 

– Remember: we can only have as much faith in the 

analysis as we do in the underlying assumptions. Were 

those gene REALLY independent How about the 

residuals - normally distributed

Gene function 

• If enough is known about a differentially expressed gene for 

it to “make sense” or be “interesting” in the context of the 

experiment, then biologists tend to get a bit more excited. 

• Although a gene name is often somewhat informative, vast 

amounts of information about that gene may reside in 

journal publications and internet databases - how do we get 

this information

PubMed identifiers 

• PubMed is a service provided by the National Library of 

Medicine. 

• Contains over 15 million citations from MEDLINE and 

other life sciences publications. 

• Every journal publication is given a unique PubMed 

identifier. 

• Those that relate to a particular gene or sequence are linked 

back to the appropriate identifiers. 

• Based on this the NCBI search engine (the Bioinformatics 

Institute (www.bioinformatics.org.nz) hosts a local 

copy of the NCBI databases) can be used to retrieve 

information about differentially expressed genes.

Problem - too much information 

• For situations where large numbers of genes are differentially 

expressed, there is simply too much information available. 

• Anyway, are we really interested in individual genes 

• Wouldn’t it be better to find groups of differentially 

expressed genes which share a common function

Biological pathways 

• In reality genes are members of pathways, which perform 

major biological functions. 

• As more biological experimentation is done, researchers are 

able to build a better picture of how genes interact, and 

how pathways function. 

• Information about pathway membership and gene function 

are stored in publicly available databases. 

• This information can be used to define gene sets (groups of 

genes which are functionally related), to which statistical 

analysis can be applied.

Biological pathways: KEGG 

• Kyoto Encyclopedia of Gene and Genomes. 

www.genome.jp/kegg/kegg4.html 

• Provides nice (user-created) pathway diagrams. 

• XML output includes information about genes involved in 

pathways, and inter-gene (and gene product) relationships. 

– Can produce graphic representation of pathway based 

on XML alone.

KEGG pathway diagram (apoptosis)

XML output file for apoptosis pathway 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

XML-based KEGG pathway diagram (apoptosis)

Biological pathways: Biocarta 

• Maintain “open source” pathway database, and provide lab 

supplies for related experiments: 

www.biocarta.com 

• Pathway database is edited by “gurus”. 

• Very nice pathway diagrams (user created). 

• No non-html output.

Biocarta pathway diagram (apoptosis)

Gene ontology 

• Gene Ontology (GO) defines a collection of words (an 

ontology) which are used to classify the function of a gene. 

• Three broad classifications: 

– Molecular function. 

– Biological process. 

– Cellular component. 

• Each of these broad terms contains a hierarchy of 

categories, going from general to specific. 

• Each category is indexed by an identifier.

Example of GO hierarchy (apoptosis) 

* all : all ( 218850 ) 

o GO:0008150 : biological_process ( 145098 ) 

+ GO:0009987 : cellular process ( 91236 ) 

# GO:0050875 : cellular physiological process (81383 ) 

* GO:0008219 : cell death ( 2714 ) 

o GO:0012501 : programmed cell death ( 2395 ) 

+ GO:0006915 : apoptosis ( 2061 ) 

+ GO:0007582 : physiological process ( 96419 ) 

# GO:0050875 : cellular physiological process ( 81383 ) 

* GO:0008219 : cell death ( 2714 ) 


+ GO:0006915 : apoptosis ( 2061 ) 

# GO:0016265 : death ( 3054 ) 

* GO:0008219 : cell death ( 2714 ) 


+ GO:0006915 : apoptosis ( 2061 )

Annotation for microarrays 

• Linking information back to the array spots. 

• Types of information: 

– Sequence. 

– Gene. 

– Chromosome location. 

– Publications. 

– Function. 

– Other (e.g., transcription factors, orthologs, proteins). 

• Amount of information available is organism-specific.

Common identifiers 

• Microarray experiments often utilize the following identifiers: 

– Manufacturer’s ID (e.g., Affymetrix probe ID, MWG ID 

number). 

– Accession number (sequence identifier). 

– Entrez Gene ID or LocusLink ID (gene identifier). 

– KEGG ID (KEGG pathway membership). 

– GO term ID (functional information). 

• This information can be used to enhance (or be 

incorporated into) a statistical analysis.

Annotation in Bioconductor 

• Bioconductor includes metadata packages which contain 

annotation information. 

– Oligo-set specific (e.g., MWG 30K set). 

– Array specific (e.g., Affymetrix HGU133A). 

– Organism specific (e.g., human, rat, mouse). 

• These packages provide linkage between the sequences used 

in array experiments, and the genes from which they are 

derived. 

• Go and KEGG libraries are also available, with links to 

LocusLink IDs (old name for Entrez Gene).

Bioconductor example: hu6800 

• Annotation package for Affymetrix array used in Golub et 

al. paper: 

> library(hu6800.db) 

> hu6800() 

Quality control information for hu6800: 

This package has the following mappings: 

hu6800ACCNUM has 7129 mapped keys (of 7129 keys) 

hu6800ALIAS2PROBE has 23712 mapped keys (of 110391 keys) 

hu6800CHR has 6178 mapped keys (of 7129 keys) 

hu6800CHRLENGTHS has 93 mapped keys (of 93 keys) 

hu6800CHRLOC has 6136 mapped keys (of 7129 keys) 

hu6800CHRLOCEND has 6136 mapped keys (of 7129 keys) 

...

• Gene symbols: 

Bioconductor example: hu6800 

> library(hu6800.db) 

> sym sym[1:3] 

$U30894_at 

[1] "SGSH" 

$X85178_at 

[1] "SURF5" 

• KEGG pathways: 

> KEGG.list KEGG.list[1:2] 

$U30894_at 

[1] "00531" "01032" 

$X85178_at 

[1] NA

Bioconductor example: KEGG Pathway names 

> library(KEGG.db) 

> kg kg[1:3] 

$‘00010‘ 

[1] "Glycolysis / Gluconeogenesis" 

$‘00020‘ 

[1] "Citrate cycle (TCA cycle)" 

> kg[match("00860",names(kg))] 

$‘00860‘ 

[1] "Porphyrin and chlorophyll metabolism"

Detecting pathway-level changes 

• Microarrays are able to measure changes in gene expression 

across treatment conditions. 

• Can obtain information about gene sets (e.g., GO, KEGG, 

Biocarta). 

• Allows microarray data to be used to assess whether 

changes in expression occur at the group level. 

• Such changes often provide greater information than single 

gene changes.

Hypergeometric distribution 

• Simple approach to investigating coordinated gene 

expression - involves hypergeometric distribution. 

• Look for functional groupings within a set of significantly 

differentially expressed genes: 

– e.g., what is the probability of getting 10 apoptosis 

genes in my 100 differentially expressed genes 

• Similar to classic hypergeometric problem: 

– e.g., what is the probability of selecting k white balls in 

a sample of size n from a bag containing m white and 

N − m black balls

P (X = x) = 

Hypergeometric distribution 

`M 

´`N−M 

x n−x 

`N 

n 

´ 

´ for x = max(0, n + M − N) to x = min(n, M) 

• Here x is the number of genes from a particular pathway 

(of size M) which showed up in our list of n differentially 

expressed genes (then are N genes in total). 

• To calculate a p-value for this “test” we need to sum up all 

of the probabilities from x (which we observed) up to 

min(M, n). 

• This is done for each gene set, and then the p-values are 

adjusted to take multiple comparisons into account.

P (X = x) = 

Example: hypergeometric distribution 

`M 

´`N−M 

x n−x 

`N 

n 

´ 

´ for x = max(0, n + M − N) to x = min(n, M) 

• Suppose that we observe 10 apoptosis genes in our 100 

differentially expressed genes. 

• There are 10,000 genes on our array, of which 500 are 

apoptosis genes. 

P (X = 10) = 

( 500 

)( 9500 

10 90 

( 10,000 

100 

) 

) = 0.012 

• Summing the values from x = 10 to x = 100 gives a 

p-value of 0.0205. 

• Would then have to adjust this for the number of pathways 

being tested.

Limitations 

• The hypergeometric test only takes the size of gene sets 

into account. 

• All genes for the same group that are not significant are 

treated the same. 

– What if they are “almost” significant 

• We are now thinking about the ranks of the genes. 

• Want to incorporate this rank information into our 

calculations.

Gene Set Analysis 

• How to incorporate functional information into statistical 

analysis 

• Want to make inferences about groups of genes (“gene 

sets”) rather than individual genes. 

• First publication to do this was by Mootha et al. (2003), 

which introduced an approach called gene set enrichment 

analysis (GSEA). 

– Somewhat limited - calculated the significance of the 

“most up-regulated” gene set in an experiment. 

– Later a second publication from the same group 

appeared in PNAS - nothing new (or exciting). 

– R code available, but not part of Bioconductor.

GSEA methodology (Mootha et al., 2003)

Significance analysis of function and expression 

• Barry et al. (2005) published a method called Significance 

Analysis of Function and Expression (SAFE) for detecting 

gene sets undergoing significant changes in expression 

between two experimental conditions. 

• Provided a simple approach to this type of analysis. 

• Also provided an Bioconductor package (safe) which 

allows this method to be used in R.

SAFE methodology 

• Gene sets are pre-defined using existing grouping (e.g., 

KEGG, GO, Biocarta) or expert knowledge. 

• Like the earlier GSEA method, SAFE breaks the analysis 

into two parts: 

1. Rank genes based on their level of differential expression 

between experimental conditions (local). 

2. Rank gene sets based on the ranking of the genes they 

contain (global). 

• Various methods can be used for each of the ranking steps.

SAFE gene (local) ranking 

• The default method used to rank genes based on their level 

of differential expression is the two sample t-test. 

– This provides an ordered list based on the absolute value 

of the test statistic for each gene. 

– Other statistics are obviously possible (e.g., SAM, 

limma etc). 

• No need to assess significance of differential expression - 

just want to rank the genes.

SAFE pathway (global) ranking 

• The default pathway-ranking method is to take the ranks 

for the genes in each pathway, and use a Wilcoxon Rank 

Sum statistic to rank the pathways. 

– Produces ranked list of pathways, based on the ranks of 

their constituent genes. 

– Again, other rank-based statistics are possible (e.g., 

Kolmogorov-Smirnov). 

• Permutation-based resampling used to assess significance, 

while accounting for correlation between pathways, and 

multiple testing.

SAFE output 

• SAFE produces a list of significantly changed gene sets, 

based on a permutation p-value (and MCP adjustment 

incorporating dependency, if selected), using an α-level 

specified by the user. 

SAFE results: 

Local: t.Student 

Global: Wilcoxon 

Method: permutation 

Size Mean.Rank Emp.pvalue 

00860 15 2412.2 0.001 

04110 57 1955.9 0.003 

00561 12 2086.6 0.008 

04966 10 2312.9 0.011 

05144 32 1953.0 0.012 

00970 16 2133.1 0.018 

• Graphical summary (the “safeplot”) is also available.

Safeplot

Interpreting the safeplot 

• Significant changes in gene set activity are indicated by 

deviation from the diagonal. 

• The safeplot also shows the direction of differential 

expression observed for each gene in the group (of course to 

figure this out you need to know how the rankings relate to 

the comparison of interest).

R code for running SAFE 

library(safe) 

library(multtest) 

library(hu6800.db) 

data(golub) 

dimnames(golub)[[1]]

SAFE: summary 

• The SAFE methodology provides a relatively simple 

approach to investigating changes in the activity of 

pre-defined gene sets. 

• Various deficiencies exists (e.g., correlation between genes 

within a pathway not taken into account, magnitude of 

changes not accouted for by ranking method), but it is still 

a useful tool for this type of problem.

Extensions and improvements 

• Goeman and Buhlmann (2007) defined two different types 

of hypotheses that could be tested in a gene set analysis: 

– a self-contained test, where each gene set is tested for 

differential expression between the sample classes 

independently of the other gene sets (the null 

distribution is generated by permuting the samples). 

– a competitive test, where each gene set is tested for 

differential expression between the sample classes 

relative to the other gene sets (the null distribution is 

generated by permuting the genes). 

• Tests of each type are implemented within the limma 

package, allowing use of a linear models approach.

Gene set analysis with Limma 

• The roast command performs “rotation gene set testing” 

for linear models (mroast does the same for multiple gene 

sets): this is a self-contained test. 

• geneSetTest and wilcoxGST perform a simple competitive 

gene set tests (similar to SAFE, but permuting genes rather 

than samples). 

• romer provides “gene set enrichment analysis for linear 

models using rotation tests” as a competitive test.

geneSetTest (Limma) 

fit

geneSetTest (Limma) 

gst[which.min(gst)] 

04120 

0.0180696 

library(KEGG.db) 

path

Barcode plot (Limma) 

barcodeplot(selected=seq(1:length(tstat))[C.matrix[,which.min(gst)]==1], 

statistics=tstat,labels=c("Up","Down"),main=paste(path[ 

match(names(gst[whi\ch.min(gst)]),names(path))][[1]],": p=", 

round(min(gst),5),sep=’’))

Rotation-based gene set testing 

• Uses rotation of orthogonalized residuals from a linear 

model to generate p-values (other methods simply permute 

either genes or samples). 

• Applicable to correlated tests (e.g., correlations between 

genes), small sample sizes, and arbitrary experimental 

designs. 

• Wu et al. ROAST: rotation gene set tests for complex 

microarray experiments. Bioinformatics (2010) vol. 26 (17) 

pp. 2176-82.

STAT 435: Topics in Advanced Statistics Data analysis for ...

Create successful ePaper yourself

Delete template?

Save as template?