18.07.2013 Views

Molecular evolution of transcriptional regulation by Alan ... - Eisen Lab

Molecular evolution of transcriptional regulation by Alan ... - Eisen Lab

Molecular evolution of transcriptional regulation by Alan ... - Eisen Lab

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Molecular</strong> <strong>evolution</strong> <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong><br />

<strong>by</strong><br />

<strong>Alan</strong> Michael Moses<br />

B.A. (Columbia University) 2000<br />

A dissertation submitted in partial satisfaction <strong>of</strong> the<br />

requirements for the degree <strong>of</strong><br />

Doctor <strong>of</strong> Philosophy<br />

in<br />

BIOPHYSICS<br />

in the<br />

GRADUATE DIVISION<br />

<strong>of</strong> the<br />

UNIVERSITY OF CALIFORNIA, BERKELEY<br />

Committee in charge:<br />

Pr<strong>of</strong>essor Michael B. <strong>Eisen</strong>, Chair<br />

Pr<strong>of</strong>essor Adam P. Arkin<br />

Pr<strong>of</strong>essor Daniel S. Rokhsar<br />

Pr<strong>of</strong>essor Michael Levine<br />

Spring 2005


The thesis <strong>of</strong> <strong>Alan</strong> Michael Moses is approved:<br />

______________________________________________________________<br />

Chair Date<br />

______________________________________________________________<br />

Date<br />

______________________________________________________________<br />

Date<br />

______________________________________________________________<br />

Date<br />

University <strong>of</strong> California, Berkeley<br />

Spring 2005


<strong>Molecular</strong> Evolution <strong>of</strong> Transcriptional Regulation<br />

© 2005<br />

<strong>by</strong><br />

<strong>Alan</strong> Michael Moses<br />

This thesis is licensed under the terms <strong>of</strong> the Creative Commons Attribution<br />

License, which permits unrestricted use, distribution, and reproduction in<br />

any medium, provided the original work is properly cited.


Abstract<br />

<strong>Molecular</strong> <strong>evolution</strong> <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong><br />

<strong>by</strong><br />

<strong>Alan</strong> Michael Moses<br />

Doctor <strong>of</strong> Philosophy in Biophysics<br />

University <strong>of</strong> California, Berkeley<br />

Pr<strong>of</strong>essor Michael B. <strong>Eisen</strong>, Chair<br />

Regulation <strong>of</strong> transcription <strong>by</strong> sequence-specific DNA-binding proteins is an important<br />

component <strong>of</strong> many fundamental biological processes; these include autonomous cellular<br />

behavior, response to the environment, and the generation <strong>of</strong> functional and<br />

morphological diversity over development and <strong>evolution</strong>. Understanding how this<br />

regulatory information is encoded in genome sequences is therefore a major challenge.<br />

Taking advantage <strong>of</strong> the availability <strong>of</strong> the genome sequences <strong>of</strong> closely related<br />

organisms, I studied the <strong>evolution</strong> <strong>of</strong> the DNA binding sites recognized <strong>by</strong> these proteins<br />

(transcription factor binding sites) and found that, just as the pattern <strong>of</strong> <strong>evolution</strong> in<br />

protein coding sequences reflects the degeneracy <strong>of</strong> the genetic code, the <strong>evolution</strong> <strong>of</strong><br />

these binding sites reflects the constraints under which they function. Using models <strong>of</strong><br />

molecular <strong>evolution</strong> that described the observed patterns <strong>of</strong> <strong>evolution</strong>, I developed new<br />

computational approaches to determine both the sequence specificity <strong>of</strong> transcription<br />

factors and the sites in the genome to which these factors bind. These new methods<br />

exploit the specific patterns <strong>of</strong> <strong>evolution</strong> observed in transcription factor binding sites. I<br />

1


have applied these methods to test cases from the budding yeasts (sensu stricto<br />

Saccharomyces) and fruit fly (Drosophila) and to novel cases in the human genome, in<br />

order to discover the specificity <strong>of</strong> transcription factors and locations <strong>of</strong> binding sites.<br />

The <strong>evolution</strong> <strong>of</strong> transcription factor binding sites not only helps reveal the<br />

regulatory information encoded in the genome, but it also a key mechanism underlying<br />

the <strong>evolution</strong> <strong>of</strong> transcription networks. In order to study the <strong>evolution</strong> <strong>of</strong> these networks,<br />

I have developed and applied computational methods to characterize the patterns <strong>of</strong><br />

<strong>evolution</strong> <strong>of</strong> <strong>transcriptional</strong> regulatory systems over the ascomycete fungi. In order to<br />

relate binding site <strong>evolution</strong> at the level <strong>of</strong> DNA sequence to the <strong>evolution</strong> <strong>of</strong><br />

<strong>transcriptional</strong> <strong>regulation</strong>, I have developed methods to systematically test for gains and<br />

losses <strong>of</strong> functional binding sites and to differentiate among <strong>evolution</strong>ary scenarios.<br />

Finally, I have suggested a general probabilistic framework to model the changing<br />

selective constraints on these sequences.<br />

______________________________________________________________<br />

Chair Date<br />

2


To California<br />

i


Table <strong>of</strong> contents<br />

Part I – Introduction<br />

1. Regulation <strong>of</strong> gene expression at the level <strong>of</strong> transcription … 2<br />

2. Understanding the <strong>evolution</strong> <strong>of</strong> non-coding DNA … 7<br />

3. <strong>Molecular</strong> mechanisms <strong>of</strong> <strong>transcriptional</strong> <strong>evolution</strong> … 9<br />

Part II – Identifying sequences controlling gene expression using data from a single<br />

species<br />

4. Searching for statistically enriched sequences … 15<br />

5. Methods that associate sequences with other data ... 23<br />

6. Improved methods <strong>of</strong> regulatory regions … 30<br />

7. Future challenges ... 35<br />

Part III – Including molecular <strong>evolution</strong> in the search for transcription factor binding sites<br />

8. Characterizing the constraints on transcription factor binding sites … 40<br />

9. Phylogenetic motif finding <strong>by</strong> expectation maximization on <strong>evolution</strong>ary mixtures<br />

… 67<br />

10. MONKEY: extending the probabilistic model to identify binding sites <strong>of</strong> factors<br />

with known specificity … 82<br />

11. Application to alignments <strong>of</strong> multiple primates and human-fugu conserved<br />

elements … 120<br />

12. Futures challenges … 131<br />

Part IV – Evolution <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong><br />

13. Evidence for <strong>evolution</strong> <strong>of</strong> transcription networks on a genome-wide scale … 132<br />

14. Evidence for losses and gains <strong>of</strong> functional binding sites … 176<br />

15. Future directions … 201<br />

Part V – Conclusions<br />

16. Insight from quantitative <strong>evolution</strong>ary analyses … 208<br />

ii


List <strong>of</strong> Figures<br />

4-1 Enrichment <strong>of</strong> predicted zeste binding sites in bound fragments in a Chip Chip<br />

experiment … 15<br />

4-2 Representing families <strong>of</strong> binding sites with probabilistic models … 21<br />

5-1 Z scores over the cell cycle … 23<br />

5-2 P-values from various statistical tests in the cell cycle dataset … 25<br />

5-3 Optimizing a matrix based on the KS-statistic … 26<br />

5-4 Clustering <strong>of</strong> 7-mers significantly associated with gene categories … 28<br />

6-1 Positional biases in binding sites associated with regulated genes … 30<br />

6-2 Upstream regions <strong>of</strong> the alpha cell type and other pheromone responsive genes … 33<br />

8-1 Characterized binding sites evolve more slowly than the promoters in which they are<br />

found … 44<br />

8-2 Comparison <strong>of</strong> rates <strong>of</strong> <strong>evolution</strong> to structures <strong>of</strong> protein-DNA complexes implies a<br />

model for the variation in the rate <strong>of</strong> <strong>evolution</strong> across binding motifs … 46<br />

8-3 Association between information pr<strong>of</strong>ile and rate <strong>of</strong> <strong>evolution</strong> in characterized<br />

binding sites from SCPD … 48<br />

8-4 Test <strong>of</strong> the Halpern-Bruno proportionality … 50<br />

8-5 Information and rate <strong>of</strong> <strong>evolution</strong> for the recently reported Crz1p motif … 53<br />

9-1 Effect <strong>of</strong> models for motif <strong>evolution</strong> on motif detection … 75<br />

9-2 Effect <strong>of</strong> <strong>evolution</strong>ary distance on motif detection … 77<br />

10-1 Accuracy <strong>of</strong> p-value estimations … 91<br />

10-2 Significance <strong>of</strong> matches increases with <strong>evolution</strong>ary distance … 96<br />

10-3 Significance <strong>of</strong> binding sites in pairwise or three-way comparisons at similar<br />

<strong>evolution</strong>ary distance … 99<br />

10-4 Relationship between conserved Rpn4p-binding sites and expression … 101<br />

iii


10-5 Some apparently functional Rpn4p-binding sites are not conserved … 103<br />

10-6 The <strong>evolution</strong>ary distance required to confidently identify conserved binding sites<br />

varies among transcription factors … 107<br />

10-7 Many characterized binding sites do not seem to be evolving under constraint …<br />

111<br />

10-8 Applying MONKEY to alignments <strong>of</strong> Drosophila … 111<br />

11-1 Sequence logo representation <strong>of</strong> the LXRE … 121<br />

11-2 Closely spaced, highly conserved matches suggest the presence <strong>of</strong> a downstream<br />

CYP7A enhancer … 123<br />

11-3 Discovery <strong>of</strong> a enriched, conserved homodomain-like motif using EMnEM … 124<br />

11-4 Identification <strong>of</strong> motifs specifically associated with enhancers that showed forebrain<br />

expression … 127<br />

13-1 Fungal Phylogeny … 134<br />

13-2 Conservation <strong>of</strong> Cis-Sequence Enrichment in Specific Gene Groups … 140<br />

13-3 Enrichment <strong>of</strong> Novel Sequences in Coregulated Genes from Other Species … 143<br />

13-4 Distribution <strong>of</strong> Cis-Regulatory Elements Upstream <strong>of</strong> Coregulated Genes … 146<br />

13-5 Spatial Relationships between Cis-Regulatory Elements … 148<br />

13-6 Position-Weight Matrices Representing Proteasome Cis-Regulatory Element … 150<br />

13-7 Sequence Alignment <strong>of</strong> the DNA-Binding Domain <strong>of</strong> Rpn4p and Its Orthologs …<br />

158<br />

14-1 The CYP7A1 LXRE in alignment <strong>of</strong> primates … 178<br />

14-2 LXRa PPRE sites in an alignment <strong>of</strong> primates and mouse … 179<br />

14-3 P-values for the T-statistic to detect binding site <strong>evolution</strong> … 183<br />

14-4 Non-conserved binding sites in the Kr 730 enhancer … 184<br />

14-5 185 A misaligned bcd site in the eve stripe 2 enhancer … 185<br />

iv


14-6 Enrichment <strong>of</strong> conserved and non-conserved binding sites in the regions bound <strong>by</strong><br />

zeste … 190<br />

14-7 Evolutionary scenarios consistent with the observation <strong>of</strong> a non-conserved binding<br />

in D. melanogaster … 193<br />

14-8 Enrichment <strong>of</strong> bindng sites classified <strong>by</strong> <strong>evolution</strong>ary scenario … 196<br />

14-9 Evidence for purifying selection on zeste binding sites … 197<br />

15-1 Modeling changing <strong>evolution</strong>ary constraints, such as binding site gain or loss … 202<br />

15-2 Probabilistic <strong>evolution</strong>ary model for binding sites in un-alignable promoters … 205<br />

v


List <strong>of</strong> Tables<br />

4-1 Running a discrete motif finder on the zeste data …19<br />

8-1 Correlation between information content and substitutions per site for the<br />

experimentally characterized binding sites in the SCPD database … 45<br />

8-2 Evolution <strong>of</strong> motifs with known consensus, but binding sites identified <strong>by</strong> MEME …<br />

51<br />

8-3 Motifs identified <strong>by</strong> MEME that do not correspond to the expected consensus<br />

sequences for the transcription factors thought to be regulating the cluster … 53<br />

9-1 Motif discovery using EMnEM and MEME … 79<br />

10-1 Definition <strong>of</strong> positive and negative sets <strong>of</strong> matrix matches … 94<br />

10-2 Performance <strong>of</strong> different scores in recognizing functional and nonfunctional sites …<br />

95<br />

13-1 Orthologues assigned to S. cerervisiage genes … 163<br />

vi


Acknowledgements<br />

I gratefully acknowledge the Biophysics Graduate Group for financial and administrative<br />

support, particularly Shizuka Gannon, Diane Sigman, Bob Glaeser and Ehud Isac<strong>of</strong>f. I<br />

received much scientific and academic advice from many people at Berkeley including<br />

Adam Arkin, Jasper Rine, Mark Van der Laan, Terry Speed, Mike Levine and Dan<br />

Rokhsar, and especially Casey Bergman and Paul Spellman. I was lucky to be inspired<br />

and challenged in discussions with Hunter Fraser and Justin Fay and in attending a course<br />

<strong>by</strong> Michael Jordan. I have also been extremely fortunate to find myself in the midst <strong>of</strong> a<br />

scientific environment where colleagues have been excited to collaborate and share<br />

resources and data. Particularly generous to me have been Nobuo Ogawa, Jeff Wang,<br />

Len Pennacchio and Eddy Rubin.<br />

Most important to the work presented here were my main collaborators in Mike <strong>Eisen</strong>’s<br />

<strong>Lab</strong>, Derek Chiang, Dan Pollard, Audrey Gasch, and <strong>of</strong> course Mike himself; I here<strong>by</strong><br />

acknowledge their guidance, insight, patience and kindness.<br />

Finally, I thank my parents, sisters, grandmother, friends, and Tom, Rima and Miranda<br />

for their support and love.<br />

vii


Part I<br />

Introduction<br />

1


1. Regulation <strong>of</strong> gene expression at the level <strong>of</strong> transcription<br />

T<br />

HE DISCOVERY OF DNA as the genetic material (Hershey and Chase<br />

1952) together with the mechanistic understanding <strong>of</strong> how the information<br />

encoded in DNA sequences is converted into protein sequences that carry out<br />

biological function (Crick et al. 1961) raises a fundamental question: How is the<br />

expression <strong>of</strong> the genetic material regulated (Orphanides and Reinberg 2002)?<br />

One <strong>of</strong> the major mechanisms for gene <strong>regulation</strong> is <strong>regulation</strong> at the level <strong>of</strong><br />

transcription (Monod and Jacob 1961). The best understood mechanism for <strong>regulation</strong> at<br />

the level <strong>of</strong> transcription is the action <strong>of</strong> sequence specific DNA binding proteins, which<br />

modulate the rate <strong>of</strong> transcript initiation through a variety <strong>of</strong> mechanisms (Gill 2001,<br />

Emerson 2002). These proteins recognize degenerate families <strong>of</strong> short sequences,<br />

(Stormo 2000) and in eukaryotes they <strong>of</strong>ten act 'combinatorially', such that the effects <strong>of</strong><br />

multiple factors are required to specify a given expression pattern (Wolberger 1999).<br />

Because the DNA binding proteins can themselves be regulated at the level <strong>of</strong><br />

transcription, they can be assembled into 'networks' that can transmit information through<br />

cascades or can set up arbitrarily complex patterns. The activity <strong>of</strong> transcription factors<br />

may also be regulated post-translationally, and they are <strong>of</strong>ten the effectors <strong>of</strong> classic<br />

signaling pathways. Because differences in gene expression programs are a major source<br />

<strong>of</strong> cellular diversity, it is perhaps not surprising (in retrospect) that many key molecules<br />

<strong>of</strong> developmental, <strong>evolution</strong>ary, and medical interest have turned out to be transcription<br />

factors.<br />

Regulation <strong>by</strong> sequence specific transcription factors is a major component <strong>of</strong><br />

2


many fundamental biological processes including:<br />

● Networks that produce autonomous cellular behaviors (Ptasne 1992), such as the<br />

eukaryotic cell-cycle (Koch and Nasmyth 1994)<br />

● Response to cellular environment (Gasch et al. 2000), such as response to oxysterols <strong>by</strong><br />

LXR in mammals (Janowski et al. 1996)<br />

● Cell-type differentiation, such as a/alpha determination in yeast (Johnson 1995)<br />

● Developmental patterning (St Johnston and Nusslein-Volhard 1992), such as the<br />

gap/pair-rule system in Drosophila (Akam 1987)<br />

● Evolutionary changes in morphology (Levine and Tjian 2003), such as changes in Ubx<br />

expression associated with appendage diversification in crustaceans (Aver<strong>of</strong> and Patel<br />

1997)<br />

The availability <strong>of</strong> complete genome sequences has greatly impacted the study <strong>of</strong><br />

<strong>transcriptional</strong> <strong>regulation</strong>. Technological innovations have allowed the measurement <strong>of</strong><br />

gene expression (De Risi et al. 1997) and in vivo binding <strong>of</strong> DNA binding proteins (Leib<br />

et al. 2001) on a genome wide scale. Although these techniques have been applied<br />

mostly to yeast, systematic measurements <strong>of</strong> this kind are rapidly becoming available for<br />

many organisms (Arbeitman et al. 2002, Odom et al. 2004).<br />

The purpose <strong>of</strong> this dissertation is tw<strong>of</strong>old: first, to examine in detail the <strong>evolution</strong>ary<br />

properties <strong>of</strong> transcription factor binding sites in relation to the development <strong>of</strong> new<br />

methods and tools for their identification and annotation, and second, to examine the role<br />

<strong>of</strong> the <strong>evolution</strong> <strong>of</strong> binding sites as a mechanism underlying the <strong>evolution</strong> <strong>of</strong> gene<br />

expression.<br />

3


Transcription factor binding sites in non-coding DNA<br />

Understanding how <strong>transcriptional</strong> <strong>regulation</strong> is encoded in the genome is a major<br />

challenge. Transcriptional regulatory regions are found in non-coding DNA near (linked<br />

in cis) the genes they affect. However, because the non-coding regions <strong>of</strong> eukaryotic<br />

genomes are typically very large and the binding sites for transcription factors are small<br />

and degenerate, (Stormo 2000) most <strong>of</strong> the sequences that seem to match the specificity<br />

<strong>of</strong> the binding proteins are not actually bound <strong>by</strong> the protein in vivo (Lieb et. al. 2001).<br />

In eukaryotes, binding sites are <strong>of</strong>ten organized into modular groups called enhancers,<br />

which typically contain ~10 binding sites for ~3 different factors and specify a particular<br />

portion <strong>of</strong> a gene’s expression pattern. Developing approaches to identify transcription<br />

factor binding sites is a major area <strong>of</strong> research in computational biology, and many<br />

approaches have been developed (Tompa et al. 2005). In addition, many computational<br />

approaches have been developed to identify conserved regions in non-coding DNA<br />

(Berezikov et al. 2004, Loots et al. 2004, Bigelow et al. 2004, Loots et al. 2002, Lenhard<br />

et al. 2003, Sandelin et al. 2004, Mrowka et al. 2003), in the hope that these will point to<br />

regulatory elements. A detailed understanding <strong>of</strong> the constraints under which<br />

transcription factor binding sites evolve will allow development <strong>of</strong> more accurate models<br />

and more effective computational methods, there<strong>by</strong> improving our ability to identify<br />

these regulatory regions.<br />

References<br />

Akam M. The molecular basis for metameric pattern in the Drosophila embryo. Development. 1987<br />

Sep;101(1):1-22.<br />

4


Arbeitman MN, Furlong EE, Imam F, Johnson E, Null BH, Baker BS, Krasnow MA, Scott MP, Davis RW,<br />

White KP. Gene expression during the life cycle <strong>of</strong> Drosophila melanogaster. Science. 2002 Sep<br />

27;297(5590):2270-5. Erratum in: Science 2002 Nov 8;298(5596):1172.<br />

Aver<strong>of</strong> M, Patel NH. Crustacean appendage <strong>evolution</strong> associated with changes in Hox gene expression.<br />

Nature. 1997 Aug 14;388(6643):682-6.<br />

Berezikov E, Guryev V, Plasterk RH, Cuppen E: CONREAL: conserved regulatory elements anchored<br />

alignment algorithm for identification <strong>of</strong> transcription factor binding sites <strong>by</strong> phylogenetic footprinting.<br />

Genome Res 2004, 14:170-178.<br />

Bigelow HR, Wenick AS, Wong A, Hobert O: CisOrtho: a program pipeline for genome-wide<br />

identification <strong>of</strong> transcription factor target genes using phylogenetic footprinting. BMC Bioinformatics<br />

2004, 5:27.<br />

Crick FHC, Barnett L, Brenner S, and Watts-Tobin RJ (1961) General nature <strong>of</strong> the genetic code for<br />

proteins. Nature 192:1227-1232<br />

DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control <strong>of</strong> gene expression on a<br />

genomic scale. Science. 1997 Oct 24;278(5338):680-6.<br />

Emerson BM. Specificity <strong>of</strong> gene <strong>regulation</strong>. Cell. 2002 May 3;109(3):267-70.<br />

F. Jacob and J. Monod. (1961). Genetic regulatory mechanisms in the synthesis <strong>of</strong> proteins J. Mol. Biol. 3:<br />

318-356.<br />

Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, <strong>Eisen</strong> MB, Storz G, Botstein D, Brown PO.Genomic<br />

expression programs in the response <strong>of</strong> yeast cells to environmental changes. Mol Biol Cell. 2000<br />

Dec;11(12):4241-57<br />

Gill G. Regulation <strong>of</strong> the initiation <strong>of</strong> eukaryotic transcription. Essays Biochem. 2001;37:33-43.<br />

Hershey, A.D. & Chase, M. (1952) Independent functions <strong>of</strong> viral protein and dna nucleic acid in growth <strong>of</strong><br />

bacteriophage. Journal <strong>of</strong> General Physiology 36:39-56<br />

Janowski BA, Willy PJ, Devi TR, Falck JR, Mangelsdorf DJ. An oxysterol signalling pathway mediated <strong>by</strong><br />

the nuclear receptor LXR alpha. Nature. 1996 Oct 24;383(6602):728-31.<br />

Johnson AD. <strong>Molecular</strong> mechanisms <strong>of</strong> cell-type determination in budding yeast. Curr Opin Genet Dev.<br />

1995 Oct;5(5):552-8<br />

Koch, C. and Nasmyth, K., 1994. Cell cycle regulated transcription in yeast. Curr. Opin. Cell Biol. 6, pp.<br />

451-459.<br />

Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasserman WW: Identification <strong>of</strong> conserved<br />

regulatory elements <strong>by</strong> comparative genome analysis. J Biol 2003, 2:13.<br />

Levine M, Tjian R. Transcription <strong>regulation</strong> and animal diversity. Nature. 2003 Jul 10;424(6945):147-51.<br />

Lieb JD, Liu X, Botstein D, Brown PO. Promoter-specific binding <strong>of</strong> Rap1 revealed <strong>by</strong> genome-wide maps<br />

<strong>of</strong> protein-DNA association. Nat Genet. 2001 Aug;28(4):327-34. Erratum in: Nat Genet 2001<br />

Sep;29(1):100.<br />

Loots GG, Ovcharenko I, Pachter L, Dubchak I, Rubin EM: rVista for comparative sequence-based<br />

5


discovery <strong>of</strong> functional transcription factor binding sites. Genome Res 2002, 12:832-839.<br />

Loots GG, Ovcharenko I: rVISTA 2.0: <strong>evolution</strong>ary analysis <strong>of</strong> transcription factor binding sites. Nucleic<br />

Acids Res 2004, 32(Web Server):W217-W221.<br />

Mrowka R, Steinhage K, Patzak A, Persson PB: An <strong>evolution</strong>ary approach for identifying potential<br />

transcription factor binding sites: the renin gene as an example. Am J Physiol Regul Integr Comp Physiol<br />

2003, 284:R1147-R1150.<br />

Odom DT, Zizlsperger N, Gordon DB, Bell GW, Rinaldi NJ, Murray HL, Volkert TL, Schreiber J, Rolfe<br />

PA, Gifford DK, Fraenkel E, Bell GI, Young RA Control <strong>of</strong> pancreas and liver gene expression <strong>by</strong> HNF<br />

transcription factors. Science. 2004 Feb 27;303(5662):1378-81.<br />

Orphanides G, Reinberg D. A unified theory <strong>of</strong> gene expression. Cell. 2002 Feb 22;108(4):439-51.<br />

Ptashne M (1992) A Genetic Switch, 2nd edn. Cambridge, MA: Cell Press and Blackwell Press.<br />

Sandelin A, Wasserman WW, Lenhard B: ConSite: web-based prediction <strong>of</strong> regulatory elements using<br />

cross-species comparison. Nucleic Acids Res 2004, 32(Web Server):W249-W252.<br />

St Johnston D, Nusslein-Volhard C. The origin <strong>of</strong> pattern and polarity in the Drosophila embryo. Cell.<br />

1992 Jan 24;68(2):201-19.<br />

Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000 Jan;16(1):16-23.<br />

Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ,<br />

Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van<br />

Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z. Assessing computational tools for the<br />

discovery <strong>of</strong> transcription factor binding sites. Nat Biotechnol. 2005 Jan;23(1):137-44.<br />

Wolberger C: Multiprotein-DNA complexes in <strong>transcriptional</strong> <strong>regulation</strong>. Annu Rev Biophys Biomol<br />

Struct 1999, 28:29-56.<br />

6


2. Understanding the <strong>evolution</strong> <strong>of</strong> non-coding DNA<br />

An important reason for studying the <strong>evolution</strong> <strong>of</strong> transcription factor binding sites is to<br />

fill in the picture <strong>of</strong> the <strong>evolution</strong>ary constraints on non-coding DNA, about which<br />

relatively little is known. Historically, non-coding DNA has been assumed to be under<br />

much less functional constraint than coding sequences, as it was believed to be largely<br />

‘junk’ (Ohno 1972). While it is clear that transposons and other simple repetitive<br />

elements comprise a large portion <strong>of</strong> non-coding DNA, it is unclear what portion <strong>of</strong> the<br />

functional non-coding DNA is involved in the <strong>regulation</strong> <strong>of</strong> transcription. If cis-<br />

regulatory DNA makes up a large portion <strong>of</strong> non-coding DNA (Cameron et al. 2004),<br />

understanding the <strong>evolution</strong> <strong>of</strong> regulatory DNA will be critical in any understanding <strong>of</strong><br />

<strong>evolution</strong> at the molecular level. Recently, non-coding DNA from closely related<br />

organisms has become increasingly available (e.g., Thomas et al. 2003), and<br />

understanding the patterns <strong>of</strong> <strong>evolution</strong> in non-coding regions is an important challenge.<br />

Genome-wide screens for highly conserved elements have revealed non-coding elements<br />

showing extreme conservation (Bejerano et al. 2004, Dermitzakis et al. 2004), a large<br />

number <strong>of</strong> which may drive gene expression patterns (Wolfe et. al 2005); their extreme<br />

conservation may reflect the presence <strong>of</strong> multiple overlapping functional constraints or<br />

indicate that they represent a class <strong>of</strong> as <strong>of</strong> yet undiscovered non-coding elements. A<br />

better understanding <strong>of</strong> the <strong>evolution</strong> <strong>of</strong> transcription factor binding sites may help to<br />

clarify their relative contribution to these observations or to suggest new hypotheses.<br />

References<br />

Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D. Ultraconserved<br />

7


elements in the human genome. Science. 2004 May 28;304(5675):1321-5. Epub 2004 May 6.<br />

Cameron RA, Oliveri P, Wyllie J, Davidson EH. cis-Regulatory activity <strong>of</strong> randomly chosen genomic<br />

fragments from the sea urchin. Gene Expr Patterns. 2004 Mar;4(2):205-13.<br />

Dermitzakis ET, Reymond A, Scamuffa N, Ucla C, Kirkness E, Rossier C, Antonarakis SE. Evolutionary<br />

discrimination <strong>of</strong> mammalian conserved non-genic sequences (CNGs). Science. 2003 Nov<br />

7;302(5647):1033-5.<br />

Hardison RC. Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet.<br />

2000 Sep;16(9):369-72.<br />

Hunt C, Morimoto RI. Conserved features <strong>of</strong> eukaryotic hsp70 genes revealed <strong>by</strong> comparison with the<br />

nucleotide sequence <strong>of</strong> human hsp70. Proc Natl Acad Sci U S A. 1985 Oct;82(19):6455-9.<br />

Ohno, S. (1972). So much "junk" DNA in our genome. In: Evolution <strong>of</strong> Genetic Systems, edited <strong>by</strong> H.H.<br />

Smith. Gordon and Breach, New York, pp.366-370.<br />

Thomas JW, Touchman JW, Blakesley RW, Bouffard GG, Beckstrom-Sternberg SM, Margulies EH,<br />

Blanchette M, Siepel AC, Thomas PJ, McDowell JC, Maskeri B, Hansen NF, Schwartz MS, Weber RJ,<br />

Kent WJ, Karolchik D, Bruen TC, Bevan R, Cutler DJ, Schwartz S, Elnitski L, Idol JR, Prasad AB, Lee-<br />

Lin SQ, Maduro VV, Summers TJ, Portnoy ME, Dietrich NL, Akhter N, Ayele K, Benjamin B, Cariaga K,<br />

Brinkley CP, Brooks SY, Granite S, Guan X, Gupta J, Haghighi P, Ho SL, Huang MC, Karlins E, Laric PL,<br />

Legaspi R, Lim MJ, Maduro QL, Masiello CA, Mastrian SD, McCloskey JC, Pearson R, Stantripop S,<br />

Tiongson EE, Tran JT, Tsurgeon C, Vogt JL, Walker MA, Wether<strong>by</strong> KD, Wiggins LS, Young AC, Zhang<br />

LH, Osoegawa K, Zhu B, Zhao B, Shu CL, De Jong PJ, Lawrence CE, Smit AF, Chakravarti A, Haussler<br />

D, Green P, Miller W, Green ED. Comparative analyses <strong>of</strong> multi-species sequences from targeted genomic<br />

regions. Nature. 2003 Aug 14;424(6950):788-93.<br />

Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H,<br />

Kelly K, Walter K, Abnizova I, Gilks W, Edwards YJ, Cooke JE, Elgar G. Highly conserved non-coding<br />

sequences are associated with vertebrate development. PLoS Biol. 2005 Jan;3(1):e7.<br />

8


3. <strong>Molecular</strong> mechanisms <strong>of</strong> <strong>transcriptional</strong> <strong>evolution</strong><br />

Although the importance <strong>of</strong> regulatory <strong>evolution</strong> has long been recognized (Wilson<br />

1976), recent elucidation <strong>of</strong> the role <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong> in development has<br />

suggested a specific picture <strong>of</strong> how <strong>evolution</strong> <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong> can generate<br />

<strong>evolution</strong>ary diversity <strong>of</strong> morphology (Levine and Tjian, 2003). Differences in<br />

<strong>regulation</strong> <strong>of</strong> transcription can occur both in cis or trans, but most attention has been<br />

focused on changes in cis-regulatory sequences. Less is known about changes in trans,<br />

and they are <strong>of</strong>ten presumed to be much less common (and therefore less important)<br />

because they are more likely to be pleiotropic (Wray et al. 2003). Systematic studies <strong>of</strong><br />

the genetic variation in gene expression within or between closely related species (Brem<br />

et al. 2002, Wittkopp et al. 2004) support the idea that there is abundant genetic variation<br />

in gene expression, both in cis and trans. The few cases where regulatory changes have<br />

been characterized in detail suggest that <strong>evolution</strong>ary changes in gene expression <strong>of</strong>ten<br />

have both cis and trans components (Gompel et al. 2005, Ludwig et al. 2005), with the<br />

important caveat that a cis-regulatory change for one gene may represent a trans change<br />

for a gene downstream. Ultimately, it seems likely that an understanding <strong>of</strong> both cis and<br />

trans changes will be critical if we are to obtain a mechanistic picture <strong>of</strong> the <strong>evolution</strong> <strong>of</strong><br />

gene <strong>regulation</strong> at the <strong>transcriptional</strong> level.<br />

Evolution <strong>of</strong> cis-regulatory elements<br />

What is certainly clear is that cis-regulatory changes are much easier to study; while trans<br />

changes may occur at the loci <strong>of</strong> any number <strong>of</strong> regulatory proteins or in any protein that<br />

9


lies upstream genetically, cis-regulatory changes are confined to the locus <strong>of</strong> the gene<br />

whose expression has evolved. Further, because transcription factor binding sites may<br />

represent a major part <strong>of</strong> the cis-regulatory DNA for many well-studied genes, they are<br />

an appropriate class <strong>of</strong> regulatory sequences on which to focus <strong>evolution</strong>ary studies.<br />

Analysis <strong>of</strong> the <strong>evolution</strong> <strong>of</strong> the expression <strong>of</strong> particular genes has yielded several<br />

interesting scenarios whose relative contributions to the <strong>evolution</strong> <strong>of</strong> gene expression<br />

remain unknown. For example, it has long been observed that regulatory elements may<br />

be conserved over long <strong>evolution</strong>ary distances (e.g., Hunt and Morimoto 1985) and that<br />

conserved non-coding sequences <strong>of</strong>ten turn out to function as regulatory sequences<br />

(Hardison 2000). On the other hand, landmark work has showed that lack <strong>of</strong> sequence<br />

conservation cannot be interpreted as change in function (Ludwig et al. 2000), as<br />

compensatory changes can apparently restore function and, conversely, that reasonably<br />

good sequence conservation can be associated with functional differences (Ludwig et al.<br />

2005). Finally, cases <strong>of</strong> cis-regulatory <strong>evolution</strong> over short <strong>evolution</strong>ary distances that<br />

lead to substantial functional differences have also been noted (e.g., Fang and Brennan<br />

1992), and simulations have suggested that transcription factor binding sites may emerge<br />

from random sequence <strong>by</strong> genetic drift on relatively short timescales (Stone and Wray<br />

2001).<br />

The availability <strong>of</strong> genome sequences that can be compared at the level <strong>of</strong> non-<br />

coding DNA (Kellis et al 2003, Cliften et al 2003, Richards et al 2005) allows<br />

interrogation <strong>of</strong> these issues systematically, on a genomic scale. Formulation <strong>of</strong> explicit<br />

models for these processes will allow tests for selection and will differentiate between<br />

<strong>evolution</strong>ary hypotheses.<br />

10


Evolution <strong>of</strong> transcription networks<br />

Understanding the <strong>evolution</strong> <strong>of</strong> individual binding sites, or even the <strong>evolution</strong> <strong>of</strong> the<br />

expression <strong>of</strong> individual genes, is not sufficient to understand the contribution <strong>of</strong><br />

<strong>transcriptional</strong> <strong>regulation</strong> to <strong>evolution</strong>ary diversity. Multiple transcription factors and<br />

their targets are <strong>of</strong>ten linked in complex networks that ultimately produce biological<br />

function and pattern. It is therefore important to consider the <strong>evolution</strong> <strong>of</strong> regulatory<br />

systems, rather than single genes. Because <strong>transcriptional</strong> <strong>regulation</strong> is multifactorial, its<br />

<strong>evolution</strong> may be complex, and multiple components may evolve simultaneously.<br />

Much <strong>of</strong> the thinking about the <strong>evolution</strong> <strong>of</strong> <strong>transcriptional</strong> regulatory networks<br />

has been from the following perspective (Schneider 2000, Sengupta et al. 2002, Gerland<br />

and Hwa 2002, Berg et al. 2004): the cell wants to express some set <strong>of</strong> genes at a<br />

particular time and, therefore, designs a DNA binding protein that binds to their<br />

promoters and to a minimum number <strong>of</strong> other promoters. Given assumptions about the<br />

relationship between protein-DNA binding energy to expression levels, expression levels<br />

to selection, and selection to constraints on DNA sequences, these models attempt to<br />

address the emergence <strong>of</strong> binding sites under selection or the variability in sequence<br />

specificity or in binding energy <strong>of</strong> transcription factors. While these models are<br />

conceptually interesting, there is little data to test the predictions they make and even less<br />

to support the assumptions on which they are based.<br />

On the other hand, due to advances in concepts and techniques as well as accurate<br />

quantitative data on which to base these models, modeling <strong>of</strong> regulatory networks in<br />

single species has advanced rapidly in recent years (Wolfe and Arkin 2003). This has<br />

11


allowed the first comparative studies <strong>of</strong> detailed models <strong>of</strong> regulatory networks (Rao et<br />

al. 2004). Genome-wide functional measurements and sequence data from closely<br />

related species, combined with an understanding <strong>of</strong> the <strong>evolution</strong> <strong>of</strong> <strong>transcriptional</strong><br />

regulatory sequences, will allow us to characterize patterns <strong>of</strong> <strong>evolution</strong> <strong>of</strong> regulatory<br />

networks.<br />

References<br />

Berg J, Willmann S, Lassig M. Adaptive <strong>evolution</strong> <strong>of</strong> transcription factor binding sites. BMC Evol Biol.<br />

2004 Oct 28;4(1):42.<br />

Brem RB, Yvert G, Clinton R, Kruglyak L. Genetic dissection <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong> in budding<br />

yeast. Science. 2002 Apr 26;296(5568):752-5.<br />

Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston M.<br />

Finding functional features in Saccharomyces genomes <strong>by</strong> phylogenetic footprinting. Science. 2003 Jul<br />

4;301(5629):71-6. Epub 2003 May 29.<br />

Fang XM, Brennan MD. Multiple cis-acting sequences contribute to evolved regulatory variation for<br />

Drosophila Adh genes. Genetics. 1992 Jun;131(2):333-43.<br />

Gerland U, Hwa T. On the selection and <strong>evolution</strong> <strong>of</strong> regulatory DNA motifs. J Mol Evol. 2002<br />

Oct;55(4):386-400.<br />

Gompel N, Prud'homme B, Wittkopp PJ, Kassner VA, Carroll SB. Chance caught on the wing: cisregulatory<br />

<strong>evolution</strong> and the origin <strong>of</strong> pigment patterns in Drosophila. Nature. 2005 Feb 3;433(7025):481-<br />

7.<br />

Hardison RC. Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet.<br />

2000 Sep;16(9):369-72.<br />

Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison <strong>of</strong> yeast species to<br />

identify genes and regulatory elements. Nature. 2003 May 15;423(6937):241-54.<br />

Levine M, Tjian R. Transcription <strong>regulation</strong> and animal diversity. Nature. 2003 Jul 10;424(6945):147-51.<br />

Ludwig MZ, Bergman C, Patel NH, Kreitman M.Evidence for stabilizing selection in a eukaryotic<br />

enhancer element. Nature. 2000 Feb 3;403(6769):564-7.<br />

Ludwig MZ, Palsson A, Alekseeva E, Bergman CM, Nathan J, Kreitman M. Functional Evolution <strong>of</strong> a cis-<br />

Regulatory Module. PLoS Biol. 2005 Mar 15;3(4):e93<br />

Rao CV, Kir<strong>by</strong> JR, Arkin AP. Design and diversity in bacterial chemotaxis: a comparative study in<br />

Escherichia coli and Bacillus subtilis. PLoS Biol. 2004 Feb;2(2):E49.<br />

Richards S, Liu Y, Bettencourt BR, Hradecky P, Letovsky S, Nielsen R, Thornton K, Hubisz MJ, Chen R,<br />

Meisel RP, Couronne O, Hua S, Smith MA, Zhang P, Liu J, Bussemaker HJ, van Batenburg MF, Howells<br />

12


SL, Scherer SE, Sodergren E, Matthews BB, Cros<strong>by</strong> MA, Schroeder AJ, Ortiz-Barrientos D, Rives CM,<br />

Metzker ML, Muzny DM, Scott G, Steffen D, Wheeler DA, Worley KC, Havlak P, Durbin KJ, Egan A,<br />

Gill R, Hume J, Morgan MB, Miner G, Hamilton C, Huang Y, Waldron L, Verduzco D, Clerc-Blankenburg<br />

KP, Dubchak I, Noor MA, Anderson W, White KP, Clark AG, Schaeffer SW, Gelbart W, Weinstock GM,<br />

Gibbs RA. Comparative genome sequencing <strong>of</strong> Drosophila pseudoobscura: chromosomal, gene, and ciselement<br />

<strong>evolution</strong>. Genome Res. 2005 Jan;15(1):1-18.<br />

Schneider TD. Evolution <strong>of</strong> biological information. Nucleic Acids Res. 2000 Jul 15;28(14):2794-9.<br />

Sengupta AM, Djordjevic M, Shraiman BI. Specificity and robustness in transcription control networks.<br />

Proc Natl Acad Sci U S A. 2002 Feb 19;99(4):2072-7.<br />

Stone JR, Wray GA. Rapid <strong>evolution</strong> <strong>of</strong> cis-regulatory sequences via local point mutations. Mol Biol Evol.<br />

2001 Sep;18(9):1764-70.<br />

Wilson, A. C., 1976 Gene <strong>regulation</strong> and <strong>evolution</strong>, pp. 225-234 in <strong>Molecular</strong> Evolution, edited <strong>by</strong> F. J.<br />

Ayala. Sinauer Associates, Sunderland, Mass.<br />

Wittkopp PJ, Haerum BK, Clark AG. Evolutionary changes in cis and trans gene <strong>regulation</strong>. Nature. 2004<br />

Jul 1;430(6995):85-8.<br />

Wolf DM, Arkin AP. Motifs, modules and games in bacteria. Curr Opin Microbiol. 2003 Apr;6(2):125-34.<br />

Wray GA, Hahn MW, Abouheif E, Balh<strong>of</strong>f JP, Pizer M, Rockman MV, Romano LA. The <strong>evolution</strong> <strong>of</strong><br />

f<strong>transcriptional</strong> <strong>regulation</strong> in eukaryotes. Mol Biol Evol. 2003 Sep;20(9):1377-419. Epub 2003 May 30.<br />

13


Part III<br />

Including molecular <strong>evolution</strong> in the search for transcription factor binding sites<br />

39


8. Characterizing the constraints on transcription factor binding sites<br />

There are now closely related genomes available for nearly all experimental organisms. If<br />

the non-coding regions <strong>of</strong> a group <strong>of</strong> organisms can be aligned, it is possible to infer the<br />

<strong>evolution</strong>ary history <strong>of</strong> the sequences, and incorporate this information into the search for<br />

functional non-coding elements.<br />

A widely used approach is to search for regions <strong>of</strong> non-coding DNA that show<br />

slower <strong>evolution</strong> than the average for the locus or species being compared. This can be<br />

done using an <strong>evolution</strong>ary model. However, methods <strong>of</strong> this type identify generic<br />

conserved regions that may be any <strong>of</strong> several types <strong>of</strong> conserved non-coding DNA. In<br />

order to identify the binding sites for transcription factors, either <strong>of</strong> known specificity or<br />

to learn the specificity, a detailed understanding <strong>of</strong> the <strong>evolution</strong> <strong>of</strong> binding sites seems<br />

necessary. We therefore examine the <strong>evolution</strong> <strong>of</strong> characterized transcription factor<br />

binding sites from S. cerevisiae. This work was published as Moses et al. 2003.<br />

Position specific variation in the rate <strong>of</strong> <strong>evolution</strong> in transcription factor binding sites<br />

Transcription factors recognize degenerate families <strong>of</strong> short sequences (5–25 base pairs).<br />

The binding specificities <strong>of</strong> transcription factors are typically represented as consensus<br />

sequences or position weight matrices [1] that summarize their position-specific sequence<br />

preferences. In some cases, such 'motif' models <strong>of</strong> transcription factor binding sites can<br />

be inferred from genome sequences using computational methods [2–7]. Despite the<br />

absence <strong>of</strong> a detailed understanding <strong>of</strong> the <strong>evolution</strong> <strong>of</strong> transcription factor binding sites,<br />

the comparison <strong>of</strong> sequences from related species has been used to identify transcription<br />

factor binding sites en masse, with the guiding hypothesis that functional regulatory<br />

40


sequences will be more conserved than the surrounding DNA. Several methods [8–12]<br />

have been developed to identify conserved non-coding sequences that, when tested, <strong>of</strong>ten<br />

function as regulatory sequences in vivo (reviewed in [13]).<br />

Here we characterize the <strong>evolution</strong> <strong>of</strong> known transcription factor binding sites<br />

using the complete genome sequences <strong>of</strong> the closely related budding yeasts<br />

Saccharomyces mikatae, S. bayanus, S. paradoxus [12] and S. cerevisiae [14]. We limit<br />

our focus to the conservation <strong>of</strong> binding sites due to purifying selection [15–18], though<br />

binding site turnover [15,16] (the loss and reappearance <strong>of</strong> binding sites) and other<br />

processes also occur. Preferential conservation <strong>of</strong> transcription factor binding sites has<br />

been observed previously in the genomes <strong>of</strong> organisms from bacteria [10,11] to mammals<br />

[16–18], and we expect the same to be true <strong>of</strong> yeast. In addition to the availability <strong>of</strong><br />

complete genome sequences, the budding yeasts are a particularly appealing system in<br />

which to test these hypotheses because <strong>of</strong> the relative wealth and easy accessibility <strong>of</strong><br />

biochemical and genetic information [e.g., [20]].<br />

Characterizing the pattern <strong>of</strong> <strong>evolution</strong> within transcription factor binding sites<br />

allows us to explore the nature <strong>of</strong> functional constraints on these sequences. As is well<br />

known for protein sequences [21–23], we expect the pattern <strong>of</strong> <strong>evolution</strong> in transcription<br />

factor binding sites to reflect the particular patterns <strong>of</strong> constraint under which they<br />

function; important regions or residues should be constrained, while unimportant<br />

positions may show fixed changes. Unlike protein sequences, where the relationship <strong>of</strong><br />

the amino acid sequence to the functional constraint is <strong>of</strong>ten difficult to discern, in the<br />

case <strong>of</strong> transcription factor binding sites, we suggest that the <strong>evolution</strong>ary constraints can<br />

be interpreted directly with respect to the physical constraints imposed <strong>by</strong> the DNA-<br />

41


inding protein. Protein-DNA interactions are <strong>of</strong> much interest (e.g., [24– 27]) and an<br />

understanding <strong>of</strong> the <strong>evolution</strong> <strong>of</strong> the binding motifs may provide insight into these<br />

interactions. In particular, it has recently been shown that there is a relationship between<br />

the pattern <strong>of</strong> degeneracy in certain binding motifs and regions <strong>of</strong> contact between the<br />

DNA and the binding protein: positions with fewer points <strong>of</strong> contact in the structures <strong>of</strong><br />

protein-DNA complexes show greater variability among binding sites within a single<br />

genome [28]. If these degenerate positions are less important for the formation <strong>of</strong> the<br />

protein-DNA complex, they might be expected to show less constrained <strong>evolution</strong>, as<br />

changes at these positions have a smaller effect on the relative fitness <strong>of</strong> the organism,<br />

and therefore may become fixed in the population <strong>by</strong> drift with greater probability.<br />

Conversely, changes at positions in the motif that disrupt the recognition <strong>of</strong> the binding<br />

site <strong>by</strong> the binding-protein are likely to be deleterious, and therefore removed from the<br />

population <strong>by</strong> purifying selection. This intuition leads to a theoretical prediction that the<br />

rate <strong>of</strong> <strong>evolution</strong> at each position is a function <strong>of</strong> the frequencies in the position weight<br />

matrix (analogous to the predictions for protein sequences found in [29]).<br />

Characterized binding sites show fewer substitutions than background DNA<br />

We first sought to verify that functional non-coding regions evolve more slowly than<br />

'background sequences.' To do so, we selected several transcription factors for which<br />

there were multiple experimentally validated binding sites in the S. cerevisiae genome<br />

listed in the Promoter database <strong>of</strong> Saccharomyces cerevisiae (SCPD[20]), and compared<br />

the rate <strong>of</strong> <strong>evolution</strong> within these binding sites to that <strong>of</strong> the promoter regions in which<br />

they were found. We measured the rate <strong>of</strong> <strong>evolution</strong> in substitutions (i.e., inferred<br />

42


nucleotide changes) per site, where 'site' refers to a single nucleotide position, not the<br />

multi-basepair 'binding sites' <strong>of</strong> transcription factors. We first looked at Gal4p, a very<br />

well studied Zn[2]Cys[6] binuclear cluster domain <strong>transcriptional</strong> activator [30]. The<br />

average rate <strong>of</strong> <strong>evolution</strong> within known Gal4p binding sites is 0.32 (+0.12, - 0.09, n =<br />

119) substitutions per site, substantially slower than the 0.75 (± 0.03, n = 2760)<br />

substitutions per site observed in the promoters in which these Gal4p binding sites are<br />

found (fig. 1A, 1B compare Gal4 'motif' and 'background.')<br />

43


Figure 1.<br />

Characterized binding sites evolve more slowly than the promoters in which they<br />

are found.<br />

A. Histogram <strong>of</strong> the rate <strong>of</strong> <strong>evolution</strong> (estimated <strong>by</strong> maximum parsimony) in characterized Gal4p binding<br />

sites and randomly chosen sequences <strong>of</strong> the same length (17 basepairs) from the same promoters. B.<br />

Differences in the mean rate <strong>of</strong> <strong>evolution</strong> in motifs and the mean rate in the promoters in which they are<br />

found. Grey boxes represent the average in binding sites; unfilled boxes represent the average over the<br />

promoters in which the motifs are found (see methods). Error bars represent exact 95 % confidence<br />

intervals for a Poisson distribution.<br />

To test the generality <strong>of</strong> this observation, we chose six other transcription factors<br />

representing different types <strong>of</strong> DNA-binding domains (see table 1) with relatively many<br />

characterized binding sites in the SCPD database. In each case there are significantly<br />

fewer substitutions (p < 0.05, 1000 bootstraps) in the characterized binding sites than in<br />

44


the promoters in which they lie (figure 1B), suggesting that, in general, characterized<br />

transcription factor binding sites evolve more slowly than the surrounding intergenic<br />

sequences. This is consistent with the hypothesis that these sequences are under<br />

functional constraint and their <strong>evolution</strong> reflects purifying selection.<br />

Table 1: Correlation between information content and substitutions per site for the<br />

experimentally characterized binding sites in the SCPD database.<br />

Functionally important positions are preferentially conserved<br />

In order to further explore the functional constraints on transcription factor binding site<br />

<strong>evolution</strong>, we computed the rate <strong>of</strong> <strong>evolution</strong> at each position within the motif and<br />

observed that the rate <strong>of</strong> <strong>evolution</strong> is not constant over the binding sites. Some positions<br />

in the motif show fewer substitutions than background, while others do not. For example,<br />

in the Gal4p binding sites positions 1, 2, 3, 15, 16, and 17 show fewer substitutions than<br />

do positions 4– 14 (fig. 2, right panel). Functionally important positions are expected to<br />

be under stronger purifying selection and therefore show stronger conservation. Indeed,<br />

the conserved positions in the Gal4p binding sites correspond to the points <strong>of</strong> contact in<br />

the crystal structure <strong>of</strong> the protein-DNA complex (fig. 2, right panel) that are required for<br />

the recognition <strong>of</strong> the target sequence [30].<br />

45


Figure 2.<br />

Comparison <strong>of</strong> rates <strong>of</strong> <strong>evolution</strong> to structures <strong>of</strong> protein-DNA complexes implies a<br />

model for the variation in the rate <strong>of</strong> <strong>evolution</strong> across binding motifs.<br />

The DNA backbone appears as a red helix; proteins appear as linked coloured cylinders. We propose that<br />

the formation <strong>of</strong> the protein-DNA complex is the functional constraint that leads to purifying selection, and<br />

therefore fewer substitutions at certain positions in the binding motif. Images <strong>of</strong> protein-DNA complex<br />

structures are from the Protein Data Bank [47]. Rate <strong>of</strong> <strong>evolution</strong> is in substitutions per site (estimated <strong>by</strong><br />

maximum parsimony) and error bars represent exact 95 % confidence intervals for a Poisson distribution.<br />

Another particularly interesting example is the case <strong>of</strong> Mcm1p. Although there is no<br />

specific base in the consensus at positions 8, 9 and 10, there is a strong A/T bias in the<br />

matrix at these positions and mutagenesis studies [31] <strong>of</strong> the binding site have suggested<br />

that this is needed to allow the high degree <strong>of</strong> bending known to be necessary for the<br />

formation <strong>of</strong> Mcm1p-DNA complex [32–34]. The relative paucity <strong>of</strong> substitutions at<br />

positions 8, 9 and 10 (0.37, 0.22 and 0.5 respectively, compared to 0.70 over the entire<br />

46


promoters) further supports the notion that the constraint on functionally important<br />

positions slows their <strong>evolution</strong>.<br />

Positional variation within one genome is correlated to variation between genomes<br />

Noting that positions with fewer substitutions seem to coincide with the positions that are<br />

non-degenerate in the consensus, we constructed position weight matrices using the<br />

characterized binding sites from S. cerevisiae and, in order to quantify the degeneracy,<br />

computed the information content at each position. The information content <strong>of</strong> a position<br />

within a binding site has been shown to correlate with the importance <strong>of</strong> that position in<br />

the formation <strong>of</strong> the protein-DNA complex [28]. For the transcription factors used above<br />

(fig. 1B), we observe that positions <strong>of</strong> high information correspond to positions with<br />

fewer substitutions (e.g., Fig. 3). In 6 <strong>of</strong> 7 cases we found this correlation (Spearman's<br />

rank <strong>of</strong> -0.70 to -0.84) statistically significant (p < 0.01), the lone exception being Tbp1p,<br />

where a negative correlation was observed (-0.46), but was not significant (p = 0.11).<br />

(Table 1 & see discussion.) Thus the sequence variation in characterized transcription<br />

factor binding sites within one genome is directly related to the sequence variation at<br />

individual sites between genomes.<br />

47


Figure 3.<br />

Association between information pr<strong>of</strong>ile and rate <strong>of</strong> <strong>evolution</strong> in characterized<br />

binding sites from SCPD.<br />

A–D. Representative plots <strong>of</strong> information content and substitutions per site reveal a correspondence<br />

between positions <strong>of</strong> high information content and slower rates <strong>of</strong> <strong>evolution</strong>. Open symbols represent<br />

information content and filled symbols the number <strong>of</strong> substitutions per site (estimated <strong>by</strong> maximum<br />

parsimony). Consensus letters are included below the appropriate positions in the motif.<br />

Site-specific substitution rates are consistent with the proportionality <strong>of</strong> Halpern and<br />

Bruno<br />

If the nucleotide frequencies at each position <strong>of</strong> a position weight matrix accurately<br />

reflect the allowed sequence specificity for the formation <strong>of</strong> a functional protein-DNA<br />

complex, it is possible, under several assumptions, to predict the rates <strong>of</strong> <strong>evolution</strong> based<br />

on these frequencies, as has been done using the frequencies <strong>of</strong> residues in protein<br />

48


sequences [29]. The underlying intuition is that if, for example, at a given position in the<br />

motif, a transcription factor recognizes only guanine, i.e., (fA,fC,fG,fT) = (0,0,1,0), a<br />

mutation to any other nucleotide should prohibit formation <strong>of</strong> the protein-DNA complex,<br />

and therefore be deleterious. Such mutations should be removed from the population and<br />

therefore the number <strong>of</strong> observed substitutions at such a position is expected to be very<br />

small. Similarly, if the binding protein requires, say, A or T at a given position with no<br />

preference, i.e., (fA,fC,fG,fT) = (½ , 0, 0, ½), we expect changes between A and T to persist<br />

in the population, but changes to C or G to be removed; we should therefore observe<br />

somewhat more substitutions, but still fewer than at positions where there is no<br />

preference at all, i.e., (fA,fC,fG,fT) = (¼, ¼, ¼, ¼), and all types <strong>of</strong> substitutions are<br />

permitted. Under several assumptions, it is possible to write the following proportionality<br />

for the rates <strong>of</strong> substitution between various residues and a function <strong>of</strong> their frequencies<br />

([29] equation 10 & see methods).<br />

R<br />

abp<br />

∝ P<br />

ab<br />

⎛ f<br />

ln⎜<br />

⎜ f<br />

×<br />

⎝<br />

f<br />

1−<br />

f<br />

where Rabp is the observed rate <strong>of</strong> substitution from residue a to residue b at position p,<br />

Pab and Pba are the (position independent) underlying rates <strong>of</strong> mutation from residue a to<br />

residue b and b to a, respectively, and fap and fbp are the frequencies <strong>of</strong> residue a and b at<br />

position p in the position weight matrix. The predicted rate <strong>of</strong> <strong>evolution</strong> at each position,<br />

Kp, is just the sum <strong>of</strong> the Rabp times the probability that that base was observed, i.e.,<br />

p ∝ ∑∑<br />

a a≠<br />

b<br />

49<br />

ap<br />

bp<br />

ap<br />

ap<br />

bp<br />

P<br />

P<br />

ba<br />

ab<br />

P<br />

P<br />

abp<br />

ab<br />

ba<br />

K f R .<br />

⎞<br />

⎟<br />

⎠<br />

,


In order to test these predictions, we estimated a background mutation model (Pab) <strong>by</strong><br />

fitting the HKY85 model [34] to entire promoter sequences using the PAML package<br />

[35], treating all positions independently (see methods). Using the seven position weight<br />

matrices (fap) trained on the characterized binding sites (all from S. cerevisiae,) and<br />

scaling the proportionality <strong>by</strong> the total number <strong>of</strong> changes observed in the motif, we<br />

compared the predicted rates to the observed rates and the results are shown in figure 4.<br />

Although there is quite a bit <strong>of</strong> variability, the observed rates <strong>of</strong> <strong>evolution</strong> seem to agree<br />

with the predictions (R 2 = 0.67).<br />

Figure 4.<br />

Test <strong>of</strong> the Halpern-Bruno proportionality.<br />

Observed rate <strong>of</strong> <strong>evolution</strong> versus the predictions based on the nucleotide frequencies in the binding motif<br />

in S. cerevisiae. Each point represents the predicted and observed rates at a given position in a motif. For<br />

50


each factor the proportionality has been normalized <strong>by</strong> the total number <strong>of</strong> substitutions observed in the<br />

corresponding binding sites. See text for details.<br />

Computationally predicted binding sites show similar <strong>evolution</strong>ary properties<br />

There are relatively few transcription factors for which the number <strong>of</strong> experimentally<br />

characterized binding sites was sufficient to reliably estimate the information pr<strong>of</strong>ile and<br />

rate <strong>of</strong> <strong>evolution</strong> at each position. To further establish the generality <strong>of</strong> these observations<br />

we extended the analysis to include additional factors where some information regarding<br />

the consensus or target genes was available. We ran the MEME motif-finding program<br />

[3] on the promoter regions <strong>of</strong> groups <strong>of</strong> genes that showed similar expression patterns to<br />

the known targets <strong>of</strong> these factors in microarray experiments to derive models <strong>of</strong> their<br />

binding specificity and identify putative binding sites. As in the experimentally<br />

characterized cases, the rate <strong>of</strong> <strong>evolution</strong> in these binding sites was slower than that <strong>of</strong> the<br />

promoters in which the sequences were found (table 2.) Furthermore, most <strong>of</strong> the motifs<br />

showed the characteristic correlation between the information content at each position<br />

and the number <strong>of</strong> substitutions per site (table 2).<br />

Table 2.<br />

Evolution <strong>of</strong> motifs with known consensus, but binding sites identified <strong>by</strong> MEME<br />

51


Here binding sites are identified <strong>by</strong> running the MEME program [3] on genes that clustered with targets in<br />

micro-array gene expression data. Expected consensus sequences (from [20,40] or [7]) are underlined.<br />

'Motif subs.' and 'bg subs.' are the substitutions per site in the binding sites and the promoters in which they<br />

are found respectively. 'Corr.' and 'p-value' are the Spearman's rank correlation coefficient and the<br />

associated p-value between the rate <strong>of</strong> <strong>evolution</strong> at each position and the information content at each<br />

position. * Indicates significance at a per factor error rate <strong>of</strong> < 0.05. ** Indicates significance after<br />

Bonferoni correction for a global error rate < 0.05, assuming 50 tests were done in total. (?) indicates<br />

uncertainty as to the identity <strong>of</strong> the binding protein. + indicates clusters taken from hierarchical clustering<br />

[40] <strong>of</strong> yeast data from the Stanford Microarray database [42], ++ indicates clusters taken from hierarchical<br />

clustering <strong>of</strong> 300 genetic perturbations [43] and +++ indicates clusters taken from hierarchical clustering <strong>of</strong><br />

64 control experiments [43]<br />

The pattern <strong>of</strong> <strong>evolution</strong> may be useful in distinguishing real motifs from computational<br />

artifacts<br />

A challenge in computational motif detection is that algorithms <strong>of</strong>ten identify sequence<br />

motifs that do not represent real transcription factor binding sites. For example, in<br />

addition to the cases described above (table 2), there were several cases where the motif<br />

identified <strong>by</strong> MEME was not the binding motif for the factor known to regulate these<br />

genes. We computed the number <strong>of</strong> substitutions per site as well as the correlation<br />

between the number <strong>of</strong> substitutions and the information content for these motifs as well,<br />

and found no significant correlations (Table 3), suggesting that the reported motifs in<br />

these cases may be computational artefacts. It is possible that a reduction in the average<br />

number <strong>of</strong> substitutions per site, and a correlation between the information pr<strong>of</strong>ile and the<br />

substitutions across the motif will prove to be useful heuristics in assessing the support<br />

from comparative sequence data for computationally identified motifs. In order to further<br />

test this idea we ran MEME on the promoters <strong>of</strong> a group <strong>of</strong> proposed Crz1p target genes<br />

52


identified in a recent microarray study [36]. We found that the resulting motif (figure 5)<br />

was on average more conserved (0.38 subs. per site, n = 297) than the promoters <strong>of</strong> the<br />

genes in the group (0.65 subs. per site, n = 11832). In addition, it showed the<br />

characteristic correlation between the information pr<strong>of</strong>ile and rate <strong>of</strong> <strong>evolution</strong> across the<br />

motif (Spearman's rank = -0.78, p = 0.001). Thus, in this case, the comparative sequence<br />

data support the hypothesis that this is a functional binding motif in these genes.<br />

Table 3.<br />

Motifs identified <strong>by</strong> MEME that do not correspond to the expected consensus<br />

sequences for the transcription factors thought to be regulating the cluster.<br />

These motifs do not show the characteristic correlation with rate <strong>of</strong> substitution or the substantial decrease<br />

in substitution rate observed for the computationally identified motifs with the expected consensus. +<br />

indicates clusters taken from hierarchical clustering <strong>of</strong> yeast data from the Stanford Microarray database<br />

[42], ++ indicates clusters taken from hierarchical clustering <strong>of</strong> 300 genetic perturbations [43].<br />

53


Figure 5.<br />

Information and rate <strong>of</strong> <strong>evolution</strong> for the recently reported Crz1p motif.<br />

This motif shows the characteristic pattern <strong>of</strong> <strong>evolution</strong> observed for real motifs. Open symbols represent<br />

information content and filled symbols, the number <strong>of</strong> substitutions per site (estimated <strong>by</strong> maximum<br />

parsimony.) Consensus letters are included below the appropriate positions in the motif.<br />

Discussion<br />

Motifs are conserved on average, but individual binding sites are not perfectly conserved<br />

We confirm an important motivating assumption <strong>of</strong> comparative sequencing projects: the<br />

rate <strong>of</strong> <strong>evolution</strong> within functional non-coding sequences elements is slower than the<br />

surrounding intergenic DNA (fig. 1). While this means that on average binding sites are<br />

conserved, it is important to note, however, that in no case was the average number <strong>of</strong><br />

substitutions over the motif reduced to zero. Since substitutions do occur in characterized<br />

binding sites, simply searching through alignments for perfectly conserved segments<br />

would not have revealed all the real binding sites used in this study. Nevertheless,<br />

binding sites do show characteristic patterns <strong>of</strong> <strong>evolution</strong>, and it should be possible to<br />

take these into account in attempting to distinguish the functional instances <strong>of</strong> the motif.<br />

Position specific variation in the rate <strong>of</strong> <strong>evolution</strong> is consistent with models <strong>of</strong> functional<br />

constraint<br />

The observation that the rate <strong>of</strong> <strong>evolution</strong> is not constant over functional non-coding<br />

DNA sequences mirrors similar observations <strong>of</strong> regional variation in the number <strong>of</strong><br />

substitutions per site in peptide sequences; residues that are more important to the<br />

function or structure <strong>of</strong> the protein change much less rapidly, presumably because<br />

mutations at these positions are likely to be deleterious, and therefore do not drift to<br />

54


fixation [21–23]. By analogy to peptide sequences, the observation that the positions in<br />

functional non-coding DNA with high information content evolve more slowly is<br />

consistent with these positions being more important for the formation <strong>of</strong> the protein-<br />

DNA complex, and therefore under more functional constraint. Unlike peptide sequences,<br />

however, the purifying selection and accompanying reduction in the rate <strong>of</strong> substitution<br />

in transcription-factor binding sites seems to be a relatively straightforward mapping<br />

from the physical interaction <strong>of</strong> the DNA with the binding protein (as in fig. 2). Since the<br />

information content has been shown to correlate with the physical constraints imposed <strong>by</strong><br />

transcription factors on their motifs [28] it is consistent that we observe significant<br />

correlations between the information pr<strong>of</strong>iles and the rate <strong>of</strong> <strong>evolution</strong> as well. The<br />

binding sites <strong>of</strong> sequence specific transcription factors afford a rare opportunity to test<br />

theoretical predictions <strong>of</strong> the effects <strong>of</strong> purifying selection on site-specific rates <strong>of</strong><br />

<strong>evolution</strong>. By assuming the nucleotide frequencies from position specific weight matrices<br />

are the equilibrium frequencies under the purifying selection imposed on these sequences,<br />

we could make seemingly reasonable predictions for the rate <strong>of</strong> <strong>evolution</strong> at each position<br />

(figure 4). Although we do not have sufficient data to reliably estimate the rates for each<br />

type <strong>of</strong> substitution (e.g., A→T vs. A→G,) the results presented here are promising. The<br />

same intuition that allows us to construct position weight matrices (i.e., that we may<br />

average over all the binding sites to learn the average sequence specificity) allows us to<br />

compute the rate <strong>of</strong> <strong>evolution</strong> across the motif <strong>by</strong> averaging the changes observed in the<br />

individual binding sites.<br />

Improved understanding <strong>of</strong> binding site <strong>evolution</strong> can guide the use <strong>of</strong> comparative data<br />

55


An accurate understanding <strong>of</strong> the <strong>evolution</strong> <strong>of</strong> functional regulatory sequences is critical<br />

to the optimal use <strong>of</strong> comparative sequence data in the analysis <strong>of</strong> <strong>transcriptional</strong><br />

<strong>regulation</strong>. Without such an understanding, it remains difficult to distinguish sequences<br />

under functional constraint from sequences that are similar because <strong>of</strong> shared descent, or<br />

to differentiate among the various classes <strong>of</strong> conserved non-coding sequences. We<br />

believe our observations linking position-specific variation in the rate <strong>of</strong> <strong>evolution</strong> within<br />

transcription factor binding sites to position-specific sequence variation within genomes<br />

(and to structural features <strong>of</strong> the protein-DNA complex) will be useful in comparative<br />

sequence analysis. For example, comparative sequence data can be used to verify the<br />

predictions <strong>of</strong> de novo motif finding algorithms that have been applied to single genomes,<br />

<strong>by</strong> allowing us to ascribe increased confidence to predicted motifs that are also<br />

conserved. However, simply assessing whether motifs are 'present' in other species can be<br />

ineffective as similar sequences are expected to be present in closely related species<br />

because they have had insufficient time to diverge or as the result <strong>of</strong> other functional<br />

constraints. We propose that the patterns <strong>of</strong> <strong>evolution</strong> we observe for known motifs –<br />

their conservation relative to flanking sequences and the correlation between position-<br />

specific rate <strong>of</strong> <strong>evolution</strong> and intragenomic degeneracy – can more accurately distinguish<br />

motif models that correspond to bona fide transcription factor binding sites from<br />

computational artefacts (compare tables 2 and 3). As a demonstration we show that<br />

comparative sequence supports the motif reported in [36] (fig. 5). Verification <strong>of</strong><br />

computationally predicted motifs may be an immediate practical application <strong>of</strong> our<br />

observations and computational methods that incorporate models <strong>of</strong> binding site<br />

<strong>evolution</strong> should take more effective advantage <strong>of</strong> comparative sequence data. More<br />

56


generally, just as faster <strong>evolution</strong> at synonymous sites is an <strong>evolution</strong>ary signature <strong>of</strong><br />

protein coding regions [21–23], the pattern <strong>of</strong> position-specific variation in <strong>evolution</strong>ary<br />

rates within binding sites can be thought <strong>of</strong> as an <strong>evolution</strong>ary signature <strong>of</strong> transcription<br />

factors. We have shown here how these <strong>evolution</strong>ary signatures might be used to identify<br />

sequences or motifs that collectively have the properties <strong>of</strong> transcription factor binding<br />

sites. With sufficient sequence data, it should ultimately be possible to estimate the rate<br />

<strong>of</strong> <strong>evolution</strong> at every base in a genome, and to identify individual short sequences with<br />

the <strong>evolution</strong>ary characteristics <strong>of</strong> functional transcription factor binding sites.<br />

Conclusions<br />

We show that the rate <strong>of</strong> <strong>evolution</strong> in characterized and predicted transcription factor<br />

binding sites is slower than that <strong>of</strong> the intergenic regions in which they are found. In<br />

addition we show that there is position specific variation in the rate <strong>of</strong> <strong>evolution</strong> across<br />

these binding sites. We show that this variation is correlated to the variability in the<br />

sequence specificity for that factor and can be modelled <strong>by</strong> assuming that purifying<br />

selection acts to maintain these specificities. Together this suggests that the variation in<br />

the rate <strong>of</strong> <strong>evolution</strong> is a direct reflection <strong>of</strong> differences in the strength <strong>of</strong> purifying<br />

selection due to differing physical constraints on the DNA imposed <strong>by</strong> the interaction<br />

with the binding protein. The characterization <strong>of</strong> the pattern <strong>of</strong> conservation over known<br />

binding sites is an important step in understanding the <strong>evolution</strong> <strong>of</strong> functional non-coding<br />

DNA, and perhaps also towards the general understanding <strong>of</strong> protein-DNA interactions.<br />

Our observations should contribute to the effectiveness <strong>of</strong> comparative non-coding<br />

sequence analysis<br />

57


Methods<br />

Rates <strong>of</strong> binding-site and intergenic <strong>evolution</strong><br />

Global alignments <strong>of</strong> intergenic regions from S. mikatae, S. paradoxus and S. bayanus<br />

were computed using clustalw (as described in [12]). Using the accepted species tree<br />

(Sbay, Smik,(Spar, Scer)) [8,12], we computed the minimal number <strong>of</strong> changes needed<br />

for each column <strong>of</strong> the alignment (the so called cost) using the classical parsimony<br />

algorithm (as described in [37]). We included only alignments where sequence from all<br />

four species was available; regions <strong>of</strong> ambiguity or missing sequence in the alignment,<br />

were treated as gaps. The average rate <strong>of</strong> <strong>evolution</strong> within a binding site (in fig. 1A) is the<br />

sum <strong>of</strong> the cost at each position in the binding site divided <strong>by</strong> its length. The average rate<br />

<strong>of</strong> <strong>evolution</strong> for a motif (in fig. 1B) is the sum <strong>of</strong> the cost in the binding sites divided <strong>by</strong><br />

the total number <strong>of</strong> ungapped positions in the binding sites. Although gaps are not<br />

expected in alignments <strong>of</strong> functional binding sites, we allow for them so that we can<br />

apply the same metrics to binding sites as the surrounding sequences. The background<br />

histogram in figure 1A was made <strong>by</strong> calculating the average rate <strong>of</strong> <strong>evolution</strong> in<br />

randomly drawn 17-mers from the promoters <strong>of</strong> the genes containing the binding sites.<br />

The rate <strong>of</strong> background <strong>evolution</strong> (in fig. 1B) is the sum <strong>of</strong> the cost over the entire<br />

alignment divided <strong>by</strong> the total number <strong>of</strong> ungapped positions. The average rate <strong>of</strong><br />

<strong>evolution</strong> at each position is the sum (over all the binding sites) <strong>of</strong> the cost at that position<br />

divided <strong>by</strong> the total number <strong>of</strong> binding sites that have no gap at that position. Although a<br />

maximum likelihood estimator for the number <strong>of</strong> substitutions per site in DNA sequences<br />

has been constructed [38] its performance is expected to be similar to parsimony methods<br />

58


for short <strong>evolution</strong>ary distances as are considered here. We note that the rate <strong>of</strong><br />

background <strong>evolution</strong> differed significantly among the groups <strong>of</strong> genes examined<br />

(characterized targets, expressions clusters.) We address this variation, and examine<br />

possible explanations in another manuscript (Hunter B Fraser, AMM and MBE in<br />

preparation).<br />

Statistics<br />

A Poisson distribution for the number <strong>of</strong> substitutions was used when reporting<br />

confidence intervals, and for error bars in figures 1 and 2, because this is thought to be a<br />

reasonable model for the underlying distribution for neutral substitution events. The<br />

significance <strong>of</strong> difference <strong>of</strong> means between motifs and background was estimated <strong>by</strong><br />

bootstrapping. We randomly selected sequences the same length as the motif (with<br />

replacement) from the upstream regions in which they were found, until we had the same<br />

number as we had characterized binding sites. We then calculated for these samples the<br />

mean number <strong>of</strong> substitutions exactly as for the characterized binding sites. Finally, we<br />

repeated this process 1000 times, and asked how <strong>of</strong>ten we observed an <strong>evolution</strong>ary rate<br />

smaller than for the characterized sites. Both the rate <strong>of</strong> <strong>evolution</strong> in promoter sequences<br />

and the locations <strong>of</strong> yeast transcription factor binding sites are known to show positional<br />

preferences ([39], AMM and MBE, unpublished data). To control for possible effects <strong>of</strong><br />

these biases, we also calculated the number <strong>of</strong> substitutions in sequences <strong>of</strong> the same size<br />

as the motif 5 basepairs away on either side <strong>of</strong> the binding sites, for each <strong>of</strong> the factors<br />

shown in fig. 1B. To be as conservative as possible, we simply computed the probability<br />

<strong>of</strong> observing the number <strong>of</strong> changes in the binding sites out <strong>of</strong> the total number <strong>of</strong><br />

59


changes in the binding sites and the flanking regions with no assumptions about the<br />

underlying distribution <strong>of</strong> the changes (using the hypergeometric distribution) and found<br />

that there were fewer substitutions in the binding sites than in the flanking regions (data<br />

not shown).<br />

Identification <strong>of</strong> binding sites and construction <strong>of</strong> position weight matrices<br />

Characterized binding sites were taken from SCPD [20] for Gal4p (n = 10), Mcm1p (n =<br />

35), Abf1p (n = 16), and Rap1p (n = 17). For some <strong>of</strong> the short Gcn4p (n = 15), Reb1p (n<br />

= 18) and Tbp1p (n = 15) sites up to 5 flanking base pairs were included. Although SCPD<br />

lists additional binding sites for many <strong>of</strong> these factors, we excluded many <strong>of</strong> these<br />

because they were redundant listings <strong>of</strong> binding sites that have been characterized<br />

multiple times (e.g., STE6 has 4 Mcm1p binding sites listed, but these are actually the<br />

same two listed twice) or they were found in divergently transcribed genes, and were<br />

listed independently for both genes (e.g., GAL1 and GAL10 both have 4 Gal4p binding<br />

sites listed, but in fact they share these sites). For each factor, the sequences were aligned<br />

using the MEME program [3] and the 'letter-probability-matrix' from its output was used<br />

as the position weight matrix. SCPD lists binding sites for many other regulatory<br />

elements and transcription factors, most <strong>of</strong> which have few sites, or have sites from a<br />

small number <strong>of</strong> target genes. For each transcription factor, we attempted to identify<br />

groups <strong>of</strong> genes with similar expression patterns as the known target genes, as well as<br />

known target genes <strong>of</strong> other transcription factors (from [40]). These groups were then<br />

chosen <strong>by</strong> hand from hierarchical clustering [41] <strong>of</strong> expression data from various<br />

experimental treatments and over the cell-cycle, downloaded from the Stanford<br />

60


Microarray Database [42] or from 300 publicly available deletion and drug treatment<br />

experiments or 64 control experiments [43]. We ran MEME on the putative promoter<br />

regions <strong>of</strong> genes in expression clusters with the following parameters: motif width was<br />

allowed to range between 8 and 16, 'zoops' and 'tcm' models were both tried for each<br />

case, and both strands <strong>of</strong> the promoter were searched. When the 'tcm' model was used, we<br />

specified between 0.5 n and 2 n for the number <strong>of</strong> occurrences where n is the number <strong>of</strong><br />

genes in the cluster. For MEME runs, promoter regions were taken to be the 600<br />

basepairs upstream <strong>of</strong> the translation start (basepairs in other coding regions were<br />

excluded), except in the case <strong>of</strong> the proteasome and the repressed stress genes where 300<br />

basepairs were used because <strong>of</strong> a positional bias in the location <strong>of</strong> those binding sites<br />

(AMM and MBE unpublished results.) For computationally predicted binding-sites,<br />

occurrences were taken to be those listed in the MEME output, and the 'letter-<br />

probability-matrix' was used as position weight matrix. In the case <strong>of</strong> Crz1p we used the<br />

starting consensus NNNNGGCNCNN, which was reported in [36].<br />

Correlation with information pr<strong>of</strong>iles<br />

Information at each position was calculated as<br />

I p = 2 − ∑ fbp<br />

log2<br />

fbp<br />

b<br />

where fbp is the frequency <strong>of</strong> base b at position p in the motif, with b є {A, C, G, T}, and<br />

p є [1, W] where W is the width <strong>of</strong> the motif. Spearman's rank-order correlation (the<br />

linear correlation <strong>of</strong> the ranks) was computed and the significance <strong>of</strong> the correlation<br />

coefficient was assigned as described in [44].<br />

61


Predictions <strong>of</strong> the rate <strong>of</strong> <strong>evolution</strong><br />

We follow exactly the derivation for protein sequences found in [29]. Briefly, if we<br />

assume that sites are independent, <strong>evolution</strong> is reversible, and underlying probabilities <strong>of</strong><br />

mutation are invariant across sites, we can write the rate <strong>of</strong> <strong>evolution</strong> at each position as<br />

R abp ∝ Pab<br />

× Fabp<br />

where Rabp is the rate <strong>of</strong> substitution from residue a to residue b at position p, Pab is the<br />

rate <strong>of</strong> mutation from residue a to residue b and Fabp is the probability <strong>of</strong> fixation <strong>of</strong> a<br />

mutation from residue a to residue b at position p. If we assume that the time <strong>of</strong> fixation<br />

is small relative to the time between fixations, a so-called weak-mutation model [45], we<br />

can use Kimura's equations [46] and write the following.<br />

−2s<br />

p<br />

1−<br />

e<br />

Fabp = −2<br />

Ns p<br />

1−<br />

e<br />

2s<br />

p<br />

≈ −2<br />

Ns p<br />

1−<br />

e<br />

2s<br />

p<br />

1−<br />

e<br />

Fbap = 2Ns<br />

p<br />

1−<br />

e<br />

− 2s<br />

p<br />

≈ 2Ns<br />

p<br />

1−<br />

e<br />

,<br />

where N is the effective population size and sp is the coefficient <strong>of</strong> selection at position p.<br />

As was noted in [29], if equilibrium has been reached, i.e., there has been sufficient time<br />

for all the possible mutations at that position to occur and either be fixed or removed,<br />

then<br />

fbpPba Fabp<br />

2Ns<br />

p<br />

=<br />

fapPab<br />

Fbap<br />

≈ e ,<br />

where fap is the equilibrium frequency <strong>of</strong> residue a at position p, in our case the frequency<br />

in the position weight matrix. This implies<br />

and therefore<br />

⎛ f<br />

2 Ns = ⎜<br />

p ln<br />

⎜<br />

⎝ f<br />

62<br />

bp<br />

ap<br />

P<br />

P<br />

ba<br />

ab<br />

⎞<br />

⎟<br />


F<br />

abp<br />

⎛ fbpP<br />

⎞ ba<br />

ln⎜<br />

⎟<br />

⎜ fapP<br />

⎟<br />

ab<br />

∝<br />

⎝ ⎠<br />

fapPab<br />

1−<br />

f P<br />

which can be substituted to give the proportionality used in results. To fit a background<br />

mutation model, we used PAML [35] to fit the HKY model [34] to the promoters that<br />

contain the characterized binding sites for each factor. We fixed the alpha parameter at 0<br />

to use a constant rate across sites. The HKY model accounts for equilibrium frequencies<br />

<strong>of</strong> nucleotides as well as transition-transversion mutation bias. The equilibrium<br />

frequencies differed from (¼,¼,¼,¼), and transitions were more probable than<br />

transversions (kappa between 3 and 4). We also tested the site specific predictions for the<br />

rate <strong>of</strong> <strong>evolution</strong> assuming that all types <strong>of</strong> substitutions were equally likely and<br />

qualitatively the results were very similar (data not shown).<br />

References<br />

1. Stormo GD: DNA binding sites: representation and discovery. Bioinformatics 2000, 16(1):16-23.<br />

2. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF and Wootton JC: Detecting subtle<br />

sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993, 262(5131):208-214.<br />

3. Bailey TL and Elkan C: Fitting a mixture model <strong>by</strong> expectation maximization to discover motifs in<br />

biopolymers. Proceedings <strong>of</strong> the Second International Conference on Intelligent Systems for <strong>Molecular</strong><br />

Biology AAAI Press, Menlo Park, California; 1994:28-36.<br />

4. Eskin E and Pevzner PA: Finding composite regulatory patterns in DNA sequences. Bioinformatics<br />

2002, 18(Suppl 1):S354-363.<br />

5. Liu XS, Brutlag DL and Liu JS: An algorithm for finding protein- DNA binding sites with<br />

applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol 2002,<br />

20(8):835-839.<br />

6. Marsan L and Sagot MF: Algorithms for extracting structured motifs using a suffix tree with an<br />

application to promoter and regulatory site consensus identification. J Comput Biol 2000, 7(3–4):345-<br />

362.<br />

7. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ and Church GM: Systematic determination <strong>of</strong> genetic<br />

network architecture. Nat Genet 1999, 22(3):281-285.<br />

63<br />

bp<br />

ba


8. Cliften PF, Hillier LW, Fulton L, Graves T, Miner T, Gish WR, Waterston RH and Johnston M:<br />

Surveying Saccharomyces genomes to identify functional elements <strong>by</strong> comparative DNA sequence<br />

analysis. Genome Res 2001, 11(7):1175-1186.<br />

9. Blanchette M, Schwikowski B and Tompa : Algorithms for phylogenetic footprinting. J Comput Biol<br />

2002, 9(2):211-223.<br />

10. McCue L, Thompson W, Carmack C, Ryan MP, Liu JS, Der<strong>by</strong>shire V and Lawrence CE: Phylogenetic<br />

footprinting <strong>of</strong> transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res 2001,<br />

29(3):774-782.<br />

11. Rajewsky N, Socci ND, Zapotocky M and Siggia ED: The <strong>evolution</strong> <strong>of</strong> DNA regulatory regions for<br />

proteo-gamma bacteria <strong>by</strong> interspecies comparisons. Genome Res 2002, 12(2):298-308.<br />

12. Kellis M., Patterson N, Endrizzi M, Birren B and Lander ES: Sequencing and Comparison <strong>of</strong> Yeast<br />

Species to Identify Genes and Regulatory Elements. Nature 2003, 423(6937):241-254.<br />

13. Hardison RC: Conserved noncoding sequences are reliable guides to regulatory elements. Trends in<br />

Genetics 2000, 16(9):369-372.<br />

14. G<strong>of</strong>feau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C,<br />

Johnston M, Louis EJ, Mewes HW, Murakami Y, Philippsen P, Tettelin H and Oliver SG: Life with 6000<br />

genes. Science 1996, 274(5287):563-567.<br />

15. Ludwig MZ, Patel NH and Kreitman M: Functional analysis <strong>of</strong> eve stripe 2 enhancer <strong>evolution</strong> in<br />

Drosophila: rules governing conservation and change. Development 1998, 125(5):949-958.<br />

16. Dermitzakis ET and Clark AG: Evolution <strong>of</strong> Transcription Factor Binding Sites in Mammalian<br />

Gene Regulatory Regions: Conservation and Turnover. Mol Biol Evol 2002, 19(7):1114-1121.<br />

17. Elnitski L, Hardison RC, Li J, Yang S, Kolbe D, Eswara P, O'Connor MJ, Schwartz S, Miller W and<br />

Chiaromonte F: Distinguishing regulatory DNA from neutral sites. Genome Res 2003, 13(1):64-72.<br />

18. Wasserman WW, Palumbo M, Thompson W, Fickett JW and Lawrence CE: Related Human-mouse<br />

genome comparisons to locate regulatory sites. Nat Genet 2000, 26(2):225-228.<br />

19. Levy S, Hannenhalli S and Workman C: Enrichment <strong>of</strong> regulatory signals in conserved non-coding<br />

genomic sequence. Bioinformatics 2001, 17(10):871-877.<br />

20. Zhu J and Zhang MQ: SCPD: a promoter database <strong>of</strong> the yeast Saccharomyces cerevisiae.<br />

Bioinformatics 1999, 15(7–8):871-877.<br />

21. Kimura M: The Neutral Theory <strong>of</strong> <strong>Molecular</strong> Evolution Cambridge University Press, Cambridge; 1983.<br />

22. Li WH: <strong>Molecular</strong> Evolution Sinauer Associates, Sunderland MA; 1997.<br />

23. Nei M: <strong>Molecular</strong> Evolutionary Genetics Columbia University Press, New York; 1987.<br />

24. Matthews BW: Protein-DNA interaction. No code for recognition. Nature 1988, 335(6188):294-295.<br />

25. Suzuki M, Brenner SE, Gerstein M and Yagi N: DNA recognition code <strong>of</strong> transcription factors.<br />

Protein Eng 1995, 8(4):319-328.<br />

26. Kono H and Sarai A: Structure-based prediction <strong>of</strong> DNA target sites <strong>by</strong> regulatory proteins.<br />

Proteins 1999, 35(1):114-131.<br />

64


27. Benos PV, Lapedes AS and Stormo GD: Is there a code for protein- DNA recognition?<br />

Probab(ilistical)ly. Bioessays 2002, 24(5):466-475.<br />

28. Mirny LA and Gelfand MS: Structural analysis <strong>of</strong> conserved base pairs in protein-DNA complexes.<br />

Nucleic Acids Res 2002, 30(7):1704-1711.<br />

29. Halpern AL and Bruno WJ: Evolutionary distances for proteincoding sequences: modelling sitespecific<br />

residue frequencies. Mol Biol Evol 1998, 15(7):910-917.<br />

30. Marmorstein R, Carey M, Ptashne M and Harrison SC: DNA recognition <strong>by</strong> GAL4: structure <strong>of</strong> a<br />

protein-DNA complex. Nature 1992, 356(6368):408-414.<br />

31. Acton TB, Zhong H and Vershon AK: DNA-binding specificity <strong>of</strong> Mcm1: operator mutations that<br />

alter DNA-bending and <strong>transcriptional</strong> activities <strong>by</strong> a MADS box protein. Mol Cell Biol 1997,<br />

17(4):1881-1889.<br />

32. Kerppola TK: Transcriptional cooperativity: bending over backwards and doing the flip. Structure<br />

1998, 6(5):549-554.<br />

33. Tan S and Richmond TJ: Crystal structure <strong>of</strong> the yeast MATalpha2/MCM1/DNA ternary complex.<br />

Nature 1998, 391(6668):660-666.<br />

34. Yang Z, Goldman N and Friday AE: Comparison <strong>of</strong> models for nucleotide substitution used in<br />

maximum likelihood phylogenetic estimation. Mol Biol Evol 1994, 11(2):316-324.<br />

35. Yang Z: PAML: a program package for phylogenetic analysis <strong>by</strong> maximum likelihood. Comput<br />

Appl Biosci 1997, 13(5):555-556.<br />

36. Yoshimoto H, Saltsman K, Gasch AP, Li HX, Ogawa N, Botstein D, Brown PO and Cyert MS:<br />

Genome-wide Analysis <strong>of</strong> Gene Expression Regulated <strong>by</strong> the Calcineurin/Crz1p Signalling Pathway<br />

in Saccharomyces cerevisiae. J Biol Chem 2002, 277(34):31079-31088.<br />

37. Durbin R, Eddy S, Krogh A and Mitchison G: Biological Sequence Analysis: Probabilistic Models <strong>of</strong><br />

Proteins and Nucleic Acids Cambridge University Press, Cambridge, UK; 1998.<br />

38. Nielsen R: Site-<strong>by</strong>-site estimation <strong>of</strong> the rate <strong>of</strong> substitution and the correlation <strong>of</strong> rates in<br />

mitochondrial DNA. Syst Biol 1997, 46(2):346-353.<br />

39. Hampson S, Kibler D and Baldi P: Distribution patterns <strong>of</strong> overrepresented k-mers in non-coding<br />

yeast DNA. Bioinformatics 2002, 18(4):513-528.<br />

40. Hodges PE, Payne WE and Garrels JI: The Yeast Protein Database (YPD): a curated proteome<br />

database for Saccharomyces cerevisiae. Nucleic Acids Res 1998, 26(1):68-72.<br />

41. <strong>Eisen</strong> MB, Spellman PT, Brown PO and Botstein D: Cluster analysis and display <strong>of</strong> genome-wide<br />

expression patterns. Proc Natl Acad Sci U S A 1998, 95(25):14863-14868.<br />

42. Gollub J, Ball CA, Binkley G, Demeter J, Finkelstein DB, Hebert JM, Hernandez-Boussard T, Jin H,<br />

Kaloper M, Matese JC, Schroeder M, Brown PO, Botstein D and Sherlock G: The Stanford Microarray<br />

Database: data access and quality assessment tools. Nucleic Acids Res 2003, 31(1):94-96.<br />

43. Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, C<strong>of</strong>fey E, Dai<br />

H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte<br />

65


D, Chakraburtty K, Simon J, Bard M and Friend SH: Functional discovery via a compendium <strong>of</strong><br />

expression pr<strong>of</strong>iles. Cell 2000, 102(1):109-126.<br />

44. Press WH, Teukolsky ST, Vetterling WT and Flannery BP: Numerical Recipes in C 2nd edition.<br />

Cambridge University Press, Cambridge, UK; 1992.<br />

45. Golding B and Felsenstein J: A maximum likelihood approach to the detection <strong>of</strong> selection from a<br />

phylogeny. J Mol Evol 1990, 31:511-523.<br />

46. Kimura M: On the probability <strong>of</strong> fixation <strong>of</strong> mutant genes in a population. Genetics 1962, 4:713-<br />

719.<br />

47. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN and Bourne PE:<br />

The Protein Data Bank. Nucleic Acids Res 2000, 28(1):235-242.<br />

66


9. Phylogenetic motif finding <strong>by</strong> expectation maximization on <strong>evolution</strong>ary mixtures<br />

The characteristic patterns <strong>of</strong> <strong>evolution</strong> in transcription factor binding sites described in<br />

the previous chapter suggest that we can develop new methods for both motif discovery<br />

and binding site identification that employ specific models <strong>of</strong> transcription factor binding<br />

site <strong>evolution</strong>. This work was published as Moses et al. 2004.<br />

The functional binding sites for a given protein are rarely identical, with most<br />

proteins binding to families <strong>of</strong> related sequences collectively referred to as their ‘motif’<br />

[1]. Although experimental methods exist to identify sequences bound <strong>by</strong> a specific<br />

protein, they have not been widely applied, and computational approaches [2,3,4] to<br />

‘motif discovery’ have proven to be a useful alternative. For example, the program<br />

MEME [5], models a collection <strong>of</strong> sequences as a mixture <strong>of</strong> multinomial models for<br />

motif and background and uses an Expectation-Maximization (EM) algorithm to estimate<br />

the parameters.<br />

Because functional binding sites are <strong>evolution</strong>arily constrained, their preferential<br />

conservation relative to background sequence has proven a useful approach for their<br />

identification [6]. With the availability <strong>of</strong> complete genomes for closely related species<br />

e.g., [7], it is possible to incorporate an understanding <strong>of</strong> binding site <strong>evolution</strong> into motif<br />

discovery as well. At present, few motif discovery methods simultaneously take<br />

advantage <strong>of</strong> both the statistical enrichment <strong>of</strong> motifs and the preferential conservation <strong>of</strong><br />

the sequences that match them. One recent study [7] enumerated spaced hexamers that<br />

were both preferentially conserved (in multiple sequence alignments) and statistically<br />

enriched. Another method, FootPrinter, [8] identifies sequences (with mismatches) with<br />

67


few changes over an <strong>evolution</strong>ary tree. Neither <strong>of</strong> these methods, however, makes use <strong>of</strong><br />

an explicit probabilistic model.<br />

Here we present a unified probabilistic framework that combines the mixture models<br />

<strong>of</strong> MEME with probabilistic models <strong>of</strong> <strong>evolution</strong>, and can thus be viewed as an<br />

<strong>evolution</strong>ary extension <strong>of</strong> MEME. These <strong>evolution</strong>ary models (used in the maximum<br />

likelihood estimation <strong>of</strong> phylogeny [9]) consider observed sequences to have been<br />

generated <strong>by</strong> a continuous time Markov substitution process from unobserved ancestral<br />

sequences, and can accurately model the complicated statistical relationship between<br />

sequences that have diverged along a tree from a common ancestor. Our approach<br />

considers observed sequences to have been generated from ancestral sequences that are<br />

two component mixtures <strong>of</strong> motif and background, each with their own <strong>evolution</strong>ary<br />

model. The value <strong>of</strong> varying <strong>evolution</strong>ary models has been realized in other contexts as<br />

well, e.g., [10] and such models have been successfully trained using EM [11]. A mixture<br />

<strong>of</strong> <strong>evolution</strong>ary models has been used previously to identify slowly evolving non-coding<br />

sequences [12], and this work can equally be regarded as an extension <strong>of</strong> that approach.<br />

Given a set <strong>of</strong> aligned sequences, we use an EM algorithm to obtain the maximum<br />

likelihood estimates <strong>of</strong> the motif matrix and a corresponding <strong>evolution</strong>ary model.<br />

We first describe the probabilistic framework used to model aligned non-coding<br />

sequences. We employ a mixture model, which can be written generically as<br />

p(data) = Σp(data|model)p(model),<br />

where p(x) is the probability density function for the random variable x. The sum over<br />

models indicates that the data is distributed as some mixture <strong>of</strong> component models, where<br />

68


the prior, p(model), is the mixing proportion. For simplicity, we first address the case <strong>of</strong><br />

pair-wise sequence alignments. Given some motif size, w, we treat the entire alignment as<br />

a series <strong>of</strong> alignments <strong>of</strong> length w, each <strong>of</strong> which may be an instance <strong>of</strong> the motif or a<br />

piece <strong>of</strong> background sequence. We denote the pair <strong>of</strong> aligned sequences as X and Y,<br />

where the ith position in the sequence as a vector <strong>of</strong> length 4, (for each <strong>of</strong> ACGT), where<br />

Xib=1 if the bth base is observed, and 0 otherwise. We denote the unobserved ancestral<br />

sequence, A, similarly, except that the values <strong>of</strong> Aib are not observed. For a series <strong>of</strong><br />

alignments <strong>of</strong> total length N, the likelihood, L, is given <strong>by</strong><br />

L =<br />

N −w<br />

∏∑ p(<br />

mi<br />

) ∏∑<br />

i= 0 mi<br />

i+<br />

w−1<br />

3<br />

k = i b=<br />

0<br />

p(<br />

X , Y | A , m ) p(<br />

A | m )<br />

where the m i are unobserved indicator variables indexing the component models; in our<br />

case m is either motif or background. Generically, we let<br />

p(m i ) = π m ,<br />

the prior probability for each component.<br />

We incorporate the sequence specificity <strong>of</strong> the motif <strong>by</strong> letting the prior probabilities<br />

<strong>of</strong> observing each base in the ancestral sequence, p(A kb |m i ), be the frequency <strong>of</strong> each base<br />

at each position in the motif (the frequency matrix). We write<br />

p(A kb |m i ) = f mkb ,<br />

such that if m is motif, f mkb gives the probability <strong>of</strong> observing the b th<br />

base at the k-i th<br />

position. For the background model we use the average base frequencies for each<br />

alignment, and assume that they are independent <strong>of</strong> position. This allows us to run our<br />

algorithm on several alignments simultaneously [15] and the densities are therefore<br />

conditioned on the alignment as well, but omit this here for notational clarity.<br />

69<br />

k<br />

k<br />

kb<br />

i<br />

kb<br />

i


Finally, noting that because the two sequences descended independently from the<br />

ancestor, we can write p(X k ,Y k |A kb ,m i ) = p(X k |A kb ,m i ) p(Y k |A kb ,m i ), where p(X k |A kb ,m i ) is<br />

the probability <strong>of</strong> the residue X k , given that the ancestral sequence, A, was base b at that<br />

position – a substitution matrix for each component model. For simplicity we use the<br />

Jukes-Cantor [16] substitution matrix, which is, in our notation,<br />

p(<br />

X<br />

k<br />

|<br />

A<br />

kb<br />

⎛<br />

m ⎜<br />

1<br />

, i ) =<br />

⎜<br />

+<br />

⎝ 4<br />

3<br />

4<br />

e<br />

4<br />

− α<br />

3<br />

where α mk is the rate parameter at position k.<br />

mk<br />

⎞<br />

⎟<br />

⎠<br />

X<br />

kb<br />

70<br />

⎛<br />

⎜<br />

1<br />

⎜<br />

−<br />

⎝ 4<br />

3<br />

4<br />

e<br />

4<br />

− α<br />

3<br />

mk<br />

⎞<br />

⎟<br />

⎠<br />

1−<br />

X<br />

It is here that we incorporate differences in <strong>evolution</strong> between the motif and<br />

background <strong>by</strong> specifying different substitution matrices for each component. For<br />

example, if we set α m smaller for the motif than for background, the motif evolves at a<br />

slower rate than the background – it is conserved. We test a variety <strong>of</strong> different<br />

substitution models for the motif and summarize the implications for motif discovery in<br />

the Gcn4p targets. (See results) Unfortunately, as the dependence <strong>of</strong> these models on the<br />

equilibrium frequencies becomes more complicated, deriving ML estimators for the<br />

parameters becomes more difficult, and more general optimization methods may be<br />

necessary. Once again, we can allow each alignment its own background rate, [15] and<br />

express the motif rate as a proportion <strong>of</strong> background.<br />

3.2 An EM algorithm to train parameters<br />

Following the example <strong>of</strong> the MEME program [5] which uses an EM (an iterative<br />

optimization scheme guaranteed to find local maxima in the likelihood) algorithm to fit<br />

kb


mixtures to unrelated sequences, we now derive an EM algorithm to train the parameters<br />

<strong>of</strong> the model described above. We write the ‘expected complete log likelihood’ [17]<br />

ln Lc<br />

N −w<br />

= ∑∑ mi<br />

i= 0 mi<br />

i+<br />

w−1<br />

3<br />

⎡<br />

⎢lnπ<br />

m + ∑∑<br />

⎣<br />

k= i b=<br />

0<br />

⎤<br />

Akb<br />

( ln p(<br />

X k , Yk<br />

| Akb,<br />

mi<br />

) p(<br />

Akb<br />

| mi<br />

) + ln fmkb<br />

) ⎥⎦ ,<br />

where ln denotes the natural logarithm, and maximize <strong>by</strong> setting the derivatives with<br />

respect to the parameters to zero at each iteration. Setting<br />

and solving gives<br />

∂ ln L<br />

∂π<br />

1<br />

π m = ∑ mi<br />

,<br />

N − w<br />

m<br />

c<br />

∂ ln Lc<br />

∂ ln Lc<br />

= 0 , = 0 and = 0<br />

∂f<br />

∂α<br />

mkb<br />

∑<br />

i<br />

∑<br />

mi<br />

Akb<br />

f mkb =<br />

m<br />

and α<br />

3 1−<br />

1<br />

3 Rmk<br />

= −<br />

⎜<br />

⎟<br />

mk ln<br />

i<br />

i<br />

4 ⎝ 1+<br />

Rmk<br />

⎠<br />

i<br />

where Rkm is the ratio <strong>of</strong> expected changed to identical residues under each model, and is<br />

given <strong>by</strong><br />

R<br />

N −w<br />

∑ mi<br />

∑ ∑<br />

i=<br />

0<br />

m = N −w<br />

∑ mi<br />

∑ ∑<br />

i=<br />

0<br />

i+<br />

w−1<br />

3<br />

k= i b=<br />

0<br />

i+<br />

w−1<br />

3<br />

k= i b=<br />

0<br />

71<br />

A<br />

kb<br />

A<br />

( 2<br />

kb<br />

−Y<br />

( Y<br />

kb<br />

kb<br />

mk<br />

− X<br />

+ X<br />

for all k in the case <strong>of</strong> a constant rate across the motif. The sufficient statistics ‹A kb › and<br />

‹m i ›, are derived <strong>by</strong> applying Bayes’ theorem and are computed using the values <strong>of</strong> the<br />

parameters from the previous iteration. We have<br />

where<br />

and<br />

m<br />

i<br />

p(<br />

mi<br />

) p(<br />

X , Y | mi<br />

)<br />

= p(<br />

mi<br />

| X , Y ) =<br />

p(<br />

X , Y )<br />

i w 1 3 − +<br />

p( X , Y | mi<br />

) = ∏∑ p(<br />

X k , Yk<br />

| Akb,<br />

mi<br />

) p(<br />

Akb<br />

| mi<br />

)<br />

k = i b=<br />

0<br />

kb<br />

kb<br />

)<br />

)<br />

⎛<br />


Similarly,<br />

∑<br />

p ( X , Y | m ) = p(<br />

X , Y | m ) p(<br />

m )<br />

i<br />

mi<br />

p(<br />

Aib)<br />

p(<br />

X i,<br />

Yi<br />

| Aib,<br />

mi<br />

)<br />

A ib = p(<br />

Aib<br />

| X i,<br />

Yi<br />

) = ∑ p(<br />

Aib<br />

| X i,<br />

Yi<br />

, mi<br />

) p(<br />

mi<br />

) = ∑<br />

p(<br />

mi<br />

)<br />

p(<br />

X , Y | m )<br />

mi<br />

In order to extend these results beyond pair-wise alignments, we can simply replace<br />

the two sequences X and Y with the probability <strong>of</strong> the entire tree below conditioned on<br />

having observed base b in the ancestral sequence. The likelihood becomes<br />

L =<br />

N −w<br />

∏∑ p mi<br />

) ∏∑<br />

i= 0 m<br />

i<br />

i+<br />

w−1<br />

3<br />

k = i b=<br />

0<br />

72<br />

kb<br />

i<br />

mi<br />

( p(<br />

tree | A , m ) p(<br />

A | m ) ,<br />

where p(tree|A kb ,mi) are computed using the ‘pruning’ algorithm [9]. Of course, a tree<br />

topology is needed in these cases and we used the accepted topology for the sensu stricto<br />

Saccharomyces [7] and computed for each alignment the maximum likelihood branch<br />

lengths using the paml package [18].<br />

3.3 Implementation<br />

We implemented a C++ program (EMnEM: Expectation-Maximization on Evolutionary<br />

Mixtures) to execute the algorithm described above, with the following extensions.<br />

Because instances <strong>of</strong> a motif may occur on either strand <strong>of</strong> DNA sequence, we also treat<br />

the strand <strong>of</strong> each occurrence as a hidden variable, and sum over the two possible<br />

orientations. In addition, because the mixture model treats each position in the alignment<br />

independently, we down-weight overlapping matches <strong>by</strong> limiting the total expected<br />

number <strong>of</strong> matches in any window <strong>of</strong> 2w to be less than one. Finally, because EM is<br />

guaranteed only to converge to a local optimum in the likelihood, we need to initialize the<br />

model in the region <strong>of</strong> the likelihood space where we believe the global optimum lies.<br />

i<br />

i<br />

kb<br />

i<br />

i<br />

i<br />

i


Similar to the strategy used in the MEME program [5], we initialize the motif matrix with<br />

the reconstructed ancestral sequence <strong>of</strong> length w at each position in the alignments, and<br />

perform the full EM starting with the sequence at the position that had the greatest<br />

likelihood. EMnEM will be made available at http://rana.lbl.gov.<br />

3.4 Time complexity<br />

The time complexity <strong>of</strong> the EM algorithm is linear with total length <strong>of</strong> the data, and<br />

the initialization heuristic we have implemented is quadratic with the length.<br />

Interestingly, because our algorithm runs on aligned sequences, relative to MEME, which<br />

that treats sequences independently, the total length is reduced <strong>by</strong> a factor <strong>of</strong> 1<br />

/ S , where S<br />

is the number <strong>of</strong> sequences in the alignment. Usually, we lose this factor in each iteration<br />

when calculating p(tree|A kb ) using the ‘pruning’ algorithm [9], as it is linear in S. We<br />

note, however, that for <strong>evolution</strong>ary models (e.g., Juckes-Cantor) where p(tree|A kb ) is<br />

independent <strong>of</strong> p(A kb |m i ), we may learn the PSPM without re estimating the sufficient<br />

statistics ‹A kb › (the reconstructed ancestral sequence) at each iteration. In these cases the<br />

complexity <strong>of</strong> EMnEM will indeed be linear in the length <strong>of</strong> the aligned sequence, a<br />

considerable speedup, especially in the quadratic initialization step.<br />

4 Results and Discussion<br />

4.1 A test case from the budding yeasts<br />

In order to compare our algorithm under various <strong>evolution</strong>ary models as well as to other<br />

motif discovery strategies, we chose to compare all methods on a single test case: the<br />

upstream regions from 5 sensu stricto Saccharomyces (S. bayanus, S. cerevisiae, S.<br />

73


kudriavzevii, S. mikatae, and S. paradoxus) <strong>of</strong> 9 known Gcn4p targets that are listed in<br />

SCPD [19]. In order to control for variability in alignment quality at different<br />

<strong>evolution</strong>ary distances, we made multiple alignments <strong>of</strong> all available upstream regions<br />

using T-c<strong>of</strong>fee [20] and then extracted the appropriate sequences for any subset <strong>of</strong> the<br />

species. The Gcn4p targets from SCPD are a good set on which to test our method<br />

because there are a relatively high number <strong>of</strong> characterized sites in these promoters. In<br />

addition, the upstream regions <strong>of</strong> these genes contain stretches <strong>of</strong> poly T, which are not<br />

known to be binding sites. As a result, MEME (“tcm” model, w 10) assigns a lower<br />

(better) evalue to a ‘polyT’ motif (e=2.7e-03) than to the known Gcn4p motif (e=1.6e06)<br />

when run on the S. cerevisiae upstream regions. Because this is typical <strong>of</strong> the types <strong>of</strong><br />

false positives that motif finding algorithms produce, we use as an indicator <strong>of</strong> the<br />

success <strong>of</strong> our method the log ratio <strong>of</strong> the likelihood <strong>of</strong> the <strong>evolution</strong>ary mixture model<br />

using the real Gcn4p matrix, to that using the polyT matrix. If this indicator is greater<br />

than zero, i.e.,<br />

⎡ p(<br />

data | Gcn4<br />

p)<br />

⎤<br />

log ⎢<br />

⎥ > 0,<br />

⎣ p(<br />

data | polyT ) ⎦<br />

the real motif has a greater likelihood than the false positive, and should be returned as<br />

the top motif.<br />

4.2 Incorporating a model <strong>of</strong> motif <strong>evolution</strong> can eliminate false positives<br />

In order to explore the effects <strong>of</strong> incorporating models <strong>of</strong> motif <strong>evolution</strong> into motif<br />

detection, we tested several <strong>evolution</strong>ary models. In particular we were interested in the<br />

effect <strong>of</strong> incorporating <strong>evolution</strong>ary rate, as real motifs evolve slower than surrounding<br />

sequences. Using alignments <strong>of</strong> S. cerevisiae and S. mikatae, we calculated the log ratio<br />

74


<strong>of</strong> the likelihood using the real Gcn4p matrix to the likelihood using the polyT matrix<br />

with Jukes-Cantor substitution under several assumptions about the rate <strong>of</strong> <strong>evolution</strong> in<br />

the motif (Figure 1). Interestingly, slower <strong>evolution</strong> in the motif, either ¼ or 0.03 (the ML<br />

estimate) times background rate, is enough to assign a higher likelihood to the Gcn4p<br />

motif and thus eliminate the false positive. We tried two additional <strong>evolution</strong>ary models,<br />

in which the rate <strong>of</strong> substitution at each position depends on the frequency matrix. In the<br />

Felsenstein ’81 model (F81) the different types <strong>of</strong> changes occur at different rates, but the<br />

overall rate at each position is constant, while the Halpern-Bruno model (HB) assumes<br />

there is purifying selection at each position and can account for positional variation in<br />

overall rate [21,22]. In each case, these more realistic models further favored the Gcn4p<br />

matrix over the polyT.<br />

Figure 1.<br />

Effect <strong>of</strong> models for motif <strong>evolution</strong> on motif detection<br />

Plotted is the log ratio <strong>of</strong> the likelihood using the Gcn4p PSPM to the likelihood using polyT PSPM under<br />

various <strong>evolution</strong>ary models in alignments <strong>of</strong> S. cerevisiae to S. mikatae. Models that allow the motif to<br />

evolve more slowly than background, JC (0.25), JC (ML) and JC (HB), and models in which the rates <strong>of</strong><br />

<strong>evolution</strong> take into account the deviation from equilibrium base frequencies, F81 and JC (HB), assign<br />

75


higher likelihood to the Gcn4p PSPM. Also plotted is the negative log ratio <strong>of</strong> the e-values from MEME<br />

(‘tcm’ model, w 10). JC are Jukes-Cantor models with rate parameter equal to background (bg), ¼ <strong>of</strong><br />

background (0.25) or set to the maximum-likelihood estimate below background (ML).<br />

4.3 Success <strong>of</strong> motif discovery is dependent on <strong>evolution</strong>ary distance<br />

In order to test the generality <strong>of</strong> the results achieved for the S. cerevisiae S. mikatae<br />

alignments, we calculated the log ratio <strong>of</strong> the likelihood <strong>of</strong> the <strong>evolution</strong>ary mixture using<br />

the real Gcn4p matrix to the polyT matrix over a range <strong>of</strong> <strong>evolution</strong>ary distances and<br />

rates <strong>of</strong> <strong>evolution</strong> (figure 2, filled symbols). At closer distances, more <strong>of</strong> the data is<br />

redundant, while over longer comparisons, conserved sequences should stand out more<br />

against the background. Indeed, at the distance <strong>of</strong> S. cerevisiae to S. paradoxus (~0.13<br />

substitutions per site), the likelihood <strong>of</strong> polyT is greater, while at the distance <strong>of</strong> S.<br />

cerevisiae, S. mikatae, and S. paradoxus (~0.31 subs. per site) the Gcn4p matrix is<br />

favored. Interestingly, this is true regardless <strong>of</strong> the rate <strong>of</strong> <strong>evolution</strong> assumed for the<br />

motif. While at all <strong>evolution</strong>ary distances slow <strong>evolution</strong> favors the Gcn4p matrix more<br />

than when the motif evolves at the background rate, the effect <strong>of</strong> including slower<br />

<strong>evolution</strong> is smaller than the effect <strong>of</strong> the varying <strong>evolution</strong>ary distance. Only at the<br />

borderline distance <strong>of</strong> S. cerevisiae to S. mikatae (~0.25 subs. per site), do the models<br />

perform differently. We also ran MEME (with the “tcm” model, w set at 10) on the all<br />

sequences (from all genes and all species) and calculated the negative log ratio <strong>of</strong> the<br />

MEME e-values for the two motifs (figure 2, heavy trace). MEME treats all the<br />

sequences independently, and continues to assign the polyT matrix a lower e-value over<br />

all the <strong>evolution</strong>ary distances. At least for this case, it seems more important to accurately<br />

76


model the phylogenetic relationships between the sequences (i.e., using a tree) than to<br />

accurately model the <strong>evolution</strong> within the motif.<br />

Figure 2.<br />

Effect <strong>of</strong> <strong>evolution</strong>ary distance on motif detection.<br />

Log ratio <strong>of</strong> the likelihood using the Gcn4p matrix to the likelihood using polyT matrix and alignments that<br />

span increasing <strong>evolution</strong>ary distance. At distances greater than S. cerevisiae to S. mikatae the <strong>evolution</strong>ary<br />

mixture assigns the Gcn4p matrix a greater likelihood whether the rate <strong>of</strong> <strong>evolution</strong> in the motif is equal to,<br />

½, ¼ or ⅛ <strong>of</strong> the background rate, (diamonds, squares, triangles and circles, respectively). Also plotted are<br />

negative log ratios <strong>of</strong> the MEME evalues for the Gcn4p to polyT, using the entire sequences, or pre-<br />

filtering alignments for 20 base pair windows <strong>of</strong> at least 70% or 50% identity to a reference genome<br />

(heavy, lighter and lightest traces, respectively.)<br />

4.4 The unified framework is preferable to using <strong>evolution</strong>ary information separately<br />

In order to compare our method, which incorporates <strong>evolution</strong>ary information directly<br />

into motif discovery, to approaches that use such information separately, we scanned the<br />

alignments at each <strong>evolution</strong>ary distance and removed regions than were less than 50 or<br />

70 % identical to a reference genome in a 20 base pair window. This allows MEME,<br />

which does take into account phylogenic information, to focus on the conserved regions.<br />

77


We ran MEME and computed the negative log ratio <strong>of</strong> the e-values for the Gcn4p matrix<br />

and the polyT matrix. While in both cases there were distances where the real motif was<br />

favored (figure 2, lighter traces), the effect <strong>of</strong> the filtering was not consistent. At<br />

distances too close, not enough is filtered out, and the polyT is still preferred, while at<br />

distances too far, real instances <strong>of</strong> the motif will no longer pass the cut<strong>of</strong>f and the real<br />

motif is no longer recovered (figure 2, lighter traces). Thus, while incorporating<br />

<strong>evolution</strong>ary information separately can help recover the real motif, it depends critically<br />

on the choice <strong>of</strong> percent identity cut<strong>of</strong>f.<br />

4.5 Examples <strong>of</strong> other discovered motifs<br />

We ran both our program and MEME on the upstream regions <strong>of</strong> target genes <strong>of</strong> some<br />

transcription factors with few characterized targets and/or poorly defined motifs In<br />

several cases, for a given motif size, our algorithm ranked a plausible motif first, and<br />

MEME ranked a polyT motif first (see Table 1).<br />

78


Table 1.<br />

Motif discovery using EMnEM and MEME.<br />

The EMnEM program was run using the Jukes Cantor model for motif <strong>evolution</strong> with the rate set to ¼<br />

background (JC 0.25) on S. cerevisiae S. mikatae alignments in each case. For cases where EMnEM ranked<br />

the motif higher, the consensus sequence and a plot <strong>of</strong> the information content is shown. MEME was run<br />

on the unaligned sequences from both species simultaneously. Target genes are from SCPD[20] (+) or YPD<br />

[23] (++). – indicates that a plausible motif was not found.<br />

5 Conclusions and future directions<br />

We have provided an <strong>evolution</strong>ary mixture model for transcription factor binding sites in<br />

aligned sequences, and a motif finding algorithm based on this framework. We believe<br />

that our approach has many advantages over current methods; it produces probabilistic<br />

models <strong>of</strong> motifs, can be applied directly to multiple or pair-wise alignments, and can be<br />

applied simultaneously at multiple loci. Our method should be applicable to any group <strong>of</strong><br />

species whose intergenic regions can be aligned, though because alignments may not be<br />

79


possible at large <strong>evolution</strong>ary distances, our reliance on them is a disadvantage <strong>of</strong> our<br />

method relative to FootPrinter [18]. It is not difficult to conceive <strong>of</strong> extending this<br />

framework to unaligned sequences <strong>by</strong> treating the alignment as a hidden variable as well;<br />

unfortunately, the space <strong>of</strong> multiple alignments is large, and improved optimization<br />

methods would certainly be needed.<br />

In addition to motif discovery, our probabilistic framework is also applicable to<br />

binding site identification. Current methods that search genome sequence for matches to<br />

motifs are also plagued <strong>by</strong> false positives, but optimally combining sequence specificity<br />

and <strong>evolution</strong>ary constraint may lead to considerable improvement.<br />

7 References<br />

1. Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. (2000) Jan;16(1):16-23.<br />

2. Stormo GD, Hartzell GW 3rd. Identifying protein-binding sites from unaligned DNA fragments. Proc<br />

Natl Acad Sci U S A. (1989) Feb;86(4):1183-7.<br />

3. Lawrence CE, Reilly AA. An expectation maximization (EM) algorithm for the identification and<br />

characterization <strong>of</strong> common sites in unaligned biopolymer sequences. Proteins. 1990;7(1):41-51.<br />

4. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence<br />

signals: a Gibbs sampling strategy for multiple alignment. Science. (1993) Oct 8;262(5131):208-14.<br />

5. Bailey TL, Elkan C, Fitting a mixture model <strong>by</strong> expectation maximization to discover motifs in<br />

biopolymers, Proceedings <strong>of</strong> the Second International Conference on Intelligent Systems for <strong>Molecular</strong><br />

Biology, pp. 28-36, AAAI Press, Menlo Park, California, (1994.)<br />

6. Hardison, Conserved noncoding sequences are reliable guides to regulatory elements, Trends in<br />

Genetics, (2000) Sep;16(9):369-372<br />

7. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison <strong>of</strong> yeast species to<br />

identify genes and regulatory elements. Nature. (2003) May 15;423(6937):241-54.<br />

8. Blanchette M, Schwikowski B, Tompa M. Algorithms for phylogenetic footprinting. J Comput Biol.<br />

(2002);9(2):211-23.<br />

9. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol.<br />

(1981);17(6):368-76.<br />

80


10. Ng PC, Henik<strong>of</strong>f JG, Henik<strong>of</strong>f S. PHAT: a transmembrane-specific substitution matrix. Predicted<br />

hydrophobic and transmembrane. Bioinformatics. 2000 Sep;16(9):760-6. Erratum in: Bioinformatics<br />

2001 Mar;17(3):290<br />

11. Holmes I, Rubin GM. An expectation maximization algorithm for training hidden substitution models.<br />

J Mol Biol. (2002) Apr 12;317(5):753-64.<br />

12. B<strong>of</strong>felli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM. Phylogenetic<br />

shadowing <strong>of</strong> primate sequences to find functional regions <strong>of</strong> the human genome. Science. (2003) Feb<br />

28;299(5611):1391-4.<br />

13. Yang Z. PAML: a program package for phylogenetic analysis <strong>by</strong> maximum likelihood. Comput Appl<br />

Biosci. (1997) Oct;13(5):555-6.<br />

14. Hastie T, Tibshirani R, Friedman J. The Elements <strong>of</strong> Statistical Learning Springer-verlag NY, (2001)<br />

15. Yang, Z. Maximum likelihood models for combined analyses <strong>of</strong> multiple sequence data. J Mol Evol.<br />

42:587-596 (1996.)<br />

16. Yang, Z., N. Goldman, and A. E. Friday. Comparison <strong>of</strong> models for nucleotide substitution used in<br />

maximum likelihood phylogenetic estimation. Mol Biol Evol. 11:316-324 (1994)<br />

17. M. I. Jordan, An Introduction to Probabilistic Graphical Models, in preparation.<br />

18. Yang Z: PAML: a program package for phylogenetic analysis <strong>by</strong> maximum likelihood. Comput. Appl.<br />

Biosci. (1997) 13(5):555-556<br />

19. Zhu J, Zhang MQ. SCPD: a promoter database <strong>of</strong> the yeast Saccharomyces cerevisiae. Bioinformatics.<br />

(1999) Jul-Aug;15(7-8):607-611.<br />

20. Notredame C, Higgins DG, Heringa J. T-C<strong>of</strong>fee: A novel method for fast and accurate multiple<br />

sequence alignment. J Mol Biol. (2000) Sep 8;302(1):205-17.<br />

21. Halpern AL, Bruno WJ. Evolutionary distances for protein-coding sequences: modeling site-specific<br />

residue frequencies. Mol Biol Evol. 1998 Jul;15(7):910-917.<br />

22. Moses AM, Chiang DY, Kellis M, Lander ES, <strong>Eisen</strong> MB. Position specific variation in the rate <strong>of</strong><br />

<strong>evolution</strong> in transcription factor binding sites. BMC Evol Biol. (2003) 3:18<br />

23. Hodges PE, Payne WE, Garrels JI. The Yeast Protein Database (YPD): a curated proteome database for<br />

Saccharomyces cerevisiae. Nucleic Acids Res. (1998) Jan 1;26(1):68-72.<br />

81


10. MONKEY – extending the probabilistic model to identify binding sites <strong>of</strong> factors<br />

with known specificity.<br />

This work was published as Moses et al. 2004.<br />

Different types <strong>of</strong> genomic features have characteristic patterns <strong>of</strong> <strong>evolution</strong> that, when<br />

sequences from closely related organisms are available, can be exploited to annotate<br />

genomes [1]. Methods for comparative sequence analysis that exploit variation in rates<br />

and patterns <strong>of</strong> nucleotide <strong>evolution</strong> can identify coding exons [1,2], noncoding<br />

sequences involved in the <strong>regulation</strong> <strong>of</strong> transcription [3,4] and various types <strong>of</strong> RNAs [5-<br />

7]. While most <strong>of</strong> these methods have been developed for and applied to pairwise<br />

comparisons, sequence data are increasingly available for multiple closely related species<br />

[8]. It is therefore <strong>of</strong> considerable importance to develop sequence-analysis methods that<br />

optimally exploit <strong>evolution</strong>ary information, and to explore the dependence <strong>of</strong> these<br />

methods on the <strong>evolution</strong>ary relationships <strong>of</strong> the species in comparison.<br />

Sequence-specific DNA-binding proteins involved in <strong>transcriptional</strong> <strong>regulation</strong><br />

(transcription factors) play a central role in many biological processes. Despite extensive<br />

biochemical and molecular analysis, it remains exceedingly difficult to predict where on<br />

the genome a given factor will bind. Transcription factors bind to degenerate families <strong>of</strong><br />

short (6-20 base-pairs (bp)) sequences that occur frequently in the genome, yet only a<br />

small fraction <strong>of</strong> these sequences are actually bona fide targets <strong>of</strong> the transcription factor<br />

[9]. A major challenge in understanding the <strong>regulation</strong> <strong>of</strong> transcription is to be able to<br />

distinguish real transcription factor binding sites (TFBSs) from sequences that simply<br />

match a factor's binding specificity. Because the <strong>evolution</strong>ary properties <strong>of</strong> TFBSs are<br />

82


expected to be different from their nonfunctional counterparts, comparative analyses hold<br />

great promise in helping to address this challenge.<br />

In the past few years, several methods have been introduced to identify conserved<br />

(and presumably functional) TFBSs for a factor <strong>of</strong> known specificity (in contrast to the<br />

larger set <strong>of</strong> methods that use comparative data in motif discovery or to otherwise<br />

identify sequences likely to be involved in cis-<strong>regulation</strong>). Each <strong>of</strong> these methods<br />

explicitly or implicitly adopts one <strong>of</strong> several distinct definition <strong>of</strong> a conserved TFBS.<br />

These include a binding site in a reference genome that is perfectly or highly conserved<br />

[8,10-12]; a binding site in a reference genome that lies in a highly conserved region [4];<br />

or a position at which the binding model predicts a binding site in all species [13-18].<br />

In a previous study we characterized the <strong>evolution</strong> <strong>of</strong> experimentally validated<br />

TFBSs in the Saccharomyces cerevisiae genome, finding that functional TFBSs evolve<br />

more slowly than flanking intergenic regions, and more strikingly, that there is<br />

considerable position-specific variation in <strong>evolution</strong>ary rates within TFBSs [19]. We<br />

further showed that <strong>evolution</strong>ary rate at each position is a function <strong>of</strong> the selectivity <strong>of</strong><br />

the factor for bases at that position. Our goal here is to incorporate these specific<br />

<strong>evolution</strong>ary properties <strong>of</strong> TFBSs into the search for conserved TFBSs. Or, more<br />

precisely, to develop a method that, given the specificity <strong>of</strong> a transcription factor,<br />

identifies conserved binding sites in multiple alignments <strong>by</strong> taking into account the<br />

sequence specificity and patterns <strong>of</strong> <strong>evolution</strong> expected for TFBSs, while still fully<br />

exploiting the phylogenetic relationships <strong>of</strong> the species being compared.<br />

In addition to developing new methods, there are several hypotheses regarding the<br />

comparative annotation <strong>of</strong> TFBSs that we are interested in testing. It has been noted that<br />

83


the effectiveness <strong>of</strong> such analyses will depend critically on the <strong>evolution</strong>ary distance<br />

separating the species used. At very close distances TFBSs will appear conserved<br />

because there has been insufficient time for substitutions to occur. As distance increases,<br />

and substitutions occur most rapidly at nonfunctional positions, our ability to detect<br />

constrained binding sites should improve until we are no longer able to reliably assign<br />

orthology based on sequence alignment. To overcome this problem <strong>of</strong> divergence<br />

distances exceeding what can be aligned, the sequences <strong>of</strong> multiple closely related<br />

species can be used to span the same <strong>evolution</strong>ary distances (and presumably provide the<br />

same discriminatory power) as fewer more distantly related ones. However, aside from<br />

these qualitative expectations, the dependence <strong>of</strong> the ability to identify conserved TFBSs<br />

on <strong>evolution</strong>ary distance and tree topology has not been rigorously investigated. Because<br />

the s<strong>of</strong>tware MONKEY can be applied to multiple alignments <strong>of</strong> varying numbers <strong>of</strong><br />

species and produces scores that can be meaningfully compared across different sets <strong>of</strong><br />

species, we are now able to address these issues.<br />

Results<br />

Overview<br />

We developed an approach to identify conserved TFBSs that combines probabilistic<br />

models <strong>of</strong> binding-site specificity [20- 22] with probabilistic models <strong>of</strong> <strong>evolution</strong> [23,24].<br />

Starting with an alignment <strong>of</strong> sequences from multiple related species, we use the known<br />

sequence specificity for a transcription factor to compare the likelihood <strong>of</strong> the sequences<br />

under two <strong>evolution</strong>ary models - one for background and one for TFBSs. The central<br />

feature <strong>of</strong> this method that underlies its ability to identify conserved TFBSs is that it uses<br />

a specific probabilistic <strong>evolution</strong>ary model for the binding sites <strong>of</strong> each transcription<br />

84


factor. The <strong>evolution</strong>ary model we use for TFBSs [25] assumes that sites were under<br />

selection to remain binding sites throughout the <strong>evolution</strong>ary history <strong>of</strong> the species being<br />

studied. This model uses the sequence specificity <strong>of</strong> the factor to predict patterns and<br />

rates <strong>of</strong> <strong>evolution</strong> that recapitulate the patterns and rates observed in real TFBSs [19].<br />

MONKEY: scanning alignments to identify conserved transcription factor binding sites<br />

MONKEY, our tool for identifying conserved TFBSs, takes as input a multiple sequence<br />

alignment, a tree describing the relationship <strong>of</strong> the aligned species, a model <strong>of</strong> a<br />

transcription factor's binding specificity and a model for background noncoding DNA. It<br />

returns, for each position in the alignment, a likelihood ratio comparing the probability<br />

that the position is a conserved binding site for the selected factor compared to the<br />

probability that the position is background.<br />

Extending matrix searches to multiple sequence alignments<br />

For the model <strong>of</strong> binding specificity, we use a traditional frequency matrix [20-22]. The<br />

values in the matrix - fib - represent the probability <strong>of</strong> observing the base b (A, C, G or T)<br />

at the ith position in a binding site <strong>of</strong> width w. For the model <strong>of</strong> the background, we use a<br />

single set <strong>of</strong> base frequencies gb. A widely used statistic for scoring the similarity <strong>of</strong> a<br />

single sequence to a frequency matrix is the log likelihood ratio comparing the<br />

probability <strong>of</strong> having observed a sequence X <strong>of</strong> width w under the motif model (a<br />

frequency matrix, designated as motif) to the probability <strong>of</strong> having observed X under the<br />

background model (designated <strong>by</strong> bg), which can be easily reduced to:<br />

p(<br />

X | motif )<br />

f<br />

S(<br />

X ) = log<br />

=<br />

,<br />

p(<br />

X | bg)<br />

85<br />

∑∑<br />

= i w<br />

ib X ib log<br />

i= 1 b gb


where Xib is an indicator variable which equals 1 if base b is observed at position i, and<br />

zero otherwise. This classifier can be motivated <strong>by</strong> the approximation that the data are<br />

distributed as a two-component mixture <strong>of</strong> sequences matching the frequency matrix and<br />

sequences drawn from a uniform background. In practice, we compute this score using a<br />

position-specific scoring matrix (PSSM) with entries, Mib = log(fib/gb), and find S for a<br />

particular wmer <strong>by</strong> adding up the entries that correspond to the bases in the query<br />

sequence. In extending this to a pair <strong>of</strong> aligned sequences X and Y, we want to perform<br />

the same calculation on their common ancestor A. Since A is not observed, we consider<br />

all possible ancestral sequences <strong>by</strong> summing over them, weighting each <strong>by</strong> their<br />

probability given the data (X and Y), the phylogenetic tree (T) that relates the sequences,<br />

and a probabilistic <strong>evolution</strong>ary model [23].<br />

We can write a new score representing the log-likelihood ratio that compares the<br />

hypothesis that X and Y are a conserved example <strong>of</strong> the binding site represented <strong>by</strong> the<br />

frequency matrix to the hypothesis that they have been drawn from the background:<br />

ˆ<br />

p(<br />

X , Y | motif , T,<br />

Rmotif<br />

)<br />

S ( X , Y ) = log<br />

,<br />

p(<br />

X , Y | bg,<br />

T,<br />

R )<br />

bg<br />

where Rmotif and Rbg are rate matrices describing the substitution process <strong>of</strong> the binding<br />

site and background respectively. Using the conditional independence <strong>of</strong> the sequences X<br />

and Y on the ancestor, A, and writing TAX for the <strong>evolution</strong>ary distance separating<br />

sequence X from A, this becomes:<br />

∑ p(<br />

A<br />

∑<br />

X | A,<br />

TAX<br />

, Rmotif<br />

) p(<br />

Y | A,<br />

TAY<br />

, Rmotif<br />

) p(<br />

A | motif )<br />

Sˆ<br />

( X , Y ) = log<br />

.<br />

p(<br />

X | A,<br />

T , R ) p(<br />

Y | A,<br />

T , R ) p(<br />

A | bg)<br />

A<br />

AX<br />

86<br />

bg<br />

AY<br />

bg


The class <strong>of</strong> <strong>evolution</strong>ary models used <strong>by</strong> MONKEY define a substitution matrix,<br />

p(Xi|Ai,t) = e Rt , that represents the probability <strong>of</strong> observing each base at position i in the<br />

extant sequence (X) given each base in the ancestral sequence (A) after t units <strong>of</strong><br />

<strong>evolution</strong>ary time or distance, given some rate matrix, R [23]. Since these models retain<br />

positional independence, we can rewrite this as:<br />

i=<br />

1<br />

∑ p<br />

b<br />

∑<br />

( X =<br />

=<br />

w<br />

i | Aib<br />

1,<br />

TAX<br />

, Rmotif<br />

) p(<br />

Yi<br />

| Aib<br />

1,<br />

TAY<br />

, Rmotif<br />

) fib<br />

Sˆ<br />

( X , Y ) = ∑ log<br />

.<br />

p(<br />

X | A = 1,<br />

T , R ) p(<br />

Y | A = 1,<br />

T , R ) g<br />

b<br />

i<br />

ib<br />

This can be extended to more than two sequences, that is, S ˆ( X , Y,...,<br />

Z)<br />

, <strong>by</strong> replacing the<br />

probabilities <strong>of</strong> X and Y with the probability with the left and right branches <strong>of</strong> the tree<br />

below, and performing the calculation at the root. The probabilities <strong>of</strong> the left and right<br />

branches <strong>of</strong> the tree can be calculated recursively as has been described previously [23].<br />

Once again, for practical purposes we can convert these scores to a PSSM, whose entries<br />

are given for the pairwise case <strong>by</strong>:<br />

ˆ<br />

p(<br />

X ia = 1,<br />

Yib<br />

= 1|<br />

motif , T,<br />

Rmotif<br />

)<br />

M iab = log<br />

p(<br />

X ia = 1,<br />

Yib<br />

= 1|<br />

bg,<br />

T,<br />

Rbg<br />

)<br />

where at each position we now index <strong>by</strong> the bases a and b in the two sequences. For<br />

multiple alignments <strong>of</strong> n species, each position requires 4 n entries.<br />

Evolutionary models<br />

The use <strong>of</strong> <strong>evolution</strong>ary models is critical to the function <strong>of</strong> MONKEY. Myriad <strong>of</strong> such<br />

models exist, and in principle all can be used in MONKEY. For the background, it is<br />

natural to use a model appropriate for sites with no particular constraint, such as the<br />

average intergenic or synonymous rates. MONKEY allows the use <strong>of</strong> the JC [26] or HKY<br />

87<br />

AX<br />

bg<br />

i<br />

ib<br />

AY<br />

bg<br />

b


[27] models, and here we use the latter with the base frequencies, rates and transition-<br />

transversion rate-ratio estimated from noncoding alignments assuming a single model <strong>of</strong><br />

<strong>evolution</strong> over the noncoding regions (see details in Materials and methods). It is also<br />

possible to estimate the <strong>evolution</strong>ary model separately for each intergenic alignment,<br />

although the small size <strong>of</strong> yeast intergenic regions leads to variable estimates. In<br />

principle, the JC and HKY models can also be used for the motif, with rates set according<br />

to our expectation <strong>of</strong> the overall rate <strong>of</strong> <strong>evolution</strong> in functional binding sites, which has<br />

been estimated as two to three times slower than the average intergenic rate [19].<br />

However, we have previously shown that there is position-specific variation in<br />

<strong>evolution</strong>ary rates within functional transcription factor binding sites [19] and that<br />

positions in a motif with low degeneracy in the bindingsite model evolve more slowly<br />

than positions with high degeneracy; this relationship between the equilibrium<br />

frequencies and the position-specific <strong>evolution</strong>ary rates is accurately predicted <strong>by</strong> an<br />

<strong>evolution</strong>ary model from Halpern and Bruno (HB model) [25].<br />

In using this model, we assume that sequences evolve under constant purifying<br />

selection to maintain a particular set <strong>of</strong> equilibrium base frequencies. The use <strong>of</strong> this<br />

model corresponds to a definition <strong>of</strong> a conserved TFBS as a sequence position where<br />

there has always been a binding site for the transcription factor. Although the model does<br />

not strictly require that a binding site be present in each <strong>of</strong> the observed species, positions<br />

lacking such sites will have lower probabilities as they require the use <strong>of</strong> less probable<br />

substitutions. The rate <strong>of</strong> change from residue a to b at position i in the motif is given <strong>by</strong>:<br />

88


R<br />

⎛ fibQ<br />

⎞ ba ln ⎜<br />

f Q ⎟<br />

⎝ ⎠<br />

ia ab<br />

( i)<br />

ab = Qab<br />

× ,<br />

fiaQab<br />

where Q is the (position independent) underlying mutation matrix, which we set equal to<br />

the background model (Q = Rbg), and f is the frequency matrix describing the specificity<br />

<strong>of</strong> the factor. Thus, for each position in the motif, the HB model predicts the rates <strong>of</strong> each<br />

type <strong>of</strong> substitution as a function <strong>of</strong> the frequency matrix, and the background model.<br />

Comparing hits for different factors and <strong>evolution</strong>ary distances: computing the null<br />

distribution<br />

To compare scores from different <strong>evolution</strong>ary distances and different factors, it is<br />

critical that we are able to assign significance to a particular value <strong>of</strong> the score. To do so,<br />

we need to compute the distribution <strong>of</strong> the score under the null hypothesis that the<br />

sequence is part <strong>of</strong> the background. Calculating a p-value for a score S in a single<br />

sequence requires the enumeration <strong>of</strong> all possible w-mers that have a score S or greater<br />

under the background model. For n aligned sequences this requires the enumeration all<br />

4 wn possible sets <strong>of</strong> aligned w-mers with scores S or greater under the background model.<br />

While the number <strong>of</strong> possible alignments <strong>of</strong> n w-mers can be unmanageably large for<br />

even small values <strong>of</strong> n and w, because we treat each position independently we can<br />

enumerate these possibilities efficiently using an algorithm developed for matrix searches<br />

<strong>of</strong> single sequences [28,29]. Every observed score is a sum <strong>of</strong> w numbers, one from each<br />

column <strong>of</strong> the matrix. The probability <strong>of</strong> observing exactly score S is the number <strong>of</strong> paths<br />

through the matrix whose entries add up to S, weighted <strong>by</strong> the probability <strong>of</strong> the path. By<br />

converting the matrix to integers, we can compute this probability for all values <strong>of</strong> S<br />

89<br />

1−<br />

f<br />

ib<br />

Q<br />

ba


ecursively. We initialize Pi(S) (the probability <strong>of</strong> observing score S after i columns in the<br />

matrix) <strong>by</strong> setting P0(S) = 1 for S = 0, and P0(S) = 0 for S ≠ 0. We then compute the<br />

values <strong>of</strong> the function for i = [1, w] as follows:<br />

∑<br />

P S)<br />

P ( S − Mˆ<br />

) p(<br />

c | bg,<br />

T,<br />

R )<br />

i(<br />

= i−1<br />

c<br />

For aligned sequences, c represents a column in the alignment, and the sum is over all 4 n<br />

possible columns an alignment <strong>of</strong> n sequences. The probability distribution function<br />

(PDF) <strong>of</strong> scores is Pw(S), and from this the cumulative distribution function (CDF), the<br />

probability <strong>of</strong> observing a score <strong>of</strong> S or greater, can be directly computed. Although in<br />

principle we can compute the probabilities to arbitrary precision, because the time<br />

complexity increases with the number <strong>of</strong> possible scores, we limit the precision to within<br />

approximately 0.01 bits.<br />

90<br />

ic<br />

Figure 1 compares empirical p-values from 5,000 pairs <strong>of</strong> sequences evolved in a<br />

simulation (see Materials and methods) with those computed <strong>by</strong> this method, and shows<br />

that they agree closely. We have used this method to compute the CDFs for alignments <strong>of</strong><br />

up to six species, and therefore can apply our method to most comparative genomics<br />

applications. We note, in addition, that the likelihood ratio scores are approximately<br />

Gaussian (data not shown). As the means and variance <strong>of</strong> the scores under each model<br />

can be computed efficiently (see Materials and methods) we can estimate p-values using<br />

a Gaussian approximation (Figure 1) when the number <strong>of</strong> sequences in the alignment is<br />

large.<br />

bg


Figure 1.<br />

Accuracy <strong>of</strong> p-value estimations.<br />

To examine the accuracy <strong>of</strong> our p-value estimates, we compared the empirical p-value (computed from the<br />

observed distribution <strong>of</strong> scores) to p-values computed using either the exact method described above (black<br />

points) or Gaussian approximation (gray points). The scores represent the simple score at a distance <strong>of</strong> 0.1<br />

substitutions per site calculated using the Gcn4p matrix from SCPD [33]. Other models and matrices<br />

produce similar results.<br />

Heuristics for alignments with gaps<br />

The treatment <strong>of</strong> alignment gaps in identifying conserved TFBSs is somewhat<br />

problematic. One the one hand, nonfunctional sequences may be inserted and deleted<br />

over <strong>evolution</strong> more rapidly than functional elements [30-32], and thus the presence <strong>of</strong> a<br />

gap aligned to a predicted binding site could indicate that it is nonfunctional. On the other<br />

91


hand, alignment algorithms are imperfect, and must <strong>of</strong>ten make arbitrary decisions about<br />

the placement <strong>of</strong> gaps. We sought to design a heuristic that accommodated both these<br />

aspects <strong>of</strong> genomic sequence data <strong>by</strong> locally optimizing alignments for the purpose <strong>of</strong><br />

comparative annotation <strong>of</strong> regulatory elements.<br />

The idea is to assign a poor score to regions <strong>of</strong> the alignment with a large number<br />

<strong>of</strong> gaps, but to locally realign regions with a small number <strong>of</strong> gaps to identify conserved<br />

but misaligned binding sites. To do this, we scan along the ungapped version <strong>of</strong> one <strong>of</strong><br />

the aligned sequences - the 'reference' sequence. For each position in the reference<br />

sequence pr, we define a window in each other sequence around ps, the position in<br />

sequence s aligned to position p r . The window runs from ps - (a + b) to p s + w + (a +<br />

b), where a and b are the number <strong>of</strong> gaps in the aligned versions <strong>of</strong> sequences r and s in<br />

position p to p + w, where p is the position in the alignment <strong>of</strong> pr. For each subsequence<br />

<strong>of</strong> length w in the window, we calculate the percent identity to the reference sequence,<br />

and create an alignment <strong>of</strong> pr to pr + w (in the reference sequence) to the most similar<br />

word in the window <strong>of</strong> each other sequence. This locally optimized alignment is then<br />

scored. Note that if a and b are zero (meaning there are no gaps in the aligned sequence),<br />

no optimization is done. If a is too large (in most contexts greater than five) we exclude<br />

that region <strong>of</strong> the alignment from further. This heuristic encapsulates the idea that too<br />

many gaps are indicative <strong>of</strong> lack <strong>of</strong> constraint, but conservatively allows for a few gaps<br />

due to alignment or sequence imperfections.<br />

Application to Saccharomyces<br />

92


The genome sequences <strong>of</strong> several species closely related to the budding yeast<br />

Saccharomyces cerevisiae have recently been published and become models for the<br />

comparative identification <strong>of</strong> transcription factor binding sites [8,11]. We aligned the<br />

intergenic regions <strong>of</strong> S. cerevisiae genes to their orthologs in S. paradoxus, S. mikatae, S.<br />

bayanus and S. kudriavzevii genomes using CLUSTALW (see Materials and methods)<br />

and sought to evaluate the effectiveness <strong>of</strong> MONKEY under different <strong>evolution</strong>ary<br />

models and distances. Ideally, we would use several diverse transcription factors with<br />

known binding specificity, where the set <strong>of</strong> matches to the factor's matrix in the S.<br />

cerevisiae genome could be divided into two reasonably sized sets: those known to be<br />

bound <strong>by</strong> the factor (positives) and those known not to be bound <strong>by</strong> the factor<br />

(negatives). Unfortunately, even in yeast, the number <strong>of</strong> such cases is limited. For many<br />

factors we can identify true positives <strong>by</strong> combining high- and low-throughput<br />

experimental data that supports the hypothesis that a particular position in the genome is<br />

bound <strong>by</strong> a given factor. A true negative set, however, must be constructed on the basis<br />

<strong>of</strong> lack <strong>of</strong> evidence that a sequence is functional, as the interpretation <strong>of</strong> negative results<br />

almost always is ambiguous. In the case <strong>of</strong> transcription factor binding sites this is<br />

particularly problematic, because DNA-binding proteins have overlapping specificity,<br />

and we may therefore observe conservation <strong>of</strong> a binding site because it is bound <strong>by</strong><br />

another factor with similar specificity. After evaluating all factors with binding<br />

specificity in Saccharomyces cerevisiae Promoter Database (SCPD) [33], we focus on<br />

Gal4p and Rpn4p for further analysis (see Table 1 for properties <strong>of</strong> these factors, and<br />

Materials and methods for a description <strong>of</strong> the selection <strong>of</strong> positive and negative sets).<br />

93


Table 1.<br />

Definition <strong>of</strong> positive and negative sets <strong>of</strong> matrix matches<br />

Criteria used to define positive and negative sets to use in this study. It is important to avoid factors whose<br />

specificity overlaps with other factors, because binding sites that are not occupied <strong>by</strong> one factor may be<br />

constrained because <strong>of</strong> binding <strong>by</strong> another, and to choose factors with characterized specificity because our<br />

methods rely on the assumption that the specificity is known.<br />

The effects <strong>of</strong> <strong>evolution</strong>ary models on the discrimination <strong>of</strong> functional binding sites<br />

To evaluate the performance <strong>of</strong> our <strong>evolution</strong>ary method in correctly identifying bona<br />

fide binding sites, we calculated the p-values <strong>of</strong> the positive and negative sites for each<br />

factor, using MONKEY on alignments <strong>of</strong> all five genomes for Rpn4p and four species<br />

(with S. kudriavzevii excluded because too few sequences were available) for Gal4p. We<br />

compared the performance <strong>of</strong> MONKEY with the HB model to scores from S. cerevisiae<br />

alone and to a 'simple' score (equal to the average <strong>of</strong> the single sequence log likelihood<br />

ratios) that utilizes all the comparative data without an <strong>evolution</strong>ary model. The results<br />

are summarized in Table 2. An ideal scoring method would assign low p-values to real<br />

sites (positives) and high p-values to spurious sites (negatives), and we therefore<br />

compared the p-values assigned <strong>by</strong> monkey based on the HB model to those based on the<br />

'simple' score. Not surprisingly, both methods were a great improvement over searching<br />

in S. cerevisiae alone. Overall, when compared to each other, the HB score assigned<br />

lower p-values to the binding sites more <strong>of</strong>ten in the positive sets (90% for Gal4p and<br />

94


80% for Rpn4p) and less <strong>of</strong>ten in the negative sets (20% for Gal4p and 25% for Rpn4p)<br />

than did the simple score. We note that some <strong>of</strong> the supposedly functional Rpn4p sites<br />

were assigned higher pvalues in S. cerevisiae alone, suggesting that they are not in fact<br />

conserved; these will be discussed below.<br />

Table 2.<br />

Performance <strong>of</strong> different scores in recognizing functional and nonfunctional sites<br />

The score based on the Halpern-Bruno (HB) model assigns lower p-values to functional binding sites and higher p-<br />

values to nonfunctional binding sites than the simple score, defined as the average <strong>of</strong> the single species scores in at that<br />

position in the alignment. Both methods are far superior to p-values from S. cerevisiae alone. See text for details.<br />

The effect <strong>of</strong> <strong>evolution</strong>ary distance on the discrimination <strong>of</strong> functional binding sites<br />

As <strong>evolution</strong>ary distance increases, we expect fewer matches to the matrix to be<br />

conserved <strong>by</strong> chance, which implies that the probability <strong>of</strong> observing matches as highly<br />

conserved as the functional sites should decrease. Similarly, we expect the nonfunctional<br />

sites to show many substitutions and their p-values to increase over <strong>evolution</strong>. To explore<br />

the change in p-values over <strong>evolution</strong>ary distance, we scored the functional and<br />

nonfunctional sets <strong>of</strong> binding sites at a variety <strong>of</strong> <strong>evolution</strong>ary distances <strong>by</strong> creating<br />

alignments <strong>of</strong> different combinations <strong>of</strong> species (see Materials and methods). The median<br />

p-value <strong>of</strong> the positive set <strong>of</strong> TFBSs decreases monotonically with <strong>evolution</strong>ary distance,<br />

with the rate <strong>of</strong> decrease an approximately constant function <strong>of</strong> <strong>evolution</strong>ary distance (see<br />

95


Figure 2). The median p-value for the binding sites in the negative set increases with<br />

<strong>evolution</strong>ary distance, although somewhat erratically. This demonstrates that MONKEY<br />

effectively exploits <strong>evolution</strong>ary distance, and confirms our intuition that as <strong>evolution</strong>ary<br />

distance increases, functional elements should be increasingly easy to distinguish from<br />

spurious predictions.<br />

Figure 2.<br />

Significance <strong>of</strong> matches increases with <strong>evolution</strong>ary distance.<br />

Median p-values for the positive (black squares) and negative (white triangles or white triangle points) sets<br />

<strong>of</strong> binding sites for (a) Gal4p and (b) Rpn4p at different <strong>evolution</strong>ary distances represented <strong>by</strong> comparing<br />

S. cerevisiae to different subsets <strong>of</strong> the available species. For both factors, as <strong>evolution</strong>ary distance<br />

increases, the median p-value <strong>of</strong> the functional matches decreases, indicating that they are less likely to<br />

have appeared <strong>by</strong> chance. Conversely, the median p-value <strong>of</strong> the nonfunctional matches (negative set, white<br />

symbols) increases. These observations agree with our predictions for the behavior <strong>of</strong> the p-values (solid<br />

traces) under either the HB <strong>evolution</strong> for the motif or HKY <strong>evolution</strong> for the background. There is little<br />

difference between these predictions and similar ones that assume that all the comparisons were pairwise<br />

(dotted traces).<br />

96


To test this hypothesis on a more quantitative level we sought to compare the observed<br />

scores with the expected scores assuming that binding sites evolved precisely according<br />

to the <strong>evolution</strong>ary models used <strong>by</strong> MONKEY. Briefly, given a binding- site model and a<br />

phylogenetic tree, we assume we have observed a binding site in the reference genome,<br />

and that this site evolves along the tree under either the motif model (HB) or background<br />

model (HKY), representing functional and nonfunctional binding sites, respectively (see<br />

Materials and methods for details). The expected p-values associated with the functional<br />

binding sites (Figure 2, solid lines) showed reasonable agreement with the models,<br />

consistent with previous observations that they are evolving under constraint that is well<br />

modeled <strong>by</strong> the purifying selection on the base frequencies in the specificity matrix [19].<br />

Pairwise versus multi-species comparisons<br />

The comparisons at the different <strong>evolution</strong>ary distances used in Figure 2 employed<br />

variable numbers <strong>of</strong> species, with the shorter distances representing primarily pairwise<br />

comparisons and the longer distances comparisons <strong>of</strong> three or more species. While we<br />

expect the variation in p-values with different combinations <strong>of</strong> species to be primarily a<br />

function <strong>of</strong> the <strong>evolution</strong>ary distance spanned <strong>by</strong> these species, there will also be effects<br />

related to the number <strong>of</strong> species and the topology <strong>of</strong> the three. For example, in the limit<br />

<strong>of</strong> very long branch lengths, the <strong>evolution</strong>ary p-values are on the order <strong>of</strong> the power <strong>of</strong><br />

the number <strong>of</strong> species and are independent <strong>of</strong> <strong>evolution</strong>ary distance. In contrast, in the<br />

limit <strong>of</strong> very short branch lengths, the <strong>evolution</strong>ary p-values depend only on the distance<br />

spanned <strong>by</strong> the comparison, as most <strong>of</strong> the information provided <strong>by</strong> additional species is<br />

redundant. However, because most comparisons that are actually carried out are far from<br />

97


either <strong>of</strong> these extremes, we sought to evaluate the effects <strong>of</strong> species numbers and tree<br />

topology for the Saccharomyces species analyzed here. First, we recomputed the<br />

expected p-values for all the distances analyzed in Figure 2, except that instead <strong>of</strong> using<br />

the real tree topology, we used a single pairwise comparison at the same <strong>evolution</strong>ary<br />

distance (Figure 2, dotted lines). For example, for the Rpn4p analyses using all five<br />

species we assumed a pairwise comparison at an <strong>evolution</strong>ary distance <strong>of</strong> around 1.1<br />

substitutions per site. Note that this is considerably more distant than any <strong>of</strong> the pairwise<br />

comparisons available among these species. The predictions for the pairwise and multi-<br />

species comparisons are very similar, suggesting that at the <strong>evolution</strong>ary distances<br />

spanned <strong>by</strong> these species there is little difference in using multiple species alignments<br />

relative to a pairwise alignment that spans the same <strong>evolution</strong>ary distance. Only at the<br />

longest distances considered (greater than 0.8 substitutions per site) does the power <strong>of</strong> the<br />

pairwise comparison begin to level <strong>of</strong>f, although there are other reasons that multiple<br />

species comparisons might still be preferred (see Discussion). To complement this<br />

theoretical analysis, we were interested in using empirical data to compare pairwise and<br />

multi-species analyses. Fortuitously, the <strong>evolution</strong>ary distance between S. cerevisiae and<br />

S. kudriavzevii is almost exactly equal to the <strong>evolution</strong>ary distance spanned <strong>by</strong> S.<br />

cerevisiae, S. paradoxus and S. mikatae (median tree length approximately 0.5<br />

substitutions per site; see Figure 3a). Because our models predict that we are in a regime<br />

where <strong>evolution</strong>ary distance is the primary determinant <strong>of</strong> the p-values, we expect<br />

searches using these different sets <strong>of</strong> species to yield similar results. We tested this<br />

hypothesis <strong>by</strong> calculating the p-values associated with the Rpn4p-binding sites using the<br />

sequences from these two comparisons. The median p-values in both the positive and<br />

98


negative sets are very similar (Figure 3b), confirming that at these relatively short<br />

<strong>evolution</strong>ary distances, the power <strong>of</strong> the comparative method is independent <strong>of</strong> the<br />

number <strong>of</strong> species considered (see Discussion).<br />

Figure 3.<br />

Significance <strong>of</strong> binding sites in pairwise or three-way comparisons at similar<br />

<strong>evolution</strong>ary distance.<br />

(a) Histogram <strong>of</strong> the percent identities <strong>of</strong> all aligned noncoding regions <strong>of</strong> S. cerevisiae and S. kudriavzevii<br />

(open squares) and S. cerevisiae, S. paradoxus and S. mikatae (filled squares). (b) Median p-values <strong>of</strong><br />

functional matches (positive set, gray bars) and the nonfunctional matches (negative set, open bars) for S.<br />

cerevisiae and S. kudriavzevii alignments (left) and S. cerevisiae, S. paradoxus and S. mikatae alignments<br />

(right). The similarity <strong>of</strong> these p-values supports the idea that multiple similar genomes can be used to span<br />

longer <strong>evolution</strong>ary distances, but at these close <strong>evolution</strong>ary distances provide little additional power.<br />

Taken together, these results strongly support the idea that when appropriate methods are<br />

used, data from multiple species can be combined effectively to span larger <strong>evolution</strong>ary<br />

distances. Note that this in no way implies that the addition <strong>of</strong> extra species to an existing<br />

99


pairwise comparisons is not useful - such additions will always increase the <strong>evolution</strong>ary<br />

distance spanned <strong>by</strong> the species and thus will increase the power <strong>of</strong> the comparison.<br />

Testing the power <strong>of</strong> comparative annotation <strong>of</strong>transcription factor binding sites<br />

At the distances spanned <strong>by</strong> all available sequence data, the p-values are so small that we<br />

no longer expect to find matches <strong>of</strong> the quality <strong>of</strong> those in the positive set <strong>by</strong> chance,<br />

especially for Rpn4p. To test this further, we scanned both strands <strong>of</strong> all the available<br />

alignments <strong>of</strong> all five sensu stricto species (around 2.7 Mb) to identify our most confident<br />

predictions <strong>of</strong> conserved matches to the Rpn4p matrix. We chose the pvalue cut<strong>of</strong>f <strong>of</strong><br />

1.85 × 10-8, which corresponds to a probability <strong>of</strong> 0.05 <strong>of</strong> observing one match at that<br />

level over the entire search (using a Bonferroni correction for multiple testing). After<br />

excluding divergently transcribed genes, there were 56 genes that contained putative<br />

binding sites at that p-value. Of 32 genes in our positive set that had sequence available<br />

for all five species, 30 had binding sites below this p-value. Of the 28 genes in the<br />

negative set for which sequences were available, only three had binding sites below this<br />

cut<strong>of</strong>f. In this (nearly ideal) case we have ruled out nearly 90% <strong>of</strong> the negative set at the<br />

expense <strong>of</strong> less than 10% <strong>of</strong> the positives. Examining the expression patterns <strong>of</strong> these<br />

genes (Figure 4a) allows them to be divided into three major classes. The first is a group<br />

(indicated <strong>by</strong> a blue bar) containing 30 genes (28 <strong>of</strong> which were in our original positive<br />

set and two other genes) that show a very similar pattern over the entire set <strong>of</strong> conditions.<br />

The second group (indicated <strong>by</strong> a green bar) contains 11 genes (<strong>of</strong> which only one was in<br />

our original positive set) that show uncoordinated gene expression changes in some<br />

conditions in addition to the stereotypical Rpn4p expression pattern. It is possible that<br />

these genes' <strong>regulation</strong> is controlled <strong>by</strong> multiple mechanisms under different conditions<br />

100


[34], and <strong>regulation</strong> <strong>by</strong> Rpn4p is one contribution to their overall pattern <strong>of</strong> expression.<br />

Further supporting this hypothesis, only one <strong>of</strong> these genes (UFD1) is annotated as<br />

involved in protein degradation, and three (YBR062C, YOR052C and YER163C) have<br />

unknown functions.<br />

Finally, and most surprising from the perspective <strong>of</strong> comparative annotation, is a<br />

third set <strong>of</strong> 14 genes, including one from our original positive set and three from our<br />

negative set, most <strong>of</strong> which show no evidence <strong>of</strong> the proteasomal expression pattern<br />

associated with Rpn4p (Figure 4b). It is extremely unlikely that these sequences have<br />

been conserved <strong>by</strong> chance, and we suggest that they represent matches that are conserved<br />

for reasons other than binding <strong>by</strong> Rpn4p (see Discussion).<br />

Figure 4.<br />

Relationship between conserved Rpn4p-binding sites and expression.<br />

101


(a) We identified 56 Rpn4p-binding sites with p-values below 1.85 × 10-8 using all five species and the<br />

HB model. The expression patterns <strong>of</strong> these genes (clustered and displayed as in [44]) fall into two major<br />

groups: the 'stereotypical' proteasomal pattern (indicated <strong>by</strong> a blue bar at the right), and a second group<br />

expressed in these and additional conditions (indicated <strong>by</strong> the green bar). The orange bars above the<br />

expression data correspond to (left to right) temperature changes, treatment with H2O2, treatment with the<br />

superoxide generating drug menadione, treatment with the sulfhydryl oxidant diamide, deletions <strong>of</strong> YAP1<br />

and MSN2/4, treatment with the DNA damaging agent methylmethanesulfonate (MMS), and heat shock in<br />

deletions <strong>of</strong> MEC1 and DUN1 [48,49]. (b) Examples <strong>of</strong> conserved Rpn4p sites (boxed) that do not fall in<br />

either expression group (neither blue nor green bar).<br />

Non-conserved binding sites in regulated genes<br />

Having identified examples <strong>of</strong> conserved binding sites whose near<strong>by</strong> genes showed no<br />

evidence <strong>of</strong> function, we decided to examine the converse: binding sites near regulated<br />

genes, and therefore presumably functional, that are not conserved. Figure 5 shows the p-<br />

values <strong>of</strong> individual positive Rpn4p sites at different <strong>evolution</strong>ary distances. While most<br />

<strong>of</strong> the sites follow the trajectory predicted for sites evolving under the HB model, the p-<br />

values for four <strong>of</strong> the positive sites seem to be well-modeled <strong>by</strong> the 'background' or<br />

unconstrained model. This is surprising because we expect these binding sites to be<br />

functional, and therefore under purifying selection. One explanation is that some <strong>of</strong> these<br />

sites may have been misannotated as functional. For example, in addition to a<br />

nonconserved positive site, the upstream region <strong>of</strong> REH1 contains another binding site<br />

that is a weaker match to the Rpn4p matrix (Figure 5b) and did not pass our threshold for<br />

inclusion in the positive set (see Materials and methods). This weaker match is more<br />

highly conserved and may represent the functional site in this promoter. In the case <strong>of</strong><br />

PTC3, however, we can find no other candidate binding sites near<strong>by</strong> (Figure 5c). This<br />

102


epresents a possible example <strong>of</strong> binding site gain, a proposed mechanism <strong>of</strong> regulatory<br />

<strong>evolution</strong> at the molecular level (see Discussion).<br />

Figure 5.<br />

Some apparently functional Rpn4p-binding sites are not conserved.<br />

(a) The MONKEY p-values (points) <strong>of</strong> all putatively functional Rpn4p-binding sites at varying<br />

<strong>evolution</strong>ary distances, along with the expected values under the HB and HKY models (solid traces). The<br />

majority <strong>of</strong> sites behave as expected for conserved binding sites (lower trace). Several, however, behave as<br />

expected for unconstrained sites (upper trace). (b) The predicted binding site (indicated <strong>by</strong> a box) in REH1,<br />

which encodes a protein <strong>of</strong> unknown function in S. cerevisiae, is not conserved, whereas a binding site with<br />

a lower score is conserved (indicated <strong>by</strong> a black bar). (c) A very poorly conserved match upstream <strong>of</strong><br />

PTC3; in this case no other sites can be found in the region.<br />

Different factors have different relationships between significance and <strong>evolution</strong>ary<br />

distance<br />

103


The optimal selection <strong>of</strong> species for comparative sequence analysis remains an open<br />

question. To analyze this question for transcription factor binding sites, we examined the<br />

relationship between <strong>evolution</strong>ary distance and the MONKEY pvalues for several S.<br />

cerevisiae transcription factors (Figure 6) for which sufficient characterized binding sites<br />

were available in SCPD [33]. We find that while all factors show the tendency for p-<br />

values to decrease with <strong>evolution</strong>ary distance, the p-values for each factor remain very<br />

different. For example, with alignments <strong>of</strong> four species spanning about 0.8 substitutions<br />

per site, we expect a conserved match to the Gcn4p matrix as good as the median<br />

functional binding site (Figure 6a, red triangles) approximately every million bases <strong>of</strong><br />

aligned sequence. This in contrast to Rpn4p, for which in the same alignments we expect<br />

such a match (Figure 6a, violet crosses) only once in about 1 billion base pairs. Thus, the<br />

<strong>evolution</strong>ary distance required to achieve a desired p-value is different for different<br />

factors. Understanding the relationship between a frequency matrix and the behavior <strong>of</strong><br />

its p-values is an area for further theoretical exploration. We note that, once again, we can<br />

predict the behavior <strong>of</strong> these p-values (Figure 6b), and that while our predictions agree<br />

qualitatively, there is considerable variability.<br />

S<strong>of</strong>tware<br />

MONKEY is implemented in C++. It is available for download under the GPL and can<br />

be accessed over the web at [35].<br />

Discussion<br />

By formulating the problem <strong>of</strong> identifying conserved TFBSs in a probabilistic<br />

<strong>evolution</strong>ary framework, we have both created a useful tool (MONKEY) for comparative<br />

104


sequence analysis capable <strong>of</strong> functioning on relatively large numbers <strong>of</strong> related species,<br />

and enabled the examination <strong>of</strong> several important questions in comparative genomics.<br />

While most previous approaches to this problem have used heuristics to define conserved<br />

and nonconserved TFBSs, with the probabilistic scores and p-value estimates presented<br />

here the assumptions underlying our approach can be made explicit, and where those<br />

assumptions hold we can be assured the reliability <strong>of</strong> our method. In addition, the<br />

probabilistic framework allows us to estimate the amount <strong>of</strong> <strong>evolution</strong>ary distance<br />

required to achieve a certain level <strong>of</strong> significance.<br />

Evolutionary models<br />

The score based on the <strong>evolution</strong>ary model proposed <strong>by</strong> Halpern and Bruno [25]<br />

effectively discriminated the functional and nonfunctional Gal4p- and Rpn4p-binding<br />

sites in S. cerevisiae (Table 2). We believe the success <strong>of</strong> the HB model in predicting<br />

position-specific rates <strong>of</strong> <strong>evolution</strong> [19] and identifying conserved TFBSs reflects its<br />

encapsulation <strong>of</strong> a model <strong>of</strong> binding sites evolving under constant purifying selection.<br />

Although not every functional binding site will remain under purifying selection, as a<br />

result <strong>of</strong> either functional change or binding-site turnover (see below), a large subset <strong>of</strong><br />

functional binding sites do remain under purifying selection, and for these, the 'HB' score<br />

performs better than the 'simple' score. It is interesting to note, however, that the simple<br />

score, which is not based on an <strong>evolution</strong>ary model and does not take into account the<br />

relationships <strong>of</strong> the species used in the comparison, still shows great improvement over<br />

one genome alone, highlighting the value <strong>of</strong> comparative sequence data even when used<br />

suboptimally.<br />

105


Effects <strong>of</strong> <strong>evolution</strong>ary distance<br />

An important hypothesis <strong>of</strong> the comparative genomics paradigm is that as <strong>evolution</strong>ary<br />

distance increases, observing a match with a given level <strong>of</strong> conservation should become<br />

less and less likely <strong>by</strong> chance - the p-values for functional sites that are conserved are<br />

expected to decrease. We confirm this hypothesis for a small number <strong>of</strong> factors from S.<br />

cerevisiae. In addition, our probabilistic models allow us to quantify this relationship. We<br />

can directly measure the confidence that a specific site is a conserved binding site, and<br />

we can predict the <strong>evolution</strong>ary distance needed to achieve a desired level <strong>of</strong> significance.<br />

Typical p-values for functional binding sites scored <strong>by</strong> matching a matrix to a single<br />

genome are on the order <strong>of</strong> 10 -4 to 10 -6 . Even in a relatively small genome like yeast, with<br />

roughly 12 million bases, we expect many matches at this significance level to occur <strong>by</strong><br />

chance. Adding four closely related species that span a total <strong>evolution</strong>ary distance <strong>of</strong><br />

approximately one substitution per site reduces these p-values <strong>by</strong> approximately three<br />

orders <strong>of</strong> magnitude to the range 10 -7 to 10 -9 . In the yeast genome we expect few, if any,<br />

matches to occur at this level <strong>of</strong> significance <strong>by</strong> chance. When we search the alignments<br />

<strong>of</strong> these species with the Rpn4p matrix with a low enough p-value that we expect a match<br />

at that significance to occur only once in a random 50 Mb genome, we recover nearly the<br />

entire positive set <strong>of</strong> Rpn4p-binding sites while excluding most <strong>of</strong> the negative set,<br />

highlighting the utility <strong>of</strong> MONKEY and the statistics we have developed. As a measure<br />

<strong>of</strong> the improvement over searching a single genome alone, we note that even the best<br />

possible match to the Rpn4p matrix in one genome does not meet this significance<br />

criterion. The expected relationship between <strong>evolution</strong>ary distance and p-value can, in<br />

principle, be used to guide to choice <strong>of</strong> species to be sequenced for comparative analyses.<br />

106


However, the dependence <strong>of</strong> p-values on <strong>evolution</strong>ary distance is not the same for all<br />

factors (Figure 6). This suggests that our ability to annotate functional sequences <strong>by</strong><br />

comparative methods will depend on the type <strong>of</strong> sequences that we are trying to annotate,<br />

and that there is no single <strong>evolution</strong>ary distance sweet-spot for identifying TFBSs.<br />

The <strong>evolution</strong>ary distance required to confidently identify conserved binding sites varies<br />

Figure 6.<br />

The <strong>evolution</strong>ary distance required to confidently identify conserved binding sites<br />

varies among transcription factors.<br />

(a) Median p-values for functional binding sites for various factors at different <strong>evolution</strong>ary distances. The<br />

<strong>evolution</strong>ary distance needed to obtain a desired significance varies between factors. (b) Predicted<br />

dependence <strong>of</strong> the p-values on <strong>evolution</strong>ary distance. Specificity data and functional binding sites were<br />

obtained from the SCPD.<br />

Pairwise versus multiple species comparisons<br />

In theory, for a given reference genome it should be possible to pick a single comparison<br />

species at an <strong>evolution</strong>ary distance sufficient to identify any conserved feature <strong>of</strong> interest.<br />

Our results suggest that at distances <strong>of</strong> up to approximately 0.6 substitutions per site,<br />

107


pairwise alignments provide essentially the same amount <strong>of</strong> resolving power as multiple<br />

comparisons spanning the same <strong>evolution</strong>ary distance. We showed that S. cerevisiae and<br />

S. kudriavzevii span almost exactly the same <strong>evolution</strong>ary distance as S. cerevisiae, S.<br />

paradoxus and S. mikatae, and that that distance is well below 0.6 substitutions per site.<br />

Consistent with this, MONKEY produces nearly identical p-values for conserved binding<br />

sites from these two sets <strong>of</strong> species. Thus, our results suggest that from a theoretical<br />

perspective, if the goal <strong>of</strong> comparative analysis is to identify conserved binding sites for<br />

factors like the ones considered here, it is not necessary to sequence species much more<br />

closely related than this limit. We note, however, that there are myriad practical reasons<br />

other than <strong>evolution</strong>ary resolving power (the only factor considered in our models) for<br />

sequencing multiple closely related sequences. First, there may simply be no extant<br />

species at the exact <strong>evolution</strong>ary distance desired. Second, the quality <strong>of</strong> DNA alignments<br />

is expected to be much higher for multiple closely related species than for more distant<br />

pairwise alignments - if alignment errors prevent correct assignment <strong>of</strong> orthology,<br />

conserved binding sites will not be identified. For the factors considered here, the<br />

pairwise comparison performed nearly as well as the multiple species comparison well<br />

beyond the <strong>evolution</strong>ary distances at which pairwise alignments are reliable [36],<br />

suggesting that the necessity <strong>of</strong> alignment will limit the maximum distance between<br />

species. Finally, and perhaps most important, is the assumption that our models make<br />

about constant functional constraint over <strong>evolution</strong>. To illustrate this, consider the<br />

binding sites for Gal4p used in the analysis in Figure 2a. These binding sites could not be<br />

included in Figure 3 because S. kudriavzevii orthologs for these genes were not available<br />

in SGD, apparently because <strong>of</strong> the degeneration <strong>of</strong> the galactose-utilization pathway in<br />

108


this species [37]. Sequencing multiple closely related species provides insurance against<br />

such functional changes, because they are less likely to have occurred in all the lineages.<br />

Conserved sites and binding-site turnover<br />

MONKEY was very effective in identifying functional Rpn4p binding sites from the<br />

alignment <strong>of</strong> five Saccharomyces species. In our search, 41 <strong>of</strong> 56 (73%) predicted sites<br />

were found near genes showing the expected expression pattern, and are therefore likely<br />

to be functional. Even at this level <strong>of</strong> stringency, however, there are highly conserved<br />

sequences that match the matrix, but do not appear to be near genes that are regulated <strong>by</strong><br />

Rpn4p. It is very unlikely that these sites are conserved <strong>by</strong> chance. One possible<br />

explanation for this high degree <strong>of</strong> conservation is that these are functional sites, but that<br />

the expression <strong>of</strong> these genes is not accurately detected in high-throughput assays, or<br />

their function has not been accurately determined. A more likely possibility is that these<br />

sites are conserved because they perform other, unknown functions. Consistent with this<br />

hypothesis is the fact that many <strong>of</strong> these matches fall near other highly conserved<br />

sequences (Figure 4b), suggesting that they may be parts <strong>of</strong> larger conserved features.<br />

In addition to the conserved sequences that are unlikely to represent bona fide<br />

binding sites, we also found examples <strong>of</strong> binding sites associated with properly regulated<br />

genes that do not seem to be conserved (Figure 5). Once again there are several possible<br />

explanations for this observation. First, these binding sites may not actually be functional<br />

and may have been included in our positive set erroneously. While this is a possible<br />

explanation for the case <strong>of</strong> the Rpn4p-binding sites shown in Figure 5 (and may be likely<br />

in the case <strong>of</strong> REH1, where we could identify another apparently conserved binding site<br />

109


in the region) we have also found nonconserved examples among the TFBSs in the SCPD<br />

database (approximately 20% <strong>of</strong> TFBSs we examined, see figure 7), all <strong>of</strong> which have at<br />

least some direct experimental support. Another potential explanation is that these<br />

binding sites are actually conserved, but were not aligned correctly. While this is difficult<br />

to rule out in general, in the few nonconserved cases for Rpn4p at least we could not find<br />

(<strong>by</strong> eye) errors in the alignments. Most interesting, <strong>of</strong> course, would be the situation<br />

where these nonconserved binding sites are not due to some error on our part, but rather<br />

represent a biological change in the functional constraints on these sequences, possibly<br />

resulting in a change in the <strong>regulation</strong> <strong>of</strong> the expression <strong>of</strong> these genes. Our results<br />

represent an upper bound on the number <strong>of</strong> TFBSs for which this has occurred. Cis-<br />

regulatory changes have been proposed to be an important source <strong>of</strong> genetic variation<br />

[32]. Gains and losses <strong>of</strong> functional binding sites represent an important class <strong>of</strong> these<br />

changes [38,39], and an important area for future computational and experimental<br />

analysis, particularly as the genome sequences <strong>of</strong> closely related metazoans become<br />

available. We expect MONKEY to be a useful tool in the comparative analysis <strong>of</strong> these<br />

genomes, and we have found comparable increases in the significance <strong>of</strong> functional<br />

binding sites in alignments <strong>of</strong> Drosophila melanogster and D. pseudoobscura (see figure<br />

8)<br />

110


Figure 7.<br />

Many characterized binding sites do not seem to be evolving under constraint.<br />

We defined a binding site as not conserved if its p-value in the four way alignment <strong>of</strong> S. cerevisiae, S.<br />

paradoxus, S. mikatae, and S. bayanus was higher (less significant) than in S. cerevisiae alone.<br />

Figure 8.<br />

Applying MONKEY to alignments <strong>of</strong> Drosophila<br />

Putative binding sites (D. melanogaster p


Conclusions<br />

We have developed a method to identify conserved TFBSs in sequence alignments from<br />

multiple related species that provides a quantitative framework for evaluating results. The<br />

method - implemented in the open-source s<strong>of</strong>tware MONKEY - extends probabilistic<br />

models <strong>of</strong> binding specificity to multiple species with probabilistic models <strong>of</strong> <strong>evolution</strong>.<br />

We have found that a probabilistic <strong>evolution</strong>ary model [25] that assumes binding sites are<br />

under constant purifying selection performs effectively in discriminating functional<br />

binding sites. We have developed methods to assess the significance <strong>of</strong> hits, and have<br />

shown that the significance <strong>of</strong> functional matches increases while the significance <strong>of</strong><br />

spurious matches decreases over increasing <strong>evolution</strong>ary distance. We can explicitly<br />

model the relationship between the significance <strong>of</strong> a hit and <strong>evolution</strong>ary distance,<br />

allowing the assessment <strong>of</strong> the potential <strong>of</strong> any collection <strong>of</strong> genomes for identifying<br />

conserved binding sites. Applying MONKEY to a collection <strong>of</strong> related yeast species we<br />

find that most functional binding sites are highly significantly conserved, but also find<br />

evidence for conserved sites that are not functional and vice versa. Our results suggest<br />

that development <strong>of</strong> methods that model the <strong>evolution</strong>ary relationships between species<br />

and the <strong>evolution</strong> <strong>of</strong> the genomic features <strong>of</strong> interest yield insight into the challenges for<br />

comparative genomics.<br />

Materials and methods<br />

Simulating pairs <strong>of</strong> sequences<br />

To generate the empirical p-values shown in Figure 1, random sequences <strong>of</strong> length w<br />

were generated according to the average intergenic base frequencies <strong>of</strong> the S. cerevisiae<br />

112


genome. These were then evolved according to the Jukes-Cantor substitution model, to a<br />

specified <strong>evolution</strong>ary distance. Likelihood ratio scores and p-values were then calculated<br />

for each <strong>of</strong> the pairs <strong>of</strong> sequences using the method implemented in MONKEY. Finally,<br />

all pairs <strong>of</strong> sequences were ranked <strong>by</strong> their scores, and the rank divided <strong>by</strong> the total<br />

number <strong>of</strong> pairs was taken as the empirical p-value.<br />

Preparation <strong>of</strong> alignments for different groups <strong>of</strong> species<br />

We aligned the upstream regions <strong>of</strong> all S. cerevisiae genes to their orthologs in S.<br />

paradoxus, S. mikatae, S. bayanus and S. kudriavzevii <strong>by</strong> taking the 1,000 bp upstream <strong>of</strong><br />

each gene, identifying the corresponding region from the other species using data in the<br />

Saccharomyces Genome Database [40], aligning them with CLUSTAL W [41] and<br />

trimming them to remove regions corresponding to S. cerevisiae coding sequence. We<br />

used this strategy rather than simply aligning intergenic regions to control for differences<br />

in alignments that might arise from the use <strong>of</strong> variably sized regions. To obtain estimates<br />

<strong>of</strong> the <strong>evolution</strong>ary distance spanned <strong>by</strong> each comparison, we ran PAML [24] on the<br />

entire set <strong>of</strong> intergenic alignments, using the HKY model [27], with constant rates across<br />

sites. We used the median PAML estimate <strong>of</strong> kappa (the transition-transversion rate ratio)<br />

<strong>of</strong> 3.8, the S. cerevisiae background frequencies (ACGT) = (0.3, 0.2, 0.2, 0.3) and the<br />

median <strong>of</strong> the branch lengths estimates as the 'background' <strong>evolution</strong>ary model. The trees<br />

with these branch lengths were used as input to MONKEY to calculate p-values. The<br />

distances in Figure 4 represent the sum <strong>of</strong> the median branch lengths in each comparison.<br />

The subsets (with <strong>evolution</strong>ary distances in parentheses) were as follows: S. cerevisiae<br />

and S. paradoxus (0.194); S. cerevisiae and S. mikatae (0.403); S. cerevisiae, S.<br />

113


paradoxus S. mikatae (0.477); S. cerevisiae and S. bayanus (0.559); S. cerevisiae, S.<br />

paradoxus, S. mikatae and S. bayanus (0.816); S. cerevisiae, S. paradoxus, S. mikatae, S.<br />

bayanus and S. kudriavzevii (1.090).<br />

Definition <strong>of</strong> Rpn4p and Gal4p matrices and positive and negative sets<br />

Rpn4p: we used Rpn4p sites in proteasomal genes [42,43] to build an Rpn4p specificity<br />

matrix (using a pseudocount <strong>of</strong> 1 per base per position). To identify additional likely<br />

targets, we obtained expression data from public sources [30,31] and compared the<br />

expression patterns <strong>of</strong> all genes to the average expression pattern <strong>of</strong> proteasomal genes<br />

using the following metric:<br />

n − 2<br />

t = 2<br />

1−<br />

θ<br />

where θ is the 'uncentered correlation', a commonly used distance metric for gene-<br />

expression data [44]. Our score adds a correction for the number <strong>of</strong> datapoints, n, that are<br />

available for each gene. All matches to the Rpn4p matrix (S. cerevisiae likelihood ratio<br />

score > 9) in the upstream region <strong>of</strong> a gene that matched the proteasomal expression<br />

pattern (t > 8) were considered to be true Rnp4p sites. The negative set consists <strong>of</strong> all<br />

sites that matched the Rpn4p matrix with a score greater than 9, and excluded sites in<br />

genes with even weak similarity to the proteasomal expression pattern (t > 0) or that were<br />

annotated [40] as involved in protein processing or degradation.<br />

Gal4p: we used the matrix from SCPD [33] (with a pseudo count <strong>of</strong> 1 per base per<br />

position). To define a positive set we used the binding sites in SCPD and systematic<br />

studies <strong>of</strong> this Gal4p regulatory system [45,46], and used matches near additional genes<br />

114


that we identified in these studies with scores above the lowest score in the SCPD set. To<br />

define a negative set, we again scanned the S. cerevisiae genome with a cut<strong>of</strong>f equal to<br />

the lowest score in the positive set and then eliminated any binding sites near genes that<br />

showed evidence for <strong>regulation</strong> in the systematic studies.<br />

It is important to note that our categorization <strong>of</strong> sequences as positive and<br />

negative is done independently <strong>of</strong> the comparative sequence data, thus avoiding potential<br />

circularity.<br />

Calculations <strong>of</strong> expected scores<br />

Because our methods employ explicit probabilistic models for the <strong>evolution</strong> <strong>of</strong> noncoding<br />

DNA, it is possible to compute the expected scores under various assumptions. The<br />

expectation <strong>of</strong> the log likelihood ratio for examples <strong>of</strong> the motif is the 'information<br />

content' and its calculation has been addressed [47]. We can extend this to calculation to<br />

our <strong>evolution</strong>ary scores, as follows. Using the fact that all the scores treat the positions <strong>of</strong><br />

the matrix independently, and the linearity <strong>of</strong> the expectation, we write:<br />

E[<br />

Sˆ<br />

( X , Y ) | m]<br />

=<br />

w<br />

∑<br />

i=<br />

1<br />

E[<br />

Sˆ<br />

( X , Y ) | m]<br />

=<br />

i<br />

i<br />

i<br />

115<br />

w<br />

∑∑<br />

i= 1 X , Y<br />

i<br />

i<br />

p(<br />

X , Y | m,<br />

T ) Sˆ<br />

( X , Y )<br />

where E [x] denotes the expectation <strong>of</strong> the random variable x, m denotes a frequency<br />

matrix and a corresponding <strong>evolution</strong>ary model, either {motif, Rmotif} or {bg, Rbg}.<br />

p(Xi, Yi|m, T) is calculated as above, and we define:<br />

ˆ<br />

p(<br />

X i,<br />

Yi<br />

| motif , T,<br />

R<br />

Si ( X i , Yi<br />

) ≡ log<br />

p(<br />

X i , Yi<br />

| bg,<br />

T,<br />

R<br />

We can write a similar expression for the variance, V:<br />

V[<br />

Sˆ<br />

( X , Y ) | m]<br />

=<br />

w<br />

∑∑<br />

i= 1 X , Y<br />

i<br />

i<br />

i<br />

i<br />

i<br />

i<br />

i<br />

i<br />

bg<br />

)<br />

i<br />

motif<br />

2<br />

p(<br />

X , Y | m,<br />

T )( Sˆ<br />

( X , Y ) − E[<br />

Sˆ<br />

( X , Y ) | m])<br />

)<br />

i<br />

i<br />

i<br />

i<br />

i<br />

i


In order to predict the scores for the genes in our positive and negative sets, we are<br />

interested in the case were we have observed a match to the motif in one species, but the<br />

constraints on its <strong>evolution</strong> are either those <strong>of</strong> the background or the motif. We can<br />

compute the expected scores under these assumptions as follows:<br />

E[<br />

Sˆ<br />

( X , Y ) | X = match,<br />

m]<br />

=<br />

w<br />

∑∑ p(<br />

X i | motif ) ∑<br />

i= 1 X Y<br />

i i<br />

( Y | X , m,<br />

T ) Sˆ<br />

( X , Y )<br />

where p(Xi|motif) is the single species probability <strong>of</strong> observing the base Xi at position i in<br />

the specificity matrix (f), and using Bayes' theorem:<br />

p(<br />

X i,<br />

Yi<br />

| m,<br />

T ) p(<br />

X i,<br />

Yi<br />

| m,<br />

T )<br />

p(<br />

Yi<br />

| X i,<br />

m,<br />

T ) =<br />

=<br />

.<br />

p(<br />

X | m,<br />

T ) p(<br />

X , Y | m,<br />

T )<br />

i<br />

∑<br />

Yi<br />

This calculation can be extended to the multiple species case, <strong>by</strong> replacing the<br />

4. Wasserman WW, Palumbo M, Thompson W, Fickett JW, Lawrence CE: Human-mouse genome<br />

comparisons to locate regulatory sites. Nat Genet 2000, 26:225-228.<br />

116<br />

i<br />

distributions p(Xi, Yi) and p(Yi|Xi) with p(Xi, Yi, ..., Zi) and p(Yi, ..., Zi|Xi) and changing<br />

the sum over Yi to a sum over all the other leaves in the tree except the reference, in this<br />

i<br />

case, Xi. For the functional set, we assumed the binding sites were evolving under the HB<br />

model [25], and for the nonfunctional set we assumed <strong>evolution</strong> under the HKY<br />

background model described above. To model the sequencespecificity matrices most<br />

accurately, we reduced the pseudocount (equal to the background probability <strong>of</strong><br />

observing each base).<br />

References<br />

1. Ureta-Vidal A, Ettwiller L, Birney E: Comparative genomics: genome-wide analysis in metazoan<br />

eukaryotes. Nat Rev Genet 2003, 4:251-262.<br />

2. Morgenstern B, Rinner O, Abdeddaim S, Haase D, Mayer KF, Dress AW, Mewes HW: Exon discovery<br />

<strong>by</strong> genomic sequence alignment. Bioinformatics 2002, 18:777-787.<br />

3. Hardison RC: Conserved noncoding sequences are reliable guides to regulatory elements. Trends<br />

Genet 2000, 16:369-372.<br />

i<br />

i<br />

i<br />

i<br />

i


5. Rivas E, Klein RJ, Jones TA, Eddy SR: Computational identification <strong>of</strong> noncoding RNAs in E. coli<br />

<strong>by</strong> comparative genomics. Curr Biol 2001, 11:1369-1373.<br />

6. Wassarman KM, Repoila F, Rosenow C, Storz G, Gottesman S: Identification <strong>of</strong> novel small RNAs<br />

using comparative genomics and microarrays. Genes Dev 2001, 15:1637-1651.<br />

7. Carter RJ, Dubchak I, Holbrook SR: A computational approach to identify genes for functional<br />

RNAs in genomic sequences. Nucleic Acids Res 2001, 29:3928-3938.<br />

8. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison <strong>of</strong> yeast species<br />

to identify genes and regulatory elements. Nature 2003, 423:241-254.<br />

9. Lieb JD, Liu X, Botstein D, Brown PO: Promoter-specific binding <strong>of</strong> Rap1 revealed <strong>by</strong> genome-wide<br />

maps <strong>of</strong> protein-DNA association. Nat Genet 2001, 28:327-334.<br />

10. Chiang DY, Moses AM, Kellis M, Lander ES, <strong>Eisen</strong> MB: Phylogenetically and spatially conserved<br />

word pairs associated with gene-expression changes in yeasts. Genome Biol 2003, 4:R43.<br />

11. Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston<br />

M: Finding functional features in Saccharomyces genomes <strong>by</strong> phylogenetic footprinting. Science 2003,<br />

301:71-76.<br />

12. Berezikov E, Guryev V, Plasterk RH, Cuppen E: CONREAL: conserved regulatory elements<br />

anchored alignment algorithm for identification <strong>of</strong> transcription factor binding sites <strong>by</strong> phylogenetic<br />

footprinting. Genome Res 2004, 14:170-178.<br />

13. Loots GG, Ovcharenko I: rVISTA 2.0: <strong>evolution</strong>ary analysis <strong>of</strong> transcription factor binding sites.<br />

Nucleic Acids Res 2004, 32(Web Server):W217-W221.<br />

14. Bigelow HR, Wenick AS, Wong A, Hobert O: CisOrtho: a program pipeline for genome-wide<br />

identification <strong>of</strong> transcription factor target genes using phylogenetic footprinting. BMC Bioinformatics<br />

2004, 5:27.<br />

15. Loots GG, Ovcharenko I, Pachter L, Dubchak I, Rubin EM: rVista for comparative sequence-based<br />

discovery <strong>of</strong> functional transcription factor binding sites. Genome Res 2002, 12:832-839.<br />

16. Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasserman WW: Identification <strong>of</strong><br />

conserved regulatory elements <strong>by</strong> comparative genome analysis. J Biol 2003, 2:13.<br />

17. Sandelin A, Wasserman WW, Lenhard B: ConSite: web-based prediction <strong>of</strong> regulatory elements<br />

using cross-species comparison. Nucleic Acids Res 2004, 32(Web Server):W249-W252.<br />

18. Mrowka R, Steinhage K, Patzak A, Persson PB: An <strong>evolution</strong>ary approach for identifying potential<br />

transcription factor binding sites: the renin gene as an example. Am J Physiol Regul Integr Comp<br />

Physiol 2003, 284:R1147-R1150.<br />

19. Moses AM, Chiang DY, Kellis M, Lander ES, <strong>Eisen</strong> MB: Position specific variation in the rate <strong>of</strong><br />

<strong>evolution</strong> in transcription factor binding sites. BMC Evol Biol 2003, 3:19.<br />

20. Berg OG, von Hippel PH: Selection <strong>of</strong> DNA binding sites <strong>by</strong> regulatory proteins. Statisticalmechanical<br />

theory and application to operators and promoters. J Mol Biol 1987, 193:723-750.<br />

117


21. Stormo GD, Schneider TD, Gold L, Ehrenfeucht A: Use <strong>of</strong> the 'Perceptron' algorithm to distinguish<br />

translational initiation sites in E. coli. Nucleic Acids Res 1982, 10:2997-3011.<br />

22. Staden R: Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res 1984,<br />

12:505-519.<br />

23. Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol<br />

Evol 1981, 17:368-376.<br />

24. Yang Z: PAML: a program package for phylogenetic analysis <strong>by</strong> maximum likelihood. Comput<br />

Appl Biosci 1997, 13:555-556.<br />

25. Halpern AL, Bruno WJ: Evolutionary distances for protein-coding sequences: modeling sitespecific<br />

residue frequencies. Mol Biol Evol 1998, 15:910-917.<br />

26. Hasegawa M, Kishino H, Yano T: Dating <strong>of</strong> the human-ape splitting <strong>by</strong> a molecular clock <strong>of</strong><br />

mitochondrial DNA. J Mol Evol 1985, 22:160-174.<br />

27. Jukes T, Cantor C: Evolution <strong>of</strong> Protein Molecules. In Mammalian Protein Metabolism Edited <strong>by</strong>:<br />

Munro H. New York: Academic Press; 1969:121-132.<br />

28. Tatusov RL, Altschul SF, Koonin EV: Detection <strong>of</strong> conserved segments in proteins: iterative<br />

scanning <strong>of</strong> sequence databases with alignment blocks. Proc Natl Acad Sci USA 1994, 91:12091-12095.<br />

29. Staden R: Methods for calculating the probabilities <strong>of</strong> finding patterns in sequences. Comput Appl<br />

Biosci 1989, 5:89-96.<br />

30. Belting HG, Shashikant CS, Ruddle FH: Modification <strong>of</strong> expression and cis-<strong>regulation</strong> <strong>of</strong> Hoxc8 in<br />

the <strong>evolution</strong> <strong>of</strong> diverged axial morphology. Proc Natl Acad Sci USA 1998, 95:2355-2360.<br />

31. Ludwig MZ, Kreitman M: Evolutionary dynamics <strong>of</strong> the enhancer region <strong>of</strong> even-skipped in<br />

Drosophila. Mol Biol Evol 1995, 12:1002-1011.<br />

32. Wray GA, Hahn MW, Abouheif E, Balh<strong>of</strong>f JP, Pizer M, Rockman MV, Romano LA: The <strong>evolution</strong> <strong>of</strong><br />

<strong>transcriptional</strong> <strong>regulation</strong> in eukaryotes. Mol Biol Evol 2003, 20:1377-1419.<br />

33. Zhu J, Zhang MQ: SCPD: a promoter database <strong>of</strong> the yeast Saccharomyces cerevisiae.<br />

Bioinformatics 1999, 15:607-611.<br />

34. Gasch AP, <strong>Eisen</strong> MB: Exploring the conditional co<strong>regulation</strong> <strong>of</strong> yeast gene expression through<br />

fuzzy k-means clustering. Genome Biol 2002, 3:research0059.1-0059.22.<br />

35. webMONKEY [http://rana.lbl.gov/monkey]<br />

36. Pollard DA, Bergman CM, Stoye J, Celniker SE, <strong>Eisen</strong> MB: Benchmarking tools for the alignment <strong>of</strong><br />

functional noncoding DNA. BMC Bioinformatics 2004, 5:6.<br />

37. Hittinger CT, Rokas A, Carroll SB: Parallel inactivation <strong>of</strong> multiple GAL pathway genes and<br />

ecological diversification in yeasts. Proc Natl Acad Sci USA 2004, 101:14144-14149.<br />

38. Dermitzakis ET, Clark AG: Evolution <strong>of</strong> transcription factor binding sites in Mammalian gene<br />

regulatory regions: conservation and turnover. Mol Biol Evol 2002, 19:1114-1121.<br />

39. Ludwig MZ, Bergman C, Patel NH, Kreitman M: Evidence for stabilizing selection in a eukaryotic<br />

enhancer element. Nature 2000, 403:564-567.<br />

118


40. Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M,<br />

et al.: SGD: Saccharomyces Genome Database. Nucleic Acids Res 1998, 26:73-79.<br />

41. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity <strong>of</strong> progressive<br />

multiple sequence alignment through sequence weighting, position-specific gap penalties and weight<br />

matrix choice. Nucleic Acids Res 1994, 22:4673-4680.<br />

42. Mannhaupt G, Schnall R, Karpov V, Vetter I, Feldmann H: Rpn4p acts as a transcription factor <strong>by</strong><br />

binding to PACE, a nonamer box found upstream <strong>of</strong> 26S proteasomal and other genes in yeast. FEBS<br />

Lett 1999, 450:27-34.<br />

43. Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification <strong>of</strong> cis-regulatory<br />

elements associated with groups <strong>of</strong> functionally related genes in Saccharomyces cerevisiae. J Mol Biol<br />

2000, 296:1205-1214.<br />

44. <strong>Eisen</strong> MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display <strong>of</strong> genome-wide<br />

expression patterns. Proc Natl Acad Sci USA 1998, 95:14863-14868.<br />

45. Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner R, Goodlett DR,<br />

Aebersold R, Hood L: Integrated genomic and proteomic analyses <strong>of</strong> a systematically perturbed<br />

metabolic network. Science 2001, 292:929-934.<br />

46. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N,<br />

Kanin E, et al.: Genome-wide location and function <strong>of</strong> DNA binding proteins. Science 2000, 290:2306-<br />

2309.<br />

47. Stormo GD: DNA binding sites: representation and discovery. Bioinformatics 2000, 16:16-23.<br />

48. Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, <strong>Eisen</strong> MB, Storz G, Botstein D, Brown PO:<br />

Genomic expression programs in the response <strong>of</strong> yeast cells to environmental changes. Mol Biol Cell<br />

2000, 11:4241-4257.<br />

49. Gasch AP, Huang M, Metzner S, Botstein D, Elledge SJ, Brown PO: Genomic expression responses<br />

to DNA-damaging agents and the regulatory role <strong>of</strong> the yeast ATR homolog Mec1p. Mol Biol Cell<br />

2001, 12:2987-3003.<br />

50. Todd RB, Andrianopoulos A: Evolution <strong>of</strong> a fungal regulatory gene family: the Zn(II)2Cys6<br />

binuclear cluster DNA binding motif. Fungal Genet Biol 1997, 21:388-405.<br />

51. Marmorstein R, Carey M, Ptashne M, Harrison SC: DNA recognition <strong>by</strong> GAL4: structure <strong>of</strong> a<br />

protein-DNA complex. Nature 1992, 356:408-414.<br />

52. Lohr D, Venkov P, Zlatanova J: Transcriptional <strong>regulation</strong> in the yeast GAL gene family: a<br />

complex genetic network. FASEB J 1995, 9:777-787.<br />

119


11. Applications to alignments <strong>of</strong> multiple primates and human-fugu conserved elements<br />

While the <strong>evolution</strong>ary models and statistical techniques on which the methods described<br />

above are based are general, and should in principle be applicable to any set <strong>of</strong> closely<br />

related species, it remains to be seen if the assumptions and approximations made are<br />

actually valid in other cases. In order to test these methods on other taxonomic groups I<br />

have applied both EMnEM and MONKEY to comparisons <strong>of</strong> human sequences in<br />

collaboration with Eddy Rubin’s group.<br />

Identifying conserved LXR binding sites in alignments <strong>of</strong> multiple primates.<br />

LXRa is a nuclear hormone receptor involved in the <strong>regulation</strong> <strong>of</strong> many genes in response<br />

to oxysterol levels. This is a medically important system because the targets <strong>of</strong> LXR<br />

include CYP7A1, APOE and potentially many other important genes involved in<br />

arteriosclerosis (Tontonoz and Mangelsdorf 2003). Because the large loci and complex<br />

structure <strong>of</strong> many <strong>of</strong> these genes, most <strong>of</strong> the characterized binding sites are in the<br />

proximal promoters <strong>of</strong> these genes. By making multiple species alignments, we can<br />

search for conserved binding sites, thus greatly improving the specificity <strong>of</strong> the<br />

predictions, <strong>by</strong> ruling out matches to the matrix that are not conserved.<br />

LXR binds DNA as a DR4 heterodimer with RXR (Repa and Mangelsdorf 2000).<br />

Using LXREs (the binding sites for the LXRa-RXR heterodimer) curated from the<br />

literature (Zhang and Magelsdorf 2002), I have constructed a specificity matrix (Figure<br />

1).<br />

120


LXRa RXR<br />

Figure 1.<br />

Sequence logo representation <strong>of</strong> the LXRE<br />

Ovals above the letters are a schematic <strong>of</strong> positions <strong>of</strong> the DR4 nuclear hormone receptor heterodimer.<br />

Note that despite the high variability, each half-site seems to contain the AGGTCA hexameric consensus.<br />

Using this matrix I have searched the loci <strong>of</strong> genes known to be regulated <strong>by</strong> LXR in<br />

order to identify conserved binding sites. One example <strong>of</strong> a potentially interesting<br />

prediction is in the SREBF1 locus. SREBF1 encodes the SREBP1 membrane bound<br />

transcription factor involved in lipid biosynthesis (Horton et al 2002). While two<br />

functional LXREs have already been characterized in the proximal promoter <strong>of</strong> this gene,<br />

and both <strong>of</strong> these elements are conserved in alignments <strong>of</strong> human and mouse, (Monkey p-<br />

values for human and mouse, 5.31E-11 and 5.54E-07), there is a predicted conserved<br />

binding site in the primate alignments (monkey p-value, primates 3.31E-10)<br />

approximately 12.5 Kb away, which is gapped in the alignments <strong>of</strong> human and mouse.<br />

This conserved binding site may point to an LXR regulated enhancer, as has been found<br />

for the APOE gene (Laffitte et. al. 2001). Identifying primate specific binding sites is <strong>of</strong><br />

particular interest in the case <strong>of</strong> cholesterol <strong>regulation</strong> because there are important<br />

differences in the <strong>regulation</strong> <strong>of</strong> these genes between rodents and humans, and it is<br />

121


possible that in this case primate sequences will prove more informative. I shall return to<br />

this case in part IV, as an example <strong>of</strong> regulatory <strong>evolution</strong>.<br />

Identifying clusters <strong>of</strong> conserved binding sites<br />

In addition to LXR, there are other transcription factors that are known to be important<br />

for the <strong>regulation</strong> <strong>of</strong> genes involved in cholesterol transport and <strong>regulation</strong>. Although<br />

there are few characterized binding sites for these factors, there have been selex<br />

experiments done on a number <strong>of</strong> them, and matrices representing the specificity <strong>of</strong><br />

several are available (Sandelin et al 2004). Using these matrices I searched for regions in<br />

the alignments were there were binding sites for more than one <strong>of</strong> these factors in close<br />

proximity. Approaches that search for multiple binding sites have proven effective in<br />

identifying regulatory regions computationally (Wasserman and Fickett 1998) because<br />

eukaryotic transcription factors <strong>of</strong>ten work in combination. Using the alignments <strong>of</strong><br />

closely related sequences, and MONKEY, we can extend this approach to search for<br />

clusters <strong>of</strong> conserved binding sites.<br />

An interesting example <strong>of</strong> a group <strong>of</strong> conserved binding sites is found in the<br />

CYP7A1 locus. CYP7A1 catalyzes the rate-limiting step in bile formation, and is already<br />

known to be regulated <strong>by</strong> a number <strong>of</strong> factors, including HNF1, HNF4, LXR and PPAR<br />

at the proximal promoter (Marrapodi and Chiang 2000, Stroup and Chiang 2000, Chen et<br />

al 1999, Chiang et al 2001). Scanning the alignment <strong>of</strong> the CYP7A1 locus, we identified<br />

a region <strong>of</strong> 110 bp that contained conserved matches to the matrices for HNF3, HNF4<br />

and a weak match to the LXR matrix (Figure 2). This putative enhancer is located nearly<br />

122


12 Kb downstream <strong>of</strong> the 3’ end <strong>of</strong> the CYP7A1 transcript, and nearly 20 Kb downstream<br />

<strong>of</strong> the proximal promoter.<br />

A<br />

B<br />

27454 27564<br />

39551<br />

TGCAGGGATATGTTTACTTTGACTTTAGTCCTATTGGAATTCTGTGACCTTGGGCTAAAATAGGCAGTTCTCAAATGTATTATTTCTGAAGGGCAAAGTTAAGAGATT Baboon<br />

TGCAGAGATATGTTTACTTTGACTTTAGTCCTATTGGAATTCTGTGACCTTGGGCTAAAATGGGCAGTGCTCCAATATATTATTTCTGAAGGTCAAAGTTAAGAGATT Colobus<br />

TGCAGGGATATGTTTACTTTGACTTTAGTCCTATTGGAATTCTGTGACCTTGGCCTAAAATGGACAGTGATCAAATATATTATTTCTGAAGGGCAAAGTTAAGAGATT Human<br />

TGCAGAGATATGTTTACTTTGACTTTAGTCCTATTGGAATTCTGTGACCTTGGGCTAAAATAGACAGTGCTCAAATATATTATTTCTGAAGGGCAAAGTTAAGAGATT Marmoset<br />

TGCAGAGATATGTTTACTTTGACTTTAGTACTATTGGAATTCTGTAACCTTGGGCTAAAATAGGCAGTTCTCAAATATACTATTTCTGAAGGGCAAAGTTAAGAGATT Owl monkey<br />

TGCAGAGATATGTTTACTTTGACTTTAGTCCTATTGGAATTCTGTAACCTTGAGCTAAAATAGGCAGTTCTCAAATATATTATTTCTGAAGGGCAAAGTTAAGAGATT Squirrel monkey<br />

TGCAAGGATATGTTTACTTTGACTTTAGTCCTATTGGAATTCTGTGACCTTGGGCTAAAATGAACAGTTCTCAAACATATTCTTTCTGAAGGGCAAAGTTAAGCGGTT Lemur<br />

TGAGGTGGCATGTTTACTTTGACTTTAGTCCTATTGGAATTCTGTAACCTTGGACTAAAATGGACAGTTCTCAAATATAGTCCTTCTAAAGGGCAAAGTTAAGAGGTT Mouse<br />

TGAGGTGGCATGTTTACTTTGACTTTAGTCCTATTGGAATTCTGTAACCTTGGACTAAAATGGACAGTTCTCAAATATAGTCCTTCTAAAGGGCAAAGTTACGAGGTT Rat<br />

** * ******************** *************** ****** ******* **** ** ** ** * **** **** ******** * * **<br />

HNF3<br />

Figure 2.<br />

LXR<br />

(weak match)<br />

HNF4<br />

Closely spaced, highly conserved matches suggest the presence <strong>of</strong> a downstream<br />

CYP7A enhancer<br />

A schematic <strong>of</strong> the CYP7A1 locus showing the transcribed portion (blue bar) the characterized proximal<br />

promoter (red bar) and the predicted enhancer (green bar). B Multiple sequence alignment with boxes<br />

indicating the conserved binding sites as predicted <strong>by</strong> MONKEY.<br />

Discovering motifs in human-fugu conserved sequences using EMnEM<br />

Although the vast non-coding regions in the human genome make regulatory element<br />

detection exceedingly difficult, it is possible to take advantage <strong>of</strong> more compact<br />

vertebrate genomes, such as the puffer fish, Fugu rubripes (Aparicio et al 1995, Elgar et<br />

al 1996). Because <strong>of</strong> the long <strong>evolution</strong>ary distance that separates them, shared non-<br />

coding sequences between mammals and fish are very likely to be under strong<br />

functional constraint, and may be representative <strong>of</strong> the features <strong>of</strong> the ancestral<br />

123<br />

CYP7A1<br />

49521<br />

Proximal<br />

promoter


vertebrate. These highly conserved non-coding sequences that have been discovered in<br />

vertebrate genomes represent and ideal data set for <strong>evolution</strong>ary motif-finders. They are<br />

a set <strong>of</strong> relatively short sequences, they are very likely to be functional, and very little is<br />

known about their functions. I have applied EMnEM to a set <strong>of</strong> highly conserved regions<br />

that were shown to drive embryonic expression patterns, in vivo in a reporter construct<br />

(Len Pennacchio, personal communication).<br />

In order to test the hypothesis that there are shared sequence featured between all<br />

these highly conserved enhancers, I ran EMnEM on the entire set, using alignments <strong>of</strong><br />

what ever species were available for that region on the UCSC genome browser (Kent et<br />

al. 2002). The resulting motif is shown in Figure 3A.<br />

A B<br />

Figure 3.<br />

Discovery <strong>of</strong> a enriched, conserved homodomain-like motif using EMnEM<br />

A: graphical representation <strong>of</strong> the motif discovered using EMnEM. EMnEM was run assuming the motif<br />

evolved at 0.5 the average rate in these highly conserved sequences, suggesting that this motif is<br />

preferentially conserved. B TAAT is specifically enriched in these conserved enhancers relative to<br />

surrounding regions. The density <strong>of</strong> the core TAAT sequence in the enhancers is much greater than in the<br />

flanking sequences, further supporting the hypothesis that this element is functionally important in these<br />

enhancers.<br />

TAAT per bp<br />

124<br />

0.02<br />

0.015<br />

0.01<br />

0.005<br />

0<br />

-5000 0 5000<br />

position from left or right edge<br />

<strong>of</strong> enhance r<br />

average in<br />

overlapping 500<br />

bp windows<br />

average in<br />

enhancers<br />

genomic<br />

background<br />

(10000 random<br />

windows)


The discovery <strong>of</strong> a highly conserved, significantly enriched motif that contains the<br />

ATTA/TAAT homeodomain core in these highly conserved enhancers that have been<br />

shown to drive specific spatial embryonic expression patterns and fall near<br />

developmental genes (Sandelin et al. 2004, Wolfe et al. 2005) is interesting for the<br />

following reasons. First, although we cannot say which proteins are binding these<br />

sequences, the hox and other homeodomain proteins are well known to regulate spatial<br />

and temporal patterns <strong>of</strong> expression during vertebrate development (Hunt and Krumlauf<br />

1992, Cavodeassi 2001) as well as nervous system patterning in particular (Boyl et al.<br />

2001, McMahon 2000, Cecchi and Boncinelli 2000). That members <strong>of</strong> that family are<br />

regulating these enhancers is consistent with their spatial patterns and putative<br />

developmental patterning functions. Second, and perhaps more interestingly, the<br />

discovery <strong>of</strong> high densities <strong>of</strong> binding sites in these enhancers supports the hypothesis<br />

that they conserved at the nucleotide level because <strong>of</strong> the constraints these binding sites<br />

place on their <strong>evolution</strong>. However, the binding sites for most developmental regulatory<br />

homeodomain proteins are known to be highly degenerate, and it is therefore surprising<br />

that they would be conserved so highly at the nucleotide level. In passing I note that I<br />

also ran MEME (Bailey et al. 1994) on these sequences and did not recover this motif.<br />

Predicting gene expression using conserved binding sites<br />

Because the homeodomain core that was discovered <strong>by</strong> EMnEM could represent the<br />

overlapping specificity <strong>of</strong> many proteins, and the models underlying EMnEM do not<br />

account for the possibility <strong>of</strong> multiple motifs <strong>of</strong> overlapping specificity, I decided to<br />

search the enhancers for other previously characterized motifs. I searched the human-<br />

125


mouse-fugu alignments <strong>of</strong> the tested enhancers with 78 matrices from the JASPAR<br />

database (Sandelin et al. 2004) to find conserved matches to these matrices. Many <strong>of</strong><br />

these motifs area derived from selex, or in vitro footprinting studies, and therefore<br />

represent the specificity <strong>of</strong> only one protein. Indeed, there are multiple matrices in the<br />

database that represent the specificities <strong>of</strong> different homeodomain containing proteins<br />

(e.g., Figure 4B), and using these matrices is may be possible to distinguish between<br />

various members <strong>of</strong> this large protein family.<br />

Because the expression patterns <strong>of</strong> the tested fragments could be divided into<br />

those that showed expression in different spatial and morphological patterns, we were<br />

interested in identifying binding sites that were specifically associated with particular<br />

expression patterns. In testing this hypothesis, I found that several <strong>of</strong> these matrices were<br />

significantly enriched in the enhancers that showed forebrain expression patterns relative<br />

to the fragments that showed no pattern, and those that showed other patterns (Figure 4A)<br />

In order to test the predictive power <strong>of</strong> these conserved matches to the matrices, I<br />

performed a stepwise logistic regression analysis (Wasserman and Fickett 1998).<br />

Although several <strong>of</strong> the motifs were significantly associated with the forebrain expression<br />

pattern, only Nkx was highly significant when a model using more than one motif was<br />

constructed. Furthermore, in order to verify that these weakly significant additional<br />

motifs were not the result <strong>of</strong> the model being overfit, I performed leave one out cross-<br />

validation, and found that the error rate (~20%) using only the hits to Nkx was<br />

comparable to that using all 5 selected motifs (~18%). This suggested that most <strong>of</strong> the<br />

predictive power was coming from Nkx.<br />

126


Although I have presented no specific evidence that these enriched binding sites<br />

are actually responsible for the forebrain expression pattern, or that these enhancers are<br />

indeed regulated <strong>by</strong> Nkx proteins, these observations are certainly consistent with Nkx2.1<br />

and Nkx2.2 (verebrate homologues <strong>of</strong> Drosphila vnd) being involved in forebrain<br />

patterning and neural fate determination (Briscoe et al 1999, Corbin et al 2003). That<br />

these binding sites are conserved since the common ancestor <strong>of</strong> the vertebrates is<br />

consistent with the suggestion <strong>of</strong> an ancient role for Nkx2.1 in the patterning and perhaps<br />

origin <strong>of</strong> the vertebrate forebrain (Venkatesh et al 1999).<br />

Conserved hits in negatives<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

A B<br />

Sox5<br />

chop cEBP<br />

0 10 20 30 40<br />

Conserved hits in forebrain<br />

Figure 4.<br />

Identification <strong>of</strong> motifs specifically associated with enhancers that showed forebrain<br />

expression.<br />

A conserved hits, defined as matches to the matrix in human (p


unusual for homeodomain proteins, as they bind sequences with a CAAG core as well as the canonical<br />

TAAT (Damante et al 1994).<br />

Discussion<br />

While detailed experimental tests <strong>of</strong> the predictions presented here will be necessary to<br />

confirm the power <strong>of</strong> these methods in analysis <strong>of</strong> mammalian vertebrate regulatory<br />

sequences, preliminary results (J. Wang personal communication) suggest that at least the<br />

downstream CYP7A1 enhancer can drive expression <strong>of</strong> a reporter in cell culture. These<br />

results suggest that methods such as those described here that take advantage <strong>of</strong> the<br />

<strong>evolution</strong>ary properties <strong>of</strong> binding sites, principally their preferential conservation in<br />

multiple sequence alignments, will prove effective tools in the efforts to understand even<br />

the most complex regulatory sequences, such as those involved in patterning <strong>of</strong> the<br />

vertebrate nervous system. Further, the methods developed here, which employ a<br />

probabilistic model <strong>of</strong> binding site <strong>evolution</strong> seem not only to be justified in their<br />

conception, but practically useful in the cases studied here.<br />

References<br />

Aparicio S, Morrison A, Gould A, Gilthorpe J, Chaudhuri C, Rig<strong>by</strong> P, Krumlauf R, Brenner S. Detecting<br />

conserved regulatory elements with the model genome <strong>of</strong> the Japanese puffer fish, Fugu rubripes. Proc Natl<br />

Acad Sci U S A. 1995 Feb 28;92(5):1684-8.<br />

Boyl PP, Signore M, Annino A, Barbera JP, Acampora D, Simeone A. Otx genes in the development and<br />

<strong>evolution</strong> <strong>of</strong> the vertebrate brain. Int J Dev Neurosci. 2001 Jul;19(4):353-63.<br />

Briscoe J, Sussel L, Serup P, Hartigan-O'Connor D, Jessell TM, Rubenstein JL, Ericson J. Homeobox gene<br />

Nkx2.2 and specification <strong>of</strong> neuronal identity <strong>by</strong> graded Sonic hedgehog signalling. Nature. 1999 Apr<br />

15;398(6728):622-7.<br />

Cavodeassi F, Modolell J, Gomez-Skarmeta JL. The Iroquois family <strong>of</strong> genes: from body building to neural<br />

patterning. Development. 2001 Aug;128(15):2847-55.<br />

Cecchi C, Boncinelli E. Emx homeogenes and mouse brain development. Trends Neurosci. 2000<br />

Aug;23(8):347-52. Review.<br />

128


Chen J, Cooper AD, Levy-Wilson B. Hepatocyte nuclear factor 1 binds to and transactivates the human but<br />

not the rat CYP7A1 promoter. Biochem Biophys Res Commun. 1999 Jul 14;260(3):829-34.<br />

Chiang JY, Kimmel R, Stroup D. Regulation <strong>of</strong> cholesterol 7alpha-hydroxylase gene (CYP7A1)<br />

transcription <strong>by</strong> the liver orphan receptor (LXRalpha). Gene. 2001 Jan 10;26<br />

Corbin JG, Rutlin M, Gaiano N, Fishell G. Combinatorial function <strong>of</strong> the homeodomain proteins Nkx2.1<br />

and Gsh2 in ventral telencephalic patterning. Development. 2003 Oct;130(20):4895-906. Epub 2003 Aug<br />

20.<br />

Damante G, Fabbro D, Pellizzari L, Civitareale D, Guazzi S, Polycarpou-Schwartz M, Cauci S,<br />

Quadrifoglio F, Formisano S, Di Lauro R. Sequence-specific DNA recognition <strong>by</strong> the thyroid transcription<br />

factor-1 homeodomain. Nucleic Acids Res. 1994 Aug 11;22(15):3075-83.<br />

Elgar G, Sandford R, Aparicio S, Macrae A, Venkatesh B, Brenner S. Small is beautiful: comparative<br />

genomics with the pufferfish (Fugu rubripes). Trends Genet. 1996 Apr;12(4):145-50.<br />

Horton JD, Goldstein JL, Brown MS. SREBPs: activators <strong>of</strong> the complete program <strong>of</strong> cholesterol and fatty<br />

acid synthesis in the liver. J Clin Invest. 2002 May;109(9):1125-31.<br />

Hunt P, Krumlauf R. Hox codes and positional specification in vertebrate embryonic axes. Annu Rev Cell<br />

Biol. 1992;8:227-56.<br />

Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome<br />

browser at UCSC. Genome Res. 2002 Jun;12(6):996-1006.<br />

Laffitte BA, Repa JJ, Joseph SB, Wilpitz DC, Kast HR, Mangelsdorf DJ, Tontonoz P. LXRs control lipidinducible<br />

expression <strong>of</strong> the apolipoprotein E gene in macrophages and adipocytes. Proc Natl Acad Sci U S<br />

A. 2001 Jan 16;98(2):507-12.<br />

Marrapodi M, Chiang JY. Peroxisome proliferator-activated receptor alpha (PPARalpha) and agonist<br />

inhibit cholesterol 7alpha-hydroxylase gene (CYP7A1) transcription. J Lipid Res. 2000 Apr;41(4):514-20.<br />

McMahon AP. Neural patterning: the role <strong>of</strong> Nkx genes in the ventral spinal cord. Genes Dev. 2000 Sep<br />

15;14(18):2261-4.<br />

Repa JJ, Mangelsdorf DJ. The role <strong>of</strong> orphan nuclear receptors in the <strong>regulation</strong> <strong>of</strong> cholesterol homeostasis.<br />

Annu Rev Cell Dev Biol. 2000;16:459-81.<br />

Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B. JASPAR: an open-access database for<br />

eukaryotic transcription factor binding pr<strong>of</strong>iles. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D91-4.<br />

Sandelin A, Bailey P, Bruce S, Engstrom PG, Klos JM, Wasserman WW, Ericson J, Lenhard B. Arrays <strong>of</strong><br />

ultraconserved non-coding regions span the loci <strong>of</strong> key developmental genes in vertebrate genomes. BMC<br />

Genomics. 2004 Dec 21;5(1):99<br />

Stroup D, Chiang JY. HNF4 and COUP-TFII interact to modulate transcription <strong>of</strong> the cholesterol 7alphahydroxylase<br />

gene (CYP7A1). J Lipid Res. 2000 Jan;41(1):1-11.<br />

Tontonoz P, Mangelsdorf DJ. Liver X receptor signaling pathways in cardiovascular disease. Mol<br />

Endocrinol. 2003 Jun;17(6):985-93. Epub 2003 Apr 10. Review.<br />

Venkatesh TV, Holland ND, Holland LZ, Su MT, Bodmer R. Sequence and developmental expression <strong>of</strong><br />

amphioxus AmphiNk2-1: insights into the <strong>evolution</strong>ary origin <strong>of</strong> the vertebrate thyroid gland and forebrain.<br />

Dev Genes Evol. 1999 Apr;209(4):254-9.<br />

129


Wasserman WW, Fickett JW. Identification <strong>of</strong> regulatory regions which confer muscle-specific gene<br />

expression. J Mol Biol. 1998 Apr 24;278(1):167-81.<br />

Zhang Y, Mangelsdorf DJ. LuXuRies <strong>of</strong> lipid homeostasis: the unity <strong>of</strong> nuclear hormone receptors,<br />

transcription <strong>regulation</strong>, and cholesterol sensing. Mol Interv. 2002 Apr;2(2):78-87.<br />

130


12. Future challenges<br />

All <strong>of</strong> the future challenges that face the attempts to utilize sequence data from single<br />

genomes for motif finding and binding site prediction can be extended to the <strong>evolution</strong>ary<br />

situation using the probabilistic framework that we have described. In the <strong>evolution</strong>ary<br />

case, however, there are a great number <strong>of</strong> additional challenges. Of particular<br />

importance is to characterize the <strong>evolution</strong>ary constraints on promoter organization and<br />

context dependencies between binding sites. In order to do this, reasonably large sets <strong>of</strong><br />

promoters will be required where the binding sites have been characterized in detail, and<br />

we can be reasonably confident that the <strong>regulation</strong> has not changed over <strong>evolution</strong>.<br />

Needless to say, assembling such datasets will require considerable experimental efforts.<br />

Another major area <strong>of</strong> interest is to understand the situation where function is not<br />

strictly conserved at the level <strong>of</strong> the binding sites; this will be the focus <strong>of</strong> the majority <strong>of</strong><br />

the remainder <strong>of</strong> this dissertation. From the perspective <strong>of</strong> tools to detect motifs and<br />

identify binding sites, methods that do not assume binding sites to be strictly conserved<br />

since the common ancestor will be particularly useful were the number <strong>of</strong> species<br />

available is large – in these cases binding sites need not be required to be conserved in<br />

the entire tree. Once we try to allow for the <strong>evolution</strong> <strong>of</strong> functional constraints the<br />

problem <strong>of</strong> binding site <strong>evolution</strong> becomes much more complicated. Furthermore,<br />

because there have been relatively few mechanistic studies <strong>of</strong> binding site <strong>evolution</strong>, we<br />

have little empirical data on which to develop our models.<br />

131


Part II<br />

Identifying sequences controlling gene expression using data from a single species<br />

14


4. Searching for statistically enriched sequences<br />

Methods for identifying sequence determinants <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong> take<br />

advantage <strong>of</strong> sequence specificity <strong>of</strong> transcription factors, and attempt to recognize<br />

statistically enriched patterns in non-coding DNA. For example, Figure 1 shows the<br />

enrichment <strong>of</strong> zeste binding sites in regions bound in vivo <strong>by</strong> this protein.<br />

-2500 -1500 -500 500 1500 2500<br />

Figure 1.<br />

Enrichment <strong>of</strong> predicted zeste binding sites in bound fragments in a Chip Chip<br />

experiment<br />

The solid line is the density <strong>of</strong> matches to the zeste matrix in non-overlapping 100 bp windows surrounding<br />

peaks in signal intensity from a chip-chip experiment. The dotted traces are 99% confidence intervals for<br />

the binomial distribution. The peak in number <strong>of</strong> hits to the zeste matrix matches corresponds to the peak<br />

in binding as measured in this experiment.<br />

Distance relative to peak<br />

Systematically identifying the binding sites for transcription factors is an important part<br />

<strong>of</strong> understanding how regulatory information is encoded in DNA sequence. We can<br />

break this problem into two parts, first to find the specificity <strong>of</strong> transcription factors, and<br />

second once we know the specificity, to identify the functional examples <strong>of</strong> their binding<br />

15<br />

0.014<br />

0.012<br />

0.01<br />

0.008<br />

0.006<br />

0.004<br />

0.002<br />

0<br />

binding site density (per bp)


site in the genome. Both <strong>of</strong> these problems can be addressed computationally; the first,<br />

identifying the specificity is a pattern discovery problem, while the second, identifying<br />

the ‘instances’ <strong>of</strong> the motif is a pattern matching problem. Much more attention has been<br />

paid to the former, and it shall take most <strong>of</strong> the focus here. I will return to the latter<br />

problem in more detail in the context <strong>of</strong> <strong>evolution</strong>ary binding site identification.<br />

Finding the sequence specificity <strong>of</strong> unknown transcription factors amounts to<br />

looking for statistically enriched patterns or motifs in non-coding DNA, and it is<br />

therefore worth some consideration <strong>of</strong> what definitions <strong>of</strong> statistical enrichment can be<br />

employed.<br />

Statistics for calculating ‘statistical’ enrichment<br />

An important question in these types <strong>of</strong> approaches is how to score the statistical<br />

enrichment <strong>of</strong> the sequence feature <strong>of</strong> interest. The simplest approach is to simply count<br />

the instances <strong>of</strong> the motif in the bound set and compare that to the expected frequency<br />

based on the genome, or some random subset <strong>of</strong> the genome. Statistical significance can<br />

be assessed in these cases using the binomial distribution, where p-values are given <strong>by</strong><br />

N<br />

∑<br />

x=<br />

n<br />

16<br />

n−1<br />

∑<br />

p ( x ≥ n | N,<br />

f ) = p(<br />

x | N,<br />

f ) = 1−<br />

p(<br />

x | N,<br />

f ) ,<br />

where x is the number <strong>of</strong> instances <strong>of</strong> the motif in the subset, N is the number <strong>of</strong> base-<br />

pairs in the subset, f is the genome, or background frequency <strong>of</strong> instances <strong>of</strong> this motif<br />

and<br />

x=<br />

0


p<br />

x<br />

⎛ N ⎞<br />

N − x N!<br />

x N − x<br />

( | , ) = ⎜ ⎟ ( 1−<br />

f ) = f ( 1−<br />

f ) ,<br />

x<br />

N<br />

f<br />

⎝ x ⎠<br />

17<br />

x!<br />

( N − x)!<br />

which is the binomial distribution. The assumption here is that each position in the<br />

sequence can be thought <strong>of</strong> as an independent trial, and that there is some background<br />

rate at which motif instances appear. Then we can ask directly the probability <strong>of</strong><br />

observing as many instances as we have given the background frequency. This statistical<br />

model has limitations, in particular the case were a motif is self-overlapping, or there are<br />

repetitive regions in the subset – in these cases the significance can be either greatly over-<br />

estimated or one sequence or small region <strong>of</strong> a sequence that contains many instances can<br />

contribute disproportionately.<br />

In general, when we are searching for over-represented sequences, we may want<br />

to include some notion <strong>of</strong> the ‘coverage’ in the subset; not only should there be more<br />

instances than some expectation, but that these should be spread over more <strong>of</strong> the subset<br />

than expected. Another way <strong>of</strong> looking at the idea <strong>of</strong> coverage is that we are requiring<br />

the motif to be not only statistically enriched, but also shared amongst the sequences in<br />

the subset. Tests <strong>of</strong> enrichment which take this into account are also possible, and <strong>of</strong><br />

particular utility has been one based on the hypergeometric distribution where the length<br />

<strong>of</strong> the sequences in the subset are all the same. In this case, instead <strong>of</strong> the sample size<br />

being the number <strong>of</strong> basepairs, because the sequences are all the same size, we can use<br />

the number <strong>of</strong> sequences. We now define x to be a ‘coverage statistic’, i.e., the number<br />

<strong>of</strong> sequences in the set that contain the motif. Now the numbers may be too small to<br />

invoke the binomial distribution, but we can use the hypergeometric distribution to<br />

calculate p-values


M<br />

∑<br />

x=<br />

m<br />

18<br />

m−1<br />

p ( x ≥ m | M , n,<br />

N)<br />

= p(<br />

x | M , n,<br />

N)<br />

= 1−<br />

p(<br />

x | M , n,<br />

N)<br />

,<br />

where x is now the number <strong>of</strong> sequences in the subset that contain the motif, M is the<br />

number <strong>of</strong> sequences in the subset, n is the number <strong>of</strong> sequences <strong>of</strong> that size in the<br />

genome that contain the motif, N is the total number <strong>of</strong> sequences <strong>of</strong> that size in the<br />

genome, and<br />

⎛ M ⎞⎛<br />

N − M ⎞<br />

⎜ ⎟⎜<br />

⎟<br />

⎝ x ⎠⎝<br />

n − x<br />

p ( x | M , n,<br />

N)<br />

=<br />

⎠<br />

,<br />

⎛ N ⎞<br />

⎜ ⎟<br />

⎝ n ⎠<br />

which is the hypergeometric distribution. Of course, the requirement that the sequences<br />

be <strong>of</strong> the same length is not always easy to satisfy. We have found this test extremely<br />

robust when using 600 bp upstream <strong>of</strong> each yeast gene, for example. More generally,<br />

when the lengths <strong>of</strong> sequences differ, it is possible to evaluate the significance <strong>of</strong> the<br />

coverage statistic, (the number <strong>of</strong> sequences that contain the motif) <strong>by</strong> sampling<br />

sequences <strong>of</strong> the same size distribution from the genome. For motifs with low specificity<br />

that occur commonly in random sequence, this can be extended <strong>by</strong> defining the coverage<br />

statistic xk to be the number <strong>of</strong> sequences that contain at least k instances <strong>of</strong> the motif.<br />

Ideally, however, we desire some kind <strong>of</strong> hybrid approach that takes into account<br />

both the statistical enrichment (the excess number <strong>of</strong> instances) and the coverage (the<br />

∑<br />

x=<br />

0<br />

excess number <strong>of</strong> sequences that contain an excess <strong>of</strong> instances).<br />

Motif finding based on discrete models <strong>of</strong> transcription factor binding sites<br />

There are two main approaches to the problem <strong>of</strong> discovering enriched patterns in non-<br />

coding DNA. First a set <strong>of</strong> approaches that treat the motif discretely, and then attempt to


enumerate motifs until statistically significant ones are found. Significance can be<br />

assessed <strong>by</strong> comparing the distribution in some subset <strong>of</strong> sequence to a background<br />

model, or explicitly comparing the distribution in the subset to the genomic background,<br />

and using statistical models such as the ones presented above. The availability <strong>of</strong> rigorous<br />

estimates <strong>of</strong> statistical significance, and that the number <strong>of</strong> motifs tested can be controlled<br />

are the main strengths <strong>of</strong> the discrete methods. The main drawback <strong>of</strong> such methods is<br />

that the discrete models they employ cannot always account for the true variability <strong>of</strong> the<br />

binding sites for a given factor.<br />

I have applied methods <strong>of</strong> this kind to a number <strong>of</strong> datasets. ChIP on chip data is<br />

particularly well suited to analysis <strong>of</strong> this kind because the output <strong>of</strong> such experiments is<br />

a list <strong>of</strong> 'bound' fragments that can be tested explicitly against the genome in order to<br />

identify enriched sequences that are candidates for the binding sites <strong>of</strong> these factors. A<br />

simple approach is to start with short words, and then extend them if the extension<br />

increases the significance relative to the background set. An example <strong>of</strong> the output <strong>of</strong><br />

such a motif finder is presented in table 1. There has also been a lot <strong>of</strong> work in<br />

developing sophisticated algorithms to exhaustively search the space <strong>of</strong> discrete sequence<br />

patterns (e.g., Sinha and Tompa 2000).<br />

motif consensus n enrichment p-value<br />

1 YCACTCAR 156 1.89e-83<br />

2 RRGWGAGC 107 1.23e-43<br />

3 GAGWGARW 161 1.48e-37<br />

4 CRCTCRM 227 3.12e-28<br />

5 AGAGMG 137 6.25e-17<br />

Table 1<br />

Running a discrete motif finder on the zeste data.<br />

19


Output <strong>of</strong> a simple discrete motif-finding algorithm that enumerates w-mers, and adds ambiguity characters<br />

to maximize significance <strong>of</strong> the consensus in the bound set (using the simple coverage statistic described<br />

above) relative to a set <strong>of</strong> randomly drawn set. This was run on the zeste ChIP chip data and the top motif<br />

matches the consensus <strong>of</strong> the zeste footprinted sites (underlined). The rest <strong>of</strong> the motifs correspond to<br />

related ‘GAGA’ sequences that may represent binding sites for the GAGA factor (also known as GAF or<br />

trl) which is known to work with zeste in regulating target genes (Laney and Biggin 1996, Mulholland et al.<br />

2003). This example highlights a major challenge for motif finders in this type <strong>of</strong> application – when<br />

multiple factors bind the same regulatory regions it is difficult to interpret the results <strong>of</strong> motif-finding<br />

alogorithms. I shall return to this issue in the future directions section.<br />

Probabilistic approaches to motif finding<br />

The second set <strong>of</strong> motif finding approaches rely on a probabilistic model <strong>of</strong> the motif, a<br />

multinomial for each position (Stormo 2000). Now the parameters <strong>of</strong> the model can be<br />

inferred from the data using a variety <strong>of</strong> optimization techniques and some assumptions<br />

about the distribution <strong>of</strong> the binding sites in the sequence (e.g., Stormo and Hartzell,<br />

1989). These approaches <strong>of</strong>ten suffer from overfitting - the high number <strong>of</strong> parameters<br />

needed for the models require large amounts <strong>of</strong> data to estimate accurately, and because<br />

they rely on a probabilistic model <strong>of</strong> the background, they <strong>of</strong>ten find sequences that are<br />

significantly enriched, but not specific to the set <strong>of</strong> genes <strong>of</strong> interest, such as polyA/T and<br />

other low complexity sequences. An example <strong>of</strong> a probabilistic model for the zeste<br />

motif is presented in Figure 2.<br />

20


A probability B<br />

pos. fA fC fG fT<br />

1 0.159 0.081 0.081 0.678<br />

2 0.011 0.193 0.044 0.752<br />

3 0.011 0.007 0.970 0.011<br />

4 0.974 0.007 0.007 0.011<br />

5 0.011 0.007 0.970 0.011<br />

6 0.048 0.156 0.007 0.789<br />

7 0.011 0.044 0.900 0.048<br />

Figure 2.<br />

Representing families <strong>of</strong> binding sites with probabilistic models<br />

A: a probabilistic representation <strong>of</strong> zeste binding sites. B: a graphical ‘sequence logo’ representation <strong>of</strong> the<br />

binding sites. These were constructed using in vitro footprinted binding sites (David Nix personal<br />

communication), and the ‘sequence logo’ was created using web logo (Crooks et al. 1998). While the<br />

consensus sequence TGAGTG does describe the binding sites, this probabilistic representation<br />

encapsulates the notion that while T is preferred at both the first and second positions, there is more<br />

variability at the first position than the second.<br />

We will return to probabilistic motif finding, both in the context <strong>of</strong> associating motifs<br />

with gene expression data, and in relation to incorporating conservation <strong>of</strong> binding sites<br />

into methods for motif finding and binding site identification.<br />

Methods<br />

Identifying instances <strong>of</strong> a motif<br />

For consensus sequences, we simply enumerate the number <strong>of</strong> times that sequence was<br />

found in the subset or in the background set. For probabilistic models <strong>of</strong> motifs we<br />

compute a likelihood ratio score that compares the probability <strong>of</strong> the w bases under the<br />

motif model to the probability under the background model. In figure <strong>of</strong> zeste<br />

21


enrichement, we used a score cut<strong>of</strong>f <strong>of</strong> 6.0, although the results are similar with other<br />

scores. We shall return to this type <strong>of</strong> score, its <strong>evolution</strong>ary extensions and how to<br />

calculate its distribution in part III.<br />

Building probabilistic models <strong>of</strong> motifs<br />

Frequency matrices <strong>of</strong> the type commonly used to represent the specificity <strong>of</strong><br />

transcription factors are multinomials. The maximum-liklihood estimates for the<br />

parameters are then simply the counts <strong>of</strong> each base over the total number observed.<br />

However, because the sample sizes are small, we add a correction to the ML estimate,<br />

known as a pseusocount, to avoid having parameter estimates <strong>of</strong> 0 in the matrix. This is<br />

<strong>of</strong>ten interpreted as a prior distribution, and we use either the background base<br />

frequencies, or the add-one method.<br />

References<br />

Crooks GE, Hon G, Chandonia JM, Brenner SE (2004) WebLogo: A sequence logo generator. Genome Res<br />

14: 1188-1190.<br />

Laney JD, Biggin MD. Redundant control <strong>of</strong> Ultrabithorax <strong>by</strong> zeste involves functional levels <strong>of</strong> zeste<br />

protein binding at the Ultrabithorax promoter. Development. 1996 Jul;122(7):2303-11.<br />

Mulholland NM, King IF, Kingston RE. Regulation <strong>of</strong> Polycomb group complexes <strong>by</strong> the sequencespecific<br />

DNA binding proteins Zeste and GAGA. Genes Dev. 2003 Nov 15;17(22):2741-6.<br />

Sinha S, Tompa M. A statistical method for finding transcription factor binding sites. Proc Int Conf Intell<br />

Syst Mol Biol. 2000;8:344-54.<br />

Stormo GD, Hartzell GW 3rd. Identifying protein-binding sites from unaligned DNA fragments. Proc Natl<br />

Acad Sci U S A. (1989) Feb;86(4):1183-7.<br />

Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. (2000) Jan;16(1):16-23.<br />

van Helden J, Andre B, Collado-Vides J. Extracting regulatory sites from the upstream region <strong>of</strong> yeast<br />

genes <strong>by</strong> computational analysis <strong>of</strong> oligonucleotide frequencies. J Mol Biol. 1998 Sep 4;281(5):827-42<br />

22


5. Methods that associate sequences with other data<br />

Thus far we have thought <strong>of</strong> the motif-finding problem as the attempt to find patterns that<br />

show statistical differences between a subset and the background model. Another way <strong>of</strong><br />

looking at this to try to find sequence features that explain the variability <strong>of</strong> a certain<br />

quantity, and in the previous examples, this was always 1 or 0, subset or background. In<br />

that case we seek an association between some sequence feature and a discrete variable.<br />

In general, however, we may wish to find associations between sequence features and<br />

continuous or multidimensional variables. All manner <strong>of</strong> tests <strong>of</strong> association can be used,<br />

and several have been tried in this context, such as linear (Bussemaker and Siggia, 2001)<br />

or logistic (Keles et al. 2002) regression or z-tests (Chiang et al. 2001). A main<br />

observation from these studies has been that the discovered sequence features usually<br />

have low predictive power when compared to any functional readout, but can be<br />

discovered because <strong>of</strong> the large statistical power afforded <strong>by</strong> genomic datasets. An<br />

example <strong>of</strong> the patterns that such approaches can produce is given <strong>by</strong> figure 1.<br />

Figure 1.<br />

23


Z scores over the cell cycle<br />

Each point on this graph is the z-score (as in Chiang et. al 2001) for the genes that contain a particular<br />

sequence element vs. the entire genome in microarray data from S. cerevisiae over the cell cycle (Spellman<br />

et. al. 1998). The cyclical patterns <strong>of</strong> transcript abundance recapitulate the temporal ordering <strong>of</strong> these<br />

transcription factors in cell cycle control: Mbp1p/Swi6p, Swi4p/Swi6p, Fhk1p/Fkh2p, Swi5p (Breeden<br />

1996, Merrill et al. 1992, Aerne et al. 1998, Zhu et al. 2000). Note that genes with Gal4p sites are induced<br />

in the first 2 experiments, which were conditional alleles induced in gal+ media, and genes with Ste12p<br />

sites are induced at the beginning <strong>of</strong> the first time-course because the cells were synchronized using alpha<br />

factor.<br />

Our approach to this problem has been to apply the KS test to find associations between<br />

sequence features and quantitative measurements <strong>of</strong> gene expression from microarrays.<br />

The KS-statistic explicitly compares the cumulative distributions <strong>of</strong> the two samples, and<br />

the distribution <strong>of</strong> this statistic can be approximated numerically as a function <strong>of</strong> the<br />

sample sizes (Press et al. 1992). Because it operates on the distributions <strong>of</strong> the data<br />

directly, there is no assumption about the form <strong>of</strong> the distribution and it can identify<br />

subtle relationships between distributions when even when the means are not very<br />

different or when the sample size is small. * In general, it produces similar results as other<br />

similar statistical tests (Figure 2), with the main advantage being the relaxation <strong>of</strong> the<br />

need for a large number <strong>of</strong> genes containing a particular sequence feature.<br />

* In passing it is also important to note that I have found a problem with the KS test in the<br />

case where there are many ties and the number <strong>of</strong> points in two distributions differs. To<br />

our knowledge this problem has not been reported, and whether this is an implementation<br />

difficulty, or a fundamental problem with the test has not been explored.<br />

24


Figure 2.<br />

P-values from various statistical tests in the cell cycle dataset<br />

In these data the KS-test performs similarly to the parametric tests. Because the KS-test looks only at the<br />

cumulative distribution, we lose the information about the direction <strong>of</strong> the change; in this case whether the<br />

genes with the binding site upstream were relatively induced or repressed.<br />

This method effectively identifies both words and pairs <strong>of</strong> words that are associated with<br />

expression data. These words can then be clustered and motifs can be inferred (Chiang et<br />

al. 2003).<br />

Thusfar, we have considered binding sites for transcription factors to be simple<br />

consensus sequences. However, transcription factors <strong>of</strong>ten bind sequence families that<br />

are better represented <strong>by</strong> probability distributions that <strong>by</strong> simple consensus sequences.<br />

We have therefore developed an approach to find matrices that maximize the p-values <strong>of</strong><br />

KS-statistics on expression data. Using this type <strong>of</strong> training procedure we were able to<br />

find matrices with much improved predictive power, although even the optimized<br />

matrices still predicted many genes that were not actually regulated. Figure 3 shows an<br />

example <strong>of</strong> the results <strong>of</strong> such an approach applied to perturbations <strong>of</strong> the phosphate<br />

25


egulatory system in S. cereviaisiae (Ogawa et al. 2000).<br />

counts<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

Entire genome in a pho4p consitutively active allele mutant vs. wt<br />

Genes containing pho4p binding site CACGTG<br />

Genes above cut<strong>of</strong>f on KS-optimized 10 bp matrix<br />

Relative expression<br />

Figure 3.<br />

Optimizing a matrix based on the KS-statistic<br />

Starting with an initial w-mer, I performed a greedy optimization procedure where at each step I added a w-<br />

mer from one <strong>of</strong> the regulated sequences to the ‘matrix’ until no more improvement <strong>of</strong> the KS-statistic was<br />

possible. Even though the optimized matrix predicts many fewer <strong>of</strong> the genes that do not actually change in<br />

expression, these ‘false positives’ still out number the genes that are actually upregulated. This means that<br />

although we have found a sequence element that is strongly associated with the change in expression, the<br />

predictive power <strong>of</strong> the sequence motif is extremely poor.<br />

In addition, to these approaches that associate sequence features with quantitative<br />

measurements <strong>of</strong> gene expression, I have developed approaches to find sequences<br />

associated with multivariate discrete expression or other functional data <strong>by</strong> treating it one<br />

variable at a time and performing hypergeometric tests, and then clustering the variables<br />

and sequences. We have found this to be an effective strategy when the categories have<br />

sufficient numbers <strong>of</strong> genes. I have applied these type <strong>of</strong> methods to microarray data that<br />

has been discretized into expression clusters, MIPS or GO categories, but in principle this<br />

26


type <strong>of</strong> approach can be applied to any other functional categorization <strong>of</strong> genes.<br />

Applying this type <strong>of</strong> approach to yeast upstream regions yields several<br />

interesting observations (Figure 4). First, because this is an enumerative approach and I<br />

am using many categories <strong>of</strong> genes, I can expect to find many significant motifs – thus<br />

producing a systematic, global view <strong>of</strong> the yeast transcription network. Furthermore,<br />

because I am displaying all the data, it is easy to identify cases where multiple motifs are<br />

enriched in a single gene group, or when similar motifs are enriched in separate, though<br />

perhaps related gene groups. For example, consider the transcription factors Bas1p and<br />

Gcn4p. Both <strong>of</strong> these factors are known to bind sequences with the consensus TGACTC<br />

(Springer et al. 1996). While it is clear that this consensus is enriched in the groups <strong>of</strong><br />

genes that are targeted <strong>by</strong> Bas1p and Gcn4p, different variants <strong>of</strong> the flanking bases are<br />

statistically enriched in different expression data clusters. This suggests that there may<br />

be some flanking specificity in these proteins that contributes to their ability to regulate<br />

different genes in vivo. Another example <strong>of</strong> this may be the case <strong>of</strong> Pho4p and Cbf1p;<br />

while both bind the consensus CACGTG, a typical palendromic E-box motif, Cbf1p<br />

seems to prefer a flanking T or A, which is reflected in the significance these w-mers in<br />

its target genes (methionine and sulfur biosynthesis genes). This example, however, also<br />

illustrates how our global, enumerative approach can identify multiple binding sites in a<br />

single group <strong>of</strong> regulated genes; these genes also contain binding sites for Met31p, a zinc<br />

finger transcription factor that can work with Cbf1p in <strong>regulation</strong> these genes (Blaiseau<br />

and Thomas 1998). Other examples <strong>of</strong> multiple motifs in a single expression group<br />

include Ste12p and Alpha1 in mating genes, TTTAAA and GATGAG in non-ribosomal<br />

repressed stress genes, Hap1p and Upc2p in respiration genes, and Rap1p and<br />

27


CCGTACA (the latter <strong>of</strong> which may represent a Rap1 site with altered specificity) in<br />

ribosomal proteins. In many cases these are consistent with previous observations (see<br />

Chiang et al 2003).<br />

Figure 4.<br />

Clustering <strong>of</strong> 7-mers significantly associated with gene categories<br />

Each row in this figure represents a single 7mer, and each column represents a (possibly overlapping) set <strong>of</strong><br />

genes. Gene categories are either expression data clusters (<strong>Eisen</strong> et al. 1998, Hughes et al. 2001, Gasch et<br />

al. 2003, Cyert et al. 2003) or MIPS categories (Mewes et al. 2000). Statistical significance was calculated<br />

using the coverage statistic described above for each category, and the resulting p-values were log-<br />

transformed and clustered. Displayed is a heat map, where brighter green intensity corresponds to greater<br />

28<br />

TTTAAA, GATGAG repressed in stress<br />

ACGCG, MCB, cell cycle<br />

GTGGCAA, Rpn4p, proteasome<br />

TGAAAC, CATGT, ste12p, alpha2p, mating<br />

CACGTG, Pho4p, phosphate metabolism<br />

TGACTCC, Bas1p, adenine biosynthesis<br />

TGACTCA, Gcn4p, amino acid biosynthesis<br />

CAGCAA, SCB, cell cycle<br />

CCGATA, TAAACG, Hap1p, Upc2p, ergos. biosyn.<br />

CCAAT, Hap2/3/4p, respiration<br />

GATAAG, GATA factors, nitrogen<br />

ACTGTG, TCACGTG, Met31p, Cbf1p met. biosyn.<br />

AGGC, Crz1p, calcium signaling<br />

CCCCT, STRE, induced in stress<br />

CACCC, Aft1p, metal transport<br />

ACCCA, CCGTACA, Rap1p, ribosome


statistical significance.<br />

References<br />

Aerne BL, Johnson AL, Toyn JH, Johnston LH. Swi5 controls a novel wave <strong>of</strong> cyclin synthesis in late<br />

mitosis. Mol Biol Cell. 1998 Apr;9(4):945-56.<br />

Blaiseau PL, Thomas D: Multiple <strong>transcriptional</strong> activation complexestether the yeast activator Met4 to<br />

DNA. EMBO J 1998, 17:6327-6336.<br />

Breeden L. Start-specific transcription in yeast. Curr Top Microbiol Immunol. 1996;208:95-127.<br />

Bussemaker HJ, Li H, Siggia ED. Regulatory element detection using correlation with expression. Nat<br />

Genet. 2001 Feb;27(2):167-71.<br />

Chiang DY, Brown PO, <strong>Eisen</strong> MB. Visualizing associations between genome sequences and gene<br />

expression data using genome-mean expression pr<strong>of</strong>iles. Bioinformatics. 2001;17 Suppl 1:S49-55.<br />

Keles S, van der Laan M, <strong>Eisen</strong> MB. Identification <strong>of</strong> regulatory elements using a feature selection method.<br />

Bioinformatics. 2002 Sep;18(9):1167-75.<br />

Merrill GF, Morgan BA, Lowndes NF, Johnston LH. DNA synthesis control in yeast: an <strong>evolution</strong>arily<br />

conserved mechanism for regulating DNA synthesis genes? Bioessays. 1992 Dec;14(12):823-30.<br />

Mewes, H. W., D. Frishman, C. Gruber, B. Geier, D. Haase, A. Kaps, K. Lemcke, G. Mannhaupt, F.<br />

Pfeiffer, C. ller, S. Stocker, and B. Weil. MIPS: a database for genomes and protein sequences. Nucleic<br />

Acids Res. 28: 37-40, 2000.<br />

Ogawa N, DeRisi J, Brown PO. New components <strong>of</strong> a system for phosphate accumulation and<br />

polyphosphate metabolism in Saccharomyces cerevisiae revealed <strong>by</strong> genomic expression analysis. Mol<br />

Biol Cell. 2000 Dec;11(12):4309-21<br />

Press WH, Teukolsky SA, Vertterling WT, Flannery BP: Numerical Recipes in C 2nd edition. Cambridge:<br />

Cambridge University Press; 1992.<br />

Springer C, Kunzler M, Balmelli T, Braus GH. Amino acid and adenine cross-pathway <strong>regulation</strong> act<br />

through the same 5'-TGACTC-3' motif in the yeast HIS7 promoter. J Biol Chem. 1996 Nov<br />

22;271(47):29637-43.<br />

Zhu G, Spellman PT, Volpe T, Brown PO, Botstein D, Davis TN, Futcher B. Two yeast forkhead genes<br />

regulate the cell cycle and pseudohyphal growth. Nature. 2000 Jul 6;406(6791):90-4.<br />

29


6. Improved models <strong>of</strong> regulatory regions<br />

Thus far we have thought <strong>of</strong> the sequence features <strong>of</strong> interest as simple sequences. It is<br />

possible to take into account other features <strong>of</strong> regulatory regions such as the position <strong>of</strong><br />

binding sites in promoters, the total number <strong>of</strong> sites, or the spacing between particular<br />

groups <strong>of</strong> sites that are thought to work together.<br />

Positional biases in transcription factor binding sites<br />

I studied the positional distributions <strong>of</strong> consensus sequences in S. cerevisiae promoters.<br />

Many known transcription factor binding sites show skewed distributions in the<br />

promoters, and genes with binding sites in the favoured regions are more likely to show<br />

expression patterns that are consistent with <strong>regulation</strong> <strong>by</strong> that factor. Figure 1 gives an<br />

example <strong>of</strong> such a positional bias for the MCB element.<br />

Correlation<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

-0.2<br />

-0.4<br />

-0.6<br />

-0.8<br />

-1<br />

-800<br />

mbp1 (ACGCGT)<br />

Position <strong>of</strong> site (relative to ATG)<br />

30<br />

0


Figure 1.<br />

Positional biases in binding sites associated with regulated genes<br />

Genes with Mbp1p binding sites in a particular region <strong>of</strong> the promoter are much more likely to show a<br />

strong correlation with the mean expression pattern for this motif. Each point in this figure represents a<br />

single instance <strong>of</strong> the MCB element, and the x coordinate is the position in the promoter relative to the<br />

ATG for the adjacent gene.<br />

Incorporating the positional distribution in the promoter could improve predictive power<br />

and increase the sensitivity <strong>of</strong> motif finding. The significance (<strong>of</strong> a coverage statistic) <strong>of</strong><br />

binding motifs in their target genes is much stronger when the search is restricted to the<br />

appropriate region <strong>of</strong> the promoter. Once again, it is possible to imagine optimization<br />

procedures that search for motifs that are associated with gene expression patterns and<br />

tend to be found in particular regions <strong>of</strong> the promoters.<br />

A main drawback <strong>of</strong> this type <strong>of</strong> approach is that the positions <strong>of</strong> motifs in these<br />

cases are all taken relative to the translation start (ATG) because yeast promoter<br />

prediction is very difficult. Another problem is that these methods may be <strong>of</strong> little use<br />

outside <strong>of</strong> yeast: while in yeast the <strong>transcriptional</strong> regulatory information is largely<br />

confined to the proximal promoters, in organisms with large genomes, cis-regulatory<br />

information can be encoded at distances <strong>of</strong> many thousands <strong>of</strong> basepairs from the<br />

translation start. In these cases rules about the spacing relative to the translation start will<br />

be <strong>of</strong> little use.<br />

Modeling combinatorial control<br />

A perhaps more general feature <strong>of</strong> eukaryotic <strong>transcriptional</strong> <strong>regulation</strong> is that it is<br />

combinatorial, such that the action <strong>of</strong> several transcription factors is necessary to specify<br />

31


an expression pattern. Because these factors may physically interact, or cooperatively<br />

recruit additional factors, their binding sites may be positioned non-randomly in the<br />

promoter. This 'promoter architecture' can be taken into account <strong>by</strong> searching for closely<br />

spaced binding sites. Several attempts have been made to model this feature <strong>of</strong> regulatory<br />

sequences. Of course, this adds additional parameters into the model, and makes<br />

probabilistic approaches more difficult. Recent studies have approached this problem<br />

either <strong>by</strong> separating the 'motif-discovery' and the inference <strong>of</strong> spacing and combinations<br />

<strong>of</strong> factors into different steps, or <strong>by</strong> searching for 'homotypic' clusters <strong>of</strong> binding sites,<br />

such that there are multiple sites from the same motif.<br />

We approached this problem <strong>by</strong> searching for pairs <strong>of</strong> closely spaced sequence<br />

features and using the KS-test to score the improvement in the statistical association with<br />

gene expression (Chiang et al. 2003). Overall, there is much work to be done in<br />

developing methods that capture the complexity <strong>of</strong> regulatory region organization.<br />

When binding specificity is known, this problem <strong>of</strong> modeling promoters is less<br />

difficult and some progress has been made. Approaches that attempt to classify genes<br />

based on the presence <strong>of</strong> multiple upstream motifs (Wasserman and Fickett 1998) and<br />

more recently, searches for 'clusters' <strong>of</strong> predefined sets <strong>of</strong> motifs (Markstein and Levine<br />

2002) have had some success. We have suggested a new approach to this problem that<br />

overcomes the need to predefine p-value cut<strong>of</strong>fs for the motifs, the length <strong>of</strong> the cluster,<br />

or a minimum density <strong>of</strong> binding sites in a cluster. Instead, we proposed to look for the<br />

groups <strong>of</strong> matches to a matrix that are the most statistically unlikely, taking into account<br />

both the number <strong>of</strong> hits, their similarity to the matrix, and the number <strong>of</strong> basepairs that<br />

we looked through. This should greatly reduce the parameterization <strong>of</strong> these methods and<br />

32


therefore allow clusters <strong>of</strong> binding site to be discovered when there is a much smaller set<br />

<strong>of</strong> characterized enhancers. In general there has been much less focus on the problem <strong>of</strong><br />

discovering binding sites when the specificity <strong>of</strong> the DNA binding protein is already<br />

known. In fact, this is a very difficult pattern recognition problem because there is very<br />

little information in a motif alone. We have proposed a simple information theoretic<br />

approach to include the spacing between factors and their orientation.<br />

These methods, while useful in terms <strong>of</strong> their ability to predict new targets for<br />

groups <strong>of</strong> transcription factors, still fail to capture the complexity <strong>of</strong> the organization <strong>of</strong><br />

even relatively simple eukaryotic regulatory regions (Figure 2). Thus, despite their<br />

success, they leave us unsure about why they are succeeding, and yield surprisingly little<br />

insight into the underlying logic <strong>of</strong> the cis-regulatory code.<br />

------------1 ----1 ---------1 ------ STE1 4 ______mating_____________________<br />

-------------1 - 1 - 11--1 --1 ------------- STE1 2 ______mating_ and_ pseudohyphal_ gro<br />

------------1 -----1 ---------1 ---1 --- TEC1 _______pseudohyphal_ growth________<br />

---1 ----1 ------------1 ------------- FAR1 _______cell_ cycle_________________<br />

-------------1 -------1 ------------ STE4_______signaling,_pheromone_pathwa<br />

-----------------1 --------------- GPA1 _______signaling,_ pheromone_ pathwa<br />

-------------------1 ----1 --1 -----0 -- AGA1 _______mating_____________________<br />

-------------------------1 - 11------ FUS1 _______mating;_ cell_ fusion________<br />

---------------0 -------1 - 1 --------- SST2_______mating_____________________<br />

-------------------------------- unknown__________________________unkno<br />

---1 --------------1 ---1 ----0 -------- MFA1 _______mating_____________________<br />

-----------------1 ----0 - 0 ---1 ------- AGA2_______mating_____________________<br />

------------------1 -----0 - 0 - 1 --1 ----- MFA2_______mating_____________________<br />

----------1 ---------1 ---1 0 - 0 --1 ------- BAR1 _______mating_____________________<br />

-------------------------00- 1 - 1 ----- STE6_______mating_____________________<br />

----------------1 ----1 ----0 - 0 1 ------- STE2_______mating_____________________<br />

------1 ----------------1 -----1 ----- FUS3_______mating_(cell_cycle_arrest)_<br />

-------1 --------------1 --1 --------- CDC20 ______mitosis____________________<br />

-800 -1<br />

Figure 2.<br />

Upstream regions <strong>of</strong> the alpha cell type and other pheromone responsive genes<br />

In this schematic representation <strong>of</strong> the upstream regions <strong>of</strong> these genes, a 1 corresponds to a binding site for<br />

ste12p and 0 corresponds to a halfsite for mcm1p/alpha2p complex. While these genes show similar<br />

expression patterns (they all fall into one ‘cluster’ (<strong>Eisen</strong> et al 1998) in response to pheromone, it is clear<br />

33


ased on the organization <strong>of</strong> binding sites in their promoters that they are regulated differently. This<br />

suggests that models incorporating spacing, number and order <strong>of</strong> binding sites will be necessary to accurate<br />

model the sequences that encode <strong>transcriptional</strong> <strong>regulation</strong>.<br />

References<br />

Chiang DY, Moses AM, Kellis M, Lander ES, <strong>Eisen</strong> MB. Phylogenetically and spatially conserved word<br />

pairs associated with gene-expression changes in yeasts. Genome Biol. 2003;4(7):R43. Epub 2003 Jun 26.<br />

Markstein M, Levine M. Decoding cis-regulatory DNAs in the Drosophila genome.<br />

Curr Opin Genet Dev. 2002 Oct;12(5):601-6.<br />

Wasserman WW, Fickett JW. Identification <strong>of</strong> regulatory regions which confer muscle-specific gene<br />

expression. J Mol Biol. 1998 Apr 24;278(1):167-81.<br />

34


7. Future challenges<br />

In addition to many currently unsolved problems in the field, with the availability <strong>of</strong><br />

complete genome sequences and genome wide functional data, motif finding and pattern<br />

matching to identify transcription factor specificity faces many new challenges. I will try<br />

to highlight those that seem the most important.<br />

More realistic cost functions for probabilistic methods<br />

Probabilistic methods are the most popular de novo motif finding methods. Current<br />

methods, however, nearly all optimize similar likelihood functions. While there has been<br />

some progress in developing methods that include some prior knowledge about the<br />

patterns <strong>of</strong> information in the motif (Sandelin and Wasserman 2004, Xing and Karp<br />

2004, Kechris et al. 2004), and incorporating spatial clustering <strong>of</strong> motifs (Zhou and Wong<br />

2004), most methods still rely on a simple set <strong>of</strong> models for the distribution <strong>of</strong> motifs in<br />

the regulated sequence set. These are usually models <strong>of</strong> the form ‘one occurrence per<br />

sequence’, ‘zero or one occurance per sequence’ or ‘don’t use information about which<br />

sequence they come from’. While these cost functions have been somewhat successful,<br />

and are implemented in most popular motif-finders, in most applications, ‘one or more<br />

occurrences per sequence’ or ‘zero or one or more occurrences per sequence’ models<br />

would actually be preferable, particularly for application to transcription factor binding<br />

sites that occur multiple times in a promoter. Further, while some progress has been<br />

made (Narisimhan et al. 2003) most probabilistic motif-finders do not actually find<br />

optimal motifs relative to a real background set, instead they rely on some background<br />

35


markov model. There are many situations where experimental data produces a set <strong>of</strong><br />

functional and non-functional sequences, and it is <strong>of</strong> interest to find motifs that are<br />

associate with function. As discussed above, there are many ways to think about the<br />

statistical enrichment <strong>of</strong> a motif in a group <strong>of</strong> genes, and each <strong>of</strong> these implies a different<br />

likelihood function to optimize.<br />

Overfitting <strong>of</strong> motif models<br />

There has been little work on overfitting, although it is clearly a rampant problem for<br />

current motif finding methods. One <strong>of</strong> the main difficulties here is that it is always<br />

possible to find a longer, more variable motif that is more specific to a subset. However,<br />

Given the number <strong>of</strong> sequences used to make up a 'motif' it is possible to calculate the<br />

probability <strong>of</strong> observing a certain number <strong>of</strong> bases more than once, and thus assess<br />

whether a discovered pattern represents a real binding site or the result <strong>of</strong> a greedy<br />

learning algorithm fitting the noise. It is easy to illustrate this in the discrete case. In the<br />

case <strong>of</strong> two sequences, if ambiguity characters are allowed, one can construct a motif that<br />

will match the entirety <strong>of</strong> the two sequences, and it will contain roughly ¾ two-way<br />

ambiguity characters, and ¼ exact characters. Similar calculations or simulations could<br />

be done as a function <strong>of</strong> motif size and dataset size.<br />

In addition, another approach to avoiding overfitting may come from viewing<br />

motif-finding as a model selection problem, where we hope to show that the model which<br />

includes the discovered motif fits the data significantly better than a model without the<br />

motif. There are techniques that are widely used in molecular <strong>evolution</strong> and other fields<br />

36


to test whether the fit <strong>of</strong> a model is significantly better than the fit <strong>of</strong> another model,<br />

usually <strong>by</strong> adding a penalty for the addition <strong>of</strong> extra parameters. This type <strong>of</strong> nested<br />

model test may be applicable to the probabilistic learning methods employed in motif<br />

finding and can be used to decide whether the found motifs might have been discovered<br />

<strong>by</strong> chance. Unfortunately, model selection problems to not lend themselves naturally to<br />

hypothesis testing, particularly when the number <strong>of</strong> models is large and none <strong>of</strong> them is a<br />

particularly good representation <strong>of</strong> the data. In these cases Bayesian approaches may be<br />

more appropriate.<br />

Motifs with overlapping specificities<br />

A major challenge to which there is no obvious solution within current probabilistic motif<br />

finding frameworks, is that these approaches treat features as unique. Because<br />

transcription factors are members <strong>of</strong> large protein families there are <strong>of</strong>ten several proteins<br />

with very similar binding specificities. This leads to a violation <strong>of</strong> the assumptions <strong>of</strong><br />

almost all these models, and poses a very difficult theoretical problem. Given a family <strong>of</strong><br />

sequences that are all somewhat related, how does one decide that they are the binding<br />

sites for one DNA-binding protein, or two separate proteins with similar specificity? It<br />

seems that much more empirical data about the patterns <strong>of</strong> degeneracy that are observed<br />

in proteins with related DNA binding domains, will be necessary before this can be<br />

adequately addressed.<br />

Truly scalable methods<br />

37


Finally, although some progress has been made with discrete methods (Kellis et al. 2003,<br />

Xie et al. 2005), there are really no probabilistic methods that can practically utilize<br />

genomic data to find motifs de novo in large complex genomes. As genomic functional<br />

data becomes increasingly available, discrete methods that can be applied to large<br />

datasets have already proven useful (e.g., as described above) and these methods should<br />

be extended to the probabilistic case.<br />

References<br />

Kechris KJ, van Zwet E, Bickel PJ, <strong>Eisen</strong> MB. Detecting DNA regulatory motifs <strong>by</strong> incorporating<br />

positional trends in information content. Genome Biol. 2004;5(7):R50. Epub 2004 Jun 24.<br />

Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison <strong>of</strong> yeast species to<br />

identify genes and regulatory elements. Nature. 2003 May 15;423(6937):241-54.<br />

Narasimhan C, LoCascio P, Uberbacher E. Background rareness-based iterative multiple sequence<br />

alignment algorithm for regulatory element detection. Bioinformatics. 2003 Oct 12;19(15):1952-63.<br />

Sandelin A, Wasserman WW. Constrained binding site diversity within families <strong>of</strong> transcription factors<br />

enhances pattern discovery bioinformatics. J Mol Biol. 2004 Apr 23;338(2):207-15.<br />

Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M. Systematic<br />

discovery <strong>of</strong> regulatory motifs in human promoters and 3' UTRs <strong>by</strong> comparison <strong>of</strong> several mammals.<br />

Nature. 2005 Mar 17;434(7031):338-45. Epub 2005 Feb 27.<br />

Xing EP, Karp RM. MotifPrototyper: a Bayesian pr<strong>of</strong>ile model for motif families. Proc Natl Acad Sci U S<br />

A. 2004 Jul 20;101(29):10523-8. Epub 2004 Jul 13.<br />

Zhou Q, Wong WH. CisModule: de novo discovery <strong>of</strong> cis-regulatory modules <strong>by</strong> hierarchical mixture<br />

modeling. Proc Natl Acad Sci U S A. 2004 Aug 17;101(33):12114-9. Epub 2004 Aug 5<br />

38


Part IV<br />

Evolution <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong><br />

131


13. Evidence for <strong>evolution</strong> <strong>of</strong> transcription networks on a genome-wide scale<br />

Thus far, we have considered the problem <strong>of</strong> detecting motifs and identifying binding<br />

sites in single genomes, and we have attempted to include <strong>evolution</strong>ary models <strong>of</strong> these<br />

binding sites. In all these cases we have been assuming that gene <strong>regulation</strong> is conserved<br />

over <strong>evolution</strong>. Of course, we know this not to be the case – that <strong>evolution</strong> <strong>of</strong> gene<br />

expression is a major source <strong>of</strong> the underlying molecular diversity that leads to<br />

organismal diversity – this in one <strong>of</strong> the motivations for studying gene <strong>regulation</strong> in the<br />

first place.<br />

Therefore we now begin to address the <strong>evolution</strong> <strong>of</strong> the mechanisms controlling<br />

gene expression in earnest, and focus less on the development <strong>of</strong> methods for annotation<br />

and prediction. As a first step in this direction, we have examined the <strong>evolution</strong> <strong>of</strong> gene<br />

<strong>regulation</strong> over longer <strong>evolution</strong>ary timescales than have been previously discussed.<br />

Considering more divergent species affords two major advantages. First, practically,<br />

because these species are sufficiently diverged at the level <strong>of</strong> non-coding DNA that there<br />

is no similarity between regulatory regions due to descent, and they can be treated<br />

independently. This is <strong>of</strong> utility because we can simply apply the methods that we have<br />

developed for analysis <strong>of</strong> single genomes to these multiple related genomes<br />

independently and gain <strong>evolution</strong>ary insight. Secondly, and perhaps more important, is<br />

that at these longer <strong>evolution</strong>ary time scales we are much more likely to see dramatic<br />

changes in the mechanisms <strong>of</strong> <strong>regulation</strong>. This work was published as Gasch et al. 2004.<br />

Conservation and <strong>evolution</strong> <strong>of</strong> cis-regulatory systems in ascomycete fungi<br />

132


Much <strong>of</strong> a gene’s expression pattern is dictated <strong>by</strong> flanking noncoding sequences that<br />

contain, among other things, binding sites recognized <strong>by</strong> sequence-specific nucleotide<br />

binding proteins that modulate transcript abundance. A number <strong>of</strong> recent studies have<br />

examined the <strong>evolution</strong> <strong>of</strong> cis-regulatory elements in alignments <strong>of</strong> orthologous<br />

regulatory regions, consistently showing that these elements evolve at a slower rate than<br />

the nonfunctional DNA that surrounds them (Hardison et al. 1997; Loots et al. 2000;<br />

McGuire et al. 2000; Bergman and Kreitman 2001; Dermitzakis and Clark 2002;<br />

Rajewsky et al. 2002; Moses et al. 2003). Most <strong>of</strong> these studies have been limited to<br />

closely related species whose orthologous noncoding sequences can be aligned, such that<br />

the putative cis-regulatory elements can be identified and compared. Cis-regulatory<br />

elements can be conserved in more distantly related species, even when the orthologous<br />

regulatory regions are too divergent to be accurately aligned (Piano et al. 1999; Cliften et<br />

al. 2003; Romano and Wray 2003). However, without the guidance <strong>of</strong> multiple<br />

alignments, little has been gleaned about the patterns <strong>of</strong> <strong>evolution</strong> or the functional<br />

constraints that act on cis-regulatory elements over longer <strong>evolution</strong>ary timescales.<br />

Recently, several methods have been developed to dissect the regulatory networks<br />

that function within an individual species. Myriad studies have shown that functional<br />

regulatory sequences can be identified in a set <strong>of</strong> coregulated genes on the basis <strong>of</strong> the<br />

enriched fraction <strong>of</strong> those genes that contain the sequence within their flanking regions<br />

(van Helden et al. 1998; Tavazoie et al. 1999; McGuire et al. 2000; Bussemaker et al.<br />

2001; Sinha and Tompa 2002). Gene co<strong>regulation</strong> can be conserved in related species,<br />

and this conservation has been exploited for the computational prediction <strong>of</strong> cis-<br />

regulatory elements that are highly conserved (Gelfand et al. 2000; Qin et al. 2003; Wang<br />

133


and Stormo, 2003; Pritsker et al. 2004; Yu et al. 2004). We reasoned that we could extend<br />

this approach to examine the <strong>evolution</strong> <strong>of</strong> cis-regulatory networks across species, <strong>by</strong><br />

analyzing the orthologs <strong>of</strong> genes coregulated in S. cerevisiae. As a first step toward this<br />

goal, we have examined the simplest model <strong>of</strong> regulatory networks: the connection<br />

between groups <strong>of</strong> coregulated genes and the flanking cis-regulatory sequences that<br />

coordinate their expression. We characterized groups <strong>of</strong> coregulated S. cerevisiae genes<br />

and their orthologs in 13 additional ascomycete fungi (Figure 1) and assessed the<br />

enriched fraction <strong>of</strong> those genes that contain known and novel cis-regulatory sequences.<br />

Our results strongly suggest that many <strong>of</strong> the known cis-regulatory systems from S.<br />

cerevisiae have been conserved over hundreds <strong>of</strong> millions <strong>of</strong> years <strong>of</strong> <strong>evolution</strong> (Berbee<br />

and Taylor 1993; Heckman et al. 2001). Based on these observations, we present a<br />

number <strong>of</strong> models for the mechanisms <strong>of</strong> cis-regulatory <strong>evolution</strong>.<br />

134


Figure 1.<br />

Fungal Phylogeny<br />

The phylogenetic tree shows the 14 different fungi analyzed in this study. The topology <strong>of</strong> the tree was<br />

based on Kurtzman and Robnett (2003), and the branch lengths represent the average <strong>of</strong> maximum-<br />

likelihood estimates <strong>of</strong> amino acid substitutions (obtained using the PAML package [Yang 1997]) for the<br />

303 proteins that had orthologs assigned in all 14 <strong>of</strong> these genomes. The closely related saccharomycete<br />

species for which the orthologous upstream regions can be aligned are labeled in orange. The source <strong>of</strong><br />

each genome sequence is also indicated to the right <strong>of</strong> each species.<br />

Results<br />

We began <strong>by</strong> systematically characterizing known cis-regulatory elements and their gene<br />

targets in the well-studied yeast S. cerevisiae. We compiled a catalog <strong>of</strong> known and<br />

predicted S. cerevisiae cis-regulatory elements (Dataset S1) in two ways. First, we<br />

retrieved 80 known consensus transcription factor-binding sites from the literature, based<br />

in part on information summarized on the Yeast Proteasome Database (Costanzo et al.<br />

2001) and the Saccharomyces Genome Database (Weng et al. 2003). The majority <strong>of</strong><br />

these sequences have been experimentally defined. Six others were identified <strong>by</strong> virtue <strong>of</strong><br />

their conservation in the 39 untranslated regions <strong>of</strong> closely related Saccharomyces<br />

species (Kellis et al. 2003), and five downstream elements were computationally<br />

predicted from mRNA immunoprecipitation experiments (Gerber et al. 2004). In addition<br />

to these known consensus sequences, we used the program MEME (Bailey and Elkan<br />

1994) to identify 597 upstream sequence motifs common to groups <strong>of</strong> predicted<br />

coregulated genes (see below). Genes that contained one or more instance <strong>of</strong> each <strong>of</strong><br />

these sequences in the 1,000-bp upstream or 500-bp downstream regions were identified<br />

as described in Materials and Methods.<br />

135


We next identified and manually annotated 264 partially redundant groups <strong>of</strong><br />

genes that are predicted to be coregulated in S. cerevisiae, based on the genes’ similarity<br />

in expression, physical association with the same transcription factor, or functional<br />

relationships (Dataset S2; see Materials and Methods for details). For each gene group,<br />

we systematically scored the enrichment <strong>of</strong> genes that contained each <strong>of</strong> the putative<br />

regulatory elements identified above, compared to all genes in the S. cerevisiae genome<br />

that contained that flanking sequence. Of the 80 consensus sequences, 41 were identified<br />

as significant <strong>by</strong> this criterion. Of these significant sequences, 34 were identified in the<br />

gene group known to be regulated <strong>by</strong> that element (Dataset S3), suggesting an upper limit<br />

<strong>of</strong> 17% false-positive identifications. Of the 597 MEME matrices we identified, only 43<br />

were significantly enriched in the gene group that they were identified in (see matrices in<br />

Dataset S4). All but four <strong>of</strong> these matrices were very similar to the consensus element<br />

known to regulate those genes (see Materials and Methods for details). Therefore, out <strong>of</strong><br />

19,239 motif-gene group comparisons, we recovered 34 consensus sequences and four<br />

additional MEME matrices representing known cis-regulatory elements (thus 38 <strong>of</strong> 80<br />

known elements) and four unannotated MEME matrices that may represent novel S.<br />

cerevisiae regulatory sequences, for a total <strong>of</strong> 42 S. cerevisiae cis-elements in 35 unique<br />

gene groups. Many <strong>of</strong> these S. cerevisiae regulatory elements were shown to be<br />

conserved in orthologous regulatory regions from four closely related saccharomyces<br />

species (Figure 1, orange species) (Cliften et al. 2003; Kellis et al. 2003). However, it<br />

was not known whether these elements are conserved in more distantly related species for<br />

which the intergenic regions cannot be aligned. To explore this possibility, we reasoned<br />

that many genes that are coregulated in S. cerevisiae should also be coregulated in other<br />

136


fungal species, and that functional cis-regulatory elements could be identified with the<br />

same methods applied to coregulated S. cerevisiae genes. Therefore, for each group <strong>of</strong><br />

coregulated S. cerevisiae genes, we identified orthologs in each <strong>of</strong> 13 other fungal<br />

genomes using the method <strong>of</strong> Wall et al. (2003). This method identifies reciprocal<br />

BLAST hits between two genomes that span more than 80% <strong>of</strong> the protein lengths,<br />

there<strong>by</strong> providing a more conservative list <strong>of</strong> putative orthologs than a simple BLAST<br />

method. The complete set <strong>of</strong> orthologs is available in Datasets S5–S13.<br />

For each species-specific gene group, we scored the enrichment <strong>of</strong> genes that<br />

contain each <strong>of</strong> the 80 consensus sequences or examples <strong>of</strong> the MEME matrices<br />

discovered in the orthologous S. cerevisiae genes, as described above. This procedure<br />

was performed separately on each species, so that the identification <strong>of</strong> an enriched<br />

sequence in one species was independent <strong>of</strong> its identification in the other species.<br />

Therefore, when a given sequence was enriched in the orthologous gene groups from<br />

multiple genomes, we interpreted this to reflect the conservation <strong>of</strong> the cis-regulatory<br />

system represented <strong>by</strong> that element in the corresponding species. It is important to note<br />

that we have characterized this conservation at the level <strong>of</strong> regulatory networks, which<br />

does not necessarily imply that the individual elements upstream <strong>of</strong> each gene have been<br />

perfectly conserved (see Discussion).<br />

Many S. cerevisiae Cis-Regulatory Systems Are Conservedin Other Fungi<br />

The patterns <strong>of</strong> cis-sequence enrichment in gene groups from each species strongly<br />

suggest that many <strong>of</strong> the genes coregulated in S. cerevisiae are also coregulated in the<br />

other fungal species. Furthermore, these patterns suggest that the expression <strong>of</strong> those<br />

137


genes is likely to be governed <strong>by</strong> the same cis-regulatory systems. Figure 2 shows the<br />

enrichment measured for each S. cerevisiae cis-regulatory element in the gene group it is<br />

proposed to regulate (represented <strong>by</strong> each row <strong>of</strong> the figure) in the 14 fungal species<br />

(shown in each column in the figure). (All p-values are available in Datasets S14–S46.)<br />

All <strong>of</strong> the 42 elements were identified in the same gene groups from at least three <strong>of</strong> the<br />

four closely related saccharomycete species. The majority <strong>of</strong> these elements were<br />

identified in the orthologous genes from other hemiascomycete species as well: 31 (74%)<br />

were identified in S. castellii, 23 (56%) and 27 (64%) were found in the related species S.<br />

kluyveri and Kluyveromyces waltii, respectively, and 21 (50%) and 14 (33%) were found<br />

in Ash<strong>by</strong>a gossypii and Candida albicans, respectively. Outside <strong>of</strong> the hemiascomycete<br />

group, we identified three to four (7%–10%) <strong>of</strong> these elements in the euascomycete fungi<br />

and two (5%) in Schizosaccharomyces pombe. Notably, when an identical procedure was<br />

performed using randomized consensus sequences, zero sequences were enriched with p<br />

< 0.0002 in their respective gene group from any species (Figures S1 and S2). The<br />

number <strong>of</strong> regulatory systems that could be found in each species roughly correlates with<br />

the species tree, in that more cis-regulatory elements were identified in species closely<br />

related to S. cerevisiae compared to the more distantly related fungi. This result could<br />

arise from the decreased accuracy <strong>of</strong> ortholog assignment in the distantly related species,<br />

which would hinder the identification <strong>of</strong> conserved regulatory systems. However, control<br />

experiments indicate that our ability to identify each regulatory element <strong>by</strong> enrichment is<br />

largely insensitive to noise in each gene group and to the ortholog assignment parameters<br />

(Figure S3 and unpublished data). These results therefore suggest that the number <strong>of</strong><br />

regulatory systems conserved across species correlates with their divergence times. A<br />

138


handful <strong>of</strong> these cis-regulatory systems are conserved in all or nearly all <strong>of</strong> the fungal<br />

genomes. For example, the group <strong>of</strong> G1-phase cell-cycle genes from all species was<br />

significantly enriched for genes containing the upstream Mlu1-cell cycle box (MCB)<br />

(McIntosh 1993). This sequence regulates the expression <strong>of</strong> the G1-phase genes from S.<br />

cerevisiae (Moll et al. 1992) as well as its distant relative Sch. pombe (Lowndes et al.<br />

1992; Malhotra et al. 1993), strongly suggesting that the element has a similar role in the<br />

other fungi. Likewise, the Gcn4p binding site was identified in the amino acid-<br />

biosynthesis genes from all but Sch. pombe, consistent with the known involvement <strong>of</strong><br />

Gcn4p-like transcription factors in the amino acid-starvation responses <strong>of</strong> S. cerevisiae,<br />

C. albicans, Neurospora crassa, and Aspergillus nidulans (Hinnebusch 1986; Ebbole et<br />

al. 1991; Tazebay et al. 1997; Tripathi et al. 2002). The expression <strong>of</strong> nitrogen-<br />

catabolism genes in C. albicans, N. crassa, and As. nidulans is thought to be governed <strong>by</strong><br />

GATA-like factors (Kudla et al. 1990; Chiang et al. 1994; Marzluf 1997; Limjindaporn et<br />

al. 2003), as it is in S. cerevisiae (Magasanik and Kaiser 2002), consistent with our ability<br />

to detect upstream GATA-binding elements in the group <strong>of</strong> nitrogen catabolism genes<br />

from these species. In the majority <strong>of</strong> cases (approximately 80%) in which a given cis-<br />

regulatory element was identified <strong>by</strong> enrichment, we could also identify in that species an<br />

ortholog <strong>of</strong> its binding protein from S. cerevisiae. Therefore, the most parsimonious<br />

model is that gene-expression <strong>regulation</strong> through the identified cis-regulatory sequence is<br />

governed <strong>by</strong> the orthologous transcription factor in each species.<br />

139


Figure 2.<br />

Conservation <strong>of</strong> Cis-Sequence Enrichment in Specific Gene Groups<br />

Gene groups from each <strong>of</strong> the 14 species that are enriched for genes whose flanking regions contain known<br />

or novel cis-sequences are represented <strong>by</strong> orange boxes. Each row represents a group <strong>of</strong> coexpressed S.<br />

140


cerevisiae genes and a single cis-regulatory element known or predicted to control the genes’ expression, as<br />

indicated to the left <strong>of</strong> the figure. Each column in the figure represents the orthologous gene groups in 14<br />

different fungal species. An orange box indicates that the S. cerevisiae cis-regulatory sequence listed to the<br />

left <strong>of</strong> the diagram is enriched in the denoted S. cerevisiae genes or their orthologs in each fungal genome,<br />

according to the key at the bottom <strong>of</strong> the figure. The p-values for each group are available in Datasets S14–<br />

S46, and the number <strong>of</strong> orthologs in each gene group is available in Dataset S49. Some cis-regulatory<br />

elements did not meet our significance cut<strong>of</strong>f for enrichment but had been previously identified as<br />

conserved in related gene groups from the closely related saccharomycete species (Kellis et al. 2003), and<br />

these are denoted with a yellow box. A gray box indicates that the denoted sequence was not significantly<br />

enriched in that gene group, while a white box indicates that fewer than four orthologs were identified in<br />

the species. The rows are organized in decreasing order <strong>of</strong> the number <strong>of</strong> species in which the element was<br />

enriched.<br />

Novel Sequences Are Enriched in Coregulated Gene Groups from Other Fungi<br />

In many cases, we were unable to detect significant enrichment <strong>of</strong> the S. cerevisiae<br />

upstream elements in the orthologous gene groups from other species, particularly in the<br />

more distantly related fungi. One possible explanation for this observation is that,<br />

although the genes are still coregulated in these species, the cis-regulatory mechanisms<br />

that control their expression have evolved. We therefore searched the upstream regions<br />

from each group <strong>of</strong> orthologous genes for novel sequence motifs, using the program<br />

MEME (Bailey and Elkan 1994) and selected matrices that were significantly enriched in<br />

the gene group in which they were identified (see Materials and Methods for details). As<br />

has been previously noted for this type <strong>of</strong> motif discovery (Tavazoie et al. 1999; McGuire<br />

et al. 2000), the majority <strong>of</strong> the identified motifs were not significantly enriched in the<br />

141


appropriate gene group and may represent background sequences that are not functional.<br />

Thus, a total <strong>of</strong> 53 matrices were identified as significant in at least one species based on<br />

this criterion (the complete list <strong>of</strong> matrices and enrichment p values are available in<br />

Datasets S47 and S48). Over half <strong>of</strong> these were similar to known S. cerevisiae elements<br />

shown in Figure 2 and were enriched in the orthologous S. cerevisiae genes. Of the<br />

remaining motifs, two recognizably similar matrices were identified in the same gene<br />

group from multiple species, suggesting that they represent conserved regulatory systems<br />

not present in S. cerevisiae. To further examine this possibility, we scored the enrichment<br />

<strong>of</strong> genes containing examples <strong>of</strong> the 53 matrices in the orthologous gene groups from all<br />

species. This procedure identified 19 unique MEME matrices that were not identified in<br />

the S. cerevisiae genes and therefore may represent novel cis-regulatory elements in these<br />

fungi (Figure 3). More than a third <strong>of</strong> these elements were also enriched in the same gene<br />

group from other species, providing additional support for their functional relevance. For<br />

example, a number <strong>of</strong> upstream sequences identified in ribosomal protein genes were<br />

enriched in the same gene group from four or five other species, but not from S.<br />

cerevisiae. Similarly, sequences identified upstream <strong>of</strong> tRNA synthetase genes and<br />

upstream <strong>of</strong> the proteasome genes were identified in the same genes from all <strong>of</strong> the<br />

euascomycete fungi (N. crassa, Magnaporthe grisea, and As. nidulans). In the case <strong>of</strong> the<br />

proteasome genes, MEME identified the same motif upstream <strong>of</strong> orthologous genes from<br />

the related euascomycete Histoplasma capsulatum, for which partial genome sequence is<br />

available ( http://www.genome.wustl.edu/projects/hcapsulatum/) (unpublished data). That<br />

these sequences were identified in the same gene groups from multiple euascomycetes<br />

(but not the other species) implies that they are clade specific. Although future<br />

142


experiments will be required to elucidate the exact roles <strong>of</strong> these sequences, our<br />

observations suggest that the identified cis-sequences are functionally relevant and<br />

conserved across species.<br />

Figure 3.<br />

Enrichment <strong>of</strong> Novel Sequences in Coregulated Genes from Other Species<br />

Gene groups from each <strong>of</strong> the 14 species that are enriched for genes containing novel upstream sequences<br />

identified <strong>by</strong> MEME (see Materials and Methods for details) are shown, as described in Figure 2.<br />

Enrichment <strong>of</strong> genes that contain the cis-sequence listed to the left <strong>of</strong> the diagram is indicated <strong>by</strong> a purple<br />

box, according to the key at the bottom <strong>of</strong> the figure.<br />

Cis-Regulatory Element Positions and Spacing Are Also Conserved across Species<br />

143


The physical locations <strong>of</strong> many characterized S. cerevisiae cis-regulatory elements are<br />

restricted to a narrow region upstream <strong>of</strong> their target genes (Mannhaupt et al. 1999;<br />

Tavazoie et al. 1999; McGuire et al. 2000; Lieb et al. 2001; Natarajan et al. 2001). This<br />

suggests that these elements must be positioned in the appropriate window <strong>of</strong> the<br />

upstream sequences, perhaps to promote proper interactions between the element’s<br />

binding protein and other factors (such as nucleosomes or RNA polymerase subunits)<br />

(Workman and Kingston 1992; Vashee and Kodadek 1995; Fry et al. 1997; Fry and<br />

Farnham 1999; GuhaThakurta and Stormo 2001). To characterize the upstream positions<br />

<strong>of</strong> cis-regulatory elements in S. cerevisiae, we compared the fraction <strong>of</strong> elements in 50-bp<br />

windows upstream <strong>of</strong> their target genes to the fraction <strong>of</strong> elements in the same 50-bp<br />

window upstream <strong>of</strong> all genes in the S. cerevisiae genome. (This model is required to<br />

overcome the nonrandom nucleotide distribution immediately upstream <strong>of</strong> genes in this<br />

and other species, as described in Materials and Methods.) We found that many <strong>of</strong> the S.<br />

cerevisiae cis-regulatory elements are non-randomly distributed upstream <strong>of</strong> their target<br />

genes (Figure 4, blue boxes). Each element shows a different window <strong>of</strong> peak enrichment<br />

in S. cerevisiae. This likely reflects mechanistic differences between the regulatory<br />

systems that control the expression <strong>of</strong> each set <strong>of</strong> genes.<br />

In the majority <strong>of</strong> cases, when a cis-regulatory system was conserved in another<br />

species, the corresponding element had a similar upstream distribution to that seen in S.<br />

cerevisiae, in that the distributions had the same window <strong>of</strong> peak enrichment (Figure 4).<br />

This is significant, as the underlying genomic distribution <strong>of</strong> many <strong>of</strong> these sequences is<br />

substantially different in each species, due in part to the different GC content <strong>of</strong> some <strong>of</strong><br />

the genomes (unpublished data). For many regulatory systems, there was no correlation<br />

144


etween the positions <strong>of</strong> individual elements in orthologous upstream regions from<br />

multiple species (although there were some exceptions; Figures S4 and S5). This<br />

indicates that the distributions <strong>of</strong> these elements have been conserved, even though the<br />

precise positions <strong>of</strong> individual elements have not (see Discussion). In addition to the<br />

conserved S. cerevisiae elements, many <strong>of</strong> the novel cis-sequences presented in Figure 3<br />

also showed nonrandom distributions in the species in which they were identified (Figure<br />

4, purple boxes). Thus, the positional distribution <strong>of</strong> cis-regulatory elements appears to be<br />

a general feature <strong>of</strong> cis-<strong>regulation</strong> in multiple ascomycete species.<br />

145


Figure 4.<br />

Distribution <strong>of</strong> Cis-Regulatory Elements Upstream <strong>of</strong> Coregulated Genes<br />

146


The distribution <strong>of</strong> nine different sequences motifs (represented to the left <strong>of</strong> the figure <strong>by</strong> the consensus<br />

sequences and their known binding proteins) was measured in 50-bp windows within 1,000 bp upstream <strong>of</strong><br />

the putative target genes (denoted to the right <strong>of</strong> the figure). Each colored box represents the frequency <strong>of</strong><br />

an element in a 50-bp window upstream <strong>of</strong> the target genes compared to the element’s frequency in the<br />

corresponding window <strong>of</strong> all upstream regions in each genome. Blue boxes represent sequences that<br />

matched the S. cerevisiae MEME matrices, while purple boxes represent sequences that matched the<br />

designated species-specific MEME matrices. Distributions that were significantly different from<br />

background in at least one 50-bp window (p , 0.01) were identified using the hypergeometric distribution<br />

(as described in Materials and Methods) and are denoted <strong>by</strong> an asterisk.<br />

In one case, the close spacing between two cis-regulatory elements was conserved across<br />

species. Chiang et al. (2003) previously reported that the distance between the Cbf1p-<br />

and Met31/32p-binding sites upstream <strong>of</strong> the methionine biosynthesis genes is closer than<br />

expected <strong>by</strong> chance. We found this feature to be conserved in other species as well. The<br />

Cbf1p and Met31/32p elements were independently identified upstream <strong>of</strong> the<br />

methionine genes from almost all <strong>of</strong> the hemiascomycetes (see Figure 2). In addition, the<br />

closer-than-expected spacing between these sequences was also conserved in these<br />

species (Figure 5). The spacing between elements was independent <strong>of</strong> the exact positions<br />

<strong>of</strong> the Cbf1p or Met31/32p sites in the saccharomycete species, indicated <strong>by</strong> permutation<br />

tests performed as previously described (p < 0.05; Chiang et al. 2003). Thus, the close<br />

spacing between these sites is not simply due to the conserved positioning <strong>of</strong> the<br />

individual elements in each orthologous upstream region, but likely resulted from an<br />

<strong>evolution</strong>ary constraint on the distance between these sequences (see Discussion).<br />

147


Figure 5.<br />

Spatial Relationships between Cis-Regulatory Elements<br />

The mean spacing between the Cbf1p- and Met31/32p- binding sites within 500 bp upstream <strong>of</strong> the<br />

methionine biosynthesis genes (m) and <strong>of</strong> all <strong>of</strong> the genes in each genome (g) was calculated for the species<br />

indicated. The error bars represent twice the standard error, indicating the range <strong>of</strong> the estimated means<br />

with 95% confidence. The values below each plot indicate the number <strong>of</strong> binding-site pairs used in each<br />

calculation<br />

Evolution <strong>of</strong> the Proteasome Cis-Regulatory Element in S. cerevisiae and C. albicans<br />

We were particularly interested in exploring patterns <strong>of</strong> cis element <strong>evolution</strong> across<br />

fungi. One interesting example is the case <strong>of</strong> Rpn4p, a non-classical Cys2-His2 zinc-<br />

finger protein known to regulate proteasome gene expression in S. cerevisiae (Mannhaupt<br />

et al. 1999; Xie and Varshavsky 2001). For the group <strong>of</strong> S. cerevisiae proteasome genes,<br />

148


the enrichment <strong>of</strong> genes containing the known Rpn4p binding site was highly significant<br />

(GGTGGCAA; p = 6.3 x 10 –41 ). The same consensus sequence was also enriched in the<br />

orthologous upstream regions <strong>of</strong> all <strong>of</strong> the hemiascomycete fungi, but not in the upstream<br />

regions retrieved from fungi outside <strong>of</strong> the hemiascomycete group. We noticed that, in<br />

addition to the Rpn4p consensus site, a number <strong>of</strong> related hexameric sequences were also<br />

highly enriched in the orthologous upstream regions from C. albicans (unpublished data).<br />

This hinted at the possibility that a slightly different set <strong>of</strong> regulatory sequences governs<br />

the expression <strong>of</strong> the C. albicans proteasome genes.<br />

To further explore this possibility, we compared sequences found upstream <strong>of</strong> the<br />

proteasome genes from S. cerevisiae and C. albicans. To identify these sequences in an<br />

unbiased way, we first generated a species-independent ‘‘meta-matrix’’ based on a<br />

limited subset <strong>of</strong> the proteasome upstream regions from both species (see Materials and<br />

Methods for details). We then identified all examples <strong>of</strong> the meta-matrix upstream <strong>of</strong> the<br />

proteasome genes from S. cerevisiae and C. albicans, partitioned the sequences according<br />

to their species, and calculated two species-specific position-weight matrices (Figure 6).<br />

These matrices were statistically different at the second, third, and ninth positions (p ,<br />

0.01; see Materials and Methods for details) and indicated that the C. albicans matrix had<br />

less basepair specificity at these positions.<br />

The matrices are useful because they summarize the set <strong>of</strong> related sequences that<br />

are common to the upstream regions in each group, but a more direct assessment <strong>of</strong> these<br />

elements is to inspect the sequences directly. Sequences upstream <strong>of</strong> the S. cerevisiae and<br />

C. albicans proteasome genes that matched the ‘‘meta-matrix’’ described above were<br />

combined and organized <strong>by</strong> sequence similarity, using a hierarchical clustering method<br />

149


described in Materials and Methods. The sequences could be classified into three general<br />

categories (Figure S6). The first category consisted <strong>of</strong> related sequences that were found<br />

in both S. cerevisiae and C. albicans proteasome upstream regions, the second was<br />

composed <strong>of</strong> sequences found almost exclusively upstream <strong>of</strong> S. cerevisiae genes, and the<br />

third was composed <strong>of</strong> elements found only upstream <strong>of</strong> the C. albicans proteasome<br />

genes. Manual inspection <strong>of</strong> the proteasome-gene upstream regions supported these<br />

classi- fications: There were zero instances <strong>of</strong> the S. cerevisiae-specific 10-mer<br />

GGTGGCAAAW upstream <strong>of</strong> any C. albicans proteasome genes, although nearly 75%<br />

<strong>of</strong> the S. cerevisiae proteasome genes contained this upstream sequence. Similarly, zero<br />

instances <strong>of</strong> the C. albicans-specific 10-mer GRAGGCAAAA were found upstream <strong>of</strong> S.<br />

cerevisiae proteasome genes, whereas 25% <strong>of</strong> the C. albicans genes contained the<br />

element. These observations suggest that S. cerevisiae and C. albicans use different<br />

sequences to govern the expression <strong>of</strong> the proteasome genes.<br />

Figure 6.<br />

Position-Weight Matrices Representing Proteasome Cis-Regulatory Element<br />

Sequences within 500 bp upstream <strong>of</strong> the S. cerevisiae or C. albicans proteasome genes that matched the<br />

species-independent metamatrix were identified as described. The identified sequences were used to<br />

generate sequence logos (Crooks et al. 2004) to represent the set <strong>of</strong> cis-sequences from S. cerevisiae (left)<br />

or from C. albicans (right). The height <strong>of</strong> each letter represents the frequency <strong>of</strong> that base in that position <strong>of</strong><br />

the matrix. Positions in the matrices that are statistically different (see Materials and Methods for details)<br />

are indicated with an asterisk.<br />

150


Discussion<br />

The ascomycete fungi represent nearly 75% <strong>of</strong> all fungal species, and their diversity is<br />

evident <strong>by</strong> their unique morphologies, life styles, environmental interactions, and niches<br />

(Ainsworth et al. 2001). This diversity has been shaped <strong>by</strong> over a billion years <strong>of</strong><br />

<strong>evolution</strong> (Berbee and Taylor 1993; Heckman et al. 2001) and has almost certainly been<br />

affected <strong>by</strong> variation in gene expression. To explore the <strong>evolution</strong> <strong>of</strong> gene-expression<br />

<strong>regulation</strong> in these fungi, we have examined the cis-regulatory networks <strong>of</strong> 14<br />

ascomycete species whose genomes have been sequenced, using a framework that is not<br />

dependent on multiple alignments <strong>of</strong> orthologous regulatory regions. We have identified<br />

probable cis-acting sequences in each <strong>of</strong> these species <strong>by</strong> applying motif search and<br />

discovery methods to the flanking regions <strong>of</strong> orthologs <strong>of</strong> coregulated S. cerevisiae genes.<br />

Our ability to identify such sequences in the same gene groups from multiple species<br />

strongly suggests that the co<strong>regulation</strong> <strong>of</strong> those genes has been conserved. Examples<br />

from our analysis indicate that in many cases the genes’ co<strong>regulation</strong> is governed <strong>by</strong> a<br />

conserved regulatory system, while other examples suggest that some regulatory<br />

networks have evolved. These examples provide insights into the functional constraints<br />

that underlie the <strong>evolution</strong> <strong>of</strong> geneexpression <strong>regulation</strong>, as summarized below.<br />

Conservation <strong>of</strong> Cis-Regulatory Systems<br />

Our results indicate that a large number <strong>of</strong> cis-regulatory networks that function in S.<br />

cerevisiae are conserved in other ascomycete species. This is expected for the closely<br />

related species, since conserved regulatory elements can be readily identified in<br />

alignments <strong>of</strong> orthologous regulatory regions (Cliften et al. 2003; Kellis et al. 2003).<br />

151


However, we show here that many <strong>of</strong> the cis-regulatory systems represented <strong>by</strong> these<br />

elements are conserved over much longer <strong>evolution</strong>ary time frames, beyond those for<br />

which orthologous noncoding regions can be aligned. For example, 50%–75% <strong>of</strong> the<br />

regulatory systems identified in S. cerevisiae are also found in S. kluyveri and S. castellii,<br />

which are diverged enough from S. cerevisiae that much <strong>of</strong> the gene synteny is lost and<br />

most orthologous intergenic regions cannot be aligned (Cliften et al. 2003). Over a third<br />

<strong>of</strong> these regulatory systems were identified in C. albicans, which is estimated to have<br />

diverged from S. cerevisiae over 200 million years ago, and a small number <strong>of</strong> regulatory<br />

networks have been conserved since the origin <strong>of</strong> the Ascomycetes some 500 million to a<br />

billion years ago (Berbee and Taylor 1993; Heckman et al. 2001). It is likely that we have<br />

underestimated the number <strong>of</strong> conserved regulatory networks, partly because <strong>of</strong> statistical<br />

limitations <strong>of</strong> our method. Nonetheless, these data indicate that regulatory networks can<br />

be conserved over very long periods <strong>of</strong> <strong>evolution</strong>. Despite the widespread conservation <strong>of</strong><br />

cis-regulatory networks, it is important to note that this does not necessarily imply that<br />

the individual cis-elements have remained perfectly conserved. For example, while we<br />

could identify the same cis-sequences in orthologous gene groups, the positions <strong>of</strong> the<br />

individual elements in orthologous upstream regions in many cases appear to have<br />

changed (see Figure S4). Evolution <strong>of</strong> cis-element position has been observed in closely<br />

related drosophilids, mammals, and other species (Ludwig and Kreitman 1995; Ludwig et<br />

al. 1998; Piano et al. 1999; Dermitzakis and Clark 2002; Scemama et al. 2002;<br />

Dermitzakis et al. 2003) and is proposed to occur <strong>by</strong> two general mechanisms (reviewed<br />

in Wray et al. 2003). The first is binding-site turnover, where<strong>by</strong> the appearance <strong>of</strong> a new<br />

cis-element elsewhere in a promoter can compensate for the loss <strong>of</strong> a functional element<br />

152


in the same regulatory region. Simulation studies show that cis-element turnover occurs<br />

frequently over short <strong>evolution</strong>ary time scales and is likely to play an important role in<br />

gene-expression <strong>regulation</strong> (Stone and Wray 2001; Dermitzakis et al. 2003).<br />

Alternatively, small insertions and deletions in a regulatory region can permute the cis-<br />

element’s position without changing the element’s sequence (Ludwig and Kreitman<br />

1995; Piano et al. 1999; Ruvinsky and Ruvkun 2003). Thus, regulatory regions appear to<br />

be relatively plastic in their organization. Despite this plasticity, however, a gene’s<br />

expression pattern and the regulatory system governing its expression can remain intact<br />

even though the gene’s flanking regulatory region has undergone reorganization (Piano et<br />

al. 1999; Ludwig et al. 2000; Scemama et al. 2002; Hinman et al. 2003; Romano and<br />

Wray 2003; Ruvinsky and Ruvkun 2003). This indicates that some combination <strong>of</strong><br />

purifying selection and drift (Ludwig et al. 2000) can act to maintain the appropriate<br />

regulatory connections to conserve the gene’s expression pattern. Although the positions<br />

<strong>of</strong> many <strong>of</strong> the individual cis-elements have evolved in these species, we found that the<br />

distribution <strong>of</strong> elements upstream <strong>of</strong> their gene targets was <strong>of</strong>ten similar across species.<br />

This suggests that there has been constraint on the region in which the elements are<br />

positioned, without pressure to maintain the exact positions <strong>of</strong> individual elements. One<br />

explanation for this model is that mechanistic features <strong>of</strong> these regulatory systems are<br />

also conserved across species (Wray et al. 2003). For example, the restricted location <strong>of</strong><br />

cis-regulatory elements may promote interactions between the cognate binding protein<br />

and other regulatory proteins. Therefore, selective pressure may act to maintain these<br />

interactions through the relative positions <strong>of</strong> the underlying binding sites. This model<br />

153


may also explain the conserved close spacing between Cbf1p and Met31/32p elements in<br />

methionine biosynthesis genes from the hemiascomycete fungi. These transcription<br />

factors are proposed to act cooperatively in S. cerevisiae to recruit additional<br />

<strong>transcriptional</strong> regulators (Blaiseau and Thomas 1998). That the spacing between the<br />

Cbf1p and Met31/32p elements is closer than expected in other species as well suggests<br />

that the cooperative interaction between the factors has been conserved across the<br />

Hemiascomycetes.<br />

Evolution <strong>of</strong> Cis-Regulatory Networks<br />

In addition to the clear cases <strong>of</strong> network conservation discussed above, we also found<br />

evidence for the <strong>evolution</strong> <strong>of</strong> cis-regulatory systems. Our ability to identify novel<br />

sequences enriched in orthologs <strong>of</strong> coregulated S. cerevisiae genes implies that, although<br />

the genes are still coregulated in those species, the systems governing their expression<br />

have changed. This indicates that the regulatory regions <strong>of</strong> those genes coevolved to<br />

contain the same cis-sequences. We were interested in identifying global predictors <strong>of</strong> the<br />

relative rates <strong>of</strong> cis-regulatory network <strong>evolution</strong>, but these factors remain enigmatic.<br />

Unlike the <strong>evolution</strong>ary rates <strong>of</strong> protein coding regions, for which essential proteins<br />

typically evolve at a slower rate (Wilson et al. 1977; Hirsh and Fraser 2001; Krylov et al.<br />

2003; H. B. F., personal communication), we found no evidence for a retarded rate <strong>of</strong><br />

<strong>evolution</strong>/loss <strong>of</strong> the cis-regulatory systems <strong>of</strong> essential genes (unpublished data). For<br />

example, the proteasome subunits and the ribosomal proteins are among the most highly<br />

conserved proteins, and the genes that encode them are expressed with similar patterns in<br />

154


S. cerevisiae, C. albicans, and Sch. pombe (Gasch et al. 2000; Chen et al. 2003; Enjalbert<br />

et al. 2003). Nonetheless, we identified different upstream sequences for these groups in<br />

the different species we analyzed, suggesting that the <strong>regulation</strong> <strong>of</strong> the genes’ expression<br />

has evolved even though their expression patterns have not. This is consistent with<br />

previous observations <strong>of</strong> developmentally regulated genes in higher organisms, whose<br />

temporal and spatial expression can be conserved across taxa despite divergence in their<br />

<strong>regulation</strong> (Takahashi et al. 1999; True and Haag 2001; Scemama et al. 2002; Hinman et<br />

al. 2003; Romano and Wray 2003; Ruvinsky and Ruvkun 2003; Wang et al. 2004). In<br />

contrast, we observed that proteins involved in mating have a high rate <strong>of</strong> <strong>evolution</strong>, yet<br />

we could identify the Ste12p binding site (Fields and Herskowitz 1985) upstream <strong>of</strong><br />

mating genes in nearly all <strong>of</strong> the hemiascomycetes. Consistently, orthologs <strong>of</strong> Ste12p are<br />

known to be required for mating in distantly related fungi that mate through significantly<br />

different processes (Lengeler et al. 2000; Vallim et al. 2000; Young et al. 2000; Chang et<br />

al. 2001). Since mating may be triggered <strong>by</strong> similar environmental cues (Lengeler et al.<br />

2000), <strong>evolution</strong>ary pressure may have conserved the regulatory system that mediates this<br />

process (to the extent <strong>of</strong> our observations), even though the mating proteins have<br />

evolved. Although we could not find global correlates with the patterns <strong>of</strong> cis-regulatory<br />

network <strong>evolution</strong>, a number <strong>of</strong> individual examples from our analysis are consistent with<br />

specific models <strong>of</strong> network <strong>evolution</strong>. These examples are discussed below.<br />

Addition <strong>of</strong> Gene Targets into an Existing Regulatory Network<br />

Sequences that match cis-regulatory elements can readily appear in noncoding DNA<br />

through drift. In the same way that this process can promote binding site turnover within<br />

155


a given regulatory region, it can create de novo elements in the regulatory regions <strong>of</strong><br />

random genes, giving rise to novel targets <strong>of</strong> that regulatory system (Stone and Wray<br />

2001; Rockman and Wray 2002). The addition <strong>of</strong> novel targets into cis-regulatory<br />

systems may have occurred in the case <strong>of</strong> E2Flike transcription factors. In S. cerevisiae,<br />

the related MCB (ACGCG) and Swi4-Swi6 cell-cycle box, or SCB (CGCGAAA)<br />

regulatory elements are found upstream <strong>of</strong> G1-phase cell cycle genes, similar to the E2F<br />

element found in these genes in worms, flies, humans, and plants (Lowndes et al. 1992;<br />

Malhotra et al. 1993; DeGregori 2002; Ren et al. 2002; De Veylder et al. 2003; Rustici et<br />

al. 2004). What is striking about the conservation <strong>of</strong> this network is that cell-cycle<br />

progression is markedly different in these organisms: The hemiascomycete fungi<br />

replicate <strong>by</strong> budding, unlike the filamentous fungi in the euascomycete group, the fission<br />

yeast Sch. pombe, and the other higher eukaryotes. While some <strong>of</strong> the genes regulated <strong>by</strong><br />

these elements are well conserved across organisms (namely, the DNA replication<br />

proteins), genes whose products are involved in budding are also expressed in G1 phase<br />

and regulated <strong>by</strong> these elements in S. cerevisiae (Spellman et al. 1998; Iyer et al. 2001)<br />

and likely in its budding cousins as well. Because these genes are not conserved outside<br />

the hemiascomycete clade, and since it is unlikely that budding represents the ancestral<br />

mode <strong>of</strong> replication, this suggests that genes involved in budding were assumed into an<br />

existing cis-regulatory network in these yeasts.<br />

Co<strong>evolution</strong> <strong>of</strong> an Existing Regulatory Network<br />

Mutation <strong>of</strong> a cis-regulatory element can be compensated <strong>by</strong> the stabilizing effects <strong>of</strong><br />

binding site turnover (Ludwig et al. 2000), as discussed above, but it could also be<br />

156


overcome <strong>by</strong> corresponding changes in its DNA-binding protein, such that the interaction<br />

between the two is maintained. Parallel changes in DNA element and protein sequence<br />

can occur to conserve the overall regulatory network (i.e., the same binding protein<br />

regulating the same set <strong>of</strong> genes), despite <strong>evolution</strong> <strong>of</strong> their molecular interaction. We<br />

found slightly different sets <strong>of</strong> sequences enriched upstream <strong>of</strong> the proteasome genes<br />

from S. cerevisiae versus C. albicans, and these differences corresponded with the<br />

different binding specificities <strong>of</strong> Sc_Rpn4p and Ca_Rpn4p in vitro. This result is<br />

consistent with the model that the binding specificity <strong>of</strong> Sc_Rpn4p and Ca_Rpn4p<br />

coevolved with the elements found upstream <strong>of</strong> the proteasome genes in each species.<br />

[Neither Ca_Rpn4p nor the hybrid protein functioned in an in vivo reporter system<br />

(unpublished data); however, Sc_Rpn4p could transcribe a reporter gene to higher levels<br />

if Sequence A was present in its promoter compared to when Sequence B or a minimal<br />

promoter was placed upstream <strong>of</strong> the reporter gene (see Figure S7). These results are<br />

consistent with the hypothesis that Sc_Rpn4p ineffectively initiates transcription from the<br />

C. albicans-specific element. Since Ca_Rpn4p and Nc_Rpn4p both bind significantly to<br />

Sequence B, it is likely that this was also true <strong>of</strong> the proteins’ common ancestor and that<br />

Sc_Rpn4p largely lost the ability to bind productively to this sequence. The altered<br />

specificity <strong>of</strong> Sc_Rpn4p is due to amino acid differences in its DNA-binding domain,<br />

since the hybrid Rpn4p (containing the Ca_Rpn4p DNA binding domain) bound to<br />

Sequence B as well as it did to Sequence C (see Figure 7C). Determining which residues<br />

are responsible for the altered activity is a difficult task, however, since all <strong>of</strong> the residues<br />

known to participate in zinc coordination and DNA contact (Rhodes et al. 1996; Wolfe et<br />

al. 1999; Wolfe et al. 2000; Pabo et al. 2001; Benos et al. 2002) are perfectly conserved<br />

157


etween these orthologs (Figure 9). One obvious difference in the orthologous proteins is<br />

the spacing between the cysteine and histidine pair in the second zinc finger, which is<br />

proposed to contact the first half <strong>of</strong> the DNAbinding site (Wolfe et al. 2000; Pabo et al.<br />

2001) wherein the base-specificity differences reside. Sc_Rpn4p, Ca_Rpn4p, and the<br />

euascomycete Rpn4p orthologs all vary in amino acid length and identity in this region,<br />

which implicated the region as relevant to the specificity differences. However, a mutant<br />

Sc_Rpn4p that contained the Nc_Rpn4p sequence in this region (see Figure 9) had the<br />

same binding specificity as the wild-type Sc_Rpn4p (albeit with less activity;<br />

unpublished data), indicating that this region alone is not sufficient to explain the<br />

differences in binding pr<strong>of</strong>iles.]<br />

Figure 9.<br />

Sequence Alignment <strong>of</strong> the DNA-Binding Domain <strong>of</strong> Rpn4p and Its Orthologs<br />

Clustal W was used to identify a multiple alignment between S. cerevisiae Rpn4p and its orthologs in the<br />

other fungi; the alignment over the DNA binding domain is shown. No ortholog was identified <strong>by</strong> our<br />

method in S. kluyveri, apparently due to poor sequence coverage in that region (unpublished data). The<br />

conserved cysteine and histidine residues <strong>of</strong> the two C2H2 zinc-finger domains are highlighted in yellow,<br />

and the domain in each finger that is predicted to contact the DNA is indicated with a gray bar. The region<br />

<strong>of</strong> sequence variation between the hemiascomycete and euascomycete Rpn4p proteins is indicated with a<br />

box.<br />

158


Cooption <strong>of</strong> a Regulatory System to Govern a Different Set <strong>of</strong> Genes<br />

An extreme example <strong>of</strong> the previously discussed modes <strong>of</strong> <strong>evolution</strong> is the complete<br />

alteration <strong>of</strong> a regulatory system’s target genes (True and Carroll 2002). This may have<br />

occurred for the Rpn4p regulatory system sometime after the divergence <strong>of</strong> the<br />

euascomycete and hemiascomycete fungi. Our data suggest that, while Sc_Rpn4p and<br />

Ca_Rpn4p control proteasome-gene expression in these species, the euascomycete<br />

orthologs <strong>of</strong> this transcription factor probably do not. Nc_Rpn4p did not bind the novel<br />

sequence we identified upstream <strong>of</strong> euascomycete proteasome genes, and reciprocally the<br />

majority <strong>of</strong> these genes did not contain examples <strong>of</strong> the Rpn4p binding site. One<br />

possibility is that Nc_Rpn4p and its orthologs regulate a different set <strong>of</strong> genes in the<br />

euascomycete clade. Preliminary investigation <strong>of</strong> orthologous euascomycete genes that<br />

contain examples <strong>of</strong> the Ca_Rpn4p matrix (used as a surrogate for the Nc_Rpn4p binding<br />

matrix) did not reveal any obvious relationships in the genes’ functional annotations or<br />

striking similarities in their patterns <strong>of</strong> expression (T. Kasuga, personal communication).<br />

Interestingly, however, the orthologs <strong>of</strong> RPN4 in all three euascomycete species<br />

contained upstream Rpn4p elements, raising the possibility that this gene is autoregulated<br />

at the level <strong>of</strong> expression in these fungi. Future experiments will test the function <strong>of</strong> this<br />

factor in N. crassa as well as the role <strong>of</strong> the novel sequence in mediating proteasome<br />

gene expression. The converse <strong>of</strong> this situation is that the regulatory regions <strong>of</strong><br />

coregulated genes must coevolve, such that they all contain the same regulatory elements<br />

recognized <strong>by</strong> the new system. This apparently occurs despite strong constraint on the<br />

genes’ expression patterns. For example, most proteasome subunits are essential and<br />

159


equired in proper stoichiometric amounts (Russell et al. 1999; Kruger et al. 2001).<br />

Nonetheless, we found different cis-sequences upstream <strong>of</strong> the proteasome genes from<br />

the hemiascomycete and euascomycete fungi. Another example can been seen in the<br />

ribosomal protein genes, which must also be expressed to the same relative levels<br />

(Warner 1999; Zhao et al. 2003). In all species, we could find elements upstream <strong>of</strong> the<br />

ribosomal proteins, but different cis-sequences were identified in subsets <strong>of</strong> these species<br />

(see Figures 2 and 3). How the regulatory systems that control the genes’ expression<br />

evolve is unclear. This process may involve an intermediate stage in which the genes’<br />

expression is controlled <strong>by</strong> two distinct, but partially redundant, regulatory systems (True<br />

and Haag 2001; True and Carroll 2002). Differential loss <strong>of</strong> one system in two diverged<br />

species would render the orthologous genes coregulated <strong>by</strong> different regulatory systems.<br />

This model for regulatory system ‘‘turnover’’ is in direct analogy to the case <strong>of</strong> binding<br />

site turnover, in which partially redundant cis-elements that are created <strong>by</strong> drift coexist in<br />

a regulatory region before they are differentially lost in the diverged species (Ludwig et<br />

al. 2000; Stone and Wray 2001).<br />

Conclusions and Future Directions<br />

We have provided a framework for studying cis-regulatory <strong>evolution</strong> without relying on<br />

alignments <strong>of</strong> intergenic regions. The <strong>evolution</strong>ary dynamics <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong><br />

is evident from the examples we have presented. We expect that as more complete fungal<br />

genomes emerge, particularly for fungi with intermediate <strong>evolution</strong>ary relationships,<br />

important gaps in the existing phylogeny will be filled. These key species may provide a<br />

160


window into intermediate stages <strong>of</strong> cis-element <strong>evolution</strong>, allowing us to further delineate<br />

the patterns <strong>of</strong> and constraints on the <strong>evolution</strong> <strong>of</strong> cis-<strong>regulation</strong>.<br />

Materials and Methods<br />

Genome sequences.<br />

Genome sequence and open reading frame (ORF) annotations for the saccharomycete<br />

species were obtained from P. Cliften, M. Kellis, and the Saccharomyces Genome<br />

Database (G<strong>of</strong>feau et al. 1996; Cliften et al. 2003; Kellis et al. 2003). Sequences for other<br />

genomes were downloaded from the published or listed Web sites as follows. K. waltii<br />

(Kellis et al. 2004), A. gossypii (Dietrich et al. 2004), C. albicans (Assembly 6;<br />

http://www-sequence.stanford.edu/ group/candida/) (Jones et al. 2004), N. crassa<br />

(Release 3; Galagan et al. 2003), M. grisea (Release 2; http://www-<br />

genome.wi.mit.edu/annotation/ fungi/magnaporthe/) , As. nidulans (Release 3.1;<br />

http://www.broad. mit.edu/annotation/fungi/aspergillus/) , and Sch. pombe (Wood et al.<br />

2002). A conservative list <strong>of</strong> putative ORFs from S. kudriavzevii, S. castellii, and S.<br />

kluyveri was generated, taking all ORFs <strong>of</strong> more than 100 amino acids as putative genes.<br />

ORFs orthologous to S. cerevisiae genes were identified as described below; some intron-<br />

containing S. cerevisiae genes that may also contain introns in these species (namely<br />

ribosomal protein genes) were identified <strong>by</strong> tBLASTn and manually added to the list <strong>of</strong><br />

orthologs for these species. Orthologs between S. cerevisiae and S. paradoxus, S.<br />

mikatae, and S. bayanus (Kellis et al. 2003) were downloaded from the Saccharomyces<br />

Genome Database ( http://www.yeastgenome.org/) . All other orthologs to S. cerevisiae<br />

genes were assigned using the method <strong>of</strong> Wall et al. (Wall et al. 2003) using a BLAST e-<br />

161


value cut<strong>of</strong>f <strong>of</strong> 10-5 and the requirement for fewer than 20% gapped positions in the<br />

Clustal W alignments. The number <strong>of</strong> orthologs assigned in each species is listed in Table<br />

1, and the complete results are available in Datasets S5–S12.<br />

S. cerevisiae gene clusters.<br />

Groups <strong>of</strong> known or putatively coregulated genes were identified in three ways. First, we<br />

used hierarchical (<strong>Eisen</strong> et al. 1998) and fuzzy k-means (Gasch and <strong>Eisen</strong> 2002)<br />

clustering to organize publicly available yeast gene expression data (DeRisi et al. 1997;<br />

Spellman et al. 1998; Gasch et al. 2000; Lyons et al. 2000; Ogawa et al. 2000; Primig et<br />

al. 2000; Gasch et al. 2001; Yoshimoto et al. 2002), taking gene clusters that were<br />

correlated <strong>by</strong> more than about 0.7 or with a membership <strong>of</strong> 0.08 or greater (Gasch and<br />

<strong>Eisen</strong> 2002). Second, we identified genes or transcripts whose flanking regions are<br />

physically bound <strong>by</strong> the same DNA or RNA binding proteins, as indicated <strong>by</strong><br />

immunoprecipitation experiments (Simon et al. 1993; Iyer et al. 2001; Lieb et al. 2001;<br />

Simon et al. 2001; Lee et al. 2002; Gerber et al. 2004): For the DNA<br />

immunoprecipitation experiments, genes were ranked according to the published binding<br />

p-values, and a sliding p value (between 10-2 and 10-4) was applied such that at least 20<br />

genes were selected in each group. Transcripts that are bound <strong>by</strong> RNA binding proteins<br />

were taken from (Gerber et al. 2004). Finally, genes with the same functional annotations<br />

(Weng et al. 2003), and genes known to be coregulated <strong>by</strong> various transcription factors<br />

(Gasch et al. 2000; Lyons et al. 2000; Ogawa et al. 2000; Shakoury-Elizeh et al. 2004),<br />

were grouped together. In all, we identified 264 partially redundant groups <strong>of</strong> S.<br />

cerevisiae genes that are likely to be coregulated. These gene groups ranged in size from<br />

162


four to 570 genes, with a median size <strong>of</strong> 17 genes per group. The complete gene groups<br />

are available in Dataset S2.<br />

Motif identification and enrichment.<br />

We compiled from the literature a list <strong>of</strong> 80 known transcription factor-binding sites,<br />

represented <strong>by</strong> IUPAC consensus sequences (Dataset S1) (Costanzo et al. 2001; Weng et<br />

al. 2003). Unless otherwise noted, we searched 1,000 bp upstream or 500 bp downstream<br />

<strong>of</strong> the genes from each group in each fungal genome for sequences that matched the<br />

consensus binding sites, <strong>by</strong> doing string comparisons on both strands using PERL scripts.<br />

For each group <strong>of</strong> genes identified above, we scored the enrichment <strong>of</strong> genes whose<br />

163


flanking regions (either 500 bp upstream, 1,000 bp upstream, or 500 bp downstream)<br />

contain one or more example <strong>of</strong> each cis-regulatory element, using the hypergeometric<br />

distribution<br />

⎛ M ⎞⎛<br />

N − M ⎞<br />

⎜ ⎟⎜<br />

⎟<br />

l<br />

⎝ i ⎠⎝<br />

l − i ⎠<br />

∑ ,<br />

i=<br />

q ⎛ N ⎞<br />

⎜ ⎟<br />

⎝ l ⎠<br />

where M is the number <strong>of</strong> genes that contain the motif in a group <strong>of</strong> i selected genes,<br />

relative to N genes that contain the motif in a genome <strong>of</strong> l genes. A p < 0.0002<br />

(approximately 0.01/80 tests) was deemed statistically significant for the consensus<br />

sequences, although if the sequence was enriched in the known group <strong>of</strong> target genes, we<br />

relaxed the cut<strong>of</strong>f to p < 0.01. A cut<strong>of</strong>f <strong>of</strong> p < 2.3x10 -5 was applied to sequences that<br />

matched the MEME matrices. For the Mig1p and GATA binding sequences, which are<br />

sufficiently short and occur frequently in each genome, we also scored the enrichment <strong>of</strong><br />

genes whose upstream region contained two or more examples <strong>of</strong> the known binding<br />

sites. For each group <strong>of</strong> genes, we also ran the motif-finding algorithm MEME (Bailey<br />

and Elkan 1994) on the upstream regions <strong>of</strong> S. cerevisiae genes or their orthologs in each<br />

species, using a two-component mixture model both with and without a motif-width<br />

specification <strong>of</strong> 8 bp. Unless otherwise noted, we used 500 bp upstream (for the<br />

hemiascomycetes) or 1,000 bp upstream (for the euascomycetes and Sch. pombe) <strong>of</strong> the<br />

genes in each group. Thus, for each group <strong>of</strong> coregulated genes, we performed 14 MEME<br />

analyses (each identifying three matrices) on the upstream regions <strong>of</strong> the genes from a<br />

given species. Matrices that matched known S. cerevisiae regulatory elements were<br />

identified <strong>by</strong> manual and automated comparisons, similar to that previous described<br />

164


(Hughes et al. 2000). A position-weight matrix was calculated for each motif on the basis<br />

<strong>of</strong> n motif examples MEME identified <strong>by</strong> counting the number <strong>of</strong> occurrences <strong>of</strong> each<br />

base at each position in n motifs, adding one pseudocount, and dividing <strong>by</strong> n + 4. A log-<br />

likelihood score S was calculated for each motif example as follows.<br />

∑∑ ⎟ motif ⎛ f ⎞ pb<br />

S = X ⎜<br />

pb log<br />

⎜ background<br />

p b ⎝ f pb ⎠<br />

In this formula, p is each position in the motif, b is the base (GACT) and X is a matrix <strong>of</strong><br />

indicator variables representing the sequence, where Xpb = 1 if the sequence has base b at<br />

position p, and zero otherwise. The probabilities <strong>of</strong> bases in the motif according to the<br />

position-weight matrix are represented <strong>by</strong> f motif , and the probabilities <strong>of</strong> bases in the<br />

genomic background are represented <strong>by</strong> f background (see below). The score S’ was assigned<br />

to each matrix, equal to 0.75 the average S <strong>of</strong> the motif examples, using the base<br />

frequency from each genome as the background model (G/C = 0.2 and A/T = 0.3 for all<br />

species except N. crassa, where G/C/A/T = 0.25). This score was used as a cut<strong>of</strong>f to<br />

identify genomic examples <strong>of</strong> the matrix. To identify genes whose upstream regions<br />

contained examples <strong>of</strong> each motif, we calculated the log-likelihood S <strong>of</strong> each 8-bp<br />

sequence within the 1,000 bp upstream region <strong>of</strong> each gene. The background model was<br />

based on the genomic nucleotide frequency in the 50 bp upstream window corresponding<br />

to the position <strong>of</strong> the sequence being assessed. We used this model to overcome the<br />

species-specific positional nucleotide biases immediately upstream <strong>of</strong> coding sequences<br />

(A. M. M., A. P. G., D. Y. C., and M. B. E., unpublished data). A sequence was<br />

considered a match to the matrix if S > S’. The enrichment <strong>of</strong> genes that contained each<br />

motif was scored using the hypergeometric distribution, as described above. A p < 3x10 -5<br />

165


(0.01 divided <strong>by</strong> the number <strong>of</strong> matrices tested in each species) was considered<br />

statistically significant. Out <strong>of</strong> the MEME matrices trained on the non-S. cerevisiae<br />

species, 53 were enriched in the gene group in which they were identified. Of these<br />

elements, 28 were similar to S. cerevisiae elements shown in Figure 2 and were enriched<br />

in the S. cerevisiae genes. An additional six matrices were redundantly identified in<br />

nearly identical gene groups (namely, Fhl1p targets and ribosomal protein genes) from<br />

the same species, and two elements were very similar and identified in the same gene<br />

group from As. nidulans and M. grisea. Thus, in all, 19 novel elements were identified.<br />

The complete list <strong>of</strong> matrices is available in Dataset S47.<br />

Positional distribution and spacing <strong>of</strong> cis-sequences.<br />

Genes that contained sequences that matched the S. cerevisiae position-weight matrices<br />

were identified as described above. We then calculated the frequency <strong>of</strong> each sequence in<br />

50-bp windows upstream <strong>of</strong> the potential target genes and compared it to the frequency <strong>of</strong><br />

that element in the corresponding upstream window for all <strong>of</strong> the genes in that genome.<br />

To identify distributions that were statistically different from the background, we<br />

identified 50-bp windows that contained a disproportionate number <strong>of</strong> the cis-sequences<br />

in the target upstream regions compared to the background, using the hypergeometric<br />

distribution presented above, where i was the total number <strong>of</strong> elements identified<br />

upstream <strong>of</strong> the genes in each group, M was the number <strong>of</strong> those elements that fell within<br />

a given 50-bp window, l was the total number <strong>of</strong> elements upstream <strong>of</strong> all <strong>of</strong> the genes in<br />

that genome and N was the number <strong>of</strong> those elements that fell within the same 50-bp<br />

window. We considered an element’s distribution to be significant if there was at least<br />

166


one 50-bp window with p < 0.01; only 5%–10% <strong>of</strong> the elements had distributions that<br />

met this criterion in gene groups other than their putative target genes. We calculated the<br />

correlation between element positions in S. cerevisiae and each <strong>of</strong> the other species <strong>by</strong><br />

taking all possible pairwise combinations <strong>of</strong> a cis-element’s positions in a given S.<br />

cerevisiae upstream region and in the orthologous region from other species and plotting<br />

these values for each group <strong>of</strong> coregulated genes (example scatter plots shown in Figures<br />

S4 and S5).<br />

Genes that contained sequences that matched the S. cerevisiae Cbf1p and<br />

Met31/32p position-weight matrices were identified in each species as described above.<br />

The average spacing between Cbf1p and Met31/32p binding sites within the 500 bp-<br />

upstream regions <strong>of</strong> the methionine biosynthesis genes and <strong>of</strong> all <strong>of</strong> the genes in each<br />

genome was measured <strong>by</strong> calculating the distance between all pairwise combinations <strong>of</strong><br />

the two motifs in each upstream region and taking the average spacing for the respective<br />

group <strong>of</strong> genes.<br />

Rpn4p matrix comparisons.<br />

To compare the upstream sequences identified in proteasome genes from S. cerevisiae<br />

and C. albicans, and to ensure that the identified sequences were not obtained <strong>by</strong><br />

sampling bias, we performed the following permutation analysis. We ran MEME on the<br />

entire set <strong>of</strong> upstream regions <strong>of</strong> 26 proteasome genes with orthologs in both species,<br />

using the conservative one-per sequence model. This produced a ‘‘meta-matrix’’ that<br />

identified exactly one putative binding site from each gene, leaving us with a set <strong>of</strong><br />

exactly 52. We calculated the likelihood-ratio statistic, testing the hypothesis that the<br />

167


sequences were drawn from a single multinomial, or from multinomials estimated<br />

separately for each species. In order to test the significance <strong>of</strong> this statistic, we randomly<br />

divided the data into two equal-sized groups 10,000 times, recalculated the statistic, and<br />

found that matrix positions 2, 3, and 9 had values <strong>of</strong> p


Bergman CM, Kreitman M (2001) Analysis <strong>of</strong> conserved noncoding DNA in Drosophila reveals similar<br />

constraints in intergenic and intronic sequences. Genome Res 11: 1335–<br />

1345.<br />

Blaiseau PL, Thomas D (1998) Multiple <strong>transcriptional</strong> activation complexes tether the yeast activator<br />

Met4 to DNA. EMBO J 17: 6327–6336.<br />

Bussemaker HJ, Li H, Siggia ED (2001) Regulatory element detection using correlation with expression.<br />

Nat Genet 27: 167–171.<br />

Chang YC, Penoyer LA, Kwon-Chung KJ (2001) The second STE12 homologue <strong>of</strong> Cryptococcus<br />

ne<strong>of</strong>ormans is MATa-specific and plays an important role in virulence. Proc Natl Acad Sci U S A 98:<br />

3258–3263.<br />

Chen D, Toone WM, Mata J, Lyne R, Burns G et al. (2003) Global <strong>transcriptional</strong> responses <strong>of</strong> fission<br />

yeast to environmental stress. Mol Biol Cell 14: 214–229.<br />

Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ et al. (2003) Multiple sequence alignment with the<br />

Clustal series <strong>of</strong> programs. Nucleic Acids Res 31: 3497–3500.<br />

Chiang DY, Moses AM, Kellis M, Lander ES, <strong>Eisen</strong> MB (2003) Phylogenetically and spatially conserved<br />

word pairs associated with gene-expression changes in yeasts. Genome Biol 4: R43.<br />

Chiang TY, Rai R, Cooper TG, Marzluf GA (1994) DNA binding site specificity <strong>of</strong> the Neurospora global<br />

nitrogen regulatory protein NIT2: Analysis with mutated binding sites. Mol Gen Genet 245: 512–516.<br />

Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B et al. (2003) Finding functional features in<br />

Saccharomyces genomes <strong>by</strong> phylogenetic footprinting. Science 301: 71–76.<br />

Costanzo MC, Crawford ME, Hirschman JE, Kranz JE, Olsen P et al. (2001) YPD, PombePD and<br />

WormPD: Model organism volumes <strong>of</strong> the BioKnowledge library, an integrated resource for protein<br />

information. Nucleic Acids Res 29: 75–79.<br />

Crooks GE, Hon G, Chandonia JM, Brenner SE (2004) WebLogo: A sequence logo generator. Genome Res<br />

14: 1188–1190.<br />

De Veylder L, Joubes J, Inze D (2003) Plant cell cycle transitions. Curr Opin Plant Biol 6: 536–543.<br />

DeGregori J (2002) The genetics <strong>of</strong> the E2F family <strong>of</strong> transcription factors: Shared functions and unique<br />

roles. Biochim Biophys Acta 1602: 131–150.<br />

DeRisi JL, Iyer VR, Brown PO (1997) Exploring the metabolic and genetic control <strong>of</strong> gene expression on a<br />

genomic scale. Science 278: 680–686.<br />

Dermitzakis ET, Clark AG (2002) Evolution <strong>of</strong> transcription factor binding sites in mammalian gene<br />

regulatory regions: Conservation and turnover. Mol Biol Evol 19: 1114–1121.<br />

Dermitzakis ET, Bergman CM, Clark AG (2003) Tracing the <strong>evolution</strong>ary history <strong>of</strong> Drosophila regulatory<br />

regions with models that identify transcription factor binding sites. Mol Biol Evol 20: 703–714.<br />

Dietrich FS, Voegeli S, Brachat S, Lerch A, Gates K et al. (2004) The Ash<strong>by</strong>a gossypii genome as a tool<br />

for mapping the ancient Saccharomyces cerevisiae genome. Science 304: 304–307.<br />

169


Ebbole DJ, Paluh JL, Plamann M, Sachs MS, Yan<strong>of</strong>sky C (1991) cpc-1, the general regulatory gene for<br />

genes <strong>of</strong> amino acid biosynthesis in Neurospora crassa, is differentially expressed during the asexual life<br />

cycle. Mol Cell Biol 11: 928–934.<br />

<strong>Eisen</strong> MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display <strong>of</strong> genome-wide<br />

expression patterns. Proc Natl Acad Sci U S A 95: 14863–14868.<br />

Enjalbert B, Nantel A, Whiteway M (2003) Stress-induced gene expression in Candida albicans: Absence<br />

<strong>of</strong> a general stress response. Mol Biol Cell 14: 1460– 1467.<br />

Fields S, Herskowitz I (1985) The yeast STE12 product is required for expression <strong>of</strong> two sets <strong>of</strong> cell-type<br />

specific genes. Cell 42: 923–930.<br />

Fry CJ, Farnham PJ (1999) Context-dependent <strong>transcriptional</strong> <strong>regulation</strong>. J Biol Chem 274: 29583–29586.<br />

Fry CJ, Slansky JE, Farnham PJ (1997) Position-dependent <strong>transcriptional</strong> <strong>regulation</strong> <strong>of</strong> the murine<br />

dihydr<strong>of</strong>olate reductase promoter <strong>by</strong> the E2F transactivation domain. Mol Cell Biol 17: 1966–1976.<br />

Galagan JE, Calvo SE, Borkovich KA, Selker EU, Read ND et al. (2003) The genome sequence <strong>of</strong> the<br />

filamentous fungus Neurospora crassa Nature 422: 859–868.<br />

Gasch AP, <strong>Eisen</strong> MB (2002) Exploring the conditional co<strong>regulation</strong> <strong>of</strong> yeast gene expression through fuzzy<br />

k-means clustering. Genome Biol 3: RESEARCH0059.<br />

Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, <strong>Eisen</strong> MB et al. (2000) Genomic expression programs<br />

in the response <strong>of</strong> yeast cells to environmental changes. Mol Biol Cell 11: 4241–4257.<br />

Gasch AP, Huang M, Metzner S, Botstein D, Elledge SJ et al. (2001) Genomic expression responses to<br />

DNA-damaging agents and the regulatory role <strong>of</strong> the yeast ATR homolog Mec1p. Mol Biol Cell 12: 2987–<br />

3003.<br />

Gelfand MS, Koonin EV, Mironov AA. (2000) Prediction <strong>of</strong> transcription regulatory sites in Archaea <strong>by</strong> a<br />

comparative genomic approach. Nucleic Acids Res 28: 695–705.<br />

Gerber AP, Herschlag D, Brown PO (2004) Extensive association <strong>of</strong> function ally and cytotopically related<br />

mRNAs with Puf family RNA-binding proteins in yeast. PLoS Biol 2: E79.<br />

G<strong>of</strong>feau A, Barrell BG, Bussey H, Davis RW, Dujon B et al. (1996) Life with 6,000 genes. Science 274:<br />

546, 563–547.<br />

Gompel N, Carroll SB (2003) Genetic mechanisms and constraints governing the <strong>evolution</strong> <strong>of</strong> correlated<br />

traits in drosophilid flies. Nature 424: 931–935.<br />

GuhaThakurta D, Stormo GD (2001) Identifying target sites for cooperatively binding factors.<br />

Bioinformatics 17: 608–621.<br />

Guthrie C, Fink GR (2002) Guide to yeast genetics and molecular biology, Part B. Volume 350, Methods in<br />

enzymology. London: Academic Press. 623 p.<br />

Hardison RC, Oeltjen J, Miller W (1997) Long human-mouse sequence alignments reveal novel regulatory<br />

elements: A reason to sequence the mouse genome. Genome Res 7: 959–966.<br />

Heckman DS, Geiser DM, Eidell BR, Stauffer RL, Kardos NL et al. (2001) <strong>Molecular</strong> evidence for the<br />

early colonization <strong>of</strong> land <strong>by</strong> fungi and plants. Science 293: 1129–1133.<br />

170


Hinman VF, Nguyen AT, Cameron RA, Davidson EH (2003) Developmental gene regulatory network<br />

architecture across 500 million years <strong>of</strong> echinoderm <strong>evolution</strong>. Proc Natl Acad Sci U S A 100: 13356–<br />

13361.<br />

Hinnebusch AG (1986) The general control <strong>of</strong> amino acid biosynthetic genes in the yeast Saccharomyces<br />

cerevisiae CRC Crit Rev Biochem 21: 277–317.<br />

Hirsh AE, Fraser HB (2001) Protein dispensability and rate <strong>of</strong> <strong>evolution</strong>. Nature 411: 1046–1049.<br />

Hughes JD, Estep PW, Tavazoie S, Church GM (2000) Computational identification <strong>of</strong> cis-regulatory<br />

elements associated with groups <strong>of</strong> functionally related genes in<br />

Saccharomyces cerevisiae J Mol Biol 296: 1205–1214.<br />

Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M et al. (2001) Genomic binding sites <strong>of</strong> the yeast cellcycle<br />

transcription factors SBF and MBF. Nature 409: 533–538.<br />

Jones T, Federspiel NA, Chibana H, Dungan J, Kalman S et al. (2004) The diploid genome sequence <strong>of</strong><br />

Candida albicans Proc Natl Acad Sci U S A 101: 7329–7334.<br />

Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES (2003) Sequencing and comparison <strong>of</strong> yeast<br />

species to identify genes and regulatory elements. Nature 423: 241–254.<br />

Kellis M, Birren BW, Lander ES (2004) Pro<strong>of</strong> and <strong>evolution</strong>ary analysis <strong>of</strong> ancient genome duplication in<br />

the yeast Saccharomyces cerevisiae Nature 428: 617–624.<br />

Kruger E, Kloetzel PM, Enenkel C (2001) 20S proteasome biogenesis. Biochimie 83: 289–293.<br />

Krylov DM, Wolf YI, Rogozin IB, Koonin EV (2003) Gene loss, protein sequence divergence, gene<br />

dispensability, expression level, and interactivity are correlated in eukaryotic <strong>evolution</strong>. Genome Res 13:<br />

2229–2235.<br />

Kudla B, Caddick MX, Langdon T, Martinez-Rossi NM, Bennett CF et al. (1990) The regulatory gene areA<br />

mediating nitrogen metabolite repression in Aspergillus nidulans. Mutations affecting specificity <strong>of</strong> gene<br />

activation alter a loop residue <strong>of</strong> a putative zinc finger. EMBO J 9: 1355–1364.<br />

Kurtzman CP, Robnett CJ (2003) Phylogenetic relationships among yeasts <strong>of</strong> the ‘‘Saccharomyces<br />

complex’’ determined from multigene sequence analyses. FEMS Yeast Res 3: 417–432.<br />

Lee PN, Callaerts P, De Couet HG, Martindale MQ (2003) Cephalopod Hox genes and the origin <strong>of</strong><br />

morphological novelties. Nature 424: 1061–1065.<br />

Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z et al. (2002) Transcriptional regulatory networks in<br />

Saccharomyces cerevisiae Science 298: 799–804.<br />

Lengeler KB, Davidson RC, D’Souza C, Harashima T, Shen WC et al. (2000) Signal transduction cascades<br />

regulating fungal development and virulence. Microbiol Mol Biol Rev 64: 746–785.<br />

Lieb JD, Liu X, Botstein D, Brown PO (2001) Promoter-specific binding <strong>of</strong> Rap1 revealed <strong>by</strong> genomewide<br />

maps <strong>of</strong> protein-DNA association. Nat Genet 28: 327–334.<br />

Limjindaporn T, Khalaf RA, Fonzi WA (2003) Nitrogen metabolism and virulence <strong>of</strong> Candida albicans<br />

require the GATA-type <strong>transcriptional</strong> activator encoded <strong>by</strong> GAT1. Mol Microbiol 50: 993–1004.<br />

Loots GG, Locksley RM, Blankespoor CM, Wang ZE, Miller W et al. (2000) Identification <strong>of</strong> a coordinate<br />

regulator <strong>of</strong> interleukins 4, 13, and 5 <strong>by</strong> crossspecies sequence comparisons. Science 288: 136–140.<br />

171


Lowndes NF, McInerny CJ, Johnson AL, Fantes PA, Johnston LH (1992) Control <strong>of</strong> DNA synthesis genes<br />

in fission yeast <strong>by</strong> the cell-cycle gene cdc10þ. Nature 355: 449–453.<br />

Ludwig MZ, Kreitman M (1995) Evolutionary dynamics <strong>of</strong> the enhancer region <strong>of</strong> even-skipped in<br />

Drosophila Mol Biol Evol 12: 1002–1011.<br />

Ludwig MZ, Patel NH, Kreitman M (1998) Functional analysis <strong>of</strong> eve stripe 2 enhancer <strong>evolution</strong> in<br />

Drosophila: Rules governing conservation and change. Development 125: 949–958.<br />

Ludwig MZ, Bergman C, Patel NH, Kreitman M (2000) Evidence for stabilizing selection in a eukaryotic<br />

enhancer element. Nature 403: 564–567.<br />

Lyons TJ, Gasch AP, Gaither LA, Botstein D, Brown PO et al. (2000) Genomewide characterization <strong>of</strong> the<br />

Zap1p zinc-responsive regulon in yeast. Proc Natl Acad Sci U S A 97: 7957–7962.<br />

Magasanik B, Kaiser CA (2002) Nitrogen <strong>regulation</strong> in Saccharomyces cerevisiae Gene 290: 1–18.<br />

Malhotra P, Manohar CF, Swaminathan S, Toyama R, Dhar R et al. (1993) E2F site activates transcription<br />

in fission yeast Schizosaccharomyces pombe and binds to a 30-kDa transcription factor. J Biol Chem 268:<br />

20392–20401.<br />

Malmqvist M (1999) BIACORE: An affinity biosensor system for characterization <strong>of</strong> biomolecular<br />

interactions. Biochem Soc Trans 27: 335–340.<br />

Mannhaupt G, Schnall R, Karpov V, Vetter I, Feldmann H (1999) Rpn4p acts as a transcription factor <strong>by</strong><br />

binding to PACE, a nonamer box found upstream <strong>of</strong> 26S proteasomal and other genes in yeast. FEBS Lett<br />

450: 27–34.<br />

Marzluf GA (1997) Genetic <strong>regulation</strong> <strong>of</strong> nitrogen metabolism in the fungi. Microbiol Mol Biol Rev 61:<br />

17–32.<br />

McGuire AM, Hughes JD, Church GM (2000) Conservation <strong>of</strong> DNA regulatory motifs and discovery <strong>of</strong><br />

new motifs in microbial genomes. Genome Res 10: 744–757.<br />

McIntosh EM (1993) MCB elements and the <strong>regulation</strong> <strong>of</strong> DNA replication genes in yeast. Curr Genet 24:<br />

185–192.<br />

Moll T, Dirick L, Auer H, Bonkovsky J, Nasmyth K (1992) SWI6 is a regulatory subunit <strong>of</strong> two different<br />

cell cycle START-dependent transcription factors in Saccharomyces cerevisiae J Cell Sci Suppl 16: 87–96.<br />

Monod J, Jacob F (1961) Teleonomic mechanisms in cellular metabolism, growth, and differentiation. Cold<br />

Spring Harb Symp Quant Biol 26: 389– 401.<br />

Moses AM, Chiang DY, Kellis M, Lander ES, <strong>Eisen</strong> MB (2003) Position specific variation in the rate <strong>of</strong><br />

<strong>evolution</strong> in transcription factor binding sites BMC Evol Biol 3: 19<br />

Natarajan K, Meyer MR, Jackson BM, Slade D, Roberts C et al. (2001) Transcriptional pr<strong>of</strong>iling shows that<br />

Gcn4p is a master regulator <strong>of</strong> gene expression during amino acid starvation in yeast. Mol Cell Biol 21:<br />

4347– 4368.<br />

Ogawa N, DeRisi J, Brown PO (2000) New components <strong>of</strong> a system for phosphate accumulation and<br />

polyphosphate metabolism in Saccharomyces cerevisiae revealed <strong>by</strong> genomic expression analysis. Mol<br />

Biol Cell 11: 4309– 4321.<br />

172


Pabo CO, Peisach E, Grant RA (2001) Design and selection <strong>of</strong> novel Cys2His2 zinc finger proteins. Annu<br />

Rev Biochem 70: 313-–340.<br />

Piano F, Parisi MJ, Karess R, Kam<strong>by</strong>sellis MP (1999) Evidence for redundancy but not trans-factor-cis<br />

element co<strong>evolution</strong> in the <strong>regulation</strong> <strong>of</strong> Drosophila Yp genes. Genetics 152: 605–616.<br />

Primig M, Williams RM, Winzeler EA, Tevzadze GG, Conway AR et al. (2000) The core meiotic<br />

transcriptome in budding yeasts. Nat Genet 26: 415–423.<br />

Pritsker M, Liu YC, Beer MA, Tavazoie S (2004) Whole-genome discovery <strong>of</strong> transcription factor binding<br />

sites <strong>by</strong> network-level conservation. Genome Res 14: 99–108.<br />

Qin ZS, McCue LA, Thompson W, Mayerh<strong>of</strong>er L, Lawrence CE et al. (2003) Identification <strong>of</strong> co-regulated<br />

genes through Bayesian clustering <strong>of</strong> predicted regulatory binding sites. Nat Biotechnol 21: 435–439.<br />

Rajewsky N, Socci ND, Zapotocky M, Siggia ED (2002) The <strong>evolution</strong> <strong>of</strong> DNA regulatory regions for<br />

proteo-gamma bacteria <strong>by</strong> interspecies comparisons. Genome Res 12: 298–308.<br />

Ren B, Cam H, Takahashi Y, Volkert T, Terragni J et al. (2002) E2F integrates cell cycle progression with<br />

DNA repair, replication, and G(2)/M checkpoints.<br />

Genes Dev 16: 245–256.<br />

Rhodes D, Schwabe JW, Chapman L, Fairall L (1996) Towards an understanding<br />

<strong>of</strong> protein-DNA recognition. Philos Trans R Soc Lond B Biol Sci 351: 501–<br />

509.<br />

Rockman MV, Wray GA (2002) Abundant raw material for cis-regulatory<br />

<strong>evolution</strong> in humans. Mol Biol Evol 19: 1991–2004.<br />

Romano LA, Wray GA (2003) Conservation <strong>of</strong> Endo16 expression in sea urchins despite <strong>evolution</strong>ary<br />

divergence in both cis and trans-acting components <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong>. Development 130: 4187–<br />

4199.<br />

Russell SJ, Steger KA, Johnston SA (1999) Subcellular localization, stoichiometry, and protein levels <strong>of</strong> 26<br />

S proteasome subunits in yeast. J Biol Chem 274: 21943–21952.<br />

Rustici G, Mata J, Kivinen K, Lio P, Penkett CJ et al. (2004) Periodic gene expression program <strong>of</strong> the<br />

fission yeast cell cycle. Nat Genet 36: 809–817.<br />

Ruvinsky I, Ruvkun G (2003) Functional tests <strong>of</strong> enhancer conservation between distantly related species.<br />

Development 130: 5133–5142.<br />

Scemama JL, Hunter M, McCallum J, Prince V, Stellwag E (2002) Evolutionary divergence <strong>of</strong> vertebrate<br />

Hoxb2 expression patterns and <strong>transcriptional</strong> regulatory loci. J Exp Zool 294: 285–299.<br />

Shakoury-Elizeh M, Tiedeman J, Rashford J, Ferea T, Demeter J et al. (2004) Transcriptional remodeling<br />

in response to iron deprivation in Saccharomyces cerevisiae<br />

Mol Biol Cell 15: 1233–1243.<br />

Simon I, Barnett J, Hannett N, Harbison CT, Rinaldi NJ et al. (2001) Serial <strong>regulation</strong> <strong>of</strong> <strong>transcriptional</strong><br />

regulators in the yeast cell cycle. Cell 106: 697– 708.<br />

Simon PL, Kumar V, Lillquist JS, Bhatnagar P, Einstein R et al. (1993) Mapping <strong>of</strong> neutralizing epitopes<br />

and the receptor binding site <strong>of</strong> human interleukin 1 beta. J Biol Chem 268: 9771–9779.<br />

173


Sinha S, Tompa M (2002) Discovery <strong>of</strong> novel transcription factor binding sites <strong>by</strong> statistical<br />

overrepresentation. Nucleic Acids Res 30: 5549–5560.<br />

Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K et al. (1998) Comprehensive identification <strong>of</strong><br />

cell cycle-regulated genes <strong>of</strong> the yeast Saccharomyces cerevisiae <strong>by</strong> microarray hybridization. Mol Biol<br />

Cell 9: 3273– 3297.<br />

Stone JR, Wray GA (2001) Rapid <strong>evolution</strong> <strong>of</strong> cis-regulatory sequences via local point mutations. Mol Biol<br />

Evol 18: 1764–1770.<br />

Takahashi H, Mitani Y, Satoh G, Satoh N (1999) Evolutionary alterations <strong>of</strong> the minimal promoter for<br />

notochord-specific Brachyury expression in ascidian embryos. Development 126: 3725–3734.<br />

Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM (1999) Systematic determination <strong>of</strong> genetic<br />

network architecture. Nat Genet 22: 281–285.<br />

Tazebay UH, Sophianopoulou V, Scazzocchio C, Diallinas G (1997) The gene encoding the major proline<br />

transporter <strong>of</strong> Aspergillus nidulans is upregulated during conidiospore germination and in response to<br />

proline induction and amino acid starvation. Mol Microbiol 24: 105–117.<br />

Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: Improving the sensitivity <strong>of</strong> progressive<br />

multiple sequence alignment through sequence weighting, position-specific gap penalties and weight<br />

matrix choice. Nucleic Acids Res 22: 4673–4680.<br />

Tripathi G, Wiltshire C, Macaskill S, Tournu H, Budge S et al. (2002) Gcn4 coordinates morphogenetic<br />

and metabolic responses to amino acid starvation in Candida albicans EMBO J 21: 5448–5456.<br />

True JR, Haag ES (2001) Developmental system drift and flexibility in <strong>evolution</strong>ary trajectories. Evol Dev<br />

3: 109–119.<br />

True JR, Carroll SB (2002) Gene co-option in physiological and morphological <strong>evolution</strong>. Annu Rev Cell<br />

Dev Biol 18: 53-80.<br />

Vallim MA, Miller KY, Miller BL (2000) Aspergillus SteA (sterile12-like) is a homeodomain-C2/H2-Znþ2<br />

finger transcription factor required for sexual reproduction. Mol Microbiol 36: 290–301.<br />

van Helden J, Andre B, Collado-Vides J (1998) Extracting regulatory sites from the upstream region <strong>of</strong><br />

yeast genes <strong>by</strong> computational analysis <strong>of</strong> oligonucleotide frequencies. J Mol Biol 281: 827–842.<br />

Vashee S, Kodadek T (1995) The activation domain <strong>of</strong> GAL4 protein mediates cooperative promoter<br />

binding with general transcription factors in vivo. Proc Natl Acad Sci U S A 92: 10683–10687.<br />

Wall DP, Fraser HB, Hirsh AE (2003) Detecting putative orthologs. Bioinformatics 19(13): 1710-1711.<br />

Wang X, Greenberg JF, Chamberlin HM (2004) Evolution <strong>of</strong> regulatory elements producing a conserved<br />

gene expression pattern in Caenorhabditis Evol Dev 6: 237–245.<br />

Wang T, Stormo GD (2003) Combining phylogenetic data with co-regulated genes to identify regulatory<br />

motifs. Bioinformatics 19: 2369–2380.<br />

Warner JR (1999) The economics <strong>of</strong> ribosome biosynthesis in yeast. Trends Biochem Sci 24: 437–440.<br />

Weng S, Dong Q, Balakrishnan R, Christie K, Costanzo M et al. (2003) Saccharomyces Genome Database<br />

(SGD) provides biochemical and structural information for budding yeast proteins. Nucleic Acids Res 31:<br />

216–218.<br />

174


Wilson AC, Maxson LR, Sarich VM (1974) Two types <strong>of</strong> molecular <strong>evolution</strong>. Evidence from studies <strong>of</strong><br />

interspecific hybridization. Proc Natl Acad Sci U S A 71: 2843–2847.<br />

Wilson AC, Carlson SS, White TJ (1977) Biochemical <strong>evolution</strong>. Annu Rev Biochem 46: 573–639.<br />

Wolfe SA, Greisman HA, Ramm EI, Pabo CO (1999) Analysis <strong>of</strong> zinc fingers optimized via phage display:<br />

Evaluating the utility <strong>of</strong> a recognition code. J Mol Biol 285: 1917-1934.<br />

Wolfe SA, Nekludova L, Pabo CO (2000) DNA recognition <strong>by</strong> Cys2His2 zinc finger proteins. Annu Rev<br />

Biophys Biomol Struct 29: 183–212.<br />

Wood V, Gwilliam R, Rajandream MA, Lyne M, Lyne R et al. (2002) The genome sequence <strong>of</strong><br />

Schizosaccharomyces pombe Nature 415: 871–880.<br />

Workman JL, Kingston RE (1992) Nucleosome core displacement in vitro via a metastable transcription<br />

factor-nucleosome complex. Science 258: 1780– 1784.<br />

Wray GA, Hahn MW, Abouheif E, Balh<strong>of</strong>f JP, Pizer M et al. (2003) The <strong>evolution</strong> <strong>of</strong> <strong>transcriptional</strong><br />

<strong>regulation</strong> in eukaryotes. Mol Biol Evol 20: 1377–1419.<br />

Xie Y, Varshavsky A (2001) RPN4 is a ligand, substrate, and <strong>transcriptional</strong> regulator <strong>of</strong> the 26S<br />

proteasome: A negative feedback circuit. Proc Natl Acad Sci U S A 98: 3056–3061.<br />

Yang Z (1997) PAML: A program package for phylogenetic analysis <strong>by</strong> maximum likelihood. Comput<br />

Appl Biosci 13: 555–556.<br />

Yoshimoto H, Saltsman K, Gasch AP, Li HX, Ogawa N et al. (2002) Genomewide analysis <strong>of</strong> gene<br />

expression regulated <strong>by</strong> the calcineurin/Crz1p signaling pathway in Saccharomyces cerevisiae J Biol Chem<br />

277: 31079–31088.<br />

Young LY, Lorenz MC, Heitman J (2000) A STE12 homolog is required for mating but dispensable for<br />

filamentation in Candida lusitaniae Genetics 155: 17–29.<br />

Yu H, Luscombe NM, Lu HX, Zhu X, Xia Y et al. (2004) Annotation transfer between genomes: Proteinprotein<br />

interologs and protein-DNA regulogs. Genome Res 14: 1107–1118.<br />

Zhao Y, Sohn JH, Warner JR (2003) Auto<strong>regulation</strong> in the biosynthesis <strong>of</strong> ribosomes. Mol Cell Biol 23:<br />

699–707.<br />

175


14. Evidence for losses and gains <strong>of</strong> functional binding sites<br />

Given the dramatic <strong>evolution</strong>ary dynamics that we observed over longer <strong>evolution</strong>ary<br />

distances I sought to characterize the micro<strong>evolution</strong>ary processes at the level <strong>of</strong> the<br />

DNA sequence that are responsible for the network level changes. As I noted in the<br />

context <strong>of</strong> the monkey results, there are examples even at short distances there are<br />

examples <strong>of</strong> binding sites that are functional and not conserved. In order to address<br />

whether gains and losses <strong>of</strong> individual binding sites could underlie the higher level<br />

changes, I decided to study these changes systematically. Thus having discussed at<br />

length the conservation <strong>of</strong> binding sites and methods to exploit this conservation, we now<br />

turn to <strong>evolution</strong>ary changes in binding sites and the part these changes play in the<br />

<strong>evolution</strong> <strong>of</strong> <strong>transcriptional</strong> regulatory networks.<br />

As I noted previously, we must remember that while conservation over <strong>evolution</strong><br />

implies function (perhaps not necessarily an understandable function) the lack <strong>of</strong><br />

conservation does not imply the lack <strong>of</strong> function. The absence <strong>of</strong> conservation in<br />

sequence alignment can be indicative <strong>of</strong> both the lack <strong>of</strong> constraint but also changes in<br />

function over <strong>evolution</strong>. Thus the observation that binding sites that are functional in one<br />

species are not always conserved is consistent with several scenarios. Either the gene<br />

expression has changed and that change has become fixed in the population (e.g.,<br />

Shasikan et al. 1998), or gene expression has remained constant because a compensatory<br />

mutation added a new binding site or because the binding site was functionally redundant<br />

and its loss had no effect (Ludwig et al. 2000, Piano et al. 1999, Hersh & Carroll 2005).<br />

Regardless <strong>of</strong> the interpretation, a requirement for any such analysis is that we<br />

176


must first identify examples <strong>of</strong> these non-conserved functional binding sites. This is a<br />

challenge because non-functional sequences that are not binding sites, yet match the<br />

specificity <strong>of</strong> transcription factors, are frequent in the genome and these are not generally<br />

conserved. Further, we must account for the possibility that the binding sites are actually<br />

conserved but we have failed to identify them as such, because <strong>of</strong> the necessarily<br />

imperfect inferences we must make about their <strong>evolution</strong>ary history.<br />

Specific Examples<br />

The <strong>regulation</strong> <strong>of</strong> the cholesterol biosynthesis genes in mammals is an excellent example<br />

<strong>of</strong> a specific regulatory network that has evolved over relatively short time scales. This<br />

has led to very different responses to high cholesterol diets between, for example,<br />

humans and rodents. Aside from its clinical relevance, this system is attractive because<br />

<strong>of</strong> the relatively well-characterized regulatory proteins and binding sites that are<br />

responsible for the <strong>regulation</strong>. This system provides several examples <strong>of</strong> regulatory<br />

elements that have evolved since the divergence <strong>of</strong> the common ancestor <strong>of</strong> the rodents<br />

and humans. In order to study this <strong>evolution</strong> in more detail, we have obtained non-<br />

coding sequences from several primates, including lemur, for several <strong>of</strong> the important<br />

genes in the pathway.<br />

An excellent case study is the LXRE in the promoter <strong>of</strong> the CYP7A1 gene. In this<br />

case, in vivo and in vitro studies have show that that different transcription factors bind to<br />

this sequence in human and in mouse, leading to different patterns <strong>of</strong> activation and<br />

repression <strong>of</strong> the CYP7A1 gene under similar environmental conditions (Chen et al.<br />

1999). Comparison <strong>of</strong> the human and mouse sequences in this region <strong>of</strong> the promoter<br />

177


eveals few nucleotide substitutions, but these changes are responsible for the differential<br />

binding to the promoter. The pattern <strong>of</strong> <strong>evolution</strong> we observe is consistent with rapid<br />

<strong>evolution</strong> along the branch leading to all primates but lemur, suggesting that there was<br />

either positive selection or a relaxation <strong>of</strong> constraint in the common ancestor: while there<br />

are no changes between the mouse and lemur, there are 2 substitutions and one indel<br />

event along the lineage leading to humans, which under the assumption <strong>of</strong> neutral<br />

<strong>evolution</strong>, we expect to have the same branchlength (See figure 1).<br />

TGAGCTTGGTTGA-CAA 4.47493 Human<br />

TGAGCTTG-TTGACCAA -0.30334 Baboon<br />

TGAACTTGGTTGACCA 8.17303 Colobus<br />

TGAGCTTGGTTGACCA 7.26454 Marmoset<br />

TGAGCTTGGTTGACCA 7.26454 Owl Monkey<br />

TGAGCTTTGTTGACCA 7.12248 Squirrel Monkey<br />

TGAACTTGGGTGACCA 10.6822 Lemur<br />

TGAACTTGGGTGACCA<br />

*** *** * *** **<br />

10.6822 Mouse<br />

Figure 1<br />

The CYP7A1 LXRE in alignment <strong>of</strong> primates<br />

The LXRE in the CYP7A1 promoter that is functional in rodents, but not in human. There is an apparent<br />

reduction in constraint since the common ancestor <strong>of</strong> the Lemur and the other primates. Substitution or<br />

deletion events inferred <strong>by</strong> parsimony are indicated <strong>by</strong> circles on the appropriate branches. The score <strong>of</strong> the<br />

sequence on the LXRE matrix is included to the right <strong>of</strong> the sequence for each species.<br />

While the LXRE in the CYP7A1 promoter exemplifies an <strong>evolution</strong>ary change in<br />

<strong>transcriptional</strong> <strong>regulation</strong> due to a cis-regulatory change, it has also been suggested that<br />

functional cis-regulatory changes may occur with no corresponding change in function<br />

(Ludwig et al. 2000, Piano et al. 1999).<br />

An example <strong>of</strong> this type <strong>of</strong> binding site <strong>evolution</strong> in expression pattern is found in<br />

the PPRE in the LXRa promoter. Here both the human and mouse promoters are<br />

178


egulated <strong>by</strong> the PPARa RXR heterodimer (Laffitte et al. 2001), but the binding site (the<br />

PPRE) is not composed <strong>of</strong> orthologous positions in the DNA sequence. This is consistent<br />

with so-called 'binding site turnover’, where a new site has arisen to take the function <strong>of</strong><br />

the old. We can examine the <strong>evolution</strong> <strong>of</strong> this binding site over the primates in order to<br />

get a higher resolution view <strong>of</strong> this <strong>evolution</strong>ary change.<br />

A<br />

B<br />

GTACAAAGTTCA 6.53315<br />

GTACAAAGTTCA 6.53315<br />

GTACAAAGTTCA 6.53315<br />

GTATAAAGTTCA 2.89941<br />

GTACAAAGTTCA 6.53315<br />

GTACAAAGTTCA 6.53315<br />

GCACAAAGTTCA 5.43068<br />

TCAGAAAGTTGA -3.93158<br />

* ****** *<br />

GGTTTAGGATTT -4.56755 Human<br />

AGTTTAGGATTT -8.30522 Baboon<br />

GGTTTAGGATTT -4.56755 Colobus<br />

GGTTTAGGATTT -4.56755 Dusky<br />

GGTTTAGGATTT -4.56755 Owl Monkey<br />

GGTTTAGGATTT -4.56755 Squirrel Monkey<br />

GAGTCAGGATTT -11.9453 Lemur<br />

GGGCAAAGTTCA 8.98<br />

* * *<br />

Mouse<br />

Figure 2.<br />

LXRa PPRE sites in an alignment <strong>of</strong> primates and mouse<br />

A: the functional human binding site. B: the functional mouse binding site. Inferred substitutions are<br />

indicated <strong>by</strong> circles and matrix scores are indicated beside the sequence for each species. Boxes indicate<br />

the Direct repeat with a single spacer (DR1) structure <strong>of</strong> the PPAR alpha RXR heterodimer<br />

Once again, the pattern <strong>of</strong> substitutions is consistent with the functional observations.<br />

The PPRE that is functional in human shows few substitutions within the primate<br />

179<br />

Human<br />

Baboon<br />

Colobus<br />

Dusky<br />

Owl Monkey<br />

Squirrel Monkey<br />

Lemur<br />

Mouse


lineages, and many changes since the common ancestor <strong>of</strong> the primate ancestor and the<br />

mouse (Figure 2A). Similarly, the PPRE that is functional in mouse shows many changes<br />

along that lineage (Figure 2B). Thus the number <strong>of</strong> changes is consistent with an<br />

<strong>evolution</strong>ary change, but the number <strong>of</strong> changes does not imply anything about the<br />

direction <strong>of</strong> the functional change. However, this information combined with which<br />

species have strong similarity to the matrix allows the inference <strong>of</strong> the <strong>evolution</strong>ary<br />

changes. Interestingly, in this case the <strong>evolution</strong> seems to have occurred since the<br />

divergence <strong>of</strong> the common ancestor <strong>of</strong> rodent and primates, whereas in the case <strong>of</strong> the<br />

CYP7A1 LXRE the changes seem to have occurred after the divergence <strong>of</strong> the rest <strong>of</strong> the<br />

primates from the shared ancestor with lemur.<br />

These examples illustrate the two major types <strong>of</strong> <strong>evolution</strong>ary changes that we are<br />

interested in detecting. The former exemplifies a change at the level <strong>of</strong> the binding site<br />

that is accompanied <strong>by</strong> a corresponding change in <strong>transcriptional</strong> <strong>regulation</strong>, while the<br />

latter demonstrates that changes in functional binding sites may occur even while<br />

function remains constant. Further, these examples demonstrate that the number and<br />

phylogenetic position <strong>of</strong> changes is not sufficient to infer the functional scenario, rather<br />

we must also consider what types <strong>of</strong> changes have occurred leading to binding sites on<br />

particular lineages.<br />

Statistics to detect binding site loss and gain in alignments <strong>of</strong> non-coding sequences<br />

In beginning to study functional changes in binding sites over <strong>evolution</strong>, we must first<br />

rule out the hypothesis that the binding site was present in the common ancestor, and has<br />

simply been conserved ever since (Dermitzakis et al. 2003). We must therefore develop<br />

180


statistical methods that classify between binding sites that are conserved or non-<br />

conserved. This is in contrast to the situation in the previous sections where the goal was<br />

to distinguish between binding sites and background sequences. The key point in<br />

developing these statistics is that the decision about whether the binding site is conserved<br />

should be made based on the inferred <strong>evolution</strong>ary history <strong>of</strong> the sequences, rather than a<br />

matrix similarity cut<strong>of</strong>f. Unless we have confidence that we have a representative <strong>of</strong><br />

every possible target site for a transcription factor, we cannot how similar a sequence<br />

must be to a specificity matrix to be functional; the similarity cut<strong>of</strong>fs used are almost<br />

always arbitrary. When selecting binding sites the choice <strong>of</strong> a cut<strong>of</strong>f is not critical:<br />

including some spurious matches may add noise, or excluding some bona fide sites may<br />

throw out some data. However, if an arbitrary similarity cut<strong>of</strong>f is used in deciding<br />

whether a binding site has evolved or not, the interpretation <strong>of</strong> results based thereupon<br />

will be ambiguous – in that case, for example, it is not possible to know if binding sites<br />

are evolving into background sequences, or remaining binding sites, but slipping below<br />

the arbitrary threshold.<br />

A simple way to accomplish the classification based on <strong>evolution</strong>ary history is to<br />

first identify all binding sites with a p-value below some threshold in a single genome, to<br />

infer the <strong>evolution</strong>ary history <strong>of</strong> each position in the binding site (e.g., using parsimony),<br />

and then to compare the likelihood each history under the model <strong>of</strong> binding site <strong>evolution</strong><br />

(e.g., HB) to the model <strong>of</strong> background or neutral <strong>evolution</strong> (e.g., HKY). For example,<br />

p(<br />

IH | HB)<br />

Tparsimony = log<br />

,<br />

p(<br />

IH | HKY )<br />

where IH represents the inferred history for the alignment <strong>of</strong> that binding site. Note that<br />

181


this statistic depends only on a single most parsimonious history.<br />

It is possible to formulate this in the probabilistic framework for binding site<br />

<strong>evolution</strong> described in part III, <strong>by</strong> conditioning the probabilities on the event that we have<br />

already observed a binding site in one <strong>of</strong> the genomes. We define<br />

p(<br />

Y...,<br />

Z | X , HB)<br />

T = log<br />

.<br />

p(<br />

Y...,<br />

Z | X , HKY )<br />

Here, as above, X,Y…Z represent the sequences in the alignment, and the likelihood <strong>of</strong><br />

the sequences is calculated <strong>by</strong> summing over all the possible ancestral states using<br />

Felsenstein’s algorithm (Felsenstein 1981). Unlike the statistic used above in MONKEY,<br />

we are now conditioning on the observation <strong>of</strong> one <strong>of</strong> the sequences, X, which we will<br />

choose to be the reference sequence, or the sequence that passed the threshold to be<br />

called a binding site. This means that the classification is done based on the pattern <strong>of</strong><br />

<strong>evolution</strong> conditioned on the observation that one <strong>of</strong> the sequences was a good match to<br />

the matrix; the initial similarity cut<strong>of</strong>f, then, should have little impact on the value <strong>of</strong> this<br />

statistic. We note that this can be re-written using Bayes’ theorem as<br />

p(<br />

X , Y...,<br />

Z | HB)<br />

p(<br />

X | HKY )<br />

T = log<br />

p(<br />

X , Y...,<br />

Z | HKY ) p(<br />

X | HB)<br />

p(<br />

X , Y...,<br />

Z | HB)<br />

p(<br />

X | HB)<br />

ˆ p(<br />

X | HB)<br />

= log<br />

− log<br />

= S − log<br />

,<br />

p(<br />

X , Y...,<br />

Z | HKY)<br />

p(<br />

X | HKY ) p(<br />

X | HKY)<br />

where the first term is familiar as the statistic used in MONKEY and the second term is<br />

similar to the single genome likelihood ratio, but takes into account the distance from the<br />

root to the reference node, X. As with the statistics used in MONKEY it is possible to<br />

compute the distribution <strong>of</strong> this statistic under various assumptions. Of particular interest<br />

182


in detecting non-conserved binding sites is the distribution <strong>of</strong> this statistic under the<br />

assumption that there was a conserved binding site, and using this to reject that<br />

hypothesis. This can be calculated exactly using the same methods described for<br />

MONKEY, but using the probabilities under the HB <strong>evolution</strong>ary model to generate the<br />

null (or background) distribution. In addition, we can use this statistic to provide a<br />

conservative test <strong>of</strong> binding site conservation <strong>by</strong> computing the probability <strong>of</strong> observing a<br />

score as large under the hypothesis that there was a match to the matrix, but it was<br />

evolving under the background (HKY) <strong>evolution</strong>ary model.<br />

Figure 3.<br />

P-values for the T-statistic to detect binding site <strong>evolution</strong><br />

As described for statistic used in MONKEY to identify conserved binding sites, it is possible to calculate p-<br />

values for the ‘T’ statistic described above to identify non-conserved binding sites. In this simulation,<br />

10000 binding sites were generated from the bcd matrix based on footprinted binding sites (Dan Pollard,<br />

personal communication, Bergman et al. 2005). They were evolved along the tree ((((dmel:0.01301,<br />

dsim:0.01011):0.04507, dyak:0.05366):0.11171, dpse:0.18951):0.12500, dvir:0.25924), and the 5132<br />

instances with a p-value in the dmel sequence less than 0.001 were assigned p-values using the exact<br />

method, and <strong>by</strong> ranking them and dividing <strong>by</strong> 5132. That these p-values agree, even though a cut<strong>of</strong>f was<br />

183


used to choose the binding sites suggests that the distribution is relatively independent <strong>of</strong> the single genome<br />

p-value cut<strong>of</strong>f chosen.<br />

Identifying evolving binding sites using the T statistic<br />

It is possible to search alignments <strong>of</strong> non-coding DNA to identify putative binding sites<br />

that are evolving rapidly. For example, the CYP7A1 LXRE that is functional in rodents,<br />

and conserved to Lemur has T= -1.85913, using the binding site in Lemur as the<br />

reference sequence. That the statistic is less than zero implies that the binding site<br />

appeared to be evolving more like the background model (HKY) than the model <strong>of</strong><br />

selective constraint (HB). As described above, it is possible to assign statistical<br />

significance to the score <strong>by</strong> calculating the probability <strong>of</strong> having observed a score as<br />

small or smaller assuming the binding site was evolving under the HB model. In this case<br />

the p-value associated was p=0.0149213.<br />

I searched alignments <strong>of</strong> known Drosophila enhancers to identify potential cases<br />

<strong>of</strong> binding site <strong>evolution</strong>. Figure 4 shows two non-conserved hb binding sites in the<br />

Kr730 enhancer (Hoch et al. 1990), which are within a region <strong>of</strong> the enhancer that has<br />

been shown to be bound <strong>by</strong> hunchback in vitro (Hoch et al. 1991). This demonstrates the<br />

utility <strong>of</strong> the T statistic in identifying candidates for cis-regulatory <strong>evolution</strong>.<br />

dmel GCATGATCATAAAAAGCAATTTGCTACAATTTA<br />

dsim GCATGATCATAAAAAGCAATTTGCTACAATTTA<br />

dyak GCATGGTCATAAAAAGCAATTTGCTACAATTTA<br />

dere ACATGATCATAAATTGCAATTTGCTACAATTTA<br />

**** ******* ******************<br />

dmel TATTTTTTT--GCTTTTCCTTCTTTTAAGCAT<br />

dsim TATTTTTTT--GCTTTTACTTCTTTTAAGCAG<br />

dyak TATTTTCCTATGCTTTTACCTCTTTAGAGCAG<br />

dere TATTTTCTTATGCTTTTACCTCTTTAAAGCAG<br />

****** * ****** * ***** ****<br />

184


Figure 4<br />

Non-conserved binding sites in the Kr 730 enhancer<br />

Two hunchback binding sites in D. melanogaster (p


stripe 2 enhancer (Small et al. 1991), aligned using the mlagan alignment algorithm<br />

(Brudno et al. 2003). In this case, there is almost certainly an orthologous binding site in<br />

the D. psuedoobscura sequence, but it has not been aligned to the appropriate sequence in<br />

D. melanogaster. This binding site had a T= –4.11328 (p=0.0017), and would therefore<br />

seem to be a good candidate for cis-regulatory <strong>evolution</strong>. In order to study binding site<br />

<strong>evolution</strong> systematically and describe the patterns <strong>of</strong> <strong>evolution</strong> quantitatively we must<br />

therefore understand the behaviors <strong>of</strong> the alignment algorithms and how <strong>of</strong>ten we can<br />

expect the binding sites to be incorrectly aligned.<br />

Controlling for alignment error<br />

Automated sequence alignment is one <strong>of</strong> the major accomplishments <strong>of</strong> computational<br />

molecular biology. Indeed, as genomic sequence data becomes available, manual<br />

alignments are increasingly impractical, and automated alignments must serve as the<br />

basic orthology map for all <strong>evolution</strong>ary inference. Assessing the impact <strong>of</strong> these<br />

automated alignments is therefore <strong>of</strong> much importance.<br />

Automated alignments are particularly difficult in the case <strong>of</strong> non-coding DNA<br />

where a sizeable fraction <strong>of</strong> bases can be expected to match between random (or non-<br />

homologous) sequences. A recent study showed that the accuracy <strong>of</strong> pairwise alignments<br />

<strong>of</strong> non-coding DNA varies considerably as a function <strong>of</strong> <strong>evolution</strong>ary distance (Pollard et<br />

al. 2004). It is therefore important to assess the impact <strong>of</strong> the alignment quality on<br />

phylogenetic analysis.<br />

Using a simulation <strong>of</strong> molecular <strong>evolution</strong> we showed that two important<br />

problems in phylogenetic analysis <strong>of</strong> non-coding DNA are pr<strong>of</strong>oundly affected <strong>by</strong> the<br />

186


quality <strong>of</strong> the alignments. For example, our estimates <strong>of</strong> non-coding divergence, and<br />

therefore estimates <strong>of</strong> the fraction <strong>of</strong> bases under constraint are alignment limited – at<br />

intermediate <strong>evolution</strong>ary distances (~1 subs. per site) we can accurately estimate these<br />

parameters using the true alignments, but once we use an alignment algorithm, our<br />

estimates become increasingly biased. This is in contrast to coding sequences, where<br />

saturation <strong>of</strong> sites in synonymous positions is thought to be limiting. This implies that<br />

estimates <strong>of</strong> the fraction <strong>of</strong> bases under constraint in non-coding DNA must be re-<br />

evaluated in the context <strong>of</strong> the divergence distance and specific alignment algorithms<br />

used.<br />

More important for the questions considered here is the reliability <strong>of</strong> automated<br />

sequence alignments with respect to transcription factor binding sites. By including<br />

transcription factor binding sites in our simulation, we showed that binding sites evolving<br />

under constant functional constraint are aligned incorrectly at relatively high frequency.<br />

This implies that estimates <strong>of</strong> conservation <strong>of</strong> functional elements must also be<br />

reevaluated in the context <strong>of</strong> the alignments and <strong>evolution</strong>ary distances considered. It has<br />

been suggested that using multiple alignments <strong>of</strong> more closely related species to span<br />

similar <strong>evolution</strong>ary distances can greatly improve alignment accuracy; we showed that if<br />

multiple species are used to span the same distances, we can accurately estimate<br />

divergences and identify conserved binding sites at much greater distances.<br />

In practice, however, we are limited <strong>by</strong> the species that are actually available.<br />

The sequencing <strong>of</strong> several closely related Drosophila species has provided us with<br />

several <strong>evolution</strong>ary comparisons to choose from. Using our simulation, we can show<br />

that in alignments <strong>of</strong> non-coding sequence <strong>of</strong> the four most closely related species (D.<br />

187


melanogaster, D. simulans, D. yakuba and D. erecta) which span a total <strong>of</strong> 0.45<br />

substitutions per synonymous site, we can expect our multiple alignment algorithm<br />

mlagan (Brudno et al. 2003) to align binding sites such that they are at least overlapping<br />

<strong>by</strong> one position greater than 99% <strong>of</strong> the time. This is in contrast to comparisons which<br />

include D. psuedoobscura (such as the eve stripe 2 alignment shown above) in which<br />

binding sites can be expected to be aligned incorrectly ~20% <strong>of</strong> the time.<br />

Finally, because that our simulations show that the binding sites will be<br />

overlapping but not necessarily exactly aligned in these sequences, I have developed a<br />

new version <strong>of</strong> the MONKEY heuristic to search for any groups <strong>of</strong> binding sites that<br />

overlap <strong>by</strong> at least 1 base pair. This is done <strong>by</strong> searching for matches in the unaligned<br />

versions <strong>of</strong> the single sequences, finding the best overlapping matches in the other<br />

sequences, and then excluding the region <strong>of</strong> the alignment that is spanned <strong>by</strong> these<br />

binding sites. In practice this can be done recursively and does not significantly change<br />

the performance <strong>of</strong> MONKEY.<br />

Enrichment <strong>of</strong> non-conserved binding sites implies functional <strong>evolution</strong><br />

Equipped with statistical tests to detect binding site <strong>evolution</strong> and confidence that<br />

alignments <strong>of</strong> very closely related species will have extremely low error rates, we can<br />

look for evolving binding sites systematically in genome wide functional data. Figure 6<br />

shows the results <strong>of</strong> such an analysis on the zeste chromatin IP data. As expected, we see<br />

extremely strong enrichment <strong>of</strong> the zeste binding sites in D. melanogaster near the peak<br />

in signal on the array. In this case there are ~3 fold more binding sites immediately<br />

surrounding the peaks, implying a total <strong>of</strong> ~500 excess (functional) binding sites in these<br />

188


egions. Binding sites were defined as matches to the matrix in D. melanogaster with<br />

p


consistent with the action <strong>of</strong> purifying selection on the bound regions. I will return to this<br />

observation in the context <strong>of</strong> inferring selection.<br />

Most interesting from the perspective <strong>of</strong> binding site <strong>evolution</strong>, figure 6C shows<br />

that there is also an enrichment <strong>of</strong> non-conserved binding sites in this region. Once<br />

again, a significance threshold <strong>of</strong> p=0.025 was used, suggesting that fewer than 10<br />

conserved sites will pass the threshold. Regardless <strong>of</strong> the significance threshold chosen,<br />

there was an excess <strong>of</strong> ~30 non-conserved binding sites in the regions immediately<br />

adjacent to the peaks. This indicates that ~6% (30 <strong>of</strong> 500) <strong>of</strong> the functional zeste sites<br />

show patterns <strong>of</strong> <strong>evolution</strong> that are not consistent with constant selection to retain their<br />

function.<br />

A<br />

3000 -2000 -1000 0 1000 2000<br />

0<br />

3000<br />

180<br />

B<br />

3000 -2000 -1000 0 1000 2000<br />

0<br />

3000<br />

190<br />

400<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

Number <strong>of</strong> zeste binding sites


C<br />

-3000 -2000 -1000 0 1000 2000<br />

0<br />

3000<br />

Figure 6<br />

Enrichment <strong>of</strong> conserved and non-conserved binding sites in the regions bound <strong>by</strong><br />

zeste<br />

A shows the number <strong>of</strong> binding sites (p


this allows me to provide statistical evidence for the hypothesis that binding site<br />

<strong>evolution</strong> occurred, it does not allow any me to distinguish between different types <strong>of</strong><br />

<strong>evolution</strong>ary scenarios or test specific models <strong>of</strong> binding site <strong>evolution</strong>. For example, two<br />

more realistic scenarios I am interested in considering are 1) where most <strong>of</strong> the<br />

<strong>evolution</strong>ary history is consistent with <strong>evolution</strong> under functional constraint, but that the<br />

binding site has been lost on some subset <strong>of</strong> the lineages, or 2) most <strong>of</strong> the <strong>evolution</strong>ary<br />

history is consistent with <strong>evolution</strong> in the absence <strong>of</strong> the functional constraint, and that<br />

the binding site has appeared on some subset <strong>of</strong> the lineages.<br />

The T statistic will identify some <strong>of</strong> such cases, but because it considers <strong>evolution</strong> over<br />

the entire tree, it will be extremely conservative in cases where, for example, the binding<br />

site has been lost on the lineage leading to a terminal branch (leaf). Once again, it is<br />

possible to formulate these scenarios in the probabilistic framework for binding site<br />

<strong>evolution</strong> that I have developed. In this case there, however, instead <strong>of</strong> explicitly<br />

rejecting a null hypothesis, we are attempting to choose between a number <strong>of</strong> related<br />

models. From a statistical perspective this is a much more difficult problem particularly<br />

because the number <strong>of</strong> models is relatively large, and they have been selected from a<br />

potentially even larger set.<br />

192


A<br />

B<br />

C<br />

Figure 7<br />

Evolutionary scenarios consistent with the observation <strong>of</strong> a non-conserved binding<br />

in D. melanogaster<br />

A and B show ‘gains’ <strong>of</strong> binding sites along the lineage leading to D. melanogaster. C-F show losses.<br />

Here a ‘1’ represents a binding site at that point on the tree, while a ‘0’ represents background sequence.<br />

Only scenarios consistent with one <strong>evolution</strong>ary change between binding site and background are<br />

considered.<br />

0<br />

1 0 0 0<br />

Dmel Dsim Dyak Dere<br />

1<br />

1 1 0 0<br />

Dmel Dsim Dyak Dere<br />

1<br />

1 1 0 0<br />

Dmel Dsim Dyak Dere<br />

1<br />

0<br />

0<br />

0<br />

0<br />

0<br />

D<br />

E<br />

F<br />

Figure 7 shows the 6 <strong>evolution</strong>ary scenarios that are consistent with a non-conserved<br />

binding site that require only one change from the binding site ‘state’ to the background<br />

‘state.’ Because the <strong>evolution</strong>ary distances spanned here are small, it is reasonable to<br />

assume only one such event has occurred, although the process <strong>of</strong> binding site <strong>evolution</strong><br />

193<br />

1<br />

1 0 1 1<br />

Dmel Dsim Dyak Dere<br />

1<br />

1 1 0 1<br />

Dmel Dsim Dyak Dere<br />

1<br />

1 1 1 0<br />

Dmel Dsim Dyak Dere<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1


on a tree can be modeled explicitly, and we will return to that problem in a later section.<br />

There are an additional 6 scenarios requiring only one change, and these include gains<br />

and losses on the other lineages.<br />

In order to distinguish between these scenarios, we can explicitly calculate the<br />

likelihood <strong>of</strong> each one using our <strong>evolution</strong>ary model. A similar approach has been<br />

applied to the detection <strong>of</strong> pseudogenes (Coin and Durbin 2004), and technically the<br />

method used is very similar. I calculate the likelihood <strong>of</strong> the <strong>evolution</strong> on each branch<br />

according to the state <strong>of</strong> the child node, and use the equilibrium frequencies associated<br />

with the state <strong>of</strong> the root node. For each binding site, I calculate the T-statistic to reject<br />

the hypothesis that is was conserved, and then calculate the likelihood <strong>of</strong> the data given<br />

the 12 <strong>evolution</strong>ary scenarios requiring only one change from binding site to background<br />

or vise versa, and assign the binding site to the most likely. In order to be unbiased, we<br />

now no longer require a binding site to pass a the threshold in D. melanogaster, but rather<br />

consider all binding sites with p


Inferring the action <strong>of</strong> selection from patterns <strong>of</strong> binding site losses and gains<br />

Thus far we have developed methods to classify the <strong>evolution</strong>ary patterns <strong>of</strong> at the level<br />

<strong>of</strong> binding sites in order to provide systematic statistical evidence for the claim that<br />

binding sites are not simply conserved. The tools developed, however, allow us to infer<br />

the action <strong>of</strong> selection at the level <strong>of</strong> binding sites. I have shown previously that the<br />

individual nucleotides in binding sites are evolving slower than the surrounding<br />

sequences, which is consistent with purifying selection on the binding sites. Another<br />

prediction <strong>of</strong> the model that binding sites are evolving under purifying selection is that<br />

the binding sites in functional regions should be more likely to be conserved than the<br />

same binding sites that occur randomly in non-coding DNA.<br />

195


A<br />

3000 -2000 -1000 0 1000 2000 3000 90<br />

gain on D mel lineage<br />

B<br />

80<br />

loss on D mel lineage<br />

Figure 8<br />

Enrichment <strong>of</strong> binding sites classified <strong>by</strong> <strong>evolution</strong>ary scenario<br />

A shows the total number (triangles) <strong>of</strong> non-conserved binding sites (p


melanogaster lineage (diamonds and squares, respectively), or gains or losses on the other lineages<br />

(triangles and x’s respectively).<br />

In the absence <strong>of</strong> selection, sequences randomly drift through sequence space turning<br />

from potential binding sites to background sequence and back according to some model<br />

<strong>of</strong> molecular <strong>evolution</strong>. In the presence <strong>of</strong> selection, purifying selection may act to<br />

preserve the sequences that are acting as binding sites, <strong>by</strong> removing from the population<br />

mutations that disrupt the binding sites, thus slowing the rate at which they convert into<br />

background sequence. We can observe this effect <strong>of</strong> purifying selection <strong>by</strong> looking at the<br />

fraction <strong>of</strong> the total binding sites that are conserved: the stronger the purifying selection,<br />

the greater the fraction <strong>of</strong> total binding sites that should be conserved. As eluded to<br />

earlier, using the T statistic to reject the hypothesis <strong>of</strong> background <strong>evolution</strong> is a<br />

conservative method to classify a binding site as conserved. Figure 9 shows this<br />

signature <strong>of</strong> selection in the zeste binding regions.<br />

-3000 -1000 1000 3000<br />

Figure 9<br />

Evidence for purifying selection on zeste binding sites<br />

197<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

fraction <strong>of</strong> total sites in D. mel


The proportion <strong>of</strong> binding sites in D. melanogaster (p


Further, I have suggested a method to compare the relative frequencies <strong>of</strong> particular<br />

<strong>evolution</strong>ary scenarios, and shown that in these data losses <strong>of</strong> binding sites along the<br />

lineages other than D. melanogaster account for most <strong>of</strong> the functional non-conserved<br />

binding sites. Finally, I have noted that based on the patterns <strong>of</strong> conservation and<br />

<strong>evolution</strong> <strong>of</strong> binding sites detected here it is possible to detect the effects <strong>of</strong> selection on<br />

the functional binding sites relative to the binding sites occurring in surrounding regions.<br />

References<br />

Bergman CM, Carlson JW, Celniker SE. Drosophila DNase I footprint database: a systematic genome<br />

annotation <strong>of</strong> transcription factor binding sites in the fruitfly, Drosophila melanogaster. Bioinformatics.<br />

2005 Apr 15;21(8):1747-9.<br />

Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S; NISC<br />

Comparative Sequencing Program. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple<br />

alignment <strong>of</strong> genomic DNA. Genome Res. 2003 Apr;13(4):721-31.<br />

Chen J, Cooper AD, Levy-Wilson B. Hepatocyte nuclear factor 1 binds to and transactivates the human but<br />

not the rat CYP7A1 promoter. Biochem Biophys Res Commun. 1999 Jul 14;260(3):829-34<br />

Coin L, Durbin R. Improved techniques for the identification <strong>of</strong> pseudogenes. Bioinformatics. 2004 Aug<br />

4;20 Suppl 1:I94-I100.<br />

Dermitzakis ET, Bergman CM, Clark AG. Tracing the <strong>evolution</strong>ary history <strong>of</strong> Drosophila regulatory<br />

regions with models that identify transcription factor binding sites. Mol Biol Evol. 2003 May;20(5):703-<br />

14.<br />

Dermitzakis ET, Clark AG. Evolution <strong>of</strong> transcription factor binding sites in Mammalian gene regulatory<br />

regions: conservation and turnover. Mol Biol Evol. 2002 Jul;19(7):1114-21.<br />

Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 1981,<br />

17:368-376.<br />

Hersh BM, Carroll SB. Direct <strong>regulation</strong> <strong>of</strong> knot gene expression <strong>by</strong> Ultrabithorax and the <strong>evolution</strong> <strong>of</strong> cisregulatory<br />

elements in Drosophila. Development. 2005 Apr;132(7):1567-77.<br />

Hoch M, Schroder C, Seifert E, Jackle H. cis-acting control elements for Kruppel expression in the<br />

Drosophila embryo. EMBO J. 1990 Aug;9(8):2587-95<br />

Hoch M, Seifert E, Jackle H. Gene expression mediated <strong>by</strong> cis-acting sequences <strong>of</strong> the Kruppel gene in<br />

response to the Drosophila morphogens bicoid and hunchback. EMBO J. 1991 Aug;10(8):2267-78.<br />

Laffitte BA, Repa JJ, Joseph SB, Wilpitz DC, Kast HR, Mangelsdorf DJ, Tontonoz P. LXRs control lipidinducible<br />

expression <strong>of</strong> the apolipoprotein E gene in macrophages and adipocytes. Proc Natl Acad Sci U S<br />

A. 2001 Jan 16;98(2):507-12.<br />

199


Laney JD, Biggin MD. Redundant control <strong>of</strong> Ultrabithorax <strong>by</strong> zeste involves functional levels <strong>of</strong> zeste<br />

protein binding at the Ultrabithorax promoter. Development. 1996 Jul;122(7):2303-11.<br />

Laney JD, Biggin MD. zeste, a nonessential gene, potently activates Ultrabithorax transcription in the<br />

Drosophila embryo. Genes Dev. 1992 Aug;6(8):1531-41.<br />

Ludwig MZ, Bergman C, Patel NH, Kreitman M. Evidence for stabilizing selection in a eukaryotic<br />

enhancer element. Nature. 2000 Feb 3;403(6769):564-7.<br />

Ludwig MZ, Kreitman M. Evolutionary dynamics <strong>of</strong> the enhancer region <strong>of</strong> even-skipped in Drosophila.<br />

Mol Biol Evol. 1995 Nov;12(6):1002-11.<br />

Piano, F., M. J. Parisi, R. Karess, and M. P. Kam<strong>by</strong>sellis. Evidence for redundancy but not trans factor-cis<br />

element co<strong>evolution</strong> in the <strong>regulation</strong> <strong>of</strong> Drosophila Yp genes. Genetics 1999. 152:605-616.<br />

Pollard DA, Bergman CM, Stoye J, Celniker SE, <strong>Eisen</strong> MB. Benchmarking tools for the alignment <strong>of</strong><br />

functional noncoding DNA. BMC Bioinformatics. 2004 Jan 21;5(1):6.<br />

Shasikan C. S., C. B. Kim, M. A. Borbely, W. C. H. Wang, F. H. Ruddle, Comparative studies on<br />

mammalian Hoxc8 early enhancer sequence reveal a baleen whale-specific deletion <strong>of</strong> a cis-acting element<br />

Proc. Natl. Acad. Sci. USA 1998 95:15446-15451<br />

Small S, Kraut R, Hoey T, Warrior R, Levine M. Transcriptional <strong>regulation</strong> <strong>of</strong> a pair-rule stripe in<br />

Drosophila. Genes Dev. 1991 May;5(5):827-39.<br />

200


Part V<br />

Conclusions<br />

208


16. Insight from quantitative <strong>evolution</strong>ary analyses<br />

I have studied the <strong>evolution</strong> <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong> at the molecular level <strong>by</strong><br />

examining the patterns <strong>of</strong> conservation and <strong>evolution</strong> <strong>of</strong> transcription factor binding sites.<br />

Based on models <strong>of</strong> <strong>evolution</strong> for these sequences, I have defined statistics to test for<br />

their conservation and <strong>evolution</strong> that have facilitated systematic analyses at the genome-<br />

scale. Because binding sites are a key component <strong>of</strong> <strong>transcriptional</strong> regulatory networks,<br />

understanding their <strong>evolution</strong> provides a key molecular mechanism for the <strong>evolution</strong> <strong>of</strong><br />

these networks and suggests that the <strong>evolution</strong> <strong>of</strong> binding sites may underlie important<br />

<strong>evolution</strong>ary processes.<br />

More generally, this detailed study <strong>of</strong> the <strong>evolution</strong>ary properties <strong>of</strong> a specific<br />

class <strong>of</strong> functional non-coding sequence is an important step in understanding the<br />

information encoded in non-coding sequences. I suggest that other such detailed<br />

characterizations will greatly improve our ability to detect other functional classes <strong>of</strong><br />

non-coding DNA, and understanding the constraints on their <strong>evolution</strong> will help us<br />

understand the function <strong>of</strong> these important molecules. This highlights the importance <strong>of</strong><br />

an understanding <strong>of</strong> <strong>evolution</strong> at the molecular level, not only for the sake <strong>of</strong><br />

understanding <strong>evolution</strong> - a major challenge in biology - but also because <strong>evolution</strong>ary<br />

information gives a unique perspective on the functions <strong>of</strong> biological molecules.<br />

Because the techniques developed here exploit the relationship between the<br />

structural-functional constraints on biological sequences and the <strong>evolution</strong>ary constraints<br />

on those sequences, where we have some knowledge <strong>of</strong> the former, these types <strong>of</strong><br />

approaches should be generally applicable, particularly as sequence data from closely<br />

209


elated organisms continues to accumulate. Incorporating specific <strong>evolution</strong>ary<br />

information is a general and powerful approach to sequence analysis, regardless <strong>of</strong> the<br />

type <strong>of</strong> biological sequences are considered.<br />

While analysis <strong>of</strong> the <strong>evolution</strong> <strong>of</strong> nucleotide and amino acid substitution<br />

processes has formed the cornerstone <strong>of</strong> molecular phylogenetic and <strong>evolution</strong>ary<br />

analysis since the inception <strong>of</strong> those disciplines, dramatic experimental and<br />

computational progress in molecular biology has lead to a large number <strong>of</strong> motifs,<br />

domains and other elements that can be recognized in these sequences. Combined with<br />

the availability <strong>of</strong> mathematical tools and abundant computer power, we are entering an<br />

age where it will become possible to model the <strong>evolution</strong> <strong>of</strong> higher-order molecular<br />

characters, just as systematists have long studied the <strong>evolution</strong> <strong>of</strong> phenotypic characters.<br />

Because these molecular features can in many cases be predicted from primary sequence,<br />

the availability <strong>of</strong> complete genome sequences means that there are already enormous<br />

amounts <strong>of</strong> data to be analyzed. This dissertation provides an example <strong>of</strong> such an<br />

analysis: it focused on one example <strong>of</strong> such a higher-order molecular property, namely,<br />

the binding sites for sequence-specific transcription factors. It should be possible to<br />

glean the same types <strong>of</strong> <strong>evolution</strong>ary insights that we were able to gain here for all types<br />

<strong>of</strong> molecular features.<br />

This work also exemplifies the utility <strong>of</strong> quantitative modeling <strong>of</strong> biological<br />

phenomena. In order to perform analyses systematically, but still capture the subtlety and<br />

complexity <strong>of</strong> the <strong>evolution</strong> <strong>of</strong> transcription factor binding sites, realistic probabilistic<br />

models <strong>of</strong> their <strong>evolution</strong> have provided an invaluable tool. Once again, I suggest that<br />

the availability <strong>of</strong> such models for other classes <strong>of</strong> sequences, and more generally other<br />

210


molecular features and processes, will be critical if they are to be studied systematically<br />

and at the genomic scale. This seems to be particularly true in analyses that incorporate<br />

<strong>evolution</strong>ary information, where the descent from a common ancestor along a tree<br />

provides a reasonable approximation to the history <strong>of</strong> the molecules or features, but<br />

introduces complicated statistical relationships between the extant species.<br />

IN CLOSING, it is clear that the availability <strong>of</strong> non-coding sequence from many<br />

organisms has enabled the study <strong>of</strong> non-coding DNA and its <strong>evolution</strong>. The<br />

surprises and mysteries that await us present many exciting challenges and<br />

possibilities. Because <strong>of</strong> the incredible importance <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong> in<br />

development and <strong>evolution</strong> <strong>of</strong> the animals, analysis <strong>of</strong> these sequences will yield<br />

enormous insight into our own <strong>evolution</strong> and development in the near future.<br />

211


15. Future directions<br />

The methods developed above have already yielded many exciting observations, and<br />

suggest that many fascinating hypotheses regarding binding site <strong>evolution</strong> will be<br />

systematically tested in the near future. Of particular interest will be to test hypotheses<br />

regarding compensation, which should be possible with the tools developed above. In the<br />

zeste data, however, there were few examples <strong>of</strong> binding site gains (<strong>by</strong> the criteria<br />

described above) and no correlation between the patterns <strong>of</strong> gains and losses that would<br />

suggest compensation could be observed. In addition to testing more hypotheses, there is<br />

much improvement possible in the methods described above. I now consider the next<br />

steps in modeling binding site <strong>evolution</strong>.<br />

Probabilistic models for changing functional constraints<br />

The probabilistic models that we have developed thus far make the approximation that<br />

for each branch in a phylogeny positions in multiple alignments are either binding sites or<br />

not. We are now interested in generalizing the models to account for the dynamical<br />

changes functional constraints may change over <strong>evolution</strong>. This is <strong>of</strong> particular interest<br />

given the evidence for gains and losses <strong>of</strong> functional binding sites described above, as<br />

well as the evidence for dramatic changes in <strong>transcriptional</strong> regulatory mechanisms<br />

observed over longer <strong>evolution</strong>ary distances.<br />

We seek to incorporate into our models a process that describes not only the<br />

changes from one base to another (e.g., A->T) but also changes in the functional<br />

constraints on the bases (binding site -> background sequence). Thus, we are considering<br />

201


an <strong>evolution</strong>ary process that switches between two constraint regimes - at any point in<br />

time the sequence may be evolving under a binding site <strong>evolution</strong>ary model or under<br />

some background or neutral model. In particular, a two-state continuous time markov<br />

model, which allows each position in a DNA sequence to be either binding site or a part<br />

<strong>of</strong> background <strong>evolution</strong> seems appropriate. In addition, this seems like a natural<br />

extension <strong>of</strong> the two-component mixture formulation that is widely used in probabilistic<br />

motif finding.<br />

A B<br />

A B C D E F G<br />

Figure 1<br />

Modeling changing <strong>evolution</strong>ary constraints, such as binding site gain or loss<br />

A schematic representation <strong>of</strong> changing constraints along an <strong>evolution</strong>ary tree. A binding site appeared<br />

along the branch leading to the ancestor <strong>of</strong> E,F and G (filled circle). Evolution along those linages (thick<br />

line) proceedes according to a different substitution model than the rest <strong>of</strong> the tree. B two state continuous<br />

time model for binding site loss and gain Over <strong>evolution</strong>, w-mers can go from a background sequence state<br />

(represented as ‘0’) to binding sites (represented as ‘1’). The parameters <strong>of</strong> the model are the rate <strong>of</strong><br />

binding site gain (λ) and rate <strong>of</strong> binding site loss (µ).<br />

Modeling <strong>evolution</strong> <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong> at short <strong>evolution</strong>ary distances<br />

At short <strong>evolution</strong>ary distances, where sequence can be aligned at the level <strong>of</strong> the DNA<br />

202<br />

-λ<br />

µ<br />

0 1<br />

λ<br />


sequence, we are considering a model where each w-mer can change into any other, and<br />

there exists an unobserved hidden variable that indicates which selective regime the w-<br />

mer currently falls under. For notational simplicity let us consider the likelihood <strong>of</strong> an<br />

ancestral sequence A evolving into a new sequence X. Under the traditional probabilistic<br />

model <strong>of</strong> <strong>evolution</strong> (Felsenstein 1981) this would be written as<br />

N −w<br />

∏∑p( Ai<br />

| dataabove)<br />

∑<br />

L = p(<br />

data)<br />

=<br />

p(<br />

X | A ) p(<br />

data | X ) ,<br />

i= 1 Ai Xi<br />

where i index the positions in the sequence, w is the length <strong>of</strong> the binding site<br />

p(Ai|dataabove) and p(databelow|Xi) are the likelihoods <strong>of</strong> the rest <strong>of</strong> the tree above A and<br />

below X respectively, and p(Xi|Ai) is the substitution matrix that gives the probability<br />

observing the w-mer Xi given that you observed the w-mer Ai as its ancestor.<br />

We seek to extend this model to incorporate a new variable, m, that indicates the<br />

selective regime. We now have<br />

N −w<br />

∏∑∑ p(<br />

Ai<br />

, mAi<br />

| dataabove)<br />

∑∑<br />

L = p(<br />

data)<br />

=<br />

p(<br />

X , m | A , m ) p(<br />

data | X , m ) ,<br />

i= 1 mAi Ai mXi Xi<br />

where the mi, the indicator variables at each position, retain the property that they can<br />

only depend on their parents, and the tree retains the property that the likelihood <strong>of</strong><br />

everything below can only depend on the variables at the parent.<br />

From a technical perspective this model possesses a very unattractive feature: it<br />

no longer treats the positions <strong>of</strong> the motif independently – the state space we must<br />

consider will now be on the order <strong>of</strong> 4 w . In fact the substitution matrix defined <strong>by</strong><br />

p(X,m|A,m) is a 2 x 4 w <strong>by</strong> 2 x 4 w matrix. Since the probabilities we seek are given <strong>by</strong> P =<br />

e Rt , we must consider how to calculate the matrix exponential <strong>of</strong> such an extremely large<br />

203<br />

i<br />

i<br />

Xi<br />

i<br />

i<br />

Ai<br />

below<br />

i<br />

below<br />

i<br />

Xi


matrix. Even if this were possible, we must consider the time needed to run Felsenstein’s<br />

recursion on a tree with such a large state space.<br />

Despite the technical challenges, developing models <strong>of</strong> this kind would be<br />

extremely useful, not only for this application, but for many problems where it is<br />

reasonable to assume constraints may be changing over the course <strong>of</strong> <strong>evolution</strong>.<br />

Modeling <strong>evolution</strong> <strong>of</strong> transcription <strong>regulation</strong> at longer <strong>evolution</strong>ary distances<br />

As described above, beyond closely related species is not possible to reliably align<br />

binding sites. In order study <strong>evolution</strong> <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong> over longer<br />

<strong>evolution</strong>ary distances we must consider methods that do not rely on alignments. At<br />

these distances, the non-coding portions <strong>of</strong> genomes can be treated as independent and at<br />

equilibrium. Because this model treats positions independently, in the absence <strong>of</strong><br />

selection, at equilibrium there is some constant probability <strong>of</strong> observing a binding site at<br />

that position. This means that the model is a binomial at each position, which implies that<br />

in any finite stretch <strong>of</strong> sequence the expected number <strong>of</strong> binding sites is poisson with<br />

parameter = p*L, as long as the probability <strong>of</strong> observing the binding site (p) is much<br />

smaller than 1.<br />

We can also extend the model to the case <strong>of</strong> unaligned sequences without the<br />

equilibrium assumption <strong>by</strong> instead <strong>of</strong> considering the indicator variable mi at each<br />

position i considering instead the sum, n = Σmi, which is simply the total number <strong>of</strong><br />

binding sites in each sequence. Once again, as long as we assume the number <strong>of</strong> binding<br />

sites will be much smaller than the total length <strong>of</strong> the regulatory region, and that the<br />

204


length <strong>of</strong> the regulatory region does not change over <strong>evolution</strong>, the number, n, follows a<br />

simple birth death process, and the equilibrium distribution <strong>of</strong> n is poisson with parameter<br />

equal to the rate <strong>of</strong> binding site gain divided <strong>by</strong> the rate <strong>of</strong> binding site loss (refs).<br />

Comparing this result to the one above immediately suggests an interesting<br />

property <strong>of</strong> transcription factor binding sites, i.e., the probability that we find them in<br />

non-coding sequence in the absence <strong>of</strong> selection depends only on the ratio <strong>of</strong> rate <strong>of</strong><br />

binding site gain to the rate <strong>of</strong> binding site loss, and not the magnitude <strong>of</strong> these rates.<br />

Further, as I have shown above, purifying selection acts <strong>by</strong> reducing the rate at which<br />

binding sites convert to background sequences. Because the equilibrium number <strong>of</strong><br />

binding sites depends only on the ratio <strong>of</strong> rates between binding site loss and gain,<br />

reducing the rate <strong>of</strong> loss through the action <strong>of</strong> purifying selection is sufficient to produce<br />

the large excess <strong>of</strong> binding sites observed in regulatory sequences.<br />

A<br />

B<br />

µ<br />

- λ<br />

0 1 2 3<br />

∞<br />

λ<br />

2µ 3µ 4µ ∞ µ<br />

205<br />

…<br />

λ λ λ<br />

λ<br />

n = 2<br />

n = 1<br />

n = 2<br />

n = 3<br />

n = 2


Figure 2<br />

Probabilistic <strong>evolution</strong>ary model for binding sites in un-alignable promoters.<br />

A. schematic representation <strong>of</strong> the promoter <strong>of</strong> a single gene, as the number <strong>of</strong> binding sites (n) changes<br />

over <strong>evolution</strong>. B. continuous time markov model for the number <strong>of</strong> binding sites. Once again there are<br />

two parameters, rate <strong>of</strong> binding site gain (λ) and rate <strong>of</strong> binding site loss (µ).<br />

In order to test for selection, we can estimate the equilibrium distribution, based on the<br />

genome frequency <strong>of</strong> binding sites, for example, and try to reject the hypothesis that the<br />

binding sites in a particular stretch <strong>of</strong> sequence are distributed according to this<br />

distribution. Indeed, the methods described in part II that search for sequences that are<br />

statistically enriched are taking advantage <strong>of</strong> precisely this principle. In order to address<br />

<strong>evolution</strong>ary questions, distantly related species whose non-coding regions show no<br />

similarity at the sequences level can also be assumed to be at equilibrium with respect to<br />

this model. Thus, we can test for selection on sequences <strong>of</strong> interest in each genome<br />

independently. Where we find evidence for a constraint in multiple species, we can infer<br />

that the common ancestor had shared the constraint. This is precisely the principle we<br />

have applied in studying <strong>evolution</strong> <strong>of</strong> <strong>transcriptional</strong> <strong>regulation</strong> in the ascomycete fungi.<br />

However, in the context <strong>of</strong> the probabilistic model it is clear that we can test much<br />

more specific hypotheses about binding site <strong>evolution</strong> than simply rejecting the<br />

equilibrium distribution in the absence <strong>of</strong> selection. Indeed, it should be possible to<br />

obtain reconstructions <strong>of</strong> the ancestral states, and test for lineage specific changes as with<br />

any <strong>evolution</strong>ary character.<br />

Models for <strong>evolution</strong> <strong>of</strong> transcription networks.<br />

206


While I have developed methods and tools for understanding the <strong>evolution</strong> <strong>of</strong><br />

transcription factor binding sites and shown that their <strong>evolution</strong> can reflect the <strong>evolution</strong><br />

<strong>of</strong> the network, the problem <strong>of</strong> understanding how selection acts on transcription<br />

networks remains a challenge for the future. Understanding the <strong>evolution</strong> <strong>of</strong> transcription<br />

networks would be <strong>of</strong> great interest for both their role in generating the diversity <strong>of</strong> extant<br />

organisms, but also because understanding how <strong>evolution</strong> has designed these networks<br />

may teach us how to better manipulate and design them for medical and technological<br />

benefit.<br />

207

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!