03.03.2014 Views

Automatic Procedures for Compilation of Promoter sequences and ...

Automatic Procedures for Compilation of Promoter sequences and ...

Automatic Procedures for Compilation of Promoter sequences and ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Automatic</strong> <strong>Procedures</strong> <strong>for</strong> <strong>Compilation</strong> <strong>of</strong> <strong>Promoter</strong> <strong>sequences</strong> <strong>and</strong> their<br />

Evaluation based on Signal Content <strong>and</strong> Positional Distributions<br />

Christoph D. Schmid 1 , Viviane Praz 1,2 , Mauro Delorenzi 1,2 , Rouaida Perier 1 , Philipp Bucher 1,2<br />

1 Swiss Institute <strong>of</strong> Bioin<strong>for</strong>matics (SIB), 2 Swiss Insitute <strong>of</strong> Experimental Cancer Research (ISREC), Ch. des boveresses, CH-<br />

1066 Epalinges s/Lausanne (Switzerl<strong>and</strong>)<br />

Christoph.Schmid, Viviane.Praz, Mauro.Delorenzi, Rouaida.Perier, Philipp.Bucher@isb-sib.ch<br />

Keywords: Bioin<strong>for</strong>matics, Computational tools, Functional motifs, Genomic organization, Regulation <strong>of</strong> transcription<br />

Introduction:<br />

Collections <strong>of</strong> precisely mapped eukaryotic transcription start sites (TSS) are important resources <strong>for</strong> studying gene<br />

control elements <strong>and</strong> <strong>for</strong> developing promoter prediction algorithms. The present study consists <strong>of</strong> two parts. First,<br />

we present a new clustering method to infer TSS positions from EST 5' ends. Second, we present a quality<br />

evaluation <strong>of</strong> the promoter <strong>sequences</strong> defined by TSS positions from the first part <strong>of</strong> the study.<br />

Methods:<br />

Briefly, TSS were determined using an in silico version <strong>of</strong> st<strong>and</strong>ard primer extension experiments. As input, cDNA<br />

<strong>sequences</strong> from carefully selected full-length sequencing projects were aligned to genomic <strong>sequences</strong>. For each gene<br />

the genomic positions <strong>of</strong> the 5'end <strong>of</strong> the cDNAs were recorded in a data structure called a "cDNA 5' end pr<strong>of</strong>ile". A<br />

new program named "MADAP" was applied to define zero to several significant TSS clusters representative <strong>of</strong> a<br />

promoter. This program attempts to model the cDNA 5'end pr<strong>of</strong>iles by zero to several gaussian probability<br />

distributions.<br />

Results:<br />

<strong>Promoter</strong> compilation:<br />

Based on the approach described above, two novel promoter sequence sets were generated.<br />

(i) TSS clusters based on 5' ESTs from the MGC project [1], <strong>and</strong><br />

(ii) TSS clusters based on the 5' ends <strong>of</strong> oligo-capped cDNAs from DBTSS [2].<br />

As comparison we used two existing promoter sets available from public websites.<br />

(iii) promoters as described in the literature <strong>and</strong> manually compiled in EPD [3],<br />

(iv) automatically compiled promoters from the PRESTA database [4].<br />

DBTSS MGC PRESTA EPD<br />

Genes with a minimum <strong>of</strong> 10 cDNA transcripts 1522 8806 N.A. N.A.<br />

Genes with at least one promoter found <strong>and</strong> mapped to a<br />

1034 1038 424 255<br />

genomic EMBL sequence<br />

Genes with 1 promoter 971 965 292 239<br />

Genes with 2 alternative promoters 57 69 63 10<br />

Genes with 3 alternative promoters 5 4 2 4<br />

Genes with 4 alternative promoters 1 0 0 2<br />

Total number <strong>of</strong> promoters 1104 1115 490 279


<strong>Promoter</strong> evaluation:<br />

To evaluate the quality <strong>of</strong> the four promoter sets, we searched the <strong>sequences</strong> around the TSS positions <strong>for</strong> common<br />

promoter elements using the Signal Search Analysis server [5]. The positional occurrence frequencies <strong>of</strong> elements<br />

such as the Initiator, TATA-box, CCAAT-box, GC-box, as well as the presence <strong>of</strong> CpG isl<strong>and</strong>s, were used as<br />

indicator <strong>of</strong> the percentage <strong>of</strong> correctly mapped promoters in each <strong>of</strong> the different sets. Note that <strong>for</strong> a single<br />

promoter sequence, the presence or absence <strong>of</strong> these elements is not significant, as these elements occur only in<br />

(variable) subsets <strong>of</strong> promoters [6]. However, <strong>for</strong> sets <strong>of</strong> promoters <strong>of</strong> equal quality, we expect a similar average <strong>of</strong><br />

the signal content.<br />

We find that the frequencies <strong>of</strong> the CCAAT-box <strong>and</strong> the GC-box are similar <strong>for</strong> all four data sets. Based on this<br />

criterion, all sets appear to be <strong>of</strong> comparable quality <strong>and</strong> reasonably pure if we assume that the manually compiled<br />

EPD collection contains only few false entries. We also tried to estimate the precision <strong>of</strong> the initiation site mapping<br />

on the basis <strong>of</strong> the width <strong>of</strong> the TATA-box <strong>and</strong> Initiator element peaks. Both elements are believed to occur at nearly<br />

fixed distances from the TSS [6]. The narrowest TATA-box peak appears in the signal occurrence pr<strong>of</strong>ile <strong>for</strong> the<br />

DBTSS set at the expected position -27. The same promoter set also contains the highest proportion <strong>of</strong> TSS that<br />

exactly coincide with an Initiator motif. These two tests distinguish the promoter set derived from DBTSS, which is<br />

based on oligo-capped cDNAs, as the most precisely, mapped one, providing strong evidence <strong>for</strong> the effectiveness<br />

<strong>of</strong> this cloning technique in generating full-length cDNAs.<br />

Remarkably, EPD appears to be enriched in TATA-box containing promoters with a single initiation site. This may<br />

result from a publication bias against TATA-less promoters with diffuse TSS patterns in the time period be<strong>for</strong>e 1990<br />

when most EPD entries were collected.<br />

In summary, the goal <strong>of</strong> this study is to evaluate the quality <strong>of</strong> automatically compiled promoter sets as compared to<br />

the manually compiled promoters from EPD. Another contribution <strong>of</strong> this work is a new method to infer<br />

transcription start sites (TSSs) from collections <strong>of</strong> cDNA 5’ends mapped to the same genome region.<br />

Our results demonstrate that 5'end sequencing <strong>of</strong> oligo-capped cDNA libraries in conjunction with our newly<br />

developed data processing tools constitutes the most efficient, accurate, <strong>and</strong> reliable method <strong>for</strong> large-scale<br />

eukaryotic promoter mapping. It is hoped that these results will encourage public support <strong>for</strong> promoter mapping<br />

projects based on this technology <strong>for</strong> various organisms.<br />

References<br />

[1] Strausberg, R.L. et al. (2002) Generation <strong>and</strong> initial analysis <strong>of</strong> more than 15,000 full-length human <strong>and</strong> mouse cDNA<br />

<strong>sequences</strong>. Proc. Natl. Acad. Sci. U.S.A. 99, 16899-16903.<br />

[2] Suzuki, Y., Yamashita, R., Nakai, K.<strong>and</strong> Sugano, S. (2002) DBTSS: DataBase <strong>of</strong> human Transcriptional Start Sites <strong>and</strong> fulllength<br />

cDNAs. Nucleic Acids Res. 30, 328-331.<br />

[3] Praz, V., Perier, R., Bonnard, C. <strong>and</strong> Bucher, P. (2002) The Eukaryotic <strong>Promoter</strong> Database, EPD: new entry types <strong>and</strong> links to<br />

gene expression data. Nucleic Acids Res. 30, 322-324.<br />

[4] Mach, V. (2002) PRESTA: associating promoter <strong>sequences</strong> with in<strong>for</strong>mation on gene expression. Genome Biol. Research<br />

3(9):0050.1.<br />

[5] Ambrosini et al., The Signal Search Analysis server Nucleic Acids Res. in press, 2003.<br />

[6] Bucher, P. (1990) Weight matrix descriptions <strong>of</strong> four eukaryotic RNA polymerase II promoter elements derived from 502<br />

unrelated promoter <strong>sequences</strong>. J. Mol. Biol. 212, 563-578.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!