12.07.2015 Views

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

articlescopies <strong>of</strong> <strong>the</strong> rDNA t<strong>and</strong>em repeats in <strong>the</strong> draft <strong>genome</strong> sequence,owing to <strong>the</strong> deliberate bias in <strong>the</strong> initial phase <strong>of</strong> <strong>the</strong> <strong>sequencing</strong>effort against <strong>sequencing</strong> BAC clones whose restriction fragment®ngerprints showed <strong>the</strong>m to contain primarily t<strong>and</strong>emly repeatedsequence. Sequence similarity <strong>analysis</strong> with <strong>the</strong> BLASTN computerprogram does, however, detect hundreds <strong>of</strong> rDNA-derived sequencefragments dispersed throughout <strong>the</strong> complete <strong>genome</strong>, includingone `full-length' copy <strong>of</strong> an individual 5.8S rRNA gene not associatedwith a true t<strong>and</strong>em repeat unit (Table 20).The 5S rDNA genes also occur in t<strong>and</strong>em arrays, <strong>the</strong> largest <strong>of</strong>which is on chromosome 1 between 1q41.11 <strong>and</strong> 1q42.13, close to<strong>the</strong> telomere 265,266 . There are 200±300 true 5S genes in <strong>the</strong>searrays 265,267 . The number <strong>of</strong> 5S-related sequences in <strong>the</strong> <strong>genome</strong>,including numerous dispersed pseudogenes, is classically cited as2,000 (refs 252, 254). The long t<strong>and</strong>em array on chromosome 1 isnot yet present in <strong>the</strong> draft <strong>genome</strong> sequence because <strong>the</strong>re are noEcoRI or HindIII sites present, <strong>and</strong> thus it was not cloned in <strong>the</strong>most heavily utilized BAC libraries (Table 1). We expect to recover itduring <strong>the</strong> ®nishing stage. We do detect four individual copies <strong>of</strong> 5SrDNA by our search criteria ($ 95% identity <strong>and</strong> $ 95% fulllength). We also ®nd many more distantly related dispersedsequences (520 at P # 0.001), which we interpret as probablepseudogenes (Table 20).Small nucleolar RNA genes. Eukaryotic rRNA is extensively processed<strong>and</strong> modi®ed in <strong>the</strong> nucleolus. Much <strong>of</strong> this activity isdirected by numerous snoRNAs. These come in two families: C/Dbox snoRNAs (mostly involved in guiding site-speci®c 29-O-ribosemethylations <strong>of</strong> o<strong>the</strong>r RNAs) <strong>and</strong> H/ACA snoRNAs (mostlyinvolved in guiding site-speci®c pseudouridylations) 247,248 . Wecompiled a set <strong>of</strong> 97 known <strong>human</strong> snoRNA gene sequences; 84<strong>of</strong> <strong>the</strong>se (87%) have at least one copy in <strong>the</strong> draft <strong>genome</strong> sequence(Table 20), almost all as single-copy genes.It is thought that all 29-O-ribose methylations <strong>and</strong> pseudouridylationsin eukaryotic rRNA are guided by snoRNAs. There are105±107 methylations <strong>and</strong> around 95 pseudouridylations in <strong>human</strong>rRNA 268 . Only about half <strong>of</strong> <strong>the</strong>se have been tentatively assigned toknown guide snoRNAs. There are also snoRNA-directedmodi®cations on o<strong>the</strong>r stable RNAs, such as U6 (ref. 269), <strong>and</strong><strong>the</strong> extent <strong>of</strong> this is just beginning to be explored. Sequencesimilarity has so far proven insuf®cient to recognize all snoRNAgenes. We <strong>the</strong>refore expect that <strong>the</strong>re are many unrecognizedsnoRNA genes that are not detected by BLAST queries.Spliceosomal RNAs <strong>and</strong> o<strong>the</strong>r ncRNA genes. We also looked forcopies <strong>of</strong> o<strong>the</strong>r known ncRNA genes. We found at least one copy <strong>of</strong>21 (95%) <strong>of</strong> 22 known ncRNAs, including <strong>the</strong> spliceosomalsnRNAs. There were multiple copies for several ncRNAs, asexpected; for example, we ®nd 44 dispersed genes for U6 snRNA,<strong>and</strong> 16 for U1 snRNA (Table 20).For some <strong>of</strong> <strong>the</strong>se RNA genes, homogeneous multigene familiesthat occur in t<strong>and</strong>em arrays are again under-represented owing to<strong>the</strong> restriction enzymes used in constructing <strong>the</strong> BAC libraries <strong>and</strong>,in some instances, <strong>the</strong> decision to delay <strong>the</strong> <strong>sequencing</strong> <strong>of</strong> BACclones with low complexity ®ngerprints indicative <strong>of</strong> t<strong>and</strong>emlyrepeated DNA. The U2 RNA genes are located at <strong>the</strong> RNU2 locus,a t<strong>and</strong>em array <strong>of</strong> 10±20 copies <strong>of</strong> nearly identical 6.1-kb units at17q21±q22 (refs 270±272). Similarly, <strong>the</strong> U3 snoRNA genes(included in <strong>the</strong> aggregate count <strong>of</strong> C/D snoRNAs in Table 20) areclustered at <strong>the</strong> RNU3 locus at 17p11.2, not in a t<strong>and</strong>em array, but ina complex inverted repeat structure <strong>of</strong> about 5±10 copies perhaploid <strong>genome</strong> 273 . The U1 RNA genes are clustered with about30 copies at <strong>the</strong> RNU1 locus at 1p36.1, but this cluster is thought tobe loose <strong>and</strong> irregularly organized; no two U1 genes have beencloned on <strong>the</strong> same cosmid 271 . In <strong>the</strong> draft <strong>genome</strong> sequence, we seesix copies <strong>of</strong> U2 RNA that meet our criteria for true genes, three <strong>of</strong>which appear to be in <strong>the</strong> expected position on chromosome 17. ForU3, so far we see one true copy at <strong>the</strong> correct place on chromosome17p11.2. For U1, we see 16 true genes, 6 <strong>of</strong> which are looselyclustered within 0.6 Mb at 1p36.1 <strong>and</strong> ano<strong>the</strong>r 6 are elsewhere onchromosome 1. Again, <strong>the</strong>se <strong>and</strong> o<strong>the</strong>r clusters will be a matter for<strong>the</strong> ®nishing process.Table 20 Known non-coding RNA genes in <strong>the</strong> draft <strong>genome</strong> sequenceRNA gene* Number expected² Number found³ Number <strong>of</strong>related genes§FunctiontRNA 1,310 497 324 Protein syn<strong>the</strong>sisSSU (18S) rRNA 150±200 0 40 Protein syn<strong>the</strong>sis5.8S rRNA 150±200 1 11 Protein syn<strong>the</strong>sisLSU (28S) rRNA 150±200 0 181 Protein syn<strong>the</strong>sis5S rRNA 200±300 4 520 Protein syn<strong>the</strong>sisU1 ,30 16 134 Spliceosome componentU2 10±20 6 94 Spliceosome componentU4 ?? 4 87 Spliceosome componentU4atac ?? 1 20 Component <strong>of</strong> minor (U11/U12) spliceosomeU5 ?? 1 31 Spliceosome componentU6 ?? 44 1,135 Spliceosome componentU6atac ?? 4 32 Component <strong>of</strong> minor (U11/U12) spliceosomeU7 1 1 3 Histone mRNA 39 processingU11 1 0 6 Component <strong>of</strong> minor (U11/U12) spliceosomeU12 1 1 0 Component <strong>of</strong> minor (U11/U12) spliceosomeSRP (7SL) RNA 4 3 773 Component <strong>of</strong> signal recognition particle (protein secretion)RNAse P 1 1 2 tRNA 59 end processingRNAse MRP 1 1 6 rRNA processingTelomerase RNA 1 1 4 Template for addition <strong>of</strong> telomereshY1 1 1 353 Component <strong>of</strong> Ro RNP, function unknownhY3 1 25 414 Component <strong>of</strong> Ro RNP, function unknownhY4 1 3 115 Component <strong>of</strong> Ro RNP, function unknownhY5 (4.5S RNA) 1 1 9 Component <strong>of</strong> Ro RNP, function unknownVault RNAs 3 3 1 Component <strong>of</strong> 13-MDa vault RNP, function unknown7SK 1 1 330 UnknownH19 1 1 2 UnknownXist 1 1 0 Initiation <strong>of</strong> X chromosome inactivation (dosage compensation)Known C/D snoRNAs 81 69 558 Pre-rRNA processing or site-speci®c ribose methylation <strong>of</strong> rRNAKnown H/ACA snoRNAs 16 15 87 Pre-rRNA processing or site-speci®c pseudouridylation <strong>of</strong> rRNA...................................................................................................................................................................................................................................................................................................................................................................* Known ncRNA genes (or gene families, such as <strong>the</strong> C/D <strong>and</strong> H/ACA snoRNA families); reference sequences were extracted from GenBank <strong>and</strong> used to probe <strong>the</strong> draft <strong>genome</strong> sequence.² Number <strong>of</strong> genes that were expected in <strong>the</strong> <strong>human</strong> <strong>genome</strong>, based on previous literature (note that earlier experimental techniques probably tend to overestimate copy number, by counting closely relatedpseudogenes).³ The copy number <strong>of</strong> `true' full-length genes identi®ed in <strong>the</strong> draft <strong>genome</strong> sequence.§ The copy number <strong>of</strong> o<strong>the</strong>r signi®cantly related copies (pseudogenes, fragments, paralogues) found. Except for <strong>the</strong> 497 true tRNA genes, all sequence similarities were identi®ed by WashU BLASTN 2.0MP(W. Gish, unpublished; http://blast.wustl.edu), with parameters `-kap wordmask = seg B = 50000 W = 8' <strong>and</strong> <strong>the</strong> default +5/-4 DNA scoring matrix. True genes were operationally de®ned as BLAST hitswith $ 95% identity over $ 95% <strong>of</strong> <strong>the</strong> length <strong>of</strong> <strong>the</strong> query. Related sequences were operationally de®ned as all o<strong>the</strong>r BLAST hits with P-values # 0.001.NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com © 2001 Macmillan Magazines Ltd895

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!