Initial sequencing and analysis of the human genome - Vitagenes
Initial sequencing and analysis of the human genome - Vitagenes
Initial sequencing and analysis of the human genome - Vitagenes
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
articlesTable 22 Properties <strong>of</strong> <strong>the</strong> IGI/IPI <strong>human</strong> protein setSource Number Average length (amino acids) Matches to non<strong>human</strong>proteinsMatches to RIKEN mousecDNA setMatches to RIKEN mouse cDNAset but not to non<strong>human</strong> proteinsRefSeq/SwissProt/TrEMBL 14,882 469 12,708 (85%) 11,599 (78%) 776 (36%)Ensembl±Genie 4,057 443 2,989 (74%) 3,016 (74%) 498 (47%)Ensembl 12,839 187 81,126 (63%) 7,372 (57%) 1,449 (31%)Total 31,778 352 23,813 (75%) 219,873 (69%) 2,723 (34%)...................................................................................................................................................................................................................................................................................................................................................................The matches to non<strong>human</strong> proteins were obtained by using Smith-Waterman sequence alignment with an E-value threshold <strong>of</strong> 10 -3 <strong>and</strong> <strong>the</strong> matches to <strong>the</strong> RIKEN mouse cDNAs by using TBLASTN with anE-value threshold <strong>of</strong> 10 -6 . The last column shows that a signi®cant number <strong>of</strong> <strong>the</strong> IGI members that do not have non<strong>human</strong> protein matches do match sequences in <strong>the</strong> RIKEN mouse cDNA set, suggestingthat both <strong>the</strong> IGI <strong>and</strong> <strong>the</strong> RIKEN sets contain a signi®cant number <strong>of</strong> novel proteins.newly discovered genes arising from independent work that werenot used in our gene prediction effort. We identi®ed 31 such genes:22 recent entries to RefSeq <strong>and</strong> 9 from <strong>the</strong> Sanger Centre's geneidenti®cation program on chromosome X. Of <strong>the</strong>se, 28 werecontained in <strong>the</strong> draft <strong>genome</strong> sequence <strong>and</strong> 19 were representedin <strong>the</strong> IGI/IPI. This suggests that <strong>the</strong> gene prediction process has asensitivity <strong>of</strong> about 68% (19/28) for <strong>the</strong> detection <strong>of</strong> novel genes in<strong>the</strong> draft <strong>genome</strong> sequence <strong>and</strong> that <strong>the</strong> current IGI contains about61% (19/31) <strong>of</strong> novel genes in <strong>the</strong> <strong>human</strong> <strong>genome</strong>. On average, 79%<strong>of</strong> each gene was detected. The extent <strong>of</strong> fragmentation could also beestimated: 14 <strong>of</strong> <strong>the</strong> genes corresponded to a single prediction in <strong>the</strong>IGI/IPI, three genes corresponded to two predictions, one gene tothree predictions <strong>and</strong> one gene to four predictions. This correspondsto a fragmentation rate <strong>of</strong> about 1.4 gene predictions pertrue gene.Comparison with RIKEN mouse cDNAs. In a less direct but largerscaleapproach, we compared <strong>the</strong> IGI gene set to a set <strong>of</strong> mousecDNAs sequenced by <strong>the</strong> Genome Exploration Group <strong>of</strong> <strong>the</strong> RIKENGenomic Sciences Center 309 . This set <strong>of</strong> 15,294 cDNAs, subjected t<strong>of</strong>ull-insert <strong>sequencing</strong>, was enriched for novel genes by selectingcDNAs with novel 39 ends from a collection <strong>of</strong> nearly one millionESTs from diverse tissues <strong>and</strong> developmental timepoints. Wedetermined <strong>the</strong> proportion <strong>of</strong> <strong>the</strong> RIKEN cDNAs that showedsequence similarity to <strong>the</strong> draft <strong>genome</strong> sequence <strong>and</strong> <strong>the</strong> proportionthat showed sequence similarity to <strong>the</strong> IGI/IPI. Around 81% <strong>of</strong><strong>the</strong> genes in <strong>the</strong> RIKEN mouse set showed sequence similarity to <strong>the</strong><strong>human</strong> <strong>genome</strong> sequence, whereas 69% showed sequence similarityto <strong>the</strong> IGI/IPI. This suggests a sensitivity <strong>of</strong> 85% (69/81). This ishigher than <strong>the</strong> sensitivity estimate above, perhaps because some <strong>of</strong><strong>the</strong> matches may be due to paralogues ra<strong>the</strong>r than orthologues. It isconsistent with <strong>the</strong> IGI/IPI representing a substantial fraction <strong>of</strong> <strong>the</strong><strong>human</strong> proteome.Conversely, 69% (22,013/31,898) <strong>of</strong> <strong>the</strong> IGI matches <strong>the</strong> RIKENcDNA set. Table 22 shows <strong>the</strong> breakdown <strong>of</strong> <strong>the</strong>se matches among<strong>the</strong> different components <strong>of</strong> <strong>the</strong> IGI. This is lower than <strong>the</strong>proportion <strong>of</strong> matches among known proteins, although this isexpected because known proteins tend to be more highly conserved(see above) <strong>and</strong> because <strong>the</strong> predictions are on average shorter thanknown proteins. Table 22 also shows <strong>the</strong> numbers <strong>of</strong> matches to <strong>the</strong>RIKEN cDNAs among IGI members that do not match knownproteins. The results indicate that both <strong>the</strong> IGI <strong>and</strong> <strong>the</strong> RIKEN setcontain a signi®cant number <strong>of</strong> genes that are novel in <strong>the</strong> sense <strong>of</strong>not having known protein homologues.Comparison with genes on chromosome 22. We also compared <strong>the</strong>IGI/IPI with <strong>the</strong> gene annotations on chromosome 22, to assess <strong>the</strong>proportion <strong>of</strong> gene predictions corresponding to pseudogenes <strong>and</strong> toestimate <strong>the</strong> rate <strong>of</strong> overprediction. We compared 477 IGI genepredictions to 539 con®rmed genes <strong>and</strong> 133 pseudogenes on chromosome22 (with <strong>the</strong> immunoglobulin lambda locus excluded owingto its highly atypical gene structure). Of <strong>the</strong>se, 43 hit 36 annotatedpseudogenes. This suggests that 9% <strong>of</strong> <strong>the</strong> IGI predictions maycorrespond to pseudogenes <strong>and</strong> also suggests a fragmentation rate<strong>of</strong> 1.2 gene predictions per gene. Of <strong>the</strong> remaining hits, 63 did notoverlap with any current annotations. This would suggest a rate <strong>of</strong>spurious predictions <strong>of</strong> about 13% (63/477), although <strong>the</strong> true rateis likely to be much lower because many <strong>of</strong> <strong>the</strong>se may correspond tounannotated portions <strong>of</strong> existing gene predictions or to currentlyunannotated genes (<strong>of</strong> which <strong>the</strong>re are estimated to be about 100 onthis chromosome 94 ).Chromosomal distribution. Finally, we examined <strong>the</strong> chromosomaldistribution <strong>of</strong> <strong>the</strong> IGI gene set. The average density <strong>of</strong> genepredictions is 11.1 per Mb across <strong>the</strong> <strong>genome</strong>, with <strong>the</strong> extremesbeing chromosome 19 at 26.8 per Mb <strong>and</strong> chromosome Yat 6.4 perMb. It is likely that a signi®cant number <strong>of</strong> <strong>the</strong> predictions onchromosome Y are pseudogenes (this chromosome is known to berich in pseudogenes) <strong>and</strong> thus that <strong>the</strong> density for chromosome Y isan overestimate. The density <strong>of</strong> both genes <strong>and</strong> Alus on chromosome19 is much higher than expected, even accounting for <strong>the</strong> highGC content <strong>of</strong> <strong>the</strong> chromosome; this supports <strong>the</strong> idea that Aludensity is more closely correlated with gene density than with GCcontent itself.Summary. We are clearly still some way from having a complete set<strong>of</strong> <strong>human</strong> genes. The current IGI contains signi®cant numbers <strong>of</strong>partial genes, fragmented <strong>and</strong> fused genes, pseudogenes <strong>and</strong> spuriouspredictions, <strong>and</strong> it also lacks signi®cant numbers <strong>of</strong> true genes.This re¯ects <strong>the</strong> current state <strong>of</strong> gene prediction methods invertebrates even in ®nished sequence, as well as <strong>the</strong> additionalchallenges related to <strong>the</strong> current state <strong>of</strong> <strong>the</strong> draft <strong>genome</strong> sequence.None<strong>the</strong>less, <strong>the</strong> gene predictions provide a valuable starting pointfor a wide range <strong>of</strong> biological studies <strong>and</strong> will be rapidly re®ned in<strong>the</strong> coming year.The <strong>analysis</strong> above allows us to estimate <strong>the</strong> number <strong>of</strong> distinctgenes in <strong>the</strong> IGI, as well as <strong>the</strong> number <strong>of</strong> genes in <strong>the</strong> <strong>human</strong><strong>genome</strong>. The IGI set contains about 15,000 known genes <strong>and</strong> about17,000 gene predictions. Assuming that <strong>the</strong> gene predictions aresubject to a rate <strong>of</strong> overprediction (spurious predictions <strong>and</strong>pseudogenes) <strong>of</strong> 20% <strong>and</strong> a rate <strong>of</strong> fragmentation <strong>of</strong> 1.4, <strong>the</strong> IGIwould be estimated to contain about 24,500 actual <strong>human</strong> genes.Assuming that <strong>the</strong> gene predictions contain about 60% <strong>of</strong>previously unknown <strong>human</strong> genes, <strong>the</strong> total number <strong>of</strong> genes in<strong>the</strong> <strong>human</strong> <strong>genome</strong> would be estimated to be about 31,000. This isconsistent with most recent estimates based on sampling, whichsuggest a gene number <strong>of</strong> 30,000±35,000. If <strong>the</strong>re are 30,000±35,000genes, with an average coding length <strong>of</strong> about 1,400 bp <strong>and</strong> averagegenomic extent <strong>of</strong> about 30 kb, <strong>the</strong>n about 1.5% <strong>of</strong> <strong>the</strong> <strong>human</strong><strong>genome</strong> would consist <strong>of</strong> coding sequence <strong>and</strong> one-third <strong>of</strong> <strong>the</strong><strong>genome</strong> would be transcribed in genes.The IGI/IPI was constructed primarily on <strong>the</strong> basis <strong>of</strong> genepredictions from Ensembl. However, we also generated an exp<strong>and</strong>edset (IGI+) by including additional predictions from two o<strong>the</strong>r geneprediction programs, Genie <strong>and</strong> GenomeScan (C. Burge, personalcommunication). These predictions were not included in <strong>the</strong> coreIGI set, because <strong>of</strong> <strong>the</strong> concern that each additional set will providediminishing returns in identifying true genes while contributing itsown false positives (increased sensitivity at <strong>the</strong> expense <strong>of</strong> speci®city).Genie produced an additional 2,837 gene predictions notoverlapping <strong>the</strong> IGI, <strong>and</strong> GenomeScan produced 6,534 such genepredictions. If all <strong>of</strong> <strong>the</strong>se gene predictions were included in <strong>the</strong> IGI,<strong>the</strong> number <strong>of</strong> <strong>the</strong> 31 new `known' genes (see above) contained in<strong>the</strong> IGI would rise from 19 to 24. This would amount to an increase<strong>of</strong> about 26% in sensitivity, at <strong>the</strong> expense <strong>of</strong> increasing <strong>the</strong> number<strong>of</strong> predicted genes (excluding knowns) by 55%. Allowing a higher900 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com