12.07.2015 Views

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

articlesTable 22 Properties <strong>of</strong> <strong>the</strong> IGI/IPI <strong>human</strong> protein setSource Number Average length (amino acids) Matches to non<strong>human</strong>proteinsMatches to RIKEN mousecDNA setMatches to RIKEN mouse cDNAset but not to non<strong>human</strong> proteinsRefSeq/SwissProt/TrEMBL 14,882 469 12,708 (85%) 11,599 (78%) 776 (36%)Ensembl±Genie 4,057 443 2,989 (74%) 3,016 (74%) 498 (47%)Ensembl 12,839 187 81,126 (63%) 7,372 (57%) 1,449 (31%)Total 31,778 352 23,813 (75%) 219,873 (69%) 2,723 (34%)...................................................................................................................................................................................................................................................................................................................................................................The matches to non<strong>human</strong> proteins were obtained by using Smith-Waterman sequence alignment with an E-value threshold <strong>of</strong> 10 -3 <strong>and</strong> <strong>the</strong> matches to <strong>the</strong> RIKEN mouse cDNAs by using TBLASTN with anE-value threshold <strong>of</strong> 10 -6 . The last column shows that a signi®cant number <strong>of</strong> <strong>the</strong> IGI members that do not have non<strong>human</strong> protein matches do match sequences in <strong>the</strong> RIKEN mouse cDNA set, suggestingthat both <strong>the</strong> IGI <strong>and</strong> <strong>the</strong> RIKEN sets contain a signi®cant number <strong>of</strong> novel proteins.newly discovered genes arising from independent work that werenot used in our gene prediction effort. We identi®ed 31 such genes:22 recent entries to RefSeq <strong>and</strong> 9 from <strong>the</strong> Sanger Centre's geneidenti®cation program on chromosome X. Of <strong>the</strong>se, 28 werecontained in <strong>the</strong> draft <strong>genome</strong> sequence <strong>and</strong> 19 were representedin <strong>the</strong> IGI/IPI. This suggests that <strong>the</strong> gene prediction process has asensitivity <strong>of</strong> about 68% (19/28) for <strong>the</strong> detection <strong>of</strong> novel genes in<strong>the</strong> draft <strong>genome</strong> sequence <strong>and</strong> that <strong>the</strong> current IGI contains about61% (19/31) <strong>of</strong> novel genes in <strong>the</strong> <strong>human</strong> <strong>genome</strong>. On average, 79%<strong>of</strong> each gene was detected. The extent <strong>of</strong> fragmentation could also beestimated: 14 <strong>of</strong> <strong>the</strong> genes corresponded to a single prediction in <strong>the</strong>IGI/IPI, three genes corresponded to two predictions, one gene tothree predictions <strong>and</strong> one gene to four predictions. This correspondsto a fragmentation rate <strong>of</strong> about 1.4 gene predictions pertrue gene.Comparison with RIKEN mouse cDNAs. In a less direct but largerscaleapproach, we compared <strong>the</strong> IGI gene set to a set <strong>of</strong> mousecDNAs sequenced by <strong>the</strong> Genome Exploration Group <strong>of</strong> <strong>the</strong> RIKENGenomic Sciences Center 309 . This set <strong>of</strong> 15,294 cDNAs, subjected t<strong>of</strong>ull-insert <strong>sequencing</strong>, was enriched for novel genes by selectingcDNAs with novel 39 ends from a collection <strong>of</strong> nearly one millionESTs from diverse tissues <strong>and</strong> developmental timepoints. Wedetermined <strong>the</strong> proportion <strong>of</strong> <strong>the</strong> RIKEN cDNAs that showedsequence similarity to <strong>the</strong> draft <strong>genome</strong> sequence <strong>and</strong> <strong>the</strong> proportionthat showed sequence similarity to <strong>the</strong> IGI/IPI. Around 81% <strong>of</strong><strong>the</strong> genes in <strong>the</strong> RIKEN mouse set showed sequence similarity to <strong>the</strong><strong>human</strong> <strong>genome</strong> sequence, whereas 69% showed sequence similarityto <strong>the</strong> IGI/IPI. This suggests a sensitivity <strong>of</strong> 85% (69/81). This ishigher than <strong>the</strong> sensitivity estimate above, perhaps because some <strong>of</strong><strong>the</strong> matches may be due to paralogues ra<strong>the</strong>r than orthologues. It isconsistent with <strong>the</strong> IGI/IPI representing a substantial fraction <strong>of</strong> <strong>the</strong><strong>human</strong> proteome.Conversely, 69% (22,013/31,898) <strong>of</strong> <strong>the</strong> IGI matches <strong>the</strong> RIKENcDNA set. Table 22 shows <strong>the</strong> breakdown <strong>of</strong> <strong>the</strong>se matches among<strong>the</strong> different components <strong>of</strong> <strong>the</strong> IGI. This is lower than <strong>the</strong>proportion <strong>of</strong> matches among known proteins, although this isexpected because known proteins tend to be more highly conserved(see above) <strong>and</strong> because <strong>the</strong> predictions are on average shorter thanknown proteins. Table 22 also shows <strong>the</strong> numbers <strong>of</strong> matches to <strong>the</strong>RIKEN cDNAs among IGI members that do not match knownproteins. The results indicate that both <strong>the</strong> IGI <strong>and</strong> <strong>the</strong> RIKEN setcontain a signi®cant number <strong>of</strong> genes that are novel in <strong>the</strong> sense <strong>of</strong>not having known protein homologues.Comparison with genes on chromosome 22. We also compared <strong>the</strong>IGI/IPI with <strong>the</strong> gene annotations on chromosome 22, to assess <strong>the</strong>proportion <strong>of</strong> gene predictions corresponding to pseudogenes <strong>and</strong> toestimate <strong>the</strong> rate <strong>of</strong> overprediction. We compared 477 IGI genepredictions to 539 con®rmed genes <strong>and</strong> 133 pseudogenes on chromosome22 (with <strong>the</strong> immunoglobulin lambda locus excluded owingto its highly atypical gene structure). Of <strong>the</strong>se, 43 hit 36 annotatedpseudogenes. This suggests that 9% <strong>of</strong> <strong>the</strong> IGI predictions maycorrespond to pseudogenes <strong>and</strong> also suggests a fragmentation rate<strong>of</strong> 1.2 gene predictions per gene. Of <strong>the</strong> remaining hits, 63 did notoverlap with any current annotations. This would suggest a rate <strong>of</strong>spurious predictions <strong>of</strong> about 13% (63/477), although <strong>the</strong> true rateis likely to be much lower because many <strong>of</strong> <strong>the</strong>se may correspond tounannotated portions <strong>of</strong> existing gene predictions or to currentlyunannotated genes (<strong>of</strong> which <strong>the</strong>re are estimated to be about 100 onthis chromosome 94 ).Chromosomal distribution. Finally, we examined <strong>the</strong> chromosomaldistribution <strong>of</strong> <strong>the</strong> IGI gene set. The average density <strong>of</strong> genepredictions is 11.1 per Mb across <strong>the</strong> <strong>genome</strong>, with <strong>the</strong> extremesbeing chromosome 19 at 26.8 per Mb <strong>and</strong> chromosome Yat 6.4 perMb. It is likely that a signi®cant number <strong>of</strong> <strong>the</strong> predictions onchromosome Y are pseudogenes (this chromosome is known to berich in pseudogenes) <strong>and</strong> thus that <strong>the</strong> density for chromosome Y isan overestimate. The density <strong>of</strong> both genes <strong>and</strong> Alus on chromosome19 is much higher than expected, even accounting for <strong>the</strong> highGC content <strong>of</strong> <strong>the</strong> chromosome; this supports <strong>the</strong> idea that Aludensity is more closely correlated with gene density than with GCcontent itself.Summary. We are clearly still some way from having a complete set<strong>of</strong> <strong>human</strong> genes. The current IGI contains signi®cant numbers <strong>of</strong>partial genes, fragmented <strong>and</strong> fused genes, pseudogenes <strong>and</strong> spuriouspredictions, <strong>and</strong> it also lacks signi®cant numbers <strong>of</strong> true genes.This re¯ects <strong>the</strong> current state <strong>of</strong> gene prediction methods invertebrates even in ®nished sequence, as well as <strong>the</strong> additionalchallenges related to <strong>the</strong> current state <strong>of</strong> <strong>the</strong> draft <strong>genome</strong> sequence.None<strong>the</strong>less, <strong>the</strong> gene predictions provide a valuable starting pointfor a wide range <strong>of</strong> biological studies <strong>and</strong> will be rapidly re®ned in<strong>the</strong> coming year.The <strong>analysis</strong> above allows us to estimate <strong>the</strong> number <strong>of</strong> distinctgenes in <strong>the</strong> IGI, as well as <strong>the</strong> number <strong>of</strong> genes in <strong>the</strong> <strong>human</strong><strong>genome</strong>. The IGI set contains about 15,000 known genes <strong>and</strong> about17,000 gene predictions. Assuming that <strong>the</strong> gene predictions aresubject to a rate <strong>of</strong> overprediction (spurious predictions <strong>and</strong>pseudogenes) <strong>of</strong> 20% <strong>and</strong> a rate <strong>of</strong> fragmentation <strong>of</strong> 1.4, <strong>the</strong> IGIwould be estimated to contain about 24,500 actual <strong>human</strong> genes.Assuming that <strong>the</strong> gene predictions contain about 60% <strong>of</strong>previously unknown <strong>human</strong> genes, <strong>the</strong> total number <strong>of</strong> genes in<strong>the</strong> <strong>human</strong> <strong>genome</strong> would be estimated to be about 31,000. This isconsistent with most recent estimates based on sampling, whichsuggest a gene number <strong>of</strong> 30,000±35,000. If <strong>the</strong>re are 30,000±35,000genes, with an average coding length <strong>of</strong> about 1,400 bp <strong>and</strong> averagegenomic extent <strong>of</strong> about 30 kb, <strong>the</strong>n about 1.5% <strong>of</strong> <strong>the</strong> <strong>human</strong><strong>genome</strong> would consist <strong>of</strong> coding sequence <strong>and</strong> one-third <strong>of</strong> <strong>the</strong><strong>genome</strong> would be transcribed in genes.The IGI/IPI was constructed primarily on <strong>the</strong> basis <strong>of</strong> genepredictions from Ensembl. However, we also generated an exp<strong>and</strong>edset (IGI+) by including additional predictions from two o<strong>the</strong>r geneprediction programs, Genie <strong>and</strong> GenomeScan (C. Burge, personalcommunication). These predictions were not included in <strong>the</strong> coreIGI set, because <strong>of</strong> <strong>the</strong> concern that each additional set will providediminishing returns in identifying true genes while contributing itsown false positives (increased sensitivity at <strong>the</strong> expense <strong>of</strong> speci®city).Genie produced an additional 2,837 gene predictions notoverlapping <strong>the</strong> IGI, <strong>and</strong> GenomeScan produced 6,534 such genepredictions. If all <strong>of</strong> <strong>the</strong>se gene predictions were included in <strong>the</strong> IGI,<strong>the</strong> number <strong>of</strong> <strong>the</strong> 31 new `known' genes (see above) contained in<strong>the</strong> IGI would rise from 19 to 24. This would amount to an increase<strong>of</strong> about 26% in sensitivity, at <strong>the</strong> expense <strong>of</strong> increasing <strong>the</strong> number<strong>of</strong> predicted genes (excluding knowns) by 55%. Allowing a higher900 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!