12.07.2015 Views

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

articles30% to 50%. The correlation appears to be due primarily to intronsize, which drops markedly with increasing GC content (Fig. 36c).In contrast, coding properties such as exon length (Fig. 36c) or exonnumber (data not shown) vary little. Intergenic distance is alsoprobably lower in high-GC areas, although this is hard to provedirectly until all genes have been identi®ed.The large number <strong>of</strong> con®rmed <strong>human</strong> introns allows us toanalyse variant splice sites, con®rming <strong>and</strong> extending recentreports 281 . Intron positions were con®rmed by applying a stringentcriterion that EST or mRNA sequence show an exact match <strong>of</strong> 8 bpin <strong>the</strong> ¯anking exonic sequence on each side. Of 53,295 con®rmedintrons, 98.12% use <strong>the</strong> canonical dinucleotides GT at <strong>the</strong> 59 splicesite <strong>and</strong> AG at <strong>the</strong> 39 site (GT±AG pattern). Ano<strong>the</strong>r 0.76% use <strong>the</strong>related GC±AG. About 0.10% use AT±AC, which is a rare alternativepattern primarily recognized by <strong>the</strong> variant U12 splicingmachinery 282 . The remaining 1% belong to 177 types, some <strong>of</strong> whichundoubtedly re¯ect <strong>sequencing</strong> or alignment errors.Finally, we looked at alternative splicing <strong>of</strong> <strong>human</strong> genes. Alternativesplicing can allow many proteins to be produced from asingle gene <strong>and</strong> can be used for complex gene regulation. It appearsto be prevalent in <strong>human</strong>s, with lower estimates <strong>of</strong> about 35% <strong>of</strong><strong>human</strong> genes being subject to alternative splicing 283±285 . Thesestudies may have underestimated <strong>the</strong> prevalence <strong>of</strong> alternativesplicing, because <strong>the</strong>y examined only EST alignments coveringonly a portion <strong>of</strong> a gene.To investigate <strong>the</strong> prevalence <strong>of</strong> alternative splicing, we analysedreconstructed mRNA transcripts covering <strong>the</strong> entire coding regions<strong>of</strong> genes on chromosome 22 (omitting small genes with codingregions <strong>of</strong> less than 240 bp). Potential transcripts identi®ed byalignments <strong>of</strong> ESTs <strong>and</strong> cDNAs to genomic sequence were veri®edby <strong>human</strong> inspection. We found 642 transcripts, covering 245 genes(average <strong>of</strong> 2.6 distinct transcripts per gene). Two or more alternativelyspliced transcripts were found for 145 (59%) <strong>of</strong> <strong>the</strong>se genes.A similar <strong>analysis</strong> for <strong>the</strong> gene-rich chromosome 19 gave 1,859transcripts, corresponding to 544 genes (average 3.2 distinct transcriptsper gene). Because we are sampling only a subset <strong>of</strong> alltranscripts, <strong>the</strong> true extent <strong>of</strong> alternative splicing is likely to begreater. These ®gures are considerably higher than those for worm,in which <strong>analysis</strong> reveals alternative splicing for 22% <strong>of</strong> genes forwhich ESTs have been found, with an average <strong>of</strong> 1.34 (12,816/9,516)splice variants per gene. (The apparently higher extent <strong>of</strong> alternativesplicing seen in <strong>human</strong> than in worm was not an artefact resultingfrom much deeper coverage <strong>of</strong> <strong>human</strong> genes by ESTs <strong>and</strong> mRNAs.Although <strong>the</strong>re are many times more ESTs available for <strong>human</strong> thanworm, <strong>the</strong>se ESTs tend to have shorter average length (because manywere <strong>the</strong> product <strong>of</strong> early <strong>sequencing</strong> efforts) <strong>and</strong> many match no<strong>human</strong> genes. We calculated <strong>the</strong> actual coverage per bp used in <strong>the</strong><strong>analysis</strong> <strong>of</strong> <strong>the</strong> <strong>human</strong> <strong>and</strong> worm genes; <strong>the</strong> coverage is onlymodestly higher (about 50%) for <strong>the</strong> <strong>human</strong>, with a strong biastowards 39 UTRs which tend to show much less alternative splicing.We also repeated <strong>the</strong> <strong>analysis</strong> using equal coverage for <strong>the</strong> twoorganisms <strong>and</strong> con®rmed that higher levels <strong>of</strong> alternative splicingwere still seen in <strong>human</strong>.)Seventy per cent <strong>of</strong> alternative splice forms found in <strong>the</strong> genes onchromosomes 19 <strong>and</strong> 22 affect <strong>the</strong> coding sequence, ra<strong>the</strong>r thanmerely changing <strong>the</strong> 39 or 59 UTR. (This estimate may be affected by<strong>the</strong> incomplete representation <strong>of</strong> UTRs in <strong>the</strong> RefSeq database <strong>and</strong>in <strong>the</strong> transcripts studied.) Alternative splicing <strong>of</strong> <strong>the</strong> terminal exonwas seen for 20% <strong>of</strong> 6,105 mRNAs that were aligned to <strong>the</strong> draft<strong>genome</strong> sequence <strong>and</strong> correspond to con®rmed 39 EST clusters. Inaddition to alternative splicing, we found evidence <strong>of</strong> <strong>the</strong> terminalexon employing alternative polyadenylation sites (separated by. 100 bp) in 24% <strong>of</strong> cases.Towards a complete index <strong>of</strong> <strong>human</strong> genes. We next focused oncreating an initial index <strong>of</strong> <strong>human</strong> genes <strong>and</strong> proteins. This index isquite incomplete, owing to <strong>the</strong> dif®culty <strong>of</strong> gene identi®cation in<strong>human</strong> DNA <strong>and</strong> <strong>the</strong> imperfect state <strong>of</strong> <strong>the</strong> draft <strong>genome</strong> sequence.None<strong>the</strong>less, it is valuable for experimental studies <strong>and</strong> providesimportant insights into <strong>the</strong> nature <strong>of</strong> <strong>human</strong> genes <strong>and</strong> proteins.The challenge <strong>of</strong> identifying genes from genomic sequence variesgreatly among organisms. Gene identi®cation is almost trivial inbacteria <strong>and</strong> yeast, because <strong>the</strong> absence <strong>of</strong> introns in bacteria <strong>and</strong><strong>the</strong>ir paucity in yeast means that most genes can be readilyrecognized by ab initio <strong>analysis</strong> as unusually long ORFs. It is notas simple, but still relatively straightforward, to identify genes inanimals with small <strong>genome</strong>s <strong>and</strong> small introns, such as worm <strong>and</strong>¯y. A major factor is <strong>the</strong> high signal-to-noise ratioÐcodingsequences comprise a large proportion <strong>of</strong> <strong>the</strong> <strong>genome</strong> <strong>and</strong> a largeproportion <strong>of</strong> each gene (about 50% for worm <strong>and</strong> ¯y), <strong>and</strong> exonsare relatively large.Gene identi®cation is more dif®cult in <strong>human</strong> DNA. The signalto-noiseratio is lower: coding sequences comprise only a few percent <strong>of</strong> <strong>the</strong> <strong>genome</strong> <strong>and</strong> an average <strong>of</strong> about 5% <strong>of</strong> each gene;internal exons are smaller than in worms; <strong>and</strong> genes appear to havemore alternative splicing. The challenge is underscored by <strong>the</strong> workon <strong>human</strong> chromosomes 21 <strong>and</strong> 22. Even with <strong>the</strong> availability <strong>of</strong>®nished sequence <strong>and</strong> intensive experimental work, <strong>the</strong> gene contentremains uncertain, with upper <strong>and</strong> lower estimates differing byas much as 30%. The initial report <strong>of</strong> <strong>the</strong> ®nished sequence <strong>of</strong>chromosome 22 (ref. 94) identi®ed 247 previously known genes,298 predicted genes con®rmed by sequence homology or ESTs <strong>and</strong>325 ab initio predictions without additional support. Many <strong>of</strong> <strong>the</strong>con®rmed predictions represented partial genes. In <strong>the</strong> past year,440 additional exons (10%) have been added to existing geneannotations by <strong>the</strong> chromosome 22 annotation group, although<strong>the</strong> number <strong>of</strong> con®rmed genes has increased by only 17 <strong>and</strong> somepreviously identi®ed gene predictions have been merged 286 .Before discussing <strong>the</strong> gene predictions for <strong>the</strong> <strong>human</strong> <strong>genome</strong>, itis useful to consider background issues, including previous estimates<strong>of</strong> <strong>the</strong> number <strong>of</strong> <strong>human</strong> genes, lessons learned from worms<strong>and</strong> ¯ies <strong>and</strong> <strong>the</strong> representativeness <strong>of</strong> currently `known' <strong>human</strong>genes.Previous estimates <strong>of</strong> <strong>human</strong> gene number. Although direct enumeration<strong>of</strong> <strong>human</strong> genes is only now becoming possible with <strong>the</strong> advent<strong>of</strong> <strong>the</strong> draft <strong>genome</strong> sequence, <strong>the</strong>re have been many attempts in <strong>the</strong>past quarter <strong>of</strong> a century to estimate <strong>the</strong> number <strong>of</strong> genes indirectly.Early estimates based on reassociation kinetics estimated <strong>the</strong> mRNAcomplexity <strong>of</strong> typical vertebrate tissues to be 10,000±20,000, <strong>and</strong>were extrapolated to suggest around 40,000 for <strong>the</strong> entire <strong>genome</strong> 287 .In <strong>the</strong> mid-1980s, Gilbert suggested that <strong>the</strong>re might be about100,000 genes, based on <strong>the</strong> approximate ratio <strong>of</strong> <strong>the</strong> size <strong>of</strong> a typicalgene (,3 ´ 10 4 bp) to <strong>the</strong> size <strong>of</strong> <strong>the</strong> <strong>genome</strong> (3 ´ 10 9 bp). Althoughthis was intended only as a back-<strong>of</strong>-<strong>the</strong>-envelope estimate, <strong>the</strong>pleasing roundness <strong>of</strong> <strong>the</strong> ®gure seems to have led to it beingwidely quoted <strong>and</strong> adopted in many textbooks. (W. Gilbert,personal communication; ref. 288). An estimate <strong>of</strong> 70,000±80,000genes was made by extrapolating from <strong>the</strong> number <strong>of</strong> CpG isl<strong>and</strong>s<strong>and</strong> <strong>the</strong> frequency <strong>of</strong> <strong>the</strong>ir association with known genes 129 .As <strong>human</strong> sequence information has accumulated, it has beenpossible to derive estimates on <strong>the</strong> basis <strong>of</strong> sampling techniques 289 .Such studies have sought to extrapolate from various types <strong>of</strong> data,including ESTs, mRNAs from known genes, cross-species <strong>genome</strong>comparisons <strong>and</strong> <strong>analysis</strong> <strong>of</strong> ®nished chromosomes. Estimatesbased on ESTs 290 have varied widely, from 35,000 (ref. 130) to120,000 genes 291 . Some <strong>of</strong> <strong>the</strong> discrepancy lies in differing estimates<strong>of</strong> <strong>the</strong> amount <strong>of</strong> contaminating genomic sequence in <strong>the</strong> ESTcollection <strong>and</strong> <strong>the</strong> extent to which multiple distinct ESTs correspondto a single gene. The most rigorous analyses 130 exclude asspurious any ESTs that appear only once in <strong>the</strong> data set <strong>and</strong> carefullycalibrate sensitivity <strong>and</strong> speci®city. Such calculations consistentlyproduce low estimates, in <strong>the</strong> region <strong>of</strong> 35,000.Comparison <strong>of</strong> whole-<strong>genome</strong> shotgun sequence from <strong>the</strong> puffer®shT. nigroviridis with <strong>the</strong> <strong>human</strong> <strong>genome</strong> 292 can be used toestimate <strong>the</strong> density <strong>of</strong> exons (detected as conserved sequences898 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com© 2001 Macmillan Magazines Ltd

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!