12.07.2015 Views

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

articlesTable 10 Number <strong>of</strong> CpG isl<strong>and</strong>s by GC contentGC content<strong>of</strong> isl<strong>and</strong>Number<strong>of</strong> isl<strong>and</strong>sPercentage<strong>of</strong> isl<strong>and</strong>sNucleotidesin isl<strong>and</strong>sPercentage <strong>of</strong>nucleotidesin isl<strong>and</strong>sTotal 28,890 100 19,818,547 100.80% 22 0.08 5,916 0.0370±80% 5,884 20 3,111,965 1660±70% 18,779 65 13,110,924 6650±60% 4,205 15 3,589,742 18.............................................................................................................................................................................Potential CpG isl<strong>and</strong>s were identi®ed by searching <strong>the</strong> draft <strong>genome</strong> sequence one base at a time,scoring each dinucleotide (+17 for GC, -1 for o<strong>the</strong>rs) <strong>and</strong> identifying maximally scoring segments.Each segment was <strong>the</strong>n evaluated to determine GC content ($50%), length (.200) <strong>and</strong> ratio <strong>of</strong>observed proportion <strong>of</strong> GC dinucleotides to <strong>the</strong> expected proportion on <strong>the</strong> basis <strong>of</strong> <strong>the</strong> GC content<strong>of</strong> <strong>the</strong> segment (.0.60), using a modi®cation <strong>of</strong> a program developed by G. Micklem (personalcommunication).various computer programs that attempt to identify CpG isl<strong>and</strong>s on<strong>the</strong> basis <strong>of</strong> primary sequence alone. These programs differ in someimportant respects (such as how aggressively <strong>the</strong>y subdivide longCpG-containing regions), <strong>and</strong> <strong>the</strong> precise correspondence wi<strong>the</strong>xperimentally undermethylated isl<strong>and</strong>s has not been validated.Never<strong>the</strong>less, <strong>the</strong>re is a good correlation, <strong>and</strong> computational <strong>analysis</strong>thus provides a reasonable picture <strong>of</strong> <strong>the</strong> distribution <strong>of</strong> CpGisl<strong>and</strong>s in <strong>the</strong> <strong>genome</strong>.To identify CpG isl<strong>and</strong>s, we used <strong>the</strong> de®nition proposed byGardiner-Garden <strong>and</strong> Frommer 128 <strong>and</strong> embodied in a computerprogram. We searched <strong>the</strong> draft <strong>genome</strong> sequence for CpG isl<strong>and</strong>s,using both <strong>the</strong> full sequence <strong>and</strong> <strong>the</strong> sequence masked to eliminaterepeat sequences. The number <strong>of</strong> regions satisfying <strong>the</strong> de®nition <strong>of</strong>a CpG isl<strong>and</strong> was 50,267 in <strong>the</strong> full sequence <strong>and</strong> 28,890 in <strong>the</strong>repeat-masked sequence. The difference re¯ects <strong>the</strong> fact that somerepeat elements (notably Alu) are GC-rich. Although some <strong>of</strong> <strong>the</strong>serepeat elements may function as control regions, it seems unlikelythat most <strong>of</strong> <strong>the</strong> apparent CpG isl<strong>and</strong>s in repeat sequences arefunctional. Accordingly, we focused on those in <strong>the</strong> non-repeatedsequence. The count <strong>of</strong> 28,890 CpG isl<strong>and</strong>s is reasonably close to <strong>the</strong>previous estimate <strong>of</strong> about 35,000 (ref. 129, as modi®ed by ref. 130).Most <strong>of</strong> <strong>the</strong> isl<strong>and</strong>s are short, with 60±70% GC content (Table 10).More than 95% <strong>of</strong> <strong>the</strong> isl<strong>and</strong>s are less than 1,800 bp long, <strong>and</strong> morethan 75% are less than 850 bp. The longest CpG isl<strong>and</strong> (onchromosome 10) is 36,619 bp long, <strong>and</strong> 322 are longer than 3,000bp. Some <strong>of</strong> <strong>the</strong> larger isl<strong>and</strong>s contain ribosomal pseudogenes,although RNA genes <strong>and</strong> pseudogenes account for only a smallproportion <strong>of</strong> all isl<strong>and</strong>s (, 0.5%). The smaller isl<strong>and</strong>s are consistentwith <strong>the</strong>ir previously hypo<strong>the</strong>sized function, but <strong>the</strong> role <strong>of</strong><strong>the</strong>se larger isl<strong>and</strong>s is uncertain.The density <strong>of</strong> CpG isl<strong>and</strong>s varies substantially among some <strong>of</strong><strong>the</strong> chromosomes. Most chromosomes have 5±15 isl<strong>and</strong>s per Mb,with a mean <strong>of</strong> 10.5 isl<strong>and</strong>s per Mb. However, chromosome Y has anNumber <strong>of</strong> genes per Mb2520151050111 1514 6 9 12248 3207 105 2113 18X2217160 10 20 30 40 50Number <strong>of</strong> CpG isl<strong>and</strong>s per MbFigure 14 Number <strong>of</strong> CpG isl<strong>and</strong>s per Mb for each chromosome, plotted against <strong>the</strong>number <strong>of</strong> genes per Mb (<strong>the</strong> number <strong>of</strong> genes was taken from GeneMap98 (ref. 100)).Chromosomes 16, 17, 22 <strong>and</strong> particularly 19 are clear outliers, with a density <strong>of</strong> CpGisl<strong>and</strong>s that is even greater than would be expected from <strong>the</strong> high gene counts for <strong>the</strong>sefour chromosomes.19unusually low 2.9 isl<strong>and</strong>s per Mb, <strong>and</strong> chromosomes 16, 17 <strong>and</strong> 22have 19±22 isl<strong>and</strong>s per Mb. The extreme outlier is chromosome 19,with 43 isl<strong>and</strong>s per Mb. Similar trends are seen when considering <strong>the</strong>percentage <strong>of</strong> bases contained in CpG isl<strong>and</strong>s. The relative density <strong>of</strong>CpG isl<strong>and</strong>s correlates reasonably well with estimates <strong>of</strong> relativegene density on <strong>the</strong>se chromosomes, based both on previousmapping studies involving ESTs (Fig. 14) <strong>and</strong> on <strong>the</strong> distribution<strong>of</strong> gene predictions discussed below.Comparison <strong>of</strong> genetic <strong>and</strong> physical distanceThe draft <strong>genome</strong> sequence makes it possible to compare genetic<strong>and</strong> physical distances <strong>and</strong> <strong>the</strong>reby to explore variation in <strong>the</strong> rate <strong>of</strong>recombination across <strong>the</strong> <strong>human</strong> chromosomes. We focus here onlarge-scale variation. Finer variation is examined in an accompanyingpaper 131 .The genetic <strong>and</strong> physical maps are integrated by 5,282 polymorphicloci from <strong>the</strong> Marsh®eld genetic map 102 , whose positionsare known in terms <strong>of</strong> centimorgans (cM) <strong>and</strong> Mb along <strong>the</strong>chromosomes. Figure 15 shows <strong>the</strong> comparison <strong>of</strong> <strong>the</strong> draft<strong>genome</strong> sequence for chromosome 12 with <strong>the</strong> male, female <strong>and</strong>sex-averaged maps. One can calculate <strong>the</strong> approximate ratio <strong>of</strong> cMper Mb across a chromosome (re¯ected in <strong>the</strong> slopes in Fig. 15) <strong>and</strong><strong>the</strong> average recombination rate for each chromosome arm.Two striking features emerge from <strong>analysis</strong> <strong>of</strong> <strong>the</strong>se data. First, <strong>the</strong>average recombination rate increases as <strong>the</strong> length <strong>of</strong> <strong>the</strong> chromosomearm decreases (Fig. 16). Long chromosome arms have anaverage recombination rate <strong>of</strong> about 1 cM per Mb, whereas <strong>the</strong>shortest arms are in <strong>the</strong> range <strong>of</strong> 2 cM per Mb. A similar trend hasbeen seen in <strong>the</strong> yeast <strong>genome</strong> 132,133 , despite <strong>the</strong> fact that <strong>the</strong> physicalscale is nearly 200 times as small. Moreover, experimental studieshave shown that leng<strong>the</strong>ning or shortening yeast chromosomesresults in a compensatory change in recombination rate 132 .The second observation is that <strong>the</strong> recombination rate tends to besuppressed near <strong>the</strong> centromeres <strong>and</strong> higher in <strong>the</strong> distal portions<strong>of</strong> most chromosomes, with <strong>the</strong> increase largely in <strong>the</strong> terminalDistance from centromere (cM)1401301201101009012q8070605040302010012p1020304050Sex-averagedMaleFemale600 10 20 30 40 50 60 70 80 90 100 110 120 130 140Centromere Position (Mb)Figure 15 Distance in cM along <strong>the</strong> genetic map <strong>of</strong> chromosome 12 plotted againstposition in Mb in <strong>the</strong> draft <strong>genome</strong> sequence. Female, male <strong>and</strong> sex-averaged maps areshown. Female recombination rates are much higher than male recombination rates. Theincreased slopes at ei<strong>the</strong>r end <strong>of</strong> <strong>the</strong> chromosome re¯ect <strong>the</strong> increased rates <strong>of</strong>recombination per Mb near <strong>the</strong> telomeres. Conversely, <strong>the</strong> ¯atter slope near <strong>the</strong>centromere shows decreased recombination <strong>the</strong>re, especially in male meiosis. This istypical <strong>of</strong> <strong>the</strong> o<strong>the</strong>r chromosomes as well (see http://<strong>genome</strong>.ucsc.edu/goldenPath/mapPlots). Discordant markers may be map, marker placement or assembly errors.878 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!