12.07.2015 Views

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

articlesSegmental history <strong>of</strong> <strong>the</strong> <strong>human</strong> <strong>genome</strong>In bacteria, genomic segments <strong>of</strong>ten convey important informationabout function: genes located close to one ano<strong>the</strong>r <strong>of</strong>ten encodeproteins in a common pathway <strong>and</strong> are regulated in a commonoperon. In mammals, genes found close to each o<strong>the</strong>r only rarelyhave common functions, but <strong>the</strong>y are still interesting because <strong>the</strong>yhave a common history. In fact, <strong>the</strong> study <strong>of</strong> genomic segments canshed light on biological events as long as 500 Myr ago <strong>and</strong> as recentlyas 20,000 years ago.Conserved segments between <strong>human</strong> <strong>and</strong> mouseHumans <strong>and</strong> mice shared a common ancestor about 100 Myr ago.Despite <strong>the</strong> 200 Myr <strong>of</strong> evolutionary distance between <strong>the</strong> species, asigni®cant fraction <strong>of</strong> genes show synteny between <strong>the</strong> two, beingpreserved within conserved segments. Genes tightly linked in onemammalian species tend to be linked in o<strong>the</strong>rs. In fact, conservedsegments have been observed in even more distant species: <strong>human</strong>sshow conserved segments with ®sh 350,351 <strong>and</strong> even with invertebratessuch as ¯y <strong>and</strong> worm 352 . In general, <strong>the</strong> likelihood that a syntenicrelationship will be disrupted correlates with <strong>the</strong> physical distancebetween <strong>the</strong> loci <strong>and</strong> <strong>the</strong> evolutionary distance between <strong>the</strong> species.Studying conserved segments between <strong>human</strong> <strong>and</strong> mouse hasseveral uses. First, conservation <strong>of</strong> gene order has been used toidentify likely orthologues between <strong>the</strong> species, particularly wheninvestigating disease phenotypes. Second, <strong>the</strong> study <strong>of</strong> conservedsegments among <strong>genome</strong>s helps us to deduce evolutionary ancestry.Fly (number <strong>of</strong> genes)4003503002502820058 71501032117 15 12 9100 36437 23 2129 20 1334 2211351933 18 145027 263130242516600 100 200 300 400 500 600 700 800 900Human (number <strong>of</strong> genes)Figure 44 Relative expansions <strong>of</strong> protein families between <strong>human</strong> <strong>and</strong> ¯y. These datahave not been normalized for proteomic size differences. Blue line, equality betweennormalized family sizes in <strong>the</strong> two organisms. Green line, equality between unnormalizedfamily sizes. Numbered InterPro entries: (1) immunoglobulin domain [IPR003006]; (2) zinc®nger, C2H2 type [IPR000822]; (3) eukaryotic protein kinase [IPR000719]; (4) rhodopsinlikeGPCR superfamily [IPR000276]; (5) ATP/GTP-binding site motif A (P-loop)[IPR001687]; (6) reverse transcriptase (RNA-dependent DNA polymerase) [IPR000477];(7) RNA-binding region RNP-1 (RNA recognition motif) [IPR000504]; (8) G-proteinb WD-40 repeats [IPR001680]; (9) ankyrin repeat [IPR002110]; (10) homeobox domain[IPR001356]; (11) PH domain [IPR001849]; (12) EF-h<strong>and</strong> family [IPR002048]; (13) EGFlikedomain [IPR000561]; (14) Src homology 3 (SH3) domain [IPR001452]; (15) RING®nger [IPR001841]; (16) KRAB box [IPR001909]; (17) leucine-rich repeat [IPR001611];(18) ®bronectin type III domain [IPR001777]; (19) PDZ domain (also known as DHR orGLGF) [IPR001478]; (20) TPR repeat [IPR001440]; (21) helicase C-terminal domain[IPR001650]; (22) ion transport protein [IPR002216]; (23) helix±loop±helix DNA-bindingdomain [IPR001092]; (24) cadherin domain [IPR002126]; (25) intermediate ®lamentproteins [IPR001664]; (26) C2 domain [IPR000008]; (27) Src homology 2 (SH2) domain[IPR000980]; (28) serine proteases, trypsin family [IPR001254]; (29) BTB/POZ domain[IPR000210]; (30) tyrosine-speci®c protein phosphatase <strong>and</strong> dual speci®city proteinphosphatase family [IPR000387]; (31) collagen triple helix repeat [IPR000087]; (32)esterase/lipase/thioesterase [IPR000379]; (33) neutral zinc metallopeptidases, zincbindingregion [IPR000130]; (34) ATP-binding transport protein, 2nd P-loop motif[IPR001051]; (35) ABC transporters family [IPR001617]; (36) cytochrome P450 enzyme[IPR001128]; (37) insect cuticle protein [IPR000618].32And third, detailed comparative maps may assist in <strong>the</strong> assembly <strong>of</strong><strong>the</strong> mouse sequence, using <strong>the</strong> <strong>human</strong> sequence as a scaffold.Two types <strong>of</strong> linkage conservation are commonly described 353 .`Conserved synteny' indicates that at least two genes that reside on acommon chromosome in one species are also located on a commonchromosome in <strong>the</strong> o<strong>the</strong>r species. Syntenic loci are said to lie in a`conserved segment' when not only <strong>the</strong> chromosomal position but<strong>the</strong> linear order <strong>of</strong> <strong>the</strong> loci has been preserved, without interruptionby o<strong>the</strong>r chromosomal rearrangements.An initial survey <strong>of</strong> homologous loci in <strong>human</strong> <strong>and</strong> mouse 354suggested that <strong>the</strong> total number <strong>of</strong> conserved segments would beabout 180. Subsequent estimates based on increasingly detailedcomparative maps have remained close to this projection 353,355,356(http://www.informatics.jax.org). The distribution <strong>of</strong> segmentlengths has corresponded reasonably well to <strong>the</strong> truncated negativeexponential curve predicted by <strong>the</strong> r<strong>and</strong>om breakage model 357 .The availability <strong>of</strong> a draft <strong>human</strong> <strong>genome</strong> sequence allows <strong>the</strong> ®rstglobal <strong>human</strong>±mouse comparison in which <strong>human</strong> physical distancescan be measured in Mb, ra<strong>the</strong>r than cM or orthologous genecounts. We identi®ed likely orthologues by reciprocal comparison<strong>of</strong> <strong>the</strong> <strong>human</strong> <strong>and</strong> mouse mRNAs in <strong>the</strong> LocusLink database, usingmegaBLAST. For each orthologous pair, we mapped <strong>the</strong> location <strong>of</strong><strong>the</strong> <strong>human</strong> gene in <strong>the</strong> draft <strong>genome</strong> sequence <strong>and</strong> <strong>the</strong>n checked <strong>the</strong>location <strong>of</strong> <strong>the</strong> mouse gene in <strong>the</strong> Mouse Genome Informaticsdatabase (http://www.informatics.jax.org). Using a conservativethreshold, we identi®ed 3,920 orthologous pairs in which <strong>the</strong><strong>human</strong> gene could be mapped on <strong>the</strong> draft <strong>genome</strong> sequence withhigh con®dence. Of <strong>the</strong>se, 2,998 corresponding mouse genes had aknown position in <strong>the</strong> mouse <strong>genome</strong>. We <strong>the</strong>n searched forde®nitive conserved segments, de®ned as <strong>human</strong> regions containingorthologues <strong>of</strong> at least two genes from <strong>the</strong> same mouse chromosomeregion (, 15 cM) without interruption by segments from o<strong>the</strong>rchromosomes.We identi®ed 183 de®nitive conserved segments (Fig. 46). Theaverage segment length was 15.4 Mb, with <strong>the</strong> largest segment being90.5 Mb <strong>and</strong> <strong>the</strong> smallest 24 kb. There were also 141 `singletons',segments that contained only a single locus; <strong>the</strong>se are not counted in<strong>the</strong> statistics. Although some <strong>of</strong> <strong>the</strong>se could be short conservedsegments, <strong>the</strong>y could also re¯ect incorrect choices <strong>of</strong> orthologues orproblems with <strong>the</strong> <strong>human</strong> or mouse maps. Because <strong>of</strong> this conservativeapproach, <strong>the</strong> observed number <strong>of</strong> de®nitive segments islikely be lower than <strong>the</strong> correct total. One piece <strong>of</strong> evidence for thisconclusion comes from a more detailed <strong>analysis</strong> on <strong>human</strong> chromosome7 (ref. 358), which identi®ed 20 conserved segments, <strong>of</strong>which three were singletons. Our <strong>analysis</strong> revealed only 13 de®nitivesegments on this chromosome, with nine singletons.The frequency <strong>of</strong> observing a particular gene count in a conservedsegment is plotted on a logarithmic scale in Fig. 47. If chromosomalbreaks occur in a r<strong>and</strong>om fashion (as has been proposed) <strong>and</strong>differences in gene density are ignored, a roughly straight lineshould result. There is a clear excess for n = 1, suggesting that 50%or more <strong>of</strong> <strong>the</strong> singletons are indeed artefactual. Thus, we estimatethat true number <strong>of</strong> conserved segments is around 190±230, in goodagreement with <strong>the</strong> original Nadeau±Taylor prediction 354 .Figure 48 shows a plot <strong>of</strong> <strong>the</strong> frequency <strong>of</strong> lengths <strong>of</strong> conservedsegments, where <strong>the</strong> x-axis scale is shown in Mb. As before, <strong>the</strong>re is afair amount <strong>of</strong> scatter in <strong>the</strong> data for <strong>the</strong> larger segments (where <strong>the</strong>numbers are small), but <strong>the</strong> trend appears to be consistent with ar<strong>and</strong>om breakage model.We attempted to ascertain whe<strong>the</strong>r <strong>the</strong> breakpoint regions haveany special characteristics. This <strong>analysis</strong> was complicated by imprecisionin <strong>the</strong> positioning <strong>of</strong> <strong>the</strong>se breaks, which will tend to blur anyrelationships. With 2,998 orthologues, <strong>the</strong> average interval withinwhich a break is known to have occurred is about 1.1 Mb. Wecompared <strong>the</strong> aggregate features <strong>of</strong> <strong>the</strong>se breakpoint intervals with<strong>the</strong> <strong>genome</strong> as a whole. The mean gene density was lower inbreakpoint regions than in <strong>the</strong> conserved segments (13.8 versus908 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!