12.07.2015 Views

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

articles<strong>genome</strong>s 184±187 . By studying sets <strong>of</strong> repeat elements belonging to acommon cohort, one can directly measure nucleotide substitutionrates in different regions <strong>of</strong> <strong>the</strong> <strong>genome</strong>. We ®nd strong evidencethat <strong>the</strong> pattern <strong>of</strong> neutral substitution differs as a function <strong>of</strong> localGC content (Fig. 27). Because <strong>the</strong> results are observed in repetitiveelements throughout <strong>the</strong> <strong>genome</strong>, <strong>the</strong> variation in <strong>the</strong> pattern <strong>of</strong>nucleotide substitution seems likely to be due to differences in <strong>the</strong>underlying mutational process ra<strong>the</strong>r than to selection.The effect can be seen most clearly by focusing on <strong>the</strong> substitutionprocess g $ a, where g denotes GC or CG base pairs <strong>and</strong> a denotesATor TA base pairs. If K is <strong>the</strong> equilibrium constant in <strong>the</strong> direction<strong>of</strong> a base pairs (de®ned by <strong>the</strong> ratio <strong>of</strong> <strong>the</strong> forward <strong>and</strong> reverserates), <strong>the</strong>n <strong>the</strong> equilibrium GC content should be 1/(1 + K). Twoobservations emerge.First, <strong>the</strong>re is a regional bias in substitution patterns. Theequilibrium constant varies as a function <strong>of</strong> local GC content: gbase pairs are more likely to mutate towards a base pairs in AT-richregions than in GC-rich regions. For <strong>the</strong> <strong>analysis</strong> in Fig. 27, <strong>the</strong>equilibrium constant K is 2.5, 1.9 <strong>and</strong> 1.2 when <strong>the</strong> draft <strong>genome</strong>sequence is partitioned into three bins with average GC content <strong>of</strong>37, 43 <strong>and</strong> 50%, respectively. This bias could be due to a reportedtendency for GC-rich regions to replicate earlier in <strong>the</strong> cell cyclethan AT-rich regions <strong>and</strong> for guanine pools, which are limiting forSubstitution levelin 18% diverged sequence (%)20181614121086420GCATATGCG T A CC A T GType <strong>of</strong> substitutionAverage backgroundnucleotide composition37% GC43% GC50% GCFigure 27 Substitution patterns in interspersed repeats differ as a function <strong>of</strong> GC content.We collected all copies <strong>of</strong> ®ve DNA transposons (Tigger1, Tigger2, Charlie3, MER1 <strong>and</strong>HSMAR2), chosen for <strong>the</strong>ir high copy number <strong>and</strong> well de®ned consensus sequences.DNA transposons are optimal for <strong>the</strong> study <strong>of</strong> neutral substitutions: <strong>the</strong>y do not segregateinto subfamilies with diagnostic differences, presumably because <strong>the</strong>y are short-lived <strong>and</strong>new active families do not evolve in a <strong>genome</strong> (see text). Duplicates <strong>and</strong> close paraloguesresulting from duplication after transposition were eliminated. The copies were groupedon <strong>the</strong> basis <strong>of</strong> GC content <strong>of</strong> <strong>the</strong> ¯anking 1,000 bp on both sides <strong>and</strong> aligned to <strong>the</strong>consensus sequence (representing <strong>the</strong> state <strong>of</strong> <strong>the</strong> copy at integration). Recursive effortsusing parameters arising from this study did not change <strong>the</strong> alignments signi®cantly.Alignments were inspected by h<strong>and</strong>, <strong>and</strong> obvious misalignments caused by insertions <strong>and</strong>duplications were eliminated. Substitutions (n ˆ 80; 000) were counted for each positionin <strong>the</strong> consensus, excluding those in CpG dinucleotides, <strong>and</strong> a substitution frequencymatrix was de®ned. From <strong>the</strong> matrices for each repeat (which corresponded to differentages), a single rate matrix was calculated for <strong>the</strong>se bins <strong>of</strong> GC content (, 40% GC, 40±47% GC <strong>and</strong> . 47% GC). Data are shown for a repeat with an average divergence (innon-CpG sites) <strong>of</strong> 18% in 43% GC content (<strong>the</strong> repeat has slightly higher divergence inAT-rich DNA <strong>and</strong> lower in GC-rich DNA). From <strong>the</strong> rate matrix, we calculated log-likelihoodmatrices with different entropies (divergence levels), which are <strong>the</strong>oretically optimal foralignments <strong>of</strong> neutrally diverged copies to <strong>the</strong>ir common ancestral state (A. Kas <strong>and</strong>A. F. A. Smit, unpublished). These matrices are in use by <strong>the</strong> RepeatMasker program.ATTAGCCGDNA replication, to become depleted late in <strong>the</strong> cell cycle, <strong>the</strong>rebyresulting in a small but signi®cant shift in substitution towards abase pairs 186,188 . Ano<strong>the</strong>r <strong>the</strong>ory proposes that many substitutionsare due to differences in DNA repair mechanisms, possibly relatedto transcriptional activity <strong>and</strong> <strong>the</strong>reby to gene density <strong>and</strong> GCcontent 185,189,190 .There is also an absolute bias in substitution patterns resulting indirectional pressure towards lower GC content throughout <strong>the</strong><strong>human</strong> <strong>genome</strong>. The <strong>genome</strong> is not at equilibrium with respect to<strong>the</strong> pattern <strong>of</strong> nucleotide substitution: <strong>the</strong> expected equilibrium GCcontent corresponding to <strong>the</strong> values <strong>of</strong> K above is 29, 35 <strong>and</strong> 44% forregions with average GC contents <strong>of</strong> 37, 43 <strong>and</strong> 50%, respectively.Recent observations on SNPs 190 con®rm that <strong>the</strong> mutation patternin GC-rich DNA is biased towards a base pairs; it should be possibleto perform similar analyses throughout <strong>the</strong> <strong>genome</strong> with <strong>the</strong>availability <strong>of</strong> 1.4 million SNPs 97,191 . On <strong>the</strong> basis solely <strong>of</strong> nucleotidesubstitution patterns, <strong>the</strong> GC content would be expected to be about7% lower throughout <strong>the</strong> <strong>genome</strong>.What accounts for <strong>the</strong> higher GC content? One possible explanationis that in GC-rich regions, a considerable fraction <strong>of</strong> <strong>the</strong>nucleotides is likely to be under functional constraint owing to<strong>the</strong> high gene density. Selection on coding regions <strong>and</strong> regulatoryCpG isl<strong>and</strong>s may maintain <strong>the</strong> higher-than-predicted GC content.Ano<strong>the</strong>r is that throughout <strong>the</strong> rest <strong>of</strong> <strong>the</strong> <strong>genome</strong>, a constant in¯ux<strong>of</strong> transposable elements tends to increase GC content (Fig. 28).Young repeat elements clearly have a higher GC content than <strong>the</strong>irsurrounding regions, except in extremely GC-rich regions. Moreover,repeat elements clearly shift with age towards a lower GCcontent, closer to that <strong>of</strong> <strong>the</strong> neighbourhood in which <strong>the</strong>y reside.Much <strong>of</strong> <strong>the</strong> `non-repeat' DNA in AT-rich regions probably consists<strong>of</strong> ancient repeats that are not detectable by current methods <strong>and</strong>that have had more time to approach <strong>the</strong> local equilibrium value.The repeats can also be used to study how <strong>the</strong> mutation process isaffected by <strong>the</strong> immediately adjacent nucleotide. Such `contexteffects' will be discussed elsewhere (A. Kas <strong>and</strong> A. F. A. Smit,unpublished results).Fast living on chromosome Y. The pattern <strong>of</strong> interspersed repeatscan be used to shed light on <strong>the</strong> unusual evolutionary history <strong>of</strong>chromosome Y. Our <strong>analysis</strong> shows that <strong>the</strong> genetic material onGC content <strong>of</strong> feature (%)6560555045403530All DNAYoung interspersed repeats(

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!