12.07.2015 Views

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

articlescontigs in 942 ®ngerprint clone contigs.The hierarchy <strong>of</strong> contigs is summarized in Fig. 7. <strong>Initial</strong> sequencecontigs are integrated to create merged sequence contigs, which are<strong>the</strong>n linked to form sequence-contig scaffolds. These scaffolds residewithin sequenced-clone contigs, which in turn reside within ®ngerprintclone contigs.The draft <strong>genome</strong> sequenceThe result <strong>of</strong> <strong>the</strong> assembly process is an integrated draft sequence <strong>of</strong><strong>the</strong> <strong>human</strong> <strong>genome</strong>. Several features <strong>of</strong> <strong>the</strong> draft <strong>genome</strong> sequenceare reported in Tables 5±7, including <strong>the</strong> proportion represented by®nished, draft <strong>and</strong> predraft categories. The Tables also show <strong>the</strong>numbers <strong>and</strong> lengths <strong>of</strong> different types <strong>of</strong> contig, for each chromosome<strong>and</strong> for <strong>the</strong> <strong>genome</strong> as a whole.The contiguity <strong>of</strong> <strong>the</strong> draft <strong>genome</strong> sequence at each level is animportant feature. Two commonly used statistics have signi®cantdrawbacks for describing contiguity. The `average length' <strong>of</strong> a contigis de¯ated by <strong>the</strong> presence <strong>of</strong> many small contigs comprising only asmall proportion <strong>of</strong> <strong>the</strong> <strong>genome</strong>, whereas <strong>the</strong> `length-weightedaverage length' is in¯ated by <strong>the</strong> presence <strong>of</strong> large segments <strong>of</strong>®nished sequence. Instead, we chose to describe <strong>the</strong> contiguity as aproperty <strong>of</strong> <strong>the</strong> `typical' nucleotide. We used a statistic called <strong>the</strong>`N50 length', de®ned as <strong>the</strong> largest length L such that 50% <strong>of</strong> allnucleotides are contained in contigs <strong>of</strong> size at least L.The continuity <strong>of</strong> <strong>the</strong> draft <strong>genome</strong> sequence reported here <strong>and</strong><strong>the</strong> effectiveness <strong>of</strong> assembly can be readily seen from <strong>the</strong> following:half <strong>of</strong> all nucleotides reside within an initial sequence contig <strong>of</strong> atleast 21.7 kb, a sequence contig <strong>of</strong> at least 82 kb, a sequence-contigscaffold <strong>of</strong> at least 274 kb, a sequenced-clone contig <strong>of</strong> at least 826 kb<strong>and</strong> a ®ngerprint clone contig <strong>of</strong> at least 8.4 Mb (Tables 6, 7). Thecumulative distributions for each <strong>of</strong> <strong>the</strong>se measures <strong>of</strong> contiguityare shown in Fig. 8, in which <strong>the</strong> N50 values for each measure can beseen as <strong>the</strong> value at which <strong>the</strong> cumulative distributions cross 50%.We have also estimated <strong>the</strong> size <strong>of</strong> each chromosome, by estimating<strong>the</strong> gap sizes (see below) <strong>and</strong> <strong>the</strong> extent <strong>of</strong> missing heterochromaticsequence 93,94,105±108 (Table 8). This is undoubtedly an oversimpli®cation<strong>and</strong> does not adequately take into account <strong>the</strong> sequence status<strong>of</strong> each chromosome. None<strong>the</strong>less, it provides a useful way to relate<strong>the</strong> draft sequence to <strong>the</strong> chromosomes.Quality assessmentThe draft <strong>genome</strong> sequence already covers <strong>the</strong> vast majority <strong>of</strong> <strong>the</strong><strong>genome</strong>, but it remains an incomplete, intermediate product that isregularly updated as we work towards a complete ®nished sequence.The current version contains many gaps <strong>and</strong> errors. We <strong>the</strong>reforesought to evaluate <strong>the</strong> quality <strong>of</strong> various aspects <strong>of</strong> <strong>the</strong> current draft<strong>genome</strong> sequence, including <strong>the</strong> sequenced clones <strong>the</strong>mselves, <strong>the</strong>irassignment to a position in <strong>the</strong> ®ngerprint clone contigs, <strong>and</strong> <strong>the</strong>assembly <strong>of</strong> initial sequence contigs from <strong>the</strong> individual clones intosequence-contig scaffolds.Nucleotide accuracy is re¯ected in a PHRAP score assigned toeach base in <strong>the</strong> draft <strong>genome</strong> sequence <strong>and</strong> available to usersthrough <strong>the</strong> Genome Browsers (see below) <strong>and</strong> public databaseentries. A summary <strong>of</strong> <strong>the</strong>se scores for <strong>the</strong> un®nished portion <strong>of</strong> <strong>the</strong><strong>genome</strong> is shown in Table 9. About 91% <strong>of</strong> <strong>the</strong> un®nished draft<strong>genome</strong> sequence has an error rate <strong>of</strong> less than 1 per 10,000 bases(PHRAP score . 40), <strong>and</strong> about 96% has an error rate <strong>of</strong> less than 1in 1,000 bases (PHRAP . 30). These values are based only on <strong>the</strong>quality scores for <strong>the</strong> bases in <strong>the</strong> sequenced clones; <strong>the</strong>y do notre¯ect additional con®dence in <strong>the</strong> sequences that are represented inoverlapping clones. The ®nished portion <strong>of</strong> <strong>the</strong> draft <strong>genome</strong>sequence has an error rate <strong>of</strong> less than 1 per 10,000 bases.Individual sequenced clones. We assessed <strong>the</strong> frequency <strong>of</strong> misassemblies,which can occur when <strong>the</strong> assembly program PHRAPjoins two nonadjacent regions in <strong>the</strong> clone into a single initialsequence contig. The frequency <strong>of</strong> misassemblies depends heavilyon <strong>the</strong> depth <strong>and</strong> quality <strong>of</strong> coverage <strong>of</strong> each clone <strong>and</strong> <strong>the</strong> nature <strong>of</strong><strong>the</strong> underlying sequence; thus it may vary among genomic regions<strong>and</strong> among individual centres. Most clone misassemblies are readilycorrected as coverage is added during ®nishing, but <strong>the</strong>y may havebeen propagated into <strong>the</strong> current version <strong>of</strong> <strong>the</strong> draft <strong>genome</strong>sequence <strong>and</strong> <strong>the</strong>y justify caution for certain applications.We estimated <strong>the</strong> frequency <strong>of</strong> misassembly by examininginstances in which <strong>the</strong>re was substantial overlap between a draftclone <strong>and</strong> a ®nished clone. We studied 83 Mb <strong>of</strong> such overlaps,involving about 9,000 initial sequence contigs. We found 5.3instances per Mb in which <strong>the</strong> alignment <strong>of</strong> an initial sequencecontig to <strong>the</strong> ®nished sequence failed to extend to within 200 basesTable 6 Clone level contiguity <strong>of</strong> <strong>the</strong> draft <strong>genome</strong> sequenceChromosome Sequenced-clone contigs Sequenced-clone-contig scaffolds Fingerprint clone contigs with sequenceNumber N50 length (kb) Number N50 length (kb) Number N50 length (kb)All 4,884 826 2,191 2,279 942 8,3981 453 650 197 1,915 106 3,5372 348 1,028 127 3,140 52 10,6283 409 672 201 1,550 73 5,0774 384 606 163 1,659 41 6,9185 385 623 164 1,642 48 5,7476 292 814 98 3,292 17 24,6807 224 1.074 86 3,527 29 20,4018 292 542 115 1,742 43 6,2369 143 1,242 78 2,411 21 29,10810 179 1,097 105 1,952 16 30,28411 224 887 89 3,024 31 9,41412 196 1,138 76 2,717 28 9,54613 128 1,151 56 3,257 13 25,25614 54 3,079 27 8,489 14 22,12815 123 797 56 2,095 19 8,27416 159 620 92 1,317 57 2,71617 138 831 58 2,138 43 2,81618 137 709 47 2,572 24 4,88719 159 569 79 1,200 51 1,53420 42 2,318 20 6,862 9 23,48921 5 28,515 5 28,515 5 28,51522 11 23,048 11 23,048 11 23,048X 325 572 181 1,082 143 1,436Y 27 1,539 20 3,290 8 5,135UL 47 227 40 281 40 281...................................................................................................................................................................................................................................................................................................................................................................Number <strong>and</strong> size <strong>of</strong> sequenced-clone contigs, sequenced-clone-contig scaffolds <strong>and</strong> those ®ngerprint clone contigs (see Box 1) that contain sequenced clones; some small ®ngerprint clone contigs do notas yet have associated sequence. UL, ®ngerprint clone contigs that could not reliably be placed on a chromosome. These length estimates are from <strong>the</strong> draft <strong>genome</strong> sequence, in which gaps betweensequence contigs are arbitrarily represented with 100 Ns <strong>and</strong> gaps between sequence clone contigs with 50,000 Ns for `bridged gaps' <strong>and</strong> 100,000 Ns for `unbridged gaps'. These arbitrary values differminimally from empirical estimates <strong>of</strong> gap size (see text), <strong>and</strong> using <strong>the</strong> empirically derived estimates would change <strong>the</strong> N50 lengths presented here only slightly. For un®nished chromosomes, <strong>the</strong> N50 lengthranges from 1.5 to 3 times <strong>the</strong> arithmetic mean for sequenced-clone contigs, 1.5 to 3 times for sequenced-clone-contig scaffolds, <strong>and</strong> 1.5 to 6 times for ®ngerprint clone contigs with sequence.NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com © 2001 Macmillan Magazines Ltd871

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!