articles<strong>and</strong> <strong>the</strong>ir encoded protein products', it is important to rememberthat thous<strong>and</strong>s <strong>of</strong> <strong>human</strong> genes produce noncoding RNAs(ncRNAs) as <strong>the</strong>ir ultimate product 244 . There are several majorclasses <strong>of</strong> ncRNA. (1) Transfer RNAs (tRNAs) are <strong>the</strong> adapters thattranslate <strong>the</strong> triplet nucleic acid code <strong>of</strong> RNA into <strong>the</strong> amino-acidsequence <strong>of</strong> proteins; (2) ribosomal RNAs (rRNAs) are also centralto <strong>the</strong> translational machinery, <strong>and</strong> recent X-ray crystallographyresults strongly indicate that peptide bond formation is catalysed byrRNA, not protein 245,246 ; (3) small nucleolar RNAs (snoRNAs) arerequired for rRNA processing <strong>and</strong> base modi®cation in <strong>the</strong>nucleolus 247,248 ; <strong>and</strong> (4) small nuclear RNAs (snRNAs) are criticalcomponents <strong>of</strong> spliceosomes, <strong>the</strong> large ribonucleoprotein (RNP)complexes that splice introns out <strong>of</strong> pre-mRNAs in <strong>the</strong> nucleus.Humans have both a major, U2 snRNA-dependent spliceosome thatsplices most introns, <strong>and</strong> a minor, U12 snRNA-dependent spliceosomethat splices a rare class <strong>of</strong> introns that <strong>of</strong>ten have AT/ACdinucleotides at <strong>the</strong> splice sites instead <strong>of</strong> <strong>the</strong> canonical GT/AGsplice site consensus 249 .O<strong>the</strong>r ncRNAs include both RNAs <strong>of</strong> known biochemical function(such as telomerase RNA <strong>and</strong> <strong>the</strong> 7SL signal recognitionparticle RNA) <strong>and</strong> ncRNAs <strong>of</strong> enigmatic function (such as <strong>the</strong>large Xist transcript implicated in X dosage compensation 250 , or <strong>the</strong>small vault RNAs found in <strong>the</strong> bizarre vault ribonucleoproteincomplex 251 , which is three times <strong>the</strong> mass <strong>of</strong> <strong>the</strong> ribosome but hasunknown function).ncRNAs do not have translated ORFs, are <strong>of</strong>ten small <strong>and</strong> are notpolyadenylated. Accordingly, novel ncRNAs cannot readily befound by computational gene-®nding techniques (which searchfor features such as ORFs) or experimental <strong>sequencing</strong> <strong>of</strong> cDNA orEST libraries (most <strong>of</strong> which are prepared by reverse transcriptionusing a primer complementary to a poly(A) tail). Even if <strong>the</strong>complete ®nished sequence <strong>of</strong> <strong>the</strong> <strong>human</strong> <strong>genome</strong> were available,discovering novel ncRNAs would still be challenging. We can,however, identify genomic sequences that are homologous toknown ncRNA genes, using BLASTN or, in some cases, morespecialized methods.It is sometimes dif®cult to tell whe<strong>the</strong>r such homologous genesare orthologues, paralogues or closely related pseudogenes (becauseinactivating mutations are much less obvious than for proteincodinggenes). For tRNA, <strong>the</strong>re is suf®ciently detailed informationabout <strong>the</strong> cloverleaf secondary structure to allow true genes <strong>and</strong>pseudogenes to be distinguished with high sensitivity. For manyo<strong>the</strong>r ncRNAs, <strong>the</strong>re is much less structural information <strong>and</strong> so weemploy an operational criterion <strong>of</strong> high sequence similarity (. 95%sequence identity <strong>and</strong> . 95% full length) to distinguish true genesfrom pseudogenes. These assignments will eventually need to bereconciled with experimental data.Transfer RNA genes. The classical experimental estimate <strong>of</strong> <strong>the</strong>number <strong>of</strong> <strong>human</strong> tRNA genes is 1,310 (ref. 252). In <strong>the</strong> draft<strong>genome</strong> sequence, we ®nd only 497 <strong>human</strong> tRNA genes (Tables 19,20). How do we account for this discrepancy? We believe that <strong>the</strong>original estimate is likely to have been in¯ated in two respects. First,it came from a hybridization experiment that probably countedclosely related pseudogenes; by <strong>analysis</strong> <strong>of</strong> <strong>the</strong> draft <strong>genome</strong>sequence, <strong>the</strong>re are in fact 324 tRNA-derived putative pseudogenesTable 19 Number <strong>of</strong> tRNA genes in various organismsOrganism Number <strong>of</strong> canonical tRNAs SeCys tRNAHuman 497 1Worm 584 1Fly 284 1Yeast 273 0Methanococcus jannaschii 36 1Escherichia coli 86 1.............................................................................................................................................................................Number <strong>of</strong> tRNA genes in each <strong>of</strong> six <strong>genome</strong> sequences, according to <strong>analysis</strong> by <strong>the</strong> computerprogram tRNAscan-SE . Canonical tRNAs read one <strong>of</strong> <strong>the</strong> st<strong>and</strong>ard 61 sense codons; this categoryexcludes pseudogenes, undetermined anticodons, putative supressors <strong>and</strong> selenocysteine tRNAs.Most organisms have a selenocysteine (SeCys) tRNA species, but some unicellular eukaryotes donot (such as <strong>the</strong> yeast S. cerevisiae).(Table 20). Second, <strong>the</strong> earlier estimate assumed too high a value for<strong>the</strong> size <strong>of</strong> <strong>the</strong> <strong>human</strong> <strong>genome</strong>; repeating <strong>the</strong> calculation using <strong>the</strong>correct value yields an estimate <strong>of</strong> about 890 tRNA-related loci,which is in reasonable accord with our count <strong>of</strong> 821 tRNA genes <strong>and</strong>pseudogenes in <strong>the</strong> draft <strong>genome</strong> sequence.The <strong>human</strong> tRNA gene set predicted from <strong>the</strong> draft <strong>genome</strong>sequence appears to include most <strong>of</strong> <strong>the</strong> known <strong>human</strong> tRNAspecies. The draft <strong>genome</strong> sequence contains 37 <strong>of</strong> 38 <strong>human</strong>tRNA species listed in a tRNA database 253 , allowing for up to onemismatch. This includes one copy <strong>of</strong> <strong>the</strong> known gene for aspecialized selenocysteine tRNA, one <strong>of</strong> several components <strong>of</strong> abaroque translational mechanism that reads UGA as a selenocysteinecodon in certain rare mRNAs that carry a speci®c cis-actingRNA regulatory site (a so-called SECIS element) in <strong>the</strong>ir 39 UTRs.The one tRNA gene in <strong>the</strong> database not found in <strong>the</strong> draft <strong>genome</strong>sequence is DE9990, a tRNAGlu species, which differs in twopositions from <strong>the</strong> most related tRNA gene in <strong>the</strong> <strong>human</strong><strong>genome</strong>. Possible explanations are that <strong>the</strong> database version <strong>of</strong>this tRNA contains two errors, <strong>the</strong> gene is polymorphic or this isa genuine functional tRNA that is missing from <strong>the</strong> draft <strong>genome</strong>sequence. (The database also lists one additional tRNA gene(DS9994), but this is apparently a contaminant, most similar tobacterial tRNAs; <strong>the</strong> parent entry (Z13399) was withdrawn from <strong>the</strong>DNA database, but <strong>the</strong> tRNA entry has not yet been removed from<strong>the</strong> tRNA database.) Although <strong>the</strong> <strong>human</strong> set appears substantiallycomplete by this test, <strong>the</strong> tRNA gene numbers in Table 19 should beconsidered tentative <strong>and</strong> used with caution. The <strong>human</strong> <strong>and</strong> ¯y (butnot <strong>the</strong> worm) are known to be missing signi®cant amounts <strong>of</strong>heterochromatic DNA, <strong>and</strong> additional tRNA genes could be located<strong>the</strong>re.With this caveat, <strong>the</strong> results indicate that <strong>the</strong> <strong>human</strong> has fewertRNA genes than <strong>the</strong> worm, but more than <strong>the</strong> ¯y. This may seemsurprising, but tRNA gene number in metazoans is thought to berelated not to organismal complexity, but more to idiosyncrasies <strong>of</strong><strong>the</strong> dem<strong>and</strong> for tRNA abundance in certain tissues or stages <strong>of</strong>embryonic development. For example, <strong>the</strong> frog Xenopus laevis,which must load each oocyte with a remarkable 40 ng <strong>of</strong> tRNA,has thous<strong>and</strong>s <strong>of</strong> tRNA genes 254 .The degeneracy <strong>of</strong> <strong>the</strong> genetic code has allowed an inspiredeconomy <strong>of</strong> tRNA anticodon usage. Although 61 sense codonsneed to be decoded, not all 61 different anticodons are present intRNAs. Ra<strong>the</strong>r, tRNAs generally follow stereotyped <strong>and</strong> conservedwobble rules 255±257 . Wobble reduces <strong>the</strong> number <strong>of</strong> required anticodonssubstantially, <strong>and</strong> provides a connection between <strong>the</strong> geneticcode <strong>and</strong> <strong>the</strong> hybridization stability <strong>of</strong> modi®ed <strong>and</strong> unmodi®edRNA bases. In eukaryotes, <strong>the</strong> rules proposed by Guthrie <strong>and</strong>Abelson 256 predict that about 46 tRNA species will be suf®cient toread <strong>the</strong> 61 sense codons (counting <strong>the</strong> initiator <strong>and</strong> elongatormethionine tRNAs as two species). According to <strong>the</strong>se rules, in <strong>the</strong>codon's third (wobble) position, U <strong>and</strong> C are generally decoded by asingle tRNA species, whereas A <strong>and</strong> G are decoded by two separatetRNA species.In `two-codon boxes' <strong>of</strong> <strong>the</strong> genetic code (where codons endingwith U/C encode a different amino acid from those ending withA/G), <strong>the</strong> U/C wobble position should be decoded by a G at position34 in <strong>the</strong> tRNA anticodon. Thus, in <strong>the</strong> top left <strong>of</strong> Fig. 34, <strong>the</strong>re is notRNA with an AAA anticodon for Phe, but <strong>the</strong> GAA anticodon canrecognize both UUU <strong>and</strong> UUC codons in <strong>the</strong> mRNA. In `fourcodonboxes' <strong>of</strong> <strong>the</strong> genetic code (where U, C, A <strong>and</strong> G in <strong>the</strong> wobbleposition all encode <strong>the</strong> same amino acid), <strong>the</strong> U/C wobble positionis almost always decoded by I34 (inosine) in <strong>the</strong> tRNA, where <strong>the</strong>inosine is produced by post-transcriptional modi®cation <strong>of</strong> anadenine (A). In <strong>the</strong> bottom left <strong>of</strong> Fig. 34, for example, <strong>the</strong> GUU<strong>and</strong> GUC codons <strong>of</strong> <strong>the</strong> four-codon Val box are decoded by a tRNAwith an anticodon <strong>of</strong> AAC, which is no doubt modi®ed to IAC.Presumably this pattern, which is strikingly conserved in eukaryotes,has to do with <strong>the</strong> fact that IA base pairs are also possible; thusNATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com © 2001 Macmillan Magazines Ltd893
articles<strong>the</strong> IAC anticodon for a Val tRNA could recognize GUU, GUC <strong>and</strong>even GUA codons. Were this same I34 to be utilized in two-codonboxes, however, misreading <strong>of</strong> <strong>the</strong> NNA codon would occur, resultingin translational havoc. Eukaryotic glycine tRNAs represent aconserved exception to this last rule; <strong>the</strong>y use a GCC anticodon todecode GGU <strong>and</strong> GGC, ra<strong>the</strong>r than <strong>the</strong> expected ICC anticodon.Satisfyingly, <strong>the</strong> <strong>human</strong> tRNA set follows <strong>the</strong>se wobble rulesalmost perfectly (Fig. 34). Only three unexpected tRNA speciesare found: single genes for a tRNATyr-AUA, tRNAIle-GAU, <strong>and</strong>tRNAAsn-AUU. Perhaps <strong>the</strong>se are pseudogenes, but <strong>the</strong>y appear tobe plausible tRNAs. We also checked <strong>the</strong> possibility <strong>of</strong> <strong>sequencing</strong>errors in <strong>the</strong>ir anticodons, but each <strong>of</strong> <strong>the</strong>se three genes is in a region<strong>of</strong> high sequence accuracy, with PHRAP quality scores higherthan 70 for every base in <strong>the</strong>ir anticodons.As in all o<strong>the</strong>r organisms, <strong>human</strong> protein-coding genes showcodon biasÐpreferential use <strong>of</strong> one synonymous codon overano<strong>the</strong>r 258 (Fig. 34). In less complex organisms, such as yeast orbacteria, highly expressed genes show <strong>the</strong> strongest codon bias.Cytoplasmic abundance <strong>of</strong> tRNA species is correlated with bothcodon bias <strong>and</strong> overall amino-acid frequency (for example, tRNAsfor preferred codons <strong>and</strong> for more common amino acids are moreabundant). This is presumably driven by selective pressure foref®cient or accurate translation 259 . In many organisms, tRNAabundance in turn appears to be roughly correlated with tRNAgene copy number, so tRNA gene copy number has been used as aproxy for tRNA abundance 260 . In vertebrates, however, codon bias isnot so obviously correlated with gene expression level. Differingcodon biases between <strong>human</strong> genes is more a function <strong>of</strong> <strong>the</strong>irlocation in regions <strong>of</strong> different GC composition 261 . In agreementwith <strong>the</strong> literature, we see only a very rough correlation <strong>of</strong> <strong>human</strong>tRNA gene number with ei<strong>the</strong>r amino-acid frequency or codon bias(Fig. 34). The most obvious outliers in <strong>the</strong>se weak correlations are<strong>the</strong> strongly preferred CUG leucine codon, with a mere six tRNA-Leu-CAG genes producing a tRNA to decode it, <strong>and</strong> <strong>the</strong> relativelyrare cysteine UGU <strong>and</strong> UGC codons, with 30 tRNA genes to decode<strong>the</strong>m.The tRNA genes are dispersed throughout <strong>the</strong> <strong>human</strong> <strong>genome</strong>.However, this dispersal is nonr<strong>and</strong>om. tRNA genes have sometimesbeen seen in clusters at small scales 262,263 but we can now see strikingclustering on a <strong>genome</strong>-wide scale. More than 25% <strong>of</strong> <strong>the</strong> tRNAgenes (140) are found in a region <strong>of</strong> only about 4 Mb on chromosome6. This small region, only about 0.1% <strong>of</strong> <strong>the</strong> <strong>genome</strong>, containsan almost suf®cient set <strong>of</strong> tRNA genes all by itself. The 140 tRNAgenes contain a representative for 36 <strong>of</strong> <strong>the</strong> 49 anticodons found in<strong>the</strong> complete set; <strong>and</strong> <strong>of</strong> <strong>the</strong> 21 isoacceptor types, only tRNAs todecode Asn, Cys, Glu <strong>and</strong> selenocysteine are missing. Many <strong>of</strong> <strong>the</strong>setRNA genes, meanwhile, are clustered elsewhere; 18 <strong>of</strong> <strong>the</strong> 30 CystRNAs are found in a 0.5-Mb stretch <strong>of</strong> chromosome 7 <strong>and</strong> many <strong>of</strong><strong>the</strong> Asn <strong>and</strong> Glu tRNA genes are loosely clustered on chromosome 1.More than half <strong>of</strong> <strong>the</strong> tRNA genes (280 out <strong>of</strong> 497) reside on ei<strong>the</strong>rchromosome 1 or chromosome 6. Chromosomes 3, 4, 8, 9, 10, 12,18, 20, 21 <strong>and</strong> X appear to have fewer than 10 tRNA genes each; <strong>and</strong>chromosomes 22 <strong>and</strong> Y have none at all (each has a singlepseudogene).Ribosomal RNA genes. The ribosome, <strong>the</strong> protein syn<strong>the</strong>ticmachine <strong>of</strong> <strong>the</strong> cell, is made up <strong>of</strong> two subunits <strong>and</strong> contains fourrRNA species <strong>and</strong> many proteins. The large ribosomal subunitcontains 28S <strong>and</strong> 5.8S rRNAs (collectively called `large subunit'(LSU) rRNA) <strong>and</strong> also a 5S rRNA. The small ribosomal subunitcontains 18S rRNA (`small subunit' (SSU) rRNA). The genes forLSU <strong>and</strong> SSU rRNA occur in <strong>the</strong> <strong>human</strong> <strong>genome</strong> as a 44-kb t<strong>and</strong>emrepeat unit 264 . There are thought to be about 150±200 copies <strong>of</strong> thisrepeat unit arrayed on <strong>the</strong> short arms <strong>of</strong> acrocentric chromosomes13, 14, 15, 21 <strong>and</strong> 22 (refs 254, 264). There are no true completePheLeu171 UUU203 UUC73 UUA125 UUGAAA 0GAA 14UAA 8CAA 6Ser14717211845UCUUCCUCAUCGAGAGGAUGACGA10054Tyrstopstop12415800UAUUACUAAUAGAUAGUAUUACUA11100CysstopTrp991190122UGUUGCUGAUGGACA 0GCA 30UCA 0CCA 7Leu127 CUU187 CUC69 CUA392 CUGAAG 13GAG 0UAG 2CAG 6Pro17519717069CCUCCCCCACCGAGGGGGUGGCGG110104HisGln104147121343CAUCACCAACAGAUGGUGUUGCUG0121121Arg4710763115CGUCGCCGACGGACG 9GCG 0UCG 7CCG 5IleMet165 AUU218 AUC71 AUA221 AUGAAU 13GAU 1UAU 5CAU 17Thr13119215063ACUACCACAACGAGUGGUUGUCGU80107AsnLys174199248331AAUAACAAAAAGAUUGUUUUUCUU1331622SerArg121191113110AGUAGCAGAAGGACU 0GCU 7UCU 5CCU 4Val111 GUU146 GUC72 GUA288 GUGAAC 20GAC 0UAC 5CAC 19Ala18528216074GCUGCCGCAGCGAGCGGCUGCCGC250105AspGlu230262301404GAUGACGAAGAGAUCGUCUUCCUC010148Gly112230168160GGUGGCGGAGGGACC 0GCC 11UCC 5CCC 8Figure 34 The <strong>human</strong> genetic code <strong>and</strong> associated tRNA genes. For each <strong>of</strong> <strong>the</strong> 64codons, we show: <strong>the</strong> corresponding amino acid; <strong>the</strong> observed frequency <strong>of</strong> <strong>the</strong> codon per10,000 codons; <strong>the</strong> codon; predicted wobble pairing to a tRNA anticodon (black lines); anunmodi®ed tRNA anticodon sequence; <strong>and</strong> <strong>the</strong> number <strong>of</strong> tRNA genes found with thisanticodon. For example, phenylalanine is encoded by UUU or UUC; UUC is seen morefrequently, 203 to 171 occurrences per 10,000 total codons; both codons are expected tobe decoded by a single tRNA anticodon type, GAA, using a G/U wobble; <strong>and</strong> <strong>the</strong>re are 14tRNA genes found with this anticodon. The modi®ed anticodon sequence in <strong>the</strong> maturetRNA is not shown, even where post-transcriptional modi®cations can be con®dentlypredicted (for example, when an A is used to decode a U/C third position, <strong>the</strong> A is almostcertainly an inosine in <strong>the</strong> mature tRNA). The Figure also does not show <strong>the</strong> number <strong>of</strong>distinct tRNA species (such as distinct sequence families) for each anticodon; <strong>of</strong>ten <strong>the</strong>reis more than one species for each anticodon.894 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com