Gene prediction: Database analysis Pfam database Database provides models (statistical definitions) of ≈9,000 protein families. Query: Does my predicted proteins fit to a known protein family? ? 350 million comparisons: On a standard computer would have taken over a century to complete.
Sorcerer II: What Gentle lysis, did you do to my database!? Gentle lysis, extraction, DNA size separation BAC/fosmid cloning (40–100 kb) extraction, DNA size separation Million proteins Lysis, DN shearing 5.7 Table 2. Clustering and HMM Profiling Results Showing the Number of Predicted P Expanding the Protein ofiling Results Showing the Number of Predicted Proteins (Including Both Redundant Expanding the Protein F h Nonredundant Dataset Sequences) in Each Dataset lustering Dataset(A) HMM Original Set A Clustering \ B (A) A B HMM B A A Total \ BPredicted A B filing Table Results 2. Clustering Showing BAC/fosmid Shotgun Profiling and the (B) HMM Number Profiling of Predicted Results Proteins Showing Profiling (Including the (B) Number Proteins Both of Redundant Predicted A P h Dataset Nonredundant Sequences) cloning in Each Dataset (3 [ B kb) (40–100 kb) 939,056 NCBI-nr 1,645,146 2,317,995 1,566,123 1,939,056 372,9331,645,14679,023 1,566,123 2,018,079 372,93 PG ORFs 3,049,695 575,729 448,159 418,503 157,22 TGI-EST ORFs 5,458,820 1,097,083 606,779 576,532 520,55 319,855 ENS 253,007 361,668 241,671 319,855 78,184 253,00711,336 241,671 331,191 78,18 046,914 GOS ORFs 39,056 NCBI-nr 978,637 Total 75,729 PG ORFs 3,701,388 17,422,766 1,645,146 2,317,995 6,654,479 28,610,944 448,159 3,049,695 3,624,907 6,046,914 1,566,123 1,939,056 6,427,736 9,978,637 418,503 575,729 2,422,0073,701,38876,481 372,9331,645,146 79,023 3,550,9016,654,479 226,743 157,226 448,159 29,656 3,624,907 6,123,395 1,566,123 2,018,079 6,427,736 10,205,380 418,503 605,385 2,422,00 372,93 3,550,90 157,22 97,083 TGI-EST ORFs 606,779 5,458,820 576,532 1,097,083 520,551 606,779 30,247 1,127,330 576,532 520,55 ns 19,855 common ENS A \ B denotes to both the the number 253,007 clustering 361,668 of predicted and the HMM proteins 241,671 profiling; 319,855 common A toB, both the 78,184 number the clustering 253,007 of predicted 11,336 and the proteins HMM241,671 331,191 profiling; in clusters Abut B, not the 78,18 in nut ustering 575,729 Dataset (A) HMM Original 448,159 Set A Clustering 418,503 \ B (A) A 157,226 B HMMB29,656 A Total A \ 605,385 BPredicted A B 097,083 Profiling 606,779 (B) 576,532 520,551Profiling 30,247 (B) Proteins 1,127,330A [ B