The.Algorithm.Design.Manual.Springer-Verlag.1998
The.Algorithm.Design.Manual.Springer-Verlag.1998 The.Algorithm.Design.Manual.Springer-Verlag.1998
War Story: String 'em Up ``I've tried the fastest computer in our department, but our program is too slow,'' Dimitris complained. ``It takes forever on strings of length only 2,000 characters. We will never get up to 50,000.'' For interactive SBH to be competitive as a sequencing method, we had to be able to sequence long fragments of DNA, ideally over 50 kilobases in length. If we couldn't speed up the program, we would be in the embarrassing position of having a biological technique invented by computer scientists fail because the computations took too long. We profiled our program and discovered that almost all the time was spent searching in this data structure, which was no surprise. For each of the possible concatenations, we did this k-1 times. We needed a faster dictionary data structure, since search was the innermost operation in such a deep loop. ``What about using a hash table?'' I suggested. ``If we do it right, it should take O(k) time to hash a kcharacter string and look it up in our table. That should knock off a factor of , which will mean something when 2,000.'' Dimitris went back and implemented a hash table implementation for our dictionary. Again, it worked great up until the moment we ran it. ``Our program is still too slow,'' Dimitris complained. ``Sure, it is now about ten times faster on strings of length 2,000. So now we can get up to about 4,000 characters. Big deal. We will never get up to 50,000.'' ``We should have expected only a factor ten speedup,'' I mused. ``After all, . We need a faster data structure to search in our dictionary of strings.'' ``But what can be faster than a hash table?'' Dimitris countered. ``To look up a k-character string, you must read all k characters. Our hash table already does O(k) searching.'' ``Sure, it takes k comparisons to test the first substring. But maybe we can do better on the second test. Remember where our dictionary queries are coming from. When we concatenate ABCD with EFGH, we are first testing whether BCDE is in the dictionary, then CDEF. These strings differ from each other by only one character. We should be able to exploit this so that each subsequent test takes constant time to perform....'' ``We can't do that with a hash table,'' Dimitris observed. ``The second key is not going to be anywhere near the first in the table. A binary search tree won't help, either. Since the keys ABCD and BCDE differ according to the first character, the two strings will be in different parts of the tree.'' file:///E|/BOOK/BOOK/NODE39.HTM (3 of 5) [19/1/2003 1:28:38]
War Story: String 'em Up Figure: Suffix tree on ACAC and CACT, with the pointer to the suffix of ACAC ``But we can use a suffix tree to do this,'' I countered. ``A suffix tree is a trie containing all the suffixes of a given set of strings. For example, the suffixes of ACAC are . Coupled with suffixes of string CACT, we get the suffix tree of Figure . By following a pointer from ACAC to its longest proper suffix CAC, we get to the right place to test whether CACT is in our set of strings. One character comparison is all we need to do from there.'' Suffix trees are amazing data structures, discussed in considerably more detail in Section . Dimitris did some reading about them, then built a nice suffix tree implementation for our dictionary. Once again, it worked great up until the moment we ran it. ``Now our program is faster, but it runs out of memory,'' Dimitris complained. ``And this on a 128 megabyte machine with 400 megabytes virtual memory! The suffix tree builds a path of length k for each suffix of length k, so all told there can be nodes in the tree. It crashes when we go beyond 2,000 characters. We will never get up to strings with 50,000 characters.'' I wasn't yet ready to give up. ``There is a way around the space problem, by using compressed suffix trees,'' I recalled. ``Instead of explicitly representing long paths of character nodes, we can refer back to the original string.'' Compressed suffix trees always take linear space, as described in Section . Dimitris went back one last time and implemented the compressed suffix tree data structure. Now it worked great! As shown in Figure , we ran our simulation for strings of length n= 65,536 on a SPARCstation 20 without incident. Our results, reported in [MS95a], showed that interactive SBH could be a very efficient sequencing technique. Based on these simulations, we were able to arouse interest in our technique from biologists. Making the actual wet laboratory experiments feasible provided another computational challenge, which is reported in Section . file:///E|/BOOK/BOOK/NODE39.HTM (4 of 5) [19/1/2003 1:28:38]
- Page 119 and 120: Logarithms Next: Modeling the Probl
- Page 121 and 122: Logarithms justified in ignoring th
- Page 123 and 124: Modeling the Problem Figure: Modeli
- Page 125 and 126: About the War Stories Next: War Sto
- Page 127 and 128: War Story: Psychic Modeling Next: E
- Page 129 and 130: War Story: Psychic Modeling have pu
- Page 131 and 132: War Story: Psychic Modeling Next: E
- Page 133 and 134: Exercises (b) If I prove that an al
- Page 135 and 136: Fundamental Data Types Next: Contai
- Page 137 and 138: Containers Next: Dictionaries Up: F
- Page 139 and 140: Dictionaries Next: Binary Search Tr
- Page 141 and 142: Binary Search Trees BinaryTreeQuery
- Page 143 and 144: Priority Queues Next: Specialized D
- Page 145 and 146: Specialized Data Structures Next: S
- Page 147 and 148: Sorting Next: Applications of Sorti
- Page 149 and 150: Applications of Sorting Figure: Con
- Page 151 and 152: Data Structures Next: Incremental I
- Page 153 and 154: Incremental Insertion Next: Divide
- Page 155 and 156: Randomization Next: Bucketing Techn
- Page 157 and 158: Randomization Next: Bucketing Techn
- Page 159 and 160: Bucketing Techniques Algorithms Mon
- Page 161 and 162: War Story: Stripping Triangulations
- Page 163 and 164: War Story: Stripping Triangulations
- Page 165 and 166: War Story: Mystery of the Pyramids
- Page 167 and 168: War Story: Mystery of the Pyramids
- Page 169: War Story: String 'em Up We were co
- Page 173 and 174: Exercises Next: Implementation Chal
- Page 175 and 176: Exercises used to select the pivot.
- Page 177 and 178: Dynamic Programming Next: Fibonacci
- Page 179 and 180: Fibonacci numbers Next: The Partiti
- Page 181 and 182: Fibonacci numbers Next: The Partiti
- Page 183 and 184: The Partition Problem . What is the
- Page 185 and 186: The Partition Problem Figure: Dynam
- Page 187 and 188: Approximate String Matching Next: L
- Page 189 and 190: Approximate String Matching The val
- Page 191 and 192: Longest Increasing Sequence Will th
- Page 193 and 194: Minimum Weight Triangulation Next:
- Page 195 and 196: Limitations of Dynamic Programming
- Page 197 and 198: War Story: Evolution of the Lobster
- Page 199 and 200: War Story: Evolution of the Lobster
- Page 201 and 202: War Story: What's Past is Prolog Ne
- Page 203 and 204: War Story: What's Past is Prolog th
- Page 205 and 206: War Story: Text Compression for Bar
- Page 207 and 208: War Story: Text Compression for Bar
- Page 209 and 210: Divide and Conquer Next: Fast Expon
- Page 211 and 212: Fast Exponentiation Next: Binary Se
- Page 213 and 214: Binary Search Next: Square and Othe
- Page 215 and 216: Exercises Next: Implementation Chal
- Page 217 and 218: Exercises whose denominations are ,
- Page 219 and 220: The Friendship Graph Next: Data Str
War Story: String 'em Up<br />
Figure: Suffix tree on ACAC and CACT, with the pointer to the suffix of ACAC<br />
``But we can use a suffix tree to do this,'' I countered. ``A suffix tree is a trie containing all the suffixes of<br />
a given set of strings. For example, the suffixes of ACAC are . Coupled with<br />
suffixes of string CACT, we get the suffix tree of Figure . By following a pointer from ACAC to its<br />
longest proper suffix CAC, we get to the right place to test whether CACT is in our set of strings. One<br />
character comparison is all we need to do from there.''<br />
Suffix trees are amazing data structures, discussed in considerably more detail in Section . Dimitris<br />
did some reading about them, then built a nice suffix tree implementation for our dictionary. Once again,<br />
it worked great up until the moment we ran it.<br />
``Now our program is faster, but it runs out of memory,'' Dimitris complained. ``And this on a 128<br />
megabyte machine with 400 megabytes virtual memory! <strong>The</strong> suffix tree builds a path of length k for each<br />
suffix of length k, so all told there can be nodes in the tree. It crashes when we go beyond<br />
2,000 characters. We will never get up to strings with 50,000 characters.''<br />
I wasn't yet ready to give up. ``<strong>The</strong>re is a way around the space problem, by using compressed suffix<br />
trees,'' I recalled. ``Instead of explicitly representing long paths of character nodes, we can refer back to<br />
the original string.'' Compressed suffix trees always take linear space, as described in Section .<br />
Dimitris went back one last time and implemented the compressed suffix tree data structure. Now it<br />
worked great! As shown in Figure , we ran our simulation for strings of length n= 65,536 on a<br />
SPARCstation 20 without incident. Our results, reported in [MS95a], showed that interactive SBH could<br />
be a very efficient sequencing technique. Based on these simulations, we were able to arouse interest in<br />
our technique from biologists. Making the actual wet laboratory experiments feasible provided another<br />
computational challenge, which is reported in Section .<br />
file:///E|/BOOK/BOOK/NODE39.HTM (4 of 5) [19/1/2003 1:28:38]