The.Algorithm.Design.Manual.Springer-Verlag.1998

The.Algorithm.Design.Manual.Springer-Verlag.1998 The.Algorithm.Design.Manual.Springer-Verlag.1998

18.04.2013 Views

War Story: String 'em Up ``I've tried the fastest computer in our department, but our program is too slow,'' Dimitris complained. ``It takes forever on strings of length only 2,000 characters. We will never get up to 50,000.'' For interactive SBH to be competitive as a sequencing method, we had to be able to sequence long fragments of DNA, ideally over 50 kilobases in length. If we couldn't speed up the program, we would be in the embarrassing position of having a biological technique invented by computer scientists fail because the computations took too long. We profiled our program and discovered that almost all the time was spent searching in this data structure, which was no surprise. For each of the possible concatenations, we did this k-1 times. We needed a faster dictionary data structure, since search was the innermost operation in such a deep loop. ``What about using a hash table?'' I suggested. ``If we do it right, it should take O(k) time to hash a kcharacter string and look it up in our table. That should knock off a factor of , which will mean something when 2,000.'' Dimitris went back and implemented a hash table implementation for our dictionary. Again, it worked great up until the moment we ran it. ``Our program is still too slow,'' Dimitris complained. ``Sure, it is now about ten times faster on strings of length 2,000. So now we can get up to about 4,000 characters. Big deal. We will never get up to 50,000.'' ``We should have expected only a factor ten speedup,'' I mused. ``After all, . We need a faster data structure to search in our dictionary of strings.'' ``But what can be faster than a hash table?'' Dimitris countered. ``To look up a k-character string, you must read all k characters. Our hash table already does O(k) searching.'' ``Sure, it takes k comparisons to test the first substring. But maybe we can do better on the second test. Remember where our dictionary queries are coming from. When we concatenate ABCD with EFGH, we are first testing whether BCDE is in the dictionary, then CDEF. These strings differ from each other by only one character. We should be able to exploit this so that each subsequent test takes constant time to perform....'' ``We can't do that with a hash table,'' Dimitris observed. ``The second key is not going to be anywhere near the first in the table. A binary search tree won't help, either. Since the keys ABCD and BCDE differ according to the first character, the two strings will be in different parts of the tree.'' file:///E|/BOOK/BOOK/NODE39.HTM (3 of 5) [19/1/2003 1:28:38]

War Story: String 'em Up Figure: Suffix tree on ACAC and CACT, with the pointer to the suffix of ACAC ``But we can use a suffix tree to do this,'' I countered. ``A suffix tree is a trie containing all the suffixes of a given set of strings. For example, the suffixes of ACAC are . Coupled with suffixes of string CACT, we get the suffix tree of Figure . By following a pointer from ACAC to its longest proper suffix CAC, we get to the right place to test whether CACT is in our set of strings. One character comparison is all we need to do from there.'' Suffix trees are amazing data structures, discussed in considerably more detail in Section . Dimitris did some reading about them, then built a nice suffix tree implementation for our dictionary. Once again, it worked great up until the moment we ran it. ``Now our program is faster, but it runs out of memory,'' Dimitris complained. ``And this on a 128 megabyte machine with 400 megabytes virtual memory! The suffix tree builds a path of length k for each suffix of length k, so all told there can be nodes in the tree. It crashes when we go beyond 2,000 characters. We will never get up to strings with 50,000 characters.'' I wasn't yet ready to give up. ``There is a way around the space problem, by using compressed suffix trees,'' I recalled. ``Instead of explicitly representing long paths of character nodes, we can refer back to the original string.'' Compressed suffix trees always take linear space, as described in Section . Dimitris went back one last time and implemented the compressed suffix tree data structure. Now it worked great! As shown in Figure , we ran our simulation for strings of length n= 65,536 on a SPARCstation 20 without incident. Our results, reported in [MS95a], showed that interactive SBH could be a very efficient sequencing technique. Based on these simulations, we were able to arouse interest in our technique from biologists. Making the actual wet laboratory experiments feasible provided another computational challenge, which is reported in Section . file:///E|/BOOK/BOOK/NODE39.HTM (4 of 5) [19/1/2003 1:28:38]

War Story: String 'em Up<br />

Figure: Suffix tree on ACAC and CACT, with the pointer to the suffix of ACAC<br />

``But we can use a suffix tree to do this,'' I countered. ``A suffix tree is a trie containing all the suffixes of<br />

a given set of strings. For example, the suffixes of ACAC are . Coupled with<br />

suffixes of string CACT, we get the suffix tree of Figure . By following a pointer from ACAC to its<br />

longest proper suffix CAC, we get to the right place to test whether CACT is in our set of strings. One<br />

character comparison is all we need to do from there.''<br />

Suffix trees are amazing data structures, discussed in considerably more detail in Section . Dimitris<br />

did some reading about them, then built a nice suffix tree implementation for our dictionary. Once again,<br />

it worked great up until the moment we ran it.<br />

``Now our program is faster, but it runs out of memory,'' Dimitris complained. ``And this on a 128<br />

megabyte machine with 400 megabytes virtual memory! <strong>The</strong> suffix tree builds a path of length k for each<br />

suffix of length k, so all told there can be nodes in the tree. It crashes when we go beyond<br />

2,000 characters. We will never get up to strings with 50,000 characters.''<br />

I wasn't yet ready to give up. ``<strong>The</strong>re is a way around the space problem, by using compressed suffix<br />

trees,'' I recalled. ``Instead of explicitly representing long paths of character nodes, we can refer back to<br />

the original string.'' Compressed suffix trees always take linear space, as described in Section .<br />

Dimitris went back one last time and implemented the compressed suffix tree data structure. Now it<br />

worked great! As shown in Figure , we ran our simulation for strings of length n= 65,536 on a<br />

SPARCstation 20 without incident. Our results, reported in [MS95a], showed that interactive SBH could<br />

be a very efficient sequencing technique. Based on these simulations, we were able to arouse interest in<br />

our technique from biologists. Making the actual wet laboratory experiments feasible provided another<br />

computational challenge, which is reported in Section .<br />

file:///E|/BOOK/BOOK/NODE39.HTM (4 of 5) [19/1/2003 1:28:38]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!