18.04.2013 Views

The.Algorithm.Design.Manual.Springer-Verlag.1998

The.Algorithm.Design.Manual.Springer-Verlag.1998

The.Algorithm.Design.Manual.Springer-Verlag.1998

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

War Story: String 'em Up<br />

``I've tried the fastest computer in our department, but our program is too slow,'' Dimitris complained.<br />

``It takes forever on strings of length only 2,000 characters. We will never get up to 50,000.''<br />

For interactive SBH to be competitive as a sequencing method, we had to be able to sequence long<br />

fragments of DNA, ideally over 50 kilobases in length. If we couldn't speed up the program, we would be<br />

in the embarrassing position of having a biological technique invented by computer scientists fail<br />

because the computations took too long.<br />

We profiled our program and discovered that almost all the time was spent searching in this data<br />

structure, which was no surprise. For each of the possible concatenations, we did this k-1 times. We<br />

needed a faster dictionary data structure, since search was the innermost operation in such a deep loop.<br />

``What about using a hash table?'' I suggested. ``If we do it right, it should take O(k) time to hash a kcharacter<br />

string and look it up in our table. That should knock off a factor of , which will mean<br />

something when 2,000.''<br />

Dimitris went back and implemented a hash table implementation for our dictionary. Again, it worked<br />

great up until the moment we ran it.<br />

``Our program is still too slow,'' Dimitris complained. ``Sure, it is now about ten times faster on strings<br />

of length 2,000. So now we can get up to about 4,000 characters. Big deal. We will never get up to<br />

50,000.''<br />

``We should have expected only a factor ten speedup,'' I mused. ``After all, . We need a<br />

faster data structure to search in our dictionary of strings.''<br />

``But what can be faster than a hash table?'' Dimitris countered. ``To look up a k-character string, you<br />

must read all k characters. Our hash table already does O(k) searching.''<br />

``Sure, it takes k comparisons to test the first substring. But maybe we can do better on the second test.<br />

Remember where our dictionary queries are coming from. When we concatenate ABCD with EFGH, we<br />

are first testing whether BCDE is in the dictionary, then CDEF. <strong>The</strong>se strings differ from each other by<br />

only one character. We should be able to exploit this so that each subsequent test takes constant time to<br />

perform....''<br />

``We can't do that with a hash table,'' Dimitris observed. ``<strong>The</strong> second key is not going to be anywhere<br />

near the first in the table. A binary search tree won't help, either. Since the keys ABCD and BCDE differ<br />

according to the first character, the two strings will be in different parts of the tree.''<br />

file:///E|/BOOK/BOOK/NODE39.HTM (3 of 5) [19/1/2003 1:28:38]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!