18.04.2013 Views

The.Algorithm.Design.Manual.Springer-Verlag.1998

The.Algorithm.Design.Manual.Springer-Verlag.1998

The.Algorithm.Design.Manual.Springer-Verlag.1998

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

War Story: String 'em Up<br />

We were convinced that using several small arrays would be more efficient than using one big array. We<br />

even had theory to justify our technique, but biologists aren't very inclined to believe theory. <strong>The</strong>y<br />

demand experiments for proof. Hence we had to implement our algorithms and use simulation to prove<br />

that they worked.<br />

So much for motivation. <strong>The</strong> rest of this tale will demonstrate the impact that clever data structures can<br />

have on a string processing application.<br />

Our technique involved identifying all the strings of length 2k that are possible substrings of an unknown<br />

string S, given that we know all length k substrings of S. For example, suppose we know that AC, CA,<br />

and CC are the only length-2 substrings of S. It is certainly possible that ACCA is a substring of S, since<br />

the center substring is one of our possibilities. However, CAAC cannot be a substring of S, since we<br />

know that AA is not a substring of S. We needed to find a fast algorithm to construct all the consistent<br />

length-2k strings, since S could be very long.<br />

Figure: <strong>The</strong> concatentation of two fragments can be in S only if all subfragments are<br />

<strong>The</strong> simplest algorithm to build the 2k strings would be to concatenate all pairs of k-strings<br />

together, and then for each pair to make sure that all (k-1) length-k substrings spanning the boundary of<br />

the concatenation were in fact substrings, as shown in Figure . For example, the nine possible<br />

concatenations of AC, CA, and CC are ACAC, ACCA, ACCC, CAAC, CACA, CACC, CCAC, CCCA, and<br />

CCCC. Only CAAC can be eliminated because of the absence of AA.<br />

We needed a fast way of testing whether each of the k-1 substrings straddling the concatenation was a<br />

member of our dictionary of permissible k-strings. <strong>The</strong> time it takes to do this depends upon which kind<br />

of data structure we use to maintain this dictionary. With a binary search tree, we could find the correct<br />

string within comparisons, where each comparison involved testing which of two length-k strings<br />

appeared first in alphabetical order. Since each such comparison could require testing k pairs of<br />

characters, the total time using a binary search tree would be .<br />

That seemed pretty good. So my graduate student Dimitris Margaritis implemented a binary search tree<br />

data structure for our implementation. It worked great up until the moment we ran it.<br />

file:///E|/BOOK/BOOK/NODE39.HTM (2 of 5) [19/1/2003 1:28:38]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!