18.04.2013 Views

The.Algorithm.Design.Manual.Springer-Verlag.1998

The.Algorithm.Design.Manual.Springer-Verlag.1998

The.Algorithm.Design.Manual.Springer-Verlag.1998

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Longest Common Substring<br />

than edit distance.<br />

Issues arising include:<br />

● Are you looking for a common substring or scattered subsequence? - In detecting plagiarism or<br />

attempting to identify the authors of anonymous works, we might need to find the longest phrases<br />

shared between several documents. Since phrases are strings of consecutive characters, we need<br />

the longest common substring between the texts.<br />

<strong>The</strong> longest common substring of a set of strings can be identified in linear time using suffix trees,<br />

discussed in Section . <strong>The</strong> trick is to build a suffix tree containing all the strings, label each<br />

leaf with the set of strings that contain it, and then do a depth-first traversal to identify the deepest<br />

node that has descendents from each input string.<br />

For the rest of this discussion, we will restrict attention to finding common scattered<br />

subsequences. Dynamic programming can be used to find the longest common subsequence of<br />

two strings, S and T, of n and m characters each. This algorithm is a special case of the edit-<br />

distance computation of Section . Let M[i,j] denote the number of characters in the longest<br />

common substring of and . In general, if , there is no way the<br />

last pair of characters could match, so . If S[i] = T[j], we have<br />

the option to select this character for our substring, so<br />

. This gives a recurrence that computes M,<br />

and thus finds the length of the longest common subsequence in O(nm) time. We can reconstruct<br />

the actual common substring by walking backward from M[n,m] and establishing which<br />

characters were actually matched along the way.<br />

● What if there are relatively few sets of matching characters? - For strings that do not contain too<br />

many copies of the same character, there is a faster algorithm. Let r be the number of pairs of<br />

positions (i,j) such that . Thus r can be as large as mn if both strings consist entirely of the<br />

same character, but r = n if the two strings are permutations of . This technique treats the<br />

pairs of r as defining points in the plane.<br />

<strong>The</strong> complete set of r such points can be found in O(n + m + r) time by bucketing techniques. For<br />

each string, we create a bucket for each letter of the alphabet and then partition all of its characters<br />

into the appropriate buckets. For each letter c of the alphabet, create a point (,t) from every pair<br />

and , where and are buckets for c.<br />

A common substring represents a path through these points that moves only up and to the right,<br />

never down or to the left. Given these points, the longest such path can be found in<br />

time. We will sort the points in order of increasing x-coordinate (breaking ties in favor of<br />

increasing y-coordinate. We will insert these points one by one in this order, and for each k,<br />

file:///E|/BOOK/BOOK5/NODE208.HTM (2 of 4) [19/1/2003 1:32:18]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!