The.Algorithm.Design.Manual.Springer-Verlag.1998

The.Algorithm.Design.Manual.Springer-Verlag.1998 The.Algorithm.Design.Manual.Springer-Verlag.1998

18.04.2013 Views

Longest Common Substring , maintain the minimum y-coordinate of any path going through exactly k points. Inserting a new point will change exactly one of these paths by reducing the y-coordinate of the path whose last point is barely greater than the new point. ● What if the strings are permutations? - If the strings are permutations, then there are exactly n pairs of matching characters, and the above algorithm runs in time. A particularly important case of this occurs in finding the longest increasing subsequence of a sequence of numbers. Sorting this sequence and then replacing each number by its rank in the total order gives us a permutation p. The longest common subsequence of p and gives the longest increasing subsequence. ● What if we have more than two strings to align? - The basic dynamic programming algorithm can be generalized to k strings, taking time, where n is the length of the longest string. This algorithm is exponential in the number of strings k, and so it will likely be too expensive for more than 3 to 4 strings. Further, the problem is NP-complete, so no better exact algorithm is destined to come along. This problem of multiple sequence alignment has received considerable attention, and numerous heuristics have been proposed. Many heuristics begin by computing the pairwise alignment between each of the pairs of strings, and then work to merge these alignments. One approach is to build a graph with a vertex for each character of each string. There will be an edge between and if the corresponding characters are matched in the alignment between S and T. Any k- clique (see Section ) in this graph describes a commonly aligned character, and all such cliques can be found efficiently because of the sparse structure of this graph. Although these cliques will define a common subsequence, there is no reason to believe that it will be the longest such substring. Appropriately weakening the clique requirement provides a way to increase it, but still there can be no promises. Implementations: MAP (Multiple Alignment Program) [Hua94] by Xiaoqiu Huang is a C language program that computes a global multiple alignment of sequences using an iterative pairwise method. Certain parameters will need to be tweaked to make it accommodate non-DNA data. It is available by anonymous ftp from cs.mtu.edu in the pub/huang directory. Combinatorica [Ski90] provides a Mathematica implementation of an algorithm to construct the longest increasing subsequence of a permutation, which is a special case of longest common subsequence. This algorithm is based on Young tableaux rather than dynamic programming. See Section . Notes: Good expositions on longest common subsequence include [AHU83, CLR90]. A survey of algorithmic results appears in [GBY91]. The algorithm for the case where all the characters in each sequence are distinct or infrequent is due to Hunt and Szymanski [HS77]. Expositions of this algorithm file:///E|/BOOK/BOOK5/NODE208.HTM (3 of 4) [19/1/2003 1:32:18]

Longest Common Substring include [Aho90, Man89]. Multiple sequence alignment for computational biology is treated in [Wat95]. Certain problems on strings become easier when we assume a constant-sized alphabet. Masek and Paterson [MP80] solve longest common subsequence in for constant-sized alphabets, using the four Russians technique. Related Problems: Approximate string matching (see page ), shortest common superstring (see page ). Next: Shortest Common Superstring Up: Set and String Problems Previous: Finite State Machine Minimization Algorithms Mon Jun 2 23:33:50 EDT 1997 file:///E|/BOOK/BOOK5/NODE208.HTM (4 of 4) [19/1/2003 1:32:18]

Longest Common Substring<br />

, maintain the minimum y-coordinate of any path going through exactly k points.<br />

Inserting a new point will change exactly one of these paths by reducing the y-coordinate of the<br />

path whose last point is barely greater than the new point.<br />

● What if the strings are permutations? - If the strings are permutations, then there are exactly n<br />

pairs of matching characters, and the above algorithm runs in time. A particularly<br />

important case of this occurs in finding the longest increasing subsequence of a sequence of<br />

numbers. Sorting this sequence and then replacing each number by its rank in the total order gives<br />

us a permutation p. <strong>The</strong> longest common subsequence of p and gives the longest<br />

increasing subsequence.<br />

● What if we have more than two strings to align? - <strong>The</strong> basic dynamic programming algorithm can<br />

be generalized to k strings, taking time, where n is the length of the longest string. This<br />

algorithm is exponential in the number of strings k, and so it will likely be too expensive for more<br />

than 3 to 4 strings. Further, the problem is NP-complete, so no better exact algorithm is destined<br />

to come along.<br />

This problem of multiple sequence alignment has received considerable attention, and numerous<br />

heuristics have been proposed. Many heuristics begin by computing the pairwise alignment<br />

between each of the pairs of strings, and then work to merge these alignments. One approach is<br />

to build a graph with a vertex for each character of each string. <strong>The</strong>re will be an edge between<br />

and if the corresponding characters are matched in the alignment between S and T. Any k-<br />

clique (see Section ) in this graph describes a commonly aligned character, and all such cliques<br />

can be found efficiently because of the sparse structure of this graph.<br />

Although these cliques will define a common subsequence, there is no reason to believe that it<br />

will be the longest such substring. Appropriately weakening the clique requirement provides a<br />

way to increase it, but still there can be no promises.<br />

Implementations: MAP (Multiple Alignment Program) [Hua94] by Xiaoqiu Huang is a C language<br />

program that computes a global multiple alignment of sequences using an iterative pairwise method.<br />

Certain parameters will need to be tweaked to make it accommodate non-DNA data. It is available by<br />

anonymous ftp from cs.mtu.edu in the pub/huang directory.<br />

Combinatorica [Ski90] provides a Mathematica implementation of an algorithm to construct the longest<br />

increasing subsequence of a permutation, which is a special case of longest common subsequence. This<br />

algorithm is based on Young tableaux rather than dynamic programming. See Section .<br />

Notes: Good expositions on longest common subsequence include [AHU83, CLR90]. A survey of<br />

algorithmic results appears in [GBY91]. <strong>The</strong> algorithm for the case where all the characters in each<br />

sequence are distinct or infrequent is due to Hunt and Szymanski [HS77]. Expositions of this algorithm<br />

file:///E|/BOOK/BOOK5/NODE208.HTM (3 of 4) [19/1/2003 1:32:18]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!