18.04.2013 Views

The.Algorithm.Design.Manual.Springer-Verlag.1998

The.Algorithm.Design.Manual.Springer-Verlag.1998

The.Algorithm.Design.Manual.Springer-Verlag.1998

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Shortest Common Superstring<br />

Finding a superstring of all the substrings is not difficult, as we can simply concatenate them all together.<br />

It is finding the shortest such string that is problematic. Indeed, shortest common superstring remains NPcomplete<br />

under all reasonable classes of strings.<br />

<strong>The</strong> problem of finding the shortest common superstring can easily be reduced to that of the traveling<br />

salesman problem (see Section ). Create an overlap graph G where vertex represents string . Edge<br />

will have weight equal to the length of minus the overlap of with . <strong>The</strong> path visiting all the<br />

vertices of minimum total weight defines the shortest common superstring. <strong>The</strong> edge weights of this<br />

graph are not symmetric, after all, the overlap of and is not the same as the overlap of<br />

and . Thus only programs capable of solving asymmetric TSPs can be applied to this problem.<br />

<strong>The</strong> greedy heuristic is the standard approach to approximating the shortest common superstring. Find the<br />

pair of strings with the maximum number of characters of overlap. Replace them by the merged string,<br />

and repeat until only one string remains. Given the overlap graph above, this heuristic can be efficiently<br />

implemented by inserting all of the edge weights into a heap (see Section ) and then merging if the<br />

appropriate ends of the two strings have not yet be used, which can be maintained with an array of<br />

Boolean flags.<br />

<strong>The</strong> potentially time-consuming part of this heuristic is in building the overlap graph. <strong>The</strong> brute-force<br />

approach to finding the maximum overlap of two length-l strings takes , which must be repeated<br />

times. Faster times are possible by appropriately using suffix trees (see Section ). Build a tree<br />

containing all suffixes of all reversed strings of S. String overlaps with if a suffix of matches a<br />

suffix of the reverse of . <strong>The</strong> longest overlap for each fragment can be found in time linear in its<br />

length.<br />

How well does the greedy heuristic perform? If we are unlucky with the input, the greedy heuristic can be<br />

fooled into creating a superstring that is at least twice as long as optimal. Usually, it will be a lot better in<br />

practice. It is known that the resulting superstring can never be more than 2.75 times optimal.<br />

Building superstrings becomes more difficult with positive and negative substrings, where negative<br />

substrings cannot be substrings of the superstring. <strong>The</strong> problem of deciding whether any such consistent<br />

substring exists is NP-complete, unless you are allowed to add an extra character to the alphabet to use as<br />

a spacer.<br />

Implementations: CAP (Contig Assembly Program) [Hua92] by Xiaoqiu Huang is a C language<br />

program supporting DNA shotgun sequencing by finding the shortest common superstring of a set of<br />

fragments. As to performance, CAP took 4 hours to assemble 1,015 fragments of a total of 252,000<br />

characters on a Sun SPARCstation SLC. Certain parameters will need to be tweaked to make it<br />

accommodate non-DNA data. It is available by anonymous ftp from cs.mtu.edu in the pub/huang<br />

directory.<br />

file:///E|/BOOK/BOOK5/NODE209.HTM (2 of 3) [19/1/2003 1:32:19]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!