The.Algorithm.Design.Manual.Springer-Verlag.1998
The.Algorithm.Design.Manual.Springer-Verlag.1998 The.Algorithm.Design.Manual.Springer-Verlag.1998
String Matching Next: Approximate String Matching Up: Set and String Problems Previous: Set Packing String Matching Input description: A text string t of length n. A pattern string p of length m. Problem description: Find the first (or all) instances of the pattern in the text. Discussion: String matching is fundamental to database and text processing applications. Every text editor must contain a mechanism to search the current document for arbitrary strings. Pattern matching programming languages such as Perl and Awk derive much of their power from their built-in string matching primitives, making it easy to fashion programs that filter and modify text. Spelling checkers scan an input text for words in the dictionary and reject any strings that do not match. Several issues arise in identifying the right string matching algorithm for the job: ● Are your search patterns and/or texts short? - If your strings are sufficiently short and your queries sufficiently infrequent, the brute-force O(m n)-time search algorithm will likely suffice. For each possible starting position , it tests whether the m characters starting from the ith position of the text are identical to the pattern. For very short patterns, say , you can't hope to beat brute force by much, if at all, and you file:///E|/BOOK/BOOK5/NODE203.HTM (1 of 4) [19/1/2003 1:32:08]
String Matching shouldn't try. Further, since we can reject the possibility of a match at a given starting position the instant we observe a text/pattern mismatch, we expect much better than O(mn) behavior for typical strings. Indeed, the trivial algorithm usually runs in linear time. But the worst case certainly can occur, as with pattern and text . ● What about longer texts and patterns? - By being more clever, string matching can be performed in worst-case linear time. Observe that we need not begin the search from scratch on finding a character mismatch between the pattern and text. We know something about the subsequent characters of the string as a result of the partial match from the previous position. Given a long partial match from position i, we can jump ahead to the first character position in the pattern/text that will provide new information about the text starting from position i+1. The Knuth-Moris- Pratt algorithm preprocesses the search pattern to construct such a jump table efficiently. The details are tricky to get correct, but the resulting implementations yield short, simple programs. Even better in practice is the Boyer-Moore algorithm, although it offers similar worst-case performance. Boyer-Moore matches the pattern against the text from right to left, in order to avoid looking at large chunks of text on a mismatch. Suppose the pattern is abracadabra, and the eleventh character of the text is x. This pattern cannot match in any of the first eleven starting positions of the text, and so the next necessary position to test is the 22nd character. If we get very lucky, only n/m characters need ever be tested. The Boyer-Moore algorithm involves two sets of jump tables in the case of a mismatch: one based on pattern matched so far, the other on the text character seen in the mismatch. Although somewhat more complicated than Knuth-Morris-Pratt, it is worth it in practice for patterns of length m > 5. ● Will you perform multiple queries on the same text? - Suppose you were building a program to repeatedly search a particular text database, such as the Bible. Since the text remains fixed, it pays build a data structure to speed up search queries. The suffix tree and suffix array data structures, discussed in Section , are the right tools for the job. ● Will you search many texts using the same patterns? - Suppose you were building a program to screen out dirty words from a text stream. Here, the set of patterns remains stable, while the search texts are free to change. In such applications, we may need to find all occurrences of each of k different patterns, where k can be quite large. Performing a linear-time search for each of these patterns yields an O(k (m+n)) algorithm. If k is large, a better solution builds a single finite automaton that recognizes each of these patterns and returns to the appropriate start state on any character mismatch. The Aho-Corasick algorithm builds such an automaton in linear time. Space savings can be achieved by optimizing the pattern recognition automaton, as discussed in Section . This algorithm was used in the original version of fgrep. Sometimes multiple patterns are specified not as a list of strings, but concisely as a regular expression. For example, the regular expression matches any string on (a,b,c) that begins and ends with an a, including a itself. The best way to test whether input strings are described by a given regular expression R is to construct the finite automaton equivalent to R and file:///E|/BOOK/BOOK5/NODE203.HTM (2 of 4) [19/1/2003 1:32:08]
- Page 575 and 576: Nearest Neighbor Search Implementat
- Page 577 and 578: Range Search Next: Point Location U
- Page 579 and 580: Range Search subdivisions in C++. I
- Page 581 and 582: Point Location the n edges for inte
- Page 583 and 584: Point Location More recently, there
- Page 585 and 586: Intersection Detection ● Do you w
- Page 587 and 588: Intersection Detection Implementati
- Page 589 and 590: Bin Packing Next: Medial-Axis Trans
- Page 591 and 592: Bin Packing approach for general sh
- Page 593 and 594: Medial-Axis Transformation Next: Po
- Page 595 and 596: Medial-Axis Transformation Implemen
- Page 597 and 598: Polygon Partitioning number of piec
- Page 599 and 600: Simplifying Polygons Next: Shape Si
- Page 601 and 602: Simplifying Polygons vertices and o
- Page 603 and 604: Shape Similarity Next: Motion Plann
- Page 605 and 606: Shape Similarity or how close it is
- Page 607 and 608: Motion Planning There is a wide ran
- Page 609 and 610: Motion Planning often arise in the
- Page 611 and 612: Maintaining Line Arrangements Think
- Page 613 and 614: Maintaining Line Arrangements Next:
- Page 615 and 616: Minkowski Sum where x+y is the vect
- Page 617 and 618: Set and String Problems Next: Set C
- Page 619 and 620: Set Cover Next: Set Packing Up: Set
- Page 621 and 622: Set Cover Figure: Hitting set is du
- Page 623 and 624: Set Packing Next: String Matching U
- Page 625: Set Packing Notes: An excellent exp
- Page 629 and 630: String Matching and texts, I recomm
- Page 631 and 632: Approximate String Matching This sa
- Page 633 and 634: Approximate String Matching http://
- Page 635 and 636: Text Compression Next: Cryptography
- Page 637 and 638: Text Compression code string. ASCII
- Page 639 and 640: Cryptography Next: Finite State Mac
- Page 641 and 642: Cryptography ● How can I validate
- Page 643 and 644: Cryptography MD5 [Riv92] is the sec
- Page 645 and 646: Finite State Machine Minimization F
- Page 647 and 648: Finite State Machine Minimization S
- Page 649 and 650: Longest Common Substring than edit
- Page 651 and 652: Longest Common Substring include [A
- Page 653 and 654: Shortest Common Superstring Finding
- Page 655 and 656: Software systems Next: LEDA Up: Alg
- Page 657 and 658: LEDA Next: Netlib Up: Software syst
- Page 659 and 660: Netlib Algorithms Mon Jun 2 23:33:5
- Page 661 and 662: The Stanford GraphBase Next: Combin
- Page 663 and 664: Algorithm Animations with XTango Ne
- Page 665 and 666: Programs from Books Next: Discrete
- Page 667 and 668: Handbook of Data Structures and Alg
- Page 669 and 670: Algorithms from P to NP Next: Compu
- Page 671 and 672: Algorithms in C++ Next: Data Source
- Page 673 and 674: Textbooks Next: On-Line Resources U
- Page 675 and 676: On-Line Resources Next: Literature
String Matching<br />
Next: Approximate String Matching Up: Set and String Problems Previous: Set Packing<br />
String Matching<br />
Input description: A text string t of length n. A pattern string p of length m.<br />
Problem description: Find the first (or all) instances of the pattern in the text.<br />
Discussion: String matching is fundamental to database and text processing applications. Every text<br />
editor must contain a mechanism to search the current document for arbitrary strings. Pattern matching<br />
programming languages such as Perl and Awk derive much of their power from their built-in string<br />
matching primitives, making it easy to fashion programs that filter and modify text. Spelling checkers<br />
scan an input text for words in the dictionary and reject any strings that do not match.<br />
Several issues arise in identifying the right string matching algorithm for the job:<br />
● Are your search patterns and/or texts short? - If your strings are sufficiently short and your<br />
queries sufficiently infrequent, the brute-force O(m n)-time search algorithm will likely suffice.<br />
For each possible starting position , it tests whether the m characters starting<br />
from the ith position of the text are identical to the pattern.<br />
For very short patterns, say , you can't hope to beat brute force by much, if at all, and you<br />
file:///E|/BOOK/BOOK5/NODE203.HTM (1 of 4) [19/1/2003 1:32:08]