The.Algorithm.Design.Manual.Springer-Verlag.1998
The.Algorithm.Design.Manual.Springer-Verlag.1998 The.Algorithm.Design.Manual.Springer-Verlag.1998
The Partition Problem Next: Approximate String Matching Up: Dynamic Programming Previous: Fibonacci numbers Algorithms Mon Jun 2 23:33:50 EDT 1997 file:///E|/BOOK/BOOK2/NODE45.HTM (5 of 5) [19/1/2003 1:28:44]
Approximate String Matching Next: Longest Increasing Sequence Up: Dynamic Programming Previous: The Partition Problem Approximate String Matching An important task in text processing is string matching - finding all the occurrences of a word in the text. Unfortunately, many words in documents are mispelled (sic). How can we search for the string closest to a given pattern in order to account for spelling errors? To be more precise, let P be a pattern string and T a text string over the same alphabet. The edit distance between P and T is the smallest number of changes sufficient to transform a substring of T into P, where the changes may be: 1. Substitution - two corresponding characters may differ: KAT CAT. 2. Insertion - we may add a character to T that is in P: CT CAT. 3. Deletion - we may delete from T a character that is not in P: CAAT CAT. For example, P=abcdefghijkl can be matched to T=bcdeffghixkl using exactly three changes, one of each of the above types. Approximate string matching arises in many applications, as discussed in Section . It seems like a difficult problem, because we have to decide where to delete and insert characters in pattern and text. But let us think about the problem in reverse. What information would we like to have in order to make the final decision; i.e. what should happen with the last character in each string? The last characters may be either be matched, if they are identical, or otherwise substituted one for the other. The only other options are inserting or deleting a character to take care of the last character of either the pattern or the text. More precisely, let D[i,j] be the minimum number of differences between and the segment of T ending at j. D[i, j] is the minimum of the three possible ways to extend smaller strings: 1. If , then D[i-1, j-1], else D[i-1, j-1]+1. This means we either match or substitute the ith and jth characters, depending upon whether they do or do not match. 2. D[i-1, j]+1. This means that there is an extra character in the pattern to account for, so we do not advance the text pointer and pay the cost of an insertion. 3. D[i, j-1]+1. This means that there is an extra character in the text to remove, so we do not advance the pattern pointer and pay the cost of a deletion. The alert reader will notice that we have not specified the boundary conditions of this recurrence relation. It is critical to get the initialization right if our program is to return the correct edit distance. The value of D[0,i] will correspond to the cost of matching the first i characters of the text with none of the pattern. What this value should be depends upon what you want to compute. If you seek to match the entire pattern against the entire text, this means that we must delete the first i characters of the text, so D[0,i] = i to pay the cost of the deletions. But what if we want to find where the pattern occurs in a long text? It should not cost more if the matched pattern starts far into the text than if it is near the front. Therefore, the starting cost should be equal for all positions. In this case, D[0,i] = 0, since we pay no cost for deleting the first i characters of the text. In both cases, D[i,0] = i, since we cannot excuse deleting the first i characters of the pattern without penalty. file:///E|/BOOK/BOOK2/NODE46.HTM (1 of 3) [19/1/2003 1:28:46]
- Page 135 and 136: Fundamental Data Types Next: Contai
- Page 137 and 138: Containers Next: Dictionaries Up: F
- Page 139 and 140: Dictionaries Next: Binary Search Tr
- Page 141 and 142: Binary Search Trees BinaryTreeQuery
- Page 143 and 144: Priority Queues Next: Specialized D
- Page 145 and 146: Specialized Data Structures Next: S
- Page 147 and 148: Sorting Next: Applications of Sorti
- Page 149 and 150: Applications of Sorting Figure: Con
- Page 151 and 152: Data Structures Next: Incremental I
- Page 153 and 154: Incremental Insertion Next: Divide
- Page 155 and 156: Randomization Next: Bucketing Techn
- Page 157 and 158: Randomization Next: Bucketing Techn
- Page 159 and 160: Bucketing Techniques Algorithms Mon
- Page 161 and 162: War Story: Stripping Triangulations
- Page 163 and 164: War Story: Stripping Triangulations
- Page 165 and 166: War Story: Mystery of the Pyramids
- Page 167 and 168: War Story: Mystery of the Pyramids
- Page 169 and 170: War Story: String 'em Up We were co
- Page 171 and 172: War Story: String 'em Up Figure: Su
- Page 173 and 174: Exercises Next: Implementation Chal
- Page 175 and 176: Exercises used to select the pivot.
- Page 177 and 178: Dynamic Programming Next: Fibonacci
- Page 179 and 180: Fibonacci numbers Next: The Partiti
- Page 181 and 182: Fibonacci numbers Next: The Partiti
- Page 183 and 184: The Partition Problem . What is the
- Page 185: The Partition Problem Figure: Dynam
- Page 189 and 190: Approximate String Matching The val
- Page 191 and 192: Longest Increasing Sequence Will th
- Page 193 and 194: Minimum Weight Triangulation Next:
- Page 195 and 196: Limitations of Dynamic Programming
- Page 197 and 198: War Story: Evolution of the Lobster
- Page 199 and 200: War Story: Evolution of the Lobster
- Page 201 and 202: War Story: What's Past is Prolog Ne
- Page 203 and 204: War Story: What's Past is Prolog th
- Page 205 and 206: War Story: Text Compression for Bar
- Page 207 and 208: War Story: Text Compression for Bar
- Page 209 and 210: Divide and Conquer Next: Fast Expon
- Page 211 and 212: Fast Exponentiation Next: Binary Se
- Page 213 and 214: Binary Search Next: Square and Othe
- Page 215 and 216: Exercises Next: Implementation Chal
- Page 217 and 218: Exercises whose denominations are ,
- Page 219 and 220: The Friendship Graph Next: Data Str
- Page 221 and 222: The Friendship Graph friendship gra
- Page 223 and 224: Data Structures for Graphs linked t
- Page 225 and 226: War Story: Getting the Graph ``Well
- Page 227 and 228: Traversing a Graph Next: Breadth-Fi
- Page 229 and 230: Breadth-First Search Next: Depth-Fi
- Page 231 and 232: Depth-First Search Next: Applicatio
- Page 233 and 234: Depth-First Search Next: Applicatio
- Page 235 and 236: Connected Components Next: Tree and
Approximate String Matching<br />
Next: Longest Increasing Sequence Up: Dynamic Programming Previous: <strong>The</strong> Partition Problem<br />
Approximate String Matching<br />
An important task in text processing is string matching - finding all the occurrences of a word in the text. Unfortunately,<br />
many words in documents are mispelled (sic). How can we search for the string closest to a given pattern in order to<br />
account for spelling errors?<br />
To be more precise, let P be a pattern string and T a text string over the same alphabet. <strong>The</strong> edit distance between P and T is<br />
the smallest number of changes sufficient to transform a substring of T into P, where the changes may be:<br />
1. Substitution - two corresponding characters may differ: KAT CAT.<br />
2. Insertion - we may add a character to T that is in P: CT CAT.<br />
3. Deletion - we may delete from T a character that is not in P: CAAT CAT.<br />
For example, P=abcdefghijkl can be matched to T=bcdeffghixkl using exactly three changes, one of each of the above<br />
types.<br />
Approximate string matching arises in many applications, as discussed in Section . It seems like a difficult problem,<br />
because we have to decide where to delete and insert characters in pattern and text. But let us think about the problem in<br />
reverse. What information would we like to have in order to make the final decision; i.e. what should happen with the last<br />
character in each string? <strong>The</strong> last characters may be either be matched, if they are identical, or otherwise substituted one for<br />
the other. <strong>The</strong> only other options are inserting or deleting a character to take care of the last character of either the pattern or<br />
the text.<br />
More precisely, let D[i,j] be the minimum number of differences between and the segment of T ending at j.<br />
D[i, j] is the minimum of the three possible ways to extend smaller strings:<br />
1. If , then D[i-1, j-1], else D[i-1, j-1]+1. This means we either match or substitute the ith and jth characters,<br />
depending upon whether they do or do not match.<br />
2. D[i-1, j]+1. This means that there is an extra character in the pattern to account for, so we do not advance the text<br />
pointer and pay the cost of an insertion.<br />
3. D[i, j-1]+1. This means that there is an extra character in the text to remove, so we do not advance the pattern pointer<br />
and pay the cost of a deletion.<br />
<strong>The</strong> alert reader will notice that we have not specified the boundary conditions of this recurrence relation. It is critical to get<br />
the initialization right if our program is to return the correct edit distance. <strong>The</strong> value of D[0,i] will correspond to the cost of<br />
matching the first i characters of the text with none of the pattern. What this value should be depends upon what you want<br />
to compute. If you seek to match the entire pattern against the entire text, this means that we must delete the first i<br />
characters of the text, so D[0,i] = i to pay the cost of the deletions. But what if we want to find where the pattern occurs in a<br />
long text? It should not cost more if the matched pattern starts far into the text than if it is near the front. <strong>The</strong>refore, the<br />
starting cost should be equal for all positions. In this case, D[0,i] = 0, since we pay no cost for deleting the first i characters<br />
of the text. In both cases, D[i,0] = i, since we cannot excuse deleting the first i characters of the pattern without penalty.<br />
file:///E|/BOOK/BOOK2/NODE46.HTM (1 of 3) [19/1/2003 1:28:46]