06.06.2013 Views

Patentee Name Harmonisation - ecoom.be

Patentee Name Harmonisation - ecoom.be

Patentee Name Harmonisation - ecoom.be

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6 DIRECTIONS FOR FURTHER DEVELOPMENT<br />

6.1 Approximate string searching<br />

Approximate string searching aims to identify patentee names that are slightly different<br />

but might, in fact, <strong>be</strong> the same. Rather than applying exact string matching, distance measures<br />

are used based on the similarity <strong>be</strong>tween the strings, However, it will <strong>be</strong>come apparent that one<br />

is confronted with a trade-off <strong>be</strong>tween accuracy and completeness: identifying similar patentee<br />

names might result in matching unrelated patentee names and vice versa.<br />

Several algorithms are available (bigrams, trigrams) but using the Levenshtein measure<br />

(edit-distance measure), which computes string similarity, can <strong>be</strong> considered an appropriate<br />

and easy-to-implement starting point (see for an example of an implementation approach, the<br />

AGREP tool 11 ). When using the Levenshtein distance, a string P is said to <strong>be</strong> at distance k to a<br />

string Q if P can <strong>be</strong> transformed to equal Q with a sequence of k insertions of single characters in<br />

(arbitrary places in) P, deletions of single characters in P, or substitutions of characters 12 .<br />

Using this measure, all patentee names within a certain distance of a given patentee<br />

name can <strong>be</strong> identified.<br />

However, implementation is not straightforward <strong>be</strong>cause of the following issues:<br />

• Performance: approximate string searching is not as fast as exact string searching.<br />

Moreover, if a list has to <strong>be</strong> matched with itself (e.g. one wants to compare all names in a<br />

list with all other names in that list), processing time will increase exponentially with the<br />

num<strong>be</strong>r of items in the list (a list of 100 items requires 10,000 distances to <strong>be</strong> calculated<br />

and assessed; a list of 1,000 items requires 1,000,000 calculations);<br />

• Large output: approximate string searching can produce a long list of matched strings.<br />

Moreover, if a list is matched with itself, numerous duplicates will appear in the matching<br />

list (if a string matches 50 other strings, the same combinations will appear 50 x 50 times).<br />

• Relevance of error: if the Levenshtein distance is used, the acceptable distance is dependent<br />

on the length of the string. For a string of 3 characters, name variants with a Levenshtein<br />

distance of 1 will in most cases result in false hits (e.g. NEC, NAC, NIC, NEI, and NEA). A<br />

minimum string length might <strong>be</strong> required.<br />

• Validation: even if a small Levenshtein distance <strong>be</strong>tween two larger strings is observed, this<br />

does not automatically mean that the underlying names are identical. A difference of one<br />

character can result in significantly different names. In order to <strong>be</strong> accurate, additional<br />

validation efforts need to <strong>be</strong> undertaken.<br />

In view of these issues, applying approximate string searching in an automated way for<br />

all names within the name list has not <strong>be</strong>en included in the outlined method. At the same time,<br />

approximate string searching offers significant potential especially for longer names, as the<br />

following illustrations demonstrate. Implementing such efforts does however imply extensive –<br />

manual – validation efforts <strong>be</strong>fore they can <strong>be</strong>come automated.<br />

Table 2 contains the name variations of “INTERNATIONAL BUSINESS MACHINES”<br />

identified with approximate string searching (legal form indications were removed first and<br />

resulting names were condensed).<br />

Table 2: “INTERNATIONAL BUSINESS MACHINES” name variations identified<br />

by approximate string searching<br />

IBMCORPINTERNATIONALBUSINESSMACHINESCORP<br />

IBMCORPORATIONINTERNATIONALBUSINESSMACHINESCORPORATION<br />

IBMINTERNATIONALBUSINESSMACHINES<br />

11<br />

S. Wu and U. Man<strong>be</strong>r, “AGREP -- A fast approximate pattern-matching tool”, proc. Winter 1992 USENIX<br />

Technical Conference<br />

12<br />

Variants of this logic introduce weights for the different operations; e.g. deletions 3, insertions 2, and<br />

substitutions 1.<br />

10

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!