06.06.2013 Views

Patentee Name Harmonisation - ecoom.be

Patentee Name Harmonisation - ecoom.be

Patentee Name Harmonisation - ecoom.be

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Implementation<br />

Implementation is straightforward. All identified spelling variations in Table 19, Table 20 and<br />

Table 21 are transferred to search and replace statements or rules as in the previous step –<br />

legal form indication treatment.<br />

All identified and validated occurrences of common company words were removed by executing<br />

a program that reads the search and replace statements or rules, and executes an update query<br />

on the data to replace the given keyword (spelling variation of common company word) with a<br />

given string (replace with nothing to simply remove the common company word), while at the<br />

same time updating a new field to contain the found spelling variation.<br />

Result<br />

Common company words have <strong>be</strong>en removed at the end of names in 68,152 names, at the<br />

<strong>be</strong>ginning of names in 2,463 names, and anywhere in the name in 7,662 names.<br />

Not all common words that are not distinctive elements in names are removed; only the most<br />

commonly used ones are identified by using the last word index, first word index and full text<br />

index. A more in-depth analysis of the indexes could reveal additional words safe to remove.<br />

Impact<br />

From 392,226 unique names to 385,771 unique names, an additional reduction of 6,455 names,<br />

or a total reduction of 57,951 names (13.1%).<br />

2.3 Spelling variation harmonization<br />

Description<br />

One of the causes of name variations is spelling variation (mistakes, typographical errors, etc.).<br />

Identification of word similarities with approximate string searching (for example, based on<br />

Levenshtein distance or edit distance) can <strong>be</strong> used to identify spelling variations. The problem is<br />

that it is not possible to validate name variations in proper names.<br />

For example, “AMTECH” and “IMTECH” have a Levenshtein distance of 1 but is it possible to<br />

combine them into one organization name?<br />

However, spelling, language and grammatical variations are identifiable in the case of plain<br />

English words or other languages.<br />

For example, “SYSTEM”, “SYSTEMS”, “SYSTEMEN”, “SYSTEMES” can all <strong>be</strong> harmonized to<br />

“SYSTEM” or “SYSTEMS”.<br />

Spelling variation harmonization can mutilate organization names and make them less<br />

comprehensible. However, the idea is not to use these spelling-variation harmonized names as<br />

final harmonized names but as some kind of technical search name that can <strong>be</strong> used to identify<br />

name variations of the same organization.<br />

Analysis<br />

Spelling variations that can <strong>be</strong> harmonized were identified by using a full text index of the<br />

organization names.<br />

By sorting the index on the num<strong>be</strong>r of occurrences, most commonly used words can <strong>be</strong><br />

identified. Then, by sorting the index alpha<strong>be</strong>tically, variations of those commonly used words<br />

can <strong>be</strong> identified.<br />

Table 22 contains spelling variations of words that can <strong>be</strong> harmonized.<br />

Table 22: Spelling variations and their harmonized equivalent<br />

KEYWORD NBR REMARKS<br />

"SYSTEMEN" 48 "SYSTEM"<br />

"SYSTEMES" 164 "SYSTEM"<br />

"SYSTEME" 1,140 "SYSTEM"<br />

"SYSTEMS" 10,104 "SYSTEM"<br />

"INTERNATIONALE" 109 "INTERNATIONAL"<br />

37

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!