Patentee Name Harmonisation - ecoom.be
Patentee Name Harmonisation - ecoom.be
Patentee Name Harmonisation - ecoom.be
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
4.2.3 Spelling variation harmonization<br />
Typographical errors and spelling mistakes are responsible for considerable name<br />
variations. These kinds of error can <strong>be</strong> identified by assessing word similarities. Whilst this type<br />
of analysis is straightforward for common English words, proper names usually require manual<br />
validation efforts in order to ensure accuracy. For example, “AMTECH” and “IMTECH” only differ<br />
in a single character but it would <strong>be</strong> incorrect to automatically assume that the names refer to<br />
one and the same patentee. For common words, spelling and language variations can <strong>be</strong><br />
identified without ambiguity and, therefore, harmonized effortlessly. For example, “SYSTEM”,<br />
“SYSTEMS”, “SYSTEMEN”, and “SYSTEMES” can all <strong>be</strong> harmonized to “SYSTEM” or “SYSTEMS”.<br />
Spelling variation harmonization replaces all variants of common words with one harmonized<br />
variant that will <strong>be</strong> used to match name variants.<br />
4.2.4 Condensing<br />
Significant name variations are also caused by word separation, punctuation, and nonalphanumerical<br />
characters, which clearly have no relevance in identifying the distinctive<br />
characteristics of a name (e.g. “3 COM” and “3COM”, and “AAF-MCQUAY”, “AAF MCQAY” and<br />
“AAF – MCQAY”). Condensing removes all non-alphanumerical characters so that a harmonized<br />
variant can <strong>be</strong> used to match names.<br />
4.2.5 Umlaut harmonization<br />
Although accented characters have already <strong>be</strong>en replaced (see section 4.1.1 - Character<br />
cleaning), German characters with a diacritic mark (umlaut: “ä”, “ö”, “ü”) still generate spelling<br />
variations <strong>be</strong>cause words containing them can occur in three varieties, one with an umlaut (e.g.<br />
“für”), an alternative spelling without an umlaut but with an additional “e” (e.g. “fuer”), and a<br />
simplified form without both an umlaut and an additional “e” (e.g. “fur”). Umlaut harmonization<br />
identifies and matches different variants of words including “ä”, “ö” and “ü”.<br />
5 RESULTS AND IMPACT<br />
The complete name cleaning and harmonization procedure has <strong>be</strong>en applied to an<br />
integrated set of EPO and USPTO patentee names. In the case of the EPO dataset, all 270,635<br />
applicant names are included from all 1,600,812 EPO patent applications published <strong>be</strong>tween<br />
1978 and 2004 (based on the EPO ESPACE ACCESS product). For USPTO data, all 223,665<br />
assignee names are included from all 1,614,224 USPTO granted patents published <strong>be</strong>tween<br />
1991 and 2003 (based on the USPTO Grant Red Book product). All names were converted to<br />
uppercase. Combining these two datasets produced a name list of 443,722 unique patentee<br />
names (after conversion to uppercase and removal of leading and trailing spaces). Of these<br />
names, 50,578 appear both in EPO and USPTO; 173,087 only appear in USPTO, while 220,057<br />
only appear in EPO.<br />
As indicated previously, Appendix 1 contains a detailed description of the algorithms<br />
that have <strong>be</strong>en applied and a detailed analysis of the impact of the name cleaning and<br />
harmonization procedure on the set of EPO and USPTO patentee names. In this section, we<br />
highlight the major findings.<br />
Overall, harmonization has reduced the num<strong>be</strong>r of unique patentee names by 17.6%,<br />
from 443,722 to 365,866 names. The average num<strong>be</strong>r of patents per patentee increases from<br />
7.2 <strong>be</strong>fore to 8.8, after harmonization.<br />
13.5% or 49,449 of the harmonized names are matched with more than one original<br />
patentee name. The distribution of the num<strong>be</strong>r of matched names is skewed, ranging from 2 to<br />
51 original names, with an average of 2.6.<br />
When names are matched, an average increase in patent volume of 7.2 patents is<br />
observed. The most extreme case is IBM for which name harmonization results in an additional<br />
12,704 patents. When comparing the num<strong>be</strong>r of additional patents allocated to harmonized<br />
names with the highest volume pertaining to one of the matched set of names, an average<br />
8