Patentee Name Harmonisation - ecoom.be

More documents

Recommendations

Info

4 A CONTENT-DRIVEN NAME HARMONIZATION APPROACH FOCUSING ON ACCURACY As indicated in the introduction, name harmonization involves a trade-off between completeness and accuracy. It has been a deliberate choice in the methodology outlined here to favor accuracy over completeness for reasons of transparency, as it is easier to combine additional names than separate combined names. An accurate but somewhat incomplete set of harmonized names provides users with ample opportunities to extend the methodology and its results to a broad range of applications. Given an accurate set of harmonized names, additional name matches that are considered relevant can be identified and added in a straightforward manner. Reverse operations, starting with a more complete set, are much more complicated since previous steps undertaken to achieve a more complete result might need to be undone or ‘reverse engineered’. In practice, this would prove to be a much more complicated endeavor than combining disaggregated names. Hence, this methodology, conceived as a transparent and accurate set of harmonized names in which completeness can be gradually improved, is considered far more appealing than a more complete set which contains the risk of not being accurate or being unsuited to specific analytical purposes. As a result, the development of the methodology is based on the underlying principle that every step in the cleaning and harmonization process must increase completeness without decreasing accuracy. Every action that jeopardizes accuracy will ultimately be excluded from the methodology, as combining two names belonging to two different legal entities has to be avoided at all cost. Moreover, in order to achieve sufficient levels of accuracy, several of the procedures and rules that have been developed take into account the specificities of the full original name list. This content-driven approach results in a partly manual, and hence laborintensive, development process. The final procedure can be completely automated in a modular approach to allow further refinements and improvements. The entire procedure is organized as a series of generic steps and sub-steps that are implemented by taking into account the nature of the source data. It should be noted that while the more generic parts of the procedure can be used for all kinds of name-harmonization applications, some procedures are highly content-specific and additional analysis and refinements might be needed if applied to a different set of organization names. Figure 1 contains an overview of the comprehensive methodology that consists of a sequence of steps, including both data pre-processing and name-harmonizing activities. An example patentee name is included to see the results of each step (string parts that will be affected in the next processing step are highlighted in bold). Appendix 1 describes in detail all steps in the name-cleaning and harmonization procedure as applied to the dataset with EPO applicants and USPTO assignees, including examples, detailed analysis and implementation. The main principles underlying each step are explained in the following paragraphs. This will facilitate discussion of the results in section 5 - Results and Impact 5
Figure 1: Overview schema name cleaning and harmonization 4.1 Data pre-processing In the pre-processing steps, data are prepared for processing to facilitate actual name cleaning and harmonization. The individual impact of each step on the number of unique patentee names is limited but it smoothes progression through consecutive steps and considerably increases the overall impact. Data pre-processing is highly dependent on the content of the underlying data. Consequently, extensive refinements or adaptations may be needed when processing names from a different data source. 4.1.1 Character cleaning Depending on the data source, non-letter (A to Z) and non-digit (0 to 9) characters can be coded or represented in a variety of ways (e.g. ANSI, SGML), inducing additional name 6
Page 1 and 2: Data Production Methods for Harmoni
Page 3 and 4: Table of contents 1 Introduction...
Page 5 and 6: 2 PATENTEE NAME HARMONIZATION AND L
Page 7: 3.2 DERWENT WPI company name harmon
Page 11 and 12: 4.2.3 Spelling variation harmonizat
Page 13 and 14: 6 DIRECTIONS FOR FURTHER DEVELOPMEN
Page 15 and 16: A considerable number of name varia
Page 17 and 18: IBM INTERNATION BUSINESS MACHINES I
Page 19 and 20: DIAGNOSTICS MITSUBISHIJUKOGYO MITSU
Page 21 and 22: APPENDIX 1: STEP-BY-STEP METHODOLOG
Page 23 and 24: 1.1.3 Replace propriety coded chara
Page 25 and 26: 220 "Ü" "U" 221 "Ý" "Y" 159 "Ÿ"
Page 27 and 28: Impact From 438,069 unique names to
Page 29 and 30: “%,GMBH.%” “, GMBH.” “%,G
Page 31 and 32: 2 NAME CLEANING 2.1 Legal form indi
Page 33 and 34: INCORPORATED Incorporated AS Akties
Page 35 and 36: Some legal form indications are not
Page 37 and 38: " GMBH & CO. KG " 191 GMBH Replace
Page 39 and 40: "AND COMPANY" 120 "& COMPANY" 10,90
Page 41 and 42: "TECHNOLOGIES" 7,587 "TECHNOLOGY" "
Page 43 and 44: marked in the previous step as a na
Page 45 and 46: 3 HARMONIZATION RESULTS As the fina
Page 47 and 48: 13 37 12 35 11 61 10 82 9 136 8 179
Page 49 and 50: 3.3 Patent distribution amongst pat
Page 51 and 52: product). Appendix 5 contains the t
Page 53 and 54: The absolute differences in the num
Page 55 and 56: 1056 ", IN.C" 4 INCORPORATED Remove
Page 57 and 58: 1199 " MFG., LTD." 23 LIMITED Repla
Page 59 and 60:
1314 " (I.P) LIMITED" 1 LIMITED Rem
Page 61 and 62:
1450 " P.L.C." 60 PLC Remove 1451 "
Page 63 and 64:
1585 " A.S." 240 AS Remove 1586 " A
Page 65 and 66:
1715 CO." " GESELLSCHAFT M.B.H & CO
Page 67 and 68:
1844 " AKTIENGESELLSCHAFT & CO. KG"
Page 69 and 70:
1981 " MBH." 19 GMBH Remove 1982 "
Page 71 and 72:
2053 " KOGYOLKABUSHIKI KAISHA" 1 20
Page 73 and 74:
LAST WORD NBR OF CUM % LAST WORD NB
Page 75 and 76:
FIRST WORD NBR OF CUM % FIRST WORD
Page 77 and 78:
52 NIKON CORPORATION 4,500 615,317
Page 79 and 80:
177 UNISYS CORPORATION 1,653 942,72
Page 81 and 82:
54 HENKEL KOMMANDITGESELLSCHAFT AUF
Page 83 and 84:
177 CASIO COMPUTER COMPANY 1,848 1,
Page 85:
DIMPLEX NORTH AMERICA 1 22 1 22 0 0
show all

Patentee Name Harmonisation - ecoom.be

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?