06.06.2013 Views

Patentee Name Harmonisation - ecoom.be

Patentee Name Harmonisation - ecoom.be

Patentee Name Harmonisation - ecoom.be

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Result<br />

383,707 names contain spaces or non-alphanumerical characters and have <strong>be</strong>en condensed.<br />

Impact<br />

From 384,235 unique names to 365,866 unique names, an additional reduction of 18,369<br />

names, or a total reduction of 77,856 names (17.5%).<br />

2.5 Umlaut harmonization<br />

Description<br />

As descri<strong>be</strong>d in a previous step - replace accented characters - German characters with a<br />

diacritic mark (’umlaut’: “ä”, “ö”, “ü”) - cause spelling variations <strong>be</strong>cause words containing<br />

these characters can occur in three guises, one with an umlaut (e.g. “für”), one with the<br />

alternative spelling without an umlaut but with an additional “e” (e.g. “fuer”), and a simplified<br />

form without an umlaut and without an additional “e” (e.g. “fur”).<br />

Since all of these spelling variations appear in the organization names, simply replacing all<br />

characters containing an umlaut with their simple underlying equivalent without an umlaut and<br />

without an additional “e”, as in the earlier cleaning step, will not match all equivalent names.<br />

This additional step will try to match the spelling variant without an umlaut but with an<br />

additional “e” with the other spelling variations.<br />

Other languages such as Hungarian also suffer from this problem but, in this step, emphasis is<br />

placed on the German umlaut and its equivalent with an additional “e”.<br />

Analysis<br />

Since all three variations appear in organization names (sometimes more than one variation in a<br />

name, e.g. “PATENT-TREUHEND-GESELLSCHAFT FUER ELEKTRISCHE GLÜHLAMPEN MBH”), no<br />

straightforward solution is available.<br />

Given the former example, creating two variations of all names with umlauts, one without an<br />

umlaut but with an additional “e”, and one without an umlaut and without an additional “e”, will<br />

not work <strong>be</strong>cause all kinds of combinations can appear in one name. Even if “PATENT-<br />

TREUHEND-GESELLSCHAFT FUER ELEKTRISCHE GLÜHLAMPEN MBH” could <strong>be</strong> harmonized both<br />

to “PATENT-TREUHEND-GESELLSCHAFT FUR ELEKTRISCHE GLUHLAMPEN MBH” and “PATENT-<br />

TREUHEND-GESELLSCHAFT FUER ELEKTRISCHE GLUEHLAMPEN MBH”, the following variation<br />

“PATENT-TREUHEND-GESELLSCHAFT FUR ELEKTRISCHE GLUEHLAMPEN MBH” would not <strong>be</strong><br />

matched.<br />

Therefore, not only names containing umlauts have to <strong>be</strong> harmonized but all names will have to<br />

<strong>be</strong> scanned for possible matches with a name containing an umlaut.<br />

Simply adding an “e” to, or removing it from, all occurring “a”, “o” or “u” leads to many<br />

mismatches, especially in the case of proper names containing “a”, “o” or “u”.<br />

To eliminate these mismatches, only groups of matched names with at least one name originally<br />

containing at least one umlaut are retained for the name harmonization. However, this<br />

additional step to maintain accuracy greatly reduces the num<strong>be</strong>r of matches.<br />

Implementation<br />

Firstly, all occurrences of “AE”, “OE” and “UE” are replaced with “A”, “O” and “U” respectively in<br />

all names (also in names originally containing no umlauts) by executing a series of update<br />

queries on the data.<br />

Next, all occurrences of “A”, “O” and “U” are again replaced with “AE”, “OE”, “UE” respectively<br />

in all names (also in names originally containing no umlauts) by executing a series of update<br />

queries on the data.<br />

Next, all names originally containing an umlaut are marked by executing an update query on<br />

the data.<br />

Next, all names having a preliminary umlaut harmonized name (first removing “E” and next<br />

adding “E” from and to “A”, “O” and “U”) that is equal to a preliminary umlaut harmonized name<br />

39

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!