06.06.2013 Views

Patentee Name Harmonisation - ecoom.be

Patentee Name Harmonisation - ecoom.be

Patentee Name Harmonisation - ecoom.be

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

marked in the previous step as a name originally containing an umlaut are also marked by<br />

executing an update query on the data.<br />

Finally, all preliminary umlaut harmonized names not marked in the previous two steps are<br />

reverted to the previous cleaned name after condensing by executing an update query on the<br />

data.<br />

Result<br />

Umlauts have <strong>be</strong>en harmonized in 9,443 names.<br />

By and large, not all umlaut variations have <strong>be</strong>en harmonized <strong>be</strong>cause refining the method to<br />

increase the num<strong>be</strong>r of matches tends to increase the num<strong>be</strong>r of mismatches quite<br />

substantially.<br />

The method presented here is very safe (100% correct matches) but it could well <strong>be</strong> improved<br />

to cover more names.<br />

Impact<br />

From 365,866 unique names to 365,564 unique names, an additional reduction of 302 names,<br />

or a total reduction of 78,158 names (17.6%).<br />

2.6 Cleaned name<br />

Final result<br />

The final cleaned name is the name after character cleaning, punctuation cleaning, legal form<br />

indication treatment, common company word removal, spelling variation harmonization,<br />

condensing and umlaut harmonization.<br />

During cleaning, the original name can <strong>be</strong>come heavily mutilated and unrecognizable (e.g., from<br />

“" NEUSON" -ÖLFELDSCHIEBER GESELLSCHAFT M.B.H.” to “NEUESOENOELFELDSCHIEBER”). In<br />

this stage, the cleaned name is only usable to identify matching names. In a subsequent step,<br />

the cleaned name will <strong>be</strong> converted back to a more readable and usable name closer to the<br />

original.<br />

In total, all 443,722 names have <strong>be</strong>en affected by one of the cleaning or harmonization steps.<br />

Final impact<br />

All cleaning and harmonization steps resulted in a reduction from 443,722 unique original<br />

names to 365,564 unique cleaned names, a reduction of 78,158 names or 17.6%.<br />

Table 23 contains an overview of the impact of every step, with the num<strong>be</strong>r of unique names<br />

<strong>be</strong>fore and after the particular cleaning and harmonization step, the reduction in the num<strong>be</strong>r of<br />

names, the cumulative reduction in the num<strong>be</strong>r of names for the particular step and all previous<br />

steps, the relative reduction compared to the total num<strong>be</strong>r of original unique names, and the<br />

cumulative relative reduction compared to the total num<strong>be</strong>r of original unique names.<br />

Table 23: step by step results of cleaning and harmonization<br />

STEP FROM TO REDUCTION REDUCTION % %<br />

CUM<br />

CUM<br />

Character cleaning 443,722 438,366 5,356 5,356 1.2 1.2<br />

Punctuation cleaning 438,366 437,336 1,030 6,386 0.2 1.4<br />

Legal form removal 437,336 392,226 45,110 51,496 10.2 11.6<br />

Common company word removal 392,226 385,771 6,455 57,951 1.5 13.1<br />

Spelling variation harmonization 385,771 384,235 1,536 59,487 0.3 13.4<br />

Condensing 384,235 365,866 18,369 77,856 4.1 17.5<br />

Umlaut harmonization 365,866 365,564 302 78,158 0.1 17.6<br />

Legal form removal and condensing are, by far, the most important steps. This does not mean<br />

that all other steps can <strong>be</strong> neglected, as these steps prepare the data for subsequent steps.<br />

Even if the impact of a particular step is low, the results of the step can greatly improve the<br />

impact of the steps that follow.<br />

40

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!