14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

page’s width<br />

log<br />

component’s width<br />

They are normalized for the reason that <strong>document</strong> pages <strong>of</strong> our training<br />

dataset come in different resolutions <strong>an</strong>d sizes.<br />

• Logarithm <strong>of</strong> 1/solidity 8 <strong>an</strong>d 1/elongation 9 . The import<strong>an</strong>ce <strong>of</strong><br />

solidity <strong>an</strong>d elongation is noted above. Instead <strong>of</strong> computing the me<strong>an</strong><br />

or st<strong>an</strong>dard deviation <strong>of</strong> this feature for components <strong>of</strong> a page, these two<br />

features are provided as raw <strong>an</strong>d they allow the classifier to compare these<br />

values globally for all <strong>images</strong> <strong>of</strong> the training dataset.<br />

tel-00912566, version 1 - 2 Dec 2013<br />

• Logarithm <strong>of</strong> Hu-moments <strong>of</strong> the component’s pixels 10 . Image<br />

moments are particular weighted average <strong>of</strong> the image pixels’ intensities.<br />

They usually are chosen to have some attractive properties or interpretation.<br />

Hu moments are special moments that are proved to be invari<strong>an</strong>t to<br />

the image scale, rotation <strong>an</strong>d reflection. There are a total <strong>of</strong> seven popular<br />

moments in the literature. Of these moments, the seventh one is not<br />

invari<strong>an</strong>t to reflection. Empirically we found that the first four moments<br />

improve the classification results.<br />

• Parents, children <strong>an</strong>d siblings. Usually text characters that appear on<br />

a page are not contained or do not contain <strong>an</strong>y other components except<br />

when they belong to a table. On the other h<strong>an</strong>d large drawings <strong>of</strong>ten<br />

contain m<strong>an</strong>y broken pieces. If a component contain <strong>an</strong>other component,<br />

the former is assigned the role <strong>of</strong> a parent <strong>an</strong>d the latter is assigned the<br />

role <strong>of</strong> a child. Any component that has a parent might have some other<br />

siblings. By counting the number <strong>of</strong> parents, children <strong>an</strong>d siblings they<br />

serve as perfect features for our purpose.<br />

3.3 Feature <strong>an</strong>alysis<br />

Having some features, it would make sense to know the contribution that each<br />

feature may bring to the task <strong>of</strong> classification. Moreover, if a feature has no correlation<br />

with the true labeling <strong>of</strong> the data, it may do more harm th<strong>an</strong> good <strong>an</strong>d<br />

thus should be pruned. To do so, we apply some feature selection methods to our<br />

dataset <strong>an</strong>d their true labels <strong>an</strong>d each method assigns a weight to each feature<br />

which indicates the signific<strong>an</strong>ce <strong>of</strong> the feature based on the method. All parts<br />

<strong>of</strong> this <strong>an</strong>alysis is carried out by <strong>an</strong> open-source s<strong>of</strong>tware called RapidMiner [63]<br />

<strong>an</strong>d for this purpose we use the dataset from ICDAR2009 page segmentation<br />

competition for the reason that it contains a good amount <strong>of</strong> irregular graphical<br />

components <strong>an</strong>d tables.<br />

Tabled 3.1 summaries the result <strong>of</strong> feature <strong>an</strong>alysis by showing the obtained<br />

weights for four feature <strong>an</strong>alysis methods. The first method calculates the relev<strong>an</strong>ce<br />

<strong>of</strong> a feature by computing the information gain in class distribution.<br />

8 log(1/solidity)<br />

9 log(1/elo)<br />

10 log(HuMomentX+1)<br />

44

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!