Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
page’s width<br />
log<br />
component’s width<br />
They are normalized for the reason that <strong>document</strong> pages <strong>of</strong> our training<br />
dataset come in different resolutions <strong>an</strong>d sizes.<br />
• Logarithm <strong>of</strong> 1/solidity 8 <strong>an</strong>d 1/elongation 9 . The import<strong>an</strong>ce <strong>of</strong><br />
solidity <strong>an</strong>d elongation is noted above. Instead <strong>of</strong> computing the me<strong>an</strong><br />
or st<strong>an</strong>dard deviation <strong>of</strong> this feature for components <strong>of</strong> a page, these two<br />
features are provided as raw <strong>an</strong>d they allow the classifier to compare these<br />
values globally for all <strong>images</strong> <strong>of</strong> the training dataset.<br />
tel-00912566, version 1 - 2 Dec 2013<br />
• Logarithm <strong>of</strong> Hu-moments <strong>of</strong> the component’s pixels 10 . Image<br />
moments are particular weighted average <strong>of</strong> the image pixels’ intensities.<br />
They usually are chosen to have some attractive properties or interpretation.<br />
Hu moments are special moments that are proved to be invari<strong>an</strong>t to<br />
the image scale, rotation <strong>an</strong>d reflection. There are a total <strong>of</strong> seven popular<br />
moments in the literature. Of these moments, the seventh one is not<br />
invari<strong>an</strong>t to reflection. Empirically we found that the first four moments<br />
improve the classification results.<br />
• Parents, children <strong>an</strong>d siblings. Usually text characters that appear on<br />
a page are not contained or do not contain <strong>an</strong>y other components except<br />
when they belong to a table. On the other h<strong>an</strong>d large drawings <strong>of</strong>ten<br />
contain m<strong>an</strong>y broken pieces. If a component contain <strong>an</strong>other component,<br />
the former is assigned the role <strong>of</strong> a parent <strong>an</strong>d the latter is assigned the<br />
role <strong>of</strong> a child. Any component that has a parent might have some other<br />
siblings. By counting the number <strong>of</strong> parents, children <strong>an</strong>d siblings they<br />
serve as perfect features for our purpose.<br />
3.3 Feature <strong>an</strong>alysis<br />
Having some features, it would make sense to know the contribution that each<br />
feature may bring to the task <strong>of</strong> classification. Moreover, if a feature has no correlation<br />
with the true labeling <strong>of</strong> the data, it may do more harm th<strong>an</strong> good <strong>an</strong>d<br />
thus should be pruned. To do so, we apply some feature selection methods to our<br />
dataset <strong>an</strong>d their true labels <strong>an</strong>d each method assigns a weight to each feature<br />
which indicates the signific<strong>an</strong>ce <strong>of</strong> the feature based on the method. All parts<br />
<strong>of</strong> this <strong>an</strong>alysis is carried out by <strong>an</strong> open-source s<strong>of</strong>tware called RapidMiner [63]<br />
<strong>an</strong>d for this purpose we use the dataset from ICDAR2009 page segmentation<br />
competition for the reason that it contains a good amount <strong>of</strong> irregular graphical<br />
components <strong>an</strong>d tables.<br />
Tabled 3.1 summaries the result <strong>of</strong> feature <strong>an</strong>alysis by showing the obtained<br />
weights for four feature <strong>an</strong>alysis methods. The first method calculates the relev<strong>an</strong>ce<br />
<strong>of</strong> a feature by computing the information gain in class distribution.<br />
8 log(1/solidity)<br />
9 log(1/elo)<br />
10 log(HuMomentX+1)<br />
44