Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
is assigning a label to a region <strong>of</strong> the image, it tends to extract features based<br />
on texture <strong>an</strong>alysis or projection pr<strong>of</strong>iles.<br />
The method we describe here is a hybrid approach for separating text from<br />
graphics. It is designed to assign a label to one connected component but features<br />
are gathered based on both the characteristics <strong>of</strong> the components <strong>an</strong>d the<br />
components in its neighborhood.<br />
tel-00912566, version 1 - 2 Dec 2013<br />
We consider 16 features as potential features for the purpose <strong>of</strong> separating<br />
text <strong>an</strong>d graphics components. Each feature may help to discriminate components<br />
in a particular situation. The first three features are computed relative to<br />
other components <strong>of</strong> a page. Using these three features, the aim is to estimate<br />
how likely the current component is different from other components on the<br />
same page in regard to its elongation, height or solidity. The rest <strong>of</strong> the features<br />
are global features, extracted from all components <strong>of</strong> the pages from our training<br />
dataset. Height <strong>an</strong>d width <strong>of</strong> a component are equal to height <strong>an</strong>d width<br />
<strong>of</strong> the bounding box <strong>of</strong> the component. Elongation <strong>an</strong>d solidity are defined as<br />
follows:<br />
<strong>an</strong>d<br />
solidity =<br />
Considered features are:<br />
elo =<br />
min(height, width)<br />
max(height, width)<br />
Number <strong>of</strong> pixels <strong>of</strong> the component<br />
height × width<br />
• Log-normal distribution <strong>of</strong> 1/solidity 1 . For most <strong>of</strong> characters, the<br />
number <strong>of</strong> black pixels divided by the area <strong>of</strong> their bounding box, namely<br />
solidity, is located in a fixed r<strong>an</strong>ge. This property does not hold on tables,<br />
borders <strong>an</strong>d m<strong>an</strong>y graphical drawings. Looking for outlier components<br />
<strong>of</strong> the page regarding this criterion may help in classifying them as<br />
non-textual components. Figure 3.3 shows histograms <strong>of</strong> 1/solidity <strong>an</strong>d<br />
log(1/Solidity) for text components on the whole corpus which visually<br />
justify the use <strong>of</strong> Log-normal distribution.<br />
solidity<br />
σ √ 2π<br />
(ln x−µ)2<br />
exp− 2σ 2<br />
where x is 1/solidity <strong>an</strong>d µ <strong>an</strong>d σ are the me<strong>an</strong> <strong>an</strong>d st<strong>an</strong>dard deviation <strong>of</strong><br />
1/solidity’s natural logarithm for all connected components on the page.<br />
• Log-normal distribution <strong>of</strong> 1/elongation 2 . Except for some characters<br />
such as 1,l <strong>an</strong>d I that resemble a very small rule line, the height<br />
<strong>an</strong>d width ratio <strong>of</strong> a character, namely elongation, is located roughly in a<br />
1 LogNormalDist(1/solidity)<br />
2 LogNormalDist(1/elo)<br />
39