Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
fixed r<strong>an</strong>ge. When the elongation approaches one, it represents a square.<br />
Most <strong>of</strong> characters have <strong>an</strong> elongation <strong>of</strong> about 0.57. However when the<br />
value approaches to zero, it usually represents a rule line. Figure 3.4 shows<br />
histograms <strong>of</strong> 1/elongation <strong>an</strong>d log(1/elongation) for text components on<br />
the whole corpus.<br />
elongation<br />
σ √ 2π<br />
(ln x−µ)2<br />
−<br />
exp 2σ 2<br />
where x is 1/elongation <strong>an</strong>d µ <strong>an</strong>d σ are the me<strong>an</strong> <strong>an</strong>d st<strong>an</strong>dard deviation<br />
<strong>of</strong> 1/elongation’s natural logarithm for all connected components on the<br />
page.<br />
tel-00912566, version 1 - 2 Dec 2013<br />
• Log-normal distribution <strong>of</strong> height 3 . For pages that are written with<br />
a single font size, heights are roughly equal. Even for h<strong>an</strong>dwriting with<br />
connected scripts the width <strong>of</strong> CCs may be different but height values are<br />
still the same. This feature is looking for outliers such as large graphical<br />
elements <strong>an</strong>d tables that have a considerably larger height compare to<br />
other CCs on the page. Figure 3.5 displays histograms <strong>of</strong> height <strong>of</strong> text<br />
components <strong>an</strong>d its logarithm for components on the whole corpus.<br />
1<br />
(height)σ √ 2π<br />
(ln x−µ)2<br />
exp− 2σ 2<br />
where x is the height <strong>of</strong> a component <strong>an</strong>d µ <strong>an</strong>d σ are the me<strong>an</strong> <strong>an</strong>d<br />
st<strong>an</strong>dard deviation <strong>of</strong> height’s natural logarithm for all components <strong>of</strong> the<br />
page.<br />
• Normalized X 4 <strong>an</strong>d Y 5 coordinates <strong>of</strong> the component’s center.<br />
These two features have a r<strong>an</strong>ge between 0 <strong>an</strong>d 1. They are simply the X<br />
<strong>an</strong>d Y parts <strong>of</strong> a component’s center, divided by the width <strong>an</strong>d height <strong>of</strong><br />
the <strong>document</strong> image respectively. The reason behind using this feature is<br />
that most <strong>of</strong> the time noisy components, borders <strong>an</strong>d frames are situated<br />
near the boundaries <strong>of</strong> a <strong>document</strong>. By themselves they do not provide a<br />
direct solution for locating non-textual elements but a classifier c<strong>an</strong> utilize<br />
this information along side <strong>of</strong> other features to form boundaries in a feature<br />
space that c<strong>an</strong> better discriminate text <strong>an</strong>d non-textual components.<br />
• Logarithm <strong>of</strong> normalized height 6 <strong>an</strong>d width 7 . These two features<br />
are computed as follows:<br />
3 LogNormalDist(height)<br />
4 center.x/src.cols<br />
5 center.y/src.rows<br />
6 log(src.rows/height)<br />
7 log(src.cols/width)<br />
log<br />
page’s height<br />
component’s height<br />
41