14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

fixed r<strong>an</strong>ge. When the elongation approaches one, it represents a square.<br />

Most <strong>of</strong> characters have <strong>an</strong> elongation <strong>of</strong> about 0.57. However when the<br />

value approaches to zero, it usually represents a rule line. Figure 3.4 shows<br />

histograms <strong>of</strong> 1/elongation <strong>an</strong>d log(1/elongation) for text components on<br />

the whole corpus.<br />

elongation<br />

σ √ 2π<br />

(ln x−µ)2<br />

−<br />

exp 2σ 2<br />

where x is 1/elongation <strong>an</strong>d µ <strong>an</strong>d σ are the me<strong>an</strong> <strong>an</strong>d st<strong>an</strong>dard deviation<br />

<strong>of</strong> 1/elongation’s natural logarithm for all connected components on the<br />

page.<br />

tel-00912566, version 1 - 2 Dec 2013<br />

• Log-normal distribution <strong>of</strong> height 3 . For pages that are written with<br />

a single font size, heights are roughly equal. Even for h<strong>an</strong>dwriting with<br />

connected scripts the width <strong>of</strong> CCs may be different but height values are<br />

still the same. This feature is looking for outliers such as large graphical<br />

elements <strong>an</strong>d tables that have a considerably larger height compare to<br />

other CCs on the page. Figure 3.5 displays histograms <strong>of</strong> height <strong>of</strong> text<br />

components <strong>an</strong>d its logarithm for components on the whole corpus.<br />

1<br />

(height)σ √ 2π<br />

(ln x−µ)2<br />

exp− 2σ 2<br />

where x is the height <strong>of</strong> a component <strong>an</strong>d µ <strong>an</strong>d σ are the me<strong>an</strong> <strong>an</strong>d<br />

st<strong>an</strong>dard deviation <strong>of</strong> height’s natural logarithm for all components <strong>of</strong> the<br />

page.<br />

• Normalized X 4 <strong>an</strong>d Y 5 coordinates <strong>of</strong> the component’s center.<br />

These two features have a r<strong>an</strong>ge between 0 <strong>an</strong>d 1. They are simply the X<br />

<strong>an</strong>d Y parts <strong>of</strong> a component’s center, divided by the width <strong>an</strong>d height <strong>of</strong><br />

the <strong>document</strong> image respectively. The reason behind using this feature is<br />

that most <strong>of</strong> the time noisy components, borders <strong>an</strong>d frames are situated<br />

near the boundaries <strong>of</strong> a <strong>document</strong>. By themselves they do not provide a<br />

direct solution for locating non-textual elements but a classifier c<strong>an</strong> utilize<br />

this information along side <strong>of</strong> other features to form boundaries in a feature<br />

space that c<strong>an</strong> better discriminate text <strong>an</strong>d non-textual components.<br />

• Logarithm <strong>of</strong> normalized height 6 <strong>an</strong>d width 7 . These two features<br />

are computed as follows:<br />

3 LogNormalDist(height)<br />

4 center.x/src.cols<br />

5 center.y/src.rows<br />

6 log(src.rows/height)<br />

7 log(src.cols/width)<br />

log<br />

page’s height<br />

component’s height<br />

41

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!