14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

is assigning a label to a region <strong>of</strong> the image, it tends to extract features based<br />

on texture <strong>an</strong>alysis or projection pr<strong>of</strong>iles.<br />

The method we describe here is a hybrid approach for separating text from<br />

graphics. It is designed to assign a label to one connected component but features<br />

are gathered based on both the characteristics <strong>of</strong> the components <strong>an</strong>d the<br />

components in its neighborhood.<br />

tel-00912566, version 1 - 2 Dec 2013<br />

We consider 16 features as potential features for the purpose <strong>of</strong> separating<br />

text <strong>an</strong>d graphics components. Each feature may help to discriminate components<br />

in a particular situation. The first three features are computed relative to<br />

other components <strong>of</strong> a page. Using these three features, the aim is to estimate<br />

how likely the current component is different from other components on the<br />

same page in regard to its elongation, height or solidity. The rest <strong>of</strong> the features<br />

are global features, extracted from all components <strong>of</strong> the pages from our training<br />

dataset. Height <strong>an</strong>d width <strong>of</strong> a component are equal to height <strong>an</strong>d width<br />

<strong>of</strong> the bounding box <strong>of</strong> the component. Elongation <strong>an</strong>d solidity are defined as<br />

follows:<br />

<strong>an</strong>d<br />

solidity =<br />

Considered features are:<br />

elo =<br />

min(height, width)<br />

max(height, width)<br />

Number <strong>of</strong> pixels <strong>of</strong> the component<br />

height × width<br />

• Log-normal distribution <strong>of</strong> 1/solidity 1 . For most <strong>of</strong> characters, the<br />

number <strong>of</strong> black pixels divided by the area <strong>of</strong> their bounding box, namely<br />

solidity, is located in a fixed r<strong>an</strong>ge. This property does not hold on tables,<br />

borders <strong>an</strong>d m<strong>an</strong>y graphical drawings. Looking for outlier components<br />

<strong>of</strong> the page regarding this criterion may help in classifying them as<br />

non-textual components. Figure 3.3 shows histograms <strong>of</strong> 1/solidity <strong>an</strong>d<br />

log(1/Solidity) for text components on the whole corpus which visually<br />

justify the use <strong>of</strong> Log-normal distribution.<br />

solidity<br />

σ √ 2π<br />

(ln x−µ)2<br />

exp− 2σ 2<br />

where x is 1/solidity <strong>an</strong>d µ <strong>an</strong>d σ are the me<strong>an</strong> <strong>an</strong>d st<strong>an</strong>dard deviation <strong>of</strong><br />

1/solidity’s natural logarithm for all connected components on the page.<br />

• Log-normal distribution <strong>of</strong> 1/elongation 2 . Except for some characters<br />

such as 1,l <strong>an</strong>d I that resemble a very small rule line, the height<br />

<strong>an</strong>d width ratio <strong>of</strong> a character, namely elongation, is located roughly in a<br />

1 LogNormalDist(1/solidity)<br />

2 LogNormalDist(1/elo)<br />

39

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!