14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Although it is not m<strong>an</strong>datory to segment text regions into paragraphs before<br />

passing text lines to optical character recognition modules, it is necessary to<br />

have correct paragraphs for reading order detection. Thus, in this work we also<br />

group text lines into paragraphs after identifying each text region. It is worth<br />

noting that there are more th<strong>an</strong> one possible way to segment a <strong>document</strong> image<br />

into text regions correctly that depends on the reading order in the ground<br />

truth, but there is only one solution for segmenting text regions into paragraphs.<br />

tel-00912566, version 1 - 2 Dec 2013<br />

Because <strong>of</strong> the import<strong>an</strong>t role page segmentation plays in <strong>document</strong> layout<br />

<strong>an</strong>alysis, <strong>an</strong>d <strong>of</strong> its direct effect on the optical character recognition step, it has<br />

been explored deeply for the last four decades by <strong>document</strong> imaging community<br />

<strong>an</strong>d m<strong>an</strong>y algorithms have been proposed in the literature. A comprehensive<br />

overview <strong>of</strong> these algorithms is provided by Nagy in [65] <strong>an</strong>d Cattoni et al.<br />

in [22]. In this work we review m<strong>an</strong>y <strong>of</strong> these algorithms in the domain <strong>of</strong><br />

text/graphics separation, text line <strong>an</strong>d region detection. Then a new system<br />

for page segmentation is proposed that improves the results <strong>of</strong> segmentation in<br />

areas where problems still exist by current algorithms.<br />

1.3 Overview <strong>of</strong> the approach<br />

We view a <strong>document</strong> as a scene <strong>of</strong> connected components (CCs). Thus, the<br />

first step in our method is <strong>an</strong> image binarization <strong>an</strong>d extraction <strong>of</strong> all connected<br />

components. The goal <strong>of</strong> the system is to find locations <strong>of</strong> all paragraphs inside<br />

the <strong>document</strong> image. In order to form paragraphs, we need to detect text lines<br />

correctly. And to do so, we have to detect text regions considering the geometric<br />

alignment <strong>of</strong> CCs. Text lines in multi-column <strong>document</strong>s should be separate<br />

from one <strong>an</strong>other <strong>an</strong>d in the case where side notes exist, we have to use the<br />

alignment <strong>of</strong> the CCs to separate side notes from the main text. All these operations<br />

should be carried out without reading the actual text or underst<strong>an</strong>ding<br />

the context <strong>of</strong> the text.<br />

After binarization, we have a set <strong>of</strong> connected components that either belong<br />

to text or non-text regions. The next step is to correctly classify each connected<br />

component. Each CC has a set <strong>of</strong> intrinsic features (height, width, eccentricity,<br />

...) <strong>an</strong>d a set <strong>of</strong> extrinsic features (features from surrounding area). Using these<br />

features, we train a set <strong>of</strong> weak classifiers by the help <strong>of</strong> boosting which allows<br />

us to classify components as text or non-text. In chapter 3 we go deeply into<br />

the details <strong>of</strong> this method.<br />

At this point, we are interested to segment a <strong>document</strong> image into regions<br />

<strong>of</strong> text that have some form <strong>of</strong> alignment. We consider both sets <strong>of</strong> textual <strong>an</strong>d<br />

non textual CCs for this task. Note that some non-text CCs such as tables <strong>an</strong>d<br />

rule lines are highly effective in separating regions, so we do not discard non<br />

textual components. In chapter 4 we show how the method effectively makes<br />

the best out <strong>of</strong> these sets to separate text inside tables.<br />

Having all text elements <strong>an</strong>d their regions, detecting text lines is the next<br />

step. Chapter 2 thoroughly reviews major methods in the literature for text line<br />

4

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!