Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Although it is not m<strong>an</strong>datory to segment text regions into paragraphs before<br />
passing text lines to optical character recognition modules, it is necessary to<br />
have correct paragraphs for reading order detection. Thus, in this work we also<br />
group text lines into paragraphs after identifying each text region. It is worth<br />
noting that there are more th<strong>an</strong> one possible way to segment a <strong>document</strong> image<br />
into text regions correctly that depends on the reading order in the ground<br />
truth, but there is only one solution for segmenting text regions into paragraphs.<br />
tel-00912566, version 1 - 2 Dec 2013<br />
Because <strong>of</strong> the import<strong>an</strong>t role page segmentation plays in <strong>document</strong> layout<br />
<strong>an</strong>alysis, <strong>an</strong>d <strong>of</strong> its direct effect on the optical character recognition step, it has<br />
been explored deeply for the last four decades by <strong>document</strong> imaging community<br />
<strong>an</strong>d m<strong>an</strong>y algorithms have been proposed in the literature. A comprehensive<br />
overview <strong>of</strong> these algorithms is provided by Nagy in [65] <strong>an</strong>d Cattoni et al.<br />
in [22]. In this work we review m<strong>an</strong>y <strong>of</strong> these algorithms in the domain <strong>of</strong><br />
text/graphics separation, text line <strong>an</strong>d region detection. Then a new system<br />
for page segmentation is proposed that improves the results <strong>of</strong> segmentation in<br />
areas where problems still exist by current algorithms.<br />
1.3 Overview <strong>of</strong> the approach<br />
We view a <strong>document</strong> as a scene <strong>of</strong> connected components (CCs). Thus, the<br />
first step in our method is <strong>an</strong> image binarization <strong>an</strong>d extraction <strong>of</strong> all connected<br />
components. The goal <strong>of</strong> the system is to find locations <strong>of</strong> all paragraphs inside<br />
the <strong>document</strong> image. In order to form paragraphs, we need to detect text lines<br />
correctly. And to do so, we have to detect text regions considering the geometric<br />
alignment <strong>of</strong> CCs. Text lines in multi-column <strong>document</strong>s should be separate<br />
from one <strong>an</strong>other <strong>an</strong>d in the case where side notes exist, we have to use the<br />
alignment <strong>of</strong> the CCs to separate side notes from the main text. All these operations<br />
should be carried out without reading the actual text or underst<strong>an</strong>ding<br />
the context <strong>of</strong> the text.<br />
After binarization, we have a set <strong>of</strong> connected components that either belong<br />
to text or non-text regions. The next step is to correctly classify each connected<br />
component. Each CC has a set <strong>of</strong> intrinsic features (height, width, eccentricity,<br />
...) <strong>an</strong>d a set <strong>of</strong> extrinsic features (features from surrounding area). Using these<br />
features, we train a set <strong>of</strong> weak classifiers by the help <strong>of</strong> boosting which allows<br />
us to classify components as text or non-text. In chapter 3 we go deeply into<br />
the details <strong>of</strong> this method.<br />
At this point, we are interested to segment a <strong>document</strong> image into regions<br />
<strong>of</strong> text that have some form <strong>of</strong> alignment. We consider both sets <strong>of</strong> textual <strong>an</strong>d<br />
non textual CCs for this task. Note that some non-text CCs such as tables <strong>an</strong>d<br />
rule lines are highly effective in separating regions, so we do not discard non<br />
textual components. In chapter 4 we show how the method effectively makes<br />
the best out <strong>of</strong> these sets to separate text inside tables.<br />
Having all text elements <strong>an</strong>d their regions, detecting text lines is the next<br />
step. Chapter 2 thoroughly reviews major methods in the literature for text line<br />
4