Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
2.2 Text region detection<br />
Text region detection is the actual process <strong>of</strong> segmenting a page into separate<br />
homogeneous text regions. In our view, text region detection is only one part <strong>of</strong><br />
the page segmentation process along side node separation, graphics separation<br />
<strong>an</strong>d other pre-processing <strong>an</strong>d post-processing steps. However, in the literature<br />
authors <strong>of</strong>ten use page segmentation, page decomposition or geometrical layout<br />
<strong>an</strong>alysis on behalf <strong>of</strong> region detection. Maybe one reason is that in some cases,<br />
methods do not explicitly separate text <strong>an</strong>d graphics before segmenting page<br />
into text regions. In one case that we reviewed before [99], separation <strong>of</strong> text<br />
<strong>an</strong>d graphics comes even after detecting regions. So the point is that region<br />
detection, page segmentation, page decomposition may all refer to the same<br />
process with the same goal.<br />
tel-00912566, version 1 - 2 Dec 2013<br />
M<strong>an</strong>y methods have been proposed for this goal. Back to the days when<br />
Nagy wrote his famous review on <strong>document</strong> image <strong>an</strong>alysis [65], he divided page<br />
segmentation methods into two major top-down <strong>an</strong>d bottom-up approaches categories.<br />
Top-down <strong>an</strong>alysis attempts to find larger components, like columns<br />
<strong>an</strong>d block <strong>of</strong> text before proceeding to find text lines, words <strong>an</strong>d characters.<br />
Bottom-up <strong>an</strong>alysis forms words into text lines, lines into text blocks <strong>an</strong>d so<br />
on. Then within the last decade <strong>an</strong>d due to adv<strong>an</strong>ces in computation <strong>an</strong>d optimization<br />
techniques, more <strong>an</strong>d more authors beg<strong>an</strong> to report a new emerged<br />
category. Khurshid wrote [45]: ”M<strong>an</strong>y other methods that do not fit into either<br />
<strong>of</strong> these categories are therefore called hybrid methods.”.<br />
It is part <strong>of</strong> our belief that this type <strong>of</strong> categorization does not represent the<br />
majority <strong>of</strong> the available methods in the literature <strong>an</strong>y more. It is more convenient<br />
to divide current methods into three categories; dist<strong>an</strong>ce-based, texturebased<br />
<strong>an</strong>d white space <strong>an</strong>alysis. Dist<strong>an</strong>ce-based methods try to <strong>an</strong>alyze the<br />
dist<strong>an</strong>ce between connected components <strong>an</strong>d to separate regions that fall apart.<br />
Examples <strong>of</strong> these methods use Voronoi diagram [2, 47, 48], Delauney tessellation<br />
[102] or run-length features [93, 32], map <strong>of</strong> foreground <strong>an</strong>d background<br />
runs .<br />
Texture-based methods use wavelets <strong>an</strong>alysis [49, 1], spline wavelets [27],<br />
wavelet packets [30], Markov r<strong>an</strong>dom fields (MRF) with pixel density features<br />
[69], auto correlation [44] in a single or multi-resolution framework to segment<br />
pages into text regions. Finally, methods that focus in the white background <strong>of</strong><br />
the <strong>document</strong> <strong>an</strong>d try to find the maximal empty rect<strong>an</strong>gles that c<strong>an</strong> separate<br />
columns <strong>of</strong> text [13, 85, 14].<br />
In [84], Shafait et al, evaluate the perform<strong>an</strong>ce <strong>of</strong> six famous algorithms,<br />
including X-Y cut [66], run-length smearing [101], whitespace <strong>an</strong>alysis [8], Docstrum<br />
[71], <strong>an</strong>d Voronoi-based [48] page segmentation. Also several methods for<br />
historical <strong>document</strong> layout <strong>an</strong>alysis are reported in [4], but unfortunately except<br />
for their concise description in the paper, none <strong>of</strong> them is available publicly for<br />
study <strong>an</strong>d experimentation.<br />
In the rest <strong>of</strong> this section, we briefly highlight some adv<strong>an</strong>tages <strong>an</strong>d weaknesses<br />
<strong>of</strong> each category, <strong>an</strong>d then we provide <strong>an</strong> overview <strong>of</strong> our approach.<br />
18