14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2.2 Text region detection<br />

Text region detection is the actual process <strong>of</strong> segmenting a page into separate<br />

homogeneous text regions. In our view, text region detection is only one part <strong>of</strong><br />

the page segmentation process along side node separation, graphics separation<br />

<strong>an</strong>d other pre-processing <strong>an</strong>d post-processing steps. However, in the literature<br />

authors <strong>of</strong>ten use page segmentation, page decomposition or geometrical layout<br />

<strong>an</strong>alysis on behalf <strong>of</strong> region detection. Maybe one reason is that in some cases,<br />

methods do not explicitly separate text <strong>an</strong>d graphics before segmenting page<br />

into text regions. In one case that we reviewed before [99], separation <strong>of</strong> text<br />

<strong>an</strong>d graphics comes even after detecting regions. So the point is that region<br />

detection, page segmentation, page decomposition may all refer to the same<br />

process with the same goal.<br />

tel-00912566, version 1 - 2 Dec 2013<br />

M<strong>an</strong>y methods have been proposed for this goal. Back to the days when<br />

Nagy wrote his famous review on <strong>document</strong> image <strong>an</strong>alysis [65], he divided page<br />

segmentation methods into two major top-down <strong>an</strong>d bottom-up approaches categories.<br />

Top-down <strong>an</strong>alysis attempts to find larger components, like columns<br />

<strong>an</strong>d block <strong>of</strong> text before proceeding to find text lines, words <strong>an</strong>d characters.<br />

Bottom-up <strong>an</strong>alysis forms words into text lines, lines into text blocks <strong>an</strong>d so<br />

on. Then within the last decade <strong>an</strong>d due to adv<strong>an</strong>ces in computation <strong>an</strong>d optimization<br />

techniques, more <strong>an</strong>d more authors beg<strong>an</strong> to report a new emerged<br />

category. Khurshid wrote [45]: ”M<strong>an</strong>y other methods that do not fit into either<br />

<strong>of</strong> these categories are therefore called hybrid methods.”.<br />

It is part <strong>of</strong> our belief that this type <strong>of</strong> categorization does not represent the<br />

majority <strong>of</strong> the available methods in the literature <strong>an</strong>y more. It is more convenient<br />

to divide current methods into three categories; dist<strong>an</strong>ce-based, texturebased<br />

<strong>an</strong>d white space <strong>an</strong>alysis. Dist<strong>an</strong>ce-based methods try to <strong>an</strong>alyze the<br />

dist<strong>an</strong>ce between connected components <strong>an</strong>d to separate regions that fall apart.<br />

Examples <strong>of</strong> these methods use Voronoi diagram [2, 47, 48], Delauney tessellation<br />

[102] or run-length features [93, 32], map <strong>of</strong> foreground <strong>an</strong>d background<br />

runs .<br />

Texture-based methods use wavelets <strong>an</strong>alysis [49, 1], spline wavelets [27],<br />

wavelet packets [30], Markov r<strong>an</strong>dom fields (MRF) with pixel density features<br />

[69], auto correlation [44] in a single or multi-resolution framework to segment<br />

pages into text regions. Finally, methods that focus in the white background <strong>of</strong><br />

the <strong>document</strong> <strong>an</strong>d try to find the maximal empty rect<strong>an</strong>gles that c<strong>an</strong> separate<br />

columns <strong>of</strong> text [13, 85, 14].<br />

In [84], Shafait et al, evaluate the perform<strong>an</strong>ce <strong>of</strong> six famous algorithms,<br />

including X-Y cut [66], run-length smearing [101], whitespace <strong>an</strong>alysis [8], Docstrum<br />

[71], <strong>an</strong>d Voronoi-based [48] page segmentation. Also several methods for<br />

historical <strong>document</strong> layout <strong>an</strong>alysis are reported in [4], but unfortunately except<br />

for their concise description in the paper, none <strong>of</strong> them is available publicly for<br />

study <strong>an</strong>d experimentation.<br />

In the rest <strong>of</strong> this section, we briefly highlight some adv<strong>an</strong>tages <strong>an</strong>d weaknesses<br />

<strong>of</strong> each category, <strong>an</strong>d then we provide <strong>an</strong> overview <strong>of</strong> our approach.<br />

18

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!