14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2.1.1 Connected component based methods<br />

As the name suggests, connected component based methods work with connected<br />

components to discriminate text from graphical elements within the<br />

<strong>document</strong> image. Maybe one <strong>of</strong> the earliest methods <strong>an</strong>d still popular for its robustness<br />

<strong>an</strong>d usability with increasingly complex <strong>document</strong>s is that <strong>of</strong> Fletcher<br />

<strong>an</strong>d Kasturi [33]. The method is based on Hough tr<strong>an</strong>sform, working on the<br />

center <strong>of</strong> the bounding boxes <strong>an</strong>d works by grouping aligned components into<br />

strings <strong>of</strong> characters. Then it classifies all isolated components as graphics.<br />

There are major drawbacks to this approach:<br />

• Tables <strong>an</strong>d borders around advertisements have a center that is usually<br />

located inside the text area. So it is easy to group them incorrectly as<br />

part <strong>of</strong> a character chain, unless there are some constraints that govern<br />

the size <strong>of</strong> component.<br />

tel-00912566, version 1 - 2 Dec 2013<br />

• The classifications <strong>of</strong> short strings <strong>of</strong> characters are not reliable due to<br />

lack <strong>of</strong> votes in the Hough space to efficiently discriminate them.<br />

• The method may find diagonal alignments when text lines are packed<br />

closely <strong>an</strong>d there are not enough gaps between them.<br />

• Punctuation marks, diacritics <strong>an</strong>d broken characters, are not aligned with<br />

other components in a string <strong>of</strong> text, <strong>an</strong>d they may become a seed for<br />

misclassification.<br />

Despite all these limitations, authors <strong>of</strong> [96] have recently published a paper<br />

<strong>an</strong>d the results are improved. The difficulty <strong>of</strong> the problem lies not only in<br />

the classification <strong>of</strong> these components but also in the separation <strong>of</strong> interacting<br />

components. When textual <strong>an</strong>d non textual elements interact locally, finding a<br />

solution becomes more difficult. Figure 2.1 shows two cases <strong>of</strong> such a problem.<br />

In [28] Doerm<strong>an</strong>n tries to address this issue with a method based on stroke level<br />

properties to separate components. As <strong>an</strong> illustration, <strong>of</strong> the potential discrimination<br />

power <strong>of</strong> the stroke level properties, it is noted that in h<strong>an</strong>d-completed<br />

forms <strong>an</strong>d pre-printed boxes, lines are produced by a machine <strong>an</strong>d have more<br />

regularity th<strong>an</strong> the associated h<strong>an</strong>dwritten text. Considering only the widths<br />

<strong>of</strong> the strokes <strong>an</strong>d examining the population <strong>of</strong> widths at the cross section level,<br />

strong separability c<strong>an</strong> be achieved between the two populations.<br />

Another type <strong>of</strong> problem that frequently arise in a connected component<br />

based method is that large graphical components are <strong>of</strong>ten broken into pieces,<br />

<strong>an</strong>d they are composed <strong>of</strong> m<strong>an</strong>y small isolated components that behave like text<br />

components. M<strong>an</strong>y methods try to address this problem by isolating a graphical<br />

component <strong>an</strong>d its sub elements as a whole rather th<strong>an</strong> classifying each one<br />

separately. Figure 2.2 illustrates graphical elements from two <strong>document</strong>s in our<br />

corpus that exhibit this issue.<br />

One possible solution is to apply a method like the one proposed by B. Waked<br />

in his thesis [99]. The idea is that text regions c<strong>an</strong> be regarded as a set <strong>of</strong> small<br />

bounding boxes that are regular in height <strong>an</strong>d are usually aligned horizontally<br />

or vertically, whereas a non-text image or half-tone graphics is irregular. The<br />

14

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!