Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
2.1.1 Connected component based methods<br />
As the name suggests, connected component based methods work with connected<br />
components to discriminate text from graphical elements within the<br />
<strong>document</strong> image. Maybe one <strong>of</strong> the earliest methods <strong>an</strong>d still popular for its robustness<br />
<strong>an</strong>d usability with increasingly complex <strong>document</strong>s is that <strong>of</strong> Fletcher<br />
<strong>an</strong>d Kasturi [33]. The method is based on Hough tr<strong>an</strong>sform, working on the<br />
center <strong>of</strong> the bounding boxes <strong>an</strong>d works by grouping aligned components into<br />
strings <strong>of</strong> characters. Then it classifies all isolated components as graphics.<br />
There are major drawbacks to this approach:<br />
• Tables <strong>an</strong>d borders around advertisements have a center that is usually<br />
located inside the text area. So it is easy to group them incorrectly as<br />
part <strong>of</strong> a character chain, unless there are some constraints that govern<br />
the size <strong>of</strong> component.<br />
tel-00912566, version 1 - 2 Dec 2013<br />
• The classifications <strong>of</strong> short strings <strong>of</strong> characters are not reliable due to<br />
lack <strong>of</strong> votes in the Hough space to efficiently discriminate them.<br />
• The method may find diagonal alignments when text lines are packed<br />
closely <strong>an</strong>d there are not enough gaps between them.<br />
• Punctuation marks, diacritics <strong>an</strong>d broken characters, are not aligned with<br />
other components in a string <strong>of</strong> text, <strong>an</strong>d they may become a seed for<br />
misclassification.<br />
Despite all these limitations, authors <strong>of</strong> [96] have recently published a paper<br />
<strong>an</strong>d the results are improved. The difficulty <strong>of</strong> the problem lies not only in<br />
the classification <strong>of</strong> these components but also in the separation <strong>of</strong> interacting<br />
components. When textual <strong>an</strong>d non textual elements interact locally, finding a<br />
solution becomes more difficult. Figure 2.1 shows two cases <strong>of</strong> such a problem.<br />
In [28] Doerm<strong>an</strong>n tries to address this issue with a method based on stroke level<br />
properties to separate components. As <strong>an</strong> illustration, <strong>of</strong> the potential discrimination<br />
power <strong>of</strong> the stroke level properties, it is noted that in h<strong>an</strong>d-completed<br />
forms <strong>an</strong>d pre-printed boxes, lines are produced by a machine <strong>an</strong>d have more<br />
regularity th<strong>an</strong> the associated h<strong>an</strong>dwritten text. Considering only the widths<br />
<strong>of</strong> the strokes <strong>an</strong>d examining the population <strong>of</strong> widths at the cross section level,<br />
strong separability c<strong>an</strong> be achieved between the two populations.<br />
Another type <strong>of</strong> problem that frequently arise in a connected component<br />
based method is that large graphical components are <strong>of</strong>ten broken into pieces,<br />
<strong>an</strong>d they are composed <strong>of</strong> m<strong>an</strong>y small isolated components that behave like text<br />
components. M<strong>an</strong>y methods try to address this problem by isolating a graphical<br />
component <strong>an</strong>d its sub elements as a whole rather th<strong>an</strong> classifying each one<br />
separately. Figure 2.2 illustrates graphical elements from two <strong>document</strong>s in our<br />
corpus that exhibit this issue.<br />
One possible solution is to apply a method like the one proposed by B. Waked<br />
in his thesis [99]. The idea is that text regions c<strong>an</strong> be regarded as a set <strong>of</strong> small<br />
bounding boxes that are regular in height <strong>an</strong>d are usually aligned horizontally<br />
or vertically, whereas a non-text image or half-tone graphics is irregular. The<br />
14