Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
detection <strong>an</strong>d we conclude that the text line detection algorithm by Vassilio Papavassilious<br />
[75] is the most effective among the others. The original method<br />
overlooks line detection in side notes, but being the best among the m<strong>an</strong>y <strong>an</strong>d<br />
the fact that we have already separated problematic areas such as side notes,<br />
we adopt this method for the benefit <strong>of</strong> our own. Detailed expl<strong>an</strong>ation <strong>of</strong> this<br />
method is chapter 5.<br />
The final stage <strong>of</strong> the system is to group text lines into paragraphs. In<br />
chapter 6 we propose a method based on a trainable binary tree model that<br />
maximize the probability <strong>of</strong> preserving groups <strong>of</strong> lines using a goodness criterion<br />
for paragraphs.<br />
A brief outline <strong>of</strong> these processes is given here.<br />
tel-00912566, version 1 - 2 Dec 2013<br />
• Image binarization is the process that converts a given input gray-scale<br />
image into a bi-level representation. In our case, pixels that belong to text<br />
characters are assigned a value <strong>of</strong> 0 <strong>an</strong>d pixels <strong>of</strong> non textual components<br />
<strong>an</strong>d background have a value <strong>of</strong> 1.<br />
• Connected components <strong>an</strong>alysis is the process <strong>of</strong> extracting <strong>an</strong>d labeling<br />
connected components from <strong>an</strong> image. In our case, all pixels <strong>of</strong><br />
text <strong>an</strong>d non-text elements that are connected <strong>an</strong>d have the same value<br />
are extracted <strong>an</strong>d assigned to a separate component.<br />
• Noise removal tries to detect <strong>an</strong>d remove noise pixels from the <strong>document</strong><br />
image. At this stage we only remove components that contain less th<strong>an</strong><br />
a predefined number <strong>of</strong> pixels. The exact number <strong>of</strong> pixels c<strong>an</strong> only be<br />
determined through trial <strong>an</strong>d error. A large number may remove points,<br />
diacritics <strong>an</strong>d punctuation marks. On the other h<strong>an</strong>d, a small number<br />
may not remove some dust <strong>an</strong>d speckles from the scene.<br />
• Text/Graphics separation is the process that classifies each component<br />
into being part <strong>of</strong> text or graphics.<br />
• Text region detection is the process that separates homogeneous regions<br />
<strong>of</strong> text that belong to separate columns. It is also responsible for<br />
separation <strong>of</strong> side notes from the main text region.<br />
• Text line detection is the process that finds text lines inside every text<br />
region. It is also responsible for breaking characters that are touched from<br />
two adjacent lines <strong>an</strong>d have formed a single component incorrectly.<br />
• Paragraph detection is the process that groups text lines into paragraphs<br />
based on their indentations <strong>an</strong>d geometry.<br />
1.4 Contribution <strong>of</strong> this dissertation<br />
The main contributions that are presented in this dissertation are:<br />
1. A new hybrid method for text/graphics separation. Figure 1.2 shows the<br />
ability <strong>of</strong> our text/graphics separation method in segmenting a <strong>document</strong><br />
5