14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

detection <strong>an</strong>d we conclude that the text line detection algorithm by Vassilio Papavassilious<br />

[75] is the most effective among the others. The original method<br />

overlooks line detection in side notes, but being the best among the m<strong>an</strong>y <strong>an</strong>d<br />

the fact that we have already separated problematic areas such as side notes,<br />

we adopt this method for the benefit <strong>of</strong> our own. Detailed expl<strong>an</strong>ation <strong>of</strong> this<br />

method is chapter 5.<br />

The final stage <strong>of</strong> the system is to group text lines into paragraphs. In<br />

chapter 6 we propose a method based on a trainable binary tree model that<br />

maximize the probability <strong>of</strong> preserving groups <strong>of</strong> lines using a goodness criterion<br />

for paragraphs.<br />

A brief outline <strong>of</strong> these processes is given here.<br />

tel-00912566, version 1 - 2 Dec 2013<br />

• Image binarization is the process that converts a given input gray-scale<br />

image into a bi-level representation. In our case, pixels that belong to text<br />

characters are assigned a value <strong>of</strong> 0 <strong>an</strong>d pixels <strong>of</strong> non textual components<br />

<strong>an</strong>d background have a value <strong>of</strong> 1.<br />

• Connected components <strong>an</strong>alysis is the process <strong>of</strong> extracting <strong>an</strong>d labeling<br />

connected components from <strong>an</strong> image. In our case, all pixels <strong>of</strong><br />

text <strong>an</strong>d non-text elements that are connected <strong>an</strong>d have the same value<br />

are extracted <strong>an</strong>d assigned to a separate component.<br />

• Noise removal tries to detect <strong>an</strong>d remove noise pixels from the <strong>document</strong><br />

image. At this stage we only remove components that contain less th<strong>an</strong><br />

a predefined number <strong>of</strong> pixels. The exact number <strong>of</strong> pixels c<strong>an</strong> only be<br />

determined through trial <strong>an</strong>d error. A large number may remove points,<br />

diacritics <strong>an</strong>d punctuation marks. On the other h<strong>an</strong>d, a small number<br />

may not remove some dust <strong>an</strong>d speckles from the scene.<br />

• Text/Graphics separation is the process that classifies each component<br />

into being part <strong>of</strong> text or graphics.<br />

• Text region detection is the process that separates homogeneous regions<br />

<strong>of</strong> text that belong to separate columns. It is also responsible for<br />

separation <strong>of</strong> side notes from the main text region.<br />

• Text line detection is the process that finds text lines inside every text<br />

region. It is also responsible for breaking characters that are touched from<br />

two adjacent lines <strong>an</strong>d have formed a single component incorrectly.<br />

• Paragraph detection is the process that groups text lines into paragraphs<br />

based on their indentations <strong>an</strong>d geometry.<br />

1.4 Contribution <strong>of</strong> this dissertation<br />

The main contributions that are presented in this dissertation are:<br />

1. A new hybrid method for text/graphics separation. Figure 1.2 shows the<br />

ability <strong>of</strong> our text/graphics separation method in segmenting a <strong>document</strong><br />

5

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!