14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 6<br />

Paragraph detection<br />

tel-00912566, version 1 - 2 Dec 2013<br />

Having specified all text lines, in this stage <strong>of</strong> the method, we focus<br />

on regrouping these lines to produce paragraphs. At this point it is<br />

expected that text lines from different columns or marginal notes are<br />

segmented properly <strong>an</strong>d they belong to separate text regions. However<br />

the text region detection method described in chapter 4 has its focus on<br />

separating text regions that are located far from one <strong>an</strong>other or belong to different<br />

columns. Several factors that are overlooked are the size <strong>of</strong> the text lines<br />

<strong>an</strong>d the paragraph model that includes them. For example, a text line that has<br />

a considerably larger text components compared to its adjacent text lines is expected<br />

to be a title, header or sub-header. These text lines should be separated<br />

from the rest <strong>of</strong> the text lines in a region. Moreover there are several paragraph<br />

models that exist on either h<strong>an</strong>dwritten or printed historical <strong>document</strong>s. They<br />

c<strong>an</strong> be immediately recognized from <strong>document</strong>s based on the geometric location<br />

<strong>of</strong> text lines <strong>an</strong>d their indentations. The simplest paragraph model is the model<br />

for articles, magazines <strong>an</strong>d technical journals that the first line <strong>of</strong> the paragraph<br />

has a larger left indentation th<strong>an</strong> the rest <strong>of</strong> the text lines. More complicated<br />

paragraph models c<strong>an</strong> be found in historical <strong>document</strong>s. One such paragraph<br />

model is that except for the first text lines, all remaining text lines <strong>of</strong> the paragraph<br />

have a large left indentation. Another paragraph model which c<strong>an</strong> be<br />

seen mostly in poems, has text lines that are center justified. The ending <strong>of</strong><br />

text lines may or may not be aligned. The paragraph detection module that we<br />

describe in this section, should have the capability to h<strong>an</strong>dle all these cases.<br />

Our paragraph model should be applies to each text region independently.<br />

An overview <strong>of</strong> different stages for paragraph detection is as follows:<br />

• A minimum sp<strong>an</strong>ning tree (MST) is applied to connect all text lines in a<br />

way that only one link exists between each pair <strong>of</strong> lines. The weights <strong>of</strong><br />

the MST is specified in a way that the natural reading order <strong>of</strong> the text<br />

lines are preserved.<br />

• The MST is converted to a binary partition tree <strong>of</strong> text lines. The leaves<br />

<strong>of</strong> the tree represent individual text lines. The remaining nodes represent<br />

group <strong>of</strong> text lines that are obtained by merging text lines represented by<br />

100

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!