Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 6<br />
Paragraph detection<br />
tel-00912566, version 1 - 2 Dec 2013<br />
Having specified all text lines, in this stage <strong>of</strong> the method, we focus<br />
on regrouping these lines to produce paragraphs. At this point it is<br />
expected that text lines from different columns or marginal notes are<br />
segmented properly <strong>an</strong>d they belong to separate text regions. However<br />
the text region detection method described in chapter 4 has its focus on<br />
separating text regions that are located far from one <strong>an</strong>other or belong to different<br />
columns. Several factors that are overlooked are the size <strong>of</strong> the text lines<br />
<strong>an</strong>d the paragraph model that includes them. For example, a text line that has<br />
a considerably larger text components compared to its adjacent text lines is expected<br />
to be a title, header or sub-header. These text lines should be separated<br />
from the rest <strong>of</strong> the text lines in a region. Moreover there are several paragraph<br />
models that exist on either h<strong>an</strong>dwritten or printed historical <strong>document</strong>s. They<br />
c<strong>an</strong> be immediately recognized from <strong>document</strong>s based on the geometric location<br />
<strong>of</strong> text lines <strong>an</strong>d their indentations. The simplest paragraph model is the model<br />
for articles, magazines <strong>an</strong>d technical journals that the first line <strong>of</strong> the paragraph<br />
has a larger left indentation th<strong>an</strong> the rest <strong>of</strong> the text lines. More complicated<br />
paragraph models c<strong>an</strong> be found in historical <strong>document</strong>s. One such paragraph<br />
model is that except for the first text lines, all remaining text lines <strong>of</strong> the paragraph<br />
have a large left indentation. Another paragraph model which c<strong>an</strong> be<br />
seen mostly in poems, has text lines that are center justified. The ending <strong>of</strong><br />
text lines may or may not be aligned. The paragraph detection module that we<br />
describe in this section, should have the capability to h<strong>an</strong>dle all these cases.<br />
Our paragraph model should be applies to each text region independently.<br />
An overview <strong>of</strong> different stages for paragraph detection is as follows:<br />
• A minimum sp<strong>an</strong>ning tree (MST) is applied to connect all text lines in a<br />
way that only one link exists between each pair <strong>of</strong> lines. The weights <strong>of</strong><br />
the MST is specified in a way that the natural reading order <strong>of</strong> the text<br />
lines are preserved.<br />
• The MST is converted to a binary partition tree <strong>of</strong> text lines. The leaves<br />
<strong>of</strong> the tree represent individual text lines. The remaining nodes represent<br />
group <strong>of</strong> text lines that are obtained by merging text lines represented by<br />
100