Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 5<br />
Text line detection<br />
tel-00912566, version 1 - 2 Dec 2013<br />
Text line detection refers to the segmentation <strong>of</strong> each text region into<br />
distinct entities, namely text lines. In chapter 2 we mentioned <strong>an</strong>d<br />
<strong>an</strong>alyzed m<strong>an</strong>y methods for detecting text line.<br />
Our text line detection method is a vari<strong>an</strong>t <strong>of</strong> the method proposed by<br />
Papavassiliou in [75]. The original method segments a <strong>document</strong> image into<br />
non-overlapping vertical zones with equal width. The height <strong>of</strong> each zone is<br />
equal to the height <strong>of</strong> the <strong>document</strong> image. Its width is equal to 5% <strong>of</strong> the<br />
width <strong>of</strong> the <strong>document</strong> image so as to ignore the effect <strong>of</strong> skewed text lines, <strong>an</strong>d<br />
wide enough to contain decent amount <strong>of</strong> characters. Also the original method<br />
disregards zones situated close to the left <strong>an</strong>d right borders <strong>of</strong> the page; mainly<br />
because they do not contain sufficient amount <strong>of</strong> text.<br />
Since <strong>document</strong>s in our corpus contain side notes, it is not wise to dismiss<br />
zones that do not contain sufficient amount <strong>of</strong> text compared to zones in the<br />
middle <strong>of</strong> the <strong>document</strong>. One reason why the original method neglects these<br />
zones is because <strong>of</strong> the effect <strong>of</strong> large gaps that affect the overall estimation<br />
<strong>of</strong> model parameters. To solve this problem we ensure that parameters <strong>of</strong> the<br />
model are estimated from detected text regions. Also we ensure that detected<br />
lines do not cross from one text region to <strong>an</strong>other as it happens in the original<br />
method.<br />
5.1 Initial text line separators<br />
The first step is to calculate the projection pr<strong>of</strong>ile <strong>of</strong> each vertical zone onto y<br />
axis. Let P R i be the projection pr<strong>of</strong>ile <strong>of</strong> the i th vertical zone onto y axis. Peaks<br />
<strong>an</strong>d valleys <strong>of</strong> PRs give rough indication <strong>of</strong> the location <strong>of</strong> text lines; however,<br />
in the case where writing style results in large gaps between successive words, a<br />
vertical zone m<strong>an</strong>y not contain enough foreground pixels for every text line. In<br />
order to slake the influence <strong>of</strong> these inst<strong>an</strong>ces on P R i , a smoothed projection<br />
pr<strong>of</strong>ile SP R i is estimated as a normalized weighted sum <strong>of</strong> M pr<strong>of</strong>iles on either<br />
side <strong>of</strong> the i th zone. The dimension for P R i <strong>an</strong>d SP R i is 1 × Page’s height. In<br />
figures 5.1 <strong>an</strong>d 5.2, the bar chart view for P R i <strong>an</strong>d SP R i are rendered at the<br />
87