14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 5<br />

Text line detection<br />

tel-00912566, version 1 - 2 Dec 2013<br />

Text line detection refers to the segmentation <strong>of</strong> each text region into<br />

distinct entities, namely text lines. In chapter 2 we mentioned <strong>an</strong>d<br />

<strong>an</strong>alyzed m<strong>an</strong>y methods for detecting text line.<br />

Our text line detection method is a vari<strong>an</strong>t <strong>of</strong> the method proposed by<br />

Papavassiliou in [75]. The original method segments a <strong>document</strong> image into<br />

non-overlapping vertical zones with equal width. The height <strong>of</strong> each zone is<br />

equal to the height <strong>of</strong> the <strong>document</strong> image. Its width is equal to 5% <strong>of</strong> the<br />

width <strong>of</strong> the <strong>document</strong> image so as to ignore the effect <strong>of</strong> skewed text lines, <strong>an</strong>d<br />

wide enough to contain decent amount <strong>of</strong> characters. Also the original method<br />

disregards zones situated close to the left <strong>an</strong>d right borders <strong>of</strong> the page; mainly<br />

because they do not contain sufficient amount <strong>of</strong> text.<br />

Since <strong>document</strong>s in our corpus contain side notes, it is not wise to dismiss<br />

zones that do not contain sufficient amount <strong>of</strong> text compared to zones in the<br />

middle <strong>of</strong> the <strong>document</strong>. One reason why the original method neglects these<br />

zones is because <strong>of</strong> the effect <strong>of</strong> large gaps that affect the overall estimation<br />

<strong>of</strong> model parameters. To solve this problem we ensure that parameters <strong>of</strong> the<br />

model are estimated from detected text regions. Also we ensure that detected<br />

lines do not cross from one text region to <strong>an</strong>other as it happens in the original<br />

method.<br />

5.1 Initial text line separators<br />

The first step is to calculate the projection pr<strong>of</strong>ile <strong>of</strong> each vertical zone onto y<br />

axis. Let P R i be the projection pr<strong>of</strong>ile <strong>of</strong> the i th vertical zone onto y axis. Peaks<br />

<strong>an</strong>d valleys <strong>of</strong> PRs give rough indication <strong>of</strong> the location <strong>of</strong> text lines; however,<br />

in the case where writing style results in large gaps between successive words, a<br />

vertical zone m<strong>an</strong>y not contain enough foreground pixels for every text line. In<br />

order to slake the influence <strong>of</strong> these inst<strong>an</strong>ces on P R i , a smoothed projection<br />

pr<strong>of</strong>ile SP R i is estimated as a normalized weighted sum <strong>of</strong> M pr<strong>of</strong>iles on either<br />

side <strong>of</strong> the i th zone. The dimension for P R i <strong>an</strong>d SP R i is 1 × Page’s height. In<br />

figures 5.1 <strong>an</strong>d 5.2, the bar chart view for P R i <strong>an</strong>d SP R i are rendered at the<br />

87

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!