14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

tel-00912566, version 1 - 2 Dec 2013<br />

Figure 2.3: Two <strong>document</strong>s in our corpus that exhibit variability <strong>of</strong> font sizes in<br />

newspaper style <strong>document</strong>s.<br />

lines are adaptable for detecting printed text lines.<br />

Here are some challenges that arise in text line detection:<br />

• Variability <strong>of</strong> font size is a challenge in some <strong>document</strong>s (e.g. newspaper<br />

style). Most methods assume that the distribution <strong>of</strong> font sizes in<br />

one page is Gaussi<strong>an</strong>. As a result, they tune the parameters for the average<br />

font size. This <strong>of</strong>ten leads to overlook large characters in the title or<br />

errors when there are sudden ch<strong>an</strong>ges in font sizes. Figure 2.3 illustrates<br />

this problem clearly.<br />

• Sl<strong>an</strong>ted text lines are straight text lines that are not aligned with the<br />

x-axis. Most algorithms prefer to start with a skew correction for the<br />

whole page <strong>an</strong>d then proceed to detect text lines. However, this is still<br />

a challenge when text lines with different sl<strong>an</strong>t <strong>an</strong>gles are present on the<br />

same page.<br />

• Touching text lines frequently occur in text regions, especially in h<strong>an</strong>dwritten<br />

m<strong>an</strong>uscripts where text lines are located close to one <strong>an</strong>other.<br />

This situation is a challenge for most methods. Connected componentbased<br />

methods that work their way by aligning <strong>an</strong>d grouping connected<br />

components together, fail because in the case <strong>of</strong> touching text lines, two<br />

characters from two lines that have touched are registered as one large connected<br />

components that is not in alignment with either <strong>of</strong> the text lines.<br />

If not dealt with properly, this will lead to either under-segmentation <strong>of</strong><br />

two lines into one line, or over-segmentation <strong>of</strong> two different text lines into<br />

22

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!