Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
tel-00912566, version 1 - 2 Dec 2013<br />
Figure 2.3: Two <strong>document</strong>s in our corpus that exhibit variability <strong>of</strong> font sizes in<br />
newspaper style <strong>document</strong>s.<br />
lines are adaptable for detecting printed text lines.<br />
Here are some challenges that arise in text line detection:<br />
• Variability <strong>of</strong> font size is a challenge in some <strong>document</strong>s (e.g. newspaper<br />
style). Most methods assume that the distribution <strong>of</strong> font sizes in<br />
one page is Gaussi<strong>an</strong>. As a result, they tune the parameters for the average<br />
font size. This <strong>of</strong>ten leads to overlook large characters in the title or<br />
errors when there are sudden ch<strong>an</strong>ges in font sizes. Figure 2.3 illustrates<br />
this problem clearly.<br />
• Sl<strong>an</strong>ted text lines are straight text lines that are not aligned with the<br />
x-axis. Most algorithms prefer to start with a skew correction for the<br />
whole page <strong>an</strong>d then proceed to detect text lines. However, this is still<br />
a challenge when text lines with different sl<strong>an</strong>t <strong>an</strong>gles are present on the<br />
same page.<br />
• Touching text lines frequently occur in text regions, especially in h<strong>an</strong>dwritten<br />
m<strong>an</strong>uscripts where text lines are located close to one <strong>an</strong>other.<br />
This situation is a challenge for most methods. Connected componentbased<br />
methods that work their way by aligning <strong>an</strong>d grouping connected<br />
components together, fail because in the case <strong>of</strong> touching text lines, two<br />
characters from two lines that have touched are registered as one large connected<br />
components that is not in alignment with either <strong>of</strong> the text lines.<br />
If not dealt with properly, this will lead to either under-segmentation <strong>of</strong><br />
two lines into one line, or over-segmentation <strong>of</strong> two different text lines into<br />
22