Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
tel-00912566, version 1 - 2 Dec 2013<br />
Figure 1.1: A sample <strong>of</strong> a correctly segmented <strong>document</strong> image by ABBYY<br />
FineReader 2011.<br />
improve the segmentation quality for the corpus provided to us, which mostly<br />
consists <strong>of</strong> h<strong>an</strong>dwritten <strong>document</strong>s with degraded quality <strong>an</strong>d side notes, forms<br />
<strong>an</strong>d books. We start with <strong>an</strong> introduction to <strong>document</strong> page segmentation.<br />
1.2 Document page segmentation<br />
Document page segmentation is the main component <strong>of</strong> geometric layout <strong>an</strong>alysis.<br />
Given <strong>an</strong> image <strong>of</strong> a <strong>document</strong>, the goal <strong>of</strong> page segmentation is to decompose<br />
the image into smaller homogeneous regions (zones or segments) <strong>of</strong><br />
h<strong>an</strong>dwritten <strong>an</strong>d printed text. The difference between page segmentation <strong>an</strong>d<br />
layout <strong>an</strong>alysis is that layout <strong>an</strong>alysis algorithms use these segments to assign<br />
contextual labels (title, author, footnote,...) to them <strong>an</strong>d to also find the reading<br />
order <strong>of</strong> each segment. Figure 1.1 shows a correctly segmented <strong>document</strong> image.<br />
In multi-columns <strong>document</strong>s, a page segmentation algorithm is also responsible<br />
for segmenting text columns separately, so that text lines from different columns<br />
are not merged.<br />
The reason for segmenting <strong>document</strong>s into smaller regions in the first place<br />
is that text regions are sent to <strong>an</strong> OCR (Optical character recognition) or reading<br />
order detection modules for further processing <strong>an</strong>d to convert them into<br />
ASCII format. Hence, obtained regions should also be classified as containing<br />
text or non-text elements, because character recognition modules assume that<br />
the incoming data contains text, so their outputs are unpredictable for regions<br />
containing graphics or other non-textual components.<br />
3