14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

tel-00912566, version 1 - 2 Dec 2013<br />

Figure 1.1: A sample <strong>of</strong> a correctly segmented <strong>document</strong> image by ABBYY<br />

FineReader 2011.<br />

improve the segmentation quality for the corpus provided to us, which mostly<br />

consists <strong>of</strong> h<strong>an</strong>dwritten <strong>document</strong>s with degraded quality <strong>an</strong>d side notes, forms<br />

<strong>an</strong>d books. We start with <strong>an</strong> introduction to <strong>document</strong> page segmentation.<br />

1.2 Document page segmentation<br />

Document page segmentation is the main component <strong>of</strong> geometric layout <strong>an</strong>alysis.<br />

Given <strong>an</strong> image <strong>of</strong> a <strong>document</strong>, the goal <strong>of</strong> page segmentation is to decompose<br />

the image into smaller homogeneous regions (zones or segments) <strong>of</strong><br />

h<strong>an</strong>dwritten <strong>an</strong>d printed text. The difference between page segmentation <strong>an</strong>d<br />

layout <strong>an</strong>alysis is that layout <strong>an</strong>alysis algorithms use these segments to assign<br />

contextual labels (title, author, footnote,...) to them <strong>an</strong>d to also find the reading<br />

order <strong>of</strong> each segment. Figure 1.1 shows a correctly segmented <strong>document</strong> image.<br />

In multi-columns <strong>document</strong>s, a page segmentation algorithm is also responsible<br />

for segmenting text columns separately, so that text lines from different columns<br />

are not merged.<br />

The reason for segmenting <strong>document</strong>s into smaller regions in the first place<br />

is that text regions are sent to <strong>an</strong> OCR (Optical character recognition) or reading<br />

order detection modules for further processing <strong>an</strong>d to convert them into<br />

ASCII format. Hence, obtained regions should also be classified as containing<br />

text or non-text elements, because character recognition modules assume that<br />

the incoming data contains text, so their outputs are unpredictable for regions<br />

containing graphics or other non-textual components.<br />

3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!