14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 1<br />

Introduction<br />

tel-00912566, version 1 - 2 Dec 2013<br />

1.1 Introduction<br />

For centuries, paper <strong>document</strong>s such as h<strong>an</strong>dwritten m<strong>an</strong>uscripts <strong>an</strong>d<br />

books have been used as the main source for preserving knowledge.<br />

Then, with the advent <strong>of</strong> printing, printed books, magazines <strong>an</strong>d newspapers<br />

have replaced the traditional way <strong>of</strong> preserving information.<br />

Nowadays, the most common way to store <strong>document</strong>s is by storing them in a<br />

digital format. It is fast to search, cheap <strong>an</strong>d portable. While m<strong>an</strong>y <strong>of</strong> the<br />

<strong>an</strong>cient <strong>document</strong>s have been lost over the years, a subst<strong>an</strong>tial number has been<br />

preserved. Not only these <strong>document</strong>s are fragile, they are also inaccessible for<br />

the public. As a consequence, m<strong>an</strong>y libraries <strong>an</strong>d org<strong>an</strong>izations around the<br />

world have decided to convert their collections to a digital format.<br />

Two most common ways for digitizing <strong>document</strong>s are to use image sc<strong>an</strong>ners<br />

or digital cameras. However, both methods produce <strong>images</strong> that dem<strong>an</strong>d a large<br />

storage <strong>an</strong>d they are unsearchable. Here, <strong>document</strong> image <strong>an</strong>alysis comes into<br />

play. Document image <strong>an</strong>alysis is the subfield <strong>of</strong> digital image processing with<br />

the goal <strong>of</strong> converting <strong>document</strong> <strong>images</strong> to searchable text form. The whole<br />

process starts with segmenting a <strong>document</strong> image into different parts such as<br />

text, graphics <strong>an</strong>d drawings <strong>an</strong>d forming a layout structure <strong>of</strong> the <strong>document</strong>.<br />

Finally, having a layout structure, methods c<strong>an</strong> determine the reading order <strong>of</strong><br />

the <strong>document</strong> or send text regions to <strong>an</strong> optical character recognition (OCR)<br />

module which converts text regions into searchable format.<br />

In this thesis, we develop a method for segmenting a <strong>document</strong> image into its<br />

different parts. Currently, Safig, the leader <strong>of</strong> Demat-Factory, use a s<strong>of</strong>tware to<br />

do the same task, but with the supervision <strong>of</strong> hum<strong>an</strong>. The undisclosed s<strong>of</strong>tware<br />

that they use has m<strong>an</strong>y tunable parameters that need to be monitored carefully<br />

for each <strong>document</strong> in order to produce correct segmentation result. Our goal<br />

is to automate this process <strong>an</strong>d generalize the method in a way that it c<strong>an</strong> be<br />

applied to a broad r<strong>an</strong>ge <strong>of</strong> <strong>document</strong>s. There are m<strong>an</strong>y open-source <strong>an</strong>d commercial<br />

applications such as Tesseract-OCR, OCRopus, ABBYY FineReader<br />

that perform the same task. However, as we will examine later, they are more<br />

fitted for well formatted printed <strong>document</strong>s that are free <strong>of</strong> noise. Our goal is to<br />

2

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!