Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 1<br />
Introduction<br />
tel-00912566, version 1 - 2 Dec 2013<br />
1.1 Introduction<br />
For centuries, paper <strong>document</strong>s such as h<strong>an</strong>dwritten m<strong>an</strong>uscripts <strong>an</strong>d<br />
books have been used as the main source for preserving knowledge.<br />
Then, with the advent <strong>of</strong> printing, printed books, magazines <strong>an</strong>d newspapers<br />
have replaced the traditional way <strong>of</strong> preserving information.<br />
Nowadays, the most common way to store <strong>document</strong>s is by storing them in a<br />
digital format. It is fast to search, cheap <strong>an</strong>d portable. While m<strong>an</strong>y <strong>of</strong> the<br />
<strong>an</strong>cient <strong>document</strong>s have been lost over the years, a subst<strong>an</strong>tial number has been<br />
preserved. Not only these <strong>document</strong>s are fragile, they are also inaccessible for<br />
the public. As a consequence, m<strong>an</strong>y libraries <strong>an</strong>d org<strong>an</strong>izations around the<br />
world have decided to convert their collections to a digital format.<br />
Two most common ways for digitizing <strong>document</strong>s are to use image sc<strong>an</strong>ners<br />
or digital cameras. However, both methods produce <strong>images</strong> that dem<strong>an</strong>d a large<br />
storage <strong>an</strong>d they are unsearchable. Here, <strong>document</strong> image <strong>an</strong>alysis comes into<br />
play. Document image <strong>an</strong>alysis is the subfield <strong>of</strong> digital image processing with<br />
the goal <strong>of</strong> converting <strong>document</strong> <strong>images</strong> to searchable text form. The whole<br />
process starts with segmenting a <strong>document</strong> image into different parts such as<br />
text, graphics <strong>an</strong>d drawings <strong>an</strong>d forming a layout structure <strong>of</strong> the <strong>document</strong>.<br />
Finally, having a layout structure, methods c<strong>an</strong> determine the reading order <strong>of</strong><br />
the <strong>document</strong> or send text regions to <strong>an</strong> optical character recognition (OCR)<br />
module which converts text regions into searchable format.<br />
In this thesis, we develop a method for segmenting a <strong>document</strong> image into its<br />
different parts. Currently, Safig, the leader <strong>of</strong> Demat-Factory, use a s<strong>of</strong>tware to<br />
do the same task, but with the supervision <strong>of</strong> hum<strong>an</strong>. The undisclosed s<strong>of</strong>tware<br />
that they use has m<strong>an</strong>y tunable parameters that need to be monitored carefully<br />
for each <strong>document</strong> in order to produce correct segmentation result. Our goal<br />
is to automate this process <strong>an</strong>d generalize the method in a way that it c<strong>an</strong> be<br />
applied to a broad r<strong>an</strong>ge <strong>of</strong> <strong>document</strong>s. There are m<strong>an</strong>y open-source <strong>an</strong>d commercial<br />
applications such as Tesseract-OCR, OCRopus, ABBYY FineReader<br />
that perform the same task. However, as we will examine later, they are more<br />
fitted for well formatted printed <strong>document</strong>s that are free <strong>of</strong> noise. Our goal is to<br />
2