Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>of</strong> text lines using a criterion that depends on the geometric appear<strong>an</strong>ces<br />
<strong>an</strong>d indentations <strong>of</strong> text lines in a paragraph.<br />
1.5 Our datasets<br />
One <strong>of</strong> the datasets used for testing <strong>an</strong>d training in this work is a selection <strong>of</strong><br />
<strong>document</strong> <strong>images</strong> from a huge corpus, that were provided for the purpose <strong>of</strong><br />
this project. The original corpus contains historical h<strong>an</strong>dwritten m<strong>an</strong>uscripts<br />
<strong>an</strong>d old printed forms, <strong>document</strong>s <strong>an</strong>d advertisements, mostly written in French<br />
l<strong>an</strong>guage. We have chosen a subset <strong>of</strong> 100 <strong>document</strong>s from this corpus that<br />
represents the original set. Figure 1.5 shows some samples <strong>of</strong> our own dataset.<br />
tel-00912566, version 1 - 2 Dec 2013<br />
In addition, we gathered 61 <strong>document</strong>s used as part <strong>of</strong> the dataset in IC-<br />
DAR2009 [6] page segmentation competition <strong>an</strong>d 100 <strong>document</strong>s from the dataset<br />
used in ICDAR2011 [4] historical <strong>document</strong> layout <strong>an</strong>alysis competition, to be<br />
able to compare our results with the state-<strong>of</strong>-the-art methods. Some samples<br />
from these datasets are shown in figure 1.6.<br />
For each <strong>document</strong> a ground truth file is available. The ground truth file<br />
contains the true structure <strong>of</strong> text regions <strong>an</strong>d text lines for each <strong>document</strong>. By<br />
far ground truth data are the most import<strong>an</strong>t part <strong>of</strong> the work for the purpose<br />
<strong>of</strong> both training <strong>an</strong>d testing various parts <strong>of</strong> the system. They are usually in<br />
text format <strong>an</strong>d contain coordinates <strong>of</strong> objects in a hierarchical structure. For<br />
text <strong>an</strong>d graphics separation we only need the original <strong>document</strong> image <strong>an</strong>d the<br />
location <strong>of</strong> true text regions. Every component that does not belong to a text<br />
region is considered a non-text element. A collection <strong>of</strong> textual <strong>an</strong>d non-textual<br />
components are used for the purpose <strong>of</strong> training. In text region detection, we<br />
divide every <strong>document</strong> into sites with predefined heights <strong>an</strong>d widths. By knowing<br />
whether a site is located on the text or non-text region <strong>of</strong> the ground truth,<br />
we c<strong>an</strong> easily generate our true labels for the purpose <strong>of</strong> training our region<br />
detector. For text line <strong>an</strong>d paragraph detection we consider the geometry appear<strong>an</strong>ce<br />
<strong>an</strong>d relative position <strong>of</strong> text lines defined in the ground truth structure.<br />
In the first stages <strong>of</strong> the project, we had no ground truth date for <strong>an</strong>y <strong>of</strong><br />
the datasets. So we developed a s<strong>of</strong>tware to generate our own true data. Figure<br />
1.7 shows a screen shot <strong>of</strong> our developed s<strong>of</strong>tware. After opening <strong>an</strong> image,<br />
the s<strong>of</strong>tware renders a scene <strong>of</strong> all connected components extracted from the<br />
image. The user c<strong>an</strong> group several connected components as a text line. With<br />
the same strategy the user c<strong>an</strong> group several text lines as a paragraph. In case a<br />
large connected component (a table or a frame <strong>of</strong> advertisement) occults other<br />
smaller components, the user has the option to deactivate that particular component.<br />
Finally, the application generates <strong>an</strong> XML file for each <strong>document</strong> with<br />
the correct structure to be used as a ground truth. Figure 1.8 shows part <strong>of</strong> <strong>an</strong><br />
XML file created by our s<strong>of</strong>tware.<br />
Later, we got access to ground truth data for ICDAR2009 dataset. The<br />
new ground truth data are also provided in XML format but the nodes <strong>an</strong>d<br />
elements <strong>of</strong> the new XML <strong>document</strong>s are different from ours. Because the<br />
dataset <strong>an</strong>d ground truth data belong to Prima group at University <strong>of</strong> Salford,<br />
8