14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>of</strong> text lines using a criterion that depends on the geometric appear<strong>an</strong>ces<br />

<strong>an</strong>d indentations <strong>of</strong> text lines in a paragraph.<br />

1.5 Our datasets<br />

One <strong>of</strong> the datasets used for testing <strong>an</strong>d training in this work is a selection <strong>of</strong><br />

<strong>document</strong> <strong>images</strong> from a huge corpus, that were provided for the purpose <strong>of</strong><br />

this project. The original corpus contains historical h<strong>an</strong>dwritten m<strong>an</strong>uscripts<br />

<strong>an</strong>d old printed forms, <strong>document</strong>s <strong>an</strong>d advertisements, mostly written in French<br />

l<strong>an</strong>guage. We have chosen a subset <strong>of</strong> 100 <strong>document</strong>s from this corpus that<br />

represents the original set. Figure 1.5 shows some samples <strong>of</strong> our own dataset.<br />

tel-00912566, version 1 - 2 Dec 2013<br />

In addition, we gathered 61 <strong>document</strong>s used as part <strong>of</strong> the dataset in IC-<br />

DAR2009 [6] page segmentation competition <strong>an</strong>d 100 <strong>document</strong>s from the dataset<br />

used in ICDAR2011 [4] historical <strong>document</strong> layout <strong>an</strong>alysis competition, to be<br />

able to compare our results with the state-<strong>of</strong>-the-art methods. Some samples<br />

from these datasets are shown in figure 1.6.<br />

For each <strong>document</strong> a ground truth file is available. The ground truth file<br />

contains the true structure <strong>of</strong> text regions <strong>an</strong>d text lines for each <strong>document</strong>. By<br />

far ground truth data are the most import<strong>an</strong>t part <strong>of</strong> the work for the purpose<br />

<strong>of</strong> both training <strong>an</strong>d testing various parts <strong>of</strong> the system. They are usually in<br />

text format <strong>an</strong>d contain coordinates <strong>of</strong> objects in a hierarchical structure. For<br />

text <strong>an</strong>d graphics separation we only need the original <strong>document</strong> image <strong>an</strong>d the<br />

location <strong>of</strong> true text regions. Every component that does not belong to a text<br />

region is considered a non-text element. A collection <strong>of</strong> textual <strong>an</strong>d non-textual<br />

components are used for the purpose <strong>of</strong> training. In text region detection, we<br />

divide every <strong>document</strong> into sites with predefined heights <strong>an</strong>d widths. By knowing<br />

whether a site is located on the text or non-text region <strong>of</strong> the ground truth,<br />

we c<strong>an</strong> easily generate our true labels for the purpose <strong>of</strong> training our region<br />

detector. For text line <strong>an</strong>d paragraph detection we consider the geometry appear<strong>an</strong>ce<br />

<strong>an</strong>d relative position <strong>of</strong> text lines defined in the ground truth structure.<br />

In the first stages <strong>of</strong> the project, we had no ground truth date for <strong>an</strong>y <strong>of</strong><br />

the datasets. So we developed a s<strong>of</strong>tware to generate our own true data. Figure<br />

1.7 shows a screen shot <strong>of</strong> our developed s<strong>of</strong>tware. After opening <strong>an</strong> image,<br />

the s<strong>of</strong>tware renders a scene <strong>of</strong> all connected components extracted from the<br />

image. The user c<strong>an</strong> group several connected components as a text line. With<br />

the same strategy the user c<strong>an</strong> group several text lines as a paragraph. In case a<br />

large connected component (a table or a frame <strong>of</strong> advertisement) occults other<br />

smaller components, the user has the option to deactivate that particular component.<br />

Finally, the application generates <strong>an</strong> XML file for each <strong>document</strong> with<br />

the correct structure to be used as a ground truth. Figure 1.8 shows part <strong>of</strong> <strong>an</strong><br />

XML file created by our s<strong>of</strong>tware.<br />

Later, we got access to ground truth data for ICDAR2009 dataset. The<br />

new ground truth data are also provided in XML format but the nodes <strong>an</strong>d<br />

elements <strong>of</strong> the new XML <strong>document</strong>s are different from ours. Because the<br />

dataset <strong>an</strong>d ground truth data belong to Prima group at University <strong>of</strong> Salford,<br />

8

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!