14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4.7.1 Post-processing<br />

Three post-processing steps are applied to the output <strong>of</strong> the CRF:<br />

• Removing regions whose width or height are smaller th<strong>an</strong> the width <strong>of</strong><br />

height <strong>of</strong> the average character.<br />

• Opening each regions separately from other regions. Since side notes are<br />

very close to main text body, care should be taken not to merge two text<br />

regions together.<br />

• Applying a hole-filling method to fill holes inside each text region separately.<br />

tel-00912566, version 1 - 2 Dec 2013<br />

Four pages are shown in figure 4.12 after applying these three post-processing<br />

steps.<br />

4.8 Results <strong>an</strong>d discussion<br />

At this point, regions <strong>of</strong> text are detected <strong>an</strong>d ready for text line detection,<br />

but paragraphs are yet to be found. Paragraphs may or may not be separated<br />

depending on the dist<strong>an</strong>ce between them. However since in most ground-truth<br />

data for competitions such as ICDAR2011 Historical Document Layout Competition,<br />

paragraphs are <strong>an</strong>notated separately, evaluation <strong>of</strong> the results based<br />

on region matching with the true ground-truth data are me<strong>an</strong>ingless for the<br />

purpose <strong>of</strong> comparison.<br />

We report the current success rate for site-wise classification. The statistics<br />

aim to show how far the results are from the closest acceptable region segmentation<br />

for the me<strong>an</strong>s <strong>of</strong> text line detection. This me<strong>an</strong>s that instead <strong>of</strong> preparing<br />

a ground-truth that separates all the paragraphs, we generate the ground-truth<br />

data by correcting the segmentation results to make them acceptable for text<br />

line detection.<br />

Table 4.1 indicates number <strong>of</strong> misclassified sites from the output <strong>of</strong> our CRF<br />

model.<br />

Table 4.1: NUMBER OF MISCLASSIFIED SITES (%) FROM THE OUTPUT OF<br />

OUR CRF MODEL<br />

Total sites (%) Textual sites (%) Non-textual sites (%) Gap between columns (%)<br />

0.97 1.32 0.88 3.5<br />

Two other tables 4.2 <strong>an</strong>d 4.3 show region segmentation success rates for<br />

different <strong>images</strong>. The indicated rates are computed between the segmentation<br />

output <strong>an</strong>d the closest acceptable segmentation for text line detection. The<br />

closest acceptable segmentation is a segmentation that is capable <strong>of</strong> producing<br />

76

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!