14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

tel-00912566, version 1 - 2 Dec 2013<br />

Algorithm 1 State decoding<br />

procedure DecodeStates()<br />

F preserve ← feature vector from current paragraph<br />

if this paragraph has no children then<br />

state ← ”Preserve”<br />

F ← F preserve<br />

else<br />

First child → DecodeStates()<br />

F 1 ← feature vector from first child<br />

Second child → DecodeStates()<br />

F 2 ← feature vector from second child<br />

F Remove ← F 1 + F 2<br />

C preserve ← W T F preserve ⊲ Cost <strong>of</strong> preserving this paragraph<br />

C remove ← W T F remove<br />

⊲ Cost <strong>of</strong> removing this paragraph<br />

if C preserve < C remove then<br />

state ← ”Preserve”<br />

F ← F preserve<br />

else<br />

state ← ”Remove”<br />

F ← F remove<br />

end if<br />

end if<br />

end procedure<br />

6.6 Results<br />

We apply our paragraph detection method on three datasets. These datasets<br />

include 55 <strong>document</strong>s from ICDAR2009 page segmentation competition [6], 100<br />

<strong>document</strong>s from ICDAR2011 historical <strong>document</strong> layout <strong>an</strong>alysis competition<br />

[4] <strong>an</strong>d 100 <strong>document</strong>s from our own corpus. We also apply the Tesseract-OCR<br />

[89] <strong>an</strong>d EPITA [4] page segmentation methods on all datasets for the purpose<br />

<strong>of</strong> comparison.<br />

The evaluations are carried out by Prima Layout Evaluation Tool [25]. The<br />

full description <strong>of</strong> the evaluation method is noted in appendix A. Tables 6.1,<br />

6.2 <strong>an</strong>d 6.3 summarize the results. In the reported percentages, area weighted<br />

errors <strong>an</strong>d success rates are more import<strong>an</strong>t, because they also account for the<br />

area that a violation happens.<br />

Results show that all methods perform roughly the same on well formatted<br />

magazines <strong>an</strong>d journals from ICDAR2009 dataset with our method doing<br />

slightly better.<br />

On the other h<strong>an</strong>d, on historical <strong>document</strong>s <strong>of</strong> ICDAR2011 competition<br />

dataset, our method <strong>an</strong>d the EPITA perform signific<strong>an</strong>tly better th<strong>an</strong> Tesseract-<br />

OCR with our method performing slightly better. According to this result, the<br />

reported results in [4] is updated in figure 6.3 computed based on scaled estimates.<br />

105

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!