Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
tel-00912566, version 1 - 2 Dec 2013<br />
Algorithm 1 State decoding<br />
procedure DecodeStates()<br />
F preserve ← feature vector from current paragraph<br />
if this paragraph has no children then<br />
state ← ”Preserve”<br />
F ← F preserve<br />
else<br />
First child → DecodeStates()<br />
F 1 ← feature vector from first child<br />
Second child → DecodeStates()<br />
F 2 ← feature vector from second child<br />
F Remove ← F 1 + F 2<br />
C preserve ← W T F preserve ⊲ Cost <strong>of</strong> preserving this paragraph<br />
C remove ← W T F remove<br />
⊲ Cost <strong>of</strong> removing this paragraph<br />
if C preserve < C remove then<br />
state ← ”Preserve”<br />
F ← F preserve<br />
else<br />
state ← ”Remove”<br />
F ← F remove<br />
end if<br />
end if<br />
end procedure<br />
6.6 Results<br />
We apply our paragraph detection method on three datasets. These datasets<br />
include 55 <strong>document</strong>s from ICDAR2009 page segmentation competition [6], 100<br />
<strong>document</strong>s from ICDAR2011 historical <strong>document</strong> layout <strong>an</strong>alysis competition<br />
[4] <strong>an</strong>d 100 <strong>document</strong>s from our own corpus. We also apply the Tesseract-OCR<br />
[89] <strong>an</strong>d EPITA [4] page segmentation methods on all datasets for the purpose<br />
<strong>of</strong> comparison.<br />
The evaluations are carried out by Prima Layout Evaluation Tool [25]. The<br />
full description <strong>of</strong> the evaluation method is noted in appendix A. Tables 6.1,<br />
6.2 <strong>an</strong>d 6.3 summarize the results. In the reported percentages, area weighted<br />
errors <strong>an</strong>d success rates are more import<strong>an</strong>t, because they also account for the<br />
area that a violation happens.<br />
Results show that all methods perform roughly the same on well formatted<br />
magazines <strong>an</strong>d journals from ICDAR2009 dataset with our method doing<br />
slightly better.<br />
On the other h<strong>an</strong>d, on historical <strong>document</strong>s <strong>of</strong> ICDAR2011 competition<br />
dataset, our method <strong>an</strong>d the EPITA perform signific<strong>an</strong>tly better th<strong>an</strong> Tesseract-<br />
OCR with our method performing slightly better. According to this result, the<br />
reported results in [4] is updated in figure 6.3 computed based on scaled estimates.<br />
105