Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
tel-00912566, version 1 - 2 Dec 2013<br />
Figure 3.7: Some <strong>of</strong> the misclassified components gathered from our own <strong>document</strong>s.<br />
Black components are labelled correctly. Red components should have been assigned a<br />
graphics label but they are incorrectly labelled as text. Blue components on the other<br />
h<strong>an</strong>d have text label in ground-truth, however they are misclassified as graphics.<br />
A serious challenge in some <strong>document</strong>s is a problem that arises due to underlines.<br />
Underlines that appear in the middle <strong>of</strong> a text region as shown in<br />
figure 3.7, pose two problems. These underlines are treated as graphical components<br />
<strong>an</strong>d are removed from the set <strong>of</strong> text components, but in text region<br />
detection, they are utilized to separate region <strong>of</strong> text. This behavior is expected<br />
from a true graphical component, but <strong>an</strong> underline in the middle <strong>of</strong> a<br />
text region may split the region into two which is <strong>an</strong> underst<strong>an</strong>dable side effect<br />
in this situation. Moreover, in some situations where text characters are attached<br />
to the underline, not only the underline disappear from the text region,<br />
it takes some characters with it <strong>an</strong>d leaves large gaps in the middle <strong>of</strong> a text<br />
region. This has a negative effect on our region detection stage when it happens.<br />
Here is <strong>an</strong>other comparison between the results <strong>of</strong> the method , described<br />
here <strong>an</strong>d the results <strong>of</strong> text <strong>an</strong>d graphics separation from Tesseract-OCR <strong>an</strong>d<br />
EPITA methods. The classifier for our method is trained on 26 <strong>document</strong>s,<br />
selected from both ICDAR2011 <strong>an</strong>d our corpus datasets. Tables 3.4,3.5 <strong>an</strong>d 3.6<br />
show the results.<br />
In conclusion, this chapter provides a method for separating text/graphics<br />
components with good separation accuracy.<br />
49