14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

tel-00912566, version 1 - 2 Dec 2013<br />

Figure 3.7: Some <strong>of</strong> the misclassified components gathered from our own <strong>document</strong>s.<br />

Black components are labelled correctly. Red components should have been assigned a<br />

graphics label but they are incorrectly labelled as text. Blue components on the other<br />

h<strong>an</strong>d have text label in ground-truth, however they are misclassified as graphics.<br />

A serious challenge in some <strong>document</strong>s is a problem that arises due to underlines.<br />

Underlines that appear in the middle <strong>of</strong> a text region as shown in<br />

figure 3.7, pose two problems. These underlines are treated as graphical components<br />

<strong>an</strong>d are removed from the set <strong>of</strong> text components, but in text region<br />

detection, they are utilized to separate region <strong>of</strong> text. This behavior is expected<br />

from a true graphical component, but <strong>an</strong> underline in the middle <strong>of</strong> a<br />

text region may split the region into two which is <strong>an</strong> underst<strong>an</strong>dable side effect<br />

in this situation. Moreover, in some situations where text characters are attached<br />

to the underline, not only the underline disappear from the text region,<br />

it takes some characters with it <strong>an</strong>d leaves large gaps in the middle <strong>of</strong> a text<br />

region. This has a negative effect on our region detection stage when it happens.<br />

Here is <strong>an</strong>other comparison between the results <strong>of</strong> the method , described<br />

here <strong>an</strong>d the results <strong>of</strong> text <strong>an</strong>d graphics separation from Tesseract-OCR <strong>an</strong>d<br />

EPITA methods. The classifier for our method is trained on 26 <strong>document</strong>s,<br />

selected from both ICDAR2011 <strong>an</strong>d our corpus datasets. Tables 3.4,3.5 <strong>an</strong>d 3.6<br />

show the results.<br />

In conclusion, this chapter provides a method for separating text/graphics<br />

components with good separation accuracy.<br />

49

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!