Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel Segmentation of heterogeneous document images : an ... - Tel

tel.archives.ouvertes.fr
from tel.archives.ouvertes.fr More from this publisher
14.01.2014 Views

Table 3.4: COMPARISON OF TEXT/GRAPHICS SEPARATION WITH LOGIT- BOOST, TESSERACT-OCR AND EPITA ON ICDAR2009 (61 DOCUMENTS) Method Text precision Text recall Graphics precision Graphics recall Text Accuracy LogitBoost 97.45 98.04 79.21 88.00 97.52 TesseractOCR 93.32 95.44 88.52 85.87 92.96 Epita 94.95 96.25 81.62 92.45 95.78 tel-00912566, version 1 - 2 Dec 2013 Table 3.5: COMPARISON OF TEXT/GRAPHICS SEPARATION WITH LOGIT- BOOST, TESSERACT-OCR AND EPITA ON ICDAR2011 (100 DOCUMENTS) Method Text precision Text recall Graphics precision Graphics recall Text Accuracy LogitBoost 98.05 93.42 56.58 73.52 94.22 TesseractOCR 94.76 87.60 84.66 94.08 90.16 Epita 97.85 95.43 62.33 85.29 95.23 Table 3.6: COMPARISON OF TEXT/GRAPHICS SEPARATION WITH LOGIT- BOOST, TESSERACT-OCR AND EPITA ON OUR CORPUS (97 DOCUMENTS) Method Text precision Text recall Graphics precision Graphics recall Text Accuracy LogitBoost 93.82 93.62 59.41 74.11 92.40 TesseractOCR 88.58 95.90 76.80 63.15 89.90 Epita 95.75 90.20 61.28 85.23 91.20 50

Chapter 4 Region detection tel-00912566, version 1 - 2 Dec 2013 After obtaining two separate images, one containing textual components and the other containing graphical components, we apply a region detection method to separate text regions. It is already mentioned that many authors do not separate text and graphics. They directly apply their methods to segment regions of text and the decision to assign which region is text and which is graphics, comes after segmentation. As mentioned before, when parts of a graphical drawing are located close to text characters and are aligned with them, they may be merged incorrectly as part of the text regions. Other authors do apply a text and graphics separation, and then overlook all graphical components and apply text line detection or region detection only on textual components. Our method incorporates both textual components as well as graphics. The method is designed to take advantages of graphical components such as rule lines to separate text regions. Although in many situations, rule lines may be helpful to separate text regions or columns of text, there are many examples where rulers are simply not available. Many authors have overcome this problem by using a distance-based approach or analysis of white space with success. However, still the challenge exists when columns of text are located very close to one another. One hard case of such an example is to separate side notes from the main text body. The problem is that in most documents from our corpus, side notes appear so close to the main text that a distance-based method alone, fails to separate them. We propose a framework based on two-dimensional Conditional random fields (CRFs) to separate regions of text from one another that also aims to separate side notes and text strings inside a table structure. The first motivation to use CRFs compared to other locally trained machine learning methods is the long-distance communication between sites in different parts of the image. In figure 4.1 three regions are shown in red. By only considering the local information around each of these regions separately, it would be difficult to recognize these regions as gaps between side notes and the main text. However, by taking advantage of the CRFs’ long-distance message passing between these regions and considering the alignment of same-labelled regions, column separators can be easily detected. 51

Chapter 4<br />

Region detection<br />

tel-00912566, version 1 - 2 Dec 2013<br />

After obtaining two separate <strong>images</strong>, one containing textual components<br />

<strong>an</strong>d the other containing graphical components, we apply a region detection<br />

method to separate text regions. It is already mentioned that<br />

m<strong>an</strong>y authors do not separate text <strong>an</strong>d graphics. They directly apply<br />

their methods to segment regions <strong>of</strong> text <strong>an</strong>d the decision to assign which region<br />

is text <strong>an</strong>d which is graphics, comes after segmentation. As mentioned before,<br />

when parts <strong>of</strong> a graphical drawing are located close to text characters <strong>an</strong>d are<br />

aligned with them, they may be merged incorrectly as part <strong>of</strong> the text regions.<br />

Other authors do apply a text <strong>an</strong>d graphics separation, <strong>an</strong>d then overlook all<br />

graphical components <strong>an</strong>d apply text line detection or region detection only on<br />

textual components. Our method incorporates both textual components as well<br />

as graphics. The method is designed to take adv<strong>an</strong>tages <strong>of</strong> graphical components<br />

such as rule lines to separate text regions.<br />

Although in m<strong>an</strong>y situations, rule lines may be helpful to separate text regions<br />

or columns <strong>of</strong> text, there are m<strong>an</strong>y examples where rulers are simply not<br />

available. M<strong>an</strong>y authors have overcome this problem by using a dist<strong>an</strong>ce-based<br />

approach or <strong>an</strong>alysis <strong>of</strong> white space with success. However, still the challenge<br />

exists when columns <strong>of</strong> text are located very close to one <strong>an</strong>other. One hard<br />

case <strong>of</strong> such <strong>an</strong> example is to separate side notes from the main text body. The<br />

problem is that in most <strong>document</strong>s from our corpus, side notes appear so close<br />

to the main text that a dist<strong>an</strong>ce-based method alone, fails to separate them.<br />

We propose a framework based on two-dimensional Conditional r<strong>an</strong>dom fields<br />

(CRFs) to separate regions <strong>of</strong> text from one <strong>an</strong>other that also aims to separate<br />

side notes <strong>an</strong>d text strings inside a table structure.<br />

The first motivation to use CRFs compared to other locally trained machine<br />

learning methods is the long-dist<strong>an</strong>ce communication between sites in different<br />

parts <strong>of</strong> the image. In figure 4.1 three regions are shown in red. By only considering<br />

the local information around each <strong>of</strong> these regions separately, it would<br />

be difficult to recognize these regions as gaps between side notes <strong>an</strong>d the main<br />

text. However, by taking adv<strong>an</strong>tage <strong>of</strong> the CRFs’ long-dist<strong>an</strong>ce message passing<br />

between these regions <strong>an</strong>d considering the alignment <strong>of</strong> same-labelled regions,<br />

column separators c<strong>an</strong> be easily detected.<br />

51

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!