14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

them.<br />

In such circumst<strong>an</strong>ces, it would be logical to take adv<strong>an</strong>tage <strong>of</strong> methods that<br />

work on a block or a region <strong>of</strong> <strong>document</strong> instead <strong>of</strong> the connected components.<br />

The idea is that regions <strong>of</strong> text have a particular texture that c<strong>an</strong> be revealed<br />

with methods such as wavelet <strong>an</strong>alysis, co-occurrence matrices, Radon tr<strong>an</strong>sform,<br />

etc.<br />

Shape <strong>of</strong> the regions c<strong>an</strong> be rect<strong>an</strong>gular or free form. They c<strong>an</strong> be strips <strong>of</strong><br />

overlapping blocks or coming from a multi-resolution structure like quad-tree or<br />

image pyramids. To name a few methods, Journet [44] uses a method based on<br />

orientation <strong>of</strong> auto-correlation <strong>an</strong>d frequency features to segment regions <strong>of</strong> text<br />

in historical <strong>document</strong> <strong>images</strong>. In [46], Kim et al propose <strong>an</strong> approach based on<br />

support vector machine (SVM) <strong>an</strong>d CAMSHIFT algorithm.<br />

tel-00912566, version 1 - 2 Dec 2013<br />

In CC-based methods, the geometry <strong>of</strong> the components is clearly defined.<br />

However, no matter how well the methods work, there is always <strong>an</strong> issue <strong>of</strong><br />

confidence at the boundaries <strong>of</strong> regions. This issue becomes larger as the gap<br />

between text <strong>an</strong>d non-textual components gets smaller. Using large block implies<br />

high confidence on the computed features, but poor confidence on regions’<br />

boundaries. Using small blocks implies just the opposite. One solution is to use<br />

overlapping blocks, but then the problem becomes to assign a label to pixels<br />

when two or more overlapping blocks have a disagreement about the label <strong>of</strong> the<br />

pixels. Confidence based methods for combination <strong>of</strong> evidence (e.g. Dempster-<br />

Shafer ) also may fail. Consider the case where two blocks have overlapping<br />

pixels. One block is located on the mostly text area <strong>of</strong> the <strong>document</strong>, <strong>an</strong>d the<br />

other block is located on the mostly graphical one. The pixels in between get<br />

evidences from two sources that have a high degree <strong>of</strong> conflict. The Dempster-<br />

Shafer theory is criticized for being incompetence in the mentioned situation<br />

[104].<br />

The other problem with region-based methods is their ineffectiveness to detect<br />

frames <strong>an</strong>d lines that belong to a table. Thus, the result for parts <strong>of</strong> a table<br />

that contain text, comes back as text for every pixel <strong>of</strong> that region, whereas it<br />

includes separating rule lines.<br />

2.1.3 Conclusion<br />

Whenever it is feasible to compute the connected components <strong>of</strong> a <strong>document</strong>,<br />

it is preferable to assign the label to the connected component itself <strong>an</strong>d avoid<br />

using a region-based method. However, to overcome some <strong>of</strong> the drawbacks<br />

<strong>of</strong> the component based methods, a hybrid approach should be adapted that<br />

also takes adv<strong>an</strong>tage <strong>of</strong> texture features around the component. Our method is<br />

based on this strategy <strong>an</strong>d is described in chapter 3.<br />

17

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!