Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel Segmentation of heterogeneous document images : an ... - Tel

tel.archives.ouvertes.fr
from tel.archives.ouvertes.fr More from this publisher
14.01.2014 Views

Chapter 7 Conclusion and future work tel-00912566, version 1 - 2 Dec 2013 In this thesis, we have provided a new framework for document page segmentation that brings some improvements upon current state-ofthe-art methods. Unlike most methods that apply page segmentation on all components of the page, we have presented a method to identify non-textual components and clear them from the document image before applying a page segmentation method. Moreover, the identified non-textual components such as rulers, table structures and advertisement frames are utilized again to separate text regions. For text region detection, we have introduced a framework based on twodimensional conditional random fields that can gather various observations from whitespace analysis, graphical components, run-lengths and separate regions of text according to these observations. As a result, we were able to improve the results of separating text columns when one is situated very close to the other. This framework also prevents the contents of a cell in a table to be merged with the contents of other adjacent cells. Furthermore, the method does not allow text regions inside a frame to be merged with other text regions around. After carefully examining methods for text line detection that can sustain the variety and variation of text lines in handwritten and printed document, we have adopted a variant of Papavassiliou’s text line detection method [75] to segment text regions into text lines. Finally, a novel trainable method based on binary partition tree is proposed to determine the paragraph structures based on the appearances of text lines in a text region. 7.1 Future direction As with every new work, there are many problems and challenges that can be overcome with an improvement to the methods presented in this thesis. We propose several ideas which would clearly be of benefit to this work. We are 110

currently working on some of these ideas to improve our results and we decided to set aside some others due to limitation in time. For text and graphics separation, it would be a benefit to use three labels of ”text”,”graphics” and ”text containers” instead of just two labels of ”text” and ”non-text” components. Text containers refer to tables and frames that contain text. Currently we are doing that by using a threshold on the number of children and the solidity feature of non-textual components. Learning this using a training dataset will make it more robust to errors but it requires a modification to the ground truth data that we use. tel-00912566, version 1 - 2 Dec 2013 For text region detection there is more room for improvement. Currently we are computing several observations with Gabor features with different size of the window and we send these observations to our feature functions. However, to make it more robust and multi-scale what we should do is to change the window size of the Gabor filter locally according to the local height of text connected components. To do this, we have to compute a complete set of Gabor filters and to convolve them with the original image to generate many filtering results. Then for each site, based on the average local height of text components, we have to pick the value from the filleting result that correspond to that local height. Clearly, this approach is unfeasible due to its computation burden. It would be desirable if one could benefit from a formulation and implementation of non-stationary Gabor filters in which the size of the kernel changes locally according to the local height of text components. Feature functions in our CRF model depend on several observation maps from the image. However, the number of features are not enough to separate sites with great confidence. Also the dependency of these observation are not exploited in our framework. As a result the contribution edge potentials in our CRF model is very limited due to small number of feature functions. One can benefit from more effective features such as Ferns [74, 73] to define feature functions in CRF model. As for inference in conditional random fields, we are using ICM [9] due to its simplicity and fast training time which is less than 4 to 6 hours. However, it is known that ICM fails to capture long-range interaction between sites’ labels. We evaluated loopy belief propagation [64] in this work and it could not converge to a good solution after considerable amount of time. One may benefit from other inference methods to improve results for the CRF mdoel. Finally, in the method for text line detection, the global median of height of text connected components are used inside the transition and emission probabilities of HMM. One can reformulate these probabilities to use the local average height of text components instead of the global statistics. 111

currently working on some <strong>of</strong> these ideas to improve our results <strong>an</strong>d we decided<br />

to set aside some others due to limitation in time.<br />

For text <strong>an</strong>d graphics separation, it would be a benefit to use three labels<br />

<strong>of</strong> ”text”,”graphics” <strong>an</strong>d ”text containers” instead <strong>of</strong> just two labels <strong>of</strong> ”text”<br />

<strong>an</strong>d ”non-text” components. Text containers refer to tables <strong>an</strong>d frames that<br />

contain text. Currently we are doing that by using a threshold on the number<br />

<strong>of</strong> children <strong>an</strong>d the solidity feature <strong>of</strong> non-textual components. Learning this<br />

using a training dataset will make it more robust to errors but it requires a<br />

modification to the ground truth data that we use.<br />

tel-00912566, version 1 - 2 Dec 2013<br />

For text region detection there is more room for improvement. Currently we<br />

are computing several observations with Gabor features with different size <strong>of</strong> the<br />

window <strong>an</strong>d we send these observations to our feature functions. However, to<br />

make it more robust <strong>an</strong>d multi-scale what we should do is to ch<strong>an</strong>ge the window<br />

size <strong>of</strong> the Gabor filter locally according to the local height <strong>of</strong> text connected<br />

components. To do this, we have to compute a complete set <strong>of</strong> Gabor filters<br />

<strong>an</strong>d to convolve them with the original image to generate m<strong>an</strong>y filtering results.<br />

Then for each site, based on the average local height <strong>of</strong> text components, we<br />

have to pick the value from the filleting result that correspond to that local<br />

height. Clearly, this approach is unfeasible due to its computation burden. It<br />

would be desirable if one could benefit from a formulation <strong>an</strong>d implementation<br />

<strong>of</strong> non-stationary Gabor filters in which the size <strong>of</strong> the kernel ch<strong>an</strong>ges locally<br />

according to the local height <strong>of</strong> text components.<br />

Feature functions in our CRF model depend on several observation maps<br />

from the image. However, the number <strong>of</strong> features are not enough to separate<br />

sites with great confidence. Also the dependency <strong>of</strong> these observation are not<br />

exploited in our framework. As a result the contribution edge potentials in our<br />

CRF model is very limited due to small number <strong>of</strong> feature functions. One c<strong>an</strong><br />

benefit from more effective features such as Ferns [74, 73] to define feature functions<br />

in CRF model.<br />

As for inference in conditional r<strong>an</strong>dom fields, we are using ICM [9] due to<br />

its simplicity <strong>an</strong>d fast training time which is less th<strong>an</strong> 4 to 6 hours. However, it<br />

is known that ICM fails to capture long-r<strong>an</strong>ge interaction between sites’ labels.<br />

We evaluated loopy belief propagation [64] in this work <strong>an</strong>d it could not converge<br />

to a good solution after considerable amount <strong>of</strong> time. One may benefit<br />

from other inference methods to improve results for the CRF mdoel.<br />

Finally, in the method for text line detection, the global medi<strong>an</strong> <strong>of</strong> height <strong>of</strong><br />

text connected components are used inside the tr<strong>an</strong>sition <strong>an</strong>d emission probabilities<br />

<strong>of</strong> HMM. One c<strong>an</strong> reformulate these probabilities to use the local average<br />

height <strong>of</strong> text components instead <strong>of</strong> the global statistics.<br />

111

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!