Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel Segmentation of heterogeneous document images : an ... - Tel

tel.archives.ouvertes.fr
from tel.archives.ouvertes.fr More from this publisher
14.01.2014 Views

Chapter 7 Conclusion and future work tel-00912566, version 1 - 2 Dec 2013 In this thesis, we have provided a new framework for document page segmentation that brings some improvements upon current state-ofthe-art methods. Unlike most methods that apply page segmentation on all components of the page, we have presented a method to identify non-textual components and clear them from the document image before applying a page segmentation method. Moreover, the identified non-textual components such as rulers, table structures and advertisement frames are utilized again to separate text regions. For text region detection, we have introduced a framework based on twodimensional conditional random fields that can gather various observations from whitespace analysis, graphical components, run-lengths and separate regions of text according to these observations. As a result, we were able to improve the results of separating text columns when one is situated very close to the other. This framework also prevents the contents of a cell in a table to be merged with the contents of other adjacent cells. Furthermore, the method does not allow text regions inside a frame to be merged with other text regions around. After carefully examining methods for text line detection that can sustain the variety and variation of text lines in handwritten and printed document, we have adopted a variant of Papavassiliou’s text line detection method [75] to segment text regions into text lines. Finally, a novel trainable method based on binary partition tree is proposed to determine the paragraph structures based on the appearances of text lines in a text region. 7.1 Future direction As with every new work, there are many problems and challenges that can be overcome with an improvement to the methods presented in this thesis. We propose several ideas which would clearly be of benefit to this work. We are 110

currently working on some of these ideas to improve our results and we decided to set aside some others due to limitation in time. For text and graphics separation, it would be a benefit to use three labels of ”text”,”graphics” and ”text containers” instead of just two labels of ”text” and ”non-text” components. Text containers refer to tables and frames that contain text. Currently we are doing that by using a threshold on the number of children and the solidity feature of non-textual components. Learning this using a training dataset will make it more robust to errors but it requires a modification to the ground truth data that we use. tel-00912566, version 1 - 2 Dec 2013 For text region detection there is more room for improvement. Currently we are computing several observations with Gabor features with different size of the window and we send these observations to our feature functions. However, to make it more robust and multi-scale what we should do is to change the window size of the Gabor filter locally according to the local height of text connected components. To do this, we have to compute a complete set of Gabor filters and to convolve them with the original image to generate many filtering results. Then for each site, based on the average local height of text components, we have to pick the value from the filleting result that correspond to that local height. Clearly, this approach is unfeasible due to its computation burden. It would be desirable if one could benefit from a formulation and implementation of non-stationary Gabor filters in which the size of the kernel changes locally according to the local height of text components. Feature functions in our CRF model depend on several observation maps from the image. However, the number of features are not enough to separate sites with great confidence. Also the dependency of these observation are not exploited in our framework. As a result the contribution edge potentials in our CRF model is very limited due to small number of feature functions. One can benefit from more effective features such as Ferns [74, 73] to define feature functions in CRF model. As for inference in conditional random fields, we are using ICM [9] due to its simplicity and fast training time which is less than 4 to 6 hours. However, it is known that ICM fails to capture long-range interaction between sites’ labels. We evaluated loopy belief propagation [64] in this work and it could not converge to a good solution after considerable amount of time. One may benefit from other inference methods to improve results for the CRF mdoel. Finally, in the method for text line detection, the global median of height of text connected components are used inside the transition and emission probabilities of HMM. One can reformulate these probabilities to use the local average height of text components instead of the global statistics. 111

Chapter 7<br />

Conclusion <strong>an</strong>d future work<br />

tel-00912566, version 1 - 2 Dec 2013<br />

In this thesis, we have provided a new framework for <strong>document</strong> page<br />

segmentation that brings some improvements upon current state-<strong>of</strong>the-art<br />

methods.<br />

Unlike most methods that apply page segmentation on all components <strong>of</strong> the<br />

page, we have presented a method to identify non-textual components <strong>an</strong>d clear<br />

them from the <strong>document</strong> image before applying a page segmentation method.<br />

Moreover, the identified non-textual components such as rulers, table structures<br />

<strong>an</strong>d advertisement frames are utilized again to separate text regions.<br />

For text region detection, we have introduced a framework based on twodimensional<br />

conditional r<strong>an</strong>dom fields that c<strong>an</strong> gather various observations from<br />

whitespace <strong>an</strong>alysis, graphical components, run-lengths <strong>an</strong>d separate regions <strong>of</strong><br />

text according to these observations. As a result, we were able to improve the<br />

results <strong>of</strong> separating text columns when one is situated very close to the other.<br />

This framework also prevents the contents <strong>of</strong> a cell in a table to be merged with<br />

the contents <strong>of</strong> other adjacent cells. Furthermore, the method does not allow<br />

text regions inside a frame to be merged with other text regions around.<br />

After carefully examining methods for text line detection that c<strong>an</strong> sustain<br />

the variety <strong>an</strong>d variation <strong>of</strong> text lines in h<strong>an</strong>dwritten <strong>an</strong>d printed <strong>document</strong>,<br />

we have adopted a vari<strong>an</strong>t <strong>of</strong> Papavassiliou’s text line detection method [75] to<br />

segment text regions into text lines.<br />

Finally, a novel trainable method based on binary partition tree is proposed<br />

to determine the paragraph structures based on the appear<strong>an</strong>ces <strong>of</strong> text lines in<br />

a text region.<br />

7.1 Future direction<br />

As with every new work, there are m<strong>an</strong>y problems <strong>an</strong>d challenges that c<strong>an</strong> be<br />

overcome with <strong>an</strong> improvement to the methods presented in this thesis. We<br />

propose several ideas which would clearly be <strong>of</strong> benefit to this work. We are<br />

110

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!