Segmentation of heterogeneous document images : an ... - Tel

More documents

Recommendations

Info

Although it is not mandatory to segment text regions into paragraphs before passing text lines to optical character recognition modules, it is necessary to have correct paragraphs for reading order detection. Thus, in this work we also group text lines into paragraphs after identifying each text region. It is worth noting that there are more than one possible way to segment a document image into text regions correctly that depends on the reading order in the ground truth, but there is only one solution for segmenting text regions into paragraphs. tel-00912566, version 1 - 2 Dec 2013 Because of the important role page segmentation plays in document layout analysis, and of its direct effect on the optical character recognition step, it has been explored deeply for the last four decades by document imaging community and many algorithms have been proposed in the literature. A comprehensive overview of these algorithms is provided by Nagy in [65] and Cattoni et al. in [22]. In this work we review many of these algorithms in the domain of text/graphics separation, text line and region detection. Then a new system for page segmentation is proposed that improves the results of segmentation in areas where problems still exist by current algorithms. 1.3 Overview of the approach We view a document as a scene of connected components (CCs). Thus, the first step in our method is an image binarization and extraction of all connected components. The goal of the system is to find locations of all paragraphs inside the document image. In order to form paragraphs, we need to detect text lines correctly. And to do so, we have to detect text regions considering the geometric alignment of CCs. Text lines in multi-column documents should be separate from one another and in the case where side notes exist, we have to use the alignment of the CCs to separate side notes from the main text. All these operations should be carried out without reading the actual text or understanding the context of the text. After binarization, we have a set of connected components that either belong to text or non-text regions. The next step is to correctly classify each connected component. Each CC has a set of intrinsic features (height, width, eccentricity, ...) and a set of extrinsic features (features from surrounding area). Using these features, we train a set of weak classifiers by the help of boosting which allows us to classify components as text or non-text. In chapter 3 we go deeply into the details of this method. At this point, we are interested to segment a document image into regions of text that have some form of alignment. We consider both sets of textual and non textual CCs for this task. Note that some non-text CCs such as tables and rule lines are highly effective in separating regions, so we do not discard non textual components. In chapter 4 we show how the method effectively makes the best out of these sets to separate text inside tables. Having all text elements and their regions, detecting text lines is the next step. Chapter 2 thoroughly reviews major methods in the literature for text line 4
detection and we conclude that the text line detection algorithm by Vassilio Papavassilious [75] is the most effective among the others. The original method overlooks line detection in side notes, but being the best among the many and the fact that we have already separated problematic areas such as side notes, we adopt this method for the benefit of our own. Detailed explanation of this method is chapter 5. The final stage of the system is to group text lines into paragraphs. In chapter 6 we propose a method based on a trainable binary tree model that maximize the probability of preserving groups of lines using a goodness criterion for paragraphs. A brief outline of these processes is given here. tel-00912566, version 1 - 2 Dec 2013 • Image binarization is the process that converts a given input gray-scale image into a bi-level representation. In our case, pixels that belong to text characters are assigned a value of 0 and pixels of non textual components and background have a value of 1. • Connected components analysis is the process of extracting and labeling connected components from an image. In our case, all pixels of text and non-text elements that are connected and have the same value are extracted and assigned to a separate component. • Noise removal tries to detect and remove noise pixels from the document image. At this stage we only remove components that contain less than a predefined number of pixels. The exact number of pixels can only be determined through trial and error. A large number may remove points, diacritics and punctuation marks. On the other hand, a small number may not remove some dust and speckles from the scene. • Text/Graphics separation is the process that classifies each component into being part of text or graphics. • Text region detection is the process that separates homogeneous regions of text that belong to separate columns. It is also responsible for separation of side notes from the main text region. • Text line detection is the process that finds text lines inside every text region. It is also responsible for breaking characters that are touched from two adjacent lines and have formed a single component incorrectly. • Paragraph detection is the process that groups text lines into paragraphs based on their indentations and geometry. 1.4 Contribution of this dissertation The main contributions that are presented in this dissertation are: 1. A new hybrid method for text/graphics separation. Figure 1.2 shows the ability of our text/graphics separation method in segmenting a document 5
Page 1 and 2: tel-00912566, version 1 - 2 Dec 201
Page 3 and 4: Resumé La segmentation de page est
Page 5 and 6: Acknowledgements This work would no
Page 7 and 8: 4.3.2 Text components . . . . . . .
Page 9 and 10: 3.6 Two documents that have obtaine
Page 11 and 12: 6.1 PARAGRAPH DETECTION SUCCESS RAT
Page 13: tel-00912566, version 1 - 2 Dec 201
Page 21 and 22: Figure 1.8: A screen shot that show
Page 23 and 24: Chapter 2 Related work tel-00912566
Page 27 and 28: them. In such circumstances, it wou
Page 31 and 32: [21] is another texture-based metho
Page 33 and 34: Figure 2.4: Part of a document in o
Page 35 and 36: • Degraded quality due to ageing
Page 37 and 38: 2.3.2 Handwritten text line detecti
Page 39 and 40: (a) Divided strips and their projec
Page 41 and 42: (a) Five zones 1-5 (b) Projection p
Page 43 and 44: would be difficult to draw a conclu
Page 45 and 46: The proposed methods by Xiao [102],
Page 49 and 50: is assigning a label to a region of
Page 51 and 52: fixed range. When the elongation ap
Page 55 and 56: The second method calculates the co
Page 57 and 58: 3. Repeat for m = 1, 2, ..., M •
Page 61 and 62: Chapter 4 Region detection tel-0091
Page 63 and 64: The next advantage of using CRFs is
Page 65 and 66:
weights that are assigned to edge a
Page 67 and 68:
{ 1 if ys = text and y f 1 (y s , y
Page 69 and 70:
(a) Document (b) Filled text compon
Page 71 and 72:
tel-00912566, version 1 - 2 Dec 201
Page 73 and 74:
tel-00912566, version 1 - 2 Dec 201
Page 75 and 76:
f = [y c = 0] × [y tl = 0] f = [y
Page 77 and 78:
(a) Ground-truth (b) y c = 0 tel-00
Page 79 and 80:
∂l λ = ∑ ( ∑y∈Y f k (y s ,
Page 81 and 82:
incorrect [100]. Several sufficient
Page 83 and 84:
tel-00912566, version 1 - 2 Dec 201
Page 85 and 86:
tel-00912566, version 1 - 2 Dec 201
Page 87 and 88:
tel-00912566, version 1 - 2 Dec 201
Page 89 and 90:
tel-00912566, version 1 - 2 Dec 201
Page 91 and 92:
Table 4.3: TION COUNT WEIGHTED SUCC
Page 93 and 94:
tel-00912566, version 1 - 2 Dec 201
Page 95 and 96:
tel-00912566, version 1 - 2 Dec 201
Page 97 and 98:
Chapter 5 Text line detection tel-0
Page 99 and 100:
tel-00912566, version 1 - 2 Dec 201
Page 101 and 102:
tel-00912566, version 1 - 2 Dec 201
Page 103 and 104:
Having specified the model, a verti
Page 105 and 106:
• The fifth step is to remove ext
Page 107 and 108:
tel-00912566, version 1 - 2 Dec 201
Page 109 and 110:
text lines can be divided into two
Page 111 and 112:
the two children. The root node rep
Page 113 and 114:
leaves of the tree which contain on
Page 115 and 116:
tel-00912566, version 1 - 2 Dec 201
Page 117 and 118:
tel-00912566, version 1 - 2 Dec 201
Page 119 and 120:
tel-00912566, version 1 - 2 Dec 201
Page 121 and 122:
currently working on some of these
Page 123 and 124:
• fn (false negative) is the numb
Page 125 and 126:
2 ∗ RA ∗ DR F − Measure = RA
Page 127 and 128:
• ”-tn”: This option uses the
Page 129 and 130:
[12] T. M. Breuel. Two geometric al
Page 131 and 132:
[39] B. Gatos, A. Antonacopoulos, a
Page 133 and 134:
[64] K. P. Murphy, Y. Weiss, and M.
Page 135 and 136:
[91] M. Stamp. A revealing introduc
Page 137 and 138:
Index tel-00912566, version 1 - 2 D
show all

Segmentation of heterogeneous document images : an ... - Tel

Create successful ePaper yourself

Delete template?

Save as template?