unified detection and recognition for reading text in scene images

More documents

Recommendations

Info

2.2.1 Model Likelihood Substituting the parametric form (2.4) for the model likelihood in (2.14) we have ⎛ ⎞ L (θ; D) ≡ ∑ ⎝ ∑ ( ) U C y (k) C ,x(k) ; θ C − log Z ( x (k)) ⎠ (2.15) k C∈C (k) The set of functions {U C } C∈C (k) depends on the particular unknowns y (k) , and thus C is indexed by the particular example k. For certain forms of compatibility functions, it can be shown that the objective function L (θ; D) is convex, which means that global optima can be found by gradient ascent or other convex optimization techniques [13]. In particular, if the compatibility functions are linear in the parameters, then the log likelihood (2.14) is convex. Throughout this thesis, we use linear compatibility functions, which have the general form U C (y C ,x C ; θ C ) = θ C (y C ) · F C (x) , (2.16) where F C : Ω → R d(C) is a vector of features of the observation, the dimensionality of which depends on the particular set C. The parameter vector θ C ∈ R |YC|×d(C) is conveniently thought of as a function θ C : Y C → R d(C) that takes an assignment y C and returns an associated set of weights for the features F C . Taking the gradient of the objective (2.15) with respect to the parameters yields ∇ θ L (θ; D) = ∑ ∑ ( ( ) ∇ θC U C y (k) C ,x(k) ; θ C − k C∈C [ (k) ( ) ]) E C ∇θC U C yC ,x (k) ; θ C | x (k) ; θ C (2.17) = ∑ ∑ ( [ ]) FC (x) − E C FC (x) | x (k) ; θ C . (2.18) k C∈C (k) where E C indicates an expectation with respect to the marginal probability distribution p (y C | x, θ, I). Equation (2.17) is the gradient for general compatibility functions, while (2.18) is for linear compatibilities (2.16). To calculate the log likelihood and its gradient, and thus find the optimal model θ, we will need to be able to calculate log Z (x), the so-called log partition function, and the marginal probabilities of each y C . In general, these both involve combinatorial sums, so approximations must be made. Most of these are described in Section 2.3.2, but we describe two here that are more closely related to the log-likelihood and the objective function. 2.2.1.1 Parameter Decoupling One simple approximation that may be made is to decouple the parameters in θ during training. For instance, in the example described at the end of §2.1.2 and shown in Figure 2.1, there are two types of compatibility functions, one for recognizing characters based on their appearance and another for weighting bigrams. If the parameter vector is decomposed into the parameters for the recognition and bigram 24
functions, i.e., θ = [ θ A θ ] B , we might then decouple the parameters by assuming they are independent, which means p (θ | D, I) = p ( θ A | D, I ) p ( θ B | D, I ) . (2.19) This gives two new parameter posteriors in the form of (2.9), which may then be independently optimized as described in this Section (§2.2). Decoupling the parameters in this fashion assumes certain “views” of the data are independent, i.e., the language “view” and the appearance “view.” Only some subset of the factors end up being present in the probability models used in the likelihood, which can simplify training. Moreover, this makes it possible to use partial training data. For instance, when p ( θ A | D, I ) and p ( θ B | D, I ) are optimized separately, we might use two different sets of training data. For the appearance model, θ A , we only need the images of individual characters and their associated labels, rather than a full sequence of characters as they might appear in context. Similarly, for the bigram model θ B , we do not need any image data, only a training character sequence. This strategy ameliorates the need to acquire large portions of fully-labeled training data. Indeed, this has long been an advantage of directed generative graphical models, which make similar independence assumptions and naturally factor the training process. One major difference is that in directed generative models, local normalizations serve to temper the disparity in magnitudes between different types of compatibilities. When ̂θ A and ̂θ B are found independently, there is no guarantee that the magnitude of the factors using them will be scaled appropriately. In other words, ̂θ B might be fine for a probability containing only compatibilities U B to model bigrams, but a compatibility U A using ̂θ A might have much larger values and thus (inappropriately) dominate a probability containing both U A and U B . In practice, we have not found this to be a problem, but a small amount of data may be used to learn linear weights on the resulting compatibilities U A and U B with ̂θ A and ̂θ B fixed. 2.2.1.2 Piecewise Training Rather than decoupling parameter types, a second simple approximation involves what amounts to an independence assumption among unknowns for different factors. Typical probability models involve several overlapping factors explaining the same unknowns. This is the main reason the global normalizer Z (x) is necessary and intractable. Rather than evaluate the likelihood of the entire probability model, we may decompose it into tractable pieces and evaluate them independently, collecting the results as an approximate likelihood. This so-called piecewise training approximation, due to Sutton and McCallum [111], can be justified as minimizing an upper bound on log Z (x), which is a necessary part of the likelihood objective function L (θ; D). 25
Page 1 and 2: UNIFIED DETECTION AND RECOGNITION F
Page 3 and 4: UNIFIED DETECTION AND RECOGNITION F
Page 5 and 6: ACKNOWLEDGEMENTS As soon as one beg
Page 7 and 8: ABSTRACT UNIFIED DETECTION AND RECO
Page 9 and 10: CONTENTS Page ACKNOWLEDGEMENTS . .
Page 11 and 12: 4.3.2.1 Model Training . . . . . .
Page 13 and 14: LIST OF TABLES Table Page 1.1 Diffi
Page 15 and 16: 3.11 Visual comparison of local and
Page 17 and 18: CHAPTER 1 INTRODUCTION The first at
Page 19 and 20: Figure 1.2. Images for document pag
Page 21 and 22: T he P h oto S p e cialists Input O
Page 23 and 24: Figure 1.4. Small text in an image
Page 25 and 26: section, we review some of the appr
Page 27 and 28: Kusachi et al. [62] have a multi-re
Page 29 and 30: 1.3.3 Adaptive Recognition Several
Page 31 and 32: ather than integrated systems. In a
Page 33 and 34: Background Faces Allen Andrew Keith
Page 35 and 36: CHAPTER 2 DISCRIMINATIVE MARKOV FIE
Page 37 and 38: If we have a family of sets C = {C
Page 39: ̂θ ≡ arg max p (θ | D, I) (2.6
Page 43 and 44: P L (θ; α) = −α ‖θ‖ 1 (2.
Page 45 and 46: we drop the dependence of the messa
Page 47 and 48: CHAPTER 3 TEXT AND SIGN DETECTION B
Page 49 and 50: For example, since neighboring regi
Page 51 and 52: Figure 3.3. Decomposition of images
Page 53 and 54: 3.3.1 Feature Overview Rather than
Page 55 and 56: • Raw pixel statistics (e.g., ran
Page 57 and 58: 3.4.2 Experimental Procedure For ou
Page 59 and 60: Figure 3.7. Example contextual dete
Page 61 and 62: Table 3.1. Sign detection results w
Page 63 and 64: Table 3.2. Sign detection results a
Page 65 and 66: In conclusion, the “default” pr
Page 67 and 68: CHAPTER 4 UNIFYING INFORMATION FOR
Page 69 and 70: will outline the details of our mod
Page 71 and 72: y 1 01 01 01 0 0 01 1 1 0 0 01 1 1
Page 73 and 74: 4.2.2 Language Model Properties of
Page 75 and 76: 5 Basis Functions Learned Function
Page 77 and 78: node for the factor C, while w B is
Page 79 and 80: Figure 4.5. Examples of sign evalua
Page 81 and 82: Lexicon The lexicon we use is deriv
Page 83 and 84: 40 40 Number of Queries 30 20 10 Nu
Page 85 and 86: and could be modeled directly if we
Page 87 and 88: Figure 4.9. Examples from the sign
Page 89 and 90: Model Correct Free checking 31 BOLT
Page 91 and 92:
4.3.4.2 Lexicon Model Here we discu
Page 93 and 94:
CHAPTER 5 UNIFYING DETECTION AND RE
Page 95 and 96:
character and these were directly u
Page 97 and 98:
L C ( θ C ; F, D ) ≡ ∑ k log p
Page 99 and 100:
For the independent method, each ca
Page 101 and 102:
Wavelet Transform Scale 0.3 0.25 0.
Page 103 and 104:
Scale −0.02 0 0.02 0.04 0.06 0.08
Page 105 and 106:
Figure 5.6. Sample images used in e
Page 107 and 108:
1 Avg. Categ. Accuracy 0.95 0.9 0.8
Page 109 and 110:
Relative Improvement 0.5 0.4 0.3 0.
Page 111 and 112:
Recognition Gain 1 0.8 0.6 0.4 0.2
Page 113 and 114:
of further special recognition. Hav
Page 115 and 116:
CHAPTER 6 THE ROBUST READER In the
Page 117 and 118:
6.2 Semi-Markov Model for Segmentat
Page 119 and 120:
6.2.1.2 Character Bigrams As in ear
Page 121 and 122:
exponential of the corresponding Ma
Page 123 and 124:
is signalled, and this string may e
Page 125 and 126:
and k. This joint space over i and
Page 127 and 128:
each location, all the rectified fi
Page 129 and 130:
This procedure is outlined visually
Page 131 and 132:
(e.g., cs, Bk, Kr, Nb, pd, Tl), we
Page 133 and 134:
18.8 18.6 18.4 KL N−Best Ratio Er
Page 135 and 136:
σ Image Output Binarized OmniPage
Page 137 and 138:
F I B E IA A R T C E N T E R (a) F
Page 139 and 140:
6.4.5 End-to-End Demonstration In t
Page 141 and 142:
T he P h oto S p e cialists The Pro
Page 143 and 144:
T he P hoto S p ecialists L L O Y D
Page 145 and 146:
Table 6.2. Comparison of recognitio
Page 147 and 148:
mercial document recognition softwa
Page 149 and 150:
In addition to the models, we have
Page 151 and 152:
[12] Blum, Avrim, and Langley, Pat.
Page 153 and 154:
[37] Geman, Stuart, and Geman, Dona
Page 155 and 156:
[63] Lafferty, John, McCallum, Andr
Page 157 and 158:
[89] Pal, Chris, Sutton, Charles, a
Page 159 and 160:
[114] Torralba, Antonio, Murphy, Ke
Page 161:
[139] Zhou, Yaqian, Wenb, Fuliang,
show all

unified detection and recognition for reading text in scene images

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?