11.04.2014 Views

unified detection and recognition for reading text in scene images

unified detection and recognition for reading text in scene images

unified detection and recognition for reading text in scene images

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

with the <strong>detection</strong> stage altogether. One important facet be<strong>in</strong>g overlooked by all<br />

of these methods is the opportunity to improve test<strong>in</strong>g per<strong>for</strong>mance by unify<strong>in</strong>g the<br />

tra<strong>in</strong><strong>in</strong>g of <strong>detection</strong> <strong>and</strong> <strong>recognition</strong> stages.<br />

Later <strong>in</strong> the thesis we will exam<strong>in</strong>e how to tie <strong>detection</strong> <strong>and</strong> <strong>recognition</strong> together<br />

dur<strong>in</strong>g learn<strong>in</strong>g as well as the operation of the system. This will rely on shar<strong>in</strong>g<br />

<strong>in</strong>termediate levels of the classifier—i.e., features. In the rema<strong>in</strong>der of this section,<br />

we review some related feature selection strategies as they apply to computer vision<br />

tasks <strong>and</strong> some recent work related to shar<strong>in</strong>g features <strong>for</strong> different tasks.<br />

1.4.1 Feature Selection<br />

Several general frameworks exist <strong>for</strong> select<strong>in</strong>g features. The two most basic are<br />

greedy <strong>for</strong>ward <strong>and</strong> backward schemes. Forward schemes <strong>in</strong>crementally add features<br />

to a model based on some criterion of feature utility. Examples of this <strong>in</strong>clude work<br />

by Viola <strong>and</strong> Jones [121], who use s<strong>in</strong>gle-feature l<strong>in</strong>ear classifiers as weak learners <strong>in</strong> a<br />

boost<strong>in</strong>g framework, add<strong>in</strong>g features with the lowest weighted error to the ensemble.<br />

A similar <strong>for</strong>ward method by Berger et al. [9] <strong>in</strong>volves add<strong>in</strong>g only those c<strong>and</strong>idates<br />

that most <strong>in</strong>crease the likelihood of a probabilistic model. Backward schemes, by<br />

contrast, selectively prune features from a model. The l 1 or Laplacian prior [128]<br />

<strong>for</strong> neural networks, maximum entropy models, logistic regression, etc. belongs to<br />

this category. In this scheme, features are effectively elim<strong>in</strong>ated from a model dur<strong>in</strong>g<br />

tra<strong>in</strong><strong>in</strong>g by fix<strong>in</strong>g their correspond<strong>in</strong>g weights to zero. Many other variants <strong>for</strong> select<strong>in</strong>g<br />

a subset of features are possible; see Blum <strong>and</strong> Langley [12] <strong>for</strong> a more thorough<br />

review.<br />

Feature types <strong>and</strong> selection strategies <strong>for</strong> visual tasks have varied widely. The<br />

Viola <strong>and</strong> Jones object detector [121] employs outputs of simple image difference<br />

features, which are similar to wavelets. There are many possible filters <strong>and</strong> only a<br />

few are discrim<strong>in</strong>ative, so a selection process is required primarily <strong>for</strong> computational<br />

efficiency. Other methods use image fragments or patches as feature descriptors.<br />

These patches may be taken directly from the image [119, 1] or an <strong>in</strong>termediate<br />

wavelet-based representation [103, 83].<br />

1.4.2 Feature Shar<strong>in</strong>g<br />

One of the most prom<strong>in</strong>ent examples of shared features <strong>in</strong> the general mach<strong>in</strong>e<br />

learn<strong>in</strong>g literature is Caruana’s work on multitask learn<strong>in</strong>g [19]. The central idea is<br />

to use a shared representation while learn<strong>in</strong>g tasks <strong>in</strong> parallel.<br />

In particular, the high-dimensional patch-based image features mentioned above<br />

are often densely sampled <strong>and</strong> vector quantized to create a discrete codebook representation.<br />

W<strong>in</strong>n et al. [130] iteratively merge code words that do not contribute to<br />

discrim<strong>in</strong>ation overall, while Jurie et al. [56] create a cluster<strong>in</strong>g algorithm designed to<br />

cover the space with code words <strong>for</strong> better discrim<strong>in</strong>ation. Bernste<strong>in</strong> <strong>and</strong> Amit [10]<br />

cluster patches <strong>in</strong> a generative model <strong>for</strong> characters. Alternatively, LeCun et al. [65]<br />

learn (rather than select <strong>for</strong>) a discrim<strong>in</strong>ative <strong>in</strong>termediate feature representation.<br />

However, all of these methods are focused on one type of <strong>recognition</strong>, either catego-<br />

16

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!