11.04.2014 Views

unified detection and recognition for reading text in scene images

unified detection and recognition for reading text in scene images

unified detection and recognition for reading text in scene images

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

ectification, as do Gao et al. [34]. Still, both of these methods rely on h<strong>and</strong>-coded<br />

rules.<br />

In contrast, Zhang <strong>and</strong> Chang [137] use triplet cliques on image segments to<br />

flexibly h<strong>and</strong>le the l<strong>in</strong>earity of <strong>text</strong> <strong>in</strong> a probabilistic model with learned parameters.<br />

Similarly, Shen <strong>and</strong> Coughlan [104] classify vertical <strong>and</strong> horizontal stroke features<br />

with learnable compatibilities <strong>in</strong> a probabilistic model.<br />

Probabilistic systems <strong>for</strong> layout analysis are more flexible <strong>and</strong> require less manual<br />

parameter tun<strong>in</strong>g than heuristic methods. As such, they should be able to directly<br />

h<strong>and</strong>le <strong>text</strong> that is not strictly horizontal or vertical, if they are tra<strong>in</strong>ed on such data.<br />

What these two approaches lack is feedback between the feature computation (i.e.,<br />

segmentation [137], stroke <strong>detection</strong> [104]) <strong>and</strong> the classification of the features as <strong>text</strong><br />

or non-<strong>text</strong>. Thus, top-down <strong>in</strong>terpretation cannot <strong>in</strong>fluence the lower-level features.<br />

To address this issue, we propose jo<strong>in</strong>t feature selection <strong>for</strong> <strong>detection</strong> <strong>and</strong> <strong>recognition</strong><br />

<strong>in</strong> Chapter 6 .<br />

1.2.5 Summary<br />

Many of the approaches to <strong>text</strong> <strong>detection</strong> <strong>and</strong> segmentation above may be classified<br />

as <strong>text</strong>ure-based, s<strong>in</strong>ce they use various features <strong>and</strong> statistics of <strong>in</strong>tensities <strong>and</strong><br />

gradients[53, 132, 24, 67, 135, 113]. Some make use of edge features or morphological<br />

operator outputs [36, 34, 35]. Others comb<strong>in</strong>e some of these features [131], e.g., at<br />

different stages <strong>in</strong> a pipel<strong>in</strong>ed classifier [21] or to h<strong>and</strong>le different sizes of <strong>text</strong> [29].<br />

By contrast, some others methods are region-based [137] or use higher-level features<br />

as <strong>in</strong>put to a classifier [104].<br />

With a few exceptions, most of these systems are divorced from the <strong>recognition</strong><br />

process. As such, many do only a rough <strong>in</strong>itial or w<strong>in</strong>dowed <strong>detection</strong> [53, 24, 21, 135],<br />

while others may track <strong>text</strong> regions across video frames [67, 131]. Some use heuristics<br />

<strong>for</strong> comb<strong>in</strong><strong>in</strong>g <strong>text</strong> regions <strong>and</strong>/or prun<strong>in</strong>g non-<strong>text</strong> regions [132, 36, 34, 35, 113],<br />

while very few have learned or learnable layout analyses [137, 104].<br />

Prior to <strong>recognition</strong>, many of the systems b<strong>in</strong>arize <strong>text</strong> regions [132, 36, 131, 21,<br />

137, 113], while one isolates characters (via b<strong>in</strong>arization) but subsequently classifies<br />

features of the grayscale image [35].<br />

By learn<strong>in</strong>g the spatial properties of <strong>text</strong> to improve <strong>detection</strong> <strong>and</strong> bridg<strong>in</strong>g the<br />

gap between <strong>detection</strong> <strong>and</strong> <strong>recognition</strong> dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g, the methods we present <strong>in</strong><br />

this thesis overcome many of the limitations present <strong>in</strong> earlier approaches to <strong>text</strong><br />

<strong>detection</strong>.<br />

Next we discuss some of the relevant approaches to recogniz<strong>in</strong>g <strong>text</strong> <strong>in</strong> document<br />

process<strong>in</strong>g <strong>and</strong> more recent camera- <strong>and</strong> video-based systems.<br />

1.3 Text Recognition<br />

A great deal of prior knowledge can be brought to bear on the <strong>text</strong> <strong>recognition</strong><br />

problem, from the essentials like the general appearance of characters, to more<br />

complex factors such as language <strong>and</strong> lexicon awareness. Many of these have been<br />

employed <strong>in</strong> one <strong>for</strong>m or another, but typically they are not fully <strong>in</strong>tegrated. In this<br />

8

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!