14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

∂l λ<br />

= ∑ (<br />

∑y∈Y<br />

f k (y s , x s ) −<br />

f k(y s , x s ) exp ∑ F<br />

i=1 λ )<br />

if i (y s , x s )<br />

∂λ k Z(x s , λ)<br />

s∈S<br />

⎛<br />

⎞<br />

= ∑ ⎝f k (y s , x s ) − ∑ f k (y s , x s )P (y|x s ) ⎠<br />

s∈S<br />

y∈Y<br />

= ∑ s∈S<br />

(<br />

fk (y s , x s ) − E P (y|xs)[f k (y s , x s )] )<br />

tel-00912566, version 1 - 2 Dec 2013<br />

where Y = {text, non-text} <strong>an</strong>d E p() [.] is the expected value <strong>of</strong> the model<br />

under the conditional probability distribution. For maximum likelihood solution,<br />

the equation will equal zero, <strong>an</strong>d therefore the expectation <strong>of</strong> the feature<br />

f k with respect to the model distribution must be equal to the expected value<br />

<strong>of</strong> f k with respect to the empirical distribution. However, calculating the expectation<br />

requires the enumeration <strong>of</strong> all the y labels. In linear-chain models,<br />

inference techniques based on a variation <strong>of</strong> forward-backward algorithms c<strong>an</strong> be<br />

performed to efficiently compute this expectation. However, in two-dimensional<br />

CRFs, approximation techniques are needed to simplify the computations. One<br />

solution is to use a Voted Perceptron Method.<br />

4.6.1 Collin’s voted perceptron method<br />

Perceptrons [81] use <strong>an</strong> approximation <strong>of</strong> the gradient <strong>of</strong> the unregularized conditional<br />

log-likelihood. Perceptron-based training methods consider one misclassified<br />

inst<strong>an</strong>ce at a time, along with its contribution to the gradient. The<br />

expectation <strong>of</strong> features are further approximated by a point estimate <strong>of</strong> the feature<br />

function vector at the best possible labeling. The approximation for the<br />

i th inst<strong>an</strong>ce <strong>an</strong>d the k th feature function c<strong>an</strong> be written as:<br />

∇ k l(λ) ≈ ( f k (y i , x i ) − f k (ŷ i , x k ) )<br />

where<br />

ŷ i = arg max λ k f k (y, x i )<br />

y<br />

Using this approximate gradient, the following first order update rule c<strong>an</strong><br />

be used for maximization:<br />

λ t+1<br />

k<br />

= λ t k + α ( f k (y i , x i ) − f k (ŷ i , x i ) ) .<br />

where α is the learning rate. This update step is applied once for each missclassified<br />

inst<strong>an</strong>ce x i in the training set <strong>an</strong>d multiple passes are made over the<br />

training dataset. Thought, it has been noted that the final obtained weights<br />

suffer from over-fitting. As a solution, Collins [26] suggests a voting scheme,<br />

where, in a particular pass <strong>of</strong> the training data, all the updates are collected,<br />

69

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!