Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
∂l λ<br />
= ∑ (<br />
∑y∈Y<br />
f k (y s , x s ) −<br />
f k(y s , x s ) exp ∑ F<br />
i=1 λ )<br />
if i (y s , x s )<br />
∂λ k Z(x s , λ)<br />
s∈S<br />
⎛<br />
⎞<br />
= ∑ ⎝f k (y s , x s ) − ∑ f k (y s , x s )P (y|x s ) ⎠<br />
s∈S<br />
y∈Y<br />
= ∑ s∈S<br />
(<br />
fk (y s , x s ) − E P (y|xs)[f k (y s , x s )] )<br />
tel-00912566, version 1 - 2 Dec 2013<br />
where Y = {text, non-text} <strong>an</strong>d E p() [.] is the expected value <strong>of</strong> the model<br />
under the conditional probability distribution. For maximum likelihood solution,<br />
the equation will equal zero, <strong>an</strong>d therefore the expectation <strong>of</strong> the feature<br />
f k with respect to the model distribution must be equal to the expected value<br />
<strong>of</strong> f k with respect to the empirical distribution. However, calculating the expectation<br />
requires the enumeration <strong>of</strong> all the y labels. In linear-chain models,<br />
inference techniques based on a variation <strong>of</strong> forward-backward algorithms c<strong>an</strong> be<br />
performed to efficiently compute this expectation. However, in two-dimensional<br />
CRFs, approximation techniques are needed to simplify the computations. One<br />
solution is to use a Voted Perceptron Method.<br />
4.6.1 Collin’s voted perceptron method<br />
Perceptrons [81] use <strong>an</strong> approximation <strong>of</strong> the gradient <strong>of</strong> the unregularized conditional<br />
log-likelihood. Perceptron-based training methods consider one misclassified<br />
inst<strong>an</strong>ce at a time, along with its contribution to the gradient. The<br />
expectation <strong>of</strong> features are further approximated by a point estimate <strong>of</strong> the feature<br />
function vector at the best possible labeling. The approximation for the<br />
i th inst<strong>an</strong>ce <strong>an</strong>d the k th feature function c<strong>an</strong> be written as:<br />
∇ k l(λ) ≈ ( f k (y i , x i ) − f k (ŷ i , x k ) )<br />
where<br />
ŷ i = arg max λ k f k (y, x i )<br />
y<br />
Using this approximate gradient, the following first order update rule c<strong>an</strong><br />
be used for maximization:<br />
λ t+1<br />
k<br />
= λ t k + α ( f k (y i , x i ) − f k (ŷ i , x i ) ) .<br />
where α is the learning rate. This update step is applied once for each missclassified<br />
inst<strong>an</strong>ce x i in the training set <strong>an</strong>d multiple passes are made over the<br />
training dataset. Thought, it has been noted that the final obtained weights<br />
suffer from over-fitting. As a solution, Collins [26] suggests a voting scheme,<br />
where, in a particular pass <strong>of</strong> the training data, all the updates are collected,<br />
69