Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
• Maximum right indentation <strong>of</strong> the text lines in the middle <strong>of</strong> the paragraph.<br />
• Minimum right indentation <strong>of</strong> the text lines in the middle <strong>of</strong> the paragraph.<br />
• Average vertical dist<strong>an</strong>ce between text lines.<br />
• Average height <strong>of</strong> text lines.<br />
• St<strong>an</strong>dard deviation <strong>of</strong> height for text lines.<br />
• Average width <strong>of</strong> text lines.<br />
• St<strong>an</strong>dard deviation <strong>of</strong> width fro text lines.<br />
• Average size <strong>of</strong> connected components.<br />
• St<strong>an</strong>dard deviation <strong>of</strong> size <strong>of</strong> connected components.<br />
tel-00912566, version 1 - 2 Dec 2013<br />
• Number <strong>of</strong> connected components.<br />
• Paragraph’s width.<br />
6.4 State decoding<br />
State decoding is a procedure that determine the state <strong>of</strong> each node given their<br />
weights. Two possible states are defined for each node: ”Preserve” or ”Remove”.<br />
If a node takes the ”Preserve” state, that node <strong>an</strong>d all its children should be<br />
grouped as one paragraph. On the other h<strong>an</strong>d a ”Remove” state indicates that<br />
the two children <strong>of</strong> that node should not be grouped as one paragraph. Leaves<br />
<strong>of</strong> the tree are single text lines <strong>an</strong>d should always be preserved. The algorithm<br />
is described in algorithm 1. W is the weight vector that should be trained.<br />
6.5 Training method<br />
The training method for our paragraph model is very similar to Collin’s voted<br />
perceptron in regard to computing the direction <strong>of</strong> gradients. The major difference<br />
is that the values <strong>of</strong> the features are not available beforeh<strong>an</strong>d <strong>an</strong>d should<br />
be computed online while training. For each <strong>document</strong> image in our trainset,<br />
we start from the leaves <strong>of</strong> the tree where we already know the values <strong>of</strong> the features.<br />
Then we move upward toward the root <strong>of</strong> the tree. The implementation<br />
c<strong>an</strong> be done using dynamic programming where we appoint to the root <strong>an</strong>d the<br />
root ask for feature values from its children. The pseudo code for computing ∆<br />
values is provided in algorithm 2. In each iteration we initialize ∆ to zero <strong>an</strong>d<br />
call the procedure for the root node <strong>of</strong> the binary tree for first <strong>document</strong>. Then,<br />
we continue calling procedures for all other <strong>document</strong>s in our training dataset.<br />
After one passage for all <strong>document</strong>s, ∆ should be divided by the total number <strong>of</strong><br />
processed nodes in the whole trainset. Then each element <strong>of</strong> ∆ vector should be<br />
added to the current weight vector <strong>an</strong>d the next iteration the procedure starts<br />
over with the new weights <strong>an</strong>d we continue until convergence.<br />
104