14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

• Maximum right indentation <strong>of</strong> the text lines in the middle <strong>of</strong> the paragraph.<br />

• Minimum right indentation <strong>of</strong> the text lines in the middle <strong>of</strong> the paragraph.<br />

• Average vertical dist<strong>an</strong>ce between text lines.<br />

• Average height <strong>of</strong> text lines.<br />

• St<strong>an</strong>dard deviation <strong>of</strong> height for text lines.<br />

• Average width <strong>of</strong> text lines.<br />

• St<strong>an</strong>dard deviation <strong>of</strong> width fro text lines.<br />

• Average size <strong>of</strong> connected components.<br />

• St<strong>an</strong>dard deviation <strong>of</strong> size <strong>of</strong> connected components.<br />

tel-00912566, version 1 - 2 Dec 2013<br />

• Number <strong>of</strong> connected components.<br />

• Paragraph’s width.<br />

6.4 State decoding<br />

State decoding is a procedure that determine the state <strong>of</strong> each node given their<br />

weights. Two possible states are defined for each node: ”Preserve” or ”Remove”.<br />

If a node takes the ”Preserve” state, that node <strong>an</strong>d all its children should be<br />

grouped as one paragraph. On the other h<strong>an</strong>d a ”Remove” state indicates that<br />

the two children <strong>of</strong> that node should not be grouped as one paragraph. Leaves<br />

<strong>of</strong> the tree are single text lines <strong>an</strong>d should always be preserved. The algorithm<br />

is described in algorithm 1. W is the weight vector that should be trained.<br />

6.5 Training method<br />

The training method for our paragraph model is very similar to Collin’s voted<br />

perceptron in regard to computing the direction <strong>of</strong> gradients. The major difference<br />

is that the values <strong>of</strong> the features are not available beforeh<strong>an</strong>d <strong>an</strong>d should<br />

be computed online while training. For each <strong>document</strong> image in our trainset,<br />

we start from the leaves <strong>of</strong> the tree where we already know the values <strong>of</strong> the features.<br />

Then we move upward toward the root <strong>of</strong> the tree. The implementation<br />

c<strong>an</strong> be done using dynamic programming where we appoint to the root <strong>an</strong>d the<br />

root ask for feature values from its children. The pseudo code for computing ∆<br />

values is provided in algorithm 2. In each iteration we initialize ∆ to zero <strong>an</strong>d<br />

call the procedure for the root node <strong>of</strong> the binary tree for first <strong>document</strong>. Then,<br />

we continue calling procedures for all other <strong>document</strong>s in our training dataset.<br />

After one passage for all <strong>document</strong>s, ∆ should be divided by the total number <strong>of</strong><br />

processed nodes in the whole trainset. Then each element <strong>of</strong> ∆ vector should be<br />

added to the current weight vector <strong>an</strong>d the next iteration the procedure starts<br />

over with the new weights <strong>an</strong>d we continue until convergence.<br />

104

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!