14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

leaves <strong>of</strong> the tree which contain only one text line.<br />

For each pair <strong>of</strong> lines p <strong>an</strong>d q, let a = (x L 1 , y L 1 ) <strong>an</strong>d b = (x R 1 , y R 1 ) be the<br />

coordinates <strong>of</strong> the axis points <strong>of</strong> the line at the top <strong>an</strong>d let e = (x L 2 , y L 2 ) <strong>an</strong>d<br />

f = (x R 2 , y R 2 ) be the coordinates <strong>of</strong> the axis points <strong>of</strong> the second line. Then<br />

LL = d(a, e) is the Euclide<strong>an</strong> dist<strong>an</strong>ce between the two points on the left <strong>an</strong>d<br />

RR = d(b, f) is the Euclide<strong>an</strong> dist<strong>an</strong>ce between the two points on the right.<br />

The criterion for selecting a link at each node <strong>of</strong> the BPT is:<br />

W bpt (p, q) = (1 + min{LL, RR})(1 + W mst (p, q)).<br />

Starting from the root node, at each level the algorithm selects the link with<br />

maximum W bpt <strong>an</strong>d removes that link. The two resulting set <strong>of</strong> lines are passed<br />

to the child nodes <strong>of</strong> the node in process.<br />

tel-00912566, version 1 - 2 Dec 2013<br />

6.3 Paragraph features<br />

Now that the BPT is generated <strong>an</strong>d each node <strong>of</strong> the BPT contains a potential<br />

paragraph, a set <strong>of</strong> features should be extracted from each paragraph. Features<br />

are as follows:<br />

• Left indentation <strong>of</strong> the first line. All computations <strong>of</strong> the indentation are<br />

done in regard to the bounding box <strong>of</strong> the paragraph in question.<br />

• Right indentation <strong>of</strong> the first line.<br />

• Indentation ratio <strong>of</strong> the first line which is the minimum value between left<br />

<strong>an</strong>d right indentations divided by the maximum value between them.<br />

• Difference between left <strong>an</strong>d right indentations <strong>of</strong> the first line.<br />

• Left indentation <strong>of</strong> the last line.<br />

• Right indentation <strong>of</strong> the last line.<br />

• Indentation ratio <strong>of</strong> the last line.<br />

• Difference between left <strong>an</strong>d right indentations <strong>of</strong> the last line.<br />

• Average value <strong>of</strong> left indentations <strong>of</strong> the text lines in the paragraph excluding<br />

the first <strong>an</strong>d the last line.<br />

• Average value <strong>of</strong> right indentations <strong>of</strong> the text lines in the paragraph<br />

excluding the first <strong>an</strong>d the last line.<br />

• Average indentation radio which is computed based on the average left<br />

<strong>an</strong>d right indentations <strong>of</strong> the text lines in the middle <strong>of</strong> the paragraph.<br />

• Maximum left indentation <strong>of</strong> the text lines in the middle <strong>of</strong> the paragraph.<br />

• Minimum left indentation <strong>of</strong> the text lines in the middle <strong>of</strong> the paragraph.<br />

103

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!