14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

tel-00912566, version 1 - 2 Dec 2013<br />

2.2.1 Dist<strong>an</strong>ce-based methods<br />

Run-length smearing (RLSA) [101] is perhaps the oldest known technique to<br />

segment a page into homogeneous regions. It smears components using the perceived<br />

text direction to form a distinct block <strong>of</strong> text. The Docstrum algorithm<br />

[71] starts by finding the K-nearest neighbors <strong>of</strong> each connected component <strong>an</strong>d<br />

connects them by edges. Then the histogram <strong>of</strong> dist<strong>an</strong>ces <strong>an</strong>d <strong>an</strong>gles are computed<br />

for all edges. Text lines are found by grouping pairs <strong>of</strong> closest neighbors,<br />

<strong>an</strong>d the method proceeds to form text blocks by grouping text lines. Ferilli et<br />

al., [32] indicate that white run lengths embody a dist<strong>an</strong>ce-like feature similar<br />

to the one explored in the Docstrum method, <strong>an</strong>d borrow the idea to merge<br />

closest neighbors based upon the values <strong>of</strong> horizontal <strong>an</strong>d vertical white runs between<br />

pairs <strong>of</strong> connected components. If adjacent regions from different columns<br />

have a similar homogeneity, <strong>of</strong>ten they are merged. Because <strong>of</strong> this, other vari<strong>an</strong>ts<br />

<strong>of</strong> the run-length method have also been proposed to improve the results.<br />

To name a few, Constrained Run-Length Algorithm (CRLA) [98] <strong>an</strong>d selective<br />

CRLA [93].<br />

Clearly, methods based on Voronoi diagram like [47, 48] also follow the same<br />

strategy to filter the edges <strong>of</strong> a graph <strong>an</strong>d separate regions <strong>of</strong> text. Voronoi++<br />

[2] is the latest proposed method that utilizes dynamic dist<strong>an</strong>ce thresholding<br />

instead <strong>of</strong> global thresholds that, to some extent, addresses the problem <strong>of</strong> over<br />

segmentation around large characters <strong>an</strong>d grouping <strong>of</strong> dissimilar text sizes. Despite<br />

<strong>an</strong> increase <strong>of</strong> 33% in the detecting accuracy <strong>of</strong> regions compare to [48], the<br />

text precision <strong>an</strong>d recall <strong>of</strong> the method on Washington III (UW-III) database<br />

are 68.78% <strong>an</strong>d 78.30%, respectively.<br />

The battle front in all these methods comes down to how well they c<strong>an</strong> apply<br />

dist<strong>an</strong>ce thresholds to remove appropriate edges <strong>an</strong>d group the remaining connected<br />

components as a text region. In the mentioned works, different thresholds<br />

have been set, not all <strong>of</strong> which are always suitable for other datasets. In other<br />

words, these methods rely upon certain assumptions about <strong>document</strong> layouts,<br />

<strong>an</strong>d they fail when the underlying assumptions are not met. The other issue<br />

with these methods is that because they rarely take adv<strong>an</strong>tage <strong>of</strong> the alignment<br />

<strong>of</strong> components, they <strong>of</strong>ten merge side notes with the main text body. In fact,<br />

we are not aware <strong>of</strong> a method that uses the alignment <strong>of</strong> components, shape <strong>of</strong><br />

regions or dist<strong>an</strong>ce between components, all at the same time.<br />

2.2.2 Whitespace <strong>an</strong>alysis<br />

Methods in this category try to <strong>an</strong>alyze the background structure (whitespace)<br />

<strong>of</strong> the <strong>document</strong> image to determine the physical <strong>document</strong> layout <strong>an</strong>alysis. The<br />

rudimentary algorithms for whitespace <strong>an</strong>alysis were limited to axis-aligned rect<strong>an</strong>gles.<br />

As a consequence, they have to correct <strong>document</strong> rotation before the<br />

search operation every time. In 2003, [13] Breuel developed a method that c<strong>an</strong><br />

find maximal empty rect<strong>an</strong>gles more efficiently, which would consequently be<br />

applied to considerably more complex type <strong>of</strong> <strong>document</strong>s. The new method<br />

19

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!