Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
tel-00912566, version 1 - 2 Dec 2013<br />
2.2.1 Dist<strong>an</strong>ce-based methods<br />
Run-length smearing (RLSA) [101] is perhaps the oldest known technique to<br />
segment a page into homogeneous regions. It smears components using the perceived<br />
text direction to form a distinct block <strong>of</strong> text. The Docstrum algorithm<br />
[71] starts by finding the K-nearest neighbors <strong>of</strong> each connected component <strong>an</strong>d<br />
connects them by edges. Then the histogram <strong>of</strong> dist<strong>an</strong>ces <strong>an</strong>d <strong>an</strong>gles are computed<br />
for all edges. Text lines are found by grouping pairs <strong>of</strong> closest neighbors,<br />
<strong>an</strong>d the method proceeds to form text blocks by grouping text lines. Ferilli et<br />
al., [32] indicate that white run lengths embody a dist<strong>an</strong>ce-like feature similar<br />
to the one explored in the Docstrum method, <strong>an</strong>d borrow the idea to merge<br />
closest neighbors based upon the values <strong>of</strong> horizontal <strong>an</strong>d vertical white runs between<br />
pairs <strong>of</strong> connected components. If adjacent regions from different columns<br />
have a similar homogeneity, <strong>of</strong>ten they are merged. Because <strong>of</strong> this, other vari<strong>an</strong>ts<br />
<strong>of</strong> the run-length method have also been proposed to improve the results.<br />
To name a few, Constrained Run-Length Algorithm (CRLA) [98] <strong>an</strong>d selective<br />
CRLA [93].<br />
Clearly, methods based on Voronoi diagram like [47, 48] also follow the same<br />
strategy to filter the edges <strong>of</strong> a graph <strong>an</strong>d separate regions <strong>of</strong> text. Voronoi++<br />
[2] is the latest proposed method that utilizes dynamic dist<strong>an</strong>ce thresholding<br />
instead <strong>of</strong> global thresholds that, to some extent, addresses the problem <strong>of</strong> over<br />
segmentation around large characters <strong>an</strong>d grouping <strong>of</strong> dissimilar text sizes. Despite<br />
<strong>an</strong> increase <strong>of</strong> 33% in the detecting accuracy <strong>of</strong> regions compare to [48], the<br />
text precision <strong>an</strong>d recall <strong>of</strong> the method on Washington III (UW-III) database<br />
are 68.78% <strong>an</strong>d 78.30%, respectively.<br />
The battle front in all these methods comes down to how well they c<strong>an</strong> apply<br />
dist<strong>an</strong>ce thresholds to remove appropriate edges <strong>an</strong>d group the remaining connected<br />
components as a text region. In the mentioned works, different thresholds<br />
have been set, not all <strong>of</strong> which are always suitable for other datasets. In other<br />
words, these methods rely upon certain assumptions about <strong>document</strong> layouts,<br />
<strong>an</strong>d they fail when the underlying assumptions are not met. The other issue<br />
with these methods is that because they rarely take adv<strong>an</strong>tage <strong>of</strong> the alignment<br />
<strong>of</strong> components, they <strong>of</strong>ten merge side notes with the main text body. In fact,<br />
we are not aware <strong>of</strong> a method that uses the alignment <strong>of</strong> components, shape <strong>of</strong><br />
regions or dist<strong>an</strong>ce between components, all at the same time.<br />
2.2.2 Whitespace <strong>an</strong>alysis<br />
Methods in this category try to <strong>an</strong>alyze the background structure (whitespace)<br />
<strong>of</strong> the <strong>document</strong> image to determine the physical <strong>document</strong> layout <strong>an</strong>alysis. The<br />
rudimentary algorithms for whitespace <strong>an</strong>alysis were limited to axis-aligned rect<strong>an</strong>gles.<br />
As a consequence, they have to correct <strong>document</strong> rotation before the<br />
search operation every time. In 2003, [13] Breuel developed a method that c<strong>an</strong><br />
find maximal empty rect<strong>an</strong>gles more efficiently, which would consequently be<br />
applied to considerably more complex type <strong>of</strong> <strong>document</strong>s. The new method<br />
19