Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Chapter 3<br />
Text/graphics separation<br />
tel-00912566, version 1 - 2 Dec 2013<br />
Separation <strong>of</strong> text <strong>an</strong>d graphics in <strong>document</strong> <strong>images</strong> is the first step in<br />
our methodology. This is <strong>an</strong> import<strong>an</strong>t stage that not only improves<br />
the results <strong>of</strong> text region <strong>an</strong>d text line detection by not allowing graphical<br />
drawings to be merged with text regions but also by providing<br />
assist<strong>an</strong>ce for separating text regions as we will see in the next chapter.<br />
3.1 Preprocessing<br />
Preprocessing is the first step that happens after reading a <strong>document</strong> image<br />
into memory. The aim is to prepare the <strong>document</strong> for text <strong>an</strong>d graphics separation.<br />
Our method works with connected components. Each component c<strong>an</strong><br />
be a character, punctuation, noise, part <strong>of</strong> a h<strong>an</strong>dwritten word, rule lines or<br />
graphical elements. To extract these components, binarization first takes place.<br />
It is worth noting that different binarization methods exist. When a <strong>document</strong><br />
image is in good condition, Otsu’s binarization method performs better th<strong>an</strong><br />
other methods such as Sauvola [82, 11] or Niblack [68]. The reason is that<br />
Otsu’s method <strong>of</strong>ten keeps all parts <strong>of</strong> graphical drawings in a single piece. On<br />
the other h<strong>an</strong>d Sauvola’s binarization method should be used when dealing with<br />
low quality historical <strong>document</strong>s or bad lighting conditions. Figure 3.1 shows<br />
a part <strong>of</strong> a gray-level <strong>document</strong> with bad lighting condition <strong>an</strong>d compares the<br />
result <strong>of</strong> binarization by Otsu <strong>an</strong>d Sauvola methods. Figure 3.2 illustrates the<br />
adv<strong>an</strong>tage <strong>of</strong> applying Otsu’s method to <strong>document</strong>s that contain drawings <strong>an</strong>d<br />
graphical components. After binarization, a connected component <strong>an</strong>alysis c<strong>an</strong><br />
simply extract all connected components (CCs) <strong>of</strong> the image.<br />
3.2 Features<br />
As already mentioned in the previous chapter, most methods use either blockbased<br />
or component based approach for labeling components. To clarify this,<br />
if a method is assigning a label {text,graphics} to one component then it tends<br />
to extract component based features such as size, area, etc. And if the method<br />
36