Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel Segmentation of heterogeneous document images : an ... - Tel

tel.archives.ouvertes.fr
from tel.archives.ouvertes.fr More from this publisher
14.01.2014 Views

Chapter 3 Text/graphics separation tel-00912566, version 1 - 2 Dec 2013 Separation of text and graphics in document images is the first step in our methodology. This is an important stage that not only improves the results of text region and text line detection by not allowing graphical drawings to be merged with text regions but also by providing assistance for separating text regions as we will see in the next chapter. 3.1 Preprocessing Preprocessing is the first step that happens after reading a document image into memory. The aim is to prepare the document for text and graphics separation. Our method works with connected components. Each component can be a character, punctuation, noise, part of a handwritten word, rule lines or graphical elements. To extract these components, binarization first takes place. It is worth noting that different binarization methods exist. When a document image is in good condition, Otsu’s binarization method performs better than other methods such as Sauvola [82, 11] or Niblack [68]. The reason is that Otsu’s method often keeps all parts of graphical drawings in a single piece. On the other hand Sauvola’s binarization method should be used when dealing with low quality historical documents or bad lighting conditions. Figure 3.1 shows a part of a gray-level document with bad lighting condition and compares the result of binarization by Otsu and Sauvola methods. Figure 3.2 illustrates the advantage of applying Otsu’s method to documents that contain drawings and graphical components. After binarization, a connected component analysis can simply extract all connected components (CCs) of the image. 3.2 Features As already mentioned in the previous chapter, most methods use either blockbased or component based approach for labeling components. To clarify this, if a method is assigning a label {text,graphics} to one component then it tends to extract component based features such as size, area, etc. And if the method 36

tel-00912566, version 1 - 2 Dec 2013 (a) Image (c) Sauvola binarization (b) Otsu binarization Figure 3.1: When dealing with low quality documents or documents that are captured in bad lighting condition, Sauvola’s binarization method is preferable to Otsu’s binarization. Image is from [11] 37

tel-00912566, version 1 - 2 Dec 2013<br />

(a) Image<br />

(c) Sauvola binarization<br />

(b) Otsu binarization<br />

Figure 3.1: When dealing with low quality <strong>document</strong>s or <strong>document</strong>s that are captured<br />

in bad lighting condition, Sauvola’s binarization method is preferable to Otsu’s<br />

binarization. Image is from [11]<br />

37

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!