14.01.2014 Views

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 3<br />

Text/graphics separation<br />

tel-00912566, version 1 - 2 Dec 2013<br />

Separation <strong>of</strong> text <strong>an</strong>d graphics in <strong>document</strong> <strong>images</strong> is the first step in<br />

our methodology. This is <strong>an</strong> import<strong>an</strong>t stage that not only improves<br />

the results <strong>of</strong> text region <strong>an</strong>d text line detection by not allowing graphical<br />

drawings to be merged with text regions but also by providing<br />

assist<strong>an</strong>ce for separating text regions as we will see in the next chapter.<br />

3.1 Preprocessing<br />

Preprocessing is the first step that happens after reading a <strong>document</strong> image<br />

into memory. The aim is to prepare the <strong>document</strong> for text <strong>an</strong>d graphics separation.<br />

Our method works with connected components. Each component c<strong>an</strong><br />

be a character, punctuation, noise, part <strong>of</strong> a h<strong>an</strong>dwritten word, rule lines or<br />

graphical elements. To extract these components, binarization first takes place.<br />

It is worth noting that different binarization methods exist. When a <strong>document</strong><br />

image is in good condition, Otsu’s binarization method performs better th<strong>an</strong><br />

other methods such as Sauvola [82, 11] or Niblack [68]. The reason is that<br />

Otsu’s method <strong>of</strong>ten keeps all parts <strong>of</strong> graphical drawings in a single piece. On<br />

the other h<strong>an</strong>d Sauvola’s binarization method should be used when dealing with<br />

low quality historical <strong>document</strong>s or bad lighting conditions. Figure 3.1 shows<br />

a part <strong>of</strong> a gray-level <strong>document</strong> with bad lighting condition <strong>an</strong>d compares the<br />

result <strong>of</strong> binarization by Otsu <strong>an</strong>d Sauvola methods. Figure 3.2 illustrates the<br />

adv<strong>an</strong>tage <strong>of</strong> applying Otsu’s method to <strong>document</strong>s that contain drawings <strong>an</strong>d<br />

graphical components. After binarization, a connected component <strong>an</strong>alysis c<strong>an</strong><br />

simply extract all connected components (CCs) <strong>of</strong> the image.<br />

3.2 Features<br />

As already mentioned in the previous chapter, most methods use either blockbased<br />

or component based approach for labeling components. To clarify this,<br />

if a method is assigning a label {text,graphics} to one component then it tends<br />

to extract component based features such as size, area, etc. And if the method<br />

36

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!