Segmentation of heterogeneous document images : an ... - Tel

Segmentation of heterogeneous document images : an ... - Tel Segmentation of heterogeneous document images : an ... - Tel

tel.archives.ouvertes.fr
from tel.archives.ouvertes.fr More from this publisher
14.01.2014 Views

Appendix B Implementation and software tel-00912566, version 1 - 2 Dec 2013 During the course of this PhD thesis, many applications and command line tools are developed. Some are developed for the purpose of document, data or feature visualization and some are developed to perform computation for different parts of the system. Finally, all the pieces have come together as a single unified program that can be applied on any document image using one command line tool. The command line tool is written in C++ using both QT, OpenCV and libLBGFS libraries. The cross-platform software with 6800 lines of code is developed on a Windows machine using Microsoft Visual Studio and is ported into Linux for testing and evaluation. The general syntax of the command line tool is: • DematSeg [Options] FolderPath where folder path is the location of .TIFF document images. Without any option, the application opens each document image in the folder path in a multi-threaded framework and applies page segmentation on every document image. Options can be used to generate features or to train different parts of the system. Options are: • ”-gn”: This option redirects the application to process all document images in the folder path for the purpose of extracting connected components and generating features for them. All features will come together in a single file that should later be used for training. The application expects to find the corresponding XML ground truth file for each document image in the same folder. XML ground-truths should have a name consisting of the base name of the document plus one of the suffixes: ” GT”, ” PrimaGT” or ”pc-” as prefix (ICDAR2009 default naming). If a ground truth file is not available, the application simply ignores that document and continues processing the remaining documents in the folder. 116

• ”-tn”: This option uses the dataset that is generated with option ”-gn” to train LogitBoost classifier for text/graphics separation. • ”-gr”: This option redirects the application to process all document images in the folder path for the purpose of generating observations and features for two-dimensional random fields model. Same as option ”-gn”, the application expects to find the XML ground-truth file for each document image. All features are appended to the database for further training purposes. • ”-tr”: This option uses the dataset that is generated with option ”-gr” to train the two-dimensional random fields model using either Collin’s voted perceptron or L-BFGS training algorithm. Selection of the training method as well as many other details of the training can be changed inside an .ini file that exists on the same folder as the executable command line. tel-00912566, version 1 - 2 Dec 2013 • ”-tp”: This option redirects the application to process all document images in the folder path for the purpose of training the paragraph detection module. Unlike other training options, features should be computed online and no feature is saved on the storage. The .ini file along side the executable file can be manipulated to change the behavior of the application. It is worth nothing that binarization methods, different parameters for Sauvola binarization, details of morphological operations, parallelization method, training parameters can be set or changed inside this file. 117

• ”-tn”: This option uses the dataset that is generated with option ”-gn” to<br />

train LogitBoost classifier for text/graphics separation.<br />

• ”-gr”: This option redirects the application to process all <strong>document</strong> <strong>images</strong><br />

in the folder path for the purpose <strong>of</strong> generating observations <strong>an</strong>d<br />

features for two-dimensional r<strong>an</strong>dom fields model. Same as option ”-gn”,<br />

the application expects to find the XML ground-truth file for each <strong>document</strong><br />

image. All features are appended to the database for further training<br />

purposes.<br />

• ”-tr”: This option uses the dataset that is generated with option ”-gr”<br />

to train the two-dimensional r<strong>an</strong>dom fields model using either Collin’s<br />

voted perceptron or L-BFGS training algorithm. Selection <strong>of</strong> the training<br />

method as well as m<strong>an</strong>y other details <strong>of</strong> the training c<strong>an</strong> be ch<strong>an</strong>ged inside<br />

<strong>an</strong> .ini file that exists on the same folder as the executable comm<strong>an</strong>d line.<br />

tel-00912566, version 1 - 2 Dec 2013<br />

• ”-tp”: This option redirects the application to process all <strong>document</strong> <strong>images</strong><br />

in the folder path for the purpose <strong>of</strong> training the paragraph detection<br />

module. Unlike other training options, features should be computed online<br />

<strong>an</strong>d no feature is saved on the storage.<br />

The .ini file along side the executable file c<strong>an</strong> be m<strong>an</strong>ipulated to ch<strong>an</strong>ge the<br />

behavior <strong>of</strong> the application. It is worth nothing that binarization methods, different<br />

parameters for Sauvola binarization, details <strong>of</strong> morphological operations,<br />

parallelization method, training parameters c<strong>an</strong> be set or ch<strong>an</strong>ged inside this file.<br />

117

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!