Segmentation of heterogeneous document images : an ... - Tel
Segmentation of heterogeneous document images : an ... - Tel Segmentation of heterogeneous document images : an ... - Tel
Appendix B Implementation and software tel-00912566, version 1 - 2 Dec 2013 During the course of this PhD thesis, many applications and command line tools are developed. Some are developed for the purpose of document, data or feature visualization and some are developed to perform computation for different parts of the system. Finally, all the pieces have come together as a single unified program that can be applied on any document image using one command line tool. The command line tool is written in C++ using both QT, OpenCV and libLBGFS libraries. The cross-platform software with 6800 lines of code is developed on a Windows machine using Microsoft Visual Studio and is ported into Linux for testing and evaluation. The general syntax of the command line tool is: • DematSeg [Options] FolderPath where folder path is the location of .TIFF document images. Without any option, the application opens each document image in the folder path in a multi-threaded framework and applies page segmentation on every document image. Options can be used to generate features or to train different parts of the system. Options are: • ”-gn”: This option redirects the application to process all document images in the folder path for the purpose of extracting connected components and generating features for them. All features will come together in a single file that should later be used for training. The application expects to find the corresponding XML ground truth file for each document image in the same folder. XML ground-truths should have a name consisting of the base name of the document plus one of the suffixes: ” GT”, ” PrimaGT” or ”pc-” as prefix (ICDAR2009 default naming). If a ground truth file is not available, the application simply ignores that document and continues processing the remaining documents in the folder. 116
• ”-tn”: This option uses the dataset that is generated with option ”-gn” to train LogitBoost classifier for text/graphics separation. • ”-gr”: This option redirects the application to process all document images in the folder path for the purpose of generating observations and features for two-dimensional random fields model. Same as option ”-gn”, the application expects to find the XML ground-truth file for each document image. All features are appended to the database for further training purposes. • ”-tr”: This option uses the dataset that is generated with option ”-gr” to train the two-dimensional random fields model using either Collin’s voted perceptron or L-BFGS training algorithm. Selection of the training method as well as many other details of the training can be changed inside an .ini file that exists on the same folder as the executable command line. tel-00912566, version 1 - 2 Dec 2013 • ”-tp”: This option redirects the application to process all document images in the folder path for the purpose of training the paragraph detection module. Unlike other training options, features should be computed online and no feature is saved on the storage. The .ini file along side the executable file can be manipulated to change the behavior of the application. It is worth nothing that binarization methods, different parameters for Sauvola binarization, details of morphological operations, parallelization method, training parameters can be set or changed inside this file. 117
- Page 75 and 76: f = [y c = 0] × [y tl = 0] f = [y
- Page 77 and 78: (a) Ground-truth (b) y c = 0 tel-00
- Page 79 and 80: ∂l λ = ∑ ( ∑y∈Y f k (y s ,
- Page 81 and 82: incorrect [100]. Several sufficient
- Page 83 and 84: tel-00912566, version 1 - 2 Dec 201
- Page 85 and 86: tel-00912566, version 1 - 2 Dec 201
- Page 87 and 88: tel-00912566, version 1 - 2 Dec 201
- Page 89 and 90: tel-00912566, version 1 - 2 Dec 201
- Page 91 and 92: Table 4.3: TION COUNT WEIGHTED SUCC
- Page 93 and 94: tel-00912566, version 1 - 2 Dec 201
- Page 95 and 96: tel-00912566, version 1 - 2 Dec 201
- Page 97 and 98: Chapter 5 Text line detection tel-0
- Page 99 and 100: tel-00912566, version 1 - 2 Dec 201
- Page 101 and 102: tel-00912566, version 1 - 2 Dec 201
- Page 103 and 104: Having specified the model, a verti
- Page 105 and 106: • The fifth step is to remove ext
- Page 107 and 108: tel-00912566, version 1 - 2 Dec 201
- Page 109 and 110: text lines can be divided into two
- Page 111 and 112: the two children. The root node rep
- Page 113 and 114: leaves of the tree which contain on
- Page 115 and 116: tel-00912566, version 1 - 2 Dec 201
- Page 117 and 118: tel-00912566, version 1 - 2 Dec 201
- Page 119 and 120: tel-00912566, version 1 - 2 Dec 201
- Page 121 and 122: currently working on some of these
- Page 123 and 124: • fn (false negative) is the numb
- Page 125: 2 ∗ RA ∗ DR F − Measure = RA
- Page 129 and 130: [12] T. M. Breuel. Two geometric al
- Page 131 and 132: [39] B. Gatos, A. Antonacopoulos, a
- Page 133 and 134: [64] K. P. Murphy, Y. Weiss, and M.
- Page 135 and 136: [91] M. Stamp. A revealing introduc
- Page 137 and 138: Index tel-00912566, version 1 - 2 D
• ”-tn”: This option uses the dataset that is generated with option ”-gn” to<br />
train LogitBoost classifier for text/graphics separation.<br />
• ”-gr”: This option redirects the application to process all <strong>document</strong> <strong>images</strong><br />
in the folder path for the purpose <strong>of</strong> generating observations <strong>an</strong>d<br />
features for two-dimensional r<strong>an</strong>dom fields model. Same as option ”-gn”,<br />
the application expects to find the XML ground-truth file for each <strong>document</strong><br />
image. All features are appended to the database for further training<br />
purposes.<br />
• ”-tr”: This option uses the dataset that is generated with option ”-gr”<br />
to train the two-dimensional r<strong>an</strong>dom fields model using either Collin’s<br />
voted perceptron or L-BFGS training algorithm. Selection <strong>of</strong> the training<br />
method as well as m<strong>an</strong>y other details <strong>of</strong> the training c<strong>an</strong> be ch<strong>an</strong>ged inside<br />
<strong>an</strong> .ini file that exists on the same folder as the executable comm<strong>an</strong>d line.<br />
tel-00912566, version 1 - 2 Dec 2013<br />
• ”-tp”: This option redirects the application to process all <strong>document</strong> <strong>images</strong><br />
in the folder path for the purpose <strong>of</strong> training the paragraph detection<br />
module. Unlike other training options, features should be computed online<br />
<strong>an</strong>d no feature is saved on the storage.<br />
The .ini file along side the executable file c<strong>an</strong> be m<strong>an</strong>ipulated to ch<strong>an</strong>ge the<br />
behavior <strong>of</strong> the application. It is worth nothing that binarization methods, different<br />
parameters for Sauvola binarization, details <strong>of</strong> morphological operations,<br />
parallelization method, training parameters c<strong>an</strong> be set or ch<strong>an</strong>ged inside this file.<br />
117