Printing Technique Classification for Document Counterfeit Detection

More documents

Recommendations

Info

For each of the transformed images, we create a 2D valueco-occurrence histogram p[i, j], i.e. the histogram of howoften an original grayscale value i occurs in the region ofinterest of image I together with a value j at the same positionin the transformed image J.p[i, j] =#{(x, y) ∈ ROI : I[x, y] = i ∧ J[x, y] = j}#{(x, y) ∈ ROI}(a) Contour image and dilation(b) ROI in original and binarizationFigure 5. A region is interest (ROI) mask isconstructed by dilating the contour image (a).The cross-correlation between the ROI-pixelsin the original image and in the binarized image(b) measures how close the contour regionis to an ideal step edge.From each such histogram, we extract the four features contrast,correlation, energy and homogeneity for classification:contrast = ∑ |i − j| 2 p[i, j]i,j∑i,jcorrelation =(i − µ x)(j − µ y )p[i, j]σ x σ yenergy = ∑ i,jp 2 [i, j]homogeneity = ∑ i,jp[i, j]1 + |i − j|whereµ x = ∑ i,ji p[i, j]µ y = ∑ i,jj p[i, j]Figure 6. Local binary pattern: A 3 × 3 windowsaround each position is extracted (left).Its center value is used as threshold for binarization(center). The 8 binarized neighborsare multiplied with a mask of powers of twoand summed up (right). This arranges themin clockwise order forming a byte value, inthis example 1 + 2 + 4 + 16 + 128 = 151.Local Binary Maps We calculate the rotation invariant localbinary map[8]. It can be illustrated in the followingway: Assume a 3 × 3 window in the image. We calculatea binary version of this window using the centerpoint as threshold. The resulting 8 single-bit values arecombined into a bitstring in clockwise order, resultingin an 8-bit integer value. This process is illustrated inFigure 3.2. The resulting value depends on which bitthe combination started with. To make the feature rotationallyinvariant, we using each bit as a starting pointonce and keep the minimum resulting value as output.The same process can also be done for larger windowsizes. In our experiments, we use a 5 × 5 window. Thevalues at its boundary are binarized and combined intoa 16 bit output value, which is again made rotationallyinvariant by minimizing over all 16 possible rotations.√ ∑√ ∑σ x = (i − µ x ) 2 p[i, j] σ y = (j − µ y ) 2 p[i, j]i,ji,jThis results in a 4D feature vector per transformation, i.e.the total vector for texture features is 12-dimensional.3.3 ClassificationFor classification, the 15-dimensional feature vectors arereduce to 12-dimensions using the PCA transform and normalizedto the range [0, 1]. Afterwards, they are used asinput to a support vector machine with Gaussian kernel(RBF-SVM). Support vector machines have proven to bea powerful tool for classification tasks where high accuracyis required. By use of a Gaussian kernel function, we avoidthe frequent problem of parameter section, because a RBF-SVM has only two free parameters, the penalty C and thekernel width σ, and there are straightforward testing methodfor choosing them[9]. In our case, parameters were set toC = 20 and σ = 1.Another reason for choosing a RBF-SVM was, that theyutilize the advantages of a high (even infinite) dimensionalclassification space, and can be thought of as covering themore fundamental linear kernel SVMs as well[10].3.4 VisualizationA big problem of fully automatic classification problemsis that users don’t fully trust their decisions. This is evenmore the case in sensitive areas like document authenticity,
where matters of national security or financial transactionsare involved. We have therefore designed our system to notindependently make decisions, but to guide the user intomaking his or her own decision. This is done by presentingthe user with the classification results in form of a color annotationof the original document image. Green representslaser printed characters and red represents inkjet.This representation allows the user to see at a singleglance if a document is uniquely colored in the way he orshe expects, indicating that it is an original, or if the coloris uniformly wrong, indicating a complete forgery, or if itconsists of different colors, which can be an indicator fora partial forgery. In the last case, the color coding showsits special strength, because the user can immediately seewhich parts of the documents have been detected as potentiallyforged. Based on the semantic meaning of these part,he or she might can decide to either consult the original documentfor an in-depth check, or to attribute the detection asa false positive and proceed. Typically, names and numbersare positions where a missing a forgery is potentially dangerous,whereas in the company logo or the running text ofa letter, a detection error is more likely. Encoding this informationinto the system would require a lot of backgroundknowledge about the documents to be checked, somethingthat a generic counterfeit detection system typically doesnot have. By instead leaving the choice to the user in an interactivefashion, the system stays overall generic. We alsobelieve that the acceptance rate with a user will be higher,because of the positive feedback to decide for oneself insteadof giving the control to ”a machine”.4 Experiments4.1 SetupWe implemented a prototype of the described system inMatLab, and tested it on a dataset of 26 printouts of 8 laserand 5 inkjet printouts. All documents show only text andwere scanned at a resolution of 3200 dpi. Since our algorithmworks on the level of individual letters, the documentimages were split into connected components, resulting in9217 image regions in total.For evaluation, we performed two kinds of experiments.To measure the numerical classification accuracy, we performedleave-one-out testing, each time selecting all regionsof one document for testing and all regions of all other documentsfor training. This resulted in a classification accuracyof 94.8%. However, the classification accuracy variedrather strongly, for some documents reaching 100% but inone case also dropping as low as 78%. We believe that thisis caused by a lack diversity in the training material, sincewe only had access to a limited number of printer and papercombinations.To demonstrate our approach of visualization and counterfeitdetection, we created an example document fromlaser printed stationary with inkjet printed text body. It wasclassified and the result is presented in Figure 7. As one cansee, the laser part is recognized perfectly, whereas in theinkjet section, some letters are falsely marked as laser print.However, the visualization makes it possible to see that theerrors happen mainly for very small bounding boxes in particularpunctuation, and not for critical parts like names orbank data. A human observer should therefore be able toclassify the letter as non-forged.4.2 DiscussionFor the prototypical stage of the system, we believe thatthe results presented are good. It also looks promising thatthey can be improved by more careful parameter selectionand in particular more training material. However, the currenterror rate of 5.2% is not low enough to establish a fullyautomatic system. For that the task of counterfeit detectionitself is much too sensitive to errors: a single forged lettermakes a whole document a forgery. Therefore, a singlefalse decision can cause a genuine document to be detectedas forgery, and a single false negative can cause a forgeddocument to be overlooked.A system working with such a strict decision rule wouldinstead require an error rate in the range of 0.01% (one errorper 10 documents) which appears unreachable with currentmethods. We have avoided this problem by not letting thesystem decide for itself, but by presenting the user with avisualization of the classification results in form of an assistancesystem.5 ConclusionWe have presented a system that uses machine learningto detect different types of printing techniques on the levelof individual letters, or even parts of letters. Furthermore,we have described a setup to detect counterfeit or manipulationof printed documents. Here, the aim was not to build acompletely autonomous system in form of a black box classifier,but a system that assists a possible user in his owndecision. The reported results indicate that automatic classificationof the technique used to create a printed documentis indeed possible with ordinary consumer hardware. So farwe used two classes, inkjet and laser, but by switching toa multi-class classifier and with additional research on suitablefeatures, it should be possible to distinguishing evenmore types of print and also other writing devices like ballpens. Our goal is to create a system that can indeed fulfilla useful purpose outside of the lab environment, e. g.by integration into a content management system handlinginvoices or receipts, as they are frequently used today at
Page 2 and 3: (a)(b)Figure 1. Sample letter from
Page 6: (a) (b) (c)Figure 7. Example of cla

Printing Technique Classification for Document Counterfeit Detection

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?