Exploiting Class Hierarchies for Knowledge Transfer ... - IEEE Xplore

Exploiting Class Hierarchies for Knowledge Transfer ... - IEEE Xplore Exploiting Class Hierarchies for Knowledge Transfer ... - IEEE Xplore

csr.utexas.edu
from csr.utexas.edu More from this publisher
12.07.2015 Views

RAJAN et al.: EXPLOITING CLASS HIERARCHIES FOR KNOWLEDGE TRANSFER IN HYPERSPECTRAL DATA 3411the node were modeled using mixtures of Gaussians, with thenumber of Gaussians corresponding to the number of classesat that node. The initial parameters of the Gaussians wereestimated using the corresponding class data from Area 1. In theE-step of the algorithm, the Gaussians were used to determinethe posterior probabilities of the Area 2 data. The probabilities,thus estimated, were then used to update the parameters ofthe Gaussians (M-step). EM iterations were performed untilthe average change in the posterior probabilities between twoiterations was smaller than a specified threshold [3]. A newFisher feature extractor was also computed for each EM iteration,which is based on the statistics of the metaclasses at thatiteration. The updated extractor was then used to project thedata into the corresponding Fisher space prior to the estimationof the class-conditional pdfs.Analysis of the results showed that while this approachyielded somewhat higher overall classification accuracies thana direct application of the original classifier, the errors weremostly concentrated in a few classes. A closer inspection revealedthat the spectral signatures of these classes had changedsufficiently for them to be grouped differently in the BHChierarchies, if there had been adequate amounts of labeled datafrom Area 2. This suggested that we should have obtainedmultiple trees from Area 1, such that some of them would bemore suitable for the new area.Thus, our second approach was to introduce randomizationinto the structure of the BHC tree. The design space for theBHC offers many possibilities for randomizing the tree structure.In our earlier work [28], we generated randomized BHCtrees by varying factors such as the percentage of the availabletraining data, the number of features selected at each node,class priors, and by randomly switching the class labels fora small percentage of the labeled data points. In this paper,randomized BHC trees were generated by choosing an internalnode of the tree and randomly interchanging the classes drawnfrom its right and left children. The corresponding featureextractors and classifiers at that node (and its children) werethen updated to reflect the perturbation. Note that in the absenceof any labeled data from Area 2, there is no way to evaluatewhich of the randomly generated BHC trees best suits thespatially/temporally different data. Hence, we can only generatean ensemble of classifiers using the training data, hoping thatthe ensemble contains some classifiers that are better suitedto Area 2.The key to the success of an ensemble of classifiers ischoosing the classifiers that make independent errors. If theclassifiers are not independent, the ensemble might actuallyperform worse than the best member of the ensemble. Hence, anumber of measures of diversity have been proposed to choosea good subset of classifiers [29]. Of the ten diversity measuresstudied, the authors recommend the Q av ,theρ av , and the κmeasures for their easy interpretability. They further promotethe Q-diversity measure because of its relationship with themajority vote of an ensemble and its ease of calculation. Hence,we made use of the Q-diversity measure in our earlier study[28]. However, on experimenting with the κ measure [30], wefound that it yielded a comparable, if not better, performance inthe sense of resulting overall classification accuracy than that ofthe Q measure. Further, unlike the Q-diversity measure, the κmeasure does not require access to any labeled data. Hence, inthis paper, the κ-diversity measure, which indicates the degreeof disagreement between a pair of classifiers, was used to ensurethe diversity of our classifier ensemble.The data from Area 2 were labeled using each tree in theclassifier ensemble, and these labels were then used to obtainthe κ measure between each pair of classifiers. The classificationresults of a smaller set of classifiers with the lowest averagepairwise κ measure (i.e., higher diversity) were then combinedvia a simple majority voting.B. Semisupervised CaseIf small amounts of labeled data are available, knowledgetransfer mechanisms can improve classification accuracies, especiallyif they exploit the added information. In this section,we generalize both knowledge transfer methods in order toleverage the labeled data and determine how much labeled dataare required from the spatially separate area before the advantagesof transferring information from the original solution areno longer realized.The ensemble-based approach was modified in two stages.First, after the set of classifiers was pruned to improve thediversity of the ensemble by using the κ-diversity measure, wefurther pruned the remaining set of classifiers to include onlythose which had yielded higher classification accuracies on thelabeled data. A scheme similar to the online weighted majorityalgorithm [31], which assigns all classifiers a weight, was thenused to weight the different classifiers. Prior to learning, theweights of all the classifiers are equal. As each data sampleis presented to the ensemble, a classifier’s weight is subsequentlyreduced multiplicatively, if that example is misclassified.For each new example, the ensemble then returns the classwith the maximum total weighted vote over all the classifiers.Thus, the algorithm used for computing the class label predictedby the BHC ensemble is as follows.Weighted majority vote for BHC ensemble1) Initialize the weights w 1 ,...,w n of all n BHCs to 1.2) For each labeled data point, let y 1 ,...,y n be the set ofclass labels predicted by the BHCs.3) Output class h i if ∀h j ≠ h i , j =1,...,m, where m isthe number of classesn∑n∑w k ≥w k .k=1;y k ==h i k=1;y k ==h j4) On observing the correct class label, if h i is wrong, thenmultiply the weight of each incorrect BHC by 0.5; else ifh i is correct, do not modify the weights.At the end of this learning, the “winnowing property” ofthe weighted majority scheme assigns lower weights to thoseclassifiers with poorer classification accuracies on the incomingdata. Thus, by reducing the contribution of the inaccurateclassifiers to the final decision, the voting scheme ensures thatthe performance of the ensemble is not much worse than thatof the best individual predictor, regardless of the dependencebetween the members of the ensemble [31]. For the semisupervisedimplementation, the EM-based method was modifiedto perform a constrained EM. Here, the E-step only updatesthe posterior probabilities (memberships) for the unlabeled datawhile fixing the memberships of the labeled instances according

3412 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 44, NO. 11, NOVEMBER 2006to the known class assignments [32]. The labeled data were alsoused to initialize the mean vectors and the covariance matricesof the metaclasses at the nodes of the binary trees in theκ-diversity measure pruned ensemble. The labeled and theunlabeled data from Area 2 were then used for constrained EMwhile updating the Fisher extractors in each of the binary trees.The classification results of the resulting ensemble were thencombined using the weighted majority algorithm as detailedpreviously.TABLE ICLASS NAMES AND NUMBER OF DATAPOINTS FOR THE KSC DATA SETIV. EXPERIMENTAL EVALUATIONIn this section, we provide empirical evidence that in theabsence of labeled data from the spatially/temporally separatearea, using knowledge transfer is better than the directapplication of existing classifiers to this new area. We alsopresent results showing that with small amounts of the labeleddata from the new areas, our framework yields higher overallaccuracies for our experiments than the current state-of-theartECOC multiclassifier system [6] with support vector machines(SVMs) [33] as the binary classifiers. Besides the ECOCclassifier, we also compare our framework with two EM-basedML (ML-EM) techniques. The first ML-EM classifier is theunsupervised approach suggested in [1]. The second methodis the knowledge transfer method proposed in [3], which werefer to as a seeded ML-EM, since it uses the Area 1 dataonly to initialize the Gaussians prior to performing the EMiterations. The parameters of the Gaussians and the Fisherfeature extractors are then updated using the unlabeled data(and, if available, labeled data) from Area 2 via EM.A. Data SetsThe knowledge transfer approaches described above weretested on the hyperspectral data sets obtained from two sites:NASA’s John F. Kennedy Space Center (KSC), Florida [27] andthe Okavango Delta, Botswana [4].1) KSC: The NASA Airborne Visible/Infrared ImagingSpectrometer (AVIRIS) acquired the data over the KSC onMarch 23, 1996. AVIRIS acquires data in 242 bands of 10-nmwidth from 400–2500 nm. The KSC data, which are collectedfrom an altitude of approximately 20 km, have a spatial resolutionof 18 m. Removal of noisy and water absorption bandsresulted in 176 candidate features. Training data were selectedusing land-cover maps derived by the KSC staff from color infraredphotography, Landsat Thematic Mapper (TM) imagery,and field checks. Discrimination of land-cover types for thisenvironment is difficult, due to the similarity of the spectral signaturesfor certain vegetation types and the existence of mixedclasses. The 512 × 614 spatially removed test set (Area 2)is a different subset of the flight line than the 512 × 614 dataset from Area 1 [34]. While the number of classes in the tworegions differs, we restrict ourselves to those classes that arepresent in both regions. Details of the ten land-cover classesconsidered in the KSC area are in Table I.2) Botswana: This 1476 × 256 pixel study area is locatedin the Okavango Delta, Botswana, and has 14 different landcovertypes consisting of seasonal swamps, occasional swamps,and drier woodlands located in the distal portion of the delta.Data from this region were obtained by the NASA EarthObserving 1 (EO-1) satellite for the calibration/validation por-TABLE IICLASS NAMES AND NUMBER OF DATA POINTSFOR THE BOTSWANA DATA SETtion of the mission in 2001. The Hyperion sensor on EO-1acquires data at 30-m pixel resolution over a 7.7-km strip in242 bands, covering the 400–2500-nm portion of the spectrumin 10-nm windows. Uncalibrated and noisy bands thatcover water absorption features were removed, resulting in145 features. The land-cover classes in this study were chosento reflect the impact of flooding on vegetation in the studyarea. Training data were selected manually using a combinationof global positioning system (GPS)-located vegetation surveys,aerial photography from the Aquarap (2000) project, and 2.6-mresolution IKONOS multispectral imagery. The spatially removedtest data for the May 31, 2001 acquisition were sampledfrom spatially contiguous clusters of pixels that werewithin the same scene, but disjoint from those used for thetraining data [34]. Details of the Botswana data are listedin Table II.Multitemporal data: In order to test the efficacy of theknowledge transfer framework for multitemporal images, datawere also obtained from the Okavango region in June andJuly 2001. While the May scene is characterized by the onsetof the annual flooding cycle and some newly burned areas,the progression of the flood and the corresponding vegetationresponses are seen in the June and July data. The Botswanadata acquired in May had 14 classes, but only nine classeswere identified for the June and July images, as the data wereacquired over a slightly different area due to a change inthe satellite pointing. Additionally, some classes identified inthe May 2001 image were excessively fine grained for thissequence, so the data were aggregated in some finer grainedclasses. The classes representing the various land-cover typesthat occur in this environment are listed in Table III.

RAJAN et al.: EXPLOITING CLASS HIERARCHIES FOR KNOWLEDGE TRANSFER IN HYPERSPECTRAL DATA 3411the node were modeled using mixtures of Gaussians, with thenumber of Gaussians corresponding to the number of classesat that node. The initial parameters of the Gaussians wereestimated using the corresponding class data from Area 1. In theE-step of the algorithm, the Gaussians were used to determinethe posterior probabilities of the Area 2 data. The probabilities,thus estimated, were then used to update the parameters ofthe Gaussians (M-step). EM iterations were per<strong>for</strong>med untilthe average change in the posterior probabilities between twoiterations was smaller than a specified threshold [3]. A newFisher feature extractor was also computed <strong>for</strong> each EM iteration,which is based on the statistics of the metaclasses at thatiteration. The updated extractor was then used to project thedata into the corresponding Fisher space prior to the estimationof the class-conditional pdfs.Analysis of the results showed that while this approachyielded somewhat higher overall classification accuracies thana direct application of the original classifier, the errors weremostly concentrated in a few classes. A closer inspection revealedthat the spectral signatures of these classes had changedsufficiently <strong>for</strong> them to be grouped differently in the BHChierarchies, if there had been adequate amounts of labeled datafrom Area 2. This suggested that we should have obtainedmultiple trees from Area 1, such that some of them would bemore suitable <strong>for</strong> the new area.Thus, our second approach was to introduce randomizationinto the structure of the BHC tree. The design space <strong>for</strong> theBHC offers many possibilities <strong>for</strong> randomizing the tree structure.In our earlier work [28], we generated randomized BHCtrees by varying factors such as the percentage of the availabletraining data, the number of features selected at each node,class priors, and by randomly switching the class labels <strong>for</strong>a small percentage of the labeled data points. In this paper,randomized BHC trees were generated by choosing an internalnode of the tree and randomly interchanging the classes drawnfrom its right and left children. The corresponding featureextractors and classifiers at that node (and its children) werethen updated to reflect the perturbation. Note that in the absenceof any labeled data from Area 2, there is no way to evaluatewhich of the randomly generated BHC trees best suits thespatially/temporally different data. Hence, we can only generatean ensemble of classifiers using the training data, hoping thatthe ensemble contains some classifiers that are better suitedto Area 2.The key to the success of an ensemble of classifiers ischoosing the classifiers that make independent errors. If theclassifiers are not independent, the ensemble might actuallyper<strong>for</strong>m worse than the best member of the ensemble. Hence, anumber of measures of diversity have been proposed to choosea good subset of classifiers [29]. Of the ten diversity measuresstudied, the authors recommend the Q av ,theρ av , and the κmeasures <strong>for</strong> their easy interpretability. They further promotethe Q-diversity measure because of its relationship with themajority vote of an ensemble and its ease of calculation. Hence,we made use of the Q-diversity measure in our earlier study[28]. However, on experimenting with the κ measure [30], wefound that it yielded a comparable, if not better, per<strong>for</strong>mance inthe sense of resulting overall classification accuracy than that ofthe Q measure. Further, unlike the Q-diversity measure, the κmeasure does not require access to any labeled data. Hence, inthis paper, the κ-diversity measure, which indicates the degreeof disagreement between a pair of classifiers, was used to ensurethe diversity of our classifier ensemble.The data from Area 2 were labeled using each tree in theclassifier ensemble, and these labels were then used to obtainthe κ measure between each pair of classifiers. The classificationresults of a smaller set of classifiers with the lowest averagepairwise κ measure (i.e., higher diversity) were then combinedvia a simple majority voting.B. Semisupervised CaseIf small amounts of labeled data are available, knowledgetransfer mechanisms can improve classification accuracies, especiallyif they exploit the added in<strong>for</strong>mation. In this section,we generalize both knowledge transfer methods in order toleverage the labeled data and determine how much labeled dataare required from the spatially separate area be<strong>for</strong>e the advantagesof transferring in<strong>for</strong>mation from the original solution areno longer realized.The ensemble-based approach was modified in two stages.First, after the set of classifiers was pruned to improve thediversity of the ensemble by using the κ-diversity measure, wefurther pruned the remaining set of classifiers to include onlythose which had yielded higher classification accuracies on thelabeled data. A scheme similar to the online weighted majorityalgorithm [31], which assigns all classifiers a weight, was thenused to weight the different classifiers. Prior to learning, theweights of all the classifiers are equal. As each data sampleis presented to the ensemble, a classifier’s weight is subsequentlyreduced multiplicatively, if that example is misclassified.For each new example, the ensemble then returns the classwith the maximum total weighted vote over all the classifiers.Thus, the algorithm used <strong>for</strong> computing the class label predictedby the BHC ensemble is as follows.Weighted majority vote <strong>for</strong> BHC ensemble1) Initialize the weights w 1 ,...,w n of all n BHCs to 1.2) For each labeled data point, let y 1 ,...,y n be the set ofclass labels predicted by the BHCs.3) Output class h i if ∀h j ≠ h i , j =1,...,m, where m isthe number of classesn∑n∑w k ≥w k .k=1;y k ==h i k=1;y k ==h j4) On observing the correct class label, if h i is wrong, thenmultiply the weight of each incorrect BHC by 0.5; else ifh i is correct, do not modify the weights.At the end of this learning, the “winnowing property” ofthe weighted majority scheme assigns lower weights to thoseclassifiers with poorer classification accuracies on the incomingdata. Thus, by reducing the contribution of the inaccurateclassifiers to the final decision, the voting scheme ensures thatthe per<strong>for</strong>mance of the ensemble is not much worse than thatof the best individual predictor, regardless of the dependencebetween the members of the ensemble [31]. For the semisupervisedimplementation, the EM-based method was modifiedto per<strong>for</strong>m a constrained EM. Here, the E-step only updatesthe posterior probabilities (memberships) <strong>for</strong> the unlabeled datawhile fixing the memberships of the labeled instances according

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!