Multiple Criteria for Evaluating Machine Learning Algorithms for ...

www.elsevier.com/locate/sna 

Multiple Criteria for Evaluating Machine 

Learning Algorithms for Land Cover 

Classification from Satellite Data 

R. S. DeFries* and Jonathan Cheung-Wai Chan † 

Operational monitoring of land cover from satellite of criteria in addition to the traditional accuracy measures 

data will require automated procedures for analyzing 

and that there are likely to be trade-offs between 

large volumes of data. We propose multiple criteria for algorithm performance and required computational re- 

assessing algorithms for this task. In addition to standard 

classification accuracy measures, we propose criteria to 

account for computational resources required by the alsources. 

©Elsevier Science Inc., 2000 

gorithms, stability of the algorithms, and robustness to INTRODUCTION 

noise in the training data. We also propose that classifi- 

Over the past few decades, satellite data have become 

cation accuracy take account, through estimation of misone 

of the primary sources for obtaining information 

classification costs, of unequal consequences to the user 

about the vegetation on the Earth’s land surface. At the 

depending on which cover types are confused. In this arglobal 

scale, land cover data sets have been derived for 

ticle, we apply these criteria to three variants of decision 

application in a range of earth system models primarily 

tree classifiers, a standard decision tree implemented in 

from data acquired by the Advanced Very High Resolu- 

C5.0 and two techniques recently proposed in the mation 

Radiometer (AVHRR) onboard NOAA meteorologichine 

learning literature known as “bagging” and “boostcal 

satellites (DeFries et al., 1998; Hansen et al., 2000; 

ing.” Each of these algorithms are applied to two data 

Loveland and Belward, 1997). At regional and local 

sets, a global land cover classification from 8 km AVHRR 

scales, data acquired by higher resolution sensors such as 

data and a Landsat Thematic Mapper scene in Peru. Re- 

Landsat and SPOT have been used extensively to extract 

sults indicate comparable accuracy of the three variants 

detailed information about the land surface in specific loof 

the decision tree algorithms on the two data sets, with cations (e.g., Skole and Tucker, 1993). 

boosting providing marginally higher accuracies. The A wide variety of techniques has been used to clasbagging 

and boosting algorithms, however, are both sub- sify land cover over large areas from satellite data. Techstantially 

more stable and more robust to noise in the niques range from unsupervised clustering algorithms 

training data compared with the standard C5.0 decision (e.g., Loveland and Belward, 1997; Loveland et al., 1991) 

tree. The bagging algorithm is most costly in terms of to parametric supervised algorithms such as maximum 

computational resources while the standard decision tree likelihood (e.g., DeFries and Townshend, 1994; Tucker 

is least costly. The results illustrate that the choice of the et al., 1985) to machine learning algorithms such as decimost 

suitable algorithm requires consideration of a suite sion trees (e.g., Friedl et al., 1999) and neural networks 

(e.g., Gopal et al., 1996). For a general review of these 

techniques, see Jensen (1996), Richards (1993), and 

* Earth System Science Interdisciplinary Center and Department 

of Geography, University of Maryland, College Park 

Quinlan (1993). While some comparisons of algorithm 

† Department of Geography, University of Maryland, College performance have been published (Friedl et al., 1999), 

Park 

there are not generally accepted criteria for selection of 

Address correspondence to R. DeFries, Dept. of Geography, 2181 the most appropriate classification algorithm for a given 

Lefrak Hall, Univ. of Maryland, College Park, MD 20742. E-mail: 

rd63@umail.umd.edu 

set of circumstances. 

Received 6 December 1999; revised 14 April 2000. 

With recent and upcoming launches of a large num- 

REMOTE SENS. ENVIRON. 74:503–515 (2000) 

©Elsevier Science Inc., 2000 

0034-4257/00/$–see front matter 

655 Avenue of the Americas, New York, NY 10010 PII S0034-4257(00)00142-5

504 DeFries and Chan 

ber of satellites, notably NASA’s Earth Observing Sys- Table 1. Cover Types and Number of Pixels in Training and 

tem, the volume of data available for analysis of land 

Test Data for 8 km AVHRR Data Used in This Study 

cover will increase many fold (Kalluri et al., 2000). Tech- 

No. of 

niques for extracting land cover information need to be 

Training No. of Test 

automated to the degree possible to process these large 

Cover Type Pixels Pixels 

volumes of data. In addition, the techniques need to be Evergreen needleleaf forest 667 859 

objective, reproducible, and feasible to implement within Evergreen broadleaf forest 1302 1089 

Deciduous needleleaf forest 48 164 

available resources. For example, the international effort 

Deciduous broadleaf forest 473 313 

on Global Observations of Forest Cover (Ahern et al., Mixed forest 575 358 

1998; Janetos and Ahern, 1997) aims to characterize the Woodlands 686 1174 

extent of forest cover globally from satellite data at re- Wooded grasslands/shrublands 374 704 

peated intervals over time. This task can only realistically Closed bushlands or shrublands 293 356 

Open shrublands 617 894 

be achieved through techniques that minimize time-con- Grasses 1309 1119 

suming human interpretation and maximize automated Croplands 1520 1049 

procedures for data analysis. 

Bare 1204 1313 

Comparisons of algorithm performance for land Mosses and Lichens 202 652 

Total 9306 10,044 

cover classification have generally been based on the single 

criterion of classification accuracy (Friedl et al., 1999; 

Hansen and Reed, 2000). In an operational context for 

monitoring land cover from satellite data, there are mul- formance of the machine learning algorithms. The data 

tiple criteria for assessing the suitability of algorithms in sets are described below. 

addition to accuracy. Is the algorithm efficient in terms 

of speed? Does it produce stable results or is it unac- 8 km Global Land Cover Classification 

ceptably sensitive to minor variations in the input data? DeFries et al. (1998) derived a global land cover classifi- 

How robust is the algorithm to noisy data? Are there cation of 13 cover types based on the AVHRR Pathfinder 

tradeoffs between speed and accuracy, for example, that Land data (Agbu and James, 1994) for 1984. The classifishould 

be considered? 

cation was based on 24 metrics describing the temporal 

This article sets out a number of criteria for evaluat- dynamics of vegetation over an annual cycle. These meting 

algorithms for classifying land cover from satellite rics are: maximum annual, minimum annual, mean andata. 

The criteria are intended to highlight the tradeoffs nual, and amplitude (maximum minus minimum) for 

that would be faced in selecting algorithms for opera- each of the AVHRR channels including the normalized 

tional monitoring of land cover from satellite data. We difference vegetation index (NDVI defined as [(Channel 

illustrate methods for quantifying these criteria using two 2Channel 1)/(Channel 2Channel 1)] and Channels 1 

data sets, one derived from AVHRR and one derived (visible reflectance, 0.58–0.69 lm), 2 (near-infrared refrom 

Landsat Thematic Mapper data. For this article, we flectance, 0.725–1.1 lm), 3 (thermal infrared, 3.55–3.93 

demonstrate the use of these criteria with various types lm), 4 (thermal, 10.3–11.3 lm), and 5 (thermal, 11.5– 

of univariate decision tree algorithms. The criteria, how- 12.5 lm). A decision tree algorithm was used for the 

ever, could also be applied to other classification algo- classification but not in a completely automated mode. 

rithms. 

The decision tree was modified based on human knowledge 

of global vegetation to obtain the final global land 

DATA 

cover map (DeFries et al., 1998). 

Training data for the classifier used to generate the 

To explore criteria for assessing algorithms for land cover 8 km global land cover classification were obtained from 

classification, we use two data sets: multitemporal a global network of 156 Landsat scenes. As described in 

AVHRR Pathfinder Land data for 1984 and a Landsat DeFries et al. (1998), these scenes were visually inter- 

Thematic Mapper scene (path/row 006/066 centered on preted through consultation with ancillary maps and regional 

8.684S, 74.167W) around Pucallpa, Peru acquired 16 

experts to identify locations for which the land 

October 1996. These data sets were selected because re- cover type is known with a high degree of confidence. 

liable land cover classifications have been derived from The scenes were coregistered with the 8 km AVHRR 

them using field knowledge, expert consultation, and hu- data. Those 8 km pixels containing over 90% of the cover 

man interpretation. We consequently have a high degree type identified from the Landsat scene were labeled as 

of confidence in these land cover classifications. In the training data. Approximately 9000 AVHRR pixels of 

absence of true validation data from ground-based measurements, 

training data were obtained. 

these land cover classifications serve as a ba- For the study described in this article, both data to 

sis for test data against which we can compare the per- train the classifiers and data to test the classification re-

Evaluating Machine Learning Algorithms for Land Cover Classification 505 

chine learning technique particularly suited to applications 

where it is important for a human to understand 

the classification structure, have successfully been applied 

to multidimensional satellite data for extraction of 

land cover categories (DeFries et al., 1998; Friedl et al., 

1999; Hansen et al., 2000). 

The multiple criteria for assessing algorithm performance 

are illustrated in this article with several variants 

of a basic decision tree algorithm. As no single machine 

learning algorithm has been demonstrated to be superior 

for all applications (Kohavi et al., 1996), it is necessary 

to test a number of algorithms for the specific applica- 

tion, in this case repeatable and objective classification 

of satellite data into land cover types. While this article 

illustrates the criteria through various decision tree algorithms, 

the same criteria could be applied to other types 

of algorithms such as neural networks, maximum likelihood, 

and even unsupervised classification techniques. 

sults are required (Table 1). For training, we use the 24 

metrics from the 9000 pixels identified by overlaying 

Landsat scenes on the 8 km AVHRR data. Each training 

pixel is labeled as a cover type based on interpretation 

of the Landsat scene. For the test data, we obtain a random 

sample of 10,000 pixels (distributed in proportion 

to the area occupied by each cover type in the final classification) 

from the final classification results derived by 

DeFries et al. (1998). Because this final classification result 

was examined and modified through human knowledge 

of global vegetation distributions, we believe that 

the test data have a high degree of confidence, although 

it is possible that errors do occur. 

Landsat Thematic Mapper Data 

In contrast to the coarse resolution 8 km AVHRR data 

based on multitemporal information, we also test the cri- 

teria described in this article using data from the Landsat 

Thematic Mapper scene around Pucallpa, Peru. This 

scene was classified by the Landsat Pathfinder project Decision Tree Algorithms 

mapping deforestation in the humid tropics (Townshend Decision tree theory (Breiman et al., 1984) has preet 

al., 1995) for the purpose of determining the extent viously been applied to land cover classification from satof 

deforestation between the 1970s, 1980s, and 1990s. ellite data (DeFries et al., 1998; Friedl and Brodley, 

The TM scene includes five bands at 30 m resolution 1997; Friedl et al., 1999; Hansen et al., 2000; 1996; 

(.45–.53 lm, .52–.60 lm, .63–.69 lm, .76–.90 lm, and Swain and Hauska, 1977). Decision trees predict class 

1.55–1.75 lm). The scene was classified into six classes membership by recursively partitioning a data set into 

(Table 2). The classification approach was a combination more homogeneous subsets. Different variables and 

of unsupervised and supervised classification techniques splits are then used to split the subsets into further subusing 

a high degree of human interpretation and expert sets. In univariate decision trees as used for this study, 

knowledge about the location (A. Desch, personal com- each node is formed from a binary split of one variable. 

munication). As such, we have a high degree of confi- The grown tree can be pruned based on decision rules 

dence in the classification result. 

to produce more stable predictions of class membership. 

Training data for this study were selected by sam- The decision tree has a number of advantages over 

pling the classification result in proportion to the area traditional classification algorithms (Hansen et al., 1996). 

covered by each class; 5958 pixels were randomly se- First, the univariate decision tree is not based on any aslected. 

For testing, we randomly selected an additional sumptions of normality within training statistics and is 

12,084 pixels (Table 2). Because both the training and well suited to situations where a single cover type is reptest 

data were derived from the same classification result resented by more than one cluster in the spectral space. 

and were not independently derived, we expect the accu- Second, the decision tree can reveal nonlinear and hierracies 

derived in this study to overestimate those that archical relationships between input variables and use 

would be obtained in a realistic situation where a classifi- these to predict class membership. Third, the decision 

cation result is not available. However, we believe that tree yields a set of rules which are easy to interpret and 

these data sets can nevertheless be used to illustrate the suitable for deriving a physical understanding of the clascriteria 

for evaluating the machine learning algorithms. sification process. 

In this study, we use the C5.0 decision tree estimation 

algorithm, a univariate decision tree algorithm that 

METHODS AND RESULTS 

is the commercial successor of C4.5 (Quinlan, 1993). In 

a decision tree estimation algorithm, the most important 

component is the method used to estimate splits at each 

internal node of the tree. It is this method that deter- 

mines which features are selected to form the classifier. 

C5.0 uses the “information gain ratio” to estimate splits 

at each internal node of the tree. The information gain 

measures the reduction in entropy in the data produced 

by a split. Using this metric, the test at each node within 

Data mining techniques have been developing over the 

past few decades for a large number of applications ranging 

from computer security to medical diagnosis to detection 

of volcanoes on Venus (Brodley et al., 1999). Machine 

learning, one means of data mining, refers to 

algorithms that analyze the information, recognize patterns, 

and improve prediction accuracy through repeated 

learning from training instances. Decision trees, a ma-


Table 2. Cover Types and Number of Pixels in Training 

and Test Data for Landsat Thematic Mapper Data Used 

in This Study 

Table 3. Misclassification Costs Used in This Study to 

Adjust Accuracy Measure for Global Land Cover 

Classification from 8 km AVHRR Data a 

No. of Training No. of Test Category Group 1 b Group 2 c Group 3 d Group 4 e 

Cover Type Pixels Pixels 

Group 1 0 0.3 0.6 1 

Forest 1164 2328 

Group 2 0.3 0 0.3 0.6 

Water 963 1959 Group 3 0.6 0.3 0 0.3 

Cloud 958 1939 

Group 4 1 0.6 0.3 0 

Shadow 980 1990 a 

Actual misclassification costs would vary with specific applications of 

Degraded forest 937 1912 

the land cover classification. 

Nonforest vegetation 956 1956 b 

Group 1: evergreen needleleaf forest; evergreen broadleaf forest; de- 

Total 5958 12,084 ciduous broadleaf forest; mixed forest; woodlands. 

c 

Group 2: wooded grasslands/shrubs; closed bushlands or shrublands. 

d 

Group 3: grasses; croplands; mosses and lichens. 

e 

Group 4: open shrubland; bare. 

a tree is selected based on that subdivision of the data 

that maximizes the reduction in entropy of the descenoutside 

dant nodes. Given a training data set T composed of oberror-based 

of the training set are to be classified. C5.0 uses 

servations belonging to one of kclasses {C1,C2,...,Ck}, 

pruning to remove features from the classiservations 

the amount of information required to identify the class fier that are spurious and not supported by the data. For 

for an observation in T is 

more detail, see Quinlan (1993). 

Bagging and Boosting 

info(T) k freq(C j ,T) freq(C 

log 

j ,T) 

2 , (1) 

j1 |T| 

|T| 

A number of refinements of this basic decision tree algorithm 

have recently been developed in the machine 

where freq(Cj,T) is equal to the number of cases in T be- learning community, including “boosting” and “bagging.” 

longing to class Cj, and |T| is the total number of obser- Boosting and bagging techniques construct ensembles of 

vations in T. Given a test, X, that partitions T into n out- individual classifiers and obtain classification decisions by 

comes of a test X, the total information content after voting from the individual classifiers (Quinlan, 1996). 

applying X is 

These techniques can be applied to any supervised classification 

algorithm. In this article we refer only to the apinfo 

x (T) n |T i | 

i1|T| i). (2) plication of boosting and bagging for decision trees. 

Bagging, proposed by Breiman (1996), generates an 

The information gained by splitting T using X is 

ensemble of individual decision trees by bootstrap sampling 

of the training data set. Multiple samples from the 

gain(X)info(T)info x (T). (3) 

training set are generated by sampling with replacement 

The “gain criteria”’ selects the test for which gain(X) is from the training data. A decision tree classifier is genermaximum. 

To compensate for favoring tests with large ated for each sample. The final classification result is obnumbers 

of splits, gain(X) is normalized by 

tained by plurality vote of the individual classifiers. Bagging 

has been shown to improve the performance on test 

split info(X) n |T i | 

i1|T| 2 |T i| 

|T| . (4) data sets in domains other than remote sensing in cases 

where small changes in the training set cause large 

The splitting metric is 

changes in the classifier (Breiman, 1996; Quinlan, 1996). 

gain ratio(X)gain(X)/split info(X). (5) 

Experiments indicate that performance gain reaches a 

plateau at no more than 100 individual trees (Indurkhya 

T is recursively split such that the gain ratio is maximized 

and Weiss, 1998). 

at each node of the tree. This procedure contin- Boosting, proposed by Freund and Shapiro (1996), 

ues until each leaf node contains only observations from is also an ensemble technique where multiple iterations 

a single class or no gain in information is yielded by fur- of decision tree classifiers are generated. In this case, the 

ther splitting. For a univariate decision tree using continuous 

entire training set is used to generate the decision tree. 

attributes as is the case for this study, data is parti- For each iteration of the decision tree, a weight is as- 

tioned into 2 outcomes (n2) at each node based on a signed to each training observation. Observations mis- 

threshold value for a single attribute. The threshold classified in the previous iteration are assigned a heavier 

value with the greatest gain ratio value is selected at each weight. The decision tree is forced to concentrate on 

node in the decision tree. 

those observations that were misclassified in the previous 

The decision tree resulting from this procedure may 

be overfit to noise in the training data so that the tree 

must be pruned to reduce classification errors when data 

iteration. Each iteration generates a decision tree that 

aims to correct errors in the previous iteration. The final 

classifier is generated by voting from the classifications


Table 4. Misclassification Costs Used in This Study to Adjust Accuracy 

Measure for Classification from Landsat TM Scene 

Degraded Nonforest 

Forest Water Cloud Shadow Forest Vegetation 

Forest 0 0.6 1 1 0.3 0.3 

Water 0.6 0 0.3 0.3 0.6 0.6 

Cloud 1 0.3 0 0.3 1 1 

Shadow 1 0.3 0.3 0 1 1 

Degraded 

forest 0.3 0.6 1 1 0 0.3 

Nonforest 

vegetation 0.3 0.6 1 1 0.3 0.6 

generated from the individual classifiers. In the Ada- cations in a quasiautomated fashion but required extensive 

Boost.M1 algorithm implemented in C5.0 (Quinlan, 

and time consuming collection of a global training 

1996), voting from the individual decision tree classifiers data set. For future efforts such as the Global Observations 

is weighted by the accuracy of the classifier [see Freund 

of Forest Cover, it will be necessary to evaluate a 

and Shapiro (1996) and Friedl et al. (1999) for explana- number of approaches for obtaining land cover information. 

tion of how the weightings are calculated in Ada- 

Boost.M1]. 

Here we present a number of criteria-and methods 

Boosting has been shown to reduce misclassification to quantify these criteria-relevant to the consideration of 

rates of land cover based on monthly NDVI values obtion. 

the most appropriate algorithm for land cover classificatained 

from AVHRR data by 20–50% with most of the 

These criteria are: classification accuracy, computa- 

benefit achieved after seven boosting iterations (Friedl et tional resources, stability of the algorithm, and ro- 

al., 1999). On data sets other than in the remote sensing bustness to noise in the training data. 

domain, boosting substantially improved classification ac- 

Classification Accuracy 

curacy on most data sets but severely reduced accuracy 

Classification accuracy is the primary criterion for algoon 

others (Quinlan, 1996). Dietterich (1998) shows that 

rithm comparisons in the literature. Accuracy is comboosting 

is generally more accurate than bagging for 33 

monly measured as the percentage of pixels correctly 

domains in the repository of machine learning data bases 

classified in the test set. It is necessary to consider both 

maintained at the University of California Irvine (Merz 

overall accuracy (percentage of all test pixels correctly 

and Murphy, 1996). However, in the presence of noise 

classified) and mean class accuracy (mean accuracy of all 

in the training set, bagging proved more accurate than 

classes computed individually) to avoid domination of the 

boosting. 

accuracy measure by those classes with disproportionate 

numbers of test pixels. Other measures such as produc- 

Multiple Criteria for Evaluating Land Cover er’s and user’s accuracy can also be computed from an 

Classification Algorithms 

error matrix (Congalton, 1991; Congalton and Green, 

Selection of the most appropriate algorithm for land 1999), though these are less commonly reported in the 

cover classification from satellite data in an operational remote sensing literature. 

setting will depend on specific circumstances and avail- In addition to the overall and mean class accuracies, 

able resources. Most certainly, the decision involves misclassification between certain classes may be more or 

tradeoffs between a number of important criteria includ- less important depending on the application of the land 

ing accuracy, computational speed, and ability to auto- cover classification (DeFries and Los, 1999). For exammate 

the process. One of the important criteria is the ple, misclassification between a needleleaf evergreen and 

degree to which human interpretation and involvement a mixed forest may be inconsequential in a modeling apin 

the process are feasible and desirable. For example, plication that does not distinguish between these forest 

the unsupervised classification approach used to generate types. In this case, the misclassification cost is zero. Calthe 

IGBP DISCover global land cover classification from culation of overall accuracy assumes that all misclassifica- 

1km AVHRR data (Loveland and Belward, 1997), in tion costs are equal. 

which each cluster was interpreted and labeled based on In machine learning, Receiver Operating Characterancillary 

information, involved many person-years but istic (ROC) analysis has been proposed to describe the 

eliminated the need for cumbersome collection of train- predictive behavior of a classifier independent of class 

ing data. On the other hand, the global land cover classi- distributions or misclassification costs for two class probfications 

using a decision tree approach (DeFries et al., lems (Provost and Fawcett, 1997; Provost et al., 1998). 

1998; Hansen et al., 2000) were able to generate classifi- In ROC analysis of a true-false classification problem,


Figure 1. Accuracy of standard C5.0 decision tree and boosting and bagging with the decision tree on the 8 km data for overall, mean class, and adjusted accuracy (a, b, and c, 

respectively) and Landsat data (d, e, and f). Lines in the box plots indicate median value of the ten trials and shaded boxes give values for 50th percentiles.


Table 5. Number of Operations Required for Decision Tree Algorithms to 

Classify Data Sets Used in This Study 

Algorithm 8 km AVHRR Data Landsat TM Data 

C5.0 decision tree 80926 97818 

C5.0 decision tree with bagging 100 times 100 times 

C5.0 decision tree with boosting 10 times 10 times 

the true positive rate (positives correctly classified/total In the absence of a framework to evaluate algorithm 

positives) is plotted against the false positive rate (nega- performance independent of misclassification cost, we 

tives incorrectly classified/total negatives). If one algo- use a “loss matrix” (Margineantu and Dietterich, 1999) 

rithm dominates the ROC space, meaning that ROC to account for unequal misclassification costs (Tables 3 

curves for all other algorithms are beneath it, it can be and 4). These loss matrices were constructed rather arbiconcluded 

that the algorithm is better than all others for trarily based on a presumption that there is a greater cost 

all possible costs and class distributions. It is, however, to confusion of forest types with nonforest types than 

possible that a particular algorithm may dominate in only confusion within forest types. The actual misclassification 

a portion of the curve. If this is the case, selection of costs depend on the specific application of the land cover 

the “best” algorithm needs to be done by considering the classification. In the use of a land cover classification 

desired rate of false positive outcomes. For example, within a land surface model, for example, the misclassififalse 

positive outcomes may be less acceptable than false cation costs vary even within the model depending on 

negative outcomes in the case of medical diagnosis. A the scheme for aggregating land cover types for estimatfalse 

positive will precipitate unnecessary medical treat- ing each parameter (DeFries and Los, 1999). 

ment, but a false negative would lead to neglect when For this study, we test accuracy against the test data 

treatment is needed. These techniques in machine learn- by using a bootstrap sample of 90% of the training data 

ing have only been applied to problems with two classes (with replacement) 10 times. Accuracy measures are caland 

extension to multiclass problems is an active re- culated as the mean value of the 10 bootstrap training 

search area (Provost et al., 1998). 

Figure 2. Overall accuracy with number of iterations used 

with bagging for a) 8 km data and b) Landsat data. These 

runs were done using the intentionally mislabeled training 

data as described in the text. 

Figure 3. j-error diagrams for a) 8 km data and b) Landsat 

data using 10 classifiers with 10% of training data. Blue 

indicates results from the standard C5.0 decision tree, red 

from the results using boosting, and green from the 

results using bagging.


Figure 4. Location of training data in the 8 km data. Open squares are sites of Landsat scenes used to derive training data. 

Closed squares are locaitons of training data intentionally mislabeled for this study. 

Computational Resources 

The computational resources required for the classifica- 

tion are likely to be a key consideration in choosing an 

algorithm. In the case of decision tree algorithms, mini- 

mal resources are required to grow the tree on the train- 

ing data set while vast resources might be required to 

classify unseen cases according to the decision rules. 

To compare algorithms, it is necessary to have a 

measure independent of the computer, programming 

language, programmer, and implementation details such 

as counting array indexes or setting pointers in data 

structures. For example, it would not be reasonable to 

establish a criterion based on execution time because it 

varies with the computer. A count of all statements exe- 

cuted by a program would likewise depend on the pro- 

gramming. In machine learning, algorithms are compared 

based on the “amount of work done” or “complexity 

measure” (Baase, 1988). This measure simply counts the 

number of basic operations performed. 

In the case of decision trees, the number of basic 

operations is the number of decision points in the tree 

(or ensemble of trees in the case of bagging and boost- 

ing). When using the same input data, we can compare 

the algorithms based on the total number of decision 

points traversed to classify all pixels. For the standard 

C5.0 decision tree, 80,926 and 97,818 operations are per- 

formed to classify the test data based on a tree generated 

from the training data for the 8km and Landsat TM data 

samples. This procedure was carried out to ensure that 

the accuracy reported is representative of multiple rather 

than a single test. 

We report accuracy in three ways (Fig. 1): 1) overall 

accuracy, 2) mean class accuracy, and 3) adjusted accuracy 

taking into account misclassification costs indicated 

in the loss matrices (Tables 3 and 4). The weightings in 

these loss matrices are intended to illustrate the concept 

based on what the costs might be for a typical application 

rather than to quantify the misclassification costs for all 

applications. These misclassification costs will vary with 

the application and need to be developed with respect 

to the specific application of the land cover classification. 

Another active area of research in machine learning 

is incorporation of misclassification costs in the generation 

of the decision tree itself (Margineantu and Dietterich, 

1999; Schiffers, 1997). The splitting criteria are 

based on the misclassification costs. A feature is available 

in the C5.0 algorithm to assign misclassification costs 

through a loss matrix. However, we did not experience 

improved performance when using this feature, probably 

because it is a complex problem to identify optimum 

weights from the loss matrix when the number of classes 

is greater than two (Margineantu and Dietterich, 1999). 

Results indicate that the decision tree, bagging, and 

boosting provide fairly similar accuracies by all three 

measures on both data sets (Fig. 1). In all cases, boosting 

provides the highest accuracies, but the differences are 

only small. Acccuracies are generally within 5% for the 

three algorithms by all measures. Differences between 

accuracies are even smaller when they are adjusted for 

misclassification costs, indicating that error is occurring 

in higher proportion in those classes with low misclassifi- 

cation costs than in those classes with high costs. Judging 

from this accuracy criterion alone, there is only marginal 

advantage in choosing boosting over the standard C5.0 

decision tree and even less advantage in choosing bagging 

of the C5.0 decision tree for these data sets.


Figure 5. Classified Landsat scene used for this study. The 

box indicates the location of training data intentionally 

mislabeled for this study. 

respectively (Table 5). With bagging, this number of 

operations increases in proportion to the number of 

bootstrap samples used to generate the ensemble of indi- 

vidual decision trees. The number of samples is conven- 

tionally 100 (Breiman, 1996). Our investigation shows, 

however, that accuracy does not increase substantially 

beyond approximately 50 iterations for the 8 km data and 

20 iterations for the Landsat data (Fig. 2). Therefore, the 

number of basic operations required to achieve the ben- 

efits of bagging is somewhat less for these data sets than 

the convention would indicate. 

For boosting, the number of operations is in proportion 

to the number of iterations used. In this study, we 

used 10 iterations because previous studies with nonremote 

sensing data suggest that this number provides maximum 

improvement in classification accuracy (Freund and 

Schapire, 1997). With application of boosting to remote 

sensing data, Friedl et al. (1999) report that little accuracy 

is gained beyond seven iterations. 

The comparison of the algorithms according to the 

“amount of work done” indicates that the standard C5.0 

Figure 6. j-error diagrams for 8 km data with random 

noise in the training data at a) 10%, b) 30%, and c) 50%. 

Blue indicates results from the standard C5.0 decision tree, 

red from the results using boosting, and green from the 

results using bagging. 

decision tree requires less resources than boosting and 

substantially less than bagging. While this conclusion is 

obvious in the case of the univariate decision trees used 

for this study, the “amount of work done” provides a 

framework for assessing computational resources required 

for other types of classifiers where such a com- 

parison is not as straightforward. 

Stability of the Algorithm 

It is desirable that an algorithm produce stable results 

when faced with minor variability in the input data. In 

the use of satellite data for monitoring land cover, algo- 

rithm instability could erroneously indicate changes in 

land cover when none actually occurred. If training data 

are used from the same locations at repeated intervals,


Figure 8. j-error diagrams for a) 8 km and b) Landsat data 

with mislabeled training data introduced. Blue indicates 

results from the standard C5.0 decision tree, red from the 

results using boosting, and green from the results using 

bagging. 

Figure 7. j-error diagrams for Landsat data with random 

noise in the training data at a) 10%, b) 30%, and c) 50%. 

Blue indicates results from the standard C5.0 decision tree, 

red from the results using boosting, and green from the 

results using bagging. 

sets is measured by computing a degree-of-agreement 

statistic j. A scatter plot is constructed in which each 

point corresponds to a pair of classifiers. The x coordinate 

is the diversity value (j) and the y coordinate is the 

mean accuracy (or error rate) of the classifiers. 

j is defined as 

variability in reflectances would be expected due to bidi- 

rectional effects, solar zenith angle, and a variety of other 

factors. If an algorithm is used to classify land cover type 

at these intervals, it is necessary to have confidence that 

it is not overly sensitive to these variations. 

To test the stability of the decision tree algorithms, 

we use j-error diagrams as introduced by Margineantu 

and Dietterich (1997). These diagrams help visualize the 

relationship between accuracy and stability of the deci- 

sion tree algorithms generated from training sets with 

minor variability. To approximate training sets with minor 

variability, we randomly sample 10% of the training 

data 10 times to generate 10 different training sets. 

The stability, or conversely the diversity, of each pair 

of classifications performed on each of the 10 training 

j 1 2 

. (6) 

1 2 

H 1 is an estimate of the probability that two classifiers 

agree and is given by 

1 L i1 C ii 

m , (7) 

where L is the number of classes, C is an LL square 

array such that Cij contains the number of test example 

assigned to class i by the first classifier and into class j 

by the second classifier, and m is the total number of 

test examples. 2 is an estimate of the probability that 

the two classifiers agree by chance: 

H 2 L 

i1 L 

C ij C ji 

j1m *L 

j1 

m . (8) 

j 0 when the agreement of the two classifiers equal


that expected by chance and 1 when the two classifiers rates and lower stability as seen in the larger coefficients 

agree on every example. Negative values occur when of variation than the bagging and boosting results for 

agreement is less than that expected by chance in the case both the 8 km and Landsat data (Figs. 6 and 7). Bagging 

of systematic bias. 

and boosting perform comparably for the 8 km data 

If an algorithm produces stable results, the j-error while bagging appears slightly more stable with higher 

diagram produces a compact cloud of points with j val- internal agreement than the boosting result for the 

ues close to 1. A spread of j values indicates that the Landsat data. In general, the 8 km data has higher error 

algorithm is producing results that vary when using the rates than the Landsat data, possibly because the 8 km 

different training sets. 

data contains several classes with relatively few training 

For both the 8 km and Landsat data, the standard pixels or because the Landsat training data is derived 

C5.0 decision tree produces a cloud of points with subing 

from the classification result itself. For nonremote sensstantially 

larger spread than the boosting and bagging reing 

data, Weiss (1995) illustrates that noise in the train- 

sults (Fig. 3). For the 8 km data, boosting produced the 

set leads to a disproportionate number of errors in 

most compact cloud of points. Coefficients of variation classes with a small number of training samples. 

of the j values were .016 for boosting, compared with Mislabeling of training data causes more severe 

.022 for bagging, and .027 for the standard C5.0 decision problems in terms of stability for the decision tree algotree. 

For the Landsat data, both bagging and boosting rithms than random noise (Fig. 8). While, overall, the erproduce 

compact clouds (coefficients of variation .013 ror rates are lower with mislabeled noise than with ran- 

and .012, respectively) with bagging showing higher j dom noise, the spread of points is much larger. The 

values for the same mean error value, indicating greater standard C5.0 decision tree is least stable and has the 

agreement between samples. When input data can be ex- highest error of all the algorithms for both the 8 km and 

pected to show variability, as is the case with remotely Landsat data. Bagging produces slightly lower error rates 

sensed data, these results suggest that bagging and boost- for comparable stability compared with the boosting result. 

ing provide a more stable classification result than a stanmore 

In sum, bagging and boosting appear substantially 

dard decision tree. 

robust to random noise in the training data than 

the standard C5.0 decision tree. For the Landsat data 

Robustness to Noise in Training Data 

set, bagging appears more robust than boosting, while, 

Remotely sensed training data is likely to be noisy due in the 8 km data set, bagging and boosting appear comto 

many factors including saturation of the signal, missing parable. For noise caused by mislabeled training data, all 

scans, mislabeling, problems with the sensor, and viewing the algorithms produce less stable results than with rangeometry. 

Ideally, an algorithm would not be overly sen- dom noise. However, as with the random noise, bagging 

sitive to the presence of noise in the training data. This and boosting have lower error rates and greater stability 

criterion is related to the stability of the algorithm, but than the standard C5.0 decision tree. 

even a stable algorithm will not necessarily perform well 

in the presence of noise. 

For this study, we investigate two types of noise that SUMMARY AND CONCLUSIONS 

could realistically occur in the training data: random In this article, we propose several criteria for evaluating 

noise in the input data (input data are metrics in the case machine learning algorithms for operational monitoring 

of the 8 km AVHRR data and reflectance values in the of land cover with satellite data. In addition to the stancase 

of the Landsat TM data) and mislabeling of the dard criterion of classification accuracy, we compare the 

cover type in the training data. For random noise, we computational resources required by the algorithms 

introduce zero values randomly (10%, 30%, and 50%) through quantifying the number of operations that need 

into the training input data for both the 8 km and Land- to be performed. Through the use of j-error diagrams, 

sat data to simulate missing data. For mislabeling of the we also compare the stability of the algorithms when 

8 km data, we assigned class 13 to the class label in the faced with variability in the training data set and the rotraining 

data for all training pixels derived from three bustness of the algorithms to noise in the training data. 

Landsat scenes distributed around the world (Fig. 4). With respect to the classification accuracy, we also pro- 

This type of mislabeling is likely to occur in the case of pose that misclassification costs be taken into account 

erroneous ancillary data or misinterpretation of the Land- because not all confusions between cover types have 

sat scene from which the training data were derived. For equal consequences for the user. 

the Landsat data from Peru, we mislabeled approximately To illustrate these criteria, we apply them to three 

10% of the training data as class 6 in a spatially heteroge- variants of decision tree algorithms used in machine 

neous portion of the scene (Fig. 5). learning: the standard decision tree from C5.0; the C5.0 

The j-error diagrams help understand the effect of decision tree with “bagging”; and the C5.0 decision tree 

noise on the algorithms. For the case of random noise, with “boosting.” Each of these algorithms are applied to 

the standard C5.0 decision tree clearly has higher error two data sets, a global land cover classification from 8


Table 6. Relative Ranking (Low, Medium, and High) of Multiple Criteria to Assess Algorithm Performance on 8 km and 

Landsat Data 

Computational 

Robustness to 

Accuracy Resources Stability Noise 

8 km data 

Standard C5.0 decision tree Slightly lower Low Low Low 

Decision tree with boosting Slightly higher Medium High High 

Decision tree with bagging Medium High High High 

Landsat data 

Standard C5.0 decision tree Slightly lower Low Low Low 

Decision tree with boosting Slightly higher Medium Medium Medium 

Decision tree with bagging Medium High High High 

km AVHRR data and a Landsat Thematic Mapper scene the 27th International Symposium on Remote Sensing of En- 

from Pucallpa, Peru. These data sets have each been vironment, Tromso, Norway. 

classified with extensive human interpretation and expert Baase, S. (1988), Computer Algorithms: Introduction to Design 

knowledge that provide reliable test data for assessing 

and Analysis, Addison-Wesley, Reading, MA. 

Breiman, L. (1996), Bagging predictors. Mach. Learn. 24: 

the classification results. 

123–140. 

Results indicate comparable accuracy of the three 

Breiman, L., Friedman, J. H., Olshend, R. A., and Stone, C. 

variants of the decision tree algorithms on the two data 

J. (1984), Classification and Regression Trees, Wadsworth, 

sets for all the three accuracy measures investigated here Monterey, CA. 

(overall accuracy, mean class accuracy, and adjusted ac- Brodley, C., Lane, T., and Stough, T. (1999), Knowledge discuracy 

to account for hypothetical misclassification covery and data mining. Am. Sci. (Jan./Feb.):54–61. 

costs). Accuracies are highest for boosting, but only by Congalton, R. G. (1991), A review of assessing the accuracy of 

a few percentages. However, the bagging and boosting classifications of remotely sensed data. Remote Sens. Envi- 

algorithms are both more stable and more robust to ron. 37:35–46. 

noise in the training data compared with the standard Congalton, R. G., and Green, K. (1999), Assessing the Accu- 

C5.0 decision tree. These advantages are associated with racy of Remotely Sensed Data: Principles and Practices, 

a cost of increased requirements for computation re- Lewis Publishers, New York. 

DeFries, R. S., and Los, S. O. (1999), Implications of land 

sources. The bagging algorithm is most costly in terms 

cover misclassification for parameter estimates in global 

“amount of work done” while the standard decision tree 

land surface models: an example from the Simple Biosphere 

is least costly (Table 6). The results presented here illus- 

Model (SiB2). Photogramm. Eng. Remote Sens. 65: 

trate that multiple criteria need to be evaluated in as- 1083–1088. 

sessing the most suitable algorithms for land cover classi- DeFries, R. S., and Townshend, J. R. G. (1994), NDVI-derived 

fication. The choice of the most suitable algorithm land cover classification at global scales. Int. J. Remote 

requires consideration of a number of criteria in addition Sens. 15:3567–3586. 

to the traditional accuracy measures. 

DeFries, R., Hansen, M., Townshend, J. R. G., and Sohlberg, 

R. (1998), Global land cover classifications at 8 km spatial 

This research was supported by NASA Grants NAG56970 and 

resolution: the use of training data derived from Landsat 

NAG56004. The Landsat Pathfinder project for Deforestation imagery in decision tree classifiers. Int. J. Remote Sens. 19: 

in the Humid Tropics supplied the TM data. We thank Carla 3141–3168. 

Brodley, Purdue University; Mark Friedl, Boston University; Dietterich, T. G. (in press), An experimental comparison of 

Arthur Desch, University of Maryland; and Matt Hansen, Uni- three methods for constructing ensembles of decision trees: 

versity of Maryland for helpful comments and suggestions. bagging, boosting, and randomization. Mach. Learn. 

Freund, Y., and Shapiro, R. E. (1996), Experiments with a new 

REFERENCES 

boosting algorithm, In Machine Learning Proceedings of the 

Thirteenth International Conference, Morgan-Kaufman, San 

Francisco, CA, pp. 148–156. 

Agbu, P. A., and James, M. E. (1994), The NOAA/NASA Pathfinder 

Friedl, M. A., and Brodley, C.E. (1997), Decision tree classifitributed 

AVHRR Land Data Set User’s Manual, Goddard Dis- cation of land cover form remotely sensed data. Remote 

Active Archive Center Publications, GCDG, Greenbelt, 

Sens. Environ. 61:399–409. 

MD. 

Friedl, M. A., Brodley, C. E., and Strahler, A. (1999), Maximiz- 

Ahern, F., Janetos, A., and Langham, E. (1998), Global Observations 

ing land cover classification accuracies produced by decision 

of Forest Cover: one component of CEOS’ inte- trees at continental to global scales. IEEE Trans. Geosci. 

grated global observing system strategy. In Proceedings of Remote Sens. 37:969–977.


Freund, Y., and Schapire, R. E. (1997), A decision-theoretic Conference on Artificial Intelligence, Morgan Kaufmann, 

generalization of on-line learning and an application to San Francisco. 

boosting. Journal of Computer and System Sciences Margineantu, D. D., and Dietterich, T. G. (1999), Learning de- 

55(1):119–139. 

cision trees for loss minimization in multi-class problems, 

Gopal, S., Woodcock, C., and Strahler, A. H. (1996), Fuzzy Oregon State University, Corvallis. 

ARTMAP classification of global land cover fom AVHRR Merz, C. J., and Murphy, P. M. (1996), UCI repository of madata 

set. In Proceedings of the 1996 International Geosci- chine learning databases. University of California Irvine, 

ence and Remote Sensing Symposium, Lincoln, NE, 27–31 Morgan Kaufmann, CA. 

May, pp. 538–540. 

Provost, F., and Fawcett, T. (1997), Analysis and visulaization 

Hansen, M., and Reed, B. (2000), Comparison of IGBP DISand 

of classifier performance: comparison under imprecise class 

Cover and University of Maryland 1km global land cover 

cost distribution. In Proceedings of the Third Interna- 

classifications. Int. J. Remote Sens. 21:1365–1374. 

tional Conference on Knowledge Discovery and Data Min- 

Hansen, M., DeFries, R., Townshend, J. R. G., and Sohlberg, ing (KDD-97), American Association for Artificial Intelli- 

R. (2000), Global land cover classification at 1km spatial res- gence, Huntington Beach, CA (www.aaai.org). 

olution using a classification tree approach. Int. J. Remote Provost, F., Fawcett, T., and Kohavi, R. (1998), Building the 

Sens. 21:1331–1364. 

case against accuracy estimation for comparing induction al- 

Hansen, M., Dubayah, R., and DeFries, R. (1996), Classificaference 

on Machine Learning (IMLC-98), Madison, WI. 

gorithms. Proceedings of the Fifteenth International Contion 

trees: an alternative to traditional land cover classifiers. 

Int. J. Remote Sens. 17:1075–1081. 

Quinlan, R. J. (1993), C4.5: Programs for Machine Learning, 

Indurkhya, N., and Weiss, S. M. (1998), Estimating perfor- 

Morgan Kaufmann, San Mateo, CA. 

mance gains for voted decision trees. Intell. Data Anal. 2(4): 

Quinlan, J. R. (1996), Bagging, boosting and C4.5. In Proceed- 

1–10. 

ings of the Thirteenth National Conference on Artificial In- 

Janetos, A. C., and Ahern, F. (1997), CEOS Pilot Project: 

telligence, AAAI Press, Portland, OR, pp. 725–730. 

Richards, J. A. (1993), Remote Sensing Digital Image Analysis: 

Global Observations of Forest Cover (GOFC), Report from 

An Introduction, Springer-Verlag, New York. 

meeting, Ottawa, Ontario, Canada. 

Schiffers, J. (1997), A classification approach incorporating mis- 

Jensen, J. R. (1996), Introductory Digital Image Processing: A 

classification costs. Intell. Data Anal. 1(1):1–10. 

Remote Sensing Perspective, Prentice Hall, Upper Saddle 

Skole, D., and Tucker, C. (1993), Tropical deforestation and 

River, NJ. 

habitat fragmentation in the Amazon: satellite data from 

Kalluri, S. N. V., Jaja, J., Bader, P. A. (2000), High perfor- 

1978 to 1988. Science 260:1905–1910. 

mance computing algorithms for land cover dynamics using 

Swain, P. H., and Hauska, H. (1977), The decision tree classiremote 

sensing data. Int. J. Remote Sens. 21(6):1513–1536. 

fier: design and potential. IEEE Trans. Geosci. Electron. 

Kohavi, R., Sommerfield, D., and Dougherty, J. (1996), Data 

GE-15:142–147. 

mining using MLC: a machine learning library in C. 

Townshend, J. R. G., Bell, V., Desch, A. (1995), The NASA 

Int. J. Artif. Intell. Tools 6:537–566. 

Landsat Pathfinder Humid Tropical Deforestation Project, 

Loveland, T. R., and Belward, A. S. (1997), The IGBP-DIS Land satellite information in the next decade, ASPRS Conglobal 

1 km land cover data set, DISCover: first results. Int. ference, Vienna, VA, 25–28 September, pp. IV-76-IV-87. 

J. Remote Sens. 18:3289–3295. Tucker, C. J., Townshend, J. R. G., and Goff, T. E. (1985), 

Loveland, T. R., Merchant, J. W., Ohlen, D. O., and Brown, African land-cover classification using satellite data. Sci- 

J. F. (1991), Development of a land-cover characteristics da- ence 227:369–375. 

tabase for the conterminous U.S. Photogramm. Eng. Remote Weiss, G. M. (1995), Learning with rare cases and small dis- 

Sens. 57:1453–1463. 

juncts. In Machine Learning: Proceedings of the Twelfth In- 

Margineantu, D., and Dietterich, T. (1997), Pruning adaptive ternational Conference, Morgan Kaufmann, San Fancisco, 

boosting. In Proceedings of the Fourteenth International pp. 558–565.

Multiple Criteria for Evaluating Machine Learning Algorithms for ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?