17.05.2015 Views

Multiple Criteria for Evaluating Machine Learning Algorithms for ...

Multiple Criteria for Evaluating Machine Learning Algorithms for ...

Multiple Criteria for Evaluating Machine Learning Algorithms for ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

www.elsevier.com/locate/sna<br />

<strong>Multiple</strong> <strong>Criteria</strong> <strong>for</strong> <strong>Evaluating</strong> <strong>Machine</strong><br />

<strong>Learning</strong> <strong>Algorithms</strong> <strong>for</strong> Land Cover<br />

Classification from Satellite Data<br />

R. S. DeFries* and Jonathan Cheung-Wai Chan †<br />

Operational monitoring of land cover from satellite of criteria in addition to the traditional accuracy measures<br />

data will require automated procedures <strong>for</strong> analyzing<br />

and that there are likely to be trade-offs between<br />

large volumes of data. We propose multiple criteria <strong>for</strong> algorithm per<strong>for</strong>mance and required computational re-<br />

assessing algorithms <strong>for</strong> this task. In addition to standard<br />

classification accuracy measures, we propose criteria to<br />

account <strong>for</strong> computational resources required by the alsources.<br />

©Elsevier Science Inc., 2000<br />

gorithms, stability of the algorithms, and robustness to INTRODUCTION<br />

noise in the training data. We also propose that classifi-<br />

Over the past few decades, satellite data have become<br />

cation accuracy take account, through estimation of misone<br />

of the primary sources <strong>for</strong> obtaining in<strong>for</strong>mation<br />

classification costs, of unequal consequences to the user<br />

about the vegetation on the Earth’s land surface. At the<br />

depending on which cover types are confused. In this arglobal<br />

scale, land cover data sets have been derived <strong>for</strong><br />

ticle, we apply these criteria to three variants of decision<br />

application in a range of earth system models primarily<br />

tree classifiers, a standard decision tree implemented in<br />

from data acquired by the Advanced Very High Resolu-<br />

C5.0 and two techniques recently proposed in the mation<br />

Radiometer (AVHRR) onboard NOAA meteorologichine<br />

learning literature known as “bagging” and “boostcal<br />

satellites (DeFries et al., 1998; Hansen et al., 2000;<br />

ing.” Each of these algorithms are applied to two data<br />

Loveland and Belward, 1997). At regional and local<br />

sets, a global land cover classification from 8 km AVHRR<br />

scales, data acquired by higher resolution sensors such as<br />

data and a Landsat Thematic Mapper scene in Peru. Re-<br />

Landsat and SPOT have been used extensively to extract<br />

sults indicate comparable accuracy of the three variants<br />

detailed in<strong>for</strong>mation about the land surface in specific loof<br />

the decision tree algorithms on the two data sets, with cations (e.g., Skole and Tucker, 1993).<br />

boosting providing marginally higher accuracies. The A wide variety of techniques has been used to clasbagging<br />

and boosting algorithms, however, are both sub- sify land cover over large areas from satellite data. Techstantially<br />

more stable and more robust to noise in the niques range from unsupervised clustering algorithms<br />

training data compared with the standard C5.0 decision (e.g., Loveland and Belward, 1997; Loveland et al., 1991)<br />

tree. The bagging algorithm is most costly in terms of to parametric supervised algorithms such as maximum<br />

computational resources while the standard decision tree likelihood (e.g., DeFries and Townshend, 1994; Tucker<br />

is least costly. The results illustrate that the choice of the et al., 1985) to machine learning algorithms such as decimost<br />

suitable algorithm requires consideration of a suite sion trees (e.g., Friedl et al., 1999) and neural networks<br />

(e.g., Gopal et al., 1996). For a general review of these<br />

techniques, see Jensen (1996), Richards (1993), and<br />

* Earth System Science Interdisciplinary Center and Department<br />

of Geography, University of Maryland, College Park<br />

Quinlan (1993). While some comparisons of algorithm<br />

† Department of Geography, University of Maryland, College per<strong>for</strong>mance have been published (Friedl et al., 1999),<br />

Park<br />

there are not generally accepted criteria <strong>for</strong> selection of<br />

Address correspondence to R. DeFries, Dept. of Geography, 2181 the most appropriate classification algorithm <strong>for</strong> a given<br />

Lefrak Hall, Univ. of Maryland, College Park, MD 20742. E-mail:<br />

rd63@umail.umd.edu<br />

set of circumstances.<br />

Received 6 December 1999; revised 14 April 2000.<br />

With recent and upcoming launches of a large num-<br />

REMOTE SENS. ENVIRON. 74:503–515 (2000)<br />

©Elsevier Science Inc., 2000<br />

0034-4257/00/$–see front matter<br />

655 Avenue of the Americas, New York, NY 10010 PII S0034-4257(00)00142-5


504 DeFries and Chan<br />

ber of satellites, notably NASA’s Earth Observing Sys- Table 1. Cover Types and Number of Pixels in Training and<br />

tem, the volume of data available <strong>for</strong> analysis of land<br />

Test Data <strong>for</strong> 8 km AVHRR Data Used in This Study<br />

cover will increase many fold (Kalluri et al., 2000). Tech-<br />

No. of<br />

niques <strong>for</strong> extracting land cover in<strong>for</strong>mation need to be<br />

Training No. of Test<br />

automated to the degree possible to process these large<br />

Cover Type Pixels Pixels<br />

volumes of data. In addition, the techniques need to be Evergreen needleleaf <strong>for</strong>est 667 859<br />

objective, reproducible, and feasible to implement within Evergreen broadleaf <strong>for</strong>est 1302 1089<br />

Deciduous needleleaf <strong>for</strong>est 48 164<br />

available resources. For example, the international ef<strong>for</strong>t<br />

Deciduous broadleaf <strong>for</strong>est 473 313<br />

on Global Observations of Forest Cover (Ahern et al., Mixed <strong>for</strong>est 575 358<br />

1998; Janetos and Ahern, 1997) aims to characterize the Woodlands 686 1174<br />

extent of <strong>for</strong>est cover globally from satellite data at re- Wooded grasslands/shrublands 374 704<br />

peated intervals over time. This task can only realistically Closed bushlands or shrublands 293 356<br />

Open shrublands 617 894<br />

be achieved through techniques that minimize time-con- Grasses 1309 1119<br />

suming human interpretation and maximize automated Croplands 1520 1049<br />

procedures <strong>for</strong> data analysis.<br />

Bare 1204 1313<br />

Comparisons of algorithm per<strong>for</strong>mance <strong>for</strong> land Mosses and Lichens 202 652<br />

Total 9306 10,044<br />

cover classification have generally been based on the single<br />

criterion of classification accuracy (Friedl et al., 1999;<br />

Hansen and Reed, 2000). In an operational context <strong>for</strong><br />

monitoring land cover from satellite data, there are mul- <strong>for</strong>mance of the machine learning algorithms. The data<br />

tiple criteria <strong>for</strong> assessing the suitability of algorithms in sets are described below.<br />

addition to accuracy. Is the algorithm efficient in terms<br />

of speed? Does it produce stable results or is it unac- 8 km Global Land Cover Classification<br />

ceptably sensitive to minor variations in the input data? DeFries et al. (1998) derived a global land cover classifi-<br />

How robust is the algorithm to noisy data? Are there cation of 13 cover types based on the AVHRR Pathfinder<br />

tradeoffs between speed and accuracy, <strong>for</strong> example, that Land data (Agbu and James, 1994) <strong>for</strong> 1984. The classifishould<br />

be considered?<br />

cation was based on 24 metrics describing the temporal<br />

This article sets out a number of criteria <strong>for</strong> evaluat- dynamics of vegetation over an annual cycle. These meting<br />

algorithms <strong>for</strong> classifying land cover from satellite rics are: maximum annual, minimum annual, mean andata.<br />

The criteria are intended to highlight the tradeoffs nual, and amplitude (maximum minus minimum) <strong>for</strong><br />

that would be faced in selecting algorithms <strong>for</strong> opera- each of the AVHRR channels including the normalized<br />

tional monitoring of land cover from satellite data. We difference vegetation index (NDVI defined as [(Channel<br />

illustrate methods <strong>for</strong> quantifying these criteria using two 2Channel 1)/(Channel 2Channel 1)] and Channels 1<br />

data sets, one derived from AVHRR and one derived (visible reflectance, 0.58–0.69 lm), 2 (near-infrared refrom<br />

Landsat Thematic Mapper data. For this article, we flectance, 0.725–1.1 lm), 3 (thermal infrared, 3.55–3.93<br />

demonstrate the use of these criteria with various types lm), 4 (thermal, 10.3–11.3 lm), and 5 (thermal, 11.5–<br />

of univariate decision tree algorithms. The criteria, how- 12.5 lm). A decision tree algorithm was used <strong>for</strong> the<br />

ever, could also be applied to other classification algo- classification but not in a completely automated mode.<br />

rithms.<br />

The decision tree was modified based on human knowledge<br />

of global vegetation to obtain the final global land<br />

DATA<br />

cover map (DeFries et al., 1998).<br />

Training data <strong>for</strong> the classifier used to generate the<br />

To explore criteria <strong>for</strong> assessing algorithms <strong>for</strong> land cover 8 km global land cover classification were obtained from<br />

classification, we use two data sets: multitemporal a global network of 156 Landsat scenes. As described in<br />

AVHRR Pathfinder Land data <strong>for</strong> 1984 and a Landsat DeFries et al. (1998), these scenes were visually inter-<br />

Thematic Mapper scene (path/row 006/066 centered on preted through consultation with ancillary maps and regional<br />

8.684S, 74.167W) around Pucallpa, Peru acquired 16<br />

experts to identify locations <strong>for</strong> which the land<br />

October 1996. These data sets were selected because re- cover type is known with a high degree of confidence.<br />

liable land cover classifications have been derived from The scenes were coregistered with the 8 km AVHRR<br />

them using field knowledge, expert consultation, and hu- data. Those 8 km pixels containing over 90% of the cover<br />

man interpretation. We consequently have a high degree type identified from the Landsat scene were labeled as<br />

of confidence in these land cover classifications. In the training data. Approximately 9000 AVHRR pixels of<br />

absence of true validation data from ground-based measurements,<br />

training data were obtained.<br />

these land cover classifications serve as a ba- For the study described in this article, both data to<br />

sis <strong>for</strong> test data against which we can compare the per- train the classifiers and data to test the classification re-


<strong>Evaluating</strong> <strong>Machine</strong> <strong>Learning</strong> <strong>Algorithms</strong> <strong>for</strong> Land Cover Classification 505<br />

chine learning technique particularly suited to applications<br />

where it is important <strong>for</strong> a human to understand<br />

the classification structure, have successfully been applied<br />

to multidimensional satellite data <strong>for</strong> extraction of<br />

land cover categories (DeFries et al., 1998; Friedl et al.,<br />

1999; Hansen et al., 2000).<br />

The multiple criteria <strong>for</strong> assessing algorithm per<strong>for</strong>mance<br />

are illustrated in this article with several variants<br />

of a basic decision tree algorithm. As no single machine<br />

learning algorithm has been demonstrated to be superior<br />

<strong>for</strong> all applications (Kohavi et al., 1996), it is necessary<br />

to test a number of algorithms <strong>for</strong> the specific applica-<br />

tion, in this case repeatable and objective classification<br />

of satellite data into land cover types. While this article<br />

illustrates the criteria through various decision tree algorithms,<br />

the same criteria could be applied to other types<br />

of algorithms such as neural networks, maximum likelihood,<br />

and even unsupervised classification techniques.<br />

sults are required (Table 1). For training, we use the 24<br />

metrics from the 9000 pixels identified by overlaying<br />

Landsat scenes on the 8 km AVHRR data. Each training<br />

pixel is labeled as a cover type based on interpretation<br />

of the Landsat scene. For the test data, we obtain a random<br />

sample of 10,000 pixels (distributed in proportion<br />

to the area occupied by each cover type in the final classification)<br />

from the final classification results derived by<br />

DeFries et al. (1998). Because this final classification result<br />

was examined and modified through human knowledge<br />

of global vegetation distributions, we believe that<br />

the test data have a high degree of confidence, although<br />

it is possible that errors do occur.<br />

Landsat Thematic Mapper Data<br />

In contrast to the coarse resolution 8 km AVHRR data<br />

based on multitemporal in<strong>for</strong>mation, we also test the cri-<br />

teria described in this article using data from the Landsat<br />

Thematic Mapper scene around Pucallpa, Peru. This<br />

scene was classified by the Landsat Pathfinder project Decision Tree <strong>Algorithms</strong><br />

mapping de<strong>for</strong>estation in the humid tropics (Townshend Decision tree theory (Breiman et al., 1984) has preet<br />

al., 1995) <strong>for</strong> the purpose of determining the extent viously been applied to land cover classification from satof<br />

de<strong>for</strong>estation between the 1970s, 1980s, and 1990s. ellite data (DeFries et al., 1998; Friedl and Brodley,<br />

The TM scene includes five bands at 30 m resolution 1997; Friedl et al., 1999; Hansen et al., 2000; 1996;<br />

(.45–.53 lm, .52–.60 lm, .63–.69 lm, .76–.90 lm, and Swain and Hauska, 1977). Decision trees predict class<br />

1.55–1.75 lm). The scene was classified into six classes membership by recursively partitioning a data set into<br />

(Table 2). The classification approach was a combination more homogeneous subsets. Different variables and<br />

of unsupervised and supervised classification techniques splits are then used to split the subsets into further subusing<br />

a high degree of human interpretation and expert sets. In univariate decision trees as used <strong>for</strong> this study,<br />

knowledge about the location (A. Desch, personal com- each node is <strong>for</strong>med from a binary split of one variable.<br />

munication). As such, we have a high degree of confi- The grown tree can be pruned based on decision rules<br />

dence in the classification result.<br />

to produce more stable predictions of class membership.<br />

Training data <strong>for</strong> this study were selected by sam- The decision tree has a number of advantages over<br />

pling the classification result in proportion to the area traditional classification algorithms (Hansen et al., 1996).<br />

covered by each class; 5958 pixels were randomly se- First, the univariate decision tree is not based on any aslected.<br />

For testing, we randomly selected an additional sumptions of normality within training statistics and is<br />

12,084 pixels (Table 2). Because both the training and well suited to situations where a single cover type is reptest<br />

data were derived from the same classification result resented by more than one cluster in the spectral space.<br />

and were not independently derived, we expect the accu- Second, the decision tree can reveal nonlinear and hierracies<br />

derived in this study to overestimate those that archical relationships between input variables and use<br />

would be obtained in a realistic situation where a classifi- these to predict class membership. Third, the decision<br />

cation result is not available. However, we believe that tree yields a set of rules which are easy to interpret and<br />

these data sets can nevertheless be used to illustrate the suitable <strong>for</strong> deriving a physical understanding of the clascriteria<br />

<strong>for</strong> evaluating the machine learning algorithms. sification process.<br />

In this study, we use the C5.0 decision tree estimation<br />

algorithm, a univariate decision tree algorithm that<br />

METHODS AND RESULTS<br />

is the commercial successor of C4.5 (Quinlan, 1993). In<br />

a decision tree estimation algorithm, the most important<br />

component is the method used to estimate splits at each<br />

internal node of the tree. It is this method that deter-<br />

mines which features are selected to <strong>for</strong>m the classifier.<br />

C5.0 uses the “in<strong>for</strong>mation gain ratio” to estimate splits<br />

at each internal node of the tree. The in<strong>for</strong>mation gain<br />

measures the reduction in entropy in the data produced<br />

by a split. Using this metric, the test at each node within<br />

Data mining techniques have been developing over the<br />

past few decades <strong>for</strong> a large number of applications ranging<br />

from computer security to medical diagnosis to detection<br />

of volcanoes on Venus (Brodley et al., 1999). <strong>Machine</strong><br />

learning, one means of data mining, refers to<br />

algorithms that analyze the in<strong>for</strong>mation, recognize patterns,<br />

and improve prediction accuracy through repeated<br />

learning from training instances. Decision trees, a ma-


506 DeFries and Chan<br />

Table 2. Cover Types and Number of Pixels in Training<br />

and Test Data <strong>for</strong> Landsat Thematic Mapper Data Used<br />

in This Study<br />

Table 3. Misclassification Costs Used in This Study to<br />

Adjust Accuracy Measure <strong>for</strong> Global Land Cover<br />

Classification from 8 km AVHRR Data a<br />

No. of Training No. of Test Category Group 1 b Group 2 c Group 3 d Group 4 e<br />

Cover Type Pixels Pixels<br />

Group 1 0 0.3 0.6 1<br />

Forest 1164 2328<br />

Group 2 0.3 0 0.3 0.6<br />

Water 963 1959 Group 3 0.6 0.3 0 0.3<br />

Cloud 958 1939<br />

Group 4 1 0.6 0.3 0<br />

Shadow 980 1990 a<br />

Actual misclassification costs would vary with specific applications of<br />

Degraded <strong>for</strong>est 937 1912<br />

the land cover classification.<br />

Non<strong>for</strong>est vegetation 956 1956 b<br />

Group 1: evergreen needleleaf <strong>for</strong>est; evergreen broadleaf <strong>for</strong>est; de-<br />

Total 5958 12,084 ciduous broadleaf <strong>for</strong>est; mixed <strong>for</strong>est; woodlands.<br />

c<br />

Group 2: wooded grasslands/shrubs; closed bushlands or shrublands.<br />

d<br />

Group 3: grasses; croplands; mosses and lichens.<br />

e<br />

Group 4: open shrubland; bare.<br />

a tree is selected based on that subdivision of the data<br />

that maximizes the reduction in entropy of the descenoutside<br />

dant nodes. Given a training data set T composed of oberror-based<br />

of the training set are to be classified. C5.0 uses<br />

servations belonging to one of kclasses {C1,C2,...,Ck},<br />

pruning to remove features from the classiservations<br />

the amount of in<strong>for</strong>mation required to identify the class fier that are spurious and not supported by the data. For<br />

<strong>for</strong> an observation in T is<br />

more detail, see Quinlan (1993).<br />

Bagging and Boosting<br />

info(T) k freq(C j ,T) freq(C<br />

log<br />

j ,T)<br />

2 , (1)<br />

j1 |T|<br />

|T|<br />

A number of refinements of this basic decision tree algorithm<br />

have recently been developed in the machine<br />

where freq(Cj,T) is equal to the number of cases in T be- learning community, including “boosting” and “bagging.”<br />

longing to class Cj, and |T| is the total number of obser- Boosting and bagging techniques construct ensembles of<br />

vations in T. Given a test, X, that partitions T into n out- individual classifiers and obtain classification decisions by<br />

comes of a test X, the total in<strong>for</strong>mation content after voting from the individual classifiers (Quinlan, 1996).<br />

applying X is<br />

These techniques can be applied to any supervised classification<br />

algorithm. In this article we refer only to the apinfo<br />

x (T) n |T i |<br />

i1|T| i). (2) plication of boosting and bagging <strong>for</strong> decision trees.<br />

Bagging, proposed by Breiman (1996), generates an<br />

The in<strong>for</strong>mation gained by splitting T using X is<br />

ensemble of individual decision trees by bootstrap sampling<br />

of the training data set. <strong>Multiple</strong> samples from the<br />

gain(X)info(T)info x (T). (3)<br />

training set are generated by sampling with replacement<br />

The “gain criteria”’ selects the test <strong>for</strong> which gain(X) is from the training data. A decision tree classifier is genermaximum.<br />

To compensate <strong>for</strong> favoring tests with large ated <strong>for</strong> each sample. The final classification result is obnumbers<br />

of splits, gain(X) is normalized by<br />

tained by plurality vote of the individual classifiers. Bagging<br />

has been shown to improve the per<strong>for</strong>mance on test<br />

split info(X) n |T i |<br />

i1|T| 2 |T i|<br />

|T| . (4) data sets in domains other than remote sensing in cases<br />

where small changes in the training set cause large<br />

The splitting metric is<br />

changes in the classifier (Breiman, 1996; Quinlan, 1996).<br />

gain ratio(X)gain(X)/split info(X). (5)<br />

Experiments indicate that per<strong>for</strong>mance gain reaches a<br />

plateau at no more than 100 individual trees (Indurkhya<br />

T is recursively split such that the gain ratio is maximized<br />

and Weiss, 1998).<br />

at each node of the tree. This procedure contin- Boosting, proposed by Freund and Shapiro (1996),<br />

ues until each leaf node contains only observations from is also an ensemble technique where multiple iterations<br />

a single class or no gain in in<strong>for</strong>mation is yielded by fur- of decision tree classifiers are generated. In this case, the<br />

ther splitting. For a univariate decision tree using continuous<br />

entire training set is used to generate the decision tree.<br />

attributes as is the case <strong>for</strong> this study, data is parti- For each iteration of the decision tree, a weight is as-<br />

tioned into 2 outcomes (n2) at each node based on a signed to each training observation. Observations mis-<br />

threshold value <strong>for</strong> a single attribute. The threshold classified in the previous iteration are assigned a heavier<br />

value with the greatest gain ratio value is selected at each weight. The decision tree is <strong>for</strong>ced to concentrate on<br />

node in the decision tree.<br />

those observations that were misclassified in the previous<br />

The decision tree resulting from this procedure may<br />

be overfit to noise in the training data so that the tree<br />

must be pruned to reduce classification errors when data<br />

iteration. Each iteration generates a decision tree that<br />

aims to correct errors in the previous iteration. The final<br />

classifier is generated by voting from the classifications


<strong>Evaluating</strong> <strong>Machine</strong> <strong>Learning</strong> <strong>Algorithms</strong> <strong>for</strong> Land Cover Classification 507<br />

Table 4. Misclassification Costs Used in This Study to Adjust Accuracy<br />

Measure <strong>for</strong> Classification from Landsat TM Scene<br />

Degraded Non<strong>for</strong>est<br />

Forest Water Cloud Shadow Forest Vegetation<br />

Forest 0 0.6 1 1 0.3 0.3<br />

Water 0.6 0 0.3 0.3 0.6 0.6<br />

Cloud 1 0.3 0 0.3 1 1<br />

Shadow 1 0.3 0.3 0 1 1<br />

Degraded<br />

<strong>for</strong>est 0.3 0.6 1 1 0 0.3<br />

Non<strong>for</strong>est<br />

vegetation 0.3 0.6 1 1 0.3 0.6<br />

generated from the individual classifiers. In the Ada- cations in a quasiautomated fashion but required extensive<br />

Boost.M1 algorithm implemented in C5.0 (Quinlan,<br />

and time consuming collection of a global training<br />

1996), voting from the individual decision tree classifiers data set. For future ef<strong>for</strong>ts such as the Global Observations<br />

is weighted by the accuracy of the classifier [see Freund<br />

of Forest Cover, it will be necessary to evaluate a<br />

and Shapiro (1996) and Friedl et al. (1999) <strong>for</strong> explana- number of approaches <strong>for</strong> obtaining land cover in<strong>for</strong>mation.<br />

tion of how the weightings are calculated in Ada-<br />

Boost.M1].<br />

Here we present a number of criteria-and methods<br />

Boosting has been shown to reduce misclassification to quantify these criteria-relevant to the consideration of<br />

rates of land cover based on monthly NDVI values obtion.<br />

the most appropriate algorithm <strong>for</strong> land cover classificatained<br />

from AVHRR data by 20–50% with most of the<br />

These criteria are: classification accuracy, computa-<br />

benefit achieved after seven boosting iterations (Friedl et tional resources, stability of the algorithm, and ro-<br />

al., 1999). On data sets other than in the remote sensing bustness to noise in the training data.<br />

domain, boosting substantially improved classification ac-<br />

Classification Accuracy<br />

curacy on most data sets but severely reduced accuracy<br />

Classification accuracy is the primary criterion <strong>for</strong> algoon<br />

others (Quinlan, 1996). Dietterich (1998) shows that<br />

rithm comparisons in the literature. Accuracy is comboosting<br />

is generally more accurate than bagging <strong>for</strong> 33<br />

monly measured as the percentage of pixels correctly<br />

domains in the repository of machine learning data bases<br />

classified in the test set. It is necessary to consider both<br />

maintained at the University of Cali<strong>for</strong>nia Irvine (Merz<br />

overall accuracy (percentage of all test pixels correctly<br />

and Murphy, 1996). However, in the presence of noise<br />

classified) and mean class accuracy (mean accuracy of all<br />

in the training set, bagging proved more accurate than<br />

classes computed individually) to avoid domination of the<br />

boosting.<br />

accuracy measure by those classes with disproportionate<br />

numbers of test pixels. Other measures such as produc-<br />

<strong>Multiple</strong> <strong>Criteria</strong> <strong>for</strong> <strong>Evaluating</strong> Land Cover er’s and user’s accuracy can also be computed from an<br />

Classification <strong>Algorithms</strong><br />

error matrix (Congalton, 1991; Congalton and Green,<br />

Selection of the most appropriate algorithm <strong>for</strong> land 1999), though these are less commonly reported in the<br />

cover classification from satellite data in an operational remote sensing literature.<br />

setting will depend on specific circumstances and avail- In addition to the overall and mean class accuracies,<br />

able resources. Most certainly, the decision involves misclassification between certain classes may be more or<br />

tradeoffs between a number of important criteria includ- less important depending on the application of the land<br />

ing accuracy, computational speed, and ability to auto- cover classification (DeFries and Los, 1999). For exammate<br />

the process. One of the important criteria is the ple, misclassification between a needleleaf evergreen and<br />

degree to which human interpretation and involvement a mixed <strong>for</strong>est may be inconsequential in a modeling apin<br />

the process are feasible and desirable. For example, plication that does not distinguish between these <strong>for</strong>est<br />

the unsupervised classification approach used to generate types. In this case, the misclassification cost is zero. Calthe<br />

IGBP DISCover global land cover classification from culation of overall accuracy assumes that all misclassifica-<br />

1km AVHRR data (Loveland and Belward, 1997), in tion costs are equal.<br />

which each cluster was interpreted and labeled based on In machine learning, Receiver Operating Characterancillary<br />

in<strong>for</strong>mation, involved many person-years but istic (ROC) analysis has been proposed to describe the<br />

eliminated the need <strong>for</strong> cumbersome collection of train- predictive behavior of a classifier independent of class<br />

ing data. On the other hand, the global land cover classi- distributions or misclassification costs <strong>for</strong> two class probfications<br />

using a decision tree approach (DeFries et al., lems (Provost and Fawcett, 1997; Provost et al., 1998).<br />

1998; Hansen et al., 2000) were able to generate classifi- In ROC analysis of a true-false classification problem,


508 DeFries and Chan<br />

Figure 1. Accuracy of standard C5.0 decision tree and boosting and bagging with the decision tree on the 8 km data <strong>for</strong> overall, mean class, and adjusted accuracy (a, b, and c,<br />

respectively) and Landsat data (d, e, and f). Lines in the box plots indicate median value of the ten trials and shaded boxes give values <strong>for</strong> 50th percentiles.


<strong>Evaluating</strong> <strong>Machine</strong> <strong>Learning</strong> <strong>Algorithms</strong> <strong>for</strong> Land Cover Classification 509<br />

Table 5. Number of Operations Required <strong>for</strong> Decision Tree <strong>Algorithms</strong> to<br />

Classify Data Sets Used in This Study<br />

Algorithm 8 km AVHRR Data Landsat TM Data<br />

C5.0 decision tree 80926 97818<br />

C5.0 decision tree with bagging 100 times 100 times<br />

C5.0 decision tree with boosting 10 times 10 times<br />

the true positive rate (positives correctly classified/total In the absence of a framework to evaluate algorithm<br />

positives) is plotted against the false positive rate (nega- per<strong>for</strong>mance independent of misclassification cost, we<br />

tives incorrectly classified/total negatives). If one algo- use a “loss matrix” (Margineantu and Dietterich, 1999)<br />

rithm dominates the ROC space, meaning that ROC to account <strong>for</strong> unequal misclassification costs (Tables 3<br />

curves <strong>for</strong> all other algorithms are beneath it, it can be and 4). These loss matrices were constructed rather arbiconcluded<br />

that the algorithm is better than all others <strong>for</strong> trarily based on a presumption that there is a greater cost<br />

all possible costs and class distributions. It is, however, to confusion of <strong>for</strong>est types with non<strong>for</strong>est types than<br />

possible that a particular algorithm may dominate in only confusion within <strong>for</strong>est types. The actual misclassification<br />

a portion of the curve. If this is the case, selection of costs depend on the specific application of the land cover<br />

the “best” algorithm needs to be done by considering the classification. In the use of a land cover classification<br />

desired rate of false positive outcomes. For example, within a land surface model, <strong>for</strong> example, the misclassififalse<br />

positive outcomes may be less acceptable than false cation costs vary even within the model depending on<br />

negative outcomes in the case of medical diagnosis. A the scheme <strong>for</strong> aggregating land cover types <strong>for</strong> estimatfalse<br />

positive will precipitate unnecessary medical treat- ing each parameter (DeFries and Los, 1999).<br />

ment, but a false negative would lead to neglect when For this study, we test accuracy against the test data<br />

treatment is needed. These techniques in machine learn- by using a bootstrap sample of 90% of the training data<br />

ing have only been applied to problems with two classes (with replacement) 10 times. Accuracy measures are caland<br />

extension to multiclass problems is an active re- culated as the mean value of the 10 bootstrap training<br />

search area (Provost et al., 1998).<br />

Figure 2. Overall accuracy with number of iterations used<br />

with bagging <strong>for</strong> a) 8 km data and b) Landsat data. These<br />

runs were done using the intentionally mislabeled training<br />

data as described in the text.<br />

Figure 3. j-error diagrams <strong>for</strong> a) 8 km data and b) Landsat<br />

data using 10 classifiers with 10% of training data. Blue<br />

indicates results from the standard C5.0 decision tree, red<br />

from the results using boosting, and green from the<br />

results using bagging.


510 DeFries and Chan<br />

Figure 4. Location of training data in the 8 km data. Open squares are sites of Landsat scenes used to derive training data.<br />

Closed squares are locaitons of training data intentionally mislabeled <strong>for</strong> this study.<br />

Computational Resources<br />

The computational resources required <strong>for</strong> the classifica-<br />

tion are likely to be a key consideration in choosing an<br />

algorithm. In the case of decision tree algorithms, mini-<br />

mal resources are required to grow the tree on the train-<br />

ing data set while vast resources might be required to<br />

classify unseen cases according to the decision rules.<br />

To compare algorithms, it is necessary to have a<br />

measure independent of the computer, programming<br />

language, programmer, and implementation details such<br />

as counting array indexes or setting pointers in data<br />

structures. For example, it would not be reasonable to<br />

establish a criterion based on execution time because it<br />

varies with the computer. A count of all statements exe-<br />

cuted by a program would likewise depend on the pro-<br />

gramming. In machine learning, algorithms are compared<br />

based on the “amount of work done” or “complexity<br />

measure” (Baase, 1988). This measure simply counts the<br />

number of basic operations per<strong>for</strong>med.<br />

In the case of decision trees, the number of basic<br />

operations is the number of decision points in the tree<br />

(or ensemble of trees in the case of bagging and boost-<br />

ing). When using the same input data, we can compare<br />

the algorithms based on the total number of decision<br />

points traversed to classify all pixels. For the standard<br />

C5.0 decision tree, 80,926 and 97,818 operations are per-<br />

<strong>for</strong>med to classify the test data based on a tree generated<br />

from the training data <strong>for</strong> the 8km and Landsat TM data<br />

samples. This procedure was carried out to ensure that<br />

the accuracy reported is representative of multiple rather<br />

than a single test.<br />

We report accuracy in three ways (Fig. 1): 1) overall<br />

accuracy, 2) mean class accuracy, and 3) adjusted accuracy<br />

taking into account misclassification costs indicated<br />

in the loss matrices (Tables 3 and 4). The weightings in<br />

these loss matrices are intended to illustrate the concept<br />

based on what the costs might be <strong>for</strong> a typical application<br />

rather than to quantify the misclassification costs <strong>for</strong> all<br />

applications. These misclassification costs will vary with<br />

the application and need to be developed with respect<br />

to the specific application of the land cover classification.<br />

Another active area of research in machine learning<br />

is incorporation of misclassification costs in the generation<br />

of the decision tree itself (Margineantu and Dietterich,<br />

1999; Schiffers, 1997). The splitting criteria are<br />

based on the misclassification costs. A feature is available<br />

in the C5.0 algorithm to assign misclassification costs<br />

through a loss matrix. However, we did not experience<br />

improved per<strong>for</strong>mance when using this feature, probably<br />

because it is a complex problem to identify optimum<br />

weights from the loss matrix when the number of classes<br />

is greater than two (Margineantu and Dietterich, 1999).<br />

Results indicate that the decision tree, bagging, and<br />

boosting provide fairly similar accuracies by all three<br />

measures on both data sets (Fig. 1). In all cases, boosting<br />

provides the highest accuracies, but the differences are<br />

only small. Acccuracies are generally within 5% <strong>for</strong> the<br />

three algorithms by all measures. Differences between<br />

accuracies are even smaller when they are adjusted <strong>for</strong><br />

misclassification costs, indicating that error is occurring<br />

in higher proportion in those classes with low misclassifi-<br />

cation costs than in those classes with high costs. Judging<br />

from this accuracy criterion alone, there is only marginal<br />

advantage in choosing boosting over the standard C5.0<br />

decision tree and even less advantage in choosing bagging<br />

of the C5.0 decision tree <strong>for</strong> these data sets.


<strong>Evaluating</strong> <strong>Machine</strong> <strong>Learning</strong> <strong>Algorithms</strong> <strong>for</strong> Land Cover Classification 511<br />

Figure 5. Classified Landsat scene used <strong>for</strong> this study. The<br />

box indicates the location of training data intentionally<br />

mislabeled <strong>for</strong> this study.<br />

respectively (Table 5). With bagging, this number of<br />

operations increases in proportion to the number of<br />

bootstrap samples used to generate the ensemble of indi-<br />

vidual decision trees. The number of samples is conven-<br />

tionally 100 (Breiman, 1996). Our investigation shows,<br />

however, that accuracy does not increase substantially<br />

beyond approximately 50 iterations <strong>for</strong> the 8 km data and<br />

20 iterations <strong>for</strong> the Landsat data (Fig. 2). There<strong>for</strong>e, the<br />

number of basic operations required to achieve the ben-<br />

efits of bagging is somewhat less <strong>for</strong> these data sets than<br />

the convention would indicate.<br />

For boosting, the number of operations is in proportion<br />

to the number of iterations used. In this study, we<br />

used 10 iterations because previous studies with nonremote<br />

sensing data suggest that this number provides maximum<br />

improvement in classification accuracy (Freund and<br />

Schapire, 1997). With application of boosting to remote<br />

sensing data, Friedl et al. (1999) report that little accuracy<br />

is gained beyond seven iterations.<br />

The comparison of the algorithms according to the<br />

“amount of work done” indicates that the standard C5.0<br />

Figure 6. j-error diagrams <strong>for</strong> 8 km data with random<br />

noise in the training data at a) 10%, b) 30%, and c) 50%.<br />

Blue indicates results from the standard C5.0 decision tree,<br />

red from the results using boosting, and green from the<br />

results using bagging.<br />

decision tree requires less resources than boosting and<br />

substantially less than bagging. While this conclusion is<br />

obvious in the case of the univariate decision trees used<br />

<strong>for</strong> this study, the “amount of work done” provides a<br />

framework <strong>for</strong> assessing computational resources required<br />

<strong>for</strong> other types of classifiers where such a com-<br />

parison is not as straight<strong>for</strong>ward.<br />

Stability of the Algorithm<br />

It is desirable that an algorithm produce stable results<br />

when faced with minor variability in the input data. In<br />

the use of satellite data <strong>for</strong> monitoring land cover, algo-<br />

rithm instability could erroneously indicate changes in<br />

land cover when none actually occurred. If training data<br />

are used from the same locations at repeated intervals,


512 DeFries and Chan<br />

Figure 8. j-error diagrams <strong>for</strong> a) 8 km and b) Landsat data<br />

with mislabeled training data introduced. Blue indicates<br />

results from the standard C5.0 decision tree, red from the<br />

results using boosting, and green from the results using<br />

bagging.<br />

Figure 7. j-error diagrams <strong>for</strong> Landsat data with random<br />

noise in the training data at a) 10%, b) 30%, and c) 50%.<br />

Blue indicates results from the standard C5.0 decision tree,<br />

red from the results using boosting, and green from the<br />

results using bagging.<br />

sets is measured by computing a degree-of-agreement<br />

statistic j. A scatter plot is constructed in which each<br />

point corresponds to a pair of classifiers. The x coordinate<br />

is the diversity value (j) and the y coordinate is the<br />

mean accuracy (or error rate) of the classifiers.<br />

j is defined as<br />

variability in reflectances would be expected due to bidi-<br />

rectional effects, solar zenith angle, and a variety of other<br />

factors. If an algorithm is used to classify land cover type<br />

at these intervals, it is necessary to have confidence that<br />

it is not overly sensitive to these variations.<br />

To test the stability of the decision tree algorithms,<br />

we use j-error diagrams as introduced by Margineantu<br />

and Dietterich (1997). These diagrams help visualize the<br />

relationship between accuracy and stability of the deci-<br />

sion tree algorithms generated from training sets with<br />

minor variability. To approximate training sets with minor<br />

variability, we randomly sample 10% of the training<br />

data 10 times to generate 10 different training sets.<br />

The stability, or conversely the diversity, of each pair<br />

of classifications per<strong>for</strong>med on each of the 10 training<br />

j 1 2<br />

. (6)<br />

1 2<br />

H 1 is an estimate of the probability that two classifiers<br />

agree and is given by<br />

1 L i1 C ii<br />

m , (7)<br />

where L is the number of classes, C is an LL square<br />

array such that Cij contains the number of test example<br />

assigned to class i by the first classifier and into class j<br />

by the second classifier, and m is the total number of<br />

test examples. 2 is an estimate of the probability that<br />

the two classifiers agree by chance:<br />

H 2 L<br />

i1 L<br />

C ij C ji<br />

j1m *L<br />

j1<br />

m . (8)<br />

j 0 when the agreement of the two classifiers equal


<strong>Evaluating</strong> <strong>Machine</strong> <strong>Learning</strong> <strong>Algorithms</strong> <strong>for</strong> Land Cover Classification 513<br />

that expected by chance and 1 when the two classifiers rates and lower stability as seen in the larger coefficients<br />

agree on every example. Negative values occur when of variation than the bagging and boosting results <strong>for</strong><br />

agreement is less than that expected by chance in the case both the 8 km and Landsat data (Figs. 6 and 7). Bagging<br />

of systematic bias.<br />

and boosting per<strong>for</strong>m comparably <strong>for</strong> the 8 km data<br />

If an algorithm produces stable results, the j-error while bagging appears slightly more stable with higher<br />

diagram produces a compact cloud of points with j val- internal agreement than the boosting result <strong>for</strong> the<br />

ues close to 1. A spread of j values indicates that the Landsat data. In general, the 8 km data has higher error<br />

algorithm is producing results that vary when using the rates than the Landsat data, possibly because the 8 km<br />

different training sets.<br />

data contains several classes with relatively few training<br />

For both the 8 km and Landsat data, the standard pixels or because the Landsat training data is derived<br />

C5.0 decision tree produces a cloud of points with subing<br />

from the classification result itself. For nonremote sensstantially<br />

larger spread than the boosting and bagging reing<br />

data, Weiss (1995) illustrates that noise in the train-<br />

sults (Fig. 3). For the 8 km data, boosting produced the<br />

set leads to a disproportionate number of errors in<br />

most compact cloud of points. Coefficients of variation classes with a small number of training samples.<br />

of the j values were .016 <strong>for</strong> boosting, compared with Mislabeling of training data causes more severe<br />

.022 <strong>for</strong> bagging, and .027 <strong>for</strong> the standard C5.0 decision problems in terms of stability <strong>for</strong> the decision tree algotree.<br />

For the Landsat data, both bagging and boosting rithms than random noise (Fig. 8). While, overall, the erproduce<br />

compact clouds (coefficients of variation .013 ror rates are lower with mislabeled noise than with ran-<br />

and .012, respectively) with bagging showing higher j dom noise, the spread of points is much larger. The<br />

values <strong>for</strong> the same mean error value, indicating greater standard C5.0 decision tree is least stable and has the<br />

agreement between samples. When input data can be ex- highest error of all the algorithms <strong>for</strong> both the 8 km and<br />

pected to show variability, as is the case with remotely Landsat data. Bagging produces slightly lower error rates<br />

sensed data, these results suggest that bagging and boost- <strong>for</strong> comparable stability compared with the boosting result.<br />

ing provide a more stable classification result than a stanmore<br />

In sum, bagging and boosting appear substantially<br />

dard decision tree.<br />

robust to random noise in the training data than<br />

the standard C5.0 decision tree. For the Landsat data<br />

Robustness to Noise in Training Data<br />

set, bagging appears more robust than boosting, while,<br />

Remotely sensed training data is likely to be noisy due in the 8 km data set, bagging and boosting appear comto<br />

many factors including saturation of the signal, missing parable. For noise caused by mislabeled training data, all<br />

scans, mislabeling, problems with the sensor, and viewing the algorithms produce less stable results than with rangeometry.<br />

Ideally, an algorithm would not be overly sen- dom noise. However, as with the random noise, bagging<br />

sitive to the presence of noise in the training data. This and boosting have lower error rates and greater stability<br />

criterion is related to the stability of the algorithm, but than the standard C5.0 decision tree.<br />

even a stable algorithm will not necessarily per<strong>for</strong>m well<br />

in the presence of noise.<br />

For this study, we investigate two types of noise that SUMMARY AND CONCLUSIONS<br />

could realistically occur in the training data: random In this article, we propose several criteria <strong>for</strong> evaluating<br />

noise in the input data (input data are metrics in the case machine learning algorithms <strong>for</strong> operational monitoring<br />

of the 8 km AVHRR data and reflectance values in the of land cover with satellite data. In addition to the stancase<br />

of the Landsat TM data) and mislabeling of the dard criterion of classification accuracy, we compare the<br />

cover type in the training data. For random noise, we computational resources required by the algorithms<br />

introduce zero values randomly (10%, 30%, and 50%) through quantifying the number of operations that need<br />

into the training input data <strong>for</strong> both the 8 km and Land- to be per<strong>for</strong>med. Through the use of j-error diagrams,<br />

sat data to simulate missing data. For mislabeling of the we also compare the stability of the algorithms when<br />

8 km data, we assigned class 13 to the class label in the faced with variability in the training data set and the rotraining<br />

data <strong>for</strong> all training pixels derived from three bustness of the algorithms to noise in the training data.<br />

Landsat scenes distributed around the world (Fig. 4). With respect to the classification accuracy, we also pro-<br />

This type of mislabeling is likely to occur in the case of pose that misclassification costs be taken into account<br />

erroneous ancillary data or misinterpretation of the Land- because not all confusions between cover types have<br />

sat scene from which the training data were derived. For equal consequences <strong>for</strong> the user.<br />

the Landsat data from Peru, we mislabeled approximately To illustrate these criteria, we apply them to three<br />

10% of the training data as class 6 in a spatially heteroge- variants of decision tree algorithms used in machine<br />

neous portion of the scene (Fig. 5). learning: the standard decision tree from C5.0; the C5.0<br />

The j-error diagrams help understand the effect of decision tree with “bagging”; and the C5.0 decision tree<br />

noise on the algorithms. For the case of random noise, with “boosting.” Each of these algorithms are applied to<br />

the standard C5.0 decision tree clearly has higher error two data sets, a global land cover classification from 8


514 DeFries and Chan<br />

Table 6. Relative Ranking (Low, Medium, and High) of <strong>Multiple</strong> <strong>Criteria</strong> to Assess Algorithm Per<strong>for</strong>mance on 8 km and<br />

Landsat Data<br />

Computational<br />

Robustness to<br />

Accuracy Resources Stability Noise<br />

8 km data<br />

Standard C5.0 decision tree Slightly lower Low Low Low<br />

Decision tree with boosting Slightly higher Medium High High<br />

Decision tree with bagging Medium High High High<br />

Landsat data<br />

Standard C5.0 decision tree Slightly lower Low Low Low<br />

Decision tree with boosting Slightly higher Medium Medium Medium<br />

Decision tree with bagging Medium High High High<br />

km AVHRR data and a Landsat Thematic Mapper scene the 27th International Symposium on Remote Sensing of En-<br />

from Pucallpa, Peru. These data sets have each been vironment, Tromso, Norway.<br />

classified with extensive human interpretation and expert Baase, S. (1988), Computer <strong>Algorithms</strong>: Introduction to Design<br />

knowledge that provide reliable test data <strong>for</strong> assessing<br />

and Analysis, Addison-Wesley, Reading, MA.<br />

Breiman, L. (1996), Bagging predictors. Mach. Learn. 24:<br />

the classification results.<br />

123–140.<br />

Results indicate comparable accuracy of the three<br />

Breiman, L., Friedman, J. H., Olshend, R. A., and Stone, C.<br />

variants of the decision tree algorithms on the two data<br />

J. (1984), Classification and Regression Trees, Wadsworth,<br />

sets <strong>for</strong> all the three accuracy measures investigated here Monterey, CA.<br />

(overall accuracy, mean class accuracy, and adjusted ac- Brodley, C., Lane, T., and Stough, T. (1999), Knowledge discuracy<br />

to account <strong>for</strong> hypothetical misclassification covery and data mining. Am. Sci. (Jan./Feb.):54–61.<br />

costs). Accuracies are highest <strong>for</strong> boosting, but only by Congalton, R. G. (1991), A review of assessing the accuracy of<br />

a few percentages. However, the bagging and boosting classifications of remotely sensed data. Remote Sens. Envi-<br />

algorithms are both more stable and more robust to ron. 37:35–46.<br />

noise in the training data compared with the standard Congalton, R. G., and Green, K. (1999), Assessing the Accu-<br />

C5.0 decision tree. These advantages are associated with racy of Remotely Sensed Data: Principles and Practices,<br />

a cost of increased requirements <strong>for</strong> computation re- Lewis Publishers, New York.<br />

DeFries, R. S., and Los, S. O. (1999), Implications of land<br />

sources. The bagging algorithm is most costly in terms<br />

cover misclassification <strong>for</strong> parameter estimates in global<br />

“amount of work done” while the standard decision tree<br />

land surface models: an example from the Simple Biosphere<br />

is least costly (Table 6). The results presented here illus-<br />

Model (SiB2). Photogramm. Eng. Remote Sens. 65:<br />

trate that multiple criteria need to be evaluated in as- 1083–1088.<br />

sessing the most suitable algorithms <strong>for</strong> land cover classi- DeFries, R. S., and Townshend, J. R. G. (1994), NDVI-derived<br />

fication. The choice of the most suitable algorithm land cover classification at global scales. Int. J. Remote<br />

requires consideration of a number of criteria in addition Sens. 15:3567–3586.<br />

to the traditional accuracy measures.<br />

DeFries, R., Hansen, M., Townshend, J. R. G., and Sohlberg,<br />

R. (1998), Global land cover classifications at 8 km spatial<br />

This research was supported by NASA Grants NAG56970 and<br />

resolution: the use of training data derived from Landsat<br />

NAG56004. The Landsat Pathfinder project <strong>for</strong> De<strong>for</strong>estation imagery in decision tree classifiers. Int. J. Remote Sens. 19:<br />

in the Humid Tropics supplied the TM data. We thank Carla 3141–3168.<br />

Brodley, Purdue University; Mark Friedl, Boston University; Dietterich, T. G. (in press), An experimental comparison of<br />

Arthur Desch, University of Maryland; and Matt Hansen, Uni- three methods <strong>for</strong> constructing ensembles of decision trees:<br />

versity of Maryland <strong>for</strong> helpful comments and suggestions. bagging, boosting, and randomization. Mach. Learn.<br />

Freund, Y., and Shapiro, R. E. (1996), Experiments with a new<br />

REFERENCES<br />

boosting algorithm, In <strong>Machine</strong> <strong>Learning</strong> Proceedings of the<br />

Thirteenth International Conference, Morgan-Kaufman, San<br />

Francisco, CA, pp. 148–156.<br />

Agbu, P. A., and James, M. E. (1994), The NOAA/NASA Pathfinder<br />

Friedl, M. A., and Brodley, C.E. (1997), Decision tree classifitributed<br />

AVHRR Land Data Set User’s Manual, Goddard Dis- cation of land cover <strong>for</strong>m remotely sensed data. Remote<br />

Active Archive Center Publications, GCDG, Greenbelt,<br />

Sens. Environ. 61:399–409.<br />

MD.<br />

Friedl, M. A., Brodley, C. E., and Strahler, A. (1999), Maximiz-<br />

Ahern, F., Janetos, A., and Langham, E. (1998), Global Observations<br />

ing land cover classification accuracies produced by decision<br />

of Forest Cover: one component of CEOS’ inte- trees at continental to global scales. IEEE Trans. Geosci.<br />

grated global observing system strategy. In Proceedings of Remote Sens. 37:969–977.


<strong>Evaluating</strong> <strong>Machine</strong> <strong>Learning</strong> <strong>Algorithms</strong> <strong>for</strong> Land Cover Classification 515<br />

Freund, Y., and Schapire, R. E. (1997), A decision-theoretic Conference on Artificial Intelligence, Morgan Kaufmann,<br />

generalization of on-line learning and an application to San Francisco.<br />

boosting. Journal of Computer and System Sciences Margineantu, D. D., and Dietterich, T. G. (1999), <strong>Learning</strong> de-<br />

55(1):119–139.<br />

cision trees <strong>for</strong> loss minimization in multi-class problems,<br />

Gopal, S., Woodcock, C., and Strahler, A. H. (1996), Fuzzy Oregon State University, Corvallis.<br />

ARTMAP classification of global land cover fom AVHRR Merz, C. J., and Murphy, P. M. (1996), UCI repository of madata<br />

set. In Proceedings of the 1996 International Geosci- chine learning databases. University of Cali<strong>for</strong>nia Irvine,<br />

ence and Remote Sensing Symposium, Lincoln, NE, 27–31 Morgan Kaufmann, CA.<br />

May, pp. 538–540.<br />

Provost, F., and Fawcett, T. (1997), Analysis and visulaization<br />

Hansen, M., and Reed, B. (2000), Comparison of IGBP DISand<br />

of classifier per<strong>for</strong>mance: comparison under imprecise class<br />

Cover and University of Maryland 1km global land cover<br />

cost distribution. In Proceedings of the Third Interna-<br />

classifications. Int. J. Remote Sens. 21:1365–1374.<br />

tional Conference on Knowledge Discovery and Data Min-<br />

Hansen, M., DeFries, R., Townshend, J. R. G., and Sohlberg, ing (KDD-97), American Association <strong>for</strong> Artificial Intelli-<br />

R. (2000), Global land cover classification at 1km spatial res- gence, Huntington Beach, CA (www.aaai.org).<br />

olution using a classification tree approach. Int. J. Remote Provost, F., Fawcett, T., and Kohavi, R. (1998), Building the<br />

Sens. 21:1331–1364.<br />

case against accuracy estimation <strong>for</strong> comparing induction al-<br />

Hansen, M., Dubayah, R., and DeFries, R. (1996), Classificaference<br />

on <strong>Machine</strong> <strong>Learning</strong> (IMLC-98), Madison, WI.<br />

gorithms. Proceedings of the Fifteenth International Contion<br />

trees: an alternative to traditional land cover classifiers.<br />

Int. J. Remote Sens. 17:1075–1081.<br />

Quinlan, R. J. (1993), C4.5: Programs <strong>for</strong> <strong>Machine</strong> <strong>Learning</strong>,<br />

Indurkhya, N., and Weiss, S. M. (1998), Estimating per<strong>for</strong>-<br />

Morgan Kaufmann, San Mateo, CA.<br />

mance gains <strong>for</strong> voted decision trees. Intell. Data Anal. 2(4):<br />

Quinlan, J. R. (1996), Bagging, boosting and C4.5. In Proceed-<br />

1–10.<br />

ings of the Thirteenth National Conference on Artificial In-<br />

Janetos, A. C., and Ahern, F. (1997), CEOS Pilot Project:<br />

telligence, AAAI Press, Portland, OR, pp. 725–730.<br />

Richards, J. A. (1993), Remote Sensing Digital Image Analysis:<br />

Global Observations of Forest Cover (GOFC), Report from<br />

An Introduction, Springer-Verlag, New York.<br />

meeting, Ottawa, Ontario, Canada.<br />

Schiffers, J. (1997), A classification approach incorporating mis-<br />

Jensen, J. R. (1996), Introductory Digital Image Processing: A<br />

classification costs. Intell. Data Anal. 1(1):1–10.<br />

Remote Sensing Perspective, Prentice Hall, Upper Saddle<br />

Skole, D., and Tucker, C. (1993), Tropical de<strong>for</strong>estation and<br />

River, NJ.<br />

habitat fragmentation in the Amazon: satellite data from<br />

Kalluri, S. N. V., Jaja, J., Bader, P. A. (2000), High per<strong>for</strong>-<br />

1978 to 1988. Science 260:1905–1910.<br />

mance computing algorithms <strong>for</strong> land cover dynamics using<br />

Swain, P. H., and Hauska, H. (1977), The decision tree classiremote<br />

sensing data. Int. J. Remote Sens. 21(6):1513–1536.<br />

fier: design and potential. IEEE Trans. Geosci. Electron.<br />

Kohavi, R., Sommerfield, D., and Dougherty, J. (1996), Data<br />

GE-15:142–147.<br />

mining using MLC: a machine learning library in C.<br />

Townshend, J. R. G., Bell, V., Desch, A. (1995), The NASA<br />

Int. J. Artif. Intell. Tools 6:537–566.<br />

Landsat Pathfinder Humid Tropical De<strong>for</strong>estation Project,<br />

Loveland, T. R., and Belward, A. S. (1997), The IGBP-DIS Land satellite in<strong>for</strong>mation in the next decade, ASPRS Conglobal<br />

1 km land cover data set, DISCover: first results. Int. ference, Vienna, VA, 25–28 September, pp. IV-76-IV-87.<br />

J. Remote Sens. 18:3289–3295. Tucker, C. J., Townshend, J. R. G., and Goff, T. E. (1985),<br />

Loveland, T. R., Merchant, J. W., Ohlen, D. O., and Brown, African land-cover classification using satellite data. Sci-<br />

J. F. (1991), Development of a land-cover characteristics da- ence 227:369–375.<br />

tabase <strong>for</strong> the conterminous U.S. Photogramm. Eng. Remote Weiss, G. M. (1995), <strong>Learning</strong> with rare cases and small dis-<br />

Sens. 57:1453–1463.<br />

juncts. In <strong>Machine</strong> <strong>Learning</strong>: Proceedings of the Twelfth In-<br />

Margineantu, D., and Dietterich, T. (1997), Pruning adaptive ternational Conference, Morgan Kaufmann, San Fancisco,<br />

boosting. In Proceedings of the Fourteenth International pp. 558–565.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!