Paper 8 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

1778 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 42, NO.

8, AUGUST 2004

Classification of Hyperspectral Remote Sensing


Images With Support Vector Machines
Farid Melgani, Member, IEEE, and Lorenzo Bruzzone, Senior Member, IEEE

Abstract—This paper addresses the problem of the classifica- is possible to address various additional applications requiring
tion of hyperspectral remote sensing images by support vector very high discrimination capabilities in the spectral domain (in-
machines (SVMs). First, we propose a theoretical discussion and cluding material quantification and target detection). From a
experimental analysis aimed at understanding and assessing the
potentialities of SVM classifiers in hyperdimensional feature methodological viewpoint, the automatic analysis of hyperspec-
spaces. Then, we assess the effectiveness of SVMs with respect tral data is not a trivial task. In particular, it is made complex by
to conventional feature-reduction-based approaches and their many factors, such as: 1) the large spatial variability of the hy-
performances in hypersubspaces of various dimensionalities. To perspectral signature of each land-cover class; 2) atmospheric
sustain such an analysis, the performances of SVMs are compared effects; and 3) the curse of dimensionality. In the context of su-
with those of two other nonparametric classifiers (i.e., radial basis
function neural networks and the K-nearest neighbor classifier). pervised classification, one of the main difficulties is related to
Finally, we study the potentially critical issue of applying binary the small ratio between the number of available training samples
SVMs to multiclass problems in hyperspectral data. In particular, and the number of features. This makes it impossible to obtain
four different multiclass strategies are analyzed and compared: reasonable estimates of the class-conditional hyperdimensional
the one-against-all, the one-against-one, and two hierarchical probability density functions used in standard statistical classi-
tree-based strategies. Different performance indicators have
been used to support our experimental studies in a detailed and fiers. As a consequence, on increasing the number of features
accurate way, i.e., the classification accuracy, the computational given as input to the classifier over a given threshold (which
time, the stability to parameter setting, and the complexity of depends on the number of training samples and the kind of clas-
the multiclass architecture. The results obtained on a real Air- sifier adopted), the classification accuracy decreases (this be-
borne Visible/Infrared Imaging Spectroradiometer hyperspectral havior is known as the Hughes phenomenon [1]).
dataset allow to conclude that, whatever the multiclass strategy
adopted, SVMs are a valid and effective alternative to conventional Much work has been carried out in the literature to over-
pattern recognition approaches (feature-reduction procedures come this methodological issue. Four main approaches can be
combined with a classification method) for the classification of identified: 1) regularization of the sample covariance matrix; 2)
hyperspectral remote sensing data. adaptive statistics estimation by the exploitation of the classi-
Index Terms—Classification, feature reduction, Hughes phe- fied (semilabeled) samples; 3) preprocessing techniques based
nomenon, hyperspectral images, multiclass problems, remote on feature selection/extraction, aimed at reducing/transforming
sensing, support vector machines (SVMs). the original feature space into another space of a lower dimen-
sionality; and 4) analysis of the spectral signatures to model the
I. INTRODUCTION classes.
The first approach uses the multivariate normal (Gaussian)
R EMOTE sensing images acquired by multispectral sen-
sors, such as the widely used Landsat Thematic Mapper
(TM) sensor, have shown their usefulness in numerous earth
probability density model, which is a widely accepted statistical
model for optically remotely sensed data. For each information
observation (EO) applications. In general, the relatively small class, such a model requires the correct estimation of first- and
number of acquisition channels that characterizes multispec- second-order statistics. In the presence of an unfavorable ratio
tral sensors may be sufficient to discriminate among different between the number of available training samples and features,
land-cover classes (e.g., forestry, water, crops, urban areas, etc.). the common way of estimating the covariance matrix may lead
However, their discrimination capability is very limited when to inaccurate estimations (that may make it impossible to invert
different types (or conditions) of the same species (e.g., different the covariance matrix in maximum-likelihood (ML) classifiers).
types of forest) are to be recognized. Hyperspectral sensors can Several alternatives and improved covariance matrix estimators
be used to deal with this problem. These sensors are character- have been proposed to reduce the variance of the estimate for
ized by a very high spectral resolution that usually results in limited training samples [2], [3]. The main problem involved by
hundreds of observation channels. Thanks to these channels, it improved covariance estimators is the risk that the estimated co-
variance matrices overfit the few available training samples and
lead to a poor approximation of statistics for the whole image
Manuscript received November 4, 2003; revised May 16, 2004. This work
was supported by the Italian Ministry of Education, Research and University
to be classified.
(MIUR). The second approach to overcome the Hughes phenomenon
The authors are with the Department of Information and Communication proposes to use in an iterative way the semilabeled samples ob-
Technologies, University of Trento, I-38050 Trento, Italy (e-mail: mel-
gani@dit.unitn.it; lorenzo.bruzzone@ing.unitn.it). tained after classification in order to enhance statistics estima-
Digital Object Identifier 10.1109/TGRS.2004.831865 tion and to improve classification accuracy. Samples are initially
0196-2892/04$20.00 © 2004 IEEE
MELGANI AND BRUZZONE: CLASSIFICATION OF HYPERSPECTRAL REMOTE SENSING IMAGES WITH SVMs 1779

classified by using the available training samples. Then, the clas- Finally, the approach inherited from spectroscopic methods
sified samples, together with the training ones, are exploited it- in analytical chemistry to deal with hyperspectral data is worth
eratively to update the class statistics and, accordingly, the re- mentioning. The idea behind this approach is that of looking
sults of the classification up to convergence [4], [5]. The process at the response from each pixel in the hyperspectral image as
of integration between these two typologies of samples (i.e., a one-dimensional spectral signal (signature). Each information
the training and the semilabeled samples) is carried out by the class is modeled by some descriptors of the shape of its spectra
expectation–maximization (EM) algorithm, which represents a [16], [17]. The merit of this approach is that it significantly sim-
general and powerful solution to the problem of ML estimation plifies the formulation of the hyperspectral data classification
of statistics in the presence of incomplete data [6], [7]. The main problem. However, additional work is required to find out appro-
advantage of this approach is that it fits the true class distribu- priate shape descriptors capable of capturing the spectral shape
tions better, since a larger portion of the image (available with no variability related to each information class accurately.
extra cost) contributes to the estimation process. The main prob- Other methods also exist that are not included in the group
lems related to this second approach are two: 1) it is demanding of the four main approaches discussed above. In particular, it is
from the computational point of view and 2) it requires that the interesting to mention the method based on the combination of
initial class model estimated from the training samples should different classifiers [18] and that based on cluster-space repre-
match well enough the unlabeled samples in order to avoid di- sentation [19].
vergence of the estimation process and, accordingly, to improve Recently, particular attention has been dedicated to support
the accuracy of the model parameter estimation. vector machines (SVMs) for the classification of multispectral
In order to overcome the problem of the curse of dimension- remote sensing images [20]–[22]. SVMs have often been found
ality, the third approach proposes to reduce the dimensionality to provide higher classification accuracies than other widely
of the feature space by means of feature selection or extraction used pattern recognition techniques, such as the maximum
techniques. Feature-selection techniques perform a reduction of likelihood and the multilayer perceptron neural network classi-
spectral channels by selecting a representative subset of original fiers. Furthermore, SVMs appear to be especially advantageous
features. This can be done following: 1) a selection criterion and in the presence of heterogeneous classes for which only few
2) a search strategy. The former aims at assessing the discrim- training samples are available. In the context of hyperspectral
ination capabilities of a given subset of features according to image classification, some pioneering experimental investiga-
statistical distance measures among classes (e.g., Bhattacharyya tions preliminarily pointed out the effectiveness of SVMs to
distance, Jeffries–Matusita distance, and the transformed diver- analyze hyperspectral data directly in the hyperdimensional
gence measure [8], [9]). The latter plays a crucial role in hyper- feature space, without the need of any feature-reduction pro-
dimensional spaces, since it defines the optimization approach cedure [23]–[26]. In particular, in [24], the authors found
necessary to identify the best (or a good) subset of features ac- that a significant improvement of classification accuracy can
cording to the used selection criterion. Since the identification of be obtained by SVMs with respect to the results achieved
the optimal solution is computationally unfeasible, techniques by the basic minimal-distance-to-means classifier and those
that lead to suboptimal solutions are normally used. Among
reported in [3]. In order to show its relatively low sensitivity
the search strategies proposed in the literature, it is worth men-
to the number of training samples, the accuracy of the SVM
tioning the basic sequential forward selection (SFS) [10], the
classifier was estimated on the basis of different proportions
more effective sequential forward floating selection [11], and
between the number of training and test samples. As will be
the steepest ascent (SA) techniques [12]. The feature-extraction
explained in the following section, this mainly depends on the
approach addresses the problem of feature reduction by trans-
fact that SVMs implement a classification strategy that exploits
forming the original feature space into a space of a lower di-
mensionality, which contains most of the original information. a margin-based “geometrical” criterion rather than a purely
In this context, the decision boundary feature extraction (DBFE) “statistical” criterion. In other words, SVMs do not require
method [13] has proved to be a very effective method, capable an estimation of the statistical distributions of classes to carry
of providing a minimum number of transformed features that out the classification task, but they define the classification
achieve good classification accuracy. However, this feature-ex- model by exploiting the concept of margin maximization. The
traction technique suffers from high computational complexity, growing interest in SVMs [27]–[30] is confirmed by their suc-
which makes it often unpractical. This problem can be over- cessful implementation in numerous other pattern recognition
come by coupling with the projection pursuit (PP) algorithm applications such as biomedical imaging [31], image compres-
[14], which plays the role of a preprocessor to the DBFE by sion [32], and three-dimensional object recognition [33]. Such
applying a preliminary limited reduction of the feature space an interest is justified by three main general reasons: 1) their
with (hopefully) an almost negligible information loss. An al- intrinsic effectiveness with respect to traditional classifiers,
ternative feature-extraction method, whose class-specific nature which results in high classification accuracies and very good
makes it particularly attractive, was proposed by Kumar et al. generalization capabilities; 2) the limited effort required for
[15]. It is based on a combination of subsets of (highly corre- architecture design (i.e., they involve few control parameters);
lated) adjacent bands into fewer features by means of top-down and 3) the possibility of solving the learning problem according
and bottom-up algorithms. In general, it is evident that even if to linearly constrained quadratic programming (QP) methods
feature-reduction techniques take care of limiting the loss of in- (which have been studied intensely in the scientific literature).
formation, this loss is often unavoidable and may have a nega- However, a major drawback of SVMs is that, from a theoretical
tive impact on classification accuracy. point of view, they were originally developed to solve binary
1780 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 42, NO. 8, AUGUST 2004

classification problems. This drawback becomes even more their potential properties in hyperspectral feature spaces. Sec-
evident when dealing with data acquired from hyperspectral tion III describes different strategies that can be used to solve
sensors, since they are intrinsically designed to discriminate multiclass problems with binary SVMs and that are adopted in
among a broad range of land-cover classes that may be very the experiments to assess the impact of the multiclass problem
similar from a spectral viewpoint. The implementation of in a hyperdimensional context. Section IV deals with the exper-
SVMs in multiclass classification problems can be approached imental phase of the work. Finally, Section V summarizes the
in two ways [23], [24], [34], [35]. The first consists of defining observations and concluding remarks to complete this paper.
an architecture made up of an ensemble of binary classifiers.
The decision is then taken by combining the partial decisions of II. SVM CLASSIFICATION APPROACH
the single members of the ensemble. The second is represented
A. SVM Mathematical Formulation
by SVMs formulated directly as a multiclass optimization
problem. Because of the number of classes that are to be 1) Linear SVM: Linearly Separable Case: Let us consider a
discriminated simultaneously, the number of parameters to supervised binary classification problem. Let us assume that the
be estimated increases considerably in a multiclass optimiza- training set consists of vectors from the -dimensional fea-
tion formulation. This renders the method less stable and, ture space . A target
accordingly, affects the classification performances in terms is associated to each vector . Let us assume that the two classes
of accuracy. For this reason, multiclass optimization has not are linearly separable. This means that it is possible to find at
been as successful as the approach based on the two-class least one hyperplane (linear surface) defined by a vector
optimization. (normal to the hyperplane) and a bias that can separate
In this paper, we present a theoretical discussion and an ac- the two classes without errors. The membership decision rule
curate experimental analysis that aim: 1) at assessing the prop- can be based on the function sgn , where is the dis-
erties of SVM classifiers in hyperdimensional feature spaces criminant function associated with the hyperplane and defined
and 2) at evaluating the impact of the multiclass problem in- as
volved by SVM classifiers when applied to hyperspectral data
(1)
by comparing different multiclass strategies. With regard to the
experimental part of the first objective, assessment of SVM ef- In order to find such a hyperplane, one should estimate and
fectiveness is carried out through two different experiments. In so that
the first, we propose to compare the performances of SVMs with
those of two other nonparametric classifiers applied directly to with (2)
the original hyperdimensional feature space: the radial basis
function neural network, which is another kernel-based classi- The SVM approach consists in finding the optimal hyperplane
fication method (like SVMs) that uses a different classification that maximizes the distance between the closest training sample
strategy based on a “statistical” rather than a “geometrical” cri- and the separating hyperplane. It is possible to express this dis-
terion; and the K-nearest neighbors classifier, which is widely tance as equal to with a simple rescaling of the hyper-
used in pattern recognition as a reference classification method. plane parameters and such that
The second experiment consists of a comparison of SVMs with
the classical classification approach adopted for hyperspectral (3)
data, i.e., a conventional classifier combined with a feature-re-
duction technique. This also allows to assess the performances The geometrical margin between the two classes is given by the
of SVMs in hypersubspaces of various dimensionalities. As re- quantity . The concept of margin is central in the SVM
gards the second objective of this work, four different multiclass approach, since it is a measure of its generalization capability.
strategies are analyzed and compared. In particular, the widely The larger the margin, the higher the expected generalization
used one-against-all and one-against-one strategies are consid- [27].
ered. In addition, two strategies based on the hierarchical tree Accordingly, it turns out that the optimal hyperplane can be
approach are investigated. The experimental studies were car- determined as the solution of the following convex quadratic
ried out on the basis of hyperspectral images acquired by the programming problem:
Airborne Visible/Infrared Imaging Spectroradiometer (AVIRIS)
sensor in June 1992 on the Indian Pines area (Indiana) [36]. Dif- minimize:
(4)
ferent performance indicators are used to support our experi- subject to:
mental analysis, namely, the classification accuracy, the com- This classical linearly constrained optimization problem can be
putational time, the stability to parameter setting, and the com- translated (using a Lagrangian formulation) into the following
plexity of the multiclass architecture adopted. Experimental re- dual problem:
sults confirm the significant superiority of the SVM classifiers
in the context of hyperspectral data classification over the con-
ventional classification methodologies, whatever the multiclass maximize:
strategy adopted to face the multiclass dilemma.
The rest of this paper is organized in four sections. Section II subject to: and
recalls the mathematical formulation of SVMs and discusses (5)
MELGANI AND BRUZZONE: CLASSIFICATION OF HYPERSPECTRAL REMOTE SENSING IMAGES WITH SVMs 1781

Fig. 1. Optimal separating hyperplane in SVMs for a linearly nonseparable case. White and black circles refer to the classes “ +1” and “01,” respectively. Support
vectors are indicated by an extra circle.

The Lagrange multipliers ’s expressed of the cost function expressed in (7) is subject to the following
in (5) can be estimated using quadratic programming (QP) constraints:
methods [27]. The discriminant function associated with the
optimal hyperplane becomes an equation depending both on (8)
the Lagrange multipliers and on the training samples, i.e., (9)

(6) It is worth noting that, in the nonseparable case, two kinds of


support vectors coexist: 1) margin support vectors that lie on
the hyperplane margin and 2) nonmargin support vectors that
where is the subset of training samples corresponding to the fall on the “wrong” side of this margin (Fig. 1).
nonzero Lagrange multipliers ’s. It is worth noting that the 3) Nonlinear SVM: Kernel Method: A natural way to im-
Lagrange multipliers effectively weight each training sample prove further the separation between two information classes
according to its importance in determining the discriminant consists in generalizing the above method to the category of
function. The training samples associated to nonzero weights nonlinear discriminant functions. Accordingly, one may think
are called support vectors. These lie at a distance exactly equal of mapping the data through a proper nonlinear transformation
to from the optimal separating hyperplane. into a higher dimensional feature space
2) Linear SVM: Linearly Nonseparable Case: The SVM , where a separation between the two classes can be looked
formulation described in the previous subsection holds only for following the method described in the previous subsections,
if data are linearly separable. Such an optimistic condition is i.e., by means of an optimal hyperplane defined by a normal
difficult to satisfy in the classification of real data. In order to vector and a bias . To identify the latter, one
handle nonseparable data, the concept of optimal separating should solve a dual problem such as the one defined in (5) for
hyperplane has been generalized as the solution that minimizes the linearly separable case by replacing the inner products in
a cost function that expresses a combination of two criteria: the original space with inner products in the trans-
margin maximization (as in the case of linearly separable data) formed space . At this point, the main problem
and error minimization (to penalize the wrongly classified consists of the explicit computation of , which can prove
samples). The new cost function is defined as expensive and at times unfeasible. The kernel method provides
an elegant and effective way of dealing with this problem. Let
us consider a kernel function that satisfies the condition stated
C (7) in Mercer’s theorem so as to correspond to some type of inner
product in the transformed (higher) dimensional feature space
[27, pp. 423–424], i.e.,
where the ’s are the so-called slack variables introduced to
account for the nonseparability of data, and the constant C (10)
represents a regularization parameter that allows to control the
penalty assigned to errors. The larger the C value, the higher the This kind of kernel function allows to simplify the solution of
penalty associated to misclassified samples. The minimization the dual problem considerably, since it avoids the computation
1782 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 42, NO. 8, AUGUST 2004

of the inner products in the transformed space , First, in a hyperspectral space, normally distributed samples
i.e., as in (a reasonable assumption for optically remotely sensed data)
tend to fall toward the tails of the density function with virtu-
ally no samples falling in the central region [39]. This can be
maximize: illustrated by a simple geometric example [40]. Let us consider
(11) the ratio between the volume of a sphere of radius and
subject to: and C one of a cube defined in the interval in the -dimen-
sional space. It is equal to
The final result is a discriminant function conveniently ex-
pressed as a function of the data in the original (lower) dimen-
sional feature space (15)

(12) where represents the well-known gamma function. From


(15), it easy to show that the higher the dimensionality of the
space, the lower the volume ratio. Accordingly, the volume of a
The shape of the discriminant function depends on the kind of hypercube is almost concentrated in its corners. In other words,
kernel functions adopted. A common example of kernel type turning back to our classification problem, the increase in di-
that fulfills Mercer’s condition is the Gaussian radial basis func- mensionality makes the space almost empty and results in a
tion “centrifuge” effect such that data have a tendency to concentrate
close to the tails of the distribution where they are very likely
(13) to be in proximity of decision boundaries between the informa-
tion classes. This statistical property is of interest potentially to
where is a parameter inversely proportional to the width of the pattern recognition approaches, such as SVMs, that define dis-
Gaussian kernel. Another extensively used kernel is the polyno- criminant functions on the basis of samples situated near the
mial function of order expressed as decision boundaries, since the presence of a larger number of
samples in this region allows to generate more accurate and re-
(14) liable discriminant functions.
In the second place, it is well-known that as the dimension-
It is worth underlining that the kernel-based implementation of ality of the data increases, the distances between the samples
SVMs involves the problem of the selection of multiple param- (and consequently between the information classes) increase
eters, including the kernel parameters (e.g., the and parame- [41]. In this situation, local neighborhoods are almost certainly
ters for the Gaussian and polynomial kernels, respectively) and empty, requiring the bandwidth of estimation to be large and
the regularization parameter C. Recently, two interesting auto- producing the effect of losing accuracy in density estimation for
matic techniques have been developed to deal with this issue a statistical classifier [39]. On the contrary, the “geometrical”
[37], [38]. They are based on the idea of estimating the pa- nature of SVMs results in a methodology that is not aimed at
rameter values so that: 1) they maximize the margin; and 2) estimating the statistical distributions of classes over the entire
they minimize the estimate of the expected generalization error. hyperdimensional space. Indeed, SVMs are inspired by the fol-
The latter is expressed in analytical form by the well-known lowing idea:
leave-one-out (LOO) procedure. Optimization of the parame- If you possess a limited amount of information to solve
ters is then carried out using a gradient descent search over the a problem, try solving it directly and never solve a more
space of the parameters. general problem as an intermediate step. The available in-
Since a detailed analysis of the theory of SVMs is beyond the formation may be sufficient for a direct solution, though
scope of this paper, we refer the reader to [27]–[30] for greater insufficient to solve a more general intermediate problem.
detail on SVMs. [27, p. 12]
In other words, SVMs do not involve a density estimation
B. SVMs in Hyperspectral Feature Spaces problem that can lead to the Hughes effect, but they directly
Unlike traditional learning techniques, SVMs do not depend exploit the geometrical behavior of data (space local empti-
explicitly on the dimensionality of input spaces. They solve ness) as they make it more likely to find a decision boundary
classical statistical problems such as pattern recognition, regres- between classes that results in a small classification error. The
sion, and density estimation in high-dimensional spaces [27]. In above-discussed properties (statistical and geometrical) render
greater detail, as stated in the previous subsection, the input fea- SVMs potentially less sensitive to the curse of dimensionality.
ture space is mapped by a kernel transformation into a higher Another important aspect to be pointed out is the intrinsic
dimensional space, where it is expected to find a linear sepa- good generalization capability of SVMs, which stems from
ration that maximizes the margin between the two classes. In the selection of the hyperplane that maximizes the geometrical
order to appreciate the potentialities of SVMs in high-dimen- margin between classes. In a hyperspectral context, the max-
sional spaces, it is useful to recall the statistical and geometrical imum margin solution allows to fully exploit the discrimination
properties of the data in such spaces. capability of the relatively few training samples available.
MELGANI AND BRUZZONE: CLASSIFICATION OF HYPERSPECTRAL REMOTE SENSING IMAGES WITH SVMs 1783

Accordingly, this solution deals with some of the major prob-


lems, such as the large spatial variability of the hyperspectral
signature of each information class, in the best way in terms
of generalization capability, given the limited information
present in the training set. However, it is worth noting that to
solve the problem of the spatial variability of the hyperspectral
signature of classes effectively, good generalization properties
of the classifiers should be coupled with other data analysis
techniques.
III. SVMS: MULTICLASS STRATEGIES
As stated in the previous section, SVMs are intrinsically
binary classifiers. However, the classification of hyperspec-
Fig. 2. Block diagram of a parallel architecture for solving multiclass
tral remote sensing data usually involves the simultaneous problems with binary SVMs. In the OAA strategy, M is equal to T (i.e., the
discrimination of numerous information classes. In this sec- number of information classes). By contrast, the OAO strategy involves a larger
tion, we describe four different strategies of combination of number of SVMs and M is given by T (T 1)=2.0
SVMs considered to evaluate the impact of the multiclass
problem in the context of hyperspectral data classification. Let of a different reasoning, in which simple classification tasks
be the set of possible labels (informa- are made possible thanks to a parallel architecture made up
tion classes) associated with the -dimensional hyperspectral of a large number of SVMs [23], [43]. The OAO strategy
image of the study area. In the multiclass case, the problem is involves SVMs, which model all possible pair-
to associate to each -dimensional sample the label of the set wise classifications. In this case, each SVM carries out a
that optimizes a predefined classification criterion. In order binary classification in which two information classes and
to carry out this task, the general approach adopted in strategies are analyzed against each other
based on binary classifiers consists of: 1) defining an ensemble by means of a discriminant function . Consequently, the
of binary classifiers; and 2) combining them according to some grouping becomes
decision rules.
The definition of the ensemble of binary classifiers involves (17)
the definition of a set of two-class problems, each modeled with
two groups and of classes ( and ). Before the decision process, it is necessary to compute for each
Targets with values and are assigned to the samples of class a score function , which sums the favorable
and , respectively, for each SVM. The selection of these and unfavorable votes expressed for the considered class
subsets depends on the kind of approach adopted to combine the
ensemble. Two main approaches can be identified: the “parallel”
and the “hierarchical tree-based” approaches. In the following, sgn (18)
we describe two multiclass strategies from each approach char-
acterized by different classification complexity and computa-
The final decision in the OAO strategy is taken on the basis of
tional cost properties.
the “winner-takes-all” rule, which corresponds to the following
maximization
A. Parallel Approach
1) One-Against-All Strategy: The one-against-all (OAA) (19)
strategy represents the earliest and most common multiclass
approach used for SVMs [42]. It involves a parallel architecture Sometimes, conflict situations may occur between two different
made up of SVMs, one for each class (Fig. 2). Each SVM classes characterized by the same score. Such ambiguities
solves a two-class problem defined by one information class can be solved by selecting the class with the highest prior
(e.g., ) against all the others, i.e., probability.

(16) B. Hierarchical Tree-Based Approach


The idea of representing the data analysis process with a hier-
The “winner-takes-all” rule is used for the final decision, i.e., archical tree is not new and has been under study in many pattern
the winning class is the one corresponding to the SVM with the recognition application areas. Tree-based classifiers have repre-
highest output (discriminant function value). sented an interesting and effective way to structure and solve
2) One-Against-One Strategy: The main problem of the complex classification problems [44]–[47]. The organization of
OAA strategy is that the discrimination between an information information into a hierarchical tree allows to achieve a faster
class and all the others often leads to the estimation of complex processing capability and, at times, a higher accuracy of anal-
discriminant functions. In addition, a problem with strongly ysis. This is mainly explained by the fact that the nodes of the
unbalanced prior probabilities should be solved by each SVM. tree carry out very focused tasks, meaningless when taken indi-
The idea behind the one-against-one (OAO) strategy is that vidually but meaningful when taken as a whole. Turning back
1784 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 42, NO. 8, AUGUST 2004

to our problem, the binary hierarchical tree (BHT) can be seen


as an alternative to the OAA and the OAO strategies, since it
allows to reach a good tradeoff between the number of SVMs
to be used and the complexity of the task assigned to each of
them. Furthermore, the BHT does not implement a global deci-
sion scheme after evaluating the local decisions as in the OAA
and OAO strategies. Indeed, the final decision is implicitly made
after running through the tree and reaching one of its terminal
nodes.
Many BHT strategies have been proposed in the literature. In
this paper, we investigate two different binary tree hierarchies
aimed at reducing the computational load required by the OAA
and OAO strategies, especially in the operational classification
phase (the off-line training phase is less critical from the view-
point of the computational time). This can become particularly
important when large hyperspectral images are considered. As
described in the following, both trees exploit the prior proba-
bilities of the classes to define the hierarchy of binary SVMs.
It is worth noting that alternative strategies that also exploit the
underlying affinities among the individual classes to define the
binary trees (like in [46]) could be considered.
1) BHT-Balanced Branches Strategy: In the BHT-balanced
branches (BHT-BB) strategy, the tree is defined in such a way
that each node (SVM) discriminates between two groups of
classes and with similar cumulative prior probabilities. Fig. 3. Examples of BHTs for a T -class classification problem. (a) BHT-BB.
Fig. 3(a) shows an example of tree that can be found with the (b) BHT-OAA.
BHT-BB strategy for a general -class classification problem.
TABLE I
The algorithm that implements the BHT-BB strategy is de- NUMBER OF TRAINING AND TEST SAMPLES USED IN THE EXPERIMENTS
scribed as follows:
Step 0: Root Node
—Set level index
—Divide into two groups and such that

Step 1:k-Level Branching


—For
• If Card , divide into two groups
and such that

• If Card , divide into two groups Fig. 3(b). The algorithm of the BHT-OAA strategy is drawn up
and such that in the following:
Step 0: Root Node
—Set level index
—Set
—Divide into two groups and such that
Step 2: Stop Condition and
—If or such that Card or
Step 1: k-Level Branching
Card with , go to
—Divide into two groups and such
Step 1. Otherwise, Stop.
that and
2) BHT-One Against All Strategy: The second binary tree-
based hierarchy, called BHT-one against all (BHT-OAA), rep-
resents a simplification of the OAA strategy obtained through its —Set
implementation in a hierarchical context. To this end, we pro- Step 2: Stop Condition
pose to define the tree in such a way that each node discrim- —If Card , go to Step 1. Otherwise, Stop.
inates between two groups of classes and , where It is worth noting that both BHT strategies allow to reduce the
represents the information class with the highest prior proba- number of required SVMs from and , respectively,
bility among those belonging to . This kind of hier- for the OAA and OAO strategies, to . Since the classifi-
archy leads to a tree with only one single branch as depicted in cation time depends linearly on the number of SVMs and since
MELGANI AND BRUZZONE: CLASSIFICATION OF HYPERSPECTRAL REMOTE SENSING IMAGES WITH SVMs 1785

TABLE II
BEST OVERALL AND CLASS-BY-CLASS ACCURACIES, AND COMPUTATIONAL TIMES ACHIEVED ON THE TEST SET
BY THE DIFFERENT CLASSIFIERS IN THE ORIGINAL HYPERSPECTRAL SPACE

classification tasks of medium complexity are assigned to the TABLE III


SVMs of the tree, we expect a lower classification time required ANALYSIS OF THE STABILITY OF THE OVERALL CLASSIFICATION ACCURACY
AND OF THE COMPUTATIONAL TIME VERSUS THE SETTING OF THE
by the two BHT-based strategies with respect to both the OAO PARAMETERS OF THE DIFFERENT CLASSIFIERS
and, especially, the standard OAA strategies.

IV. EXPERIMENTAL RESULTS


A. Dataset Description and Experiment Design
The hyperspectral dataset used in our experiments is a sec-
tion of a scene taken over northwest Indiana’s Indian Pines by
the AVIRIS sensor in 1992 [36]. From the 220 spectral channels
acquired by the AVIRIS sensor, 20 channels were discarded be-
method (like SVMs), which adopts a different strategy based
cause affected by atmospheric problems. From the 16 different
on a “statistical” (rather than a “geometrical”) criterion for
land-cover classes available in the original ground truth, seven
defining the discriminant hyperplane in the transformed kernel
were discarded, since only few training samples were available
space. The K-nn classifier was considered in our experiments,
for them (this makes the experimental analysis more significant
since it represents a reference classification method in pattern
from the statistical viewpoint). The remaining nine land-cover
recognition. However, it is worth noting that we expect it to be
classes were used to generate a set of 4757 training samples
sensitive to the curse of dimensionality. For both classifiers,
(used for learning the classifiers) and a set of 4588 test sam-
different trials were carried out to determine empirically the
ples (exploited for assessing their accuracies) (see Table I). The
best related parameters, namely, the number of nodes in the
experiments were run on a Sun Ultra 80 workstation.
hidden layer and the variable , respectively.
The experimental analysis was organized into three main
In the experiments, we considered two different kinds of
experiments. The first aims at analyzing the effectiveness
SVMs: a linear SVM (SVM-Linear) which corresponds to an
of SVMs in classifying hyperspectral images directly in the
SVM without kernel transformation, and a nonlinear SVM
original hyperdimensional feature space. A comparison with
based on Gaussian radial basis kernel functions (SVM-RBF).
two other nonparametric classifiers is provided as well as an
For both SVMs, the regularization parameter C must be es-
assessment of the stability of these three classification methods
timated, since data are not ideally separable. In addition, the
versus the setting of their parameters. In the second experiment,
nonlinear SVM requires the determination of the width pa-
SVMs are compared with the classical approach adopted for
rameter of the Gaussian radial basis kernels, which tunes
hyperspectral data classification, that is a conventional pat-
the smoothing of the discriminant function. For the considered
tern recognition system made up of a classification method
dataset, the best values of the parameter C were 50 and 40
combined with a feature-reduction technique. In these two
for the linear and nonlinear SVMs, respectively. The optimal
experiments, we adopted the most popular multiclass strategy
kernel width parameter of the nonlinear SVM was found
used for SVMs, that is the OAA strategy. Finally, the third
equal to 0.25. These values were estimated empirically on the
experiment aims at analyzing and comparing the effectiveness
basis of the available training samples.
of the different multiclass strategies described in the previous
The results in terms of classification accuracy and computa-
section, that is the OAA, OAO, BHT-BB, and BHT-OAA
tional time provided by the different classifiers are summarized
strategies.
in Table II. The nonlinear SVM exhibited the best Overall
Accuracy (OA), i.e., the best percentage of correctly classified
B. Results of Experiment 1: Classification in the Original pixels among all the test pixels considered, with a gain of
Hyperdimensional Feature Space 6.32%, 6.43%, and 9.48% over the linear SVM, the RBF, and
SVMs were compared with two widely used nonparametric the K-nn classifiers, respectively. In terms of class accuracies,
classifiers: a radial basis functions (RBFs) neural network the “corn-min till” class was the most critical. For this
trained with the technique described in [48] and a conventional class, the nonlinear SVM still exhibited the best accuracy
K-nearest neighbors (K-nn) classifier. The choice of the RBF (87.76%), whereas the worst accuracy (61.16%) was obtained
classifier is motivated by the fact that it is a kernel-based by the K-nn classifier. It is worth noting that, since the K-nn
1786 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 42, NO. 8, AUGUST 2004

classifier is based on counting the number of nearest neigh-


boring training samples, it requires the feature space to be filled
in with a significant number of training samples to obtain reli-
able local estimates of the conditional posterior probabilities of
classes. However, in the considered dataset, the small number
of training samples (4757) is not sufficient to fill in a proper
way the emptiness of the hyperdimensional feature space. This
explains the relatively poor classification accuracies of the K-nn
classifier. By contrast, SVMs exploit a discriminant model that
is defined on the basis of a particular portion of the training
samples (support vectors). As explained in Section II and
confirmed by the obtained results, the behavior of the class dis-
tributions in hyperdimensional spaces makes it more effective
to apply techniques that define discriminant functions on the
basis of training samples located near the decision boundaries.
Fig. 4. Overall accuracy versus the number of features obtained on the test set
Concerning computational cost, the nonlinear SVM exhibited by the four different classifiers considered in our investigation (i.e., linear and
a reasonable total computational time (given by the sum of the nonlinear SVMs, RBF and K-nn classifiers).
training and test times) compared to the other three classifiers.
It is worth noting that the long computational time required by
the linear SVM (40342 [s]) expresses the difficulties encoun- TABLE IV
FIRST- AND SECOND-ORDER STATISTICS OF THE OVERALL ACCURACIES
tered by this kind of classifier in the training phase to find a OBTAINED ON THE TEST SET BY THE DIFFERENT CLASSIFIERS COMBINED
reasonable linear separation between information classes. WITH THE SA-BASED FEATURE-SELECTION PROCEDURE FOR A NUMBER
In order to assess the robustness of each classifier to the pa- OF FEATURES VARYING FROM 20 TO 200 (WITH A STEP OF 10)

rameter settings, we derived some statistics by looking at the


overall accuracy (OA) and at the total computational time as
random realizations obtained by varying the parameters in a pre-
defined range of values. The results reported in Table III confirm
the superiority of the nonlinear SVM in terms of both mean
overall accuracy (92.64% and 92.51% by varying the param-
eters and C, respectively) and in terms of stability (it pro-
vided the lowest variances). It is worth noting that the nonlinear
SVM is less sensitive to the choice of the kernel width value spaces of a lower dimensionality (the number of features was
than to the regularization parameter C. The linear SVM showed varied from 20 to 200 with a step of 10). The SA technique
the worst stability to the parameter C (overall-accuracy variance formulates the problem of defining the subset of features that
equals 4.94). This is explained by the fact that a linear separa- maximizes the JM distance as a discrete optimization problem
tion between classes involves a large number of error samples, in a -dimensional space, which is viewed as a space of binary
which lie on the wrong side of the separating hyperplane. This strings. It starts with a binary string randomly initialized, and
makes it more difficult to apply the regularization mechanism performs an iterative local optimization of the adopted criterion
implemented in the SVM formulation, resulting in significant function. At each iteration, the criterion is maximized over a
sensitivity of the classification accuracy to the value of the reg- neighborhood of the current solution under a predefined con-
ularization parameter. Concerning the average total computa- straint. In our experiments, each subset of selected features was
tional times, the obtained results confirm the conclusions drawn given as input to all four considered classifiers (i.e., linear and
above on the basis of the total computational times obtained for nonlinear SVMs, RBF neural networks, and the K-nn classifier).
the best parameter values of the four considered classifiers. Fig. 4 plots the overall accuracy versus the number of selected
features for the four considered classifiers. As can be seen, the
C. Results of Experiment 2: Feature Reduction and obtained results still confirm the strong superiority of nonlinear
Classification SVMs over the other classifiers even in lower dimensional fea-
As already discussed in Section I, the traditional approach ture spaces, with a gain in overall accuracy (averaged over all the
adopted to address the problem of the classification of hyper- subsets of features) of % %, and % with re-
spectral data consists of two main phases: 1) reducing the di- spect to the linear SVM, the K-nn, and the RBF neural network
mensionality of the feature space; and 2) applying the resulting classifiers (see Table IV). In order to analyze the sensitivity of
subset of features to a conventional classifier. In this experiment, each classifier to the Hughes phenomenon, in the same table we
we propose to assess the effectiveness of SVMs with respect reported the variance of the overall accuracy exhibited by each
to a traditional feature-reduction-based approach and to eval- classification method when varying the number of features from
uate their performances in hypersubspaces of various dimen- 20 to 200. The lowest sensitivity was again obtained by the non-
sionalities. To this end, we used the Jeffries–Matusita (JM) in- linear SVM classifier with a sharp reduction of the variance with
terclass distance measure [8] and the steepest ascent (SA) search respect to those achieved by the K-nn, the linear SVM, and the
strategy [12] to reduce the original hyperdimensional space into RBF neural network classifiers.
MELGANI AND BRUZZONE: CLASSIFICATION OF HYPERSPECTRAL REMOTE SENSING IMAGES WITH SVMs 1787

TABLE V
CLASSIFICATION ACCURACIES YIELDED ON THE TEST SET BY THE DIFFERENT CLASSIFIERS WITH THE SUBSET OF THE BEST 30 FEATURES SELECTED ACCORDING
TO THE SA-BASED FEATURE-SELECTION PROCEDURE. THE DIFFERENCE IN OVERALL ACCURACY (DIFF-OA) FOR EACH CLASSIFIER WITH RESPECT TO THE
ACCURACY ACHIEVED IN THE ORIGINAL HYPERDIMENSIONAL SPACE IS ALSO GIVEN

Table V reports the overall and class-by-class accuracies ob-


tained for the hypersubspace made up of the best 30 selected
features. The choice of this subspace is motivated by the fact that
it represents a good compromise between a low dimensionality
of the feature space and a high classification accuracy achieved
on average by the four classifiers. In particular, one can see the
greater capacity of the nonlinear SVMs to recognize each in-
formation class, with a gain in the average of the class-by-class
accuracies of % %, and % with respect to
the linear SVM, the K-nn, and the RBF neural network classi-
fiers. In addition, the same table reports the difference in overall
accuracy (DIFF-OA) for each classifier with respect to the accu-
racy achieved in the original hyperdimensional space. It is inter- Fig. 5. Number of support vectors that characterize each binary SVM of
esting to note the lower difference (associated with the expected the multiclass nonlinear SVM classifier (OAA strategy) in both the original
hyperspace (d = 200) and the hypersubspace made up of the best 30 selected
lowest sensitivity to the problem of the curse of dimensionality) features (d = 30).
achieved by the SVM-RBF classifier (0.93%). The reduction in
the number of features involved a decrease in accuracy of 3.36% phase, the four strategies were analyzed and compared based on
for the linear SVM classifier. By contrast, significant increases three parameters: 1) classification accuracy; 2) computational
in accuracy of 2.96% and 3.46% were obtained by the conven- time; and 3) architecture complexity. The obtained results are re-
tional K-nn and RBF classifiers, respectively, confirming a rel- ported in Tables VI and VII. From the viewpoint of the accuracy,
atively high sensitivity to the curse of dimensionality. all four strategies resulted in satisfactory results when compared
In order to analyze the complexity of the decision bound- with the two other nonparametric classifiers (i.e., the RBF neural
aries produced by the nonlinear SVM classifier, we computed networks and the K-nn classifier). In greater detail, the OAO
the number of SVs defined in each binary SVM of the OAA strategy exhibited the best accuracy with a gain in overall ac-
architecture in both the original hyperspace and the hypersub- curacy of % %, and % over the BHT-OAA,
space consisting of the best 30 selected features. These numbers the BHT-BB and the OAA strategies, respectively. This sug-
are represented graphically in Fig. 5. It can be observed in gen- gests that the decomposition of the multiclass problem into an
eral that the numbers of SVs are relatively small, except for the ensemble of two-class problems of very low-complexity repre-
SVM associated with the class . This suggests that decision sents an effective way of improving overall discrimination capa-
boundaries of moderate complexity were enough to discrimi- bility. The significant reduction in the complexity of the classi-
nate accurately between the information classes. Furthermore, fication problem assigned to each SVM of the OAO architecture
as discussed in Section II-B, an important property related to is shown by the very small average number of SVs that charac-
the “geometrical” nature of SVMs seems confirmed, i.e., that terizes each SVM of the same architecture. Indeed, this number
the classification complexity does not depend on the dimension is 130 against 333, 334, and 424 for the BHT-OAA, BHT-BB
of the feature space, since the number of SVs is almost similar and the OAA strategies, respectively (Table VII). These values
in both the original and the reduced spaces. explain also why the time required to train the SVMs of the OAO
strategy is the shortest, despite the greater amount of SVMs re-
D. Results of Experiment 3: SVM and Multiclass Strategies quired by the same strategy (212 [s] against 311 [s], 410 [s],
The third (and last) experiment addressed the application of and 2361 [s] to train the BHT-BB, BHT-OAA, and OAA strate-
SVMs to the multiclass problem in the hyperdimensional space. gies, respectively). It is worth noting that the smallest number
The different multiclass strategies described in Section III (i.e., of SVs was exhibited by the OAO strategy. Indeed, only nine
the OAA, OAO, BHT-BB, and BHT-OAA strategies) were de- SVs were necessary to discriminate between the fifth and ninth
signed and trained using nonlinear SVMs based on the Gaussian classes (hay-windrowed and woods, respectively) with an accu-
radial basis kernel functions. The trees of SVMs defined for the racy of 100%. On the other hand, the larger number of SVMs
BHT-BB and the BHT-OAA strategies are illustrated in Fig. 6. involved in the OAO strategy directly affects the computational
The class prior probabilities necessary to obtain such trees were time demanded during the classification of test samples (554
computed on the basis of the training set. After the training [s] against 125 [s], 155 [s] and 341 [s] for the BHT-BB, the
1788 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 42, NO. 8, AUGUST 2004

TABLE VI
OVERALL AND CLASS-BY-CLASS ACCURACIES OBTAINED ON THE TEST SET BY SVMS WITH THE DIFFERENT MULTICLASS STRATEGIES CONSIDERED

TABLE VII
COMPUTATIONAL TIME AND CLASSIFICATION COMPLEXITY ASSOCIATED TO
THE DIFFERENT SVM MULTICLASS STRATEGIES CONSIDERED

involved in both the BHT-BB and the BHT-OAA strategies. It is


worth noting that the relatively low accuracy (93.77%) obtained
by the first SVM of the BHT-OAA architecture (SVM1) com-
bined with a significant depth of its associated tree (involving a
higher risk of error propagation) may explain why this strategy
was slightly less accurate than the BHT-BB strategy. In gen-
eral, from a computational point of view, the two investigated
BHT-BB and BHT-OAA strategies proved effective, resulting
in a significant decrease in computational time.

V. DISCUSSION AND CONCLUSION


In this paper, we addressed the problem of the classification
of hyperspectral remote sensing data using support vector ma-
chines. In order to assess the effectiveness of this promising
classification methodology, we considered two main objectives.
The first was aimed at assessing the properties of SVMs in hy-
perdimensional spaces and hypersubspaces of various dimen-
sionalities. In this context, the results obtained on the considered
dataset allow to identify the following three properties: 1) SVMs
are much more effective than other conventional nonparametric
classifiers (i.e., the RBF neural networks and the K-nn classifier)
in terms of classification accuracy, computational time, and sta-
bility to parameter setting; 2) SVMs seem more effective than
the traditional pattern recognition approach, which is based on
the combination of a feature extraction/selection procedure and
a conventional classifier, as implemented in this paper; and 3)
SVMs exhibit low sensitivity to the Hughes phenomenon, re-
sulting in an excellent approach to avoid the usually time-con-
suming phase required by any feature-reduction method. In-
Fig. 6. Hierarchical trees obtained on the considered dataset by (a) the deed, as shown in the experiments, the improvement in accuracy
BHT-BB strategy and (b) the BHT-OAA strategy.
obtained on the considered dataset by combining SVMs with a
feature-reduction technique is definitely insufficient to justify
BHT-OAA, and the OAA strategies, respectively). Thanks to the use of the latter.
the small number of required SVMs and to the moderate com- The second objective of the work concerned the assessment
plexity of the classification tasks assigned to each of them, the of the effectiveness of strategies based on ensembles of binary
two BHT strategies seem particularly interesting in an oper- SVMs used to solve multiclass problems in hyperspectral data.
ative phase involving the classification of large scale images. In particular, four different multiclass strategies were investi-
Table VIII shows the overall accuracies achieved by each SVM gated and compared. These four strategies differ basically in
MELGANI AND BRUZZONE: CLASSIFICATION OF HYPERSPECTRAL REMOTE SENSING IMAGES WITH SVMs 1789

TABLE VIII
OVERALL ACCURACY YIELDED ON THE TEST SET BY EACH SINGLE SVM OF THE BHT-BB AND BHT-OAA STRATEGIES

the manner in which the classification problem complexity is


distributed over the single members (SVMs) of the architecture.
[6] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood
Compared with each other, the parallel architectures (OAA and from incomplete data via the EM algorithm,” J. R. Statist. Soc., vol. 19,
OAO) showed a better discrimination capability than the hierar- pp. 1–38, 1977.
chical tree-based architectures (BHT-BB and BHT-OAA). This [7] T. K. Moon, “The expectation-maximization algorithm,” Signal Process.
Mag., vol. 13, pp. 47–60, 1996.
can be explained by the fact that the BHT strategies may involve [8] J. A. Richards and X. Jia, Remote Sensing Digital Image Analysis,
the risk of propagation of errors, since the final decision is the Berlin, Germany: Springer-Verlag, 1999.
result of several hierarchical exchanges of partial decisions that [9] L. Bruzzone, F. Roli, and S. B. Serpico, “An extension to multiclass
cases of the Jeffries–Matusita distance,” IEEE Trans. Geosci. Remote
may accumulate errors. Accordingly, one may observe that the Sensing, vol. 33, pp. 1318–1321, Nov. 1995.
design of a BHT strategy should favor a large number of ramifi- [10] J. Kittler, “Feature set search algorithm,” in Pattern Recognition and
cations at the expense of a lower ramification depth, to attenuate Signal Processing, C. H. Chen, Ed. Alphen aan den Rijn, Netherlands:
Sijthoff and Noordhoff, 1978, pp. 41–60.
such a risk. Another reason that justifies the lower discrimina- [11] P. Pudil, J. Novovicova, and J. Kittler, “Floating search methods in fea-
tion capability of the two proposed BHT strategies can be found ture selection,” Pattern Recognit. Lett., vol. 15, pp. 1119–1125, 1994.
in the kind of information used to construct the tree. Indeed, [12] S. B. Serpico and L. Bruzzone, “A new search algorithm for feature
selection in hyperspectral remote sensing images,” IEEE Trans. Geosci.
the use of simple information, such as the class prior probabil- Remote Sensing, vol. 39, pp. 1360–1367, July 2001.
ities, cannot take into proper account the underlying affinities [13] C. Lee and D. A. Landgrebe, “Feature extraction based on decision
among individual classes (or metaclasses). However, from the boundaries,” IEEE Trans. Pattern Anal. Machine Intell., vol. 15, pp.
388–400, Apr. 1993.
viewpoint of computational time, the BHT-BB and BHT-OAA [14] L. O. Jimenez and D. A. Landgrebe, “Hyperspectral data analysis and
strategies proved the most effective. Consequently, depending feature reduction via projection pursuit,” IEEE Trans. Geosci. Remote
on the considered application, the multiclass strategy should be Sensing, vol. 37, pp. 2653–2667, Nov. 1999.
[15] S. Kumar, J. Ghosh, and M. M. Crawford, “Best-bases feature extraction
selected according to a proper tradeoff between classification algorithms for classification of hyperspectral data,” IEEE Trans. Geosc.
accuracy and computational time. As a final remark, it is impor- Remote. Sensing, vol. 39, pp. 1368–1379, May 2001.
tant to point out that the classification accuracies exhibited by [16] J. P. Hoffbeck and D. A. Landgrebe, “Classification of remote sensing
images having high-spectral resolution,” Remote Sens. Environ., vol. 57,
all four strategies suggest that the multiclass problem does not pp. 119–126, 1996.
significantly affect the performances of SVMs in the analysis of [17] F. Tsai and W. D. Philpot, “A derivative-aided hyperspectral image anal-
hyperspectral data. Indeed, all the strategies exhibited accura- ysis system for land-cover classification,” IEEE Trans. Geosc. Remote.
Sensing, vol. 40, pp. 416–425, Feb. 2002.
cies sharply higher than those of the nonparametric classifiers [18] J. A. Benediktsson and I. Kanellopoulos, “Classification of multisource
considered in our experimental analysis. and hyperspectral data based on decision fusion,” IEEE Trans. Geosci.
Remote Sensing, vol. 37, pp. 1367–1377, May 1999.
[19] X. Jia and J. A. Richards, “Cluster-space representation of hyperspectral
ACKNOWLEDGMENT data classification,” IEEE Trans. Geosci. Remote Sensing, vol. 40, pp.
593–598, Mar. 2002.
Support by the Italian Ministry of Education, Research and [20] L. Hermes, D. Frieauff, J. Puzicha, and J. M. Buhmann, “Support vector
University (MIUR) for this work is gratefully acknowledged. machines for land usage classification in landsat TM imagery,” in Proc.
IGARSS, Hamburg, Germany, 1999, pp. 348–350.
The authors would like to thank D. Landgrebe for providing [21] F. Roli and G. Fumera, “Support vector machines for remote-sensing
the AVIRIS data and T. Joachims for supplying the software image classification,” Proc. SPIE, vol. 4170, pp. 160–166, 2001.
SVM (http://svmlight.joachims.org/) used in the context of [22] C. Huang, L. S. Davis, and J. R. G. Townshend, “An assessment of sup-
port vector machines for land cover classification,” Int. J. Remote Sens.,
this work. vol. 23, pp. 725–749, 2002.
[23] J. A. Gualtieri and R. F. Cromp, “Support vector machines for hy-
REFERENCES perspectral remote sensing classification,” Proc. SPIE, vol. 3584, pp.
221–232, 1998.
[1] G. F. Hughes, “On the mean accuracy of statistical pattern recognizers,” [24] J. A. Gualtieri, S. R. Chettri, R. F. Cromp, and L. F. Johnson, “Support
IEEE Trans. Inform. Theory, vol. IT-14, pp. 55–63, 1968. vector machine classifiers as applied to AVIRIS data,” in Summaries
[2] J. P. Hoffbeck and D. A. Landgrebe, “Covariance matrix estimation and 8th JPL Airborne Earth Science Workshop, 1999, JPL Pub. 99-17, pp.
classification with limited training data,” IEEE Trans. Pattern Anal. Ma- 217–227. Online. [Available]: ftp://popo.jpl.nasa.gov/pub/docs/work-
chine Intell., vol. 18, pp. 763–767, July 1996. shops/99_docs/toc.html.
[3] S. Tadjudin and D. A. Landgrebe, “Covariance estimation with limited [25] J. A. Gualtieri and S. Chettri, “Support vector machines for classifica-
training samples,” IEEE Trans. Geosci. Remote. Sensing, vol. 37, pp. tion of hyperspectral data,” in Proc. IGARSS, Honolulu, HI, 2000, pp.
2113–2118, July 1999. 813–815.
[4] Q. Jackson and D. A. Landgrebe, “An adaptive classifier design for high- [26] F. Melgani and L. Bruzzone, “Support vector machines for classification
dimensional data analysis with a limited training data set,” IEEE Trans. of hyperspectral remote-sensing images,” in Proc. IGARSS, Toronto,
Geosci. Remote. Sensing, vol. 39, pp. 2664–2679, Dec. 2001. ON, Canada, 2002, pp. 506–508.
[5] B. M. Shahshahani and D. A. Landgrebe, “The effect of unlabeled [27] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.
samples in reducing the small sample size problem and mitigating the [28] C. B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for
Hughes phenomenon,” IEEE Trans. Geosci. Remote. Sensing, vol. 32, optimal margin classifiers,” in Proc. 5th Annu. ACM Workshop Compu-
pp. 1087–1095, Sept. 1994. tational Learning Theory, 1992, pp. 144–152.
1790 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 42, NO. 8, AUGUST 2004

[29] C. J. C. Burges, “A tutorial on support vector machines for pattern recog- Farid Melgani (M’04) received the State Engineer
nition,” Data Mining Knowl. Discov., vol. 2, pp. 121–167, 1998. degree in electronics from the University of Batna,
[30] Set of tutorials on SVM’s and kernel methods [Online]. Available: Batna, Algeria, in 1994, the M.Sc. degree in elec-
http://www.kernel-machines.org/tutorial.html. trical engineering from the University of Baghdad,
[31] I. El-Naqa, Y. Yongyi, M. N. Wernick, N. P. Galatsanos, and R. M. Baghdad, Iraq, in 1999, and the Ph.D. degree in elec-
Nishikawa, “A support vector machine approach for detection of micro- tronic and computer engineering from the University
calcifications,” IEEE Trans. Med. Imag., vol. 21, pp. 1552–1563, Dec. of Genoa, Genoa, Italy, in 2003.
2002. From 1999 to 2002, he cooperated with the Signal
[32] J. Robinson and V. Kecman, “Combining support vector machine Processing and Telecommunications Group, Depart-
learning with the discrete cosine transform in image compression,” ment of Biophysical and Electronic Engineering,
IEEE Trans. Neural Networks, vol. 14, pp. 950–958, July 2003. University of Genoa. He is currently an Assistant
[33] M. Pontil and A. Verri, “Support vector machines for 3D object recogni- Professor of telecommunications at the University of Trento, Trento, Italy,
tion,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 637–646, where he teaches pattern recognition, radar remote sensing systems, and
June 1998. digital transmission. His research interests are in the area of processing and
pattern recognition techniques applied to remote sensing images (classification,
[34] D. J. Sebald and J. A. Bucklew, “Support vector machines and the mul-
multitemporal analysis, and data fusion). He is coauthor of more than 30
tiple hypothesis test problem,” IEEE Trans. Signal Processing, vol. 49,
scientific publications.
pp. 2865–2872, Nov. 2001.
Dr. Melgani served on the Scientific Committee of the SPIE Interna-
[35] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclass tional Conferences on Signal and Image Processing for Remote Sensing VI
support vector machines,” IEEE Trans. Neural Networks, vol. 13, pp. (Barcelona, Spain, 2000), VII (Toulouse, France, 2001), VIII (Crete, 2002),
415–425, Mar. 2002. and IX (Barcelona, Spain, 2003) and is a referee for the IEEE TRANSACTIONS
[36] AVIRIS NW Indiana’s Indian Pines 1992 data set [Online]. Available: ON GEOSCIENCE AND REMOTE SENSING.
ftp://ftp.ecn.purdue.edu/biehl/MultiSpec/92AV3C (original files) and
ftp://ftp.ecn.purdue.edu/biehl/PC_MultiSpec/ThyFiles.zip (ground
truth).
[37] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing mul- Lorenzo Bruzzone (S’95–M’99–SM’03) received
tiple parameters for support vector machines,” Mach. Learn., vol. 46, pp. the laurea (M.S.) degree in electronic engineering
131–159, 2002. (summa cum laude) and the Ph.D. degree in telecom-
[38] K.-M. Chung, W.-C. Kao, T. Sun, L.-L. Wang, and C.-J. Lin, “Radius munications, both from the University of Genoa,
margin bounds for support vector machines with the RBF kernel,” Genoa, Italy, in 1993 and 1998, respectively.
Neural. Comput., vol. 15, pp. 2643–2681, 2003. He is currently Head of the Remote Sensing Lab-
[39] L. O. Jimenez and D. A. Landgrebe, “Supervised classification in high- oratory in the Department of Information and Com-
dimensional space: Geometrical, statistical, and asymptotic properties munication Technologies at the University of Trento,
of multivariate data,” IEEE Trans. Syst., Man, Cybern. C, vol. 28, pp. Trento, Italy. From 1998 to 2000, he was a Postdoc-
39–54, Jan. 1998. toral Researcher at the University of Genoa. From
[40] M. G. Kendall, A Course in the Geometry of n-Dimensions. New York: 2000 to 2001, he was an Assistant Professor at the
Hafner, 1961. University of Trento, where he has been an Associate Professor of telecommu-
[41] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd nications since November 2001. He currently teaches remote sensing, pattern
ed. New York: Academic, 1990. recognition, and electrical communications. His current research interests are
[42] L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, L. Jackel, Y. in the area of remote sensing image processing and recognition (analysis of
LeCun, U. Muller, E. Sackinger, P. Simard, and V. Vapnik, “Comparison multitemporal data, feature selection, classification, data fusion, and neural net-
of classifier methods: A case study in handwriting digit recognition,” in works). He conducts and supervises research on these topics within the frame-
Proc. Int. Conf. Pattern Recognition, 1994, pp. 77–87. works of several national and international projects. He is the author (or coau-
[43] U. H.-G. Kreßel, “Pairwise classification and support vector machines,” thor) of more than 100 scientific publications, including journals, book chapters,
in Advances in Kernel Methods: Support Vector Learning, B. Schölkopf, and conference proceedings. He is a referee for many international journals and
C. J. C. Burges, and A. J. Smola, Eds. Cambridge, MA: MIT Press, has served on the Scientific Committees of several international conferences.
Dr. Bruzzone ranked first place in the Student Prize Paper Competition of the
1999, pp. 255–268.
1998 IEEE International Geoscience and Remote Sensing Symposium (Seattle,
[44] P. H. Swain and H. Hauska, “The decision tree classifier: Design and po-
July 1998). He is the Delegate in the scientific board for the University of Trento
tential,” IEEE Trans. Geosci. Electron., vol. GE-15, pp. 142–147, 1977.
of the Italian Consortium for Telecommunications (CNIT) and a member of
[45] B. Kim and D. A. Landgrebe, “Hierarchical classifier design in high-di- the Scientific Committee of the India–Italy Center for Advanced Research. He
mensional, numerous class cases,” IEEE Trans. Geosci. Remote Sensing, was a recipient of the Recognition of IEEE Transactions on Geoscience and
vol. 29, pp. 518–528, July 1991. Remote Sensing Best Reviewers in 1999 and was a Guest Editor of a Special
[46] J. T. Morgan, A. Henneguelle, M. M. Crawford, J. Ghosh, and A. Neuen- Issue of the IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING on
schwander, “Adaptive feature spaces for land cover classification with the subject of the analysis of multitemporal remote sensing images (November
limited ground truth data,” in Proc. 3rd Int. Workshop on Multiple Clas- 2003). He was the General Co-chair of the First and Second IEEE Interna-
sifier Systems—MCS 2002, Cagliari, Italy, June 2002, pp. 189–200. tional Workshop on the Analysis of Multi-temporal Remote-Sensing Images
[47] M. Datcu, F. Melgani, A. Piardi, and S. B. Serpico, “Multisource data (Trento, Italy, September 2001—Ispra, Italy, July 2003). Since 2003, he has
classification with dependence trees,” IEEE Trans. Geosci. Remote been the Chair of the SPIE Conference on Image and Signal Processing for
Sensing, vol. 40, pp. 609–617, Mar. 2002. Remote Sensing (Barcelona, Spain, September 2003—Maspalomas, Gran Ca-
[48] L. Bruzzone and D. F. Prieto, “A technique for the selection of kernel- naria, September 2004). He is an Associate Editor of the IEEE GEOSCIENCE AND
function parameters in RBF neural networks for classification of re- REMOTE SENSING LETTERS. He is a member of the International Association for
mote-sensing images,” IEEE Trans. Geosci Remote. Sensing, vol. 37, Pattern Recognition (IAPR) and of the Italian Association for Remote Sensing
pp. 1179–1184, Mar. 1999. (AIT).

You might also like