Paper 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 1

Learning to Pay Attention on Spectral Domain:


A Spectral Attention Module-Based Convolutional
Network for Hyperspectral Image Classification
Lichao Mou, Student Member, IEEE, and Xiao Xiang Zhu , Senior Member, IEEE

Abstract— Over the past few years, hyperspectral image Consequently, numerous kinds of classification approaches,
classification using convolutional neural networks (CNNs) has especially supervised models have been developed for hyper-
progressed significantly. In spite of their effectiveness, given that spectral data classification, as found in the literature. Among
hyperspectral images are of high dimensionality, CNNs can be
hindered by their modeling of all spectral bands with the same them, random forest [1]–[3] and support vector machine
weight, as probably not all bands are equally informative and (SVM) [4]–[8] are two examples of supervised classification
predictive. Moreover, the usage of useless spectral bands in CNNs approaches, which have been exploited for solving varied
may even introduce noises and weaken the performance of net- and numerous classification problems. Random forests are
works. For the sake of boosting the representational capacity of basically a kind of ensemble bagging or averaging algorithm.
CNNs for spectral-spatial hyperspectral data classification, in this
work, we improve networks by discriminating the significance of It creates a set of decision trees using random subsamples
different spectral bands. We design a network unit, which is of training data and then aggregates their predictions via a
termed as the spectral attention module, that makes use of a maximum a posterior (MAP) rule or voting to decide final
gating mechanism to adaptively recalibrate spectral bands by classes of test samples. On the other hand, an SVM seeks
selectively emphasizing informative bands and suppressing less for a hyperplane that is able to sort two-class data by the
useful ones. We theoretically analyze and discuss why such a
spectral attention module helps in a CNN for hyperspectral largest margin. However, the random forest and SVM are
image classification. We demonstrate using extensive experiments characterized as “shallow” models [9] as compared to deep
that in comparison with state-of-the-art approaches, the spec- networks which are able to extract hierarchical, deep feature
tral attention module-based convolutional networks are able to representations.
offer competitive results. Furthermore, this work sheds light on Deep learning, which is mainly characterized by deep
how a CNN interacts with spectral bands for the purpose of
classification. networks, has been quite successful in solving a wide range
of problems (e.g., natural language processing [10]–[12],
Index Terms— Attention module, convolutional neural network computer vision [13]–[25], and remote sensing [26]).
(CNN), gating mechanism, hyperspectral image classification.
In the hyperspectral community, some studies have been
published recently on the use of convolutional neural
I. I NTRODUCTION networks (CNNs) [27]–[42] as well as recurrent neural net-
works (RNNs) [43]–[49] for pattern recognition tasks. For

H YPERSPECTRAL images encompass hundreds of con-


tinuous observation spectral bands, which are capable of
precisely differentiating various materials of interest. Hence,
instance, Kussul et al. [27] addressed the classification prob-
lem of crop types by making use of 1-D and 2-D CNNs
and found that the 2-D CNN is superior to the 1-D CNN,
in the remote sensing community, hyperspectral images have but several tiny objects in the classification map of the
already been considered a vital data source for object identi- 2-D CNN are a little oversmoothed and misclassified. In [28],
fication and classification tasks. Song et al. studied feature fusion in a residual learning-based
2-D CNN, aiming to build a more discriminative network for
Manuscript received December 27, 2018; revised May 5, 2019; accepted hyperspectral data classification tasks. Following the recent
June 23, 2019. This work was supported in part by the German Research
Foundation (DFG) under Grant ZH 498/7-2, in part by the European Research developments in 3-D CNN for video analysis [50], where
Council (ERC) under the European Union’s Horizon 2020 Research and Inno- the third dimensionality is usually the time axis, 3-D CNNs
vation Program (Acronym: So2Sat, www.so2sat.eu) under Grant [ERC-2016- have also been studied in hyperspectral data classification.
StG-714087], and in part by the Helmholtz Association under the framework
of the Young Investigators Group “SiPEO” (www.sipeo.bgu.tum.de) under Chen et al. [29] introduced a 2 regularized 3-D CNN for
Grant VH-NG-1018. (Corresponding author: Xiao Xiang Zhu.) learning spectral-spatial features, while [30] followed a similar
The authors are with the Remote Sensing Technology Institute (IMF), idea for the purpose of classification. Paoletti et al. [51]
German Aerospace Center (DLR), 82234 Wessling, Germany, and also with
the Signal Processing in Earth Observation (SiPEO), Technical University introduced an improved 3-D CNN consisting of 5 layers
of Munich (TUM), 80333 Munich, Germany (e-mail: [email protected]; which make use of all the spatial-spectral information on the
[email protected]). hyperspectral image.
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. To avoid overfitting, Zhao and Du [32] jointly used a dimen-
Digital Object Identifier 10.1109/TGRS.2019.2933609 sion reduction method and a 2-D CNN for spectral-spatial
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

feature extraction. Ghamisi et al. [33] first exploited a compu- of spectral bands. The recalibrated spectral information
tational intelligence (particle swarm optimization) method to using these spectral gates can effectively improve the
choose informative spectral bands and then train a 2-D CNN classification results.
using the selected bands. In [34], to properly train a CNN 2) We analyze and discuss why the proposed spectral
with limited ground truth data, the authors devised a pixel- attention module is able to offer better classification
pair CNN that takes as input a pair of hyperspectral pixels. results from a theoretical perspective by diving into the
By doing so, the amount of training data is greatly augmented. backward propagation of the network. As far as we
Furthermore, in order to access a huge amount of unlabeled know, learning and analyzing such a spectral attention-
hyperspectral data, unsupervised feature learning via a CNN based network for hyperspectral image classification
is of great interest. Romero et al. [35] presented a CNN to have not been done yet.
address the problem of unsupervised spectral-spatial feature 3) We conduct experiments on four benchmark data sets.
extraction and estimated network weights via a sparse learning The empirical results demonstrate that our spectral atten-
approach in a greedy layer-wise fashion. Mou et al. [37] tion module-based convolutional network is capable of
proposed a residual learning-based fully conv-deconv network, offering competitive classification results, particularly
aiming at unsupervised spectral-spatial feature learning in an in the situation of high dimensionality and inadequate
end-to-end manner. Better classification network architecture training data.
from computer vision (e.g., ResNet [17], DenseNet [18], and The remainder of this article is organized as follows.
CapsuleNet [52]) also provides new insights into hyperspectral After detailing hyperspectral image classification using CNNs
image classification [37]–[39], [53]. Moreover, the integration in Section I, Section II introduces the proposed spectral atten-
of networks and other traditional machine learning models, tion module-based convolutional network. Section III verifies
e.g., conditional random field (CRF) and active learning, has the proposed approach and presents the corresponding analysis
also received attention recently [54], [55]. and discussion. Finally, Section IV concludes the article.
The unique asset of hyperspectral images is their rich spec-
tral content in comparison with high-resolution aerial images II. M ETHODOLOGY
and natural images in the computer vision field. Although
there already exist a number of works that have focused on A. Problem Formulation
using CNNs for hyperspectral data classification, we notice The spectral attention module in our model transforms a
that in the community, the following questions have not been patch x of a hyperspectral image into a new representation z
well addressed until now. via the following mapping:
1) Do all spectral bands contribute equally to a CNN for
classification tasks? F:x→z (1)
2) If no, how to task-drivenly find informative bands where x, z ∈ R H ×W ×C .
that can help hyperspectral data classification in an Our aim is to strengthen the representational capac-
end-to-end network? ity of a spectral-spatial classification network through
3) Is it possible to improve classification results of a CNN explicitly modeling the significance of spectral bands. There-
by emphasizing informative bands and suppressing less fore, we instantiate F as
useful ones in the network?
These questions give us an incentive to devise a z=xg (2)
novel network called spectral attention module-based con-
where  is a channel-wise multiplication operation and
volutional network for hyperspectral image classification.
g ∈ RC represents a set of spectral gates applied to individual
Inspired by recent advances in the attention mechanism of
spectral bands of the patch x.
networks [56]–[58], which enables feature interactions to
The motivation behind (2) is that we wish to make use of a
contribute differently to predictions, we design a channel
gating mechanism to recalibrate the strength of different spec-
attention mechanism for analyzing the significance of differ-
tral bands of the input, i.e., selectively emphasize useful bands
ent spectral bands and recalibrating them. More importantly,
and suppress less informative ones, for image classification
the significance analysis is automatically learned from tasks
problems.
and hyperspectral data in an end-to-end network without any
Fig. 1 illustrates the architecture of the spectral attention
human domain knowledge. Experiments show that the use
module-equipped convolutional network.
of the proposed spectral attention module in a CNN for
hyperspectral data classification serves two benefits: it not
only offers better performance but also provides an insight B. Modeling of Spectral Attention Module
into which spectral bands contribute more to predictions. This The gating mechanism has been widely used in model-
work’s contributions are threefold. ing and processing temporal sequences. For example, long
1) We propose a learnable spectral attention module that short-term memory (LSTM)-based networks [59], [60] har-
explicitly allows the spectral manipulation of hyper- ness three gates to cope with vanishing gradients. Similarly,
spectral data within a CNN. This attention module a gated recurrent unit (GRU) [61], [62] is designed to imple-
exploits the global spectral-spatial context for producing ment the modulation of information flow through the gating
a series of spectral gates which reflects the significance mechanism.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MOU AND ZHU: LEARNING TO PAY ATTENTION ON SPECTRAL DOMAIN 3

Fig. 1. The overall architecture of the proposed gating mechanism, spectral attention module, for hyperspectral classification problems. We would like to
exploit this module to learn and recalibrate strengths of different spectral bands, i.e., selectively emphasize useful bands and suppress less informative ones,
for image classification problems. To this end, we first learn a set of spectral gates by using global convolution and then apply them to individual spectral
bands. Moreover, in Section II-C, we theoretically analyze and discuss why the proposed spectral attention module can help a spectral-spatial classification
network (e.g., a 2-D CNN) for hyperspectral image classification tasks.

In addition, several recent works in computer vision have both H × W , where f c refers to the c-th filter. Thus, the
shown the benefit of introducing the gating mechanism to c-th spectral gate gc can be calculated as follows:
vision problems. To name a few, Wang et al. [56] proposed

C
a gating mechanism that is capable of dynamically balancing gc = x ∗ f c = x i ∗ f ic (3)
contributions of the current event and its surrounding contexts i=1
in their model for dense video captioning tasks. Hu et al. [58]
built a gated block for image classification tasks and demon- where ∗ represents convolution and f ic and x i are separately
strated its good performance on large-scale image recognition. the i -th channels of the c-th filter and x. Taking into account
Liu et al. [57] addressed person re-identification tasks through that the field of view of global convolution is equal to the
utilizing a network module based on a soft gating mechanism, spatial size of x, gc is actually calculated by the inner product
which enables the network to concentrate on significant local of x i and f ic (both x i and f ic are vectorized into columns),
regions of an input image pair adaptively. In remote sensing, i.e., (3) can be rewritten as follows:
a very recently published, parallel work related to this article 
C
  C
can be found in [63], where the authors introduced a visual gc = x i , f ic = x iT f ic . (4)
attention technique that first calculates a mask and then applies i=1 i=1
it to features produced by a ResNet for hyperspectral data From (4), the spectral gates g can be considered as a
classification tasks. series of global descriptors, which are capable of representing
Here, we would like to design our own gating mechanism, spectral-spatial features of x.
spectral attention module, for analyzing the significance of dif- Thus, according to (2), we can associate the c-th spectral
ferent spectral bands and recalibrating them. Besides, we hope gate gc with the c-th spectral band of x to obtain the
this module is task-driven and can be adaptively learned in an recalibrated z c via
end-to-end spectral-spatial classification network. To this end,
we need a way to aggregate the spectral-spatial information 
C

of x across the spatial domain to produce a collection of zc = x c x iT f ic . (5)


spectral gates g. i=1

The convolution operation is an ideal candidate, as 1) it is So far, we can obtain an initial spectral attention module
able to spatially shrink the input patch and 2) its differential [as shown in (5)], but there still exist three issues which we
property allows end-to-end learning. In general, a convolu- should address:
tional filter operates with a local receptive field (e.g., 3 × 3 1) Given the complex spectral-spatial properties of hyper-
in VGG-16 network), which leads to the fact that the output spectral images, we wish that the spectral gates in this
is not capable of utilizing contextual information outside of module are capable of learning a nonlinear mapping,
this region. This is a severe issue for our case because the instead of a linear one, from the input.
spectral gates g in our model are expected to be derived from 2) The attention module should model a nonmutually
the whole spectral-spatial information. To tackle this problem, exclusive relationship between spectral bands, as we
we distill global spatial information into the spectral gates by would like to ensure that multiple bands can be empha-
using global convolution. Formally, let f = [ f 1 , f 2 , · · · , f C ] sized at the same time (unlike one-hot activation in
denote a set of convolutional filters and their sizes are softmax).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

Fig. 2. Example showing how the proposed spectral attention module in a CNN correct a wrong prediction (gravel) to a right one (bricks) via learned
spectral gates.

3) The gates should be bounded (e.g., between 0 and 1), Thus, the gradient of the spectral attention module can be
easily differentiable, and monotonic (good for convex written as
optimization). 1
To meet these three requirements, we modify spectral gates ∇z = ∇x 
1 + exp(−x ∗ f )
in the initial spectral attention module as follows:  
1
+x  ∇ . (9)
1 1 + exp(−x ∗ f )
gc =
1 + exp(−x ∗ f c ) It can be seen that the term ∇ x is weighted by the spectral
1 gates (1/1 + exp(−x ∗ f )). This has the following interesting
=   . (6) properties.
1 + exp − C i=1 i x T fi
c
1) On the one hand, the existence of ∇ x ensures that
the gradient information on spectral-spatial features can
Hence, the final version of the spectral attention module can
be backpropagated directly, which helps to prevent the
be written as
vanishing gradient problem.
1 2) On the other hand, for spectral bands where the spectral
zc = x c   . (7) gates are close to 0 (less useful bands), the gradient
1 + exp − C x T fi
i=1 i c propagation vanished; on the contrary, for values that
are close to 1, gradients (of informative bands) directly
Fig. 2 is an example showing how the proposed spectral propagated from z to x.
attention module works in a CNN. For the first point, a similar effect can be found in residual
learning. He et al. [17] introduced the residual learning
into CNNs for large-scale image classification tasks and
C. Why Does the Spectral Attention Module Work? exhibited significantly improved network training character-
istics, e.g., allowing network depths that were previously
In our experiments, we observed that a 2-D CNN with
unattainable. Formally, denote by y a random variable rep-
our spectral attention module can offer better classification
resenting the output of a residual block. It can then be
results. However, how exactly does this attention module help
expressed as
a spectral-spatial classification network for hyperspectral data
classification? We dive into the backward propagation of the y = x + F (x; w) (10)
network to seek the answer to this question.
For notional simplicity, we subsequently drop the subscript c where F is a residual function and usually implemented by a
and rewrite the final expression of the spectral attention couple of stacked convolutional layers. Moreover, w represents
module as follows: learnable weights of this residual block. The gradient of a
residual block can be calculated as
1
z=x . (8) ∇ y = ∇ x + ∇(F (x; w)). (11)
1 + exp(−x ∗ f )
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MOU AND ZHU: LEARNING TO PAY ATTENTION ON SPECTRAL DOMAIN 5

TABLE I
C ONFIGURATION OF A S PECTRAL ATTENTION M ODULE -BASED C ONVOLUTIONAL N ETWORK FOR THE PAVIA U NIVERSITY D ATA S ET

From (11), we can see that ∇ y is a sum of the gradient of TABLE II


the input ∇ x and the gradient ∇(F (x; w)), and as mentioned A MOUNTS OF T RAINING AND T EST D ATA ON THE I NDIAN P INES S CENE
above, the term ∇ x is a key to avoiding the vanishing gradient
problem. This is the same for the first property of our spectral
attention module.
Instead of ∇ x in (9), ∇ x in (11) is not weighted – in other
words, gradients of all spectral bands are indiscriminately
backpropagated; in contrast, the spectral attention module has
a selection mechanism regarding the significance of different
spectral bands from the perspective of gradient.

D. Network Training
We insert the spectral attention module into a 2-D CNN
(between the input and the first convolutional layer) and
then train the whole network. Note that the spectral attention
module and other layers are trained simultaneously. We use
the TensorFlow framework to implement and train networks.
All network weights are initialized by a Glorot uniform initial-
izer [64]. The Nesterov Adam [65] algorithm is chosen to opti- are detailed in Table II. Since these 16 classes have similar
mize networks, as for our experiments, compared to stochastic spectral signatures, the precise classification of this scene is
gradient descent (SGD) with momentum [66] or Adam [67], hard. The true-color composite image and the available ground
it is able to provide much faster convergence. Almost all truth data can be found in Fig. 3 (black color in the ground
parameters of this optimizer are set as recommended in [65]. truth indicates unknown samples).
We utilize a relatively small learning rate of 2e−04. Finally, 2) Pavia University Hyperspectral Data Set: The second
we train networks on an NVIDIA Tesla P100 16 GB GPU. data set was acquired over the city of Pavia, Italy, 2002 by
Table I exhibits an example of a CNN with the proposed an airborne instrument – Reflective Optics Spectrographic
attention module. Imaging System (ROSIS). The aircraft was operated by the
German Aerospace Center (DLR) within the context of Euro-
III. E XPERIMENTS AND A NALYSIS pean Union funded HySens project. The data set is made up
A. Data Description of 640 × 340 pixels with a 1.3 m/pixel spatial resolution
1) Indian Pines Hyperspectral Data Set: The first data were and 103 bands covering from 430 to 860 nm after removing
collected by the Airborne Visible/Infrared Imaging Spectrom- 12 noisy channels. Besides unknown pixels, 9 classes are
eter (AVIRIS) sensor over Northwest Indiana, USA, 1992. manually annotated in the reference data. Fig. 4 displays
It includes 145 × 145 pixels with a 20 m/pixel spatial res- a composite image of this data set and its reference map.
olution and 200 spectral bands covering from 400 to 2500 nm Table III offers information on all 9 categories.
after removing 20 water absorption channels (220, 150-163, 3) Salinas Hyperspectral Data Set: The third data set was
and 104-108). The ground truth includes 16 classes of interest, also gathered by the AVIRIS sensor over the region of Salinas
which are mostly various crops in different growth phases and Valley, CA, USA and with a 3.7-m/pixel spatial resolution.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

Fig. 3. Classification maps of different approaches for the Indian Pines data set. (Left to right) True-color composite image, training set, test set, RF-200,
SVM-RBF, CCF-200, SICNN, 2-D CNN, and SpecAttenNet. Best zoomed-in view.

Fig. 4. Classification maps of different approaches for the Pavia University data set. (Left to right) Composite image, training samples, ground truth, RF-200,
SVM-RBF, CCF-200, SICNN, 2-D CNN, and SpecAttenNet. Best zoomed-in view.

TABLE III TABLE IV


A MOUNTS OF T RAINING AND T EST D ATA A MOUNTS OF T RAINING AND T EST D ATA ON THE S ALINAS D ATA
ON THE PAVIA U NIVERSITY D ATA S ET

The Salinas scene is composed of 224 spectral bands and


512 × 217 pixels. Like the Indian Pines data set, 20 water
absorption bands (224, 154-167, and 108-112) of the Salinas
scene have been discarded. The data set presents 16 classes
related to vegetables, vineyard fields, and bare soils. Table IV B. Experiment Setup
shows the amounts of training and test data on this data set.
4) Houston Hyperspectral Data Set: The fourth data To quantitatively compare different models for hyperspectral
set was acquired over the University of Houston campus data classification tasks from various aspects, the following
and its neighboring urban area. It was collected with an measurements are considered.
ITRES-CASI 1500 sensor on June 23, 2012 between 1) Overall Accuracy (OA): This criterion is calculated
17:37:10 and 17:39:50 UTC. The average altitude of the sensor as the fraction of test samples that are differentiated
was about 1676 m, which results in 2.5-m spatial resolution correctly.
data consisting of 349 by 1905 pixels. The hyperspectral 2) Per-Class Accuracy: To assess the performance with
imagery consists of 144 spectral bands ranging from 380 to respect to each category in a data set, we also compute
1050 nm and was processed (radiometric correction, attitude per-class accuracy. This measurement is particularly
processing, GPS processing, geo-correction, and so on) to useful when class labels are not uniformly distributed.
yield the final geo-corrected image cube representing the 3) Average Accuracy (AA): This criterion is computed as
sensor spectral radiance. Table V provides information about the average of all per-class accuracies.
all 15 classes of this data set with their corresponding training 4) Kappa Coefficient: This statistic criterion is a robustness
and test samples. This data set was kindly made available by measurement with the degree of agreement.
the Image Analysis and Data Fusion Technical Committee of Furthermore, we make use of a statistical test to vali-
IEEE GRSS in 2012. date the significance of classification accuracies produced by
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MOU AND ZHU: LEARNING TO PAY ATTENTION ON SPECTRAL DOMAIN 7

TABLE V 4 × 4 and 5 × 5, respectively. The last convolutional


A MOUNTS OF T RAINING AND T EST D ATA ON THE H OUSTON D ATA S ET layer is equipped with 4 × 4 filters. Moreover, 32, 64,
and 128 convolutional filters are used separately for
those three convolutional layers. For more details, refer
to [33].
5) 2-D CNN: To demonstrate the superiority of the pro-
posed method, we perform an ablation study, i.e., design-
ing a 2-D CNN which has no spectral attention module,
but other parts are the same as the proposed network
(cf. Table I). The exact architecture of the 2-D CNN is
a VGG-like network, in which we utilize three convo-
lutional blocks and 3 × 3 filters for all the blocks. Spa-
tial shrinkage is operated by three max-pooling layers
following the convolutional blocks. Each convolutional
block in this 2-D CNN has two convolutional layers, and
32, 64, and 128 filters are used for convolutional layers
of those three blocks, respectively. Overall, we keep
the architecture of 2-D CNN and that of the following
network consistent.
various methods. Given that samples used for two classifi- 6) SpecAttenNet: The proposed spectral attention module-
cation models are not independent, McNemar’s test can be based convolutional network (cf. Table I).
harnessed to estimate the significance of the difference in two Note that, in order to make our model completely com-
classification maps, and the McNemar’s test can be performed parable with other investigated approaches, we use standard
by training and test sets for the Indian Pines, Pavia University,
and Houston data sets. For the Salinas scene, training sam-
f 12 − f21
z 12 = √ (12) ples are generated by a simple random sampling. In both
f 12 + f 21 hyperspectral data sets, 10% samples of the training set are
where f i j is the amount of data correctly recognized by randomly selected as validation samples. In other words, in the
method i and incorrectly recognized by j . McNemar’s test network training phase, we use 90% samples of the training
is a statistical test for paired nominal data, and we can set to iteratively update and optimize network weights and
use McNemar’s test to compare the predicted accuracies of the remaining ones as validation to tune hyperparameters of
two models. In McNemar’s test, the null hypothesis, which networks. Prior to training, we normalize each channel of the
means none of the two models performs better than the other, hyperspectral data to the range between 0 and 1. In addition,
is rejected at p = 0.05 (|z| > 1.96), which indicates the network architecture for these data sets is the same.
significance level.
Below are methods included in our comparison. C. Ablation Study
1) RF-200: A random forest composed of 200 decision To validate the effectiveness of the proposed module,
trees. we perform ablation experiments. As we have mentioned
2) SVM-RBF: An SVM1 having the widely used radial above, the competitor 2-D CNN is a network that has no
basis function (RBF) kernel. We make use of five-fold spectral attention module, but other parts are the same as
cross validation to search optimal hyper-parameters γ the proposed SpecAttenNet. From Tables VI–IX, we can
(spread of the RBF kernel) and C (controlling the see that SpecAttenNet outperforms 2-D CNN on all indexes
magnitude of penalization during the model optimiza- on all four data sets. Specifically, SpecAttenNet increases
tion) in the range of γ = 2−3 , 2−2 , · · · , 24 and C = accuracies significantly by 7.46% of OA, 4.75% of AA,
10−2 , 10−1 , · · · , 104 . and 0.0849 of Kappa coefficient on the Indian Pines data
3) CCF-200: A canonical correlation forest (CCF)2 [68], set; by 2.21% of OA, 1.28% of AA, and 0.0293 of Kappa
[69] with 200 trees. coefficient on the Pavia University data set; by 2.76% of OA,
4) SICNN: A CNN model, which makes an attempt at 2.87% of AA, and 0.0303 of Kappa coefficient on the Salinas
solving the curse of dimensionality by first utilizing a scene; and by 3.1% of OA, 4.93% of AA, and 0.0333 of Kappa
computational intelligence (particle swarm optimization) coefficient on the Houston scene. This shows that recalibrated
algorithm to choose informative spectral bands and spectral bands obtained by our gating mechanism become
then training a 2-D CNN using the selected bands. more separable for a spectral-spatial classification network,
The used network is made up of three convolutional as informative bands have been emphasized, and less useful
layers. The first two convolutional layers are followed ones have been suppressed.
by max-pooling layers and their fields of view are
D. Results and Discussion
1 https://www.csie.ntu.edu.tw/∼cjlin/libsvm/ Tables VI–IX give information about per-class accuracies,
2 https://github.com/twgr/ccfs OAs, AAs, and kappa coefficients obtained by various spectral
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

TABLE VI
A CCURACY C OMPARISONS FOR THE I NDIAN P INES S CENE . B OLD N UMBERS I NDICATE THE B EST P ERFORMANCE

TABLE VII
A CCURACY C OMPARISONS FOR THE PAVIA U NIVERSITY S CENE . B OLD N UMBERS I NDICATE THE B EST P ERFORMANCE

TABLE VIII
A CCURACY C OMPARISONS FOR THE S ALINAS D ATA . B OLD N UMBERS I NDICATE THE B EST P ERFORMANCE

and spectral-spatial classification methods on the four data classification results, deep networks, including SICNN,
sets. For spectral classification approaches, CCF-200 outper- 2-D CNN, and the proposed SpecAttenNet show better per-
forms RF-200 and SVM-RBF. With respect to the obtained formance than “shallow” models (i.e., random forest, SVM,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MOU AND ZHU: LEARNING TO PAY ATTENTION ON SPECTRAL DOMAIN 9

TABLE IX
A CCURACY C OMPARISONS FOR THE H OUSTON S CENE . B OLD N UMBERS I NDICATE THE B EST P ERFORMANCE

TABLE X
A SSESSMENTS OF THE S IGNIFICANCE OF C LASSIFICATION A CCURACIES
OF THE P ROPOSED M ETHOD C OMPARED TO O THER I NVESTIGATED
A PPROACHES FOR THE F OUR D ATA S ETS .

Fig. 5. Classification maps of different approaches for the Salinas data


set. From left to right: true-color composite of the hyperspectral image,
reference data, RF-200, SVM-RBF, CCF-200, 2-D CNN, and SpecAttenNet.
and CCF) in regard to OA and kappa coefficient, mainly Best zoomed-in view.
because: 1) they are capable of extracting hierarchical, deep
feature representations; 2) spatial information can be fully
exploited in them. These two properties make the deep net- and pepper noised classification maps, while this issue is
works more robust in finding appropriate decision boundaries addressed in spectral-spatial classification networks (SICNN,
and enable the models to handle nonlinearly separable data 2-D CNN, and SpecAttenNet) by removing noisy scattered
more efficiently. points of misclassification.
On the other hand, in comparison with SICNN that selects Moreover, we observe that the use of the spectral atten-
the most informative spectral bands as inputs of a CNN using a tion module alleviates the problem of misclassification. For
band selection approach, SpecAttenNet is capable of achieving instance, misclassification in the Indian Pines data set lies
accuracy increments of 7.09%, 2.69%, and 0.0797 for OA, in similar objects (with extremely similar spectral character-
AA, and Kappa coefficient, respectively, on the Indian Pines istics), such as Alfalfa and Hay-windrowed. SpecAttenNet
scene. Regarding the Pavia University scene, the accuracy achieves the best average accuracy of 89.625% on these
increments on OA, AA, and Kappa coefficient are, respec- two classes, while the second best average accuracy is
tively, 3.89%, 2.38%, and 0.0494. This observation reveals that only 74.68%, as obtained by SICNN.
compared to conventional band selection methods, our data-
and task-driven spectral attention mechanism can offer better
results. E. Analysis of the Spectral Attention Module
Table X demonstrates the results of McNemar’s test, One challenge in hyperspectral data classification is that
in which we compute our method and other competitors due to complex light scattering mechanism, some pixels of
in terms of the significance of the difference between their a hyperspectral image, which belong to the same land
classification results. We can see that on both data sets, cover class, have different spectral signatures. Therefore,
the improvement of accuracies yielded by our approach an approach that is capable of making spectral signals of
is statistically significant as compared with other methods. those pixels that are more similar should be able to offer
Figs. 3–5 show classification maps produced by RF-200, a more accurate classification result. Here, to quantitatively
SVM-RBF, CCF-200, SICNN, 2-D CNN, and SpecAtten- verify the effectiveness of the spectral attention module,
Net on three scenes. As displayed in these figures, spectral an index called within-class similarity measures is used. The
classifiers (i.e., random forest, SVM, and CCF) lead to salt within-class similarity measure is defined as the trace of the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

Fig. 6. Visualization of original samples and recalibrated ones by the spectral attention module of the Pavia University data set by t-SNE [70]. Different
colors represent different categories. As shown in this figure, after the attention module, samples of some classes (e.g., class 2 and class 6) gather together
and come into several groups, which means outputs of the module are more useful for tasks like classification. This is mainly because by making use of the
proposed gating mechanism, bands that provide discriminative information are emphasized, while the others are suppressed.

Fig. 7. Average reflectance spectrum and average spectral gates of each class on the Pavia University data set.

within-class scatter matrix, which can be calculated as follows: and Nc denotes the amount of test data belonging to the
 c-th category.
Sw = (x i − µc )(x i − µc )T (13)
c
Table XI reports calculated within-class similarity mea-
i∈c
sures of features before and after the spectral attention
where
module in our network on both data sets. We can observe
1 
µc = xi (14) that recalibrated spectra (i.e., outputs of the spectral atten-
Nc tion module) in the same category have higher similarity.
i∈c
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MOU AND ZHU: LEARNING TO PAY ATTENTION ON SPECTRAL DOMAIN 11

Fig. 8. Average reflectance spectrum of each class and learned spectral gates on the Indian Pines data set.

TABLE XI Interestingly, the learned spectral gate on this data set is


W ITHIN -C LASS S IMILARITY M EASURES OF F EATURES B EFORE AND nearly completely binary and quite different from the gates
A FTER THE S PECTRAL ATTENTION M ODULE ON THE I NDIAN P INES ,
PAVIA U NIVERSITY, AND S ALINAS D ATA SETS .
on the Pavia University scene. From Fig. 8, we can observe
S MALLER IS B ETTER that the spectral attention module mainly pays attention on
spectral bands that provide visual cues to distinguish different
categories.

IV. C ONCLUSION
This work proposed a simple, yet effective end-to-end train-
able spectral attention module to make a spectral-spatial clas-
Hence, the results demonstrate that the recalibrated spectra sification CNN learn a channel attention mechanism, i.e., how
are more discriminative. to pay attention on the spectral domain, for hyperspectral
Furthermore, we use t-SNE [70] technique to visualize spec- image classification. Our spectral attention module enhances
tra before and after this module on the Pavia University scene the network by learning the importance of spectral bands with
in Fig. 6. As shown in this figure, after the attention module, a gating mechanism and performing a dynamic band-wise
samples of some classes (e.g., class 2 and class 6) gather recalibration, which improves not only the representational
together and come into several groups, which means outputs capacity but also the interpretability of the network. Extensive
of the module are more useful for tasks like classification. experiments validate the effectiveness of our network.
This is mainly because by making use of the proposed gating In the future, we will carry out further research and try
mechanism, bands that provide discriminative information are to figure out the band importance induced by the spectral
emphasized, while others are suppressed. attention module, which may be helpful to related fields, e.g.,
Since the designed spectral attention mechanism is data- and band selection and hyperspectral data classification network
task-driven, according to (3), different inputs have different pruning for model compression.
spectral gates. For each class, we calculate the average of
spectral gates of test samples belonging to this class and R EFERENCES
name it average spectral gate. Fig. 7 exhibits the average
reflectance spectrum and the average spectral gate learned by [1] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32,
2001.
our attention module of each class on the Pavia University [2] S. R. Joelsson, J. A. Benediktsson, and J. R. Sveinsson, “Random forest
scene. As shown in this figure, classes with similar spectral classifiers for hyperspectral data,” in Proc. IEEE Int. Geosci. Remote
signatures (e.g., Gravel and Bricks) have extremely similar Sens. Symp. (IGARSS), Jul. 2005, p. 4.
[3] P. O. Gislason, J. A. Benediktsson, and J. R. Sveinsson, “Random Forests
spectral gates, while these similar classes can be differentiated for land cover classification,” Pattern Recognit. Lett., vol. 27, no. 4,
in detail; for example, we can see that activations of some pp. 294–300, Mar. 2006.
gates on the Gravel class and the Bricks class are different. [4] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,
vol. 20, no. 3, pp. 273–297, 1995.
In Fig. 8, we also display the average reflectance spectrum [5] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for
of each class and learned spectral gates on the Indian Pines cancer classification using support vector machines,” Mach. Learn.,
data set. Note that since spectral gates of all test samples vol. 46, nos. 1–3, pp. 389–422, 2002.
[6] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote
learned on this scene are almost the same, we visualize the sensing images with support vector machines,” IEEE Trans. Geosci.
average spectral gate of all samples instead of each class. Remote Sens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

[7] B. Waske and J. A. Benediktsson, “Fusion of support vector machines for [30] Y. Li, H. Zhang, and Q. Shen, “Spectral–spatial classification of hyper-
classification of multisensor data,” IEEE Trans. Geosci. Remote Sens., spectral imagery with 3d convolutional neural network,” Remote Sens.,
vol. 45, no. 12, pp. 3858–3866, Dec. 2007. vol. 9, no. 1, p. 67, 2017.
[8] B. Waske, S. van der Linden, J. Benediktsson, A. Rabe, and P. Hostert, [31] X. Lu, W. Zhang, and X. Li, “A hybrid sparsity and distance-based
“Sensitivity of support vector machines to random feature selection in discrimination detector for hyperspectral images,” IEEE Trans. Geosci.
classification of hyperspectral data,” IEEE Trans. Geosci. Remote Sens., Remote Sens., vol. 56, no. 3, pp. 1704–1717, Mar. 2018.
vol. 48, no. 7, pp. 2880–2889, Jul. 2010. [32] W. Zhao and S. Du, “Spectral–spatial feature extraction for hyper-
[9] G. F. Elsayed, D. Krishnan, H. Mobahi, K. Regan, and S. Bengio, spectral image classification: A dimension reduction and deep learn-
“Large margin deep networks for classification,” in Proc. Adv. Neural ing approach,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 8,
Inf. Process. Syst. (NIPS), Dec. 2018, pp. 850–860. pp. 4544–4554, Aug. 2016.
[10] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning [33] P. Ghamisi, Y. Chen, and X. X. Zhu, “A self-improving convolution
with neural networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), neural network for the classification of hyperspectral data,” IEEE Geosci.
2014, pp. 3104–3112. Remote Sens. Lett., vol. 13, no. 10, pp. 1537–1541, Oct. 2016.
[11] Y. Kim, “Convolutional neural networks for sentence classification,” [34] W. Li, G. Wu, F. Zhang, and Q. Du, “Hyperspectral image classification
in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), using deep pixel-pair features,” IEEE Trans. Geosci. Remote Sens.,
Oct. 2014, pp. 1746–1751. vol. 55, no. 2, pp. 844–853, Feb. 2017.
[12] K. Cho, B. van Merriënboer, C. Gulcehre, F. Bougares, H. Schwenk, [35] A. Romero, C. Gatta, and G. Camps-Valls, “Unsupervised deep feature
and Y. Bengio, “Learning phrase representations using RNN extraction for remote sensing image classification,” IEEE Trans. Geosci.
encoder–decoder for statistical machine translation,” in Proc. Conf. Remote Sens., vol. 54, no. 3, pp. 1349–1362, Mar. 2016.
Empirical Methods Natural Lang. Process. (EMNLP), 2014, pp. 1–15. [36] X. Lu, X. Zheng, and Y. Yuan, “Remote sensing scene classification
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification by unsupervised representation learning,” IEEE Trans. Geosci. Remote
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Sens., vol. 55, no. 9, pp. 5148–5157, Sep. 2017.
Process. Syst. (NIPS), 2012, pp. 1–9. [37] L. Mou, P. Ghamisi, and X. X. Zhu, “Unsupervised spectral–spatial fea-
[14] K. Simonyan and A. Zisserman, “Very deep convolutional networks ture learning via deep residual Conv–Deconv network for hyperspectral
for large-scale image recognition,” in Proc. IEEE Int. Conf. Learn. image classification,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 1,
Represent. (ICLR), Apr. 2015, pp. 1–14. pp. 391–406, Jan. 2018.
[15] Y. Yuan, L. Mou, and X. Lu, “Scene recognition by manifold regularized [38] M. E. Paoletti, J. M. Haut, J. Plaza, and A. Plaza, “Deep&dense
deep learning architecture,” IEEE Trans. Neural Netw. Learn. Syst., convolutional neural network for hyperspectral image classification,”
vol. 26, no. 10, pp. 2222–2233, Oct. 2015. Remote Sens., vol. 10, no. 9, p. 1454, 2018.
[16] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Int. [39] M. E. Paoletti, J. M. Haut, R. Fernandez-Beltran, J. Plaza,
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9. A. J. Plaza, and F. Pla, “Deep pyramidal residual networks for
spectral–spatial hyperspectral image classification,” IEEE Trans.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Geosci. Remote Sens., vol. 57, no. 2, pp. 740–754, Feb. 2019.
recognition,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.
doi: 10.1109/TGRS.2018.2860125.
(CVPR), Jun. 2016, pp. 770–778.
[40] L. Mou, P. Ghamisi, and X. X. Zhu, “Fully conv-deconv network for
[18] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely
unsupervised spectral-spatial feature extraction of hyperspectral imagery
connected convolutional networks,” in Proc. IEEE Int. Conf. Comput.
via residual learning,” in Proc. IEEE Int. Geosci. Remote Sens. Symp.
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2261–2269.
(IGARSS), Jul. 2017, pp. 5181–5184.
[19] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for [41] X. Ma, A. Fu, J. Wang, H. Wang, and B. Yin, “Hyperspectral image
semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern classification based on deep deconvolution network with skip architec-
Recognit. (CVPR), Jun. 2015, pp. 3431–3440. ture,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 8, pp. 4781–4791,
[20] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Aug. 2018.
“Deeplab: Semantic image segmentation with deep convolutional [42] X. Lu, B. Wang, X. Zheng, and X. Li, “Exploring models and data for
nets, atrous convolution, and fully connected CRFs,” Jun. 2016, remote sensing image caption generation,” IEEE Trans. Geosci. Remote
arXiv:1606.00915. [Online]. Available: https://arxiv.org/abs/1606.00915 Sens., vol. 56, no. 4, pp. 2183–2195, Apr. 2018.
[21] Q. Li, L. Mou, Q. Liu, Y. Wang, and X. X. Zhu, “HSF-Net: Mul- [43] L. Mou, P. Ghamisi, and X. X. Zhu, “Deep recurrent neural networks for
tiscale deep feature embedding for ship detection in optical remote hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens.,
sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 12, vol. 55, no. 7, pp. 3639–3655, Jul. 2017.
pp. 7147–7161, Dec. 2018. [44] L. Mou, L. Bruzzone, and X. X. Zhu, “Learning spectral-
[22] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- spatial-temporal features via a recurrent convolutional neural net-
time object detection with region proposal networks,” in Proc. Adv. work for change detection in multispectral imagery,” Mar. 2018,
Neural Inf. Process. Syst. (NIPS), 2015, pp. 91–99. arXiv:1803.02642. [Online]. Available: https://arxiv.org/abs/1803.02642
[23] L. Mou and X. X. Zhu, “Vehicle instance segmentation from aerial [45] M. Rußwurm and M. Körner, “Temporal vegetation modelling using
image and video using a multitask learning residual fully convolu- long short-term memory networks for crop identification from medium-
tional network,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 11, resolution multi-spectral satellite images,” in Proc. IEEE Int. Conf. Com-
pp. 6699–6711, Nov. 2018. put. Vis. Pattern Recognit. (CVPR) Workshop, Jul. 2017, pp. 1496–1504.
[24] A. Miech, I. Laptev, and J. Sivic, “Learnable pooling with context [46] H. Lyu, H. Lu, and L. Mou, “Learning a transferable change rule from
gating for video classification,” Jun. 2017, arXiv:1706.06905. [Online]. a recurrent neural network for land cover change detection,” Remote
Available: https://arxiv.org/abs/1706.06905 Sens., vol. 8, no. 6, p. 506, 2016.
[25] L. Mou and X. X. Zhu, “IM2HEIGHT: Height estimation [47] H. Lyu et al., “Long-term annual mapping of four cities on different
from single monocular imagery via fully residual convolutional- continents by applying a deep information learning method to Landsat
deconvolutional network,” Feb. 2018, arXiv:1802.1024. [Online]. data,” Remote Sens., vol. 10, no. 3, p. 471, 2018.
Available: https://arxiv.org/abs/1802.10249 [48] H. Wu and S. Prasad, “Convolutional recurrent neural networks for
[26] X. X. Zhu et al., “Deep learning in remote sensing: A comprehensive hyperspectral data classification,” Remote Sens., vol. 9, no. 3, p. 298,
review and list of resources,” IEEE Geosci. Remote Sens. Mag., vol. 5, 2017.
no. 4, pp. 8–36, Dec. 2017. [49] Y. Hua, L. Mou, and X. X. Zhu, “Recurrently exploring class-wise
[27] N. Kussul, M. Lavreniuk, S. Skakun, and A. Shelestov, “Deep learning attention in a hybrid convolutional and bidirectional LSTM network for
classification of land cover and crop types using remote sensing data,” multi-label aerial image classification,” ISPRS J. Photogramm. Remote
IEEE Geosci. Remote Sens. Lett., vol. 14, no. 5, pp. 778–782, May 2017. Sens., vol. 149, pp. 188–199, Mar. 2019.
[28] W. Song, S. Li, L. Fang, and T. Lu, “Hyperspectral image classification [50] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
with deep feature fusion network,” IEEE Trans. Geosci. Remote Sens., spatiotemporal features with 3D convolutional networks,” in Proc. IEEE
vol. 56, no. 6, pp. 3173–3184, Jun. 2018. Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4489–4497.
[29] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature extrac- [51] M. E. Paoletti, J. M. Haut, J. Plaza, and A. Plaza, “A new deep
tion and classification of hyperspectral images based on convolutional convolutional neural network for fast hyperspectral image classifica-
neural networks,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10, tion,” ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 120–147,
pp. 6232–6251, Oct. 2016. Nov. 2018.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MOU AND ZHU: LEARNING TO PAY ATTENTION ON SPECTRAL DOMAIN 13

[52] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between cap- Lichao Mou (S’16) received the bachelor’s degree
sules,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2017, pp. 1–11. in automation from the Xi’an University of Posts
[53] M. E. Paoletti et al., “Capsule networks for hyperspectral image and Telecommunications, Xi’an, China, in 2012,
classification,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 4, and the master’s degree in signal and information
pp. 2145–2160, Apr. 2019. processing from the University of Chinese Academy
[54] F. I. Alam, J. Zhou, A. W.-C. Liew, X. Jia, J. Chanussot, and Y. Gao, of Sciences (UCAS), Beijing, China, in 2015. He is
“Conditional random field and deep feature learning for hyperspectral currently pursuing the Ph.D. degree with the Ger-
image classification,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 3, man Aerospace Center (DLR), Wessling, Germany,
pp. 1612–1628, Mar. 2019. and also with the Technical University of Munich
[55] C. Deng, Y. Xue, X. Liu, C. Li, and D. Tao, “Active transfer learning (TUM), Munich, Germany.
network: A unified deep joint spectral–spatial feature learning model for In 2015, he spent six months at the Computer
hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., Vision Group, University of Freiburg, Freiburg im Breisgau, Germany.
vol. 57, no. 3, pp. 1741–1754, Mar. 2019. In 2019, he was a Visiting Researcher with the University of Cambridge,
[56] J. Wang, W. Jiang, L. Ma, W. Liu, and Y. Xu, “Bidirectional atten- Cambridge, U.K. His research interests include remote sensing, computer
tive fusion with context gating for dense video captioning,” in Proc. vision, and machine learning, especially deep networks and their applications
IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, in remote sensing.
pp. 7190–7198. Mr. Mou was a recipient of the first place in the 2016 IEEE GRSS Data
[57] H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan, “End-to-end comparative Fusion Contest and finalists for the Best Student Paper Award at the 2017 Joint
attention networks for person re-identification,” IEEE Trans. Image Urban Remote Sensing Event and the 2019 Joint Urban Remote Sensing
Process., vol. 26, no. 7, pp. 3492–3506, Jul. 2017. Event.
[58] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018,
pp. 7132–7141.
[59] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[60] A. Graves, “Generating sequences with recurrent neural
networks,” Aug. 2013, arXiv:1308.0850. [Online]. Available:
https://arxiv.org/abs/1308.0850
[61] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the
properties of neural machine translation: Encoder-decoder approaches,” Xiao Xiang Zhu (S’10–M’12–SM’14) received the
in Proc. 8th Workshop Syntax, Semantics Struct. Stat. Transl. (SSST), master’s (M.Sc.), D.E. (Dr.-Ing.), and Habilitation
Oct. 2014, pp. 103–167. degrees in signal processing from the Technical
[62] Y. Gal and Z. Ghahramani, “A theoretically grounded application of University of Munich (TUM), Munich, Germany,
dropout in recurrent neural networks,” in Proc. Adv. Neural Inf. Process. in 2008, 2011, and 2013, respectively.
Syst. (NIPS), 2016, pp. 1019–1027. She was a Guest Scientist or a Visiting Pro-
[63] J. M. Haut, M. E. Paoletti, J. Plaza, A. Plaza, and J. Li, “Visual attention- fessor with the Italian National Research Council
driven hyperspectral image classification,” IEEE Trans. Geosci. Remote (CNR-IREA), Naples, Italy, in 2009; Fudan Univer-
Sens., to be published. doi: 10.1109/TGRS.2019.2918080. sity, Shanghai, China, in 2014; The University of
[64] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep Tokyo, Tokyo, Japan, in 2015; and the University of
feedforward neural networks,” in Proc. Int. Conf. Artif. Intell. Statist. California at Los Angeles, Los Angeles, CA, USA,
(AISTATS), 2010, pp. 249–256. in 2016. Since 2019, she has been co-coordinating the Munich Data Science
[65] T. Dozat. Incorporating Nesterov Momentum Into Adam. Accessed: Research School. She is also leading the Helmholtz Artificial Intelligence
Sep. 22, 2019. [Online]. Available: http://cs229.stanford.edu/proj2015/ Cooperation Unit (HAICU)–Research Field “Aeronautics, Space and Trans-
054_report.pdf port.” She is currently a Professor of signal processing in earth observation
[66] Y. LeCun et al., “Backpropagation applied to handwritten zip code with the Technical University of Munich (TUM) and also with the German
recognition,” Neural Comput., vol. 1, no. 4, pp. 541–551, Dec. 1989. Aerospace Center (DLR); also the Head of the Department “EO Data Science,”
[67] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” DLR’s Earth Observation Center; and also the Head of the Helmholtz Young
in Proc. IEEE Int. Conf. Learn. Represent. (ICLR), 2015, pp. 1–15. Investigator Group “SiPEO,” DLR and TUM. Her research interests include
[68] T. Rainforth and F. Wood, “Canonical correlation forests,” Jul. 2015, remote sensing and Earth observation, signal processing, machine learning,
arXiv:1507.05444. [Online]. Available: https://arxiv.org/abs/1507.05444 and data science, with a special application focus on global urban mapping.
[69] J. Xia, N. Yokoya, and A. Iwasaki, “Hyperspectral image classification Dr. Zhu is a member of young academy (Junge Akademie/Junges Kolleg)
with canonical correlation forests,” IEEE Trans. Geosci. Remote Sens., at the Berlin-Brandenburg Academy of Sciences and Humanities, the German
vol. 55, no. 1, pp. 421–431, Jan. 2017. National Academy of Sciences Leopoldina, and the Bavarian Academy of
[70] L. van der Maaten, “Accelerating t-SNE using tree-based algorithms,” Sciences and Humanities. She is an Associate Editor of the IEEE T RANSAC -
J. Mach. Learn. Res., vol. 15, no. 1, pp. 3221–3245, Oct. 2014. TIONS ON G EOSCIENCE AND R EMOTE S ENSING .

You might also like