Paper 2
Paper 2
Paper 2
Abstract— Over the past few years, hyperspectral image Consequently, numerous kinds of classification approaches,
classification using convolutional neural networks (CNNs) has especially supervised models have been developed for hyper-
progressed significantly. In spite of their effectiveness, given that spectral data classification, as found in the literature. Among
hyperspectral images are of high dimensionality, CNNs can be
hindered by their modeling of all spectral bands with the same them, random forest [1]–[3] and support vector machine
weight, as probably not all bands are equally informative and (SVM) [4]–[8] are two examples of supervised classification
predictive. Moreover, the usage of useless spectral bands in CNNs approaches, which have been exploited for solving varied
may even introduce noises and weaken the performance of net- and numerous classification problems. Random forests are
works. For the sake of boosting the representational capacity of basically a kind of ensemble bagging or averaging algorithm.
CNNs for spectral-spatial hyperspectral data classification, in this
work, we improve networks by discriminating the significance of It creates a set of decision trees using random subsamples
different spectral bands. We design a network unit, which is of training data and then aggregates their predictions via a
termed as the spectral attention module, that makes use of a maximum a posterior (MAP) rule or voting to decide final
gating mechanism to adaptively recalibrate spectral bands by classes of test samples. On the other hand, an SVM seeks
selectively emphasizing informative bands and suppressing less for a hyperplane that is able to sort two-class data by the
useful ones. We theoretically analyze and discuss why such a
spectral attention module helps in a CNN for hyperspectral largest margin. However, the random forest and SVM are
image classification. We demonstrate using extensive experiments characterized as “shallow” models [9] as compared to deep
that in comparison with state-of-the-art approaches, the spec- networks which are able to extract hierarchical, deep feature
tral attention module-based convolutional networks are able to representations.
offer competitive results. Furthermore, this work sheds light on Deep learning, which is mainly characterized by deep
how a CNN interacts with spectral bands for the purpose of
classification. networks, has been quite successful in solving a wide range
of problems (e.g., natural language processing [10]–[12],
Index Terms— Attention module, convolutional neural network computer vision [13]–[25], and remote sensing [26]).
(CNN), gating mechanism, hyperspectral image classification.
In the hyperspectral community, some studies have been
published recently on the use of convolutional neural
I. I NTRODUCTION networks (CNNs) [27]–[42] as well as recurrent neural net-
works (RNNs) [43]–[49] for pattern recognition tasks. For
feature extraction. Ghamisi et al. [33] first exploited a compu- of spectral bands. The recalibrated spectral information
tational intelligence (particle swarm optimization) method to using these spectral gates can effectively improve the
choose informative spectral bands and then train a 2-D CNN classification results.
using the selected bands. In [34], to properly train a CNN 2) We analyze and discuss why the proposed spectral
with limited ground truth data, the authors devised a pixel- attention module is able to offer better classification
pair CNN that takes as input a pair of hyperspectral pixels. results from a theoretical perspective by diving into the
By doing so, the amount of training data is greatly augmented. backward propagation of the network. As far as we
Furthermore, in order to access a huge amount of unlabeled know, learning and analyzing such a spectral attention-
hyperspectral data, unsupervised feature learning via a CNN based network for hyperspectral image classification
is of great interest. Romero et al. [35] presented a CNN to have not been done yet.
address the problem of unsupervised spectral-spatial feature 3) We conduct experiments on four benchmark data sets.
extraction and estimated network weights via a sparse learning The empirical results demonstrate that our spectral atten-
approach in a greedy layer-wise fashion. Mou et al. [37] tion module-based convolutional network is capable of
proposed a residual learning-based fully conv-deconv network, offering competitive classification results, particularly
aiming at unsupervised spectral-spatial feature learning in an in the situation of high dimensionality and inadequate
end-to-end manner. Better classification network architecture training data.
from computer vision (e.g., ResNet [17], DenseNet [18], and The remainder of this article is organized as follows.
CapsuleNet [52]) also provides new insights into hyperspectral After detailing hyperspectral image classification using CNNs
image classification [37]–[39], [53]. Moreover, the integration in Section I, Section II introduces the proposed spectral atten-
of networks and other traditional machine learning models, tion module-based convolutional network. Section III verifies
e.g., conditional random field (CRF) and active learning, has the proposed approach and presents the corresponding analysis
also received attention recently [54], [55]. and discussion. Finally, Section IV concludes the article.
The unique asset of hyperspectral images is their rich spec-
tral content in comparison with high-resolution aerial images II. M ETHODOLOGY
and natural images in the computer vision field. Although
there already exist a number of works that have focused on A. Problem Formulation
using CNNs for hyperspectral data classification, we notice The spectral attention module in our model transforms a
that in the community, the following questions have not been patch x of a hyperspectral image into a new representation z
well addressed until now. via the following mapping:
1) Do all spectral bands contribute equally to a CNN for
classification tasks? F:x→z (1)
2) If no, how to task-drivenly find informative bands where x, z ∈ R H ×W ×C .
that can help hyperspectral data classification in an Our aim is to strengthen the representational capac-
end-to-end network? ity of a spectral-spatial classification network through
3) Is it possible to improve classification results of a CNN explicitly modeling the significance of spectral bands. There-
by emphasizing informative bands and suppressing less fore, we instantiate F as
useful ones in the network?
These questions give us an incentive to devise a z=xg (2)
novel network called spectral attention module-based con-
where is a channel-wise multiplication operation and
volutional network for hyperspectral image classification.
g ∈ RC represents a set of spectral gates applied to individual
Inspired by recent advances in the attention mechanism of
spectral bands of the patch x.
networks [56]–[58], which enables feature interactions to
The motivation behind (2) is that we wish to make use of a
contribute differently to predictions, we design a channel
gating mechanism to recalibrate the strength of different spec-
attention mechanism for analyzing the significance of differ-
tral bands of the input, i.e., selectively emphasize useful bands
ent spectral bands and recalibrating them. More importantly,
and suppress less informative ones, for image classification
the significance analysis is automatically learned from tasks
problems.
and hyperspectral data in an end-to-end network without any
Fig. 1 illustrates the architecture of the spectral attention
human domain knowledge. Experiments show that the use
module-equipped convolutional network.
of the proposed spectral attention module in a CNN for
hyperspectral data classification serves two benefits: it not
only offers better performance but also provides an insight B. Modeling of Spectral Attention Module
into which spectral bands contribute more to predictions. This The gating mechanism has been widely used in model-
work’s contributions are threefold. ing and processing temporal sequences. For example, long
1) We propose a learnable spectral attention module that short-term memory (LSTM)-based networks [59], [60] har-
explicitly allows the spectral manipulation of hyper- ness three gates to cope with vanishing gradients. Similarly,
spectral data within a CNN. This attention module a gated recurrent unit (GRU) [61], [62] is designed to imple-
exploits the global spectral-spatial context for producing ment the modulation of information flow through the gating
a series of spectral gates which reflects the significance mechanism.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 1. The overall architecture of the proposed gating mechanism, spectral attention module, for hyperspectral classification problems. We would like to
exploit this module to learn and recalibrate strengths of different spectral bands, i.e., selectively emphasize useful bands and suppress less informative ones,
for image classification problems. To this end, we first learn a set of spectral gates by using global convolution and then apply them to individual spectral
bands. Moreover, in Section II-C, we theoretically analyze and discuss why the proposed spectral attention module can help a spectral-spatial classification
network (e.g., a 2-D CNN) for hyperspectral image classification tasks.
In addition, several recent works in computer vision have both H × W , where f c refers to the c-th filter. Thus, the
shown the benefit of introducing the gating mechanism to c-th spectral gate gc can be calculated as follows:
vision problems. To name a few, Wang et al. [56] proposed
C
a gating mechanism that is capable of dynamically balancing gc = x ∗ f c = x i ∗ f ic (3)
contributions of the current event and its surrounding contexts i=1
in their model for dense video captioning tasks. Hu et al. [58]
built a gated block for image classification tasks and demon- where ∗ represents convolution and f ic and x i are separately
strated its good performance on large-scale image recognition. the i -th channels of the c-th filter and x. Taking into account
Liu et al. [57] addressed person re-identification tasks through that the field of view of global convolution is equal to the
utilizing a network module based on a soft gating mechanism, spatial size of x, gc is actually calculated by the inner product
which enables the network to concentrate on significant local of x i and f ic (both x i and f ic are vectorized into columns),
regions of an input image pair adaptively. In remote sensing, i.e., (3) can be rewritten as follows:
a very recently published, parallel work related to this article
C
C
can be found in [63], where the authors introduced a visual gc = x i , f ic = x iT f ic . (4)
attention technique that first calculates a mask and then applies i=1 i=1
it to features produced by a ResNet for hyperspectral data From (4), the spectral gates g can be considered as a
classification tasks. series of global descriptors, which are capable of representing
Here, we would like to design our own gating mechanism, spectral-spatial features of x.
spectral attention module, for analyzing the significance of dif- Thus, according to (2), we can associate the c-th spectral
ferent spectral bands and recalibrating them. Besides, we hope gate gc with the c-th spectral band of x to obtain the
this module is task-driven and can be adaptively learned in an recalibrated z c via
end-to-end spectral-spatial classification network. To this end,
we need a way to aggregate the spectral-spatial information
C
The convolution operation is an ideal candidate, as 1) it is So far, we can obtain an initial spectral attention module
able to spatially shrink the input patch and 2) its differential [as shown in (5)], but there still exist three issues which we
property allows end-to-end learning. In general, a convolu- should address:
tional filter operates with a local receptive field (e.g., 3 × 3 1) Given the complex spectral-spatial properties of hyper-
in VGG-16 network), which leads to the fact that the output spectral images, we wish that the spectral gates in this
is not capable of utilizing contextual information outside of module are capable of learning a nonlinear mapping,
this region. This is a severe issue for our case because the instead of a linear one, from the input.
spectral gates g in our model are expected to be derived from 2) The attention module should model a nonmutually
the whole spectral-spatial information. To tackle this problem, exclusive relationship between spectral bands, as we
we distill global spatial information into the spectral gates by would like to ensure that multiple bands can be empha-
using global convolution. Formally, let f = [ f 1 , f 2 , · · · , f C ] sized at the same time (unlike one-hot activation in
denote a set of convolutional filters and their sizes are softmax).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 2. Example showing how the proposed spectral attention module in a CNN correct a wrong prediction (gravel) to a right one (bricks) via learned
spectral gates.
3) The gates should be bounded (e.g., between 0 and 1), Thus, the gradient of the spectral attention module can be
easily differentiable, and monotonic (good for convex written as
optimization). 1
To meet these three requirements, we modify spectral gates ∇z = ∇x
1 + exp(−x ∗ f )
in the initial spectral attention module as follows:
1
+x ∇ . (9)
1 1 + exp(−x ∗ f )
gc =
1 + exp(−x ∗ f c ) It can be seen that the term ∇ x is weighted by the spectral
1 gates (1/1 + exp(−x ∗ f )). This has the following interesting
= . (6) properties.
1 + exp − C i=1 i x T fi
c
1) On the one hand, the existence of ∇ x ensures that
the gradient information on spectral-spatial features can
Hence, the final version of the spectral attention module can
be backpropagated directly, which helps to prevent the
be written as
vanishing gradient problem.
1 2) On the other hand, for spectral bands where the spectral
zc = x c . (7) gates are close to 0 (less useful bands), the gradient
1 + exp − C x T fi
i=1 i c propagation vanished; on the contrary, for values that
are close to 1, gradients (of informative bands) directly
Fig. 2 is an example showing how the proposed spectral propagated from z to x.
attention module works in a CNN. For the first point, a similar effect can be found in residual
learning. He et al. [17] introduced the residual learning
into CNNs for large-scale image classification tasks and
C. Why Does the Spectral Attention Module Work? exhibited significantly improved network training character-
istics, e.g., allowing network depths that were previously
In our experiments, we observed that a 2-D CNN with
unattainable. Formally, denote by y a random variable rep-
our spectral attention module can offer better classification
resenting the output of a residual block. It can then be
results. However, how exactly does this attention module help
expressed as
a spectral-spatial classification network for hyperspectral data
classification? We dive into the backward propagation of the y = x + F (x; w) (10)
network to seek the answer to this question.
For notional simplicity, we subsequently drop the subscript c where F is a residual function and usually implemented by a
and rewrite the final expression of the spectral attention couple of stacked convolutional layers. Moreover, w represents
module as follows: learnable weights of this residual block. The gradient of a
residual block can be calculated as
1
z=x . (8) ∇ y = ∇ x + ∇(F (x; w)). (11)
1 + exp(−x ∗ f )
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
C ONFIGURATION OF A S PECTRAL ATTENTION M ODULE -BASED C ONVOLUTIONAL N ETWORK FOR THE PAVIA U NIVERSITY D ATA S ET
D. Network Training
We insert the spectral attention module into a 2-D CNN
(between the input and the first convolutional layer) and
then train the whole network. Note that the spectral attention
module and other layers are trained simultaneously. We use
the TensorFlow framework to implement and train networks.
All network weights are initialized by a Glorot uniform initial-
izer [64]. The Nesterov Adam [65] algorithm is chosen to opti- are detailed in Table II. Since these 16 classes have similar
mize networks, as for our experiments, compared to stochastic spectral signatures, the precise classification of this scene is
gradient descent (SGD) with momentum [66] or Adam [67], hard. The true-color composite image and the available ground
it is able to provide much faster convergence. Almost all truth data can be found in Fig. 3 (black color in the ground
parameters of this optimizer are set as recommended in [65]. truth indicates unknown samples).
We utilize a relatively small learning rate of 2e−04. Finally, 2) Pavia University Hyperspectral Data Set: The second
we train networks on an NVIDIA Tesla P100 16 GB GPU. data set was acquired over the city of Pavia, Italy, 2002 by
Table I exhibits an example of a CNN with the proposed an airborne instrument – Reflective Optics Spectrographic
attention module. Imaging System (ROSIS). The aircraft was operated by the
German Aerospace Center (DLR) within the context of Euro-
III. E XPERIMENTS AND A NALYSIS pean Union funded HySens project. The data set is made up
A. Data Description of 640 × 340 pixels with a 1.3 m/pixel spatial resolution
1) Indian Pines Hyperspectral Data Set: The first data were and 103 bands covering from 430 to 860 nm after removing
collected by the Airborne Visible/Infrared Imaging Spectrom- 12 noisy channels. Besides unknown pixels, 9 classes are
eter (AVIRIS) sensor over Northwest Indiana, USA, 1992. manually annotated in the reference data. Fig. 4 displays
It includes 145 × 145 pixels with a 20 m/pixel spatial res- a composite image of this data set and its reference map.
olution and 200 spectral bands covering from 400 to 2500 nm Table III offers information on all 9 categories.
after removing 20 water absorption channels (220, 150-163, 3) Salinas Hyperspectral Data Set: The third data set was
and 104-108). The ground truth includes 16 classes of interest, also gathered by the AVIRIS sensor over the region of Salinas
which are mostly various crops in different growth phases and Valley, CA, USA and with a 3.7-m/pixel spatial resolution.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 3. Classification maps of different approaches for the Indian Pines data set. (Left to right) True-color composite image, training set, test set, RF-200,
SVM-RBF, CCF-200, SICNN, 2-D CNN, and SpecAttenNet. Best zoomed-in view.
Fig. 4. Classification maps of different approaches for the Pavia University data set. (Left to right) Composite image, training samples, ground truth, RF-200,
SVM-RBF, CCF-200, SICNN, 2-D CNN, and SpecAttenNet. Best zoomed-in view.
TABLE VI
A CCURACY C OMPARISONS FOR THE I NDIAN P INES S CENE . B OLD N UMBERS I NDICATE THE B EST P ERFORMANCE
TABLE VII
A CCURACY C OMPARISONS FOR THE PAVIA U NIVERSITY S CENE . B OLD N UMBERS I NDICATE THE B EST P ERFORMANCE
TABLE VIII
A CCURACY C OMPARISONS FOR THE S ALINAS D ATA . B OLD N UMBERS I NDICATE THE B EST P ERFORMANCE
and spectral-spatial classification methods on the four data classification results, deep networks, including SICNN,
sets. For spectral classification approaches, CCF-200 outper- 2-D CNN, and the proposed SpecAttenNet show better per-
forms RF-200 and SVM-RBF. With respect to the obtained formance than “shallow” models (i.e., random forest, SVM,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IX
A CCURACY C OMPARISONS FOR THE H OUSTON S CENE . B OLD N UMBERS I NDICATE THE B EST P ERFORMANCE
TABLE X
A SSESSMENTS OF THE S IGNIFICANCE OF C LASSIFICATION A CCURACIES
OF THE P ROPOSED M ETHOD C OMPARED TO O THER I NVESTIGATED
A PPROACHES FOR THE F OUR D ATA S ETS .
Fig. 6. Visualization of original samples and recalibrated ones by the spectral attention module of the Pavia University data set by t-SNE [70]. Different
colors represent different categories. As shown in this figure, after the attention module, samples of some classes (e.g., class 2 and class 6) gather together
and come into several groups, which means outputs of the module are more useful for tasks like classification. This is mainly because by making use of the
proposed gating mechanism, bands that provide discriminative information are emphasized, while the others are suppressed.
Fig. 7. Average reflectance spectrum and average spectral gates of each class on the Pavia University data set.
within-class scatter matrix, which can be calculated as follows: and Nc denotes the amount of test data belonging to the
c-th category.
Sw = (x i − µc )(x i − µc )T (13)
c
Table XI reports calculated within-class similarity mea-
i∈c
sures of features before and after the spectral attention
where
module in our network on both data sets. We can observe
1
µc = xi (14) that recalibrated spectra (i.e., outputs of the spectral atten-
Nc tion module) in the same category have higher similarity.
i∈c
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 8. Average reflectance spectrum of each class and learned spectral gates on the Indian Pines data set.
IV. C ONCLUSION
This work proposed a simple, yet effective end-to-end train-
able spectral attention module to make a spectral-spatial clas-
Hence, the results demonstrate that the recalibrated spectra sification CNN learn a channel attention mechanism, i.e., how
are more discriminative. to pay attention on the spectral domain, for hyperspectral
Furthermore, we use t-SNE [70] technique to visualize spec- image classification. Our spectral attention module enhances
tra before and after this module on the Pavia University scene the network by learning the importance of spectral bands with
in Fig. 6. As shown in this figure, after the attention module, a gating mechanism and performing a dynamic band-wise
samples of some classes (e.g., class 2 and class 6) gather recalibration, which improves not only the representational
together and come into several groups, which means outputs capacity but also the interpretability of the network. Extensive
of the module are more useful for tasks like classification. experiments validate the effectiveness of our network.
This is mainly because by making use of the proposed gating In the future, we will carry out further research and try
mechanism, bands that provide discriminative information are to figure out the band importance induced by the spectral
emphasized, while others are suppressed. attention module, which may be helpful to related fields, e.g.,
Since the designed spectral attention mechanism is data- and band selection and hyperspectral data classification network
task-driven, according to (3), different inputs have different pruning for model compression.
spectral gates. For each class, we calculate the average of
spectral gates of test samples belonging to this class and R EFERENCES
name it average spectral gate. Fig. 7 exhibits the average
reflectance spectrum and the average spectral gate learned by [1] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32,
2001.
our attention module of each class on the Pavia University [2] S. R. Joelsson, J. A. Benediktsson, and J. R. Sveinsson, “Random forest
scene. As shown in this figure, classes with similar spectral classifiers for hyperspectral data,” in Proc. IEEE Int. Geosci. Remote
signatures (e.g., Gravel and Bricks) have extremely similar Sens. Symp. (IGARSS), Jul. 2005, p. 4.
[3] P. O. Gislason, J. A. Benediktsson, and J. R. Sveinsson, “Random Forests
spectral gates, while these similar classes can be differentiated for land cover classification,” Pattern Recognit. Lett., vol. 27, no. 4,
in detail; for example, we can see that activations of some pp. 294–300, Mar. 2006.
gates on the Gravel class and the Bricks class are different. [4] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,
vol. 20, no. 3, pp. 273–297, 1995.
In Fig. 8, we also display the average reflectance spectrum [5] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for
of each class and learned spectral gates on the Indian Pines cancer classification using support vector machines,” Mach. Learn.,
data set. Note that since spectral gates of all test samples vol. 46, nos. 1–3, pp. 389–422, 2002.
[6] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote
learned on this scene are almost the same, we visualize the sensing images with support vector machines,” IEEE Trans. Geosci.
average spectral gate of all samples instead of each class. Remote Sens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[7] B. Waske and J. A. Benediktsson, “Fusion of support vector machines for [30] Y. Li, H. Zhang, and Q. Shen, “Spectral–spatial classification of hyper-
classification of multisensor data,” IEEE Trans. Geosci. Remote Sens., spectral imagery with 3d convolutional neural network,” Remote Sens.,
vol. 45, no. 12, pp. 3858–3866, Dec. 2007. vol. 9, no. 1, p. 67, 2017.
[8] B. Waske, S. van der Linden, J. Benediktsson, A. Rabe, and P. Hostert, [31] X. Lu, W. Zhang, and X. Li, “A hybrid sparsity and distance-based
“Sensitivity of support vector machines to random feature selection in discrimination detector for hyperspectral images,” IEEE Trans. Geosci.
classification of hyperspectral data,” IEEE Trans. Geosci. Remote Sens., Remote Sens., vol. 56, no. 3, pp. 1704–1717, Mar. 2018.
vol. 48, no. 7, pp. 2880–2889, Jul. 2010. [32] W. Zhao and S. Du, “Spectral–spatial feature extraction for hyper-
[9] G. F. Elsayed, D. Krishnan, H. Mobahi, K. Regan, and S. Bengio, spectral image classification: A dimension reduction and deep learn-
“Large margin deep networks for classification,” in Proc. Adv. Neural ing approach,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 8,
Inf. Process. Syst. (NIPS), Dec. 2018, pp. 850–860. pp. 4544–4554, Aug. 2016.
[10] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning [33] P. Ghamisi, Y. Chen, and X. X. Zhu, “A self-improving convolution
with neural networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), neural network for the classification of hyperspectral data,” IEEE Geosci.
2014, pp. 3104–3112. Remote Sens. Lett., vol. 13, no. 10, pp. 1537–1541, Oct. 2016.
[11] Y. Kim, “Convolutional neural networks for sentence classification,” [34] W. Li, G. Wu, F. Zhang, and Q. Du, “Hyperspectral image classification
in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), using deep pixel-pair features,” IEEE Trans. Geosci. Remote Sens.,
Oct. 2014, pp. 1746–1751. vol. 55, no. 2, pp. 844–853, Feb. 2017.
[12] K. Cho, B. van Merriënboer, C. Gulcehre, F. Bougares, H. Schwenk, [35] A. Romero, C. Gatta, and G. Camps-Valls, “Unsupervised deep feature
and Y. Bengio, “Learning phrase representations using RNN extraction for remote sensing image classification,” IEEE Trans. Geosci.
encoder–decoder for statistical machine translation,” in Proc. Conf. Remote Sens., vol. 54, no. 3, pp. 1349–1362, Mar. 2016.
Empirical Methods Natural Lang. Process. (EMNLP), 2014, pp. 1–15. [36] X. Lu, X. Zheng, and Y. Yuan, “Remote sensing scene classification
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification by unsupervised representation learning,” IEEE Trans. Geosci. Remote
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Sens., vol. 55, no. 9, pp. 5148–5157, Sep. 2017.
Process. Syst. (NIPS), 2012, pp. 1–9. [37] L. Mou, P. Ghamisi, and X. X. Zhu, “Unsupervised spectral–spatial fea-
[14] K. Simonyan and A. Zisserman, “Very deep convolutional networks ture learning via deep residual Conv–Deconv network for hyperspectral
for large-scale image recognition,” in Proc. IEEE Int. Conf. Learn. image classification,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 1,
Represent. (ICLR), Apr. 2015, pp. 1–14. pp. 391–406, Jan. 2018.
[15] Y. Yuan, L. Mou, and X. Lu, “Scene recognition by manifold regularized [38] M. E. Paoletti, J. M. Haut, J. Plaza, and A. Plaza, “Deep&dense
deep learning architecture,” IEEE Trans. Neural Netw. Learn. Syst., convolutional neural network for hyperspectral image classification,”
vol. 26, no. 10, pp. 2222–2233, Oct. 2015. Remote Sens., vol. 10, no. 9, p. 1454, 2018.
[16] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Int. [39] M. E. Paoletti, J. M. Haut, R. Fernandez-Beltran, J. Plaza,
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9. A. J. Plaza, and F. Pla, “Deep pyramidal residual networks for
spectral–spatial hyperspectral image classification,” IEEE Trans.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Geosci. Remote Sens., vol. 57, no. 2, pp. 740–754, Feb. 2019.
recognition,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.
doi: 10.1109/TGRS.2018.2860125.
(CVPR), Jun. 2016, pp. 770–778.
[40] L. Mou, P. Ghamisi, and X. X. Zhu, “Fully conv-deconv network for
[18] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely
unsupervised spectral-spatial feature extraction of hyperspectral imagery
connected convolutional networks,” in Proc. IEEE Int. Conf. Comput.
via residual learning,” in Proc. IEEE Int. Geosci. Remote Sens. Symp.
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2261–2269.
(IGARSS), Jul. 2017, pp. 5181–5184.
[19] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for [41] X. Ma, A. Fu, J. Wang, H. Wang, and B. Yin, “Hyperspectral image
semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern classification based on deep deconvolution network with skip architec-
Recognit. (CVPR), Jun. 2015, pp. 3431–3440. ture,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 8, pp. 4781–4791,
[20] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Aug. 2018.
“Deeplab: Semantic image segmentation with deep convolutional [42] X. Lu, B. Wang, X. Zheng, and X. Li, “Exploring models and data for
nets, atrous convolution, and fully connected CRFs,” Jun. 2016, remote sensing image caption generation,” IEEE Trans. Geosci. Remote
arXiv:1606.00915. [Online]. Available: https://arxiv.org/abs/1606.00915 Sens., vol. 56, no. 4, pp. 2183–2195, Apr. 2018.
[21] Q. Li, L. Mou, Q. Liu, Y. Wang, and X. X. Zhu, “HSF-Net: Mul- [43] L. Mou, P. Ghamisi, and X. X. Zhu, “Deep recurrent neural networks for
tiscale deep feature embedding for ship detection in optical remote hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens.,
sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 12, vol. 55, no. 7, pp. 3639–3655, Jul. 2017.
pp. 7147–7161, Dec. 2018. [44] L. Mou, L. Bruzzone, and X. X. Zhu, “Learning spectral-
[22] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- spatial-temporal features via a recurrent convolutional neural net-
time object detection with region proposal networks,” in Proc. Adv. work for change detection in multispectral imagery,” Mar. 2018,
Neural Inf. Process. Syst. (NIPS), 2015, pp. 91–99. arXiv:1803.02642. [Online]. Available: https://arxiv.org/abs/1803.02642
[23] L. Mou and X. X. Zhu, “Vehicle instance segmentation from aerial [45] M. Rußwurm and M. Körner, “Temporal vegetation modelling using
image and video using a multitask learning residual fully convolu- long short-term memory networks for crop identification from medium-
tional network,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 11, resolution multi-spectral satellite images,” in Proc. IEEE Int. Conf. Com-
pp. 6699–6711, Nov. 2018. put. Vis. Pattern Recognit. (CVPR) Workshop, Jul. 2017, pp. 1496–1504.
[24] A. Miech, I. Laptev, and J. Sivic, “Learnable pooling with context [46] H. Lyu, H. Lu, and L. Mou, “Learning a transferable change rule from
gating for video classification,” Jun. 2017, arXiv:1706.06905. [Online]. a recurrent neural network for land cover change detection,” Remote
Available: https://arxiv.org/abs/1706.06905 Sens., vol. 8, no. 6, p. 506, 2016.
[25] L. Mou and X. X. Zhu, “IM2HEIGHT: Height estimation [47] H. Lyu et al., “Long-term annual mapping of four cities on different
from single monocular imagery via fully residual convolutional- continents by applying a deep information learning method to Landsat
deconvolutional network,” Feb. 2018, arXiv:1802.1024. [Online]. data,” Remote Sens., vol. 10, no. 3, p. 471, 2018.
Available: https://arxiv.org/abs/1802.10249 [48] H. Wu and S. Prasad, “Convolutional recurrent neural networks for
[26] X. X. Zhu et al., “Deep learning in remote sensing: A comprehensive hyperspectral data classification,” Remote Sens., vol. 9, no. 3, p. 298,
review and list of resources,” IEEE Geosci. Remote Sens. Mag., vol. 5, 2017.
no. 4, pp. 8–36, Dec. 2017. [49] Y. Hua, L. Mou, and X. X. Zhu, “Recurrently exploring class-wise
[27] N. Kussul, M. Lavreniuk, S. Skakun, and A. Shelestov, “Deep learning attention in a hybrid convolutional and bidirectional LSTM network for
classification of land cover and crop types using remote sensing data,” multi-label aerial image classification,” ISPRS J. Photogramm. Remote
IEEE Geosci. Remote Sens. Lett., vol. 14, no. 5, pp. 778–782, May 2017. Sens., vol. 149, pp. 188–199, Mar. 2019.
[28] W. Song, S. Li, L. Fang, and T. Lu, “Hyperspectral image classification [50] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
with deep feature fusion network,” IEEE Trans. Geosci. Remote Sens., spatiotemporal features with 3D convolutional networks,” in Proc. IEEE
vol. 56, no. 6, pp. 3173–3184, Jun. 2018. Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4489–4497.
[29] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature extrac- [51] M. E. Paoletti, J. M. Haut, J. Plaza, and A. Plaza, “A new deep
tion and classification of hyperspectral images based on convolutional convolutional neural network for fast hyperspectral image classifica-
neural networks,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10, tion,” ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 120–147,
pp. 6232–6251, Oct. 2016. Nov. 2018.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[52] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between cap- Lichao Mou (S’16) received the bachelor’s degree
sules,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2017, pp. 1–11. in automation from the Xi’an University of Posts
[53] M. E. Paoletti et al., “Capsule networks for hyperspectral image and Telecommunications, Xi’an, China, in 2012,
classification,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 4, and the master’s degree in signal and information
pp. 2145–2160, Apr. 2019. processing from the University of Chinese Academy
[54] F. I. Alam, J. Zhou, A. W.-C. Liew, X. Jia, J. Chanussot, and Y. Gao, of Sciences (UCAS), Beijing, China, in 2015. He is
“Conditional random field and deep feature learning for hyperspectral currently pursuing the Ph.D. degree with the Ger-
image classification,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 3, man Aerospace Center (DLR), Wessling, Germany,
pp. 1612–1628, Mar. 2019. and also with the Technical University of Munich
[55] C. Deng, Y. Xue, X. Liu, C. Li, and D. Tao, “Active transfer learning (TUM), Munich, Germany.
network: A unified deep joint spectral–spatial feature learning model for In 2015, he spent six months at the Computer
hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., Vision Group, University of Freiburg, Freiburg im Breisgau, Germany.
vol. 57, no. 3, pp. 1741–1754, Mar. 2019. In 2019, he was a Visiting Researcher with the University of Cambridge,
[56] J. Wang, W. Jiang, L. Ma, W. Liu, and Y. Xu, “Bidirectional atten- Cambridge, U.K. His research interests include remote sensing, computer
tive fusion with context gating for dense video captioning,” in Proc. vision, and machine learning, especially deep networks and their applications
IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, in remote sensing.
pp. 7190–7198. Mr. Mou was a recipient of the first place in the 2016 IEEE GRSS Data
[57] H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan, “End-to-end comparative Fusion Contest and finalists for the Best Student Paper Award at the 2017 Joint
attention networks for person re-identification,” IEEE Trans. Image Urban Remote Sensing Event and the 2019 Joint Urban Remote Sensing
Process., vol. 26, no. 7, pp. 3492–3506, Jul. 2017. Event.
[58] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018,
pp. 7132–7141.
[59] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[60] A. Graves, “Generating sequences with recurrent neural
networks,” Aug. 2013, arXiv:1308.0850. [Online]. Available:
https://arxiv.org/abs/1308.0850
[61] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the
properties of neural machine translation: Encoder-decoder approaches,” Xiao Xiang Zhu (S’10–M’12–SM’14) received the
in Proc. 8th Workshop Syntax, Semantics Struct. Stat. Transl. (SSST), master’s (M.Sc.), D.E. (Dr.-Ing.), and Habilitation
Oct. 2014, pp. 103–167. degrees in signal processing from the Technical
[62] Y. Gal and Z. Ghahramani, “A theoretically grounded application of University of Munich (TUM), Munich, Germany,
dropout in recurrent neural networks,” in Proc. Adv. Neural Inf. Process. in 2008, 2011, and 2013, respectively.
Syst. (NIPS), 2016, pp. 1019–1027. She was a Guest Scientist or a Visiting Pro-
[63] J. M. Haut, M. E. Paoletti, J. Plaza, A. Plaza, and J. Li, “Visual attention- fessor with the Italian National Research Council
driven hyperspectral image classification,” IEEE Trans. Geosci. Remote (CNR-IREA), Naples, Italy, in 2009; Fudan Univer-
Sens., to be published. doi: 10.1109/TGRS.2019.2918080. sity, Shanghai, China, in 2014; The University of
[64] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep Tokyo, Tokyo, Japan, in 2015; and the University of
feedforward neural networks,” in Proc. Int. Conf. Artif. Intell. Statist. California at Los Angeles, Los Angeles, CA, USA,
(AISTATS), 2010, pp. 249–256. in 2016. Since 2019, she has been co-coordinating the Munich Data Science
[65] T. Dozat. Incorporating Nesterov Momentum Into Adam. Accessed: Research School. She is also leading the Helmholtz Artificial Intelligence
Sep. 22, 2019. [Online]. Available: http://cs229.stanford.edu/proj2015/ Cooperation Unit (HAICU)–Research Field “Aeronautics, Space and Trans-
054_report.pdf port.” She is currently a Professor of signal processing in earth observation
[66] Y. LeCun et al., “Backpropagation applied to handwritten zip code with the Technical University of Munich (TUM) and also with the German
recognition,” Neural Comput., vol. 1, no. 4, pp. 541–551, Dec. 1989. Aerospace Center (DLR); also the Head of the Department “EO Data Science,”
[67] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” DLR’s Earth Observation Center; and also the Head of the Helmholtz Young
in Proc. IEEE Int. Conf. Learn. Represent. (ICLR), 2015, pp. 1–15. Investigator Group “SiPEO,” DLR and TUM. Her research interests include
[68] T. Rainforth and F. Wood, “Canonical correlation forests,” Jul. 2015, remote sensing and Earth observation, signal processing, machine learning,
arXiv:1507.05444. [Online]. Available: https://arxiv.org/abs/1507.05444 and data science, with a special application focus on global urban mapping.
[69] J. Xia, N. Yokoya, and A. Iwasaki, “Hyperspectral image classification Dr. Zhu is a member of young academy (Junge Akademie/Junges Kolleg)
with canonical correlation forests,” IEEE Trans. Geosci. Remote Sens., at the Berlin-Brandenburg Academy of Sciences and Humanities, the German
vol. 55, no. 1, pp. 421–431, Jan. 2017. National Academy of Sciences Leopoldina, and the Bavarian Academy of
[70] L. van der Maaten, “Accelerating t-SNE using tree-based algorithms,” Sciences and Humanities. She is an Associate Editor of the IEEE T RANSAC -
J. Mach. Learn. Res., vol. 15, no. 1, pp. 3221–3245, Oct. 2014. TIONS ON G EOSCIENCE AND R EMOTE S ENSING .