1. Introduction
Changes to the Earth arise from both natural hazards, such as floods and earthquakes, and human activities, like urban development [
1]. Consequently, Change Detection (CD) algorithms are essential tools for disaster and resource management. Among the various methods of landscape CD, Remote Sensing (RS) stands out as particularly important [
2,
3,
4,
5,
6]. RS data track changes over time between objects within specific regions [
7], providing a valuable data source with several advantages, including frequent updates, the ability to monitor vast areas, and cost-effectiveness [
8]. RS data are utilized in a wide range of CD applications, such as fire monitoring [
9,
10], climate change studies [
11,
12,
13,
14,
15], and flood mapping [
16,
17,
18,
19].
One type of RS imagery that provides better spectral resolution is Hyperspectral RS imagery (HSI) [
20,
21,
22,
23]. HSI improves the performance of CD for similar targets due to its high number of spectral bands [
24,
25], compared to multispectral imagery. However, the specific nature of HSI has made extracting multi-temporal imagery a significant challenge [
26]. As a result, this remains a dynamic and challenging area of study. Atmospheric status, noise levels, and data overload are among the most challenging factors affecting the results of HCD [
27]. The hyperspectral sensors can be divided into two categories: (1) airborne (e.g., Airborne Visible/Infrared Imaging Spectrometer (AVIRIS)); and (2) space-borne (e.g., Recursore IperSpettrale della Missione Applicativa (PRISMA), Enmap). In the near future, new space-borne sensors will be deployed (HyspIRI, SHALOM, and HypXIM) [
27].
Numerous studies and methodologies have so far used HSI for HCD [
27,
28,
29,
30,
31]. For example, Ertürk et al. [
32] suggested a CD technique by applying sparse spectral unmixing to bi-temporal hyperspectral images. First, this method predicts the changed areas using the spectral unmixing method, then creates a binary change map by thresholding the abundance maps. Ertürk [
33] also designed an HCD framework based on a fuzzy fusion strategy; similarity measures indices, the spectral angle mapper (SAM) algorithm, and change vector analysis to predict changed areas. The fuzzy inference fusion strategy was used to fuse the magnitude and angle measurements obtained by the change vector analysis (CVA) and SAM algorithms, respectively. Additionally, López-Fandiño et al. [
30] proposed a two-step HCD framework for performing binary and multi-CD. They first generated a binary change map based on segmentation and thresholding and used the SAM algorithm. Then, the image differencing algorithm was used to combine multi-temporal images. The Stacked Auto-encoders algorithm was then employed to reduce the dimensionality of HSI. Finally, the binary change map and the reduced HSI were used to produce the multi-class change map. In recent work, Ghasemian and Shah-Hosseini [
34] also designed an HCD framework for multiple and binary CD based on several steps: (1) stacking the bi-temporal dataset and generating sample data based on the peak density clustering algorithm, (2) implementing target detection methods based on the produced sample data, (3) generating a binary change map based on the Otsu thresholding, (4) utilizing the sparse coding algorithm and the support vector domain description (SVDD) for generating multiple maps. Furthermore, Saha et al. [
35] proposed an HCD framework based on an untrained deep model for HCD. This method extracts deep features for the first and second times of hyperspectral images using the untrained model and measures the similarity of the deep features through the Euclidean norm.
Tong et al. [
36] also proposed a framework for HCD by analyzing and transfer-learning of uncertain areas. This method is applied in four main steps: (1) generating a binary change map according to the uncertain area analysis using K-Means clustering, CVA, and rule-based methods, (2) classifying the source image based on an active learning framework, (3) second-time image classification based on improved transfer learning and a support vector machine (SVM) classifier, and (4) utilizing post-classification analysis for multiple change map detection. Moreover, Seydi and Hasanlou [
37] designed an HCD method based on a 3D convolutional neural network (3D-CNN) and an image differencing algorithm. This framework utilized the image differencing procedure to predict change and no-change areas and then employed the 3D-CNN to classify the change areas to generate a binary change map. Finally, Borsoi et al. [
38] proposed a fast spectral unmixing method for HCD based on the high temporal correlation between the abundances. This method detects abrupt changes by considering the residuals of end-member selection.
Recent progress in HCD has emphasized the value of using Siamese networks and double-stream architectures for improved spectral–spatial analysis [
39,
40]. Innovative methods such as meta-learning and self-supervised learning have proven effective in overcoming HCD challenges. For instance, Wang et al., 2022 applied meta-learning with Siamese networks for target detection, while Huang et al., 2023 developed a contrastive self-supervised network for HSI analysis [
41,
42,
43]. These advancements have not only enhanced HCD techniques but also laid the groundwork for our HCD-Net framework, which leverages these developments for more accurate HCD.
Although the current HCD methods have shown promising results, they usually have several limitations, including the following:
They require a threshold, and selecting a suitable threshold can be challenging.
They primarily focus on spectral data while ignoring the potential of spatial features in improving HCD results, which has been proven by multiple studies.
Most HCD methods are complex to implement and require high-complexity computation.
Noise and atmospheric conditions can negatively affect the automatic generation of pseudo-sample data through simple predictors and thresholding methods.
Most HCD methods require additional pre-processing steps, such as highlighting changes (recognizing changes from no-changes) or dimensional reduction. The dependence of HCD results on the chosen method for conducting these pre-processing steps makes it difficult to obtain robust results in different study areas.
Given these limitations, a novel method has been proposed in this study to minimize the challenges and to improve HCD results. This study introduces a new framework for HCD based on double-stream CNNs called the HCD-Net. The HCD-Net uses multiscale 3D/2D convolution layers and 3D/2D attention blocks. The advantages of the HCD-Net are: (1) using the multiscale multi-dimensional kernel, (2) utilizing 3D/2D attention blocks, (3) high efficiency, and providing robust results in HCD. The key contributions of this study are:
Proposing and implementing HCD-Net, a novel double-stream deep feature extraction framework for HCD that integrates both 3D and 2D convolution layers in an end-to-end manner without the need for additional processing.
Introducing a 3D/2D attention mechanism within HCD-Net to significantly enhance the extraction of informative deep features, leading to improved accuracy in detecting changes in HSI. This attention mechanism allows the model to focus on salient features in both spatial and spectral dimensions, providing a more nuanced understanding of the changes occurring within the scene.
Demonstrating the robustness and versatility of HCD-Net by evaluating its performance across diverse geographic regions using both space-borne and airborne HSI. The results showcase HCD-Net’s ability to adapt to various landscapes and sensor characteristics, establishing its efficacy in a wide range of HCD applications.
2. Methodology
The HCD-Net is designed in three main steps: (1) Pre-processing, (2) tuning parameters and model training, and (3) CNN-based binary classification and accuracy assessment. The details of the HCD-Net are presented in
Figure 1 and are further discussed in the following subsections.
2.1. Pre-Processing
The RS datasets require pre-processing steps before they can be utilized for CD. These steps can be divided into spectral and spatial corrections. The pre-processing includes spectral correction followed by subsequent spatial correction. Spectral rectification of the hyperspectral level-1 raw (L1R) Hyperion data (space-borne sensor) involved eliminating bands with no-data values, de-striping, de-noizing, smile, radiometric, and atmospheric corrections. Additionally, geometric correction was performed during preprocessing. After pre-processing, 154 spectral bands were used in this study for HCD. The pre-processing step for the airborne sensor (AVIRIS) has already been completed, and the pre-processed data are used in this article.
2.2. CNN-Based HCD
Deep Learning (DL) methods can automatically extract informative features from input datasets with a high degree of abstraction [
44]. Among all DL frameworks, including CNNs, deep belief networks (DBNs), generative adversarial networks (GANs), recurrent neural networks (RNNs), and auto-encoders (AEs), CNN is the most commonly employed method [
45]. Renowned for its applications across various fields, CNN uses stacked convolutional kernels to learn spectral and texture information in the spatial domain [
46], enabling classification based on the interrelationships between the input data and target labels. The CNN network consists of two main components: a feature extractor and a softmax classifier, which is typically implemented as a multi-layer perceptron (MLP) that assigns class labels. Furthermore, the CNN network includes several operational layers, such as convolutional layers, pooling layers, nonlinear activation functions, and normalization layers [
22].
This research proposes a novel double-stream HCD framework using a CNN network, as illustrated in
Figure 2, which details the architecture of our dual-stream CNN framework for HCD. To improve efficiency, the two input patch datasets are merged into a single patch dataset through image differencing, a departure from many DL-based frameworks that stack input patch data, thereby reducing computation time and processing.
HCD-Net’s architecture combines 3D and 2D convolutional layers and SE blocks to effectively extract and analyze the complex spectral–spatial relationships in hyperspectral data, while keeping computational efficiency in mind. The first stream uses multi-scale 3D convolution blocks, pooling layers, and 3D-SE blocks to explore the spectral–spatial dimensions, capturing the detailed relationships across the various bands of hyperspectral images. This approach enables a thorough analysis, utilizing the full spectrum of spectral information to effectively detect subtle environmental changes.
On the other hand, the second stream focuses on extracting 2D deep features through 2D convolution layers and 2D-SE blocks. Designed to be deeper, this stream includes a broader range of multiscale convolution blocks without pooling layers, aiming to capture spatial details precisely. The 2D convolutional layers in our framework are especially good at extracting high-resolution spatial features with lower computational demands, without considering the relationship between spectral bands, thus ensuring the model’s efficiency. This differentiation in convolutional approaches allows for a comprehensive analysis by extracting detailed spatial information to complement the spectral–spatial features identified by the 3D convolutions, enhancing the framework’s overall ability to detect changes within the environment. Additionally, the inclusion of the attention module and the 2D-SE attention block in our model is strategically aimed at extracting highly informative features and significantly enhancing the model’s representational power. This approach is rooted in the broader concept of attention mechanisms within CNNs, which prioritize crucial information within the input data. Our specific implementation through these attention mechanisms is designed to refine the model’s focus on salient features, thereby boosting its analytical capabilities and overall performance in detecting nuanced changes in hyperspectral imagery.
The 3D and 2D features from both branches are combined through a concatenation and flattening process. After aligning their dimensions, the 2D and transformed 3D features are merged along the channel dimension to form a composite feature map. This map is then flattened into a one-dimensional vector for analysis by dense layers. The effectiveness of this integration mechanism has been confirmed through ablation analysis in
Section 5, showing the significant potential of using both 3D and 2D convolutional layers to enhance the framework’s performance.
These features are then brought together in a concatenating layer, which merges the deep features from both streams before passing them to the classification layers. A fully connected layer bridges the CNN framework and MLP layers, leading to a softmax layer that classifies the input feature data.
This dual-stream approach, characterized by its innovative use of 2D and 3D convolutional layers and SE blocks, underlines our commitment to developing a robust framework capable of leveraging the unique advantages of hyperspectral imagery for environmental change detection. Through this method, the HCD-Net framework ensures a balanced and comprehensive analysis, achieving high accuracy and efficiency in monitoring and analyzing environmental changes.
The HCD-Net has several differences compared to other HCD frameworks, including:
Taking advantage of SE blocks in extracting informative deep features.
Utilizing the advantages of spectral information in the hyperspectral dataset through 3D convolution layers.
Combining 3D and 2D convolutions to explore high-level spectral and spatial information.
Utilizing a multiscale convolution block to increase the robustness of the network against different object sizes.
Employing a differencing algorithm to reduce computational and time costs instead of concatenating deep features in the first layers.
The multidimensional kernel convolution used in this study encompasses 3D, 2D, and 1D kernel convolutions. The distinction between these kernel convolutions is illustrated in
Figure 3.
2.3. Squeeze-and-Excitation (SE) Blocks
The proposed SE block adjusts channel-wise feature responses adaptively to explicitly model the interconnections between channels, thus improving channel interdependencies with minimal computational cost. The block consists of three components: (1) Squeeze, (2) Excitation, and (3) Rescale. The Squeeze module uses global average pooling (GAP) to reduce the spatial dimensions of the input feature data to a single value. The Excitation module then investigates the output of the Squeeze module, learning adaptive scaling weights for the feature through two MLP layers with ReLU (neurons in the first layer) and Sigmoid (neurons in the second layer) activation functions, respectively. Finally, the Rescale component uses element-wise multiplication to return the features to their original size. In this research, 3D/2D SE blocks are used in the proposed architecture for HCD.
Figure 3 illustrates the differences between 2D/3D SE blocks. The main difference between the two blocks is noted in the Squeeze module, where the 3D Squeeze module uses a 3D GAP, and the 2D Squeeze module uses a 2D GAP.
2.4. Convolution Layers
The convolution layers form the central core of CNN network frameworks and are capable of exploiting deep and high-level features. They can be categorized into three types based on their filter size: (1) 3D kernel convolution, (2) 2D kernel convolution, and (3) 1D kernel convolution. The proposed architecture leverages the benefits of both 3D and 2D convolution layers to extract deep features. Additionally, this research incorporates multiscale convolution blocks to enhance the network’s resilience against variations in object size.
The strength of 3D convolution layers lies in consideration of both spatial and spectral features. In other words, the 3D convolution layers consider the relation between the central pixel with its neighborhood and the relation between spectral bands. The feature map (H) for the 3D-convolution layer at position (
,
,
) on the
yth feature of the
xth layer is given by Equation (
1).
where
F is the activation function,
b(
x,
y) is the bias parameter,
m is the number of feature maps in the
th layer;
,
, and
are the width, height, and depth of kernel along a spectral dimension, respectively. In the 2D convolution layer, the feature map value at position (
,
) can be enumerated using the below equation:
The activation function is a rectified linear unit (Relu) according to the following equation:
Furthermore, the Sigmoid activation function can be formulated by Equation (
4).
2.5. Model Parameters Optimization
The parameters of the HCD-Net are estimated using a backpropagation method to find the optimal values. The model parameters are iteratively tuned by an optimizer based on the loss value. To achieve this, the model is trained using the training sample data, with the initial parameters initialized by the He-Normal method. The network error is then calculated using a loss function on the validation dataset. The optimizer updates the network parameters based on the feedback from the loss value. The Adam optimization algorithm is utilized to tune the parameters of the model [
27,
47]. Furthermore, the cost function used is binary cross-entropy, as follows:
where
y is a label (true value) and
is the predicted probability observation.
2.6. Accuracy Assessment and Comparison with Other Methods
The primary purpose of this step is to evaluate the results obtained by comparing them with the reference dataset. To achieve this, several comparison indices, including Overall Accuracy (OA), Kappa Coefficient (KC), F1-Score, Recall, Precision, and Balanced Accuracy (BA), were used for accuracy assessment. The results were also compared with four state-of-the-art methods: the 2D Siamese Network, the 3D Siamese Network, General End-to-end Two-dimensional CNN Framework (GETNET), and Iteratively Reweighted Multivariate Alteration Detection (IR-MAD) Support Vector Machine (SVM) [
48,
49,
50]. The 2D Siamese Network has two deep feature extraction channels, with the first channel analyzing the first-time hyperspectral dataset and the second channel focusing on the after-change hyperspectral dataset. Similarly, the 3D Siamese Network has a similar architecture but utilizes 3D convolution layers.
3. Case Study
The HCD-Net is evaluated using two HSIs for the purpose of evaluating various types of HCD methods. The quality of the ground truth data is a critical factor in these evaluations; hence the selection of these datasets is primarily due to their access to control datasets, which are widely utilized in various HCD studies such as Refs. [
48,
51]. In these studies, the ground control datasets were created through visual inspection.
The first dataset was acquired near Yuncheng, Jiangsu province in China, on 3 May 2006, and 23 April 2007. Soil, river, tree, building, roads, and agricultural fields are the main characteristics of this location. The second dataset was collected from irrigated agricultural fields in Hermiston, a city in Umatilla County, OR, USA, on 1 May 2004, and 8 May 2007. The datasets from China and the USA belong to the Hyperion sensor and are available at
https://rslab.ut.ac.ir/data (accessed on 25 February 2024) (
Figure 4). The third dataset is the product of the AVIRIS sensor taken in 2013 and 2015 from the Bay Area surrounding the city of Patterson, California, and is available at
https://citius.usc.es/investigacion/datasets/hyperspectral-change-detection-dataset (accessed on 25 February 2024). The land cover includes soil, irrigation fields, rivers, human constructions, cultivated lands, and grassland. In all datasets, changes are affiliated with the land cover type and water body areas. The details of the used dataset and the sample dataset for three case study areas are shown in
Table 1 and are illustrated in
Figure 4.
For a fair comparison, 5% of the samples from the USA and China datasets and 1% of the samples from the Bay Area ground reference data were used for training the network. The sample data are divided into three parts: (1) training data (72%), (2) validation data (18%), and (3) test data (10%).
Figure 5 displays the pixels used for training, validation, and testing for each scene. Green represents no-change pixels and red represents change pixels for both training and validation. Test pixels are shown in beige.
5. Discussion
The results indicated that HSI has a high potential for CD purposes. For example, it was observed that the OAs for HCD of all methods were more than 90%. According to visual and statistical inspections, as shown in
Figure 6,
Figure 7 and
Figure 8, and
Table 2,
Table 3,
Table 4,
Table 5,
Table 6 and
Table 7, various methods provided different results for the study areas. However, the HCD-Net provided robust results in three case study areas. Compared to other methods, the variation in the values of KC is low in the HCD-Net. The mean KC value of the USA dataset is lower than the mean KCs of the other datasets. Thus, the complexity of the classes may affect the HCD results. Although this issue can be neglected in the HCD-Net results, other methods are highly dependent on the complexity of the dataset.
Table 8 presents the results of the average OAs and BAs in three study areas for HCD-Net and all other methods (GETNET, 3D-Siamese, and 2D-Siamese). The HCD-Net demonstrated promising results (more than 97%) for the three study areas, whereas the performance of other HCD methods was not satisfactory across all study areas. The BA index evaluates the performance of CD in both change and no-change classes. Due to the imbalanced datasets, most HCD algorithms focused on change or no-change areas while losing performance in both classes. Based on
Table 8, the HCD-Net had the highest performance, confirming the robustness of HCD-Net in HCD. There is a trade-off between detecting change and no-change pixels in HCD. An accurate HCD method should provide robust and high performance in the detection of both classes. The HCD-Net could detect change and no-change pixels with a high level of accuracy, whereas other methods had lower performance in detecting both classes.
The visual results (see
Figure 6a,
Figure 7a and
Figure 8a) illustrate HCD-Net’s superior performance in HCD compared to the IR-MAD SVM, showcasing its enhanced capability to accurately identify changes across diverse landscapes. The comparison between IR-MAD SVM (non-CNN-based CD algorithm) and HCD-Net across the China, USA, and Bay Area datasets further reveals HCD-Net’s superior performance in all evaluated metrics, including OA, Precision, Recall, F1-Score, BA, and KC (
Table 2,
Table 4 and
Table 6). HCD-Net consistently outperforms IR-MAD SVM, demonstrating its effectiveness in hyperspectral change detection. These results underscore HCD-Net’s advanced capability to accurately identify changes within diverse landscapes, benefiting from its DL architecture and attention mechanisms, which enhance its sensitivity to subtle spectral–spatial features not as effectively captured by IR-MAD SVM.
Many HCD methods tend to overlook spatial features and primarily concentrate on spectral features. Moreover, even when some methods do incorporate spatial features, they often fail to utilize optimized features. The optimization of features can be a daunting and time-consuming task, leading to potential shortcomings in HCD by traditional and other state-of-the-art methods. The HCD-Net, on the other hand, automatically extracts deep features encompassing both spatial and spectral aspects. Consequently, when conducting HCD with the HCD-Net, the results are notably accurate and reliable, especially in large-scale areas where ground truth verification is challenging. Additionally, the HCD-Net incorporates a multi-dimensional convolution layer, thereby enhancing HCD performance. Unlike many other HCD methods that focus solely on either 2D or 3D convolution layers, the HCD-Net leverages both 3D and 2D convolution layers. Furthermore, the integration of 3D/2D SE blocks facilitates the extraction of informative deep features.
Several HCD methods for hyperspectral CD, such as GETNET, necessitate additional processing steps like dimensional reduction and spectral unmixing. Given the high dimensionality of hyperspectral datasets, preprocessing poses a significant challenge. In contrast, the HCD-Net does not require such additional processing for HCD, unlike the GETNET algorithm, which relies on spectral unmixing prior to CD.
To evaluate the effectiveness of each part of the HCD-Net framework, we conducted ablation studies for the China dataset. These studies examined the impact of removing or altering specific components of the model. We focused on four scenarios: (1) the model without the 3D convolutional channel (S#1), (2) the model without the 2D convolutional channel (S#2), (3) the model without the attention modules (S#3), and (4) the full HCD-Net model with all components (S#4).
The results, shown in
Table 9, highlight the importance of the attention modules. The performance drop in scenario S#3 demonstrates their role in improving the model’s focus and feature representation, which significantly enhances detection accuracy. Additionally, the comparison between S#1 and S#2 indicates that the 3D convolutional channel has a more significant impact on the model’s performance than the 2D channel. This suggests that the 3D channel is crucial for capturing complex spectral–spatial relationships in hyperspectral data, which is essential for accurate change detection.
Table 10 shows a comparison of computational times for different models, including HCD-Net. With an execution time of 465.21 s, HCD-Net is faster than the 3D-Siamese model but not as quick as the 2D-Siamese or the IR-MAD SVM, the latter being the fastest. This comparison highlights HCD-Net’s effective balance of computational speed and high precision in change detection, making it a strong choice for scenarios where detailed accuracy is more critical than the fastest processing time.