1. Introduction
Change detection is an important technique in remote sensing image analysis. It compares two images at the same position but of different time periods to locate the changed areas. Change detection has been widely used in urban sprawl tracking [
1], resource exploration [
2,
3], land utilization detection [
4], and post-disaster monitoring [
5]. The resolution of a remote sensing image is extraordinarily large, whereas the proportion of the changed areas in the whole image is very small. Therefore, it is often time-consuming and laborious to compare such large images manually [
6]. In the past few decades, people have proposed many methods for change detection.
A key problem of change detection is modeling temporal correlations between bitemporal images. Different atmospheric scattering conditions and complicated light scattering mechanisms make change detection problems highly nonlinear. Thus, a task-driven and learning-based approach is required to solve detection problems [
7]. According to whether the dataset has sufficient prior knowledge, a detection problem can be divided into unsupervised [
8,
9] and supervised [
10,
11] methods. Unsupervised learning methods do not need to obtain prior knowledge from labeled data, whereas supervised learning methods have to infer the detection areas through labeled training data.
The unsupervised learning methods have been applied to change detection in many recent studies [
12,
13]. Generally, these methods mainly focus on the generation and analysis of differential images, extracting information from either original images or differential images to detect which areas change. Common methods include principal component analysis (PCA) based on k-means clustering [
14], multivariate alteration detection (MAD), and the iteratively reweighted multivariate alteration detection (IR-MAD) [
15]. The PCA is one of the most famous subspace learning algorithms [
16]. As a linear transformation technique, this method performs decorrelation on images. However, the PCA relies on statistical characteristics of images. If the data in the changed areas and unchanged areas are unbalanced, the model performance will be seriously affected [
17]. The MAD is another unsupervised multitemporal image change analysis method. Its main mathematical essence is canonical correlation analysis (CCA) and band math in multivariate statistical analysis. However, this algorithm still cannot completely improve current multi-element remote sensing image methods. Based on the MAD algorithm, the IR-MAD algorithm was studied and put forward in combination with expectation-maximization (EM) algorithms [
18]. The core idea of the algorithm is that the initial weight of each pixel is 1, and each iteration gives a new weight to each pixel in the two images. The unchanged pixels have larger weights through calculation, and the values of final weights are the basis for determining whether a pixel changed. After several iterations, the weight of each pixel would be stable when the change is less than a set threshold and the iteration will stop if it changes no more. Another classic method is the change vector analysis (CVA) proposed by Malila [
19]. It uses a simple method to perform differential operation on image data of various wavebands in different periods, in this way, it evaluates the change of each pixel, and then forms a change vector of various wavebands. In addition, many algorithms were improved based on the CVA. Nevertheless, unsupervised learning methods cannot make use of prior knowledge of marked data, and relay on the assumptions of some models or similar rules to distinguish changed areas. Furthermore, unsupervised methods require targeted tuning of models to adapt to different environments, which is very time-consuming and laborious. Overall, unsupervised methods have certain limitations in change detection research.
Supervised learning methods use labeled training data to learn which areas changed. Traditional supervised learning methods include random forests (RFs) [
20], convolutional neural networks (CNNs), etc. Because of the rapid development of graphics processing units (GPUs), deep learning methods can be applied in various fields [
21,
22,
23]. In the field of change detections, the general end-to-end 2-D convolutional neural network [
24] effectively learned distinguishing features from higher levels by a 2-D CNN and introduced a hybrid affinity matrix fused with subpixel representations to improve its generalization ability. Yang [
25] performed change detection and land cover mapping simultaneously and used land cover information to help predict changed areas. Wang [
26] proposed region-based CNNs to extend object detection for image change detection tasks. Zhan [
27] proposed a change detection method based on deep Siamese convolution networks. This method obtained different feature maps by inputting two images into the same CNN, and then the method processed data from feature sets to detect changes between the images by the knowledge that the feature vectors of changed pixel pairs were far apart from each other.
Because most of the change detection tasks were performed at the pixel level, which is naturally associated with semantic segmentation tasks, a change detection task could be transformed into a two-class semantic segmentation problem. Semantic segmentation tasks in deep learning were mostly based on fully convolutional networks (FCNs) [
28], which could classify each pixel in the image independently and quickly. Zhang [
29] and Arabi [
30] proved that FCN methods could be well applied to image change detection tasks. However, these works only optimized the structure of the original model; features of image change detection were not improved. For example, images acquired in different time periods might have some deviations in building angles, which would lead to a low accuracy of the model in identifying changed areas.
Most of the images studied by the above deep learning methods has low spatial resolution, when the spatial resolution and the height of the object increase, those methods do not work well as they have long calculation time and low accuracies, so this paper proposes a trilateral change detection network (TCDNet). The proposed model is consisted of three branches: The main branch is responsible for multiscale extraction of the overall information of the bitemporal Google Earth images to obtain a raw change prediction map. The other two auxiliary branches are the difference module and the assimilation module, which carry out weight trainings on changed and unchanged areas, respectively. The overall framework of the model is shown in
Figure 1. Through the cooperation of the two auxiliary modules, the main network improves its prediction accuracy. The main contributions of this work are as follows. (1) The proposed network is end-to-end trainable, and each component in the network does not need separate trainings, and (2) according to the characteristics of change detection tasks, targeted optimization is carried out. A main module and two auxiliary modules are arranged to cooperate with each other to promote information exchanges between the changed and unchanged areas and thus to improve the accuracy of the prediction map. Two auxiliary modules are trained together with the main module, and no additional separate training is required. (3) A new bitemporal Google Earth image dataset was collected, including different categories of roads, railways, factories, buildings, farmland, and so forth. There are enough change categories to ensure that the model can be trained to cope with most changing environments. All ground truth of data is manually annotated to ensure the accuracy of data. Experiments show that the established dataset is feasible for quantitative evaluation of change detection tasks.
The rest of paper is as follows.
Section 2 introduces the proposed model.
Section 3 analyzes the composition of the data set.
Section 4 discusses the experimental results in details.
Section 5 summarizes this paper and puts forward the future work.
2. Methodology
In previous research, CNNs were proven to be effective for the remote sensing image segmentation [
31,
32]. In many similar visual tasks, it is usually necessary to use an extractor to extract feature information of diverse scales from images. CNNs’ structures are similar to pyramid structures with multi-scales and multi-level characteristics. Thus, it was natural to use them in our image change detection research. The network automatically learned multiscale features from rough to fine through a series of convolutions from shallow to deep. CNNs directly use typical image classification as the backbone network of a model, which is helpful to improve the accuracy of the algorithm. Furthermore, this paper is aimed at pixel-level classification; a single pixel depends on its surrounding pixel information heavily, and pooling layers and convolution layers in CNNs can handle this problem well. To sum up, we choose the neural network as the underlying framework of our model.
The TCDNet proposed in this paper is consisted of three modules: one main network module and two auxiliary modules. The main network module is responsible for systematic feature extraction of bitemporal Google Earth images pairs, and it takes advantage of convolution networks to obtain rich semantic information. The other two auxiliary modules are a difference module and an assimilation module, which focus on changed and unchanged areas, respectively, assisting the main network in carrying out change detections. To fuse the outputs of these three modules, a fusion module is designed at the end of the model to complete the task. The overall framework of the model is shown in
Figure 1.
2.1. Main Module
This module is mainly used for detailed feature extraction of bitemporal Google Earth image pairs. In selection of the backbone network, a mainstream strategy is to use the model that performs well in image classification tasks (Large Scale Visual Recognition Challenge [
33]), such as the Visual Geometry Group Network (VGG) [
34], Residual Network (ResNet) [
35], Densely Connected Convolutional Networks (DenseNet) [
36], and so forth. Because of the particularity of change detection tasks, where the depth of the neural network has an influence on the final result, the general idea is to design a network as deep as possible. However, as the depth of the network increases, the phenomenon of gradient disappearance manifests, and thus the training of the network will not be very effective, and the accuracy will even decrease. Nevertheless, a shallower network cannot fully extract the feature information of the image; therefore, the ResNet is chosen as the backbone network because it can well solve the gradient disappearance problem by using residual modules.
The core of residual modules lies in constant mapping. These modules solve the degradation problem in deep learning very well, with the increasement of network layers, the accuracy on the training set is saturated or even decreased. It is relatively difficult to directly fit convolution layers to a potential constant mapping function
H(
x) =
x, but converting it to
h(
x) =
f(
x) +
x will make it easier. When
F(
x) = 0, an identity mapping
h(
x) =
x will be obtained [
31]. The main function of residual networks is to remove the same parts and highlight minor changes in data.
It is well known that the most commonly used strategy is to combine low-level semantic features with high-level semantic features [
28] to improve final prediction results. Thus, the backbone network uses 16 downsampling layers and 32 downsampling layersto extract features of different levels to construct a feature pyramid structure.
In order to extract more complete feature information, operations are as follows. The module uses global average pooling to get the global information of 16 times and 32 times of downsampling layers, then it uses a 1 × 1 convolution network and the sigmoid function, finally it multiplies the obtained feature vector with the original 16 times and 32 times of downsampling layers to achieve feature learning [
37]. This weight vector can adjust features, which is equivalent to feature selection and combination. The sigmoid function can normalize the elements of the eigenvector to between 0 and 1. If the vector value is closer to 1, the channel would be more important. On the contrary, if the vector value is closer to 0, it would be less important.
2.2. Auxiliary Module
The essence of the image change detection algorithm lies in determining changed areas through bitemporal images. This task can be divided into two interrelated parts: one module to detect changed areas, and another to detect unchanged areas. The outputs of the three branches are put together to improve the accuracy. To accomplish this task, two auxiliary modules are designed: the assimilation and the difference modules.
Because some objects have certain deviations from different viewing angles of sensors, not all pixels of the two images taken in different periods are one-to-one paired. Especially when the observation distance is short, the height of the object is too high and the spatial resolution is very high as well, this will become a problem. As shown in
Figure 2, these areas were taken at different time periods of the same place. According to the figures, these parts (boxed in red on Google Earth images) do not change, the visual differences are caused by viewing angles of sensors. Traditional methods (such as PCA-means and IR-MAD) could not solve this problem well. However, deep learning methods could help solve this issue. Deep learning methods convert area parameters into single values for comparison. Taking the difference module as an example, two images pass through three convolution layers, and the size of the obtained feature map is only one-eighth of the original image. That is, the information corresponding to an 8 × 8 original image is mapped into a single value after convolutions.
In
Figure 3, the 4 × 4 receptive field in the blue feature layer is mapped to the 2 × 2 parameter in the orange feature layer through convolutions, and then it is mapped to the 1 × 1 parameter in the green feature layer through convolution operations. After two convolution layers, the parameters of a 4 × 4 area in the original image are finally mapped to a 1 × 1 parameter. This series of operations enables a single parameter in the final output feature layer to represent a small 4 × 4 image in the original image. This manipulation could extract global information from a small area, and convert the area parameters into single values for comparison, it can well handle the position deviation problem caused by different viewing angles of sensors.
Using deep learning methods to solve change detection problems is similar to image segmentation, which generates pixel-wise output. In image segmentation, the image is normally input to the CNN and FCN. A CNN first performs convolutions and poolings, reducing the size of the image and increasing the receptive field at the same time; image segmentation is to predict the output at pixel-wise, so it is necessary to upsample a smaller image size after pooling to the original image size for prediction. There are two key points in image segmentation: one is pooling to reduce the image size and to increase the receptive field, and the other is upsampling to expand the image size. In the procedure of reducing and increasing the size, some information is lost, so a new operation is needed to design to get more information without pooling. Dilated convolution [
38] meets all the above requirements, and it is widely used in computer vision. The extended convolution can make the receptive field to grow exponentially, and it will not increase the model parameters or computation while expanding the receptive field. In general, image pooling and undersampling operations will lead to information loss. The expanded convolution can replace pooling, and the expanded receptive field enables the output of each convolution to contain a larger range of information. In addition, dilated convolution used dilated factor (DF) to indicate the size of convolution expansion. Instead of filling blank pixels between pixels, dilated convolution skips some pixels the input remains the same, by carrying out zero-filling operation on the convolution kernel, a larger perception field is achieved.
As shown in
Figure 4, three images show the schematic diagrams of receptive fields represented by different dilated factors, and all holes are filled with zeroes. When the factor is 3, the receptive field can reach 121. A wider receptive field of vision can be obtained and is equivalent to using a 11 × 11 convolution kernel. It can be seen that the parameters of the convolution kernel remain unchanged, and the size of the receptive field increases exponentially as the dilated factor increases. The growth formula of the two-dimensional dilated convolution receptive field can be expressed as
where
i denotes the dilated factor and
is the receptive field size of the convolution kernel. The advantage of dilated convolution is that, without losing information due to poolings, the receptive field of the convolution kernel can be enlarged so that the output after each convolution could contain a larger range of information as far as possible.
Figure 5 illustrates how dilated convolutions increase the scope of the receptive field.
Figure 5a and
Figure 5b, respectively, represent the receptive fields obtained by ordinary convolutions and dilated convolutions after three convolution operations. It can be clearly seen from the figures that by using dilated convolutions, the information contained in the same parameter (red circle) is about 3.8 times that of ordinary convolutions (blue circle) in the feature map. Thus, it can be seen that by using dilated convolutions, the same parameter in the network can contain more surrounding information, and the building deviation caused by different viewing angles of sensors can be solved by a more friendly way.
Furthermore, the difference module and the assimilation module perform their respective functions. The assimilation module focuses on comparing the same areas, and the difference module is responsible for screening the characteristics of the changed areas. The three branches in the model constitute a whole network, and supervised training is carried out at the same time to update the parameters in the network. The model judges whether a pixel is in a changed area by a pair of surrounding environment information, and all the information is obtained by automatic learning of the model. In the process of model training, we set up a loss function to assist the training. The outputs of these two modules are visualized and attached to the original images for display to understand the functions of auxiliary modules intuitively.
The images in
Figure 6a,b are taken at the same place at different times. The green and blue areas in
Figure 6c,d are areas with dense weights, indicating that these areas receive more attention.
Figure 6c is the heat map output by the difference module. It can be seen from the diagram that the difference module focuses more on the changed areas in the bitemporal Google Earth images.
Figure 6d is the heat map output by the assimilation module. Comparing
Figure 6d with
Figure 6c, it is seen that the assimilation module focuses on unchanged areas.
2.3. Fusion Module
The proposed algorithm consists of three branches: a main network and two auxiliary networks; if they are fused directly, the data information will inevitably affect each other. To solve this problem, a fusion module is added at the end of the network to process these three information flows.
As shown in
Figure 7,
w and
h represent the width and height of the feature map, respectively, and the information of different branches is first stacked to obtain a feature layer
U. Its equation can be written as
is obtained by convolutions.
f denotes the convolution kernel of 3 × 3,
B means batch normalization and the calculation process can be described as
in Equation (
3) denotes the average of the output data in the previous layer;
and
m denote the parameters and the number of parameters in previous feature maps, respectively. The standard deviation
of the output data of previous layers can be acquired in Equation (
4). In Equation (
5),
is a normalized value and
is a very small value in order to avoid zero in the denominator. Finally,
is batch normalized value in Equation (
6). To make the normal distribution of each output data different,
and
are introduced.
in Equation (
2) denotes the function of ReLU, it can be written as
In order to let the model better obtain global information and use vectors to guide feature learning, an attention mechanism is added to reorganize the data in the feature maps:
means global average pooling, it adds up all pixel values of the feature map to get average values, and then it uses these average values to represent the corresponding feature map. Its main function is to replace the fully connected layers, reducing the number of parameters and calculation, improving the robustness of the model.
S in Equation (
8) denotes the function of Sigmoid, which can normalize values between 0 and 1, and its function can be written as
where
x represents the input to Sigmoid.
in Equation (
8), C
denotes the number of categories. In the change detection task, C
is set to 2 because only the changed areas are different from the unchanged. Finally, the number of corresponding channels in
t is multiplied with
U’ to assign weight to the parameters, the obtained characteristic layer and
U’ are added to obtain the final output
V:
2.4. Build Loss Function
Two auxiliary loss functions are added to the total output loss function of the model to improve the network performance and train each component better in the model. These two auxiliary loss functions enhance trainings of the assimilation module and difference module. The Softmax function is selected as the loss function because the output of the model divides the image into two categories: the changed and the unchanged areas, and its formula is shown in Equation (
11).
is the sum of the lost values with the real data for two branches,
is the output prediction of the network,
is the real value, and
N is the number of samples.
L is the joint loss function in the network. denotes the corresponding loss output of the assimilation module () and the difference module (). denotes the loss value output from the end of main module’s network.
3. Dataset
A dataset was created to train and verify the validity of the proposed model. Compared with other published datasets, the dataset used in this work contains more details of the objects and their environments due to a lower view altitude. Therefore, the requirements for the algorithm are higher. This dataset contains 3420 pairs of Google Earth images, the size of each image is 512 × 512, and the corresponding area is approximately 300 square meters. These images from Google Earth (GE) include shots taken at different times from 2010 to 2019. GE is a virtual earth software developed by Google, which allows users to browse high-definition images of different time periods around the world for free. The satellite image of GE is not a single data source, but an integration of multiple satellite images, mainly from DigitalGlobe’s QuickBird commercial satellite and EarthSat (images mostly from landsat-7). GE has a resolution of at least 100 m, usually 30 m. The view altitude is ~15 km. However, for big cities, famous scenic spots, or other hot spots, high-precision images with a resolution of 1 m or 0.6 m will be provided, with the viewing altitude of approximately 500 m or 350 m, respectively. The images of these places have been contained in the dataset. When collecting data, the bitemporal Google Earth images have been registered. The dataset has the following characteristics. (1) The proportion of positive to negative samples varies greatly. There are only small parts of changed areas in a complete Google Earth image. (2) The buildings in the images have some positional deviations due to different shooting angles. (3) Numerous scenes in changed areas, including farmland, roads, high-speed rail, factories, buildings, and mining areas. (4) Seasonal factors have great influence. For example, deciduous broad-leaved forests with seasonal changes have significant differences in winter and summer. (5) In order to obtain the most accurate changed area information, all ground truth were marked manually.
Some labeled images are shown in
Figure 8, and these data contain changed areas of different proportions. The dataset contains a small number of bitemporal Google Earth images with unchanged areas, in order to evaluate the performance of the algorithm more thoroughly, as shown in
Figure 8a.
Analyzing the proportion of the changed areas of each image in the dataset could have a better overall understanding of it. The graph after statistics is shown in
Figure 9, where the abscissa is the proportion of the changed areas to the whole, and the ordinate is the number of images in this proportion. From this figure, the proportion of changed areas of most data is between 5% and 50%. There are 337 positive samples with the proportion of 0–2%, of which 242 0–1% and 95 1–2%. Increasing the imbalance between samples can better distinguish good from bad models.
In order to make the model more robust, it is inevitable to increase the amount of dataset. For example, the CNN is sensitive to translation, rotation flip, and so forth. Sometimes, the translation of a pixel can even lead to wrong classification. Rotation and scaling are even more devastating for the image position information. Therefore, data augmentation is very important and necessary. In order to solve the above problems, the general method is data augmentation; appropriate translation; inversion (horizontal, vertical); rotation; random shear; changing hue, saturation, and value (HSV) saturation; and so forth. The essence of data augmentation is to increase the data by introducing prior knowledge so as to improve the generalization ability of the model. These methods can effectively improve the generalization performance of the model. The specific parameter settings are shown in
Table 1 and images after the data enhancement are shown in
Figure 10. The translation distance does not exceed 20% of the length and width of the original image; HSV saturation and random rotation share similar settings. Each image has a 50% probability of flipping horizontally and vertically, respectively.
5. Conclusions
In this paper, a general framework called the TCDNet is proposed. It consists of three parts: one main module and two auxiliary modules, a difference module, and an assimilation module. These three branches output the final change detection map through a fusion module. First, the main module extracts different levels of information from bitemporal Google Earth image pairs, roughly distinguishes the changed and unchanged areas, and then the method combines the outputs of these two auxiliary modules to refine the change detection map to remove redundant noise. Second, an independent loss function is added to guide weight updates in the training phase to better update auxiliary module weights. Compared with the other algorithms, the proposed method shows better generalization and robustness, and it is superior to the other algorithms in various indexes. The deep learning method depends on the number of categories in the dataset, the more categories, the better. If the type of changed area is not included in the training data, it will have a certain influence on the detection results. Plus, this paper only deals with changed areas of Google Earth images, future research will focus on the identification of categories in changed areas. At the same time, due to the huge cost of creating dataset, we will focus on the application of semisupervised and unsupervised methods in change detection in the future.