The present subsection commences with an introduction to the overall architecture of CTCANet, which is followed by a detailed description of its primary constituents, encompassing the Siamese backbone, transformer module, cascaded decoder, and CBAM.
3.1.2. Siamese Backbone
The CTCANet employs a modified Siamese ResNet18 as the underlying framework for feature extraction from bi-temporal images. ResNet18 is a convolutional network integrated with residual connections, enabling the network to effectively learn and reuse residual information from preceding layers during the training process. The introduction of residual connections addresses the issue of vanishing gradients that can arise in deep neural networks. Compared with the initial convolutional networks, the networks using residual connections converge more easily and the number of parameters remains unchanged.
The ResNet18 utilized in our study is derived from the original ResNet18 by removing both the global pooling layer and the fully connected layer. The modified ResNet18 preserves the first convolutional layer and four basic blocks for a total of five stages. The features output by these five stages are denoted as
,
,
,
, and
in turn, where
represents two different time phases. The calculation of the first convolutional layer is as follows:
where
is
convolution operation,
refers to the batch normalization (BN) layer, and
denotes the rectified linear unit (ReLU) layer.
In ResNet18, a basic block is the simplest building block of the network architecture. It consists of two convolutional layers with a residual connection between them.
Figure 2 shows that the basic blocks have two different forms depending on the stride used. When the stride is 1, residual connections enable the input to be directly merged with the output of the second convolutional layer by element-wise summation (see
Figure 2a). When the stride is not 1, the input is forwarded to a
convolutional layer to achieve dimension increase and downsampling before transferring to the output (see
Figure 2b). In practice, all four basic blocks in our ResNet18 have a stride of 2, each of which downsamples the features by half. As a result, aside from the ones produced in the initial phase, the features generated in the successive stages exhibit a decreasing size ratio of
,
,
, and
relative to the original images. The depths of the five stages are 64, 64, 128, 256, and 512 in succession. Except for the output of the last basic block, which is projected into semantic tokens by the tokenizer and further processed by the transformer module, the producing features of the remaining four stages are transmitted to the cascaded decoder to concatenate with high-level features.
3.1.3. Tokenizer and Transformer Module
Context modelling is a fundamental aspect for facilitating the network to concentrate on pertinent changes and differentiate between spurious changes such as those attributed to variations in illumination. Therefore, the transformer module, which comprises both transformer encoder and transformer decoder components (see
Figure 1a), is incorporated to enable modelling of the spatio-temporal contextual information.
The highest features of each image extracted by the Siamese backbone are embedded into a set of semantic tokens by the Siamese tokenizer before feeding into the transformer module. Let
denote the input features, where
is the spatial size and
C is the channel dimension. For the feature of each temporal, we divide and flatten it into a sequence of patches
, where
is the length of the patch sequence, and
is the spatial size per patch. Afterward, a convolution operation with the filter size of
and stride of
P projects the patch sequence into the latent embedding space (
), thus obtaining a sequence of tokens. Finally, a trainable positional embedding
is incorporated into the token sequence to retain the position information. Formally,
where
is the patch embedding that maps patches into latent space. Consequently, the semantic tokens
are produced.
Given the semantic tokens of the raw images, our transformer encoder establishes global semantic relations in token space and captures long-range dependencies among embedded tokens. As shown in
Figure 3, we employ
layers of Siamese encoders, each consisting of a multi-head self-attention (MSA) block and a multi-layer perception (MLP) block, following the standard transformer architecture [
26]. Additionally, consistent with ViT [
27], layer normalization (LN) is performed before the MSA/MLP, while the residual connection is placed after each block. The input to MSA in layer
l is a triple (query
Q, key
K, value
V) computed from the output in the prior layer through three linear projection layers. Formally,
where
are the trainable parameter matrices,
d is the channel dimension of them, and
h is the number of self-attention heads. The self-attention mechanism models global dependencies by computing the weighted average of the values per position. Formally,
where
denotes the Softmax function applied on the channel domain. To capture a wider spectrum of information, the transformer encoder uses MSA to jointly process semantic tokens from different positions. This procedure can be expressed in the subsequent equation:
where
denotes concatenating the outputs of independent self-attention heads, and
denotes the linear projection matrices.
The MLP architecture comprises two linear layers sandwiching a Gaussian error linear unit (GELU) activation function [
67]. Formally,
where
are learnable linear projection matrices.
To summarize, the computational procedure of the transformer encoder at a specific layer
l can be written as:
The raw embedded tokens are converted into context-rich tokens after layers of encoding in the Siamese transformer encoder, respectively. The Siamese transformer encoder effectively captures high-level semantic information about changes of interest.
To capture strongly discriminative semantic information, the transformer decoder projects encoded tokens back into pixel space, resulting in refined feature representations enhanced with spatio-temporal context. In the proposed transformer module (see
Figure 3), the embedded tokens
and
derived from features
and
, respectively, as well as encoded context-rich tokens
and
are passed to the transformer decoder to develop correlations between each pixel of differential features and encoded differential tokens. In practice, the raw tokens
and
and encoded tokens
and
are performing absolute differences separately and input into the transformer decoder to directly generate pixel-level highly discriminative differential features.
The transformer decoder comprises
layers of decoders. The transformer encoder and decoder share the same architecture except for the fact that the decoder uses multi-head attention (MA) blocks, while the encoder uses multi-head self-attention (MSA) blocks. Here,
and
denote queries and
and
provide keys. At each layer
l, the output
of the prior layer and encoded differential tokens serve as the input, and the decoder performs the following computations:
Finally, the decoded refined semantic tokens are unfolded and reshaped into 3D features .
3.1.4. Cascaded Decoder
There are different levels of information contained within varying layers of features extracted from raw images. Deep features are highly abstract but lack local information, whereas shallow features contain richer local details but are not as abstract. For the comprehensive learning of both deep and shallow representations, we propose a cascaded decoder, which incorporates highly abstract deep features with shallow features encompassing abundant local information through skip connections, thus alleviating the degradation of details induced by global upsampling and localizing objects with greater precision.
The cascaded decoder consists of four upsampling blocks arranged in a series, as illustrated in
Figure 1b. In each upsampling block, the differential features are upscaled and concatenated with the features extracted by the Siamese backbone to learn multilevel representations. A more intuitive description is shown in
Figure 4.
Each upsampling block sequentially performs
upsampling, concatenation, and convolution operations on the input. In practice, the differential features reconstructed by the transformer module are first upsampled to the same scale as the penultimate layer features exacted by the Siamese backbone. Then, the upsampled output is concatenated with the corresponding features of individual raw images. Finally, concatenated features successively go through two convolutional layers, BN layers and ReLU layers, yielding the results of the first upsampling block. The remaining upsampling blocks process similarly, as shown in
Figure 5.
The following formulation can be used to express the computation carried out in the upsampling block.
where
denotes
upsampling operation on the input, and
refers to the concatenation of the upsampled input, feature1, and feature2 in the channel dimension.
3.1.5. CBAM
Skip connections fuse the highly abstract differential features with the lower-level features from individual raw images. However, due to the semantic gaps between heterogeneous features, direct feature concatenation cannot achieve good training results. Thus, we introduce CBAM to the cascaded decoder to efficiently combine multilevel features. Since the semantic gap between the fused features in the last upsampling block is the largest, we add CBAM to this block to promote fusion, as shown in
Figure 1b.
CBAM is a lightweight module with low memory requirements and computational costs. CBAM consists of two sub-modules, channel attention module (CAM) and spatial attention module (SAM), which help to emphasize the change-related information across different domains. Specifically, during the final upsampling block, the operation within the dashed box in
Figure 5 undergoes CAM and SAM procedures at its front-end and back-end, respectively. The role of CAM is to emphasize channels that are pertinent to changes while inhibiting the ones that are not relevant. The function of SAM is to magnify the distances among altered and unaltered pixels across the spatial dimension. In this way, the interested changed areas in the change map are better identified.
We refer to the concatenated features in the fourth upsampling block as
. As shown in
Figure 1c,
are forwarded into the max pooling layer and average pooling layer to extract vectors with dimension
, where
is the number of channels. Each vector then enters the weight-shared MLP, and the outputs are merged into a single vector by element-wise summation. Notably, the MLP in CBAM consists of two linear layers with a ReLU non-linear activation in between, which is different from the MLP in the transformer. Eventually, the Sigmoid function allocates attention weights to each channel, yielding the channel attention map denoted as
. Formally,
where
symbolizes the Sigmoid function, and
and
denote max pooling and average pooling operations, respectively. The channel-wise refined feature
is obtained by multiplying
with the elements of the channel attention map
. Formally,
where ⊗ means element-wise multiplication.
After the channel-wise refinement,
is performs a convolution operation consistent with the previous three upsampling blocks, and the resulting feature is denoted as
.
is further refined through SAM on the spatial domain. Specifically, the input feature passes through two pooling layers to generate matrices with dimension
. Then, the concatenated matrices undergo a convolutional layer and a Sigmoid function to output the spatial attention map
(see
Figure 1d). Formally,
where
means concatenation and
means a convolution operation with a filter size of
. Eventually, feature
is improved in the spatial dimension through element-wise multiplication with
, producing the spatial-wise refined feature
. Formally,
Overall, feature is further enhanced across the channel and spatial domains during the last upsampling block to facilitate difference discrimination.
Until here, we have obtained discriminative features with the spatial size of , consistent with the size of raw images. The classifier comprised of a convolutional layer and a Softmax function is applied to to generate a two-channel predicted change probability map . The process of producing a binary change map involves performing an Argmax operation on P in the channel dimension on a pixel-by-pixel basis.