1. Introduction
Water body extraction, which refers to the correct and effective detection of water bodies in remote sensing images under the interference of complex spatial backgrounds, is a fundamental task in remote sensing image interpretation [
1,
2]. The extraction of water bodies is of great importance in disaster prevention [
3], resource utilization, and environmental protection [
4]. Remote sensing data from high-resolution and multi-spectral satellites are primarily used for water body extraction tasks. Many methods have been proposed to accurately extract water bodies from remote sensing images, which can be mainly divided into water body index-based spectral analysis methods and machine learning-based methods.
The threshold value is used in the water body index-based spectral analysis method. First, the normalized difference water index (NDWI) [
5] utilizes the different reflectance of water bodies in different bands to detect the water body. To address the problem that NDWI cannot adequately handle the image background noise, Xu et al. [
6] changed the near-infrared band in NDWI to a short-wave infrared band and named it modified NDWI (MNDWI). A single threshold value may lead to water body misclassification because the shadows of some targets have similar spectral characteristics to water bodies. Therefore, some researchers have also combined the NDWI with other relevant indices to eliminate the influence of shadows [
7,
8]. Although the performance of the water body index-based method is improving, it can still be impacted by static thresholds and subjective factors. In addition, the threshold method is not appropriate for extracting information from small area water bodies.
Machine learning-based water classification methods mostly use manually designed water features. The feature space formed by the water features is input into a machine-learning model for water extraction. The machine-learning approaches to extracting water bodies show better performance by avoiding the subjective selection of thresholds and better utilizing the information in the images. Balázs et al. [
9] used principal component analysis to extract information from images and successfully identified water bodies and saturated soils. Zhang et al. [
10] proposed an SVM-based method that can enhance edge feature extraction and reduce some segmentation errors. But the model’s accuracy still needs to be improved, and the algorithm cannot be easily implemented when the data are too large. Despite the progress of the above methods, the manually designed features have a limited ability to describe water bodies and require a certain amount of prior knowledge. Additionally, they cannot fully utilize deep features and global information to detect water bodies.
The traditional water extraction methods suffer from poor model generalization ability and cannot remove the interference of noise well. Recently, convolutional neural networks (CNN) have been extensively used in remote sensing images due to their excellent feature extraction ability. Deep CNN (DCNN) extracts features at different scales of images by multiple convolutional layers, avoiding the complicated feature selection process. Currently, DCNN has gradually become a mainstream method for water extraction. Wang et al. [
11] used the ResNet101 network with depth-wise separable convolution as an encoder and proposed a multi-scale densely connected module to enlarge the receptive field to improve the segmentation accuracy of small lake water bodies. Zhang et al. [
12] designed a new multi-feature extraction module and a combined module to extract water bodies, considering feature information at different scales. Chen et al. [
13] proposed a method based on feature pyramid enhancement and pixel pair matching to extract water bodies, which alleviates the problem of detail loss in deep networks and reduces the classification error of boundaries. Dang et al. [
14] considered the multi-scale and multi-shape characteristics of water bodies and proposed a multi-scale residual model using a self-supervised learning strategy for water body extraction.
The method using DCNN improves the performance of water body extraction, but some limitations still exist. The convolution operation has a limited receptive field and lacks modeling ability for global information. Convolution only collects information from adjacent pixels. For semantic segmentation, if only local information is modeled, the relationship between pixels is ignored, resulting in low segmentation accuracy. With the help of global information, the semantic information of each pixel becomes more exact. To overcome the local property of the CNN and model the global information in images, some researchers have combined attention mechanisms with networks [
15,
16]. Attention mechanisms are used for the adaptive weighting of important features. Attention mechanisms have also been applied to water body extraction tasks. Duan et al. [
17] introduced channel and spatial attention mechanisms in the network and combined multi-scale features to enhance the continuity of the water contour. To reduce the influence of noisy information on water body boundaries, Zhong et al. [
18] introduced a two-way channel attention mechanism, which improved the network’s accuracy for lake boundary segmentation. However, due to the interference of some complex spatial backgrounds and shadows, the performance of extracting water body boundaries and tributaries still needs to be improved. Moreover, the above attention mechanisms rely on convolution operation, which limits the representation of global features.
Transformer is an attention-based architecture [
19]. It can model the relationship between input tokens and deal with long-range dependencies. Unlike the CNN structure, Transformer processes one-dimensional sequence features generated from two-dimensional image features by flattening operations. The standard Transformer structure mainly consists of layer normalization (LN), multi-head self-attention (MHSA), multilayer perceptron (MLP), and skip connections. The Transformer can be continuously stacked to model the long-range dependencies of image features. Transformer performs global self-attention computation on image features to obtain feature representations, which can effectively capture long-range dependencies and compensate for the deficiency of CNN in the ability to extract global information of images. Although some studies have achieved satisfactory results using Transformer, they are mostly based on large-scale pre-training [
19,
20]. MixFormer [
21] is a kind of Transformer structure with an excellent ability to capture global contextual information.
To extract water bodies accurately from remote sensing images, we embedded MixFormer into Unet [
22] and proposed MU-Net, a hybrid MixFormer structure. We validated the network’s performance on the GID [
23] and LoveDA datasets [
24]. MU-Net combines CNN and MixFormer to focus on capturing the local and global contexts of images, respectively. CNN has excellent spatial induction bias [
25], so we used classical convolutional layers to extract shallow image features, which preserves high-resolution information and avoids large-scale pre-training operations. In order to obtain feature maps at different scales, the original image was downsampled via the Maxpooling operation. In order to accurately identify water features in complex backgrounds and improve the integrity of tributaries and boundary contours, we first extracted local information from deep features by convolutional layers, and then modeled global contextual information using a MixFormer block to mine deeper semantic features of water bodies. Afterward, the features generated by the encoder were refined by the attention mechanism module (AMM). The AMM weights the features related to water bodies to suppress the interference of non-water features and noise. Finally, in the decoding process, we recovered the resolution and detail information of image by bilinear interpolation and skip connection to generate the final water body extraction results. The main contributions of this paper can be summarized as follows:
- (1)
MU-Net, a network embedding MixFormer into Unet is proposed for the automatic extraction of water bodies. In order to enhance the integrity and accuracy of the water body extraction results, CNN and MixFormer are combined to model the local spatial and global contextual information of the deep features to further mine the details and deep semantic features of water bodies.
- (2)
The AMM refines the features generated by the encoder to suppress image background noise and non-water body features.
- (3)
Compared with other CNN- and Transformer-based segmentation networks, our proposed method achieves optimal performance on the GID and LoveDA datasets.
The rest of this paper is organized as follows.
Section 2 introduces related work, including the background of the model concept. In
Section 3, we describe the proposed method in detail, including the overall structure and the specific structure of each module. We present the details of the experiments in
Section 4, which include the dataset, the results of the experiments comparing different networks, and the ablation experiments of the proposed method.
Section 5 verifies the rationality of some settings in MU-Net, and
Section 6 summarizes the work.
3. Proposed Method
This section introduces the proposed MU-Net. First, we give a general overview of the model’s structure, then we describe the structure of each module in detail and introduce the loss function.
3.1. Overall Structure
As shown in
Figure 1, MU-Net is based on the encoder and decoder structure, which mainly consists of Conv block, MixFormer block, and AMM.
In the encoding process, we used the maxpooling operation to enable image downsampling and generate a series of feature maps at different scales. Shallow features are generated by convolution operations so that the generated shallow features contain high-resolution details and focus on local contextual information. The local features are extracted using convolution layers, which avoids the large-scale pre-training of Transformer. The shallow features contain rich location and detailed information focusing on capturing the boundaries of water bodies. These features are important for accurately extracting small water bodies and tributaries. All Conv blocks consist of two classical convolution modules: a convolution with kernel size 3, a batch normalization (BN), and a rectified linear unit (ReLU).
Deep features focus on capturing semantic information to accurately classify pixels, which is important for accurately identifying water features in different scenarios. For deep-level features, the local information is first modeled by convolutional layers, and then the features are flattened, and self-attention operations are performed in the MixFormer block. In this setting, the detailed information of the image is retained, and the global receptive field of the image is captured, which helps to extract the water body more accurately and completely. Considering the loss of detailed information and the scale of the model, we introduced two layers of MixFormer block in the network. Deep features capture the global receptive field by modeling the global information through the MixFormer block. The deep features have a smaller size, which reduces the computational cost. The sequence features output from the MixFormer block also need to be reshaped to convert 1D sequence features to 2D image features for the next convolution operation. Then, AMM refines the features produced by the encoder to suppress noise interference and strengthen the features related to water bodies.
The decoder keeps low-level features by skip connection, using bilinear interpolation to continuously recover resolution and detail information of the image. Finally, the feature maps are upsampled to the original image resolution, and the predicted water body extraction results are generated.
3.2. MixFormer
In remote sensing images, the complex spatial background affects the feature extraction, and the acquisition of global information is beneficial to the accurate segmentation of targets. The usual solution is to add a convolution-based attention mechanism to the model, but it is difficult to obtain global contextual information. Alternatively, Transformer is used as the encoder, increasing the model’s complexity while losing image details. Therefore, we introduced the MixFormer [
21] to model global contextual information after the CNN-based encoder obtains the deep feature, improving the long-range dependencies of images.
Although Transformer has an excellent ability to model global information, its structure makes its computational complexity much higher than that of the convolution-based attention mechanism, seriously affecting its potential for applications in remote sensing image processing. Some researchers have used the form of local windows to reduce the computational amount of self-attention [
20,
49]. MixFormer [
21] combines window-based self-attention with depth-wise convolution to form parallel branches, enabling cross-window information interaction and expanding the receptive field. The use of depth-wise convolution also captures the local relationships of the image, which is complementary to the global information obtained using self-attention. Furthermore, MixFormer includes a bi-directional interaction path to deliver complementary information for parallel branches.
3.2.1. MixFormer Block
As shown in
Figure 2, the MixFormer block consists of two normalization layers, mixing attention and a multilayer perceptron. The process can be formulated as follows:
where
represents the window-based self-attention branch, and CONV represents the depth-wise convolution branch.
represents the operation that fuses the features of the
and
branches.
represents layer normalization, and
represents the multilayer perceptron composed of two linear layers and one GELU [
50]. In parallel branches, the outputs of the depth-wise convolution branch and window-based self-attention branch are concatenated and then fed into the feedforward network.
3.2.2. Window-Based Self-Attention
The computational flow of self-attention and illustration of the image window partition are shown in
Figure 3. The computation process of window-based self-attention is as follows. The 2D feature map is first partitioned into 8 × 8 windows, and the 2D features are flattened into 1D sequence. Then self-attention is performed inside the window. The number of channels of the 1D sequence is expanded 3 times by a linear layer and then split into three vectors: query (
), key (
), and value (
). The self-attention calculation is expressed as follows:
where
obtains the relationship between pixels. The dimensions of Q, K, and V are all
, where N represents the length of the sequence, and C represents the number of channels. First,
is transposed (
), then
is multiplied with matrix
, followed by the division by
. Similar to the Swin Transformer method, relative position encoding B (
) is added, and then the feature map is passed through the Softmax function to generate a spatial attention matrix. Finally, the output feature map is multiplied by matrix
to output
. A represents the feature map after the self-attention operation. More details of the window-based self-attention used in this paper can be found in Swin Transfomer [
20].
3.2.3. Parallel Branches and Bi-Directional Interactions
As shown in
Figure 2, the purpose of using window-based self-attention and depth-wise convolution as parallel branches is to solve the limited receptive field problem arising from the self-attention operation within local windows. Considering the inference efficiency, the parameter setting of MU-Net is the same as MixFormer [
21], and the convolutional kernel size of depth-wise convolution is set to 3 × 3.
The parallel bi-directional interactions between branches are designed to improve the modeling ability of the network [
21]. The self-attention branch performs self-attention operations within the windows and shares weights in the channel dimension, which can lead to weak modeling ability in the channel dimension. The depth-wise convolution shares weights in the spatial dimension while modeling channel dimension features [
51]. As shown in
Figure 2, the information extracted by depth-wise convolution is passed through the channel interaction path to the window-based self-attention branch to improve the modeling ability of the network in the channel dimension. In the same way, the information obtained from the window-based self-attention branch is passed through the spatial interaction path to the depth-wise convolution branch to enhance the modeling ability of spatial dimension.
3.3. AMM
The shallow features generated by the encoder contain rich spatial details but lack semantic information and are susceptible to noise interference. Moreover, some non-water body disturbances mostly exist in the shallow features, so we proposed an attention mechanism module to refine the features and further enhance the ability of the network to identify water bodies. The structure of AMM is shown in
Figure 4. We constructed two attention branches to improve the representation of encoder features in spatial and channel dimensions, where r is set to 16. Specifically, the channel attention branch generates the feature map
by a global average pooling layer, where C represents the channel dimension. The two 1 × 1 convolution operations first compress the number of channels of the feature to 1/16 times, then restore it to the original number of channels. The channel attention map is generated by the sigmoid function. The spatial attention branch uses depth-wise convolution to strengthen information in spatial dimension and then passes two 1 × 1 convolutions, and BN and ReLU. With two convolution layers, the number of channels is reduced to 1. The channel and spatial attention maps are multiplied by the original input features, and the refined feature maps are generated by feature fusion.
3.4. Loss Function
During the training stage, MU-Net was trained in an end-to-end manner and using a target loss function. The final prediction map was generated by a Softmax function. The water samples were not evenly distributed in the remote sensing images, with a smaller proportion of water samples to negative samples. We used dice loss [
52] to prevent the training from focusing on the background region, which led to inaccurate segmentation results. Our target loss function combines cross-entropy and dice loss [
52]. The specific formula is shown below.
where
represents the total number of pixels,
represents the ground truth value of the pixel, and
represents the predicted probability value of the pixel generated by the final Softmax function. α and β are two hyperparameters that give different weighting factors to the two losses. In our experiments, we set
,
, and
.