DMTN-Net: Semantic Segmentation Architecture for Surface Unmanned Vessels

Shao, Mingzhi; Liu, Xin; Zhang, Tengwen; Zhang, Qingfa; Sun, Yuhan; Yuan, Haiwen; Xiao, Changshi

doi:10.3390/electronics13224539

Open AccessArticle

DMTN-Net: Semantic Segmentation Architecture for Surface Unmanned Vessels

by

Mingzhi Shao

^1,2,

Xin Liu

^1,2,*

,

Tengwen Zhang

^1,2,

Qingfa Zhang

^1,2,

Yuhan Sun

^1,2,

Haiwen Yuan

³

and

Changshi Xiao

^1,2,4

¹

School of Ship and Port Engineering, Shandong Jiaotong University, Weihai 264209, China

²

Weihai Institute of Marine Information Science and Technology, Weihai 264200, China

³

School of Electrical and Information Technology, Wuhan Institute of Technology, Wuhan 430205, China

⁴

School of Shipping, Wuhan University of Technology, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(22), 4539; https://doi.org/10.3390/electronics13224539

Submission received: 30 October 2024 / Revised: 12 November 2024 / Accepted: 15 November 2024 / Published: 19 November 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Aiming at the problems of insufficient navigation area recognition accuracy, fuzzy boundary of obstacle segmentation, and high consumption of computational resources in the autonomous navigation of water navigation sensors, such as USVs, this paper proposes a DMTN-Net network architecture based on DeeplabV3+ to improve the accuracy and efficiency of environment sensing. Firstly, DMTN-Net adopts the lightweight MobileNetV2 as the backbone, which reduces the amount of computation. Secondly, the innovative N-Decoder structure integrates cSE and Triplet Attention, which enhances the feature representation and improves the segmentation performance. Finally, various experiments were conducted on the MassMind dataset, Pascal VOC2007 dataset, and related sea areas. The experimental results show that DMTN-Net performs well on MassMind and Pascal VOC2007 datasets, and compared with other mainstream networks, the indexes of mIoU, mPA, and mPrecision are significantly improved, and the computational cost is greatly reduced. In addition, the offshore navigation experiments further validate its performance advantages and provide solid support for the practicalization of USV waterborne sensors.

Keywords:

artificial intelligence; computer science and engineering; offshore sailing experiments; computer vision

1. Introduction

As an important representative of marine robotics, unmanned surface vehicles (USVs) are widely used in military reconnaissance, water safety vigilance, environmental monitoring, and other fields by virtue of their high flexibility, strong concealment, and excellent environmental adaptability, with the core of its technology is the fusion of high-precision autonomous navigation and environment sensing [1]. However, the existing traditional sensing devices such as ranging sensors, LiDAR, millimeter wave radar, sonar, GPS, and multi-sensor fusion technology have the disadvantages of high cost, limited accuracy, and poor system stability, which make it difficult to meet the all-around sensing needs of USVs under complex sea conditions. However, with the development of artificial intelligence technology, vision technology has been widely used in the field of unmanned vessel environment sensing [2,3], especially in the field of semantic segmentation [4], by virtue of its excellent structured data processing capability.

In USVs, visual perception utilizes visual sensors to collect images and combine them with deep learning frameworks to achieve accessible area segmentation and obstacle detection. Currently, the classical semantic segmentation algorithms based on deep learning are u-net [5], hrnet [6], pspnet [7], segformer [8], and deeplab series (V1 [9], V2 [10], V3 [11], V3+ [12]). With the development of Internet technology, various improved algorithms based on traditional semantic segmentation algorithms and deep learning have emerged. For example, the WASR network proposed by Borja Bovcon [13] can effectively segment accessible areas and ships in the water, but its segmentation effect on small obstacles such as reefs is poor, which cannot ensure the safe navigation of ships. The 3D point cloud method and the WODIS water surface obstacle segmentation algorithms proposed by Jon Muhovič [14] and Xiang Chen [15] both realize the accurate segmentation of obstacle edges, but their algorithms have heavy parameters and require high computer arithmetic. Yao Fufei [16] proposed a spatial attention mechanism in U-Net to realize accurate inland water–shore segmentation, but the algorithm occupies a large amount of RAM, which is not conducive to the deployment of the model. Xiong Rui [17] proposed a fast segmentation network, DeeplabV3-CSPNet, which improves the computing speed by introducing Attention in the feature extraction part and feature fusion part, and realizes the simulation application on USVs, but its adaptability to complex scenes is poor. Therefore, the excessive amount of parameters is not conducive to deployment; low segmentation accuracy and weak generalization ability have become urgent problems in the field of semantic segmentation in the feasible domain of unmanned ships.

In order to solve the above problems, this paper adopts Deeplabv3+ as the benchmark model and improves it to propose DMTN-Net. Compared with other mainstream networks, Deeplabv3+ has the advantages of advanced network structure and excellent performance, and it can handle objects of different scales, which is very suitable for complex scenes at sea. First, Mobilenetv2 is adopted as the backbone network to reduce the number of parameters and make the model more lightweight. Second, N-Decoder is proposed to improve the segmentation accuracy and generalization ability of the model, and cSE Attention Module and Triplet Attention are integrated at the same time to achieve the interaction of information between multiple channels, reduce the loss of spatial information, and enhance the segmentation accuracy, which improves the generalization ability of the model. Finally, the excellent performance of the algorithm is proven by multi-dataset experiments and offshore sailing experiments.

2. DeeplabV3+

DeepLabV3+ [12] is a semantic segmentation network developed by Google Inc. in 2018, and its base model is shown in Figure 1. After inputting the image, the deep feature extraction is first performed by the backbone network Xception in the Encoder part to obtain the feature maps containing multi-scale information. Subsequently, these features are transferred to the null space convolution Pooling Pyramid (ASPP) module, which utilizes different scales of null convolution (including 1 × 1 convolution, 3 × 3 null convolution with null rates of 6, 12, and 18, respectively, and a comparison of the receptive fields with different null rates is shown in Figure 2) as well as global average pooling. These enhance the model’s ability to perceive the different scales of contextual information in the image by increasing the receptive fields of the convolutional layers in a nearly parameter-less manner, and also the ability to achieve perception, as well as realize the refinement processing and multi-scale information fusion of feature maps. The fused features are up-sampled four times and then entered into the Decoder module for feature fusion with the lower-level feature maps in the Encoder, and this process enhances the model’s ability to capture detailed information. Finally, after 3 × 3 convolution and bilinear interpolation up-sampling, a segmentation map matching the resolution of the input image is obtained, achieving accurate mapping from pixel level to semantic level.

The proposed network significantly enhances its performance in multiple application scenarios, such as autonomous driving, medical imaging, and intelligent surveillance. Compared to its predecessor DeepLabV3, DeepLabV3+ proposes two major innovations in architectural design: first, it draws on the essence of the U-Net network structure and incorporates the Encoder-Decoder framework. In the Encoder stage, by densely deploying Atrous Convolution, the spatial dimensionality of the image is not only maintained, but also the effective sensory field of convolution is greatly expanded, enabling each convolutional unit to capture richer contextual information. This design facilitates the seamless fusion of high-level semantic features with low-level detail information, significantly improving the accuracy and boundary clarity of the segmentation task. Second, the backbone network of DeepLabV3+ adopts the Xception network instead of ResNet101, and the core advantage of Xception is its Depthwise Separable Convolution mechanism, which effectively reduces the model parameters by decomposing the standard convolution operation into two stages, namely, depth convolution and point-by-point convolution. By decomposing the standard convolution operation into two stages, it effectively reduces the total number of model parameters and computational complexity, and improves the computational efficiency. Combined with Xception’s Residual Connections, DeepLabV3+ is able to integrate multi-scale feature information more flexibly, demonstrating excellent recognition and segmentation capabilities for target objects of different sizes in complex scenes.

3. DMTN-Net

In order to solve the problems of low accuracy of accessible area recognition, fuzzy obstacle segmentation, and low computational efficiency of semantic segmentation algorithms in USV navigation, we proposed a DMTN-Net network based on DeeplabV3+ model. The main improvements are: (1) lightweight backbone network from replacing the backbone network with Mobilenetv2; (2) design the enhanced N-Decoder structure, and insert the cSE attention in the N-Decoder structure; (3) nearly parameter-less Triplet Attention triple attention is introduced in the N-Decoder. The structure of the DMTN-Net network is shown in Figure 3.

3.1. MobileNetV2

DMTN-Net uses the lightweight network MobileNetV2 [18] as the backbone network, which is a model proposed by Google in 2018 specifically for edge computing. It not only inherits the deep separable convolution idea of MobileNetV1, but also introduces the inverted residual block and linear bottleneck structure. Compared to traditional ResNet or CNN, MobileNetV2 significantly reduces the computational cost while maintaining high accuracy, and is more lightweight.

Deep separable convolution is the core of MobileNetV2 [19], which decomposes the standard convolution into two steps: deep convolution [20] and point-by-point convolution [21]. The principle is shown in Figure 4. First, deep convolution processes each input the channel independently for initial feature extraction. Subsequently, point-by-point convolution utilizes a 1 × 1 convolution kernel to fuse the information of different channels to generate a new feature map. This decomposition strategy effectively reduces the amount of computation while maintaining the expressive power of the model.

The computational efficiency of the two convolution operations is compared here [22]. As shown in Equation 1, deeply separable convolution (Marked

F_{d}

), standard convolution (Marked

F_{c}

), parsing through specific parameters,

D_{k}

is the size of the convolution kernel,

D_{f}

covers the width and height information of the feature matrix, M and N denote the number of channels of the input and output feature matrices, respectively. Analysis shows that deeply separable convolutions have a significant computational advantage over standard convolutions. This finding is a direct reflection of the effectiveness of deeply separable convolutions in reducing computational complexity.

\frac{F_{d}}{F_{c}} = \frac{M \times D_{f}^{2} \times D_{k}^{2} + M \times N \times D_{f}^{2}}{M \times D_{f}^{2} \times D_{k}^{2} \times N} = \frac{1}{N} + \frac{1}{D_{k}^{2}}

(1)

The inverted residual architecture is a high-level evolution of the ResNet residual connectivity idea, and the key is to optimize network performance and efficiency by designing a sequence of convolutional operations. The principle is shown in Figure 5. The architecture first utilizes 1 × 1 dot convolution to perform the up-scaling operation of the feature space; this step aims to increase the number of channels of the feature map, thus introducing more feature representation capabilities. Subsequently, the up-scaled feature map is efficiently extracted and fused by a deep convolutional layer, which, with its parameter sharing and local connectivity, can effectively reduce the computational effort while maintaining the structure of the feature space. Finally, 1 × 1 pointwise convolution is again used to down-scale the feature map, and this step not only reduces the computational burden of the subsequent layers, but also optimizes the feature expression by means of feature selection. This design of the inverted residual structure not only effectively retains the high-dimensional information, enabling the network to have a stronger feature representation when dealing with complex tasks, but also significantly reduces the number of total parameters and computational complexity of the network by reducing unnecessary parameters and computations. In addition, the built-in residual connection mechanism plays a key role in facilitating the smooth propagation of gradients through the network, enabling the network to be trained more stably and supporting the construction of deeper network architectures.

3.2. N-Decoder Structure

In this paper, N-Decoder structure is proposed on the basis of Decoder. The principle is shown in Figure 6. Firstly, the shallow feature map after feature extraction by the backbone network is replaced by 3 × 3 convolution from 1 × 1 convolution; secondly, the deep feature information after ASPP processing and 4-fold linear up-sampling is divided into two parts: the first semantic information and the second semantic information. The cSE module is embedded after the first semantic information (the cSE module is shown in Figure 7), and the feature information is integrated and enhanced by the global average pooling and the excitation operation to integrate and enhance the feature information of the channel dimension, so that the network can pay more attention to the key information and reduce the noise interference when dealing with the complex task, and then improve the accuracy of the model. Then, feature fusion, 1 × 1 convolution, and generation of the channel attention map through the Sigmoid activation function are carried out with the shallow feature map that is processed by the 3 × 3 convolution, and the channel attention map with the 3 × 3 convolutional post-processed feature maps are multiplied element by element to obtain the enhanced feature maps. Finally, the enhanced feature maps are fused with the second semantic information one by one, 3 × 3 convolution, Triplet Attention triple attention mechanism enhancement, and 4-fold linear up-sampling operation to obtain the output results.

In addition, the cSE Attention [23] was designed by Abhijit Guha Roy’s team in 2018, aiming to integrate the channel information, enhance the channel feature representation, optimize the model complexity, and improve the computational efficiency by introducing the channel attention mechanism and combining the dimensionality reduction and dimensionality enhancement processing. As shown in Figure 7, the module first reduces the dimensionality of the input feature map from [C, H, W] to [C, 1, 1] through global average pooling layer processing to achieve compression of spatial information and retain the statistical information of channel dimensions. Subsequently, the compressed features are transformed using two consecutive 1 × 1 convolutional layers, where the first one is used to reduce the number of channels (dimensionality reduction) in order to minimize the computational effort, and the second one restores the number of channels (dimensionality upgrading) and allows it to learn the nonlinear relationships between the channels. The convolution output is then normalized by a sigmoid function to a vector of weights between 0 and 1, which reflect the importance of the different channels. Ultimately, the resulting weight vectors are multiplied channel-by-channel by the original feature maps to achieve channel-level recalibration of the feature maps, thus generating finely calibrated feature maps that enhance the model’s sensitivity and representation of critical channel information.

3.3. Triplet Attention

Triplet Attention [24] is an attention mechanism that combines three different dimensions of time, space, and channel, proposed by Landskape’s team in 2021, aiming to enable the model to capture the complex patterns in the input data more comprehensively, and to improve the model’s ability to comprehend and process multidimensional data such as images and videos.

The schematic is shown in Figure 8. Triplet Attention consists of three branches; the first two branches use rotation and Z-Pool operation to establish connections between channel dimensions and spatial dimensions, and finally summarize the weights by averaging.

The first branch aims to establish the interaction between C and W dimensions, and the specific process is:

(1) given a tensor X ∈

R^{C \times H \times W}

, first rotate 90° counterclockwise along the H axis to obtain

\hat{X_{1}}

. The maximum pooled feature and average pooled feature representations of the W dimensions are then aggregated by Z-Pool and the two are spliced to obtain

X_{a v g - m a x}^{W}

∈

R^{C \times H \times W}

. The formulas are as follows:

X_{a v g}^{W} = A v g P o o l (\hat{X_{1}})

(2)

X_{m a x}^{W} = M a x P o o l (\hat{X_{1}})

(3)

X_{a v g - m a x}^{W} = C o n c a t (X_{a v g}^{W} + X_{m a x}^{W})

(4)

(2) The W dimension is then mapped to one dimension by a Conv layer to get

M_{a v g - m a x}^{W}

M_{a v g - m a x}^{W} = C o n v (X_{a v g - m a x}^{W})

(5)

(3)

M_{a v g - m a x}^{W}

is then passed through a gating mechanism to generate the weight representation as

M_{W}

, and multiplied with the input feature

\hat{X_{1}}

to obtain the output

{o u t}_{W}

∈

R^{C \times H \times W}

of the corresponding branch, which is then rotated clockwise by 90° along the H-axis to recover the shape of the original input to obtain

{o u t}_{1}

∈

R^{C \times H \times W}

. The formulas are as follows:

M_{W} = S i g m o i d (M_{a v g - m a x}^{W})

(6)

{o u t}_{1} = {o u t}_{W} = M_{W} ⨀ \hat{X_{1}}

(7)

The second branch aims to establish the interaction between the C and W dimensions, with the same process as in the first branch, obtaining the final result as:

M_{h} = S i g m o i d (M_{a v g - m a x}^{h})

(8)

{o u t}_{2} = {o u t}_{h} = M_{h} ⨀ \hat{X_{2}}

(9)

The third branch aims to establish the interaction between the H and W dimensions in much the same way as the first and second branches, and the final result obtained is:

M_{c} = S i g m o i d (M_{a v g - m a x}^{c})

(10)

{o u t}_{3} = {o u t}_{c} = M_{c} ⨀ \hat{X_{2}}

(11)

Finally, the output of Triplet Attention is obtained by de-averaging

{o u t}_{1}, {o u t}_{2}, {o u t}_{3}

. The network structure is shown in Figure 9, through the Triplet Attention triple attention mechanism, which enables the model to capture richer contextual information in a nearly parameter-less situation, and the multi-dimensional attention makes the model more resistant to noise and outliers, which improves the model’s robustness, accuracy, and generalization ability.

4. Experiment

4.1. Experimental Configuration and Data

The experiments were carried out on a 64-bit Windows 11 operating system with 16 GB of operating memory and NVIDIA GeForce RTX 4060 GPU for accelerated computation, with a 13th generation Intel Core i7-13700H processor clocked at 2.40 GHz. On the software side, Python 3.11.7 was used, PyTorch 2.0.0, CUDA 11.8, cuDNN 8.7.0, and torchvision 0.15.1.

Currently, domestic research in the field of sea surface obstacle segmentation is still in its infancy, and there is a lack of specialized, publicly available datasets to support in-depth exploration in this field. In contrast, several related datasets have been released internationally in recent years, such as the Massachusetts Maritime Infrared Dataset (MassMIND) [25] and MaSTr1325 [26], which provide valuable experimental bases for international researchers. In this paper, the Massachusetts Maritime Infrared Dataset (MassMIND), which was recently released in June 2023 by Shailesh Nirgudkar et al. and focuses on the task of segmentation of obstacles at the sea surface, is selected for this study. The dataset contains 2916 high-quality LWIR images acquired from coastal marine environments over a period of up to two years, and this time span and diversity of environments ensures that the dataset contains a wide variety of weather conditions, lighting conditions, and a diversity of sea surface obstacle types. For example, the dataset covers a wide range of obstacle types such as ordinary obstacles, living obstacles (e.g., boats, buoys, etc.), and bridges, as well as environmental elements such as the sky, the water column itself, and the background. This rich data diversity helps train segmentation models that are more robust and have better generalization capabilities. In addition, the MassMind dataset performs a meticulous annotation work on each image to subdivide the sea surface obstacles into seven categories: sky, water body, common obstacles, living obstacles, bridges, self, and background. This careful classification of categories helps the model to recognize different types of obstacles more accurately, thus improving the accuracy and reliability of segmentation. For the validity and reliability of the experiments, the dataset is divided into training and validation sets in the ratio of 9:1 to ensure that the model can learn on sufficient data and its performance is evaluated by the validation set.

4.2. Evaluation Indicators

To ensure a comprehensive evaluation of the performance of DMTN-Net in feasible domain and obstacle segmentation, this paper employs several accuracy evaluation metrics, including mean pixel accuracy (mPA), mean intersection and merger ratio (mIoU), mean precision (mPrecision), mean recall (mRecall), and the number of model parameters. Together, these metrics form a multidimensional framework for evaluating the effectiveness of the model.

Mean Pixel Accuracy (mPA): it calculates the percentage of pixels that really belong to a category out of all the pixels predicted to be in that category, and then averages over all the categories. In complex environments, unmanned vessels need to accurately recognize different categories of obstacles (e.g., drifts, debris, etc.) in order to adopt different obstacle avoidance strategies, and this metric is able to reflect the accuracy of each category in the segmentation results. The higher the value of mPA is, the better the network is at recognizing obstacles of different categories. The formula is as follows:

m P A = \frac{1}{n + 1} \sum_{j = 1}^{n} \frac{m_{j j}}{\sum_{k = 0}^{n} m_{j k}}

(12)

Mean Intersection and Merger Ratio (mIoU): it calculates the ratio of intersection and merger between the predicted results and the true labels for each category, and then averages over all categories. This metric can intuitively reflect the quality of the segmentation results. mIoU value is higher, which means the closer the segmentation results are to the real labels, and the more accurately the unmanned ship perceives the environment. This helps the unmanned ship to more accurately recognize obstacles such as drifts and debris, and thus make more accurate obstacle avoidance decisions. The formula is as follows:

m I o U = \frac{1}{n + 1} \sum_{j = 1}^{n} \frac{m_{j j}}{\sum_{k = 0}^{n} m_{j k} + \sum_{K = 0}^{n} (m_{k j} - m_{j j})}

(13)

n denotes the number of categories,

m_{j j}

is the number of correctly categorised pixels,

m_{j k}

is the number of pixels from category j assigned to category k, and

m_{k j}

is the number of pixels from category k assigned to category j.

Mean Category Pixel Accuracy (mPrecision): it calculates the proportion of pixel points that truly belong to a category out of all pixel points predicted to be in that category, and then averages over all categories. This metric can reflect the accuracy of the pixel points predicted as positive samples in the segmentation results. mPrecision value is higher, which means that the unmanned ship is less likely to misjudge when recognizing obstacles. This helps the unmanned ship to more accurately determine which objects are potential obstacles.

First, the pixel accuracy (CPA) for each category needs to be specified, which is calculated by the formula:

C P A = \frac{T P}{T P + F P}

(14)

TP: true positive, FP: false positive, FN: false negative.

Then, mPrecision is averaged over all categories of CPAs with the following formula:

m P r e c i s i o n = \frac{{C P A}_{1} + {C P A}_{2} + \dots + {C P A}_{n}}{n}

(15)

where N is the total number of categories, and

{C P A}_{1} + {C P A}_{2} + \dots + {C P A}_{n}

are the pixel accuracies of each category, respectively.

Hyperparameter selection: during the training process, stochastic gradient descent (SGD) is chosen as the strategy for weight update to accelerate the convergence speed of the model and improve the segmentation performance.

The size of the learning rate determines the convergence speed of the model; too high a learning rate may lead to model dispersion, while too low will lead to model convergence being too slow, so through multiple rounds of experiments results, the learning rate of DMTN-Net is set to 0.001, which can make the model converge quickly, and at the same time to ensure that the gradient can be propagated normally.

The total number of iteration rounds is set to 100 epochs, the batchsize is set to 8, and the adaptive learning rate adjustment strategy is used to improve the training efficiency.

The focal loss function (Focal Loss) and similarity coefficient loss function (Dice Loss) are often used to evaluate the model performance, but because the Focal Loss introduces additional attention factors and category balancing weights, it will be more complicated than the traditional cross-entropy loss when calculating the loss. Therefore, in this paper, the Dice Loss similarity coefficient loss function (Dice Loss), which has good anti-interference ability and simple calculation process, is chosen, and its expression is as follows:

L_{d i c e} = \frac{F P + F N}{2 \times T P + F P + F N}

(16)

FP, FN, and TN denote false positive, false negative, and true examples, respectively.

4.3. Ablation Experiment

In this paper, we carry out ablation experiments on the USV accessible areas obstacle segmentation dataset MassMind.

The original model of DeeplabV3+ (Xception) is denoted as D;
DeeplabV3+ (MobilenetV2) is denoted as D + M;
The proposed N-Decoder structure based on;
The embedding of Triplet Attention triple attention mechanism based on (ii) is denoted as D + M + T;
Integrating the above module structure is denoted as DMTN-Net.

From the data in Table 1, it can be seen that the mIoU, mPA, and mPrecision of Test 2, Test 3, Test 4, and Test 5 are all improved compared to Test 1. This shows that the improvements in this paper are superior to the original model in all segmentation metrics. The amount of parameters is reduced by nearly 10 times after replacing the backbone network from Xception to MobilenetV2. Test 4 has no change in parameter count after adding Triplet Attention. Test 5 proves that the mIoU and mPrecision metrics of the DMTN-Net proposed in this paper are improved by 7.31% and 8.39%, respectively, compared with the original model Test 1. The enhancement of mIou indicates that the segmentation results of DMTN-Net are closer to the real labels, which helps the unmanned vessel to recognize obstacles such as drifts and debris more accurately. The enhancement of mPA indicates that DMTN-Net is more capable of recognizing obstacles. Although the mPA metrics of the original model are higher than that of the DMTN-Net by 0.37%, the other metrics of the DMTN-Net are obviously higher than that of the original model, and the number of parameters is much smaller than the original model.

The loss function curve of the online learning experiment is shown in Figure 10, and it can be observed that with the increase of the number of training iterations, the loss value gradually stabilizes and shows a decreasing trend, which indicates that the constructed network model has successfully achieved convergence. This result again verifies the effectiveness of the model and the stability of the training process.

As shown in Figure 11, in order to visualise the performance difference between DeeplabV3+ (Xception) and DMTN-Net in the segmentation task, comparison experiments containing the original image, real labels, DeeplabV3+ segmentation results, and DMTN-Net segmentation results are conducted.

First, (a) line picture shows that DMTN-Net demonstrates its superior segmentation accuracy and fine-grained parsing ability in processing complex natural scenes that incorporate people and boats. In particular, when distinguishing highly similar fine structures (e.g., figure and boat silhouettes), DMTN-Net is able to delineate the boundaries more accurately, highlighting its advantages in extracting high-level semantic information and optimising edge accuracy in complex scenes.

(b) line picture shows that DeeplabV3+ has poor segmentation results when dealing with such tiny and detail-rich targets, whereas DMTN-Net effectively captures and segments these tiny floating objects with its enhanced context information perception, proving its efficiency in capturing the subtle correlation between local details and global context.

(c) line picture shows that DeeplabV3+ encounters boundary blurring and category confusion when segmenting large structures (e.g., bridges and buildings), incorrectly classifying some of the building pixels as bridges, demonstrating its limitations in dealing with structurally complex regions with blurred boundaries. On the contrary, DMTN-Net not only accurately defines the main contours of bridges and buildings through its advanced feature extraction and classification mechanism, but also handles the details more finely, which is of great significance for the safe delineation of unmanned vessel navigational areas and accurate navigation. Finally, (d) line picture shows that DeeplabV3+ has difficulty in effectively distinguishing neighbouring vessels, leading to recognition failures and thus limiting its applicability in complex water environments. DMTN-Net, on the other hand, achieves accurate segmentation of each vessel instance by virtue of its stronger instance differentiation ability, which is crucial for improving the target recognition, path planning, and obstacle avoidance capabilities of unmanned vessels in complex waters, significantly enhancing their operational safety and efficiency.

Therefore, compared with DeeplabV3+ (Xception), DMTN-Net shows better performance in multiple dimensions of segmentation tasks, especially when dealing with complex scenes, small targets, large structures, and dense similar objects, and its segmentation completeness and accuracy are significantly improved, providing more solid technical support for the visual perception module of unmanned boats and other autopilot systems. It provides more solid technical support for the visual perception module of unmanned ships and other autopilot systems.

4.4. Network Performance Comparison Test

In order to verify the performance of DMTN-Net, this paper practices full convolutional neural network FCN (resnet50, resnet101), deeplabv3+ (resnet50, resnet101), PSPNet (resnet50, mobilenetv2), Iraspp, and deeplabv3+ (xception) based on the MassMind dataset. The training parameters are kept consistent with DMTN-Net, which is tested against mainstream segmentation networks, and the results are shown in Table 2. It can be seen that the segmentation performance of DMTN outperforms the mainstream networks, and the number of parameters is much lower than other algorithms after using mobilenetv2 as the backbone network and Triplet Attention with nearly no number of parameters, which proves the lightweight property of DMTN-Net.

In order to deeply explore the performance of different segmentation networks in complex scenes, experiments are conducted to compare the segmentation efficacy of network architectures such as Pspnet (resnet50, mobilenetv2), Deeplabv3+ (Xception), and DMTN-Net. Figure 12 visualises the segmentation details and accuracy differences between these models in several challenging visual scenarios.

For the ship and crew segmentation task (a), Pspnet (resnet50, mobilenetv2) and DeeplabV3+ show some ambiguity in the accurate depiction of the ship’s outline and fail to identify the crew area effectively (marked in purple), demonstrating limitations in detail retention and complex structure recognition. In contrast, DMTN-Net not only accurately depicts the fine outline of the ship, but also successfully captures the rough outline of the crew, which demonstrates its superiority in dealing with mutual occlusion of complex objects and recognition of fine structures.

(b) Shows that Pspnet has significant segmentation confusion when dealing with the complex region of the sailboat texture, incorrectly classifying the distant building pixels as ships, reflecting a deficiency in the ability to distinguish between long-distance, low-contrast targets. DeeplabV3+, on the other hand, produces pixel confusion between the sailboat mast and the neighbouring buildings, again revealing a deficiency in dealing with the boundaries of similar features. DMTN-Net, on the other hand, demonstrates clearer and more accurate segmentation boundaries with its enhanced feature representation and contextual understanding, effectively reducing misclassification.

For the segmentation accuracy of long-distance building edges (c), compared with Pspnet and DeeplabV3+, DMTN-Net shows higher segmentation accuracy in the fine detailing of building edges by virtue of its ability to effectively capture and retain detailed information.

In the segmentation challenge of the bridge-water junction region (d), both Pspnet and DeeplabV3+ show pixel misclassification, which is manifested by the confusion of bridge and water pixels, which highlights the general difficulty in boundary recognition of complex scenes. In contrast, DMTN-Net not only effectively distinguishes the bridge from the water surface, but also demonstrates excellent performance in segmenting long-distance buildings under the bridge, which again verifies its strong capability in complex scene resolution and multi-scale target processing.

4.5. Generalization Experiment

In order to evaluate the generalized performance of the algorithm, the Pascal VOC 2012 dataset and Sea-data dataset are used as test objects in this paper to validate the effectiveness of the proposed algorithm. The Pascal VOC 2012 dataset is the official generalized dataset used by the Pascal VOC Challenge, and it contains objects of 20 categories. Each image is labelled and the labelled objects include 20 categories including people, animals (e.g., cats, dogs, etc.), vehicles (e.g., cars, boats, planes, etc.), and furniture (e.g., chairs, tables, sofas, etc.).

The Sea-data dataset is a total of 1246 image data collected, labelled, and subjected to data enhancement by the team in the field. The dataset contains a total of seven categories including background, sky, viable domain, vessel, reef, etc. When sailing autonomously, since it is necessary to avoid all the objects on the water that hinder navigation, such as ships, reefs, floats, etc., all these impediments are categorized into the major category of obstacles, which is represented in blue color, as shown in Figure 13, with the captured images on the left side and the labels on the right side.

Table 3 clearly shows that the DMTN-Net model performs well in the segmentation task on the VOC dataset, and its performance is significantly improved compared to other mainstream network models.

In order to visualise the network segmentation effect, the relatively high deeplabv3 (resnet50) network and pspnet network in mIoU are selected to do comparison experiments with DMTN-Net. Figure 14 shows that deeplabv3 has the problems of more segmentation clutter and incomplete segmentation; pspnet has the problems of unclear segmentation boundaries, incomplete segmentation, and wrong pixel classification. In contrast, DMTN-Net can segment the VOC dataset well, showing good generalization ability.

The experiments are carried out based on the Sea-data dataset, and the comparison experiments of different algorithms are carried out in Table 4. Deeplabv3, Pspnet, DeeplabV3+, and DMTN-Net are chosen for the comparison experiments, respectively. mIou and mPA metrics in the table show that the segmentation accuracy of DMTN-Net is higher than that of the mainstream networks nowadays.

Figure 15 shows the visualisation results, and Figure 15a–c shows the inference results of Deeplabv3, Pspnet, and Deeplabv3+, respectively. Since these networks do not have enough spatial information extraction capability, they are highly deficient in recognizing the contours of mountain ranges, which leads to great pixel classification errors of the sky, the sea, and the mountain ranges, and thus, the mountain ranges are recognized in a confusing manner.

In Figure 15d, DMTN-Net enhances the recognition of mountain ranges by increasing the inter-pixel correlation through multi-scale and multi-channel interactions of feature information through N-Decoder and Triplet Attention to enhance the recognition of mountain ranges and reduce the loss of spatial information, and accurately recognizes many pixels misclassified as the sky, sea water, and mountains as mountain ranges. The enhanced feature representation achieves excellent results in both pixel classification and marginal line description of mountain ranges.

4.6. Offshore Navigation Experiment

In order to evaluate the effectiveness of DMTN-Net in practical applications, our team conducted an offshore sailing validation experiment in the sea area of Xiaoshidao, Weihai City, Shandong Province, China. The following is an overview of the experimental steps, as shown in Figure 16.

The image data of the test waters are collected in real time by Hikvision’s shipboard vision sensor and transmitted to the server in real time; DMTN-Net network carries out real-time segmentation processing on the image data and transmits the results to the client. The experimental results are shown in Figure 17, which shows that DMTN-Net demonstrates good segmentation accuracy for the segmentation of distant sea vessels and accessible areas. In order to visualize the segmentation effect more, we provide two application videos in the Supplementary Materials.

5. Conclusions

In order to solve the problems of insufficient navigable area recognition accuracy, fuzzy obstacle segmentation boundaries, and high computational resource consumption faced by unmanned surface vehicles (USVs) in autonomous navigation, this paper proposes a novel DMTN-Net network architecture, which is extended and optimized based on the DeeplabV3+ baseline model. DMTN-Net significantly improves the accuracy and efficiency of environment sensing by integrating a dynamic feature extraction strategy with a multi-task learning framework, which significantly improves the accuracy and efficiency of environment sensing. First, in order to reduce the computational complexity and improve the feasibility of model deployment, DMTN-Net adopts MobileNetV2 as the backbone network, which significantly reduces the model parameters and computation while maintaining a higher classification accuracy. Second, on the basis of the standard Decoder, an enhanced N-Decoder structure is proposed to improve the segmentation accuracy, and the cSE attention mechanism and Triplet Attention are embedded into the N-Decoder structure, multi-channel attention is introduced through Triplet Attention to capture cross-dimensional interactions. A three-branch structure is utilized to compute the attention weights, which focus on different dimensions of the input data, such as spatial dimensions (height and width) and channel dimensions, respectively. Then, by performing a three-branch structure on each branch, the input tensor in each branch is then subjected to permutation (permute) operation and processed by Z-pool and k × k convolutional layers to capture cross-dimensional interaction features. Finally, attention weights are generated using a sigmoid activation layer and applied to the aligned input tensor, which is then aligned back to the original input shape. This increases the correlation between pixels to enhance the recognition of the target by means of multidimensional interaction and reducing the loss of spatial information reduces the occurrence of pixel misclassification in the segmentation process, which further improves the segmentation accuracy and enhances the generalization ability of the model. Among them, the mIoU, mPA, and mPrecision metrics reach 79.59%, 90.53%, and 85.41%, respectively, and the number of parameters is optimized to 5.778 M, which is a big improvement compared with the original model. Finally, the ablation of the DMTN-Net with the mainstream segmentation network is carried out on the MassMind dataset and the publicly available Pascal VOC2007 dataset, and comparison tests and offshore navigation tests are conducted in the relevant sea areas. The results show that DMTN-Net demonstrates significant performance advantages in navigable area recognition and obstacle segmentation tasks, which not only improves the recognition accuracy, but also effectively reduces the computational cost.

DMTN-Net’s excellent performance in the field of environment awareness for unmanned surface vessels (USVs) will have a far-reaching impact on the maritime industry, society, and national interests. From the perspective of the maritime industry, the model significantly enhances the autonomous navigation capability of USVs. Through accurate semantic segmentation technology, USVs can identify environmental obstacles and feasible paths in real time, optimize navigation planning, and thus reduce operational costs and improve navigation safety and efficiency. In terms of environmental protection, DMTN-Net helps USVs to monitor marine pollution, such as floating objects and oil pollution, in real time, so as to respond to potential sources of pollution and maintain the health of marine ecosystems. As for the working conditions of seafarers, DMTN-Net reduces the burden of seafarers, lowers their operational risks, and improves their working environment and quality of life by enhancing the autonomy of USVs. In terms of national security and technological autonomy, DMTN-Net, as a marine exploration technology with independent intellectual property rights, provides solid support for marine security and promotes the innovative development of China’s marine technology, laying a solid foundation for the country’s long-term development.

In the development of intelligent environment sensing technology for unmanned vessels, in order to improve the accuracy and efficiency of accessible area definition and obstacle segmentation, future research needs to focus on the following aspects: firstly, by continuously collecting and expanding diversified datasets, covering a variety of weather conditions, light changes, and complex obstacle types, in order to enhance the model’s generalization ability. Secondly, according to the needs of unmanned vessel applications, deep optimization and innovation are carried out for network structure, and efficient feature extraction and innovation are introduced. Thirdly, the network structure is deeply optimized and innovated to meet the needs of unmanned ship applications, and efficient feature extraction and fusion mechanisms are introduced to improve the model’s ability to analyze complex environmental information. Finally, image segmentation technology is integrated into the unmanned ship system and combined with shipborne UAVs equipped with target detection algorithms to realize all-around environment monitoring in air and sea. Meanwhile, autonomous landing technology for shipborne UAVs, three-dimensional measurement and modeling technology for water targets, and high-efficiency detection and tracking technology for ship exhaust are being actively researched and developed. All of these will provide strong technical support for the cooperative operation between unmanned ships and unmanned aircraft, and promote the continuous progress and development of unmanned system cooperative operation technology.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics13224539/s1, We provide two video materials to show the segmentation effect of DMTN-Net in real applications. In the video, the blue color is the obstacle, the red color is the sky, and the green color is the sea water.

Author Contributions

M.S.: software, validation, writing—original draft; X.L.: conceptualization, methodology, project administration, funding acquisition; T.Z.: writing—review and editing; Q.Z.: formal analysis; Y.S.: resources, data curation; H.Y. and C.X.: visualisation, supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This researchwas funded by the Natural Science Foundation of Shandong Province issued by the Science and Technology Department of Shandong Province under grant number: ZR2022QE201.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wei, L.; Tianwei, L.; Shangyue, Z.; Rongrong, Y. Technology development and prospect of unmanned surface vessels. Ship Electron. Eng. 2021, 41, 1–3. [Google Scholar]
Tang, Z. Integration and application path of artificial intelligence technology and intelligent networked vehicle technology. Spec. Veh. 2024, 6, 77–79. [Google Scholar]
Yan, M.; Li, C.; Zhu, D. Real-time obstacle avoidance path planning system for unmanned watercraft based on bio-inspired neural network. J. Shanghai Marit. Univ. 2024, 45, 10–15+48. [Google Scholar]
Ding, S.; Liu, M.; Shang, S.; Liang, Y. Research on ship target detection based on deep learning and image segmentation. Autom. Appl. 2024, 65, 28–30+34. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Bovcon, B.; Kristan, M. WaSR—A water segmentation and refinement maritime obstacle detection network. IEEE Trans. Cybern. 2021, 52, 12661–12674. [Google Scholar] [CrossRef] [PubMed]
Muhovič, J.; Mandeljc, R.; Bovcon, B.; Kristan, M.; Perš, J. Obstacle Tracking for Unmanned Surface Vessels Using 3-D Point Cloud. IEEE J. Ocean. Eng. 2020, 45, 786–798. [Google Scholar] [CrossRef]
Chen, X.; Liu, Y.; Achuthan, K. WODIS: Water obstacle detection network based on image segmentation for autonomous surface vehicles in maritime environments. IEEE Trans. Instrum. Meas. 2021, 70, 7503213. [Google Scholar] [CrossRef]
Yao, F.-F.; Ber, L.-Z.; Zhou, T. Improved waterfront segmentation algorithm for U-Net. Comput. Sci. Appl. 2022, 12, 2875. [Google Scholar]
Xiong, R.; Cheng, L.; Hu, T.; Wu, J.; Wang, H.; Yan, X.; He, Y. Research on fast segmentation algorithm of feasible domain and obstacles for surface unmanned craft. J. Electron. Meas. Instrum. 2023, 37, 11–20. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Cao, X.; Chen, X.; Wei, T. Deeply separable convolutional neural network gas pedal based on RISC-V. J. Comput. 2024, 47, 2536–2551. [Google Scholar]
Tang, M.; Zhang, Y.; Zhang, K. Nadam algorithm optimized convolutional neural network for multi-fault coupled diagnosis of rolling bearing. Mech. Des. Manuf. 2024. [Google Scholar] [CrossRef]
Liu, X.; Song, Y.; Li, Z. Enhanced point-by-point graphical convolutional network-based combined classification method for civil aviation short texts. J. Beijing Univ. Aeronaut. Astronaut. 2024. [Google Scholar] [CrossRef]
Chen, Z.C.; Jiao, H.N.; Yang, J.; Zeng, H.F. A garbage image classification algorithm based on improved MobileNet v2. J. Zhejiang Univ. (Eng. Ed.) 2021, 55, 1490–1499. [Google Scholar]
Qiao, W.; Liu, Q.; Wu, X.; Ma, B.; Li, G. Automatic pixel-level pavement crack recognition using a deep feature aggregation segmentation network with a scSE attention mechanism module. Sensors 2021, 21, 2902. [Google Scholar] [CrossRef] [PubMed]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3139–3148. [Google Scholar]
Nirgudkar, S.; DeFilippo, M.; Sacarny, M.; Benjamin, M.; Robinette, P. Massmind: Massachusetts maritime infrared dataset. Int. J. Robot. Res. 2023, 42, 21–32. [Google Scholar] [CrossRef]
Bovcon, B.; Muhovič, J.; Perš, J.; Kristan, M. The mastr1325 dataset for training deep usv obstacle detection models. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 3431–3438. [Google Scholar]

Figure 1. DeeplabV3+.

Figure 2. Comparison of sensory fields with different expansion rates.

Figure 3. Structure of DMTN-Net network.

Figure 4. Depth separable convolution schematic.

Figure 5. Inverted residual structure.

Figure 6. Structure of N-Decoder.

Figure 7. Structure of cSE module,

⋆_{m \times n}^{p} :

Convolution with m × n kernel p channels,

σ (∙)

: Sigmoid.

Figure 7. Structure of cSE module,

⋆_{m \times n}^{p} :

Convolution with m × n kernel p channels,

σ (∙)

: Sigmoid.

Figure 8. Triplet Attention.

Figure 9. Structure of Triplet Attention.

Figure 10. Graph of loss function.

Figure 11. Comparison of network segmentation. (a): Segmentation of ships and persons; (b): Segmentation of floating objects in water; (c): Segmentation of bridges and buildings; (d): Segmentation of ships; Differences in segmentation performance are highlighted in red boxes.

Figure 12. Network performance comparison chart (a), (b): Segmentation of ships and persons; (c): Segmentation of buildings; (d): Segmentation of bridges and buildings; Differences in segmentation performance are highlighted in red boxes.

Figure 13. Examples of dataset.

Figure 14. Comparison of segmentation performance based on VOC dataset.

Figure 15. Sea-data dataset visualisation results. (a) Deeplabv3, (b) Pspnet, (c) Deeplabv3+, (d) DMTN-Net, The red boxes indicate differences in segmentation performance.

Figure 16. Offshore sailing experiment.

Figure 17. Offshore sailing results.

Table 1. Comparison of the performance of the improved structure, “√” denotes the module used by the network.

Test	Network	MobilenetV2	N-Decoder	Triplet Attention	mIoU/%	mPA/%	mPrecision/%	Params /M
1	D	-	-	-	72.28	90.90	77.02	54.71
2	D + M	√	-	-	75.98	89.51	81.96	5.815
3	D + M + N	√	√	-	75.08	87.18	82.12	5.778
4	D + M + T	√	-	√	77.01	88.16	83.91	5.815
5	DMTN-Net	√	√	√	79.59	90.53	85.41	5.778

Table 2. Comparison of the performance of the improved structure.

Network	mIoU/%	Params /M	IoU1/%	IoU2/%	IoU3/%	IoU4/%	IoU5/%	IoU6/%	IoU7/%
FCN (resnet50)	67.5	25.6	87.9	80.4	58	51.2	98.4	96.9	-
FCN (resnet101)	63.7	44.5	85.9	79.5	41.6	44.2	98.2	96.6	-
deeplabv3 (resnet50)	67.7	>25.6	87.1	81.9	54.7	54.7	98.4	96.8	-
deeplabv3 (resnet101)	63.7	>44.5	84	79.2	39.6	47.4	97.8	96.4	-
Pspnet (resnet50)	76.1	23.77	89	96	37	62	66	98	84
Pspnet (mobilenetv2)	70.19	10.77	86	95	31	52	51	97	78
Iraspp	53.5	-	95.1	96.4	28.1	5.5	72.3	76.8	-
deeplabv3 + (xception)	72.28	55.48	91	97	27	60	52	98	82
DMTN-Net	79.59	5.778	91	97	57	63	66	98	86

Table 3. Comparison experiments on the VOC 2012 dataset.

Network	mIoU/%	IoU1/%	IoU2/%	IoU3/%	IoU4/%	IoU5/%	IoU6/%	IoU7/%
FCN (resnet50)	58.7	92.7	87.3	61.2	45.9	67.9	80.3	74.6
FCN (resnet101)	43.7	89.6	79.2	43.9	58.6	52.1	45.7	44.2
deeplabv3 (resnet50)	74.9	94.3	93.3	63.5	93.9	84	81	70.5
deeplabv3 (resnet101)	65.7	89.8	89.5	58.1	87.2	61.4	68.9	88.3
Pspnet (mobilenetv2)	75.75	84	89	57	83	62	84	83
DMTN-Net	79.59	91	97	57	63	66	98	86
Network	/	IoU8/%	IoU9/%	IoU10/%	IoU11/%	IoU12/%	IoU13/%	IoU14/%
FCN (resnet50)	/	80.3	57.9	22.3	30.3	46.7	46.8	43
FCN (resnet101)	/	58.7	39.7	11.6	12.7	40.7	45.7	17.6
deeplabv3 (resnet50)	/	92.9	76	23	71.4	73.8	81.3	74.8
deeplabv3 (resnet101)	/	84.8	71.8	16.9	59.2	64.1	76.2	63.5
Pspnet (mobilenetv2)	/	66	88	64	73	24	86	87
DMTN-Net	/	71	89	68	84	45	83	84
Network	/	IoU15/%	IoU16/%	IoU17/%	IoU18/%	IoU19/%	IoU20/%	IoU21/%
FCN (resnet50)	/	79.6	84	60.6	10.8	38.7	68.6	54.1
FCN (resnet101)	/	50.6	74.1	55.5	5.6	20.8	42.8	29
deeplabv3 (resnet50)	/	87.8	87.1	77	82	57.6	67.5	39.8
deeplabv3 (resnet101)	/	79.4	80.3	41.7	76.8	28.3	68.8	24.8
Pspnet (mobilenetv2)	/	92	78	71	92	42	90	94
DMTN-Net	/	88	79	76	92	67	88	95

Table 4. Comparison experiments on the VOC 2012 dataset.

Model	mIoU	mPA
Deeplabv3	74.13	80.81
Pspnet	75.47	83.88
Deeplabv3+	75.88	80.07
DMTN-Net	76.65	88.18

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shao, M.; Liu, X.; Zhang, T.; Zhang, Q.; Sun, Y.; Yuan, H.; Xiao, C. DMTN-Net: Semantic Segmentation Architecture for Surface Unmanned Vessels. Electronics 2024, 13, 4539. https://doi.org/10.3390/electronics13224539

AMA Style

Shao M, Liu X, Zhang T, Zhang Q, Sun Y, Yuan H, Xiao C. DMTN-Net: Semantic Segmentation Architecture for Surface Unmanned Vessels. Electronics. 2024; 13(22):4539. https://doi.org/10.3390/electronics13224539

Chicago/Turabian Style

Shao, Mingzhi, Xin Liu, Tengwen Zhang, Qingfa Zhang, Yuhan Sun, Haiwen Yuan, and Changshi Xiao. 2024. "DMTN-Net: Semantic Segmentation Architecture for Surface Unmanned Vessels" Electronics 13, no. 22: 4539. https://doi.org/10.3390/electronics13224539

APA Style

Shao, M., Liu, X., Zhang, T., Zhang, Q., Sun, Y., Yuan, H., & Xiao, C. (2024). DMTN-Net: Semantic Segmentation Architecture for Surface Unmanned Vessels. Electronics, 13(22), 4539. https://doi.org/10.3390/electronics13224539

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DMTN-Net: Semantic Segmentation Architecture for Surface Unmanned Vessels

Abstract

1. Introduction

2. DeeplabV3+

3. DMTN-Net

3.1. MobileNetV2

3.2. N-Decoder Structure

3.3. Triplet Attention

4. Experiment

4.1. Experimental Configuration and Data

4.2. Evaluation Indicators

4.3. Ablation Experiment

4.4. Network Performance Comparison Test

4.5. Generalization Experiment

4.6. Offshore Navigation Experiment

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI