1. Introduction
As an important representative of marine robotics, unmanned surface vehicles (USVs) are widely used in military reconnaissance, water safety vigilance, environmental monitoring, and other fields by virtue of their high flexibility, strong concealment, and excellent environmental adaptability, with the core of its technology is the fusion of high-precision autonomous navigation and environment sensing [
1]. However, the existing traditional sensing devices such as ranging sensors, LiDAR, millimeter wave radar, sonar, GPS, and multi-sensor fusion technology have the disadvantages of high cost, limited accuracy, and poor system stability, which make it difficult to meet the all-around sensing needs of USVs under complex sea conditions. However, with the development of artificial intelligence technology, vision technology has been widely used in the field of unmanned vessel environment sensing [
2,
3], especially in the field of semantic segmentation [
4], by virtue of its excellent structured data processing capability.
In USVs, visual perception utilizes visual sensors to collect images and combine them with deep learning frameworks to achieve accessible area segmentation and obstacle detection. Currently, the classical semantic segmentation algorithms based on deep learning are u-net [
5], hrnet [
6], pspnet [
7], segformer [
8], and deeplab series (V1 [
9], V2 [
10], V3 [
11], V3+ [
12]). With the development of Internet technology, various improved algorithms based on traditional semantic segmentation algorithms and deep learning have emerged. For example, the WASR network proposed by Borja Bovcon [
13] can effectively segment accessible areas and ships in the water, but its segmentation effect on small obstacles such as reefs is poor, which cannot ensure the safe navigation of ships. The 3D point cloud method and the WODIS water surface obstacle segmentation algorithms proposed by Jon Muhovič [
14] and Xiang Chen [
15] both realize the accurate segmentation of obstacle edges, but their algorithms have heavy parameters and require high computer arithmetic. Yao Fufei [
16] proposed a spatial attention mechanism in U-Net to realize accurate inland water–shore segmentation, but the algorithm occupies a large amount of RAM, which is not conducive to the deployment of the model. Xiong Rui [
17] proposed a fast segmentation network, DeeplabV3-CSPNet, which improves the computing speed by introducing Attention in the feature extraction part and feature fusion part, and realizes the simulation application on USVs, but its adaptability to complex scenes is poor. Therefore, the excessive amount of parameters is not conducive to deployment; low segmentation accuracy and weak generalization ability have become urgent problems in the field of semantic segmentation in the feasible domain of unmanned ships.
In order to solve the above problems, this paper adopts Deeplabv3+ as the benchmark model and improves it to propose DMTN-Net. Compared with other mainstream networks, Deeplabv3+ has the advantages of advanced network structure and excellent performance, and it can handle objects of different scales, which is very suitable for complex scenes at sea. First, Mobilenetv2 is adopted as the backbone network to reduce the number of parameters and make the model more lightweight. Second, N-Decoder is proposed to improve the segmentation accuracy and generalization ability of the model, and cSE Attention Module and Triplet Attention are integrated at the same time to achieve the interaction of information between multiple channels, reduce the loss of spatial information, and enhance the segmentation accuracy, which improves the generalization ability of the model. Finally, the excellent performance of the algorithm is proven by multi-dataset experiments and offshore sailing experiments.
2. DeeplabV3+
DeepLabV3+ [
12] is a semantic segmentation network developed by Google Inc. in 2018, and its base model is shown in
Figure 1. After inputting the image, the deep feature extraction is first performed by the backbone network Xception in the Encoder part to obtain the feature maps containing multi-scale information. Subsequently, these features are transferred to the null space convolution Pooling Pyramid (ASPP) module, which utilizes different scales of null convolution (including 1 × 1 convolution, 3 × 3 null convolution with null rates of 6, 12, and 18, respectively, and a comparison of the receptive fields with different null rates is shown in
Figure 2) as well as global average pooling. These enhance the model’s ability to perceive the different scales of contextual information in the image by increasing the receptive fields of the convolutional layers in a nearly parameter-less manner, and also the ability to achieve perception, as well as realize the refinement processing and multi-scale information fusion of feature maps. The fused features are up-sampled four times and then entered into the Decoder module for feature fusion with the lower-level feature maps in the Encoder, and this process enhances the model’s ability to capture detailed information. Finally, after 3 × 3 convolution and bilinear interpolation up-sampling, a segmentation map matching the resolution of the input image is obtained, achieving accurate mapping from pixel level to semantic level.
The proposed network significantly enhances its performance in multiple application scenarios, such as autonomous driving, medical imaging, and intelligent surveillance. Compared to its predecessor DeepLabV3, DeepLabV3+ proposes two major innovations in architectural design: first, it draws on the essence of the U-Net network structure and incorporates the Encoder-Decoder framework. In the Encoder stage, by densely deploying Atrous Convolution, the spatial dimensionality of the image is not only maintained, but also the effective sensory field of convolution is greatly expanded, enabling each convolutional unit to capture richer contextual information. This design facilitates the seamless fusion of high-level semantic features with low-level detail information, significantly improving the accuracy and boundary clarity of the segmentation task. Second, the backbone network of DeepLabV3+ adopts the Xception network instead of ResNet101, and the core advantage of Xception is its Depthwise Separable Convolution mechanism, which effectively reduces the model parameters by decomposing the standard convolution operation into two stages, namely, depth convolution and point-by-point convolution. By decomposing the standard convolution operation into two stages, it effectively reduces the total number of model parameters and computational complexity, and improves the computational efficiency. Combined with Xception’s Residual Connections, DeepLabV3+ is able to integrate multi-scale feature information more flexibly, demonstrating excellent recognition and segmentation capabilities for target objects of different sizes in complex scenes.
4. Experiment
4.1. Experimental Configuration and Data
The experiments were carried out on a 64-bit Windows 11 operating system with 16 GB of operating memory and NVIDIA GeForce RTX 4060 GPU for accelerated computation, with a 13th generation Intel Core i7-13700H processor clocked at 2.40 GHz. On the software side, Python 3.11.7 was used, PyTorch 2.0.0, CUDA 11.8, cuDNN 8.7.0, and torchvision 0.15.1.
Currently, domestic research in the field of sea surface obstacle segmentation is still in its infancy, and there is a lack of specialized, publicly available datasets to support in-depth exploration in this field. In contrast, several related datasets have been released internationally in recent years, such as the Massachusetts Maritime Infrared Dataset (MassMIND) [
25] and MaSTr1325 [
26], which provide valuable experimental bases for international researchers. In this paper, the Massachusetts Maritime Infrared Dataset (MassMIND), which was recently released in June 2023 by Shailesh Nirgudkar et al. and focuses on the task of segmentation of obstacles at the sea surface, is selected for this study. The dataset contains 2916 high-quality LWIR images acquired from coastal marine environments over a period of up to two years, and this time span and diversity of environments ensures that the dataset contains a wide variety of weather conditions, lighting conditions, and a diversity of sea surface obstacle types. For example, the dataset covers a wide range of obstacle types such as ordinary obstacles, living obstacles (e.g., boats, buoys, etc.), and bridges, as well as environmental elements such as the sky, the water column itself, and the background. This rich data diversity helps train segmentation models that are more robust and have better generalization capabilities. In addition, the MassMind dataset performs a meticulous annotation work on each image to subdivide the sea surface obstacles into seven categories: sky, water body, common obstacles, living obstacles, bridges, self, and background. This careful classification of categories helps the model to recognize different types of obstacles more accurately, thus improving the accuracy and reliability of segmentation. For the validity and reliability of the experiments, the dataset is divided into training and validation sets in the ratio of 9:1 to ensure that the model can learn on sufficient data and its performance is evaluated by the validation set.
4.2. Evaluation Indicators
To ensure a comprehensive evaluation of the performance of DMTN-Net in feasible domain and obstacle segmentation, this paper employs several accuracy evaluation metrics, including mean pixel accuracy (mPA), mean intersection and merger ratio (mIoU), mean precision (mPrecision), mean recall (mRecall), and the number of model parameters. Together, these metrics form a multidimensional framework for evaluating the effectiveness of the model.
Mean Pixel Accuracy (mPA): it calculates the percentage of pixels that really belong to a category out of all the pixels predicted to be in that category, and then averages over all the categories. In complex environments, unmanned vessels need to accurately recognize different categories of obstacles (e.g., drifts, debris, etc.) in order to adopt different obstacle avoidance strategies, and this metric is able to reflect the accuracy of each category in the segmentation results. The higher the value of mPA is, the better the network is at recognizing obstacles of different categories. The formula is as follows:
Mean Intersection and Merger Ratio (mIoU): it calculates the ratio of intersection and merger between the predicted results and the true labels for each category, and then averages over all categories. This metric can intuitively reflect the quality of the segmentation results. mIoU value is higher, which means the closer the segmentation results are to the real labels, and the more accurately the unmanned ship perceives the environment. This helps the unmanned ship to more accurately recognize obstacles such as drifts and debris, and thus make more accurate obstacle avoidance decisions. The formula is as follows:
n denotes the number of categories, is the number of correctly categorised pixels, is the number of pixels from category j assigned to category k, and is the number of pixels from category k assigned to category j.
Mean Category Pixel Accuracy (mPrecision): it calculates the proportion of pixel points that truly belong to a category out of all pixel points predicted to be in that category, and then averages over all categories. This metric can reflect the accuracy of the pixel points predicted as positive samples in the segmentation results. mPrecision value is higher, which means that the unmanned ship is less likely to misjudge when recognizing obstacles. This helps the unmanned ship to more accurately determine which objects are potential obstacles.
First, the pixel accuracy (
CPA) for each category needs to be specified, which is calculated by the formula:
TP: true positive, FP: false positive, FN: false negative.
Then, mPrecision is averaged over all categories of CPAs with the following formula:
where
N is the total number of categories, and
are the pixel accuracies of each category, respectively.
Hyperparameter selection: during the training process, stochastic gradient descent (SGD) is chosen as the strategy for weight update to accelerate the convergence speed of the model and improve the segmentation performance.
The size of the learning rate determines the convergence speed of the model; too high a learning rate may lead to model dispersion, while too low will lead to model convergence being too slow, so through multiple rounds of experiments results, the learning rate of DMTN-Net is set to 0.001, which can make the model converge quickly, and at the same time to ensure that the gradient can be propagated normally.
The total number of iteration rounds is set to 100 epochs, the batchsize is set to 8, and the adaptive learning rate adjustment strategy is used to improve the training efficiency.
The focal loss function (Focal Loss) and similarity coefficient loss function (Dice Loss) are often used to evaluate the model performance, but because the Focal Loss introduces additional attention factors and category balancing weights, it will be more complicated than the traditional cross-entropy loss when calculating the loss. Therefore, in this paper, the Dice Loss similarity coefficient loss function (Dice Loss), which has good anti-interference ability and simple calculation process, is chosen, and its expression is as follows:
FP, FN, and TN denote false positive, false negative, and true examples, respectively.
4.3. Ablation Experiment
In this paper, we carry out ablation experiments on the USV accessible areas obstacle segmentation dataset MassMind.
The original model of DeeplabV3+ (Xception) is denoted as D;
DeeplabV3+ (MobilenetV2) is denoted as D + M;
The proposed N-Decoder structure based on;
The embedding of Triplet Attention triple attention mechanism based on (ii) is denoted as D + M + T;
Integrating the above module structure is denoted as DMTN-Net.
From the data in
Table 1, it can be seen that the mIoU, mPA, and mPrecision of Test 2, Test 3, Test 4, and Test 5 are all improved compared to Test 1. This shows that the improvements in this paper are superior to the original model in all segmentation metrics. The amount of parameters is reduced by nearly 10 times after replacing the backbone network from Xception to MobilenetV2. Test 4 has no change in parameter count after adding Triplet Attention. Test 5 proves that the mIoU and mPrecision metrics of the DMTN-Net proposed in this paper are improved by 7.31% and 8.39%, respectively, compared with the original model Test 1. The enhancement of mIou indicates that the segmentation results of DMTN-Net are closer to the real labels, which helps the unmanned vessel to recognize obstacles such as drifts and debris more accurately. The enhancement of mPA indicates that DMTN-Net is more capable of recognizing obstacles. Although the mPA metrics of the original model are higher than that of the DMTN-Net by 0.37%, the other metrics of the DMTN-Net are obviously higher than that of the original model, and the number of parameters is much smaller than the original model.
The loss function curve of the online learning experiment is shown in
Figure 10, and it can be observed that with the increase of the number of training iterations, the loss value gradually stabilizes and shows a decreasing trend, which indicates that the constructed network model has successfully achieved convergence. This result again verifies the effectiveness of the model and the stability of the training process.
As shown in
Figure 11, in order to visualise the performance difference between DeeplabV3+ (Xception) and DMTN-Net in the segmentation task, comparison experiments containing the original image, real labels, DeeplabV3+ segmentation results, and DMTN-Net segmentation results are conducted.
First, (a) line picture shows that DMTN-Net demonstrates its superior segmentation accuracy and fine-grained parsing ability in processing complex natural scenes that incorporate people and boats. In particular, when distinguishing highly similar fine structures (e.g., figure and boat silhouettes), DMTN-Net is able to delineate the boundaries more accurately, highlighting its advantages in extracting high-level semantic information and optimising edge accuracy in complex scenes.
(b) line picture shows that DeeplabV3+ has poor segmentation results when dealing with such tiny and detail-rich targets, whereas DMTN-Net effectively captures and segments these tiny floating objects with its enhanced context information perception, proving its efficiency in capturing the subtle correlation between local details and global context.
(c) line picture shows that DeeplabV3+ encounters boundary blurring and category confusion when segmenting large structures (e.g., bridges and buildings), incorrectly classifying some of the building pixels as bridges, demonstrating its limitations in dealing with structurally complex regions with blurred boundaries. On the contrary, DMTN-Net not only accurately defines the main contours of bridges and buildings through its advanced feature extraction and classification mechanism, but also handles the details more finely, which is of great significance for the safe delineation of unmanned vessel navigational areas and accurate navigation. Finally, (d) line picture shows that DeeplabV3+ has difficulty in effectively distinguishing neighbouring vessels, leading to recognition failures and thus limiting its applicability in complex water environments. DMTN-Net, on the other hand, achieves accurate segmentation of each vessel instance by virtue of its stronger instance differentiation ability, which is crucial for improving the target recognition, path planning, and obstacle avoidance capabilities of unmanned vessels in complex waters, significantly enhancing their operational safety and efficiency.
Therefore, compared with DeeplabV3+ (Xception), DMTN-Net shows better performance in multiple dimensions of segmentation tasks, especially when dealing with complex scenes, small targets, large structures, and dense similar objects, and its segmentation completeness and accuracy are significantly improved, providing more solid technical support for the visual perception module of unmanned boats and other autopilot systems. It provides more solid technical support for the visual perception module of unmanned ships and other autopilot systems.
4.4. Network Performance Comparison Test
In order to verify the performance of DMTN-Net, this paper practices full convolutional neural network FCN (resnet50, resnet101), deeplabv3+ (resnet50, resnet101), PSPNet (resnet50, mobilenetv2), Iraspp, and deeplabv3+ (xception) based on the MassMind dataset. The training parameters are kept consistent with DMTN-Net, which is tested against mainstream segmentation networks, and the results are shown in
Table 2. It can be seen that the segmentation performance of DMTN outperforms the mainstream networks, and the number of parameters is much lower than other algorithms after using mobilenetv2 as the backbone network and Triplet Attention with nearly no number of parameters, which proves the lightweight property of DMTN-Net.
In order to deeply explore the performance of different segmentation networks in complex scenes, experiments are conducted to compare the segmentation efficacy of network architectures such as Pspnet (resnet50, mobilenetv2), Deeplabv3+ (Xception), and DMTN-Net.
Figure 12 visualises the segmentation details and accuracy differences between these models in several challenging visual scenarios.
For the ship and crew segmentation task (a), Pspnet (resnet50, mobilenetv2) and DeeplabV3+ show some ambiguity in the accurate depiction of the ship’s outline and fail to identify the crew area effectively (marked in purple), demonstrating limitations in detail retention and complex structure recognition. In contrast, DMTN-Net not only accurately depicts the fine outline of the ship, but also successfully captures the rough outline of the crew, which demonstrates its superiority in dealing with mutual occlusion of complex objects and recognition of fine structures.
(b) Shows that Pspnet has significant segmentation confusion when dealing with the complex region of the sailboat texture, incorrectly classifying the distant building pixels as ships, reflecting a deficiency in the ability to distinguish between long-distance, low-contrast targets. DeeplabV3+, on the other hand, produces pixel confusion between the sailboat mast and the neighbouring buildings, again revealing a deficiency in dealing with the boundaries of similar features. DMTN-Net, on the other hand, demonstrates clearer and more accurate segmentation boundaries with its enhanced feature representation and contextual understanding, effectively reducing misclassification.
For the segmentation accuracy of long-distance building edges (c), compared with Pspnet and DeeplabV3+, DMTN-Net shows higher segmentation accuracy in the fine detailing of building edges by virtue of its ability to effectively capture and retain detailed information.
In the segmentation challenge of the bridge-water junction region (d), both Pspnet and DeeplabV3+ show pixel misclassification, which is manifested by the confusion of bridge and water pixels, which highlights the general difficulty in boundary recognition of complex scenes. In contrast, DMTN-Net not only effectively distinguishes the bridge from the water surface, but also demonstrates excellent performance in segmenting long-distance buildings under the bridge, which again verifies its strong capability in complex scene resolution and multi-scale target processing.
4.5. Generalization Experiment
In order to evaluate the generalized performance of the algorithm, the Pascal VOC 2012 dataset and Sea-data dataset are used as test objects in this paper to validate the effectiveness of the proposed algorithm. The Pascal VOC 2012 dataset is the official generalized dataset used by the Pascal VOC Challenge, and it contains objects of 20 categories. Each image is labelled and the labelled objects include 20 categories including people, animals (e.g., cats, dogs, etc.), vehicles (e.g., cars, boats, planes, etc.), and furniture (e.g., chairs, tables, sofas, etc.).
The Sea-data dataset is a total of 1246 image data collected, labelled, and subjected to data enhancement by the team in the field. The dataset contains a total of seven categories including background, sky, viable domain, vessel, reef, etc. When sailing autonomously, since it is necessary to avoid all the objects on the water that hinder navigation, such as ships, reefs, floats, etc., all these impediments are categorized into the major category of obstacles, which is represented in blue color, as shown in
Figure 13, with the captured images on the left side and the labels on the right side.
Table 3 clearly shows that the DMTN-Net model performs well in the segmentation task on the VOC dataset, and its performance is significantly improved compared to other mainstream network models.
In order to visualise the network segmentation effect, the relatively high deeplabv3 (resnet50) network and pspnet network in mIoU are selected to do comparison experiments with DMTN-Net.
Figure 14 shows that deeplabv3 has the problems of more segmentation clutter and incomplete segmentation; pspnet has the problems of unclear segmentation boundaries, incomplete segmentation, and wrong pixel classification. In contrast, DMTN-Net can segment the VOC dataset well, showing good generalization ability.
The experiments are carried out based on the Sea-data dataset, and the comparison experiments of different algorithms are carried out in
Table 4. Deeplabv3, Pspnet, DeeplabV3+, and DMTN-Net are chosen for the comparison experiments, respectively. mIou and mPA metrics in the table show that the segmentation accuracy of DMTN-Net is higher than that of the mainstream networks nowadays.
Figure 15 shows the visualisation results, and
Figure 15a–c shows the inference results of Deeplabv3, Pspnet, and Deeplabv3+, respectively. Since these networks do not have enough spatial information extraction capability, they are highly deficient in recognizing the contours of mountain ranges, which leads to great pixel classification errors of the sky, the sea, and the mountain ranges, and thus, the mountain ranges are recognized in a confusing manner.
In
Figure 15d, DMTN-Net enhances the recognition of mountain ranges by increasing the inter-pixel correlation through multi-scale and multi-channel interactions of feature information through N-Decoder and Triplet Attention to enhance the recognition of mountain ranges and reduce the loss of spatial information, and accurately recognizes many pixels misclassified as the sky, sea water, and mountains as mountain ranges. The enhanced feature representation achieves excellent results in both pixel classification and marginal line description of mountain ranges.
4.6. Offshore Navigation Experiment
In order to evaluate the effectiveness of DMTN-Net in practical applications, our team conducted an offshore sailing validation experiment in the sea area of Xiaoshidao, Weihai City, Shandong Province, China. The following is an overview of the experimental steps, as shown in
Figure 16.
The image data of the test waters are collected in real time by Hikvision’s shipboard vision sensor and transmitted to the server in real time; DMTN-Net network carries out real-time segmentation processing on the image data and transmits the results to the client. The experimental results are shown in
Figure 17, which shows that DMTN-Net demonstrates good segmentation accuracy for the segmentation of distant sea vessels and accessible areas. In order to visualize the segmentation effect more, we provide two application videos in the
Supplementary Materials.
5. Conclusions
In order to solve the problems of insufficient navigable area recognition accuracy, fuzzy obstacle segmentation boundaries, and high computational resource consumption faced by unmanned surface vehicles (USVs) in autonomous navigation, this paper proposes a novel DMTN-Net network architecture, which is extended and optimized based on the DeeplabV3+ baseline model. DMTN-Net significantly improves the accuracy and efficiency of environment sensing by integrating a dynamic feature extraction strategy with a multi-task learning framework, which significantly improves the accuracy and efficiency of environment sensing. First, in order to reduce the computational complexity and improve the feasibility of model deployment, DMTN-Net adopts MobileNetV2 as the backbone network, which significantly reduces the model parameters and computation while maintaining a higher classification accuracy. Second, on the basis of the standard Decoder, an enhanced N-Decoder structure is proposed to improve the segmentation accuracy, and the cSE attention mechanism and Triplet Attention are embedded into the N-Decoder structure, multi-channel attention is introduced through Triplet Attention to capture cross-dimensional interactions. A three-branch structure is utilized to compute the attention weights, which focus on different dimensions of the input data, such as spatial dimensions (height and width) and channel dimensions, respectively. Then, by performing a three-branch structure on each branch, the input tensor in each branch is then subjected to permutation (permute) operation and processed by Z-pool and k × k convolutional layers to capture cross-dimensional interaction features. Finally, attention weights are generated using a sigmoid activation layer and applied to the aligned input tensor, which is then aligned back to the original input shape. This increases the correlation between pixels to enhance the recognition of the target by means of multidimensional interaction and reducing the loss of spatial information reduces the occurrence of pixel misclassification in the segmentation process, which further improves the segmentation accuracy and enhances the generalization ability of the model. Among them, the mIoU, mPA, and mPrecision metrics reach 79.59%, 90.53%, and 85.41%, respectively, and the number of parameters is optimized to 5.778 M, which is a big improvement compared with the original model. Finally, the ablation of the DMTN-Net with the mainstream segmentation network is carried out on the MassMind dataset and the publicly available Pascal VOC2007 dataset, and comparison tests and offshore navigation tests are conducted in the relevant sea areas. The results show that DMTN-Net demonstrates significant performance advantages in navigable area recognition and obstacle segmentation tasks, which not only improves the recognition accuracy, but also effectively reduces the computational cost.
DMTN-Net’s excellent performance in the field of environment awareness for unmanned surface vessels (USVs) will have a far-reaching impact on the maritime industry, society, and national interests. From the perspective of the maritime industry, the model significantly enhances the autonomous navigation capability of USVs. Through accurate semantic segmentation technology, USVs can identify environmental obstacles and feasible paths in real time, optimize navigation planning, and thus reduce operational costs and improve navigation safety and efficiency. In terms of environmental protection, DMTN-Net helps USVs to monitor marine pollution, such as floating objects and oil pollution, in real time, so as to respond to potential sources of pollution and maintain the health of marine ecosystems. As for the working conditions of seafarers, DMTN-Net reduces the burden of seafarers, lowers their operational risks, and improves their working environment and quality of life by enhancing the autonomy of USVs. In terms of national security and technological autonomy, DMTN-Net, as a marine exploration technology with independent intellectual property rights, provides solid support for marine security and promotes the innovative development of China’s marine technology, laying a solid foundation for the country’s long-term development.
In the development of intelligent environment sensing technology for unmanned vessels, in order to improve the accuracy and efficiency of accessible area definition and obstacle segmentation, future research needs to focus on the following aspects: firstly, by continuously collecting and expanding diversified datasets, covering a variety of weather conditions, light changes, and complex obstacle types, in order to enhance the model’s generalization ability. Secondly, according to the needs of unmanned vessel applications, deep optimization and innovation are carried out for network structure, and efficient feature extraction and innovation are introduced. Thirdly, the network structure is deeply optimized and innovated to meet the needs of unmanned ship applications, and efficient feature extraction and fusion mechanisms are introduced to improve the model’s ability to analyze complex environmental information. Finally, image segmentation technology is integrated into the unmanned ship system and combined with shipborne UAVs equipped with target detection algorithms to realize all-around environment monitoring in air and sea. Meanwhile, autonomous landing technology for shipborne UAVs, three-dimensional measurement and modeling technology for water targets, and high-efficiency detection and tracking technology for ship exhaust are being actively researched and developed. All of these will provide strong technical support for the cooperative operation between unmanned ships and unmanned aircraft, and promote the continuous progress and development of unmanned system cooperative operation technology.